SlideShare a Scribd company logo
Apache Cassandra Sample
Material
VS-1046
1. INTRODUCTION TO NOSQL
NoSQL databases try to offer certain functionality that more traditional relational database
management systems do not. Whether it is for holding simple key-value pairs for shorter
lengths of time for caching purposes, or keeping unstructured collections (e.g. collections)
of data that could not be easily dealt with using relational databases and the structured
query language (SQL) – they are here to help.
1.1. NoSQL Basics
A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database
provides a mechanism for storage and retrieval of data which is modeled in means other
than the tabular relations used in relational databases. Such databases have existed since
the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the
early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook,
Google, and Amazon.com. NoSQL databases are increasingly used in big data and real-
time web applications. NoSQL systems are also sometimes called "Not only SQL" to
emphasize that they may support SQL-like query languages.
Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to
clusters of machines (which is a problem for relational databases), and finer control over
availability. The data structures used by NoSQL databases (e.g. key-value, wide column,
graph, or document) are different from those used by default in relational databases,
making some operations faster in NoSQL. The particular suitability of a given NoSQL
database depends on the problem it must solve. Sometimes the data structures used by
NoSQL databases are also viewed as "more flexible" than relational database tables.
Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor
of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL
stores include the use of low-level query languages (instead of SQL, for instance the lack of
ability to perform ad-hoc joins across tables), lack of standardized interfaces, and huge
previous investments in existing relational databases.] Most NoSQL stores lack true ACID
transactions, although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE,
Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB
have made them central to their designs.
Instead, most NoSQL databases offer a concept of "eventual consistency" in which database
changes are propagated to all nodes "eventually" (typically within milliseconds) so queries
for data might not return updated data immediately or might result in reading data that is
not accurate, a problem known as stale reads. Additionally, some NoSQL systems may
exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide
concepts such as write-ahead logging to avoid data loss. For distributed transaction
processing across multiple databases, data consistency is an even bigger challenge that is
difficult for both NoSQL and relational databases. Even current relational databases "do
not allow referential integrity constraints to span databases." There are few systems that
maintain both ACID transactions and X/Open XA standards for distributed transaction
processing.
Types and examples of NoSQL databases
There have been various approaches to classify NoSQL databases, each with different
categories and subcategories, some of which overlap. What follows is a basic classification
by data model, with examples:
Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase,
DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB,
Qizx, RethinkDB
Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE,
FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database,
OrientDB, Redis, Riak, Berkeley DB
Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic,
Neo4J, OrientDB, Virtuoso, Stardog
Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase,
FoundationDB, MarkLogic, OrientDB
By design, NoSQL databases and management systems are relation-less (or schema-less).
They are not based on a single model (e.g. relational model of RDBMSs) and each
database, depending on their target-functionality, adopt a different one.
There are almost a handful of different operational models and functioning systems for
NoSQL databases.:
Key / Value: e.g. Redis, MemcacheDB, etc.
Column: e.g. Cassandra, HBase, etc.
Document: e.g. MongoDB, Couchbase, etc
Graph: e.g. OrientDB, Neo4J, etc.
In order to better understand the roles and underlying technology of each database
management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management
simply because they can be considered the most basic and backbone implementation of
NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is
no structure nor relation. After connecting to the database server (e.g. Redis), an
application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42)
which can later be retrieved the same way by supplying the key.
Key / value DBMSs are usually used for quickly storing basic information, and sometimes
not-so-basic ones after performing, for example, a CPU and memory intensive
computation. They are extremely performant, efficient and usually easily scalable.
When it comes to computers, a dictionary usually refers to a special sort of data object.
They constitutes of arrays of collections with individual keys matching values.
Column Based
Column based NoSQL database management systems work by advancing the simple
nature of key / value based ones.
Despite their complicated-to-understand image on the internet, these databases work very
simply by creating collections of one or more key / value pairs that match a record.
Unlike the traditional defines schemas of relational databases, column-based NoSQL
solutions do not require a pre-structured table to work with the data. Each record comes
with one or more columns containing the information and each column of each record can
be different.
Basically, column-based NoSQL databases are two dimensional arrays whereby each key
(i.e. row / record) has one or more key / value pairs attached to it and these management
systems allow very large and un-structured data to be kept and used (e.g. a record with tons
of information).
These databases are commonly used when simple key / value pairs are not enough, and
storing very large numbers of records with very large numbers of information is a must.
DBMS implementing column-based, schema-less models can scale extremely well.
Document Based
Document based NoSQL database management systems can be considered the latest craze
that managed to take a lot of people by storm. These DBMS work in a similar fashion to
column-based ones; however, they allow much deeper nesting and complex structures to
be achieved (e.g. a document, within a document, within a document).
Documents overcome the constraints of one or two level of key / value nesting of columnar
databases. Basically, any complex and arbitrary structure can form a document, which can
be stored using these management systems.
Despite their powerful nature, and the ability to query records by individual keys,
document based management systems have their own issues and downfalls compared to
others. For example, retrieving a value of a record means getting the whole lot of it and
same goes for updates, all of which affect the performance.
Graph Based
Finally, the very interesting flavour of NoSQL database management systems is the graph
based ones.
The graph based DBMS models represent the data in a completely different way than the
previous three models. They use tree-like structures (i.e. graphs) with nodes and edges
connecting each other through relations.
Similarly to mathematics, certain operations are much simpler to perform using these type
of models thanks to their nature of linking and grouping related pieces of information (e.g.
connected people).
These databases are commonly used by applications whereby clear boundaries for
connections are necessary to establish. For example, when you register to a social network
of any sort, your friends' connection to you and their friends' friends' relation to you are
much easier to work with using graph-based database management systems.
There are following properties of NoSQL databases.
Design Simplicity
Horizontal Scaling
High Availability
Data structures used in Cassandra are more specified than data structures used in relational
databases. Cassandra data structures are faster than relational database structures.
NoSQL databases are increasingly used in Big Data and real-time web applications.
NoSQL databases are sometimes called Not Only SQL i.e. they may support SQL-like
query language.
Nosql Vs RDBMS
Here are the differences between relation databases and NoSQL databases in a tabular
format.
Relational Database NoSQL Database
Handles data coming in low velocity Handles data coming in high velocity
Data arrive from one or few locations Data arrive from many locations
Manages structured data
Manages structured unstructured and semi-
structured data.
Supports complex transactions (with
joins)
Supports simple transactions
single point of failure with failover No single point of failure
Handles data in the moderate
volume.
Handles data in very high volume
Centralized deployments Decentralized deployments
Transactions written in one location Transaction written in many locations
Gives read scalability Gives both read and write scalability
Deployed in vertical fashion Deployed in Horizontal fashion
1.2. Cassandra Basics and Terminology
Apache Cassandra is highly scalable, distributed and high-performance NoSQL database.
Cassandra is designed to handle a huge amount of data.
In the image above, circles are Cassandra nodes and lines between the circles shows
distributed architecture, while the client is sending data to the node. Cassandra handles the
huge amount of data with its distributed architecture. Data is placed on different machines
with more than one replication factor that provides high availability and no single point of
failure.
Cassandra History
Cassandra was first developed at Facebook for inbox search.
Facebook open sourced it in July 2008.
Apache incubator accepted Cassandra in March 2009.
Cassandra is a top level project of Apache since February 2010.
The latest version of Apache Cassandra is 3.2.1.
The 3.0 release was made available in November 2015. It includes features are
The underlying storage engine has been rewritten to more closely match CQL
constructs
Support for materialized views (sometimes also called global indexes)
Java 8 is now the supported version
The Thrift-based Command Line Interface (CLI) is removed
Apache Cassandra Features
There are main features of Cassandra are
Massively Scalable Architecture: Cassandra has a masterless design where all nodes
are at the same level which provides operational simplicity and easy scale out.
Masterless Architecture: Data can be written and read on any node.
Linear Scale Performance: As more nodes are added, the performance of
Cassandra increases.
No Single point of failure: Cassandra replicates data on different nodes that ensures
no single point of failure.
Fault Detection and Recovery: Failed nodes can easily be restored and recovered.
Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads.
Data Protection: Data is protected with commit log design and build in security like
backup and restore mechanisms.
Tunable Data Consistency: Support for strong data consistency across distributed
architecture.
Multi Data Center Replication: Cassandra provides feature to replicate data across
multiple data center.
Data Compression: Cassandra can compress up to 80% data without any overhead.
Cassandra Query language: Cassandra provides query language that is similar like
SQL language. It makes very easy for relational database developers moving from
relational database to Cassandra.
Application of Cassandra
Cassandra is a non-relational database that can be used for different types of applications.
Here are some use cases where Cassandra should be preferred.
Messaging - Cassandra is a great database for the companies that provides mobile
phones and messaging services. These companies have a huge amount of data, so
Cassandra is best for them.
Internet of things Application - Cassandra is a great database for the applications
where data is coming at very high speed from different devices or sensors.
Product Catalogs and retail apps - Cassandra is used by many retailers for durable
shopping cart protection and fast product catalog input and output.
Social Media Analytics and recommendation engine - Cassandra is a great database
for many online companies and social media providers for analysis and
recommendation to their customers.
Distributed Database
Cassandra is distributed, which means that it is capable of running on multiple machines
while appearing to users as a unified whole. In fact, there is little point in running a single
Cassandra node. Although you can do it, and that’s acceptable for getting up to speed on
how it works, you quickly realize that you’ll need multiple machines to really realize any
benefit from running Cassandra. Much of its design and code base is specifically
engineered toward not only making it work across many different machines, but also for
optimizing performance across multiple data center racks, and even for a single Cassandra
cluster running across geographically dispersed data centers. You can confidently write data
to anywhere in the cluster and Cassandra will get it.
Once you start to scale many other data stores (MySQL, Bigtable), some nodes need to be
set up as masters in order to organize other nodes, which are set up as slaves. Cassandra,
however, is decentralized, meaning that every node is identical; no Cassandra node
performs certain organizing operations distinct from any other node. Instead, Cassandra
features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes
that are alive or dead.
The fact that Cassandra is decentralized means that there is no single point of failure. All of
the nodes in a Cassandra cluster function exactly the same. This is sometimes referred to as
“server symmetry.” Because they are all doing the same thing, by definition there can’t be a
special host that is coordinating activities, as with the master/ slave setup that you see in
MySQL, Bigtable, and so many others.
Decentralization, therefore, has two key advantages: it’s simpler to use than master/slave,
and it helps you avoid outages. It can be easier to operate and maintain a decentralized
store than a master/slave store because all nodes are the same. That means that you don’t
need any special knowledge to scale; setting up 50 nodes isn’t much different from setting
up one. There’s next to no configuration required to support it.
Moreover, in a master/slave setup, the master can become a single point of failure (SPOF).
To avoid this, you often need to add some complexity to the environment in the form of
multiple masters. Because all of the replicas in Cassandra are identical, failures of a node
won’t disrupt service.
Elastic Scalability
Scalability is an architectural feature of a system that can continue serving a greater number
of requests with little degradation in performance. Vertical scaling—simply adding more
hardware capacity and memory to your existing machine—is the easiest way to achieve this.
Horizontal scaling means adding more machines that have all or some of the data on them
so that no one machine has to bear the entire burden of serving requests. But then the
software itself must have an internal mechanism for keeping its data in sync with the other
nodes in the cluster.
Elastic scalability refers to a special property of horizontal scalability. It means that your
cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to
accept new nodes that can begin participating by getting a copy of some or all of the data
and start serving new user requests without major disruption or reconfiguration of the
entire cluster. You don’t have to restart your process. You don’t have to change your
application queries. You don’t have to manually rebalance the data yourself. Just add
another machine—Cassandra will find it and start sending it work.
Consistency
Consistency essentially means that a read always returns the most recently written value.
Consider two customers are attempting to put the same item into their shopping carts on
an ecommerce site. If I place the last item in stock into my cart an instant after you do, you
should get the item added to your cart, and I should be informed that the item is no longer
available for purchase. This is guaranteed to hap pen when the state of a write is consistent
among all nodes that have that data.
But as we’ll see later, scaling data stores means making certain trade-offs between data
consistency, node availability, and partition tolerance. Cassandra is frequently called
“eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades some
consistency in order to achieve total availability. But Cassandra is more accurately termed
“tuneably consistent,” which means it allows you to easily decide the level of consistency
you require, in balance with the level of availability.
Types and examples of NoSQL databases
There have been various approaches to classify NoSQL databases, each with different
categories and subcategories, some of which overlap. What follows is a basic classification
by data model, with examples:
Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase,
DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB,
Qizx, RethinkDB
Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE,
FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database,
OrientDB, Redis, Riak, Berkeley DB
Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic,
Neo4J, OrientDB, Virtuoso, Stardog
Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase,
FoundationDB, MarkLogic, OrientDB
By design, NoSQL databases and management systems are relation-less (or schema-less).
They are not based on a single model (e.g. relational model of RDBMSs) and each
database, depending on their target-functionality, adopt a different one.
There are almost a handful of different operational models and functioning systems for
NoSQL databases.:
Key / Value: e.g. Redis, MemcacheDB, etc.
Column: e.g. Cassandra, HBase, etc.
Document: e.g. MongoDB, Couchbase, etc
Graph: e.g. OrientDB, Neo4J, etc.
In order to better understand the roles and underlying technology of each database
management system, let's quickly go over these four operational models.
Key / Value Based
We will begin our NoSQL modeling journey with key / value based database management
simply because they can be considered the most basic and backbone implementation of
NoSQL.
These type of databases work by matching keys with values, similar to a dictionary. There is
no structure nor relation. After connecting to the database server (e.g. Redis), an
application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42)
which can later be retrieved the same way by supplying the key.

More Related Content

PPTX
Unit 3 MongDB
PPSX
A Seminar on NoSQL Databases.
PDF
the rising no sql technology
DOCX
NoSQL_Databases
PPTX
2018 05 08_biological_databases_no_sql
PDF
Datastores
PDF
Artigo no sql x relational
PPTX
Unit 3 MongDB
A Seminar on NoSQL Databases.
the rising no sql technology
NoSQL_Databases
2018 05 08_biological_databases_no_sql
Datastores
Artigo no sql x relational

What's hot (20)

PPTX
No sq lv2
PDF
Introduction to NoSQL
DOCX
PDF
NoSQL-Database-Concepts
PDF
Comparative study of no sql document, column store databases and evaluation o...
PPTX
Nosql databases
PPT
NoSQL databases
PPT
No sql databases explained
PPTX
NoSQL Basics and MongDB
PPTX
Sql vs NoSQL-Presentation
PDF
cassandra
PPTX
Selecting best NoSQL
PDF
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
PPTX
No sql database
PPT
NoSQL Basics - a quick tour
PPTX
Non relational databases-no sql
PDF
NOSQL- Presentation on NoSQL
PPTX
NoSQL databases
ODP
Nonrelational Databases
No sq lv2
Introduction to NoSQL
NoSQL-Database-Concepts
Comparative study of no sql document, column store databases and evaluation o...
Nosql databases
NoSQL databases
No sql databases explained
NoSQL Basics and MongDB
Sql vs NoSQL-Presentation
cassandra
Selecting best NoSQL
A NOVEL APPROACH FOR HOTEL MANAGEMENT SYSTEM USING CASSANDRA
No sql database
NoSQL Basics - a quick tour
Non relational databases-no sql
NOSQL- Presentation on NoSQL
NoSQL databases
Nonrelational Databases
Ad

Similar to Vskills Apache Cassandra sample material (20)

DOCX
Unit II -BIG DATA ANALYTICS.docx
PPTX
unit2-ppt1.pptx
PPTX
Introduction to Data Science NoSQL.pptx
PPTX
cours database pour etudiant NoSQL (1).pptx
PDF
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
PPTX
NoSQL.pptx
PDF
PDF
NOsql Presentation.pdf
PPTX
No SQL DATABASE Description about 4 no sql database.pptx
PDF
NoSql and it's introduction features-Unit-1.pdf
PPTX
UNIT-2.pptx
PPTX
Unit 5.pptx computer graphics and gaming
PPTX
Introduction to asdfghjkln b vfgh n v
PPTX
Introduction to NoSQL database technology
PDF
The Rise of Nosql Databases
PPTX
VG AWT.pptxgtyfrtgtrfgttyuygtgyyuut6ytygtyg
PDF
Brief introduction to NoSQL by fas mosleh
PPTX
nosqldatabnjxjdjases-240121150542-d4ec9e23.pptx
PDF
Big Data technology Landscape
PPTX
Unit II -BIG DATA ANALYTICS.docx
unit2-ppt1.pptx
Introduction to Data Science NoSQL.pptx
cours database pour etudiant NoSQL (1).pptx
NoSQL BIg Data Analytics Mongo DB and Cassandra .pdf
NoSQL.pptx
NOsql Presentation.pdf
No SQL DATABASE Description about 4 no sql database.pptx
NoSql and it's introduction features-Unit-1.pdf
UNIT-2.pptx
Unit 5.pptx computer graphics and gaming
Introduction to asdfghjkln b vfgh n v
Introduction to NoSQL database technology
The Rise of Nosql Databases
VG AWT.pptxgtyfrtgtrfgttyuygtgyyuut6ytygtyg
Brief introduction to NoSQL by fas mosleh
nosqldatabnjxjdjases-240121150542-d4ec9e23.pptx
Big Data technology Landscape
Ad

More from Vskills (20)

PDF
Vskills certified administrative support professional sample material
PDF
vskills customer service professional sample material
PDF
Vskills certified operations manager sample material
PDF
Vskills certified six sigma yellow belt sample material
PDF
Vskills production and operations management sample material
PDF
vskills leadership skills professional sample material
PDF
vskills facility management expert sample material
PDF
Vskills international trade and forex professional sample material
PDF
Vskills production planning and control professional sample material
PDF
Vskills purchasing and material management professional sample material
PDF
Vskills manufacturing technology management professional sample material
PDF
certificate in agile project management sample material
PDF
Vskills angular js sample material
PDF
Vskills c++ developer sample material
PDF
Vskills c developer sample material
PDF
Vskills financial modelling professional sample material
PDF
Vskills basel iii professional sample material
PDF
Vskills telecom management professional sample material
PDF
Vskills retail management professional sample material
PDF
Vskills contract law analyst sample material
Vskills certified administrative support professional sample material
vskills customer service professional sample material
Vskills certified operations manager sample material
Vskills certified six sigma yellow belt sample material
Vskills production and operations management sample material
vskills leadership skills professional sample material
vskills facility management expert sample material
Vskills international trade and forex professional sample material
Vskills production planning and control professional sample material
Vskills purchasing and material management professional sample material
Vskills manufacturing technology management professional sample material
certificate in agile project management sample material
Vskills angular js sample material
Vskills c++ developer sample material
Vskills c developer sample material
Vskills financial modelling professional sample material
Vskills basel iii professional sample material
Vskills telecom management professional sample material
Vskills retail management professional sample material
Vskills contract law analyst sample material

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Insiders guide to clinical Medicine.pdf
PDF
01-Introduction-to-Information-Management.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
Institutional Correction lecture only . . .
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Lesson notes of climatology university.
PDF
Computing-Curriculum for Schools in Ghana
PDF
Basic Mud Logging Guide for educational purpose
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Classroom Observation Tools for Teachers
PPTX
Pharma ospi slides which help in ospi learning
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Supply Chain Operations Speaking Notes -ICLT Program
Insiders guide to clinical Medicine.pdf
01-Introduction-to-Information-Management.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
TR - Agricultural Crops Production NC III.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Institutional Correction lecture only . . .
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Module 4: Burden of Disease Tutorial Slides S2 2025
Lesson notes of climatology university.
Computing-Curriculum for Schools in Ghana
Basic Mud Logging Guide for educational purpose
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
VCE English Exam - Section C Student Revision Booklet
STATICS OF THE RIGID BODIES Hibbelers.pdf
Renaissance Architecture: A Journey from Faith to Humanism
Classroom Observation Tools for Teachers
Pharma ospi slides which help in ospi learning
O7-L3 Supply Chain Operations - ICLT Program
FourierSeries-QuestionsWithAnswers(Part-A).pdf

Vskills Apache Cassandra sample material

  • 2. 1. INTRODUCTION TO NOSQL NoSQL databases try to offer certain functionality that more traditional relational database management systems do not. Whether it is for holding simple key-value pairs for shorter lengths of time for caching purposes, or keeping unstructured collections (e.g. collections) of data that could not be easily dealt with using relational databases and the structured query language (SQL) – they are here to help. 1.1. NoSQL Basics A NoSQL (originally referring to "non SQL", "non relational" or "not only SQL") database provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but did not obtain the "NoSQL" moniker until a surge of popularity in the early twenty-first century, triggered by the needs of Web 2.0 companies such as Facebook, Google, and Amazon.com. NoSQL databases are increasingly used in big data and real- time web applications. NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages. Motivations for this approach include: simplicity of design, simpler "horizontal" scaling to clusters of machines (which is a problem for relational databases), and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, wide column, graph, or document) are different from those used by default in relational databases, making some operations faster in NoSQL. The particular suitability of a given NoSQL database depends on the problem it must solve. Sometimes the data structures used by NoSQL databases are also viewed as "more flexible" than relational database tables. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability, partition tolerance, and speed. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages (instead of SQL, for instance the lack of ability to perform ad-hoc joins across tables), lack of standardized interfaces, and huge previous investments in existing relational databases.] Most NoSQL stores lack true ACID transactions, although a few databases, such as MarkLogic, Aerospike, FairCom c-treeACE, Google Spanner (though technically a NewSQL database), Symas LMDB, and OrientDB have made them central to their designs. Instead, most NoSQL databases offer a concept of "eventual consistency" in which database changes are propagated to all nodes "eventually" (typically within milliseconds) so queries for data might not return updated data immediately or might result in reading data that is not accurate, a problem known as stale reads. Additionally, some NoSQL systems may exhibit lost writes and other forms of data loss. Fortunately, some NoSQL systems provide concepts such as write-ahead logging to avoid data loss. For distributed transaction processing across multiple databases, data consistency is an even bigger challenge that is difficult for both NoSQL and relational databases. Even current relational databases "do not allow referential integrity constraints to span databases." There are few systems that maintain both ACID transactions and X/Open XA standards for distributed transaction processing.
  • 3. Types and examples of NoSQL databases There have been various approaches to classify NoSQL databases, each with different categories and subcategories, some of which overlap. What follows is a basic classification by data model, with examples: Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB, MarkLogic, OrientDB By design, NoSQL databases and management systems are relation-less (or schema-less). They are not based on a single model (e.g. relational model of RDBMSs) and each database, depending on their target-functionality, adopt a different one. There are almost a handful of different operational models and functioning systems for NoSQL databases.: Key / Value: e.g. Redis, MemcacheDB, etc. Column: e.g. Cassandra, HBase, etc. Document: e.g. MongoDB, Couchbase, etc Graph: e.g. OrientDB, Neo4J, etc. In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models. Key / Value Based We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL. These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.
  • 4. Key / value DBMSs are usually used for quickly storing basic information, and sometimes not-so-basic ones after performing, for example, a CPU and memory intensive computation. They are extremely performant, efficient and usually easily scalable. When it comes to computers, a dictionary usually refers to a special sort of data object. They constitutes of arrays of collections with individual keys matching values. Column Based Column based NoSQL database management systems work by advancing the simple nature of key / value based ones. Despite their complicated-to-understand image on the internet, these databases work very simply by creating collections of one or more key / value pairs that match a record. Unlike the traditional defines schemas of relational databases, column-based NoSQL solutions do not require a pre-structured table to work with the data. Each record comes with one or more columns containing the information and each column of each record can be different. Basically, column-based NoSQL databases are two dimensional arrays whereby each key (i.e. row / record) has one or more key / value pairs attached to it and these management systems allow very large and un-structured data to be kept and used (e.g. a record with tons of information). These databases are commonly used when simple key / value pairs are not enough, and storing very large numbers of records with very large numbers of information is a must. DBMS implementing column-based, schema-less models can scale extremely well. Document Based Document based NoSQL database management systems can be considered the latest craze that managed to take a lot of people by storm. These DBMS work in a similar fashion to column-based ones; however, they allow much deeper nesting and complex structures to be achieved (e.g. a document, within a document, within a document). Documents overcome the constraints of one or two level of key / value nesting of columnar databases. Basically, any complex and arbitrary structure can form a document, which can be stored using these management systems. Despite their powerful nature, and the ability to query records by individual keys, document based management systems have their own issues and downfalls compared to others. For example, retrieving a value of a record means getting the whole lot of it and same goes for updates, all of which affect the performance. Graph Based Finally, the very interesting flavour of NoSQL database management systems is the graph based ones.
  • 5. The graph based DBMS models represent the data in a completely different way than the previous three models. They use tree-like structures (i.e. graphs) with nodes and edges connecting each other through relations. Similarly to mathematics, certain operations are much simpler to perform using these type of models thanks to their nature of linking and grouping related pieces of information (e.g. connected people). These databases are commonly used by applications whereby clear boundaries for connections are necessary to establish. For example, when you register to a social network of any sort, your friends' connection to you and their friends' friends' relation to you are much easier to work with using graph-based database management systems. There are following properties of NoSQL databases. Design Simplicity Horizontal Scaling High Availability Data structures used in Cassandra are more specified than data structures used in relational databases. Cassandra data structures are faster than relational database structures. NoSQL databases are increasingly used in Big Data and real-time web applications. NoSQL databases are sometimes called Not Only SQL i.e. they may support SQL-like query language. Nosql Vs RDBMS Here are the differences between relation databases and NoSQL databases in a tabular format. Relational Database NoSQL Database Handles data coming in low velocity Handles data coming in high velocity Data arrive from one or few locations Data arrive from many locations Manages structured data Manages structured unstructured and semi- structured data. Supports complex transactions (with joins) Supports simple transactions single point of failure with failover No single point of failure Handles data in the moderate volume. Handles data in very high volume Centralized deployments Decentralized deployments Transactions written in one location Transaction written in many locations Gives read scalability Gives both read and write scalability Deployed in vertical fashion Deployed in Horizontal fashion 1.2. Cassandra Basics and Terminology Apache Cassandra is highly scalable, distributed and high-performance NoSQL database. Cassandra is designed to handle a huge amount of data.
  • 6. In the image above, circles are Cassandra nodes and lines between the circles shows distributed architecture, while the client is sending data to the node. Cassandra handles the huge amount of data with its distributed architecture. Data is placed on different machines with more than one replication factor that provides high availability and no single point of failure. Cassandra History Cassandra was first developed at Facebook for inbox search. Facebook open sourced it in July 2008. Apache incubator accepted Cassandra in March 2009. Cassandra is a top level project of Apache since February 2010. The latest version of Apache Cassandra is 3.2.1. The 3.0 release was made available in November 2015. It includes features are The underlying storage engine has been rewritten to more closely match CQL constructs Support for materialized views (sometimes also called global indexes) Java 8 is now the supported version The Thrift-based Command Line Interface (CLI) is removed Apache Cassandra Features There are main features of Cassandra are Massively Scalable Architecture: Cassandra has a masterless design where all nodes are at the same level which provides operational simplicity and easy scale out. Masterless Architecture: Data can be written and read on any node. Linear Scale Performance: As more nodes are added, the performance of Cassandra increases. No Single point of failure: Cassandra replicates data on different nodes that ensures no single point of failure. Fault Detection and Recovery: Failed nodes can easily be restored and recovered. Flexible and Dynamic Data Model: Supports datatypes with Fast writes and reads.
  • 7. Data Protection: Data is protected with commit log design and build in security like backup and restore mechanisms. Tunable Data Consistency: Support for strong data consistency across distributed architecture. Multi Data Center Replication: Cassandra provides feature to replicate data across multiple data center. Data Compression: Cassandra can compress up to 80% data without any overhead. Cassandra Query language: Cassandra provides query language that is similar like SQL language. It makes very easy for relational database developers moving from relational database to Cassandra. Application of Cassandra Cassandra is a non-relational database that can be used for different types of applications. Here are some use cases where Cassandra should be preferred. Messaging - Cassandra is a great database for the companies that provides mobile phones and messaging services. These companies have a huge amount of data, so Cassandra is best for them. Internet of things Application - Cassandra is a great database for the applications where data is coming at very high speed from different devices or sensors. Product Catalogs and retail apps - Cassandra is used by many retailers for durable shopping cart protection and fast product catalog input and output. Social Media Analytics and recommendation engine - Cassandra is a great database for many online companies and social media providers for analysis and recommendation to their customers. Distributed Database Cassandra is distributed, which means that it is capable of running on multiple machines while appearing to users as a unified whole. In fact, there is little point in running a single Cassandra node. Although you can do it, and that’s acceptable for getting up to speed on how it works, you quickly realize that you’ll need multiple machines to really realize any benefit from running Cassandra. Much of its design and code base is specifically engineered toward not only making it work across many different machines, but also for optimizing performance across multiple data center racks, and even for a single Cassandra cluster running across geographically dispersed data centers. You can confidently write data to anywhere in the cluster and Cassandra will get it. Once you start to scale many other data stores (MySQL, Bigtable), some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves. Cassandra, however, is decentralized, meaning that every node is identical; no Cassandra node performs certain organizing operations distinct from any other node. Instead, Cassandra features a peer-to-peer protocol and uses gossip to maintain and keep in sync a list of nodes that are alive or dead. The fact that Cassandra is decentralized means that there is no single point of failure. All of the nodes in a Cassandra cluster function exactly the same. This is sometimes referred to as
  • 8. “server symmetry.” Because they are all doing the same thing, by definition there can’t be a special host that is coordinating activities, as with the master/ slave setup that you see in MySQL, Bigtable, and so many others. Decentralization, therefore, has two key advantages: it’s simpler to use than master/slave, and it helps you avoid outages. It can be easier to operate and maintain a decentralized store than a master/slave store because all nodes are the same. That means that you don’t need any special knowledge to scale; setting up 50 nodes isn’t much different from setting up one. There’s next to no configuration required to support it. Moreover, in a master/slave setup, the master can become a single point of failure (SPOF). To avoid this, you often need to add some complexity to the environment in the form of multiple masters. Because all of the replicas in Cassandra are identical, failures of a node won’t disrupt service. Elastic Scalability Scalability is an architectural feature of a system that can continue serving a greater number of requests with little degradation in performance. Vertical scaling—simply adding more hardware capacity and memory to your existing machine—is the easiest way to achieve this. Horizontal scaling means adding more machines that have all or some of the data on them so that no one machine has to bear the entire burden of serving requests. But then the software itself must have an internal mechanism for keeping its data in sync with the other nodes in the cluster. Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. To do this, the cluster must be able to accept new nodes that can begin participating by getting a copy of some or all of the data and start serving new user requests without major disruption or reconfiguration of the entire cluster. You don’t have to restart your process. You don’t have to change your application queries. You don’t have to manually rebalance the data yourself. Just add another machine—Cassandra will find it and start sending it work. Consistency Consistency essentially means that a read always returns the most recently written value. Consider two customers are attempting to put the same item into their shopping carts on an ecommerce site. If I place the last item in stock into my cart an instant after you do, you should get the item added to your cart, and I should be informed that the item is no longer available for purchase. This is guaranteed to hap pen when the state of a write is consistent among all nodes that have that data. But as we’ll see later, scaling data stores means making certain trade-offs between data consistency, node availability, and partition tolerance. Cassandra is frequently called “eventually consistent,” which is a bit misleading. Out of the box, Cassandra trades some consistency in order to achieve total availability. But Cassandra is more accurately termed “tuneably consistent,” which means it allows you to easily decide the level of consistency you require, in balance with the level of availability.
  • 9. Types and examples of NoSQL databases There have been various approaches to classify NoSQL databases, each with different categories and subcategories, some of which overlap. What follows is a basic classification by data model, with examples: Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB, MarkLogic, OrientDB By design, NoSQL databases and management systems are relation-less (or schema-less). They are not based on a single model (e.g. relational model of RDBMSs) and each database, depending on their target-functionality, adopt a different one. There are almost a handful of different operational models and functioning systems for NoSQL databases.: Key / Value: e.g. Redis, MemcacheDB, etc. Column: e.g. Cassandra, HBase, etc. Document: e.g. MongoDB, Couchbase, etc Graph: e.g. OrientDB, Neo4J, etc. In order to better understand the roles and underlying technology of each database management system, let's quickly go over these four operational models. Key / Value Based We will begin our NoSQL modeling journey with key / value based database management simply because they can be considered the most basic and backbone implementation of NoSQL. These type of databases work by matching keys with values, similar to a dictionary. There is no structure nor relation. After connecting to the database server (e.g. Redis), an application can state a key (e.g. the_answer_to_life) and provide a matching value (e.g. 42) which can later be retrieved the same way by supplying the key.