SlideShare a Scribd company logo
1706409 Zhu Na
1
Agenda
• What is Cassandra
• Main features and known issues
• Demo : Use Cassandra for OLAP
2
What is Cassandra
• Apache Cassandra is a
• Free for download and install
• Open-source still active on Github and JIRA
• NoSQL database management system
• designed to be distributed
 Handle large amounts of data
 Across many commodity servers
 Providing high availability with no single point of failure
3
Cassandra Query Language (CQL)
• CQL is
• a simple interface for accessing Cassandra
• as an alternative to the traditional Structured Query Language (SQL).
• CQL provides native syntaxes for collections and other common
encodings Language drivers are available for Java (JDBC), Python
(DBAPI2), Node.JS (Helenus), Go (gocql) and C++.
4
Something special
• Scalability
• MapReduce support
• Distributed
• Supports replication and multi data center replication
• Fault-tolerant
• consistency
5
Scalability
6
Map-Reduce
• Hadoop vs Spark
• Spark + Cassandra
7
Distributed : How to store data
• Key features of Cassandra’s distributed architecture are specifically tailored
for multiple-data center deployment.
• Cassandra operates by dividing all data evenly around a cluster of nodes,
which can be visualized as a ring. Nodes generally run on commodity
hardware. Each Cassandra node in the cluster is responsible for and
assigned a token range (which is essentially a range of hashes defined by a
partitioner).
• Each update or addition of data contains a unique row key (also known as
a primary key). The primary key is hashed to determine a replica (or node)
responsible for a token range inclusive of a given row key. The data is then
stored in the cluster n times (where n is defined by the
keyspace’s replication factor), or once on each replica responsible a given
query’s row key.
8
Distributed : How to read / write data
• A read request is processed using eventually consistency, and the keyspace was
created with a “replication factor” of 3, 2 of the 3 replicas for the requested data
would be contacted, their results merged, and a single result returned to the
client.
• A write requests, the coordinator node will send a write requests with all
mutated columns to all replica nodes for a given row key.
• First added to the commit log, which ensures durability of the transaction.
• Next, it is also added to the memtable. A memtable is a bounded in memory write-back
cache that contains recent writes which have not yet been flushed to an SSTable (a
permanent, immutable, and serialized on disk copy of the tables data).
• When updates cause a memtable to reach it’s configured maximum in-memory size, the
memtable is flushed to an immutable SSTable, persisting the data from the memtable
permanently on disk while making room for future updates.
• In the event of a crash or node failure, events are replayed from the commit log, which
prevents the loss of any data from memtables that had not been flushed to disk prior to an
unexpected event such as a power outage or crash.
9
Something tricky
• Cassandra is not row level consistent :
• When inserts and updates into the table
o affect the same row ; processed at approximately the same time
o may affect the non-key columns in inconsistent ways
o One update may affect one column while another affects the other
o resulting in sets of values within the row that were never specified or intended
oWhen update , Cassandra do not check the data is conflict or not !
10
Data model
• The most important thing to know in Cassandra data modeling: The primary key
• The simplest form :
• The first element in our PRIMARY KEY is what we call a partition key.
• The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the
record in the database. The other purpose, and one that very critical in distributed systems, is
determining data locality.
• Added more elements :
• All columns listed after the partition key are called clustering columns.
• This is where Cassandra take a huge break from relational databases. Where the partition key is
important for data locality, the clustering column specifies the order that the data is arranged
inside the partition. The way we read this is left to right:
• Item one is the partition key
• Item two is the first clustering column.
• Item three is the second clustering column.
• After inserting data, you should expect your SELECT to return data in the
ascending/descending order of the item two for a single partition.
11
Demo for flight “delay”
• Maybe we all experienced being late for catching a flight or running
like a crazy in the airport transfer to next flight because the previous
one is delayed.
• Did you even notice sometimes your flight even fly earlier than
scheduled ? How often might this happen ?
• How could I know this airlines is “always late” or this transfer airport
always crowded so I can take a walk even the first flight is one hour
later than scheduled while I am booking the tickets?
• If we know where and how to look those data, and avoid some
problem if it might have a very high possibility to happen ?
12
Dataset
• Source : kaggle dataset flight-delay
• flights.csv for USA 2015 all the unscheduled flight ; airlines.csv; airports.csv
13
CQL
• Use primary key / clustering key
• No join
• Allow FILTERING
• Give your more controls :
• User-defined function (UDF)
• User-defined aggregate function (UDA)
14
Start Cassandra (Mac OS) Import the CSV data
• Start Cassandra first in terminal : /usr/local/apache-cassandra-3.10/bin/cassandra -f
• Then start cqlsh in another tab of terminal : /usr/local/apache-cassandra-3.10/bin/cqlsh
• Time for fun in cqlsh:
 CREATE KEYSPACE flight WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
 USE flight;
 CREATE TABLE flight
(YEAR SMALLINT,MONTH SMALLINT,
….
WEATHER_DELAY TEXT,
PRIMARY KEY (AIRLINE, destination_airport, origin_airport));
 COPY flight (YEAR,MONTH,DAY, …..,WEATHER_DELAY)
FROM '/Users/nanazhu/Downloads/flights.csv '
WITH header=true and NULL = 'NULL' ;
15
Query: I want to flight from JFK to LAX ,which
airline/what time should be double-checked?
16
What time is the delay happened most
17
Query : how many times / time in total for
some airline departing earlier?
18
Query : how many times / time in total for
some airline arrive late?
19
Reference
• From DataStax
• Using CQL
• DS220: Data Modeling
• From Tutorialspoint
• Cassandra tutorial
20
Questions ?
Thank you 
21

More Related Content

PPTX
Cassandra - A decentralized storage system
PDF
Cassandra - A Decentralized Structured Storage System
PPTX
Cassandra - Research Paper Overview
PDF
Evan Ellis "Tumblr. Massively Sharded MySQL"
PDF
Kill mysql-performance
PPTX
Cassandra - Deep Dive ...
PPT
7. Key-Value Databases: In Depth
PDF
Kafka zero to hero
Cassandra - A decentralized storage system
Cassandra - A Decentralized Structured Storage System
Cassandra - Research Paper Overview
Evan Ellis "Tumblr. Massively Sharded MySQL"
Kill mysql-performance
Cassandra - Deep Dive ...
7. Key-Value Databases: In Depth
Kafka zero to hero

What's hot (17)

PDF
Cassandra 2.1 boot camp, Read/Write path
PPT
The No SQL Principles and Basic Application Of Casandra Model
PDF
Cassandra 101
PDF
Dataguard implementation
PPTX
Faster and smaller inverted indices with Treaps Research Paper
PPT
8. column oriented databases
PPTX
Sql server logshipping
PDF
Bigtable and Dynamo
PPT
Implementing the Databese Server session 02
PPTX
Kafka Connect
PDF
Technical Introduction to PostgreSQL and PPAS
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPTX
SQL Server to Redshift Data Load Using SSIS
PPTX
Executing Queries on a Sharded Database
PDF
PostgreSQL Replication High Availability Methods
PDF
PGConf.ASIA 2019 Bali - Upcoming Features in PostgreSQL 12 - John Naylor
PDF
Nov 2011 HUG: Blur - Lucene on Hadoop
Cassandra 2.1 boot camp, Read/Write path
The No SQL Principles and Basic Application Of Casandra Model
Cassandra 101
Dataguard implementation
Faster and smaller inverted indices with Treaps Research Paper
8. column oriented databases
Sql server logshipping
Bigtable and Dynamo
Implementing the Databese Server session 02
Kafka Connect
Technical Introduction to PostgreSQL and PPAS
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
SQL Server to Redshift Data Load Using SSIS
Executing Queries on a Sharded Database
PostgreSQL Replication High Availability Methods
PGConf.ASIA 2019 Bali - Upcoming Features in PostgreSQL 12 - John Naylor
Nov 2011 HUG: Blur - Lucene on Hadoop
Ad

Similar to Cassandra Tutorial (20)

ODP
Intro to cassandra
PPTX
Appache Cassandra
PDF
Deep Dive into Cassandra
PPTX
Learning Cassandra NoSQL
PPTX
Apache Cassandra at the Geek2Geek Berlin
PDF
cassandra
PPTX
Cassandra20141113
PPTX
cassandra_presentation_final
PPTX
Apache Cassandra Data Modeling with Travis Price
PDF
An Introduction to Apache Cassandra
PPTX
Unit -3 _Cassandra-CRUD Operations_Practice Examples
PPTX
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
PPTX
Cassandra under the hood
PPTX
Migrating from a Relational Database to Cassandra: Why, Where, When and How
PPTX
Apache Cassandra introduction
PPTX
Introduction to NoSQL CassandraDB
PDF
Moving from a Relational Database to Cassandra: Why, Where, When, and How
PDF
04-Introduction-to-CassandraDB-.pdf
PPTX
An Introduction to Cassandra - Oracle User Group
PPTX
Cassndra (4).pptx
Intro to cassandra
Appache Cassandra
Deep Dive into Cassandra
Learning Cassandra NoSQL
Apache Cassandra at the Geek2Geek Berlin
cassandra
Cassandra20141113
cassandra_presentation_final
Apache Cassandra Data Modeling with Travis Price
An Introduction to Apache Cassandra
Unit -3 _Cassandra-CRUD Operations_Practice Examples
Unit -3 -Features of Cassandra, CQL Data types, CQLSH, Keyspaces
Cassandra under the hood
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Apache Cassandra introduction
Introduction to NoSQL CassandraDB
Moving from a Relational Database to Cassandra: Why, Where, When, and How
04-Introduction-to-CassandraDB-.pdf
An Introduction to Cassandra - Oracle User Group
Cassndra (4).pptx
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
composite construction of structures.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
PPT on Performance Review to get promotions
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Digital Logic Computer Design lecture notes
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Foundation to blockchain - A guide to Blockchain Tech
Operating System & Kernel Study Guide-1 - converted.pdf
Mechanical Engineering MATERIALS Selection
composite construction of structures.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
CH1 Production IntroductoryConcepts.pptx
additive manufacturing of ss316l using mig welding
bas. eng. economics group 4 presentation 1.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Structs to JSON How Go Powers REST APIs.pdf
PPT on Performance Review to get promotions
Lesson 3_Tessellation.pptx finite Mathematics
OOP with Java - Java Introduction (Basics)
CYBER-CRIMES AND SECURITY A guide to understanding
Digital Logic Computer Design lecture notes
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf

Cassandra Tutorial

  • 2. Agenda • What is Cassandra • Main features and known issues • Demo : Use Cassandra for OLAP 2
  • 3. What is Cassandra • Apache Cassandra is a • Free for download and install • Open-source still active on Github and JIRA • NoSQL database management system • designed to be distributed  Handle large amounts of data  Across many commodity servers  Providing high availability with no single point of failure 3
  • 4. Cassandra Query Language (CQL) • CQL is • a simple interface for accessing Cassandra • as an alternative to the traditional Structured Query Language (SQL). • CQL provides native syntaxes for collections and other common encodings Language drivers are available for Java (JDBC), Python (DBAPI2), Node.JS (Helenus), Go (gocql) and C++. 4
  • 5. Something special • Scalability • MapReduce support • Distributed • Supports replication and multi data center replication • Fault-tolerant • consistency 5
  • 7. Map-Reduce • Hadoop vs Spark • Spark + Cassandra 7
  • 8. Distributed : How to store data • Key features of Cassandra’s distributed architecture are specifically tailored for multiple-data center deployment. • Cassandra operates by dividing all data evenly around a cluster of nodes, which can be visualized as a ring. Nodes generally run on commodity hardware. Each Cassandra node in the cluster is responsible for and assigned a token range (which is essentially a range of hashes defined by a partitioner). • Each update or addition of data contains a unique row key (also known as a primary key). The primary key is hashed to determine a replica (or node) responsible for a token range inclusive of a given row key. The data is then stored in the cluster n times (where n is defined by the keyspace’s replication factor), or once on each replica responsible a given query’s row key. 8
  • 9. Distributed : How to read / write data • A read request is processed using eventually consistency, and the keyspace was created with a “replication factor” of 3, 2 of the 3 replicas for the requested data would be contacted, their results merged, and a single result returned to the client. • A write requests, the coordinator node will send a write requests with all mutated columns to all replica nodes for a given row key. • First added to the commit log, which ensures durability of the transaction. • Next, it is also added to the memtable. A memtable is a bounded in memory write-back cache that contains recent writes which have not yet been flushed to an SSTable (a permanent, immutable, and serialized on disk copy of the tables data). • When updates cause a memtable to reach it’s configured maximum in-memory size, the memtable is flushed to an immutable SSTable, persisting the data from the memtable permanently on disk while making room for future updates. • In the event of a crash or node failure, events are replayed from the commit log, which prevents the loss of any data from memtables that had not been flushed to disk prior to an unexpected event such as a power outage or crash. 9
  • 10. Something tricky • Cassandra is not row level consistent : • When inserts and updates into the table o affect the same row ; processed at approximately the same time o may affect the non-key columns in inconsistent ways o One update may affect one column while another affects the other o resulting in sets of values within the row that were never specified or intended oWhen update , Cassandra do not check the data is conflict or not ! 10
  • 11. Data model • The most important thing to know in Cassandra data modeling: The primary key • The simplest form : • The first element in our PRIMARY KEY is what we call a partition key. • The partition key has a special use in Apache Cassandra beyond showing the uniqueness of the record in the database. The other purpose, and one that very critical in distributed systems, is determining data locality. • Added more elements : • All columns listed after the partition key are called clustering columns. • This is where Cassandra take a huge break from relational databases. Where the partition key is important for data locality, the clustering column specifies the order that the data is arranged inside the partition. The way we read this is left to right: • Item one is the partition key • Item two is the first clustering column. • Item three is the second clustering column. • After inserting data, you should expect your SELECT to return data in the ascending/descending order of the item two for a single partition. 11
  • 12. Demo for flight “delay” • Maybe we all experienced being late for catching a flight or running like a crazy in the airport transfer to next flight because the previous one is delayed. • Did you even notice sometimes your flight even fly earlier than scheduled ? How often might this happen ? • How could I know this airlines is “always late” or this transfer airport always crowded so I can take a walk even the first flight is one hour later than scheduled while I am booking the tickets? • If we know where and how to look those data, and avoid some problem if it might have a very high possibility to happen ? 12
  • 13. Dataset • Source : kaggle dataset flight-delay • flights.csv for USA 2015 all the unscheduled flight ; airlines.csv; airports.csv 13
  • 14. CQL • Use primary key / clustering key • No join • Allow FILTERING • Give your more controls : • User-defined function (UDF) • User-defined aggregate function (UDA) 14
  • 15. Start Cassandra (Mac OS) Import the CSV data • Start Cassandra first in terminal : /usr/local/apache-cassandra-3.10/bin/cassandra -f • Then start cqlsh in another tab of terminal : /usr/local/apache-cassandra-3.10/bin/cqlsh • Time for fun in cqlsh:  CREATE KEYSPACE flight WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };  USE flight;  CREATE TABLE flight (YEAR SMALLINT,MONTH SMALLINT, …. WEATHER_DELAY TEXT, PRIMARY KEY (AIRLINE, destination_airport, origin_airport));  COPY flight (YEAR,MONTH,DAY, …..,WEATHER_DELAY) FROM '/Users/nanazhu/Downloads/flights.csv ' WITH header=true and NULL = 'NULL' ; 15
  • 16. Query: I want to flight from JFK to LAX ,which airline/what time should be double-checked? 16
  • 17. What time is the delay happened most 17
  • 18. Query : how many times / time in total for some airline departing earlier? 18
  • 19. Query : how many times / time in total for some airline arrive late? 19
  • 20. Reference • From DataStax • Using CQL • DS220: Data Modeling • From Tutorialspoint • Cassandra tutorial 20

Editor's Notes

  • #2: Request Development of a small project. Students are strongly encouraged to propose their own idea for projects. As a suggestion, they can refer to (and also select from) the following list of tools. The project connected to a tool consists, for example, in studying the logical data model(s) adopted by the tool, the native storage data structure it uses, the query language it provides, and highlighting further distinguishing features. Also, a demonstration of the basic use of the tool through one or more examples is required. Presentation connected to projects (possibly through slides) should last around 20 minutes (including the demo).
  • #3: https://academy.datastax.com/resources/ds101-introduction-cassandra?unit=introduction-cassandra-overview https://academy.datastax.com/resources/getting-started-apache-spark?unit=connecting-spark-reading-data-cassandra
  • #4: https://www.tutorialspoint.com/cassandra/cassandra_architecture.htm
  • #5: https://www.tutorialspoint.com/cassandra/cassandra_shell_commands.htm
  • #6: About
  • #9: Why we need hashing ? Think about you are searching on some data might contains “Disney” and you don’t know which node(s) has this data (imagine you have to turn over every stone to find it) DO you really think it is good idea to ask every node “do you have this data “Disney” ? Solution : Hashing the primary key and directly goes to the node which has this data and do the rest operation !
  • #12: https://www.tutorialspoint.com/cassandra/cassandra_data_model.htm
  • #18: CREATE OR REPLACE FUNCTION group_time_and_sum_delay(state map<smallint, int>, time smallint , delay int) CALLED ON NULL INPUT RETURNS map<smallint, int> LANGUAGE java AS 'if (delay !=null && delay > 0) state.put(time,delay + state.getOrDefault(time,0));return state;'; CREATE AGGREGATE IF NOT EXISTS group_time_and_sum(smallint,int) SFUNC group_time_and_sum_delay STYPE map<smallint, int> INITCOND {}; select group_time_and_sum(MONTH ,ARRIVAL_DELAY) from flight ; select group_time_and_sum(day_of_week ,ARRIVAL_DELAY) from flight ;
  • #19: cqlsh:flight> CREATE OR REPLACE FUNCTION group_airline_and_sum_early(state map<text, int>, airline text , delay int) CALLED ON NULL INPUT RETURNS map<text, int> LANGUAGE java AS 'if (delay !=null && delay < 0) state.put(airline,delay + state.getOrDefault(airline,0));return state;'; cqlsh:flight> CREATE AGGREGATE IF NOT EXISTS group_and_sum_early(text,int) SFUNC group_airline_and_sum_early STYPE map<text, int> INITCOND {}; cqlsh:flight> CREATE OR REPLACE FUNCTION group_airline_and_counter_early(state map<text, int>, airline text , delay int) CALLED ON NULL INPUT RETURNS map<text, int> LANGUAGE java AS 'if (delay !=null && delay < 0) state.put(airline, 1 + state.getOrDefault(airline,0));return state;'; cqlsh:flight> CREATE AGGREGATE IF NOT EXISTS group_and_count_early(text,int) SFUNC group_airline_and_counter_early STYPE map<text, int> INITCOND {}; cqlsh:flight> select group_and_count_early(airline,departure_delay) from flight ;  flight.group_and_count_early(airline, departure_delay) -----------------------------------------------------------------------------------------------------------------------------------------------------------  {'AA': 459, 'AS': 118, 'B6': 273, 'DL': 476, 'EV': 754, 'F9': 215, 'HA': 25, 'MQ': 228, 'NK': 257, 'OO': 686, 'UA': 588, 'US': 162, 'VX': 39, 'WN': 1341} (1 rows) Warnings : Aggregation query used without partition key cqlsh:flight> select group_and_sum_early(airline,departure_delay) from flight ;  flight.group_and_sum_early(airline, departure_delay) -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  {'AA': -3995, 'AS': -3010, 'B6': -1707, 'DL': -4560, 'EV': -7246, 'F9': -2277, 'HA': -419, 'MQ': -2300, 'NK': -1972, 'OO': -8081, 'UA': -2608, 'US': -707, 'VX': -398, 'WN': -3849} (1 rows) Warnings : Aggregation query used without partition key cqlsh:flight> select airline from airlines where iata_code in ('DL','WN');  airline ------------------------    Delta Air Lines Inc.  Southwest Airlines Co.
  • #20: cqlsh:flight> CREATE OR REPLACE FUNCTION group_airline_and_sum_late(state map<text, int>, airline text , delay int) CALLED ON NULL INPUT RETURNS map<text, int> LANGUAGE java AS 'if (delay !=null && delay > 0) state.put(airline,delay + state.getOrDefault(airline,0));return state;'; cqlsh:flight> CREATE AGGREGATE IF NOT EXISTS group_and_sum_late(text,int) SFUNC group_airline_and_sum_late STYPE map<text, int> INITCOND {}; cqlsh:flight> CREATE OR REPLACE FUNCTION group_airline_and_counter_late(state map<text, int>, airline text , delay int) CALLED ON NULL INPUT RETURNS map<text, int> LANGUAGE java AS 'if (delay !=null && delay > 0) state.put(airline, 1 + state.getOrDefault(airline,0));return state;'; cqlsh:flight> CREATE AGGREGATE IF NOT EXISTS group_and_count_late(text,int) SFUNC group_airline_and_counter_late STYPE map<text, int> INITCOND {}; cqlsh:flight> select group_and_sum_late(airline,ARRIVAL_DELAY) from flight ;  flight.group_and_sum_late(airline, arrival_delay) -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  {'AA': 13623, 'AS': 2347, 'B6': 8998, 'DL': 13679, 'EV': 38140, 'F9': 8255, 'HA': 434, 'MQ': 9605, 'NK': 10276, 'OO': 28878, 'UA': 17146, 'US': 8407, 'VX': 805, 'WN': 34363} (1 rows) Warnings : Aggregation query used without partition key cqlsh:flight>  cqlsh:flight> select group_and_count_late(airline,ARRIVAL_DELAY) from flight ;  flight.group_and_count_late(airline, arrival_delay) -----------------------------------------------------------------------------------------------------------------------------------------------------------  {'AA': 447, 'AS': 129, 'B6': 225, 'DL': 424, 'EV': 855, 'F9': 232, 'HA': 29, 'MQ': 251, 'NK': 262, 'OO': 897, 'UA': 364, 'US': 192, 'VX': 29, 'WN': 1013} (1 rows) Warnings : Aggregation query used without partition key cqlsh:flight> select airline from airlines where iata_code in ('HA','VX','WN');  airline ------------------------  Hawaiian Airlines Inc.          Virgin America  Southwest Airlines Co.