SlideShare a Scribd company logo
Introduction to Data Modeling with
Apache Cassandra
Luke Tillman (@LukeTillman)
Language Evangelist at DataStax
1 Relational Modeling vs. Cassandra
2 The Basics
3 CQL Collections
4 Relationships
5 Time Series Use Case
2
Relational Modeling vs. Cassandra
3
The Good ol’ Relational Database
• Been around a long time (first proposed in 1970)
• Data modeling is well understood (typically 3NF or higher)
• ACID guarantees are easy for developers to reason about
• SQL is ubiquitous and allows flexible querying
– JOINs, Sub SELECTs, etc.
4
Relational Data Modeling
• Five normal forms
• Foreign Keys
• Joins at read time
– Example SQL: Get employee
and department for user id 5
(Helena Edelson)
Id First Last DeptId
1 Luke Tillman 201
2 Jon Haddad 201
5 Helena Edelson 205
5
Id Dept
201 Evangelists
205 Engineering
Employees
Departments
SELECT e.First, e.Last, d.Dept
FROM Employees e
JOIN Departments d
ON e.DeptId = d.Id
WHERE e.Id = 5
Relational Data Modeling Thought Process
6
Data
Models
Application
Cassandra Data Modeling Thought Process
7
Models
Application
Data
CQL vs SQL
• Similar syntax in many
cases, but...
• No Joins
• No Aggregations
Id First Last DeptId
1 Luke Tillman 201
2 Jon Haddad 201
5 Helena Edelson 205
8
Id Dept
201 Evangelists
205 Engineering
Employees
Departments
SELECT e.First, e.Last, d.Dept
FROM Employees e
JOIN Departments d
ON e.DeptId = d.Id
WHERE e.Id = 5
Denormalization
• Combine table columns into single view at write time
• No joins necessary
9
Id First Last Dept
1 Luke Tillman Evangelists
2 Jon Haddad Evangelists
5 Helena Edelson Engineering
Employees
SELECT First, Last, Dept
FROM Employees
WHERE Id = 5
Sequences and Auto-Incrementing Ids
• Great for letting the RDBMS handle auto-generating Ids
• Guaranteed to be unique
• Needs ACID to work (uh oh)
10
INSERT INTO Employees (Id, First, Last)
VALUES (seq.nextVal(), "Patrick", "McFadin")
No More Sequences
• Almost impossible in a distributed system like Cassandra
• Couple of great choices instead:
– Natural Keys: Unique values like Email
– Surrogate Key: UUID (or GUID for MS folks)
• UUID: Universally Unique Identifier
– 128-bit number represented in character form
– Can be generated easily on the client side
11
99051fe9-6a9c-46c2-b949-38ef78858dd0
The Basics
12
Cassandra Data Modeling Thought Process
• Start with your
application and the
queries it needs to
run
• Then build models to
satisfy those queries
13
Models
Application
Data
Entity Table
• Query: Find user by id
• Simple view of a single user
• UUID used for ID
• Simple primary key
14
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
SELECT firstname, lastname
FROM users
WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
Entity Table – A reminder on Partition Keys
• First part of Primary Key is the
Partition Key
15
CREATE TABLE users (
userid uuid,
firstname text,
lastname text,
email text,
created_date timestamp,
PRIMARY KEY (userid)
);
firstname ...
Luke ...
Jon ...
Patrick ...
userid
689d56e5- …
93357d73- …
d978b136- …
More Complicated Primary Keys
• Query: Find comments for a video (most recent first)
16
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
SELECT commentid, userid, comment
FROM comments_by_video
WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f
LIMIT 10
Let's Break This Down
• TimeUUID: a UUID with a timestamp component
• Ordering by a TimeUUID is like ordering by its timestamp
17
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
eeaca440-c745-11e4-8830-0800200c9a6603/10/2015 16:53:09 GMT
Let's Break This Down
• The Primary Key uniquely identifies a row, so a comment is
uniquely identified by its videoid and commentid
18
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
Let's Break This Down
• The first part of the Primary Key is the Partition Key, so
comments for a given video will be stored together in a partition
• When we query for a given videoid, we only need to talk to
one partition (and thus one node), which is fast
19
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
Let's Break This Down
• The second part of the Primary Key is the Clustering Column(s)
• Inside a partition, comments for a given video will be ordered
by commentid
• Remember ordering by TimeUUID is ordering by timestamp
20
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
Let's Break This Down
• We can specify a default clustering order when creating the
table which will affect the ordering of the data stored on disk
• Since our query was to get the latest comments for a video, we
order by commentid descending
21
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
Let's Break This Down
22
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (videoid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
videoid='0fe6a...'
userid=
'ac346...'
comment=
'Awesome!'
commentid='82be1...'
(10/1/2014 9:36AM)
userid=
'f89d3...'
comment=
'Garbage!'
commentid='765ac...'
(9/17/2014 7:55AM)
This query will be fast
23
videoid='0fe6a...'
userid=
'ac346...'
comment=
'Awesome!'
commentid='82be1...'
(10/1/2014 9:36AM)
userid=
'f89d3...'
comment=
'Garbage!'
commentid='765ac...'
(9/17/2014 7:55AM)
SELECT commentid, userid, comment
FROM comments_by_video
WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f
LIMIT 10
1. Locate
single
partition
2. Single seek
on disk
3. Slice 10 latest rows and return
Getting the most from queries
• Queries on Partition Key are fast
– Querying inside a single partition should be the goal
– Always specify a value for partition key when querying
• Queries on Partition Key and one or more Clustering Column(s)
are fast
– Again, inside a single partition should be the goal
– Use default ordering when creating the table to optimize if applicable
• Cassandra will give you errors if you try to stray
24
More than one way to query the same data
• New Query: Find comments made by a user (most recent first)
25
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY (userid, commentid)
) WITH CLUSTERING ORDER BY (commentid DESC);
SELECT commentid, videoid, comment
FROM comments_by_user
WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
LIMIT 10
More than one way to query the same data
• Two views of the same data
• Use a batch when inserting to both tables
• Denormalize at write time to do efficient queries at read time
26
CREATE TABLE comments_by_user (
userid uuid,
commentid timeuuid,
videoid uuid,
comment text,
PRIMARY KEY (
userid, commentid)
) WITH CLUSTERING ORDER BY (
commentid DESC);
CREATE TABLE comments_by_video (
videoid uuid,
commentid timeuuid,
userid uuid,
comment text,
PRIMARY KEY (
videoid, commentid)
) WITH CLUSTERING ORDER BY (
commentid DESC);
CQL Collections
27
CQL Collection Basics
• Store a collection of related things in a column
• Meant to be dynamic part of a table
• Update syntax is very different from insert
• Reads require all of the collection to be read
28
CQL Set
• No duplicates, sorted by CQL type's comparator
29
INSERT INTO collections_example (id, set_example)
VALUES (1, {'Patrick', 'Jon', 'Luke'});
set_example set<text>
Collection name
(column name)
Collection type CQL type
CQL Set
• Adding an element to a set
• Removing an element from a set
30
UPDATE collections_example
SET set_example = set_example + {'Rebecca'}
WHERE id = 1
UPDATE collections_example
SET set_example = set_example - {'Luke'}
WHERE id = 1
CQL List
• Allows duplicates, sorted by insertion order
• Use with caution
31
INSERT INTO collections_example (id, list_example)
VALUES (1, ['Patrick', 'Jon', 'Luke']);
list_example list<text>
Collection name
(column name)
Collection type CQL type
CQL List
• Adding an element to the end of a list
• Adding an element to the beginning of a list
• Removing an element from a list
32
UPDATE collections_example
SET list_example = list_example + ['Rebecca']
WHERE id = 1
UPDATE collections_example
SET list_example = ['Rebecca'] + list_example
WHERE id = 1
UPDATE collections_example
SET list_example = list_example - ['Luke']
WHERE id = 1
CQL Map
• Key and value, sorted by key's CQL type comparator
33
INSERT INTO collections_example (id, map_example)
VALUES (1, { 'Patrick' : 72, 'Jon' : 33, 'Luke' : 34 });
map_example map<text, int>
Collection name
(column name)
Collection type Key CQL type Value CQL type
CQL Map
• Adding an element to a map
• Updating an existing element in a map
• Removing an element from a map
34
UPDATE collections_example
SET map_example['Rebecca'] = 29
WHERE id = 1
UPDATE collections_example
SET map_example['Jon'] = 34
WHERE id = 1
DELETE map_example['Luke']
FROM collections_example
WHERE id = 1
Relationships
35
Revisiting our One-to-Many Relationship
36
Id First Last DeptId
7bc7a... Luke Tillman 5078c...
d7463... Jon Haddad 5078c...
8c26b... Helena Edelson 1d0f3...
Id Dept
5078c... Evangelists
1d0f3... Engineering
EmployeesDepartments
Department Employeehas
n1
Revisiting our One-to-Many Relationship
• Query: Get an employee and
his/her department by
employee id
– Denormalize department data
37
First Last Dept
Luke Tillman Evangelists
Jon Haddad Evangelists
Helena Edelson Engineering
Id
7bc7a...
d7463...
8c26b...
Employees
CREATE TABLE employees (
id uuid,
first text,
last text,
dept text,
PRIMARY KEY (id)
);
SELECT first, last, dept
FROM employees
WHERE id = 7bc7a...
What about the other side of the relationship?
• Query: Get all the employees for a given department
38
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
);
SELECT first, last, dept
FROM employees_by_dept
WHERE dept_id = 5078c...
What about the other side of the relationship?
39
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
);
dept_id=
'5078c...'
emp_id='7bc7a...'
dept=
'Evangelists'
first=
'Luke'
last=
'Tillman'
emp_id='d7463...'
dept=
'Evangelists'
first=
'Jon'
last=
'Haddad'
Static Columns
• Department name (dept)
will be the same across all
rows in the partition
• This is a good candidate
for a static column
40
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text,
PRIMARY KEY (dept_id, emp_id)
);
dept_id=
'5078c...'
emp_id='7bc7a...'
dept=
'Evangelists'
first=
'Luke'
last=
'Tillman'
emp_id='d7463...'
dept=
'Evangelists'
first=
'Jon'
last=
'Haddad'
Static Columns
• For data that is shared across
all rows in a partition, use
static columns
• Updates to the value will
affect all rows in the partition
41
CREATE TABLE employees_by_dept (
dept_id uuid,
emp_id uuid,
first text,
last text,
dept text STATIC,
PRIMARY KEY (dept_id, emp_id)
);
dept_id=
'5078c...'
dept=
'Evangelists'
emp_id='7bc7a...'
first=
'Luke'
last=
'Tillman'
emp_id='d7463...'
first=
'Jon'
last=
'Haddad'
Time Series Use Case
42
Weather Station
• Weather station collects data
• Cassandra stores in sequence
• Application reads in sequence
43
Weather Station
Needed Queries
• Get all data for one weather
station
• Get data for a single date
and time
• Get data for a range of dates
and times
Data Model for Queries
• Store data per weather
station
• Store time series in order:
first to last
44
Weather Station
• Weather station id and
time are unique
• Store as many as needed
45
CREATE TABLE temperatures (
weather_station text,
year int,
month int,
day int,
hour int,
temperature double,
PRIMARY KEY (
weather_station, year, month, day, hour)
);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 7, -5.6);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 8, -5.1);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 9, -4.9);
INSERT INTO temperatures (weather_station, year, month, day, hour, temperature)
VALUES ('10010:99999', 2005, 12, 1, 10, -5.3);
Storage Model: Logical View
46
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
10010:99999
10010:99999
10010:99999
10010:99999
weather_station
7
8
9
10
hour
-5.6
-5.1
-4.9
-5.3
temperature
Storage Model: Disk Layout
47
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
10010:99999
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
2005:12:1:10
-5.3
Storage Model: Disk Layout
48
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
10010:99999
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
2005:12:1:10
-5.3
2005:12:1:11
Merged, Sorted, and Stored Sequentially
Query Patterns
• Range queries
• "Slice" operation on disk
49
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
10010:99999
2005:12:1:7
-5.6
2005:12:1:8
-5.1
2005:12:1:9
-4.9
2005:12:1:10
-5.3
2005:12:1:11
Partition key for locality
Single seek on disk
Query Patterns
50
• Range queries
• "Slice" operation on disk
10010:99999
10010:99999
10010:99999
10010:99999
weather_station hour temperature
7
8
9
10
-5.6
-5.1
-4.9
-5.3
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
Query Patterns
51
• Programmers like this
10010:99999
10010:99999
10010:99999
10010:99999
weather_station hour temperature
7
8
9
10
-5.6
-5.1
-4.9
-5.3
SELECT weather_station, hour, temperature
FROM temperatures
WHERE weather_station = '10010:99999'
AND year = 2005 AND month = 12 AND day = 1
AND hour >= 7 AND hour <= 10
Sorted in
time order
Takeaway: Goals of Cassandra Data Modeling
• Spread data evenly around the cluster
– Choose a good Primary Key (particularly, the Partition Key portion)
• Minimize the number of partitions read for a given query
– Remember: Partitions are spread out around the cluster
• Do not worry about:
– Minimizing the number of writes: Cassandra is really fast at writes
– Minimizing data duplication: this is not 3NF from RDBMS, disk is cheap
52
Questions?
Follow me for updates or to ask questions later: @LukeTillman
53

More Related Content

PDF
Building your First Application with Cassandra
PDF
KillrVideo: Data Modeling Evolved (Patrick McFadin, Datastax) | Cassandra Sum...
PDF
Cassandra Day Atlanta 2015: Building Your First Application with Apache Cassa...
PDF
Cassandra 3.0 Data Modeling
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Cassandra 3.0 advanced preview
PDF
Coursera Cassandra Driver
PDF
Cassandra 3.0
Building your First Application with Cassandra
KillrVideo: Data Modeling Evolved (Patrick McFadin, Datastax) | Cassandra Sum...
Cassandra Day Atlanta 2015: Building Your First Application with Apache Cassa...
Cassandra 3.0 Data Modeling
Advanced Data Modeling with Apache Cassandra
Cassandra 3.0 advanced preview
Coursera Cassandra Driver
Cassandra 3.0

What's hot (20)

PDF
Cassandra Summit 2014: Real Data Models of Silicon Valley
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PDF
Via forensics icloud-keychain_passwords_13
PDF
Advanced Cassandra
PPTX
Integrating OpenStack with Active Directory
PDF
Apache Cassandra and Drivers
PDF
Walkthrough Neo4j 1.9 & 2.0
PDF
Keystone deep dive 1
PDF
Leveraging Open Source for Database Development: Database Version Control wit...
PDF
Improving DSpace Backups, Restores & Migrations
PDF
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
PPTX
DataStax NYC Java Meetup: Cassandra with Java
PPTX
Hadoop Hive
PDF
Introduction to DSpace
PPTX
Capture, record, clip, embed and play, search: video from newbie to ninja
PDF
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
PPT
DSpace Tutorial : Open Source Digital Library
PPTX
Async servers and clients in Rest.li
PPTX
DSpace 4.2 Basics & Configuration
PPTX
Keystone - Openstack Identity Service
Cassandra Summit 2014: Real Data Models of Silicon Valley
Enabling Search in your Cassandra Application with DataStax Enterprise
Via forensics icloud-keychain_passwords_13
Advanced Cassandra
Integrating OpenStack with Active Directory
Apache Cassandra and Drivers
Walkthrough Neo4j 1.9 & 2.0
Keystone deep dive 1
Leveraging Open Source for Database Development: Database Version Control wit...
Improving DSpace Backups, Restores & Migrations
10 Deadly Sins of SQL Server Configuration - APPSEC CALIFORNIA 2015
DataStax NYC Java Meetup: Cassandra with Java
Hadoop Hive
Introduction to DSpace
Capture, record, clip, embed and play, search: video from newbie to ninja
Lucene/Solr 8: The Next Major Release Steve Rowe, Lucidworks
DSpace Tutorial : Open Source Digital Library
Async servers and clients in Rest.li
DSpace 4.2 Basics & Configuration
Keystone - Openstack Identity Service
Ad

Viewers also liked (7)

PDF
Introduction to Apache Cassandra
PDF
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
PDF
Getting started with DataStax .NET Driver for Cassandra
PDF
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PDF
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
PDF
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
Introduction to Apache Cassandra
Event Sourcing with Cassandra (from Cassandra Japan Meetup in Tokyo March 2016)
Getting started with DataStax .NET Driver for Cassandra
Avoiding the Pit of Despair - Event Sourcing with Akka and Cassandra
A Deep Dive into Apache Cassandra for .NET Developers
Relational Scaling and the Temple of Gloom (from Cassandra Summit 2015)
From Monolith to Microservices with Cassandra, gRPC, and Falcor (from Cassand...
Ad

Similar to Introduction to Data Modeling with Apache Cassandra (20)

PDF
Cassandra Day Atlanta 2015: Data Modeling 101
PDF
Cassandra Day London 2015: Data Modeling 101
PDF
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
PDF
Introduction to data modeling with apache cassandra
PDF
Apache Cassandra & Data Modeling
PPTX
Apache Cassandra Developer Training Slide Deck
PDF
Cassandra Data Modeling
PDF
The data model is dead, long live the data model
PDF
Introduction to Data Modeling with Apache Cassandra
PPTX
Apache Cassandra Data Modeling with Travis Price
PDF
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PPTX
Cassandra
PDF
Big Data Grows Up - A (re)introduction to Cassandra
PPTX
CQL: This is not the SQL you are looking for.
PDF
Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for
PDF
Cassandra 2012
ODP
Cassandra Data Modelling
PDF
Indexing in Cassandra
PDF
Cassandra Community Webinar | Become a Super Modeler
Cassandra Day Atlanta 2015: Data Modeling 101
Cassandra Day London 2015: Data Modeling 101
Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101
Introduction to data modeling with apache cassandra
Apache Cassandra & Data Modeling
Apache Cassandra Developer Training Slide Deck
Cassandra Data Modeling
The data model is dead, long live the data model
Introduction to Data Modeling with Apache Cassandra
Apache Cassandra Data Modeling with Travis Price
Use Your MySQL Knowledge to Become an Instant Cassandra Guru
Cassandra Data Modelling with CQL (OSCON 2015)
Cassandra
Big Data Grows Up - A (re)introduction to Cassandra
CQL: This is not the SQL you are looking for.
Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for
Cassandra 2012
Cassandra Data Modelling
Indexing in Cassandra
Cassandra Community Webinar | Become a Super Modeler

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
Encapsulation_ Review paper, used for researhc scholars
The AUB Centre for AI in Media Proposal.docx
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
NewMind AI Monthly Chronicles - July 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm

Introduction to Data Modeling with Apache Cassandra

  • 1. Introduction to Data Modeling with Apache Cassandra Luke Tillman (@LukeTillman) Language Evangelist at DataStax
  • 2. 1 Relational Modeling vs. Cassandra 2 The Basics 3 CQL Collections 4 Relationships 5 Time Series Use Case 2
  • 4. The Good ol’ Relational Database • Been around a long time (first proposed in 1970) • Data modeling is well understood (typically 3NF or higher) • ACID guarantees are easy for developers to reason about • SQL is ubiquitous and allows flexible querying – JOINs, Sub SELECTs, etc. 4
  • 5. Relational Data Modeling • Five normal forms • Foreign Keys • Joins at read time – Example SQL: Get employee and department for user id 5 (Helena Edelson) Id First Last DeptId 1 Luke Tillman 201 2 Jon Haddad 201 5 Helena Edelson 205 5 Id Dept 201 Evangelists 205 Engineering Employees Departments SELECT e.First, e.Last, d.Dept FROM Employees e JOIN Departments d ON e.DeptId = d.Id WHERE e.Id = 5
  • 6. Relational Data Modeling Thought Process 6 Data Models Application
  • 7. Cassandra Data Modeling Thought Process 7 Models Application Data
  • 8. CQL vs SQL • Similar syntax in many cases, but... • No Joins • No Aggregations Id First Last DeptId 1 Luke Tillman 201 2 Jon Haddad 201 5 Helena Edelson 205 8 Id Dept 201 Evangelists 205 Engineering Employees Departments SELECT e.First, e.Last, d.Dept FROM Employees e JOIN Departments d ON e.DeptId = d.Id WHERE e.Id = 5
  • 9. Denormalization • Combine table columns into single view at write time • No joins necessary 9 Id First Last Dept 1 Luke Tillman Evangelists 2 Jon Haddad Evangelists 5 Helena Edelson Engineering Employees SELECT First, Last, Dept FROM Employees WHERE Id = 5
  • 10. Sequences and Auto-Incrementing Ids • Great for letting the RDBMS handle auto-generating Ids • Guaranteed to be unique • Needs ACID to work (uh oh) 10 INSERT INTO Employees (Id, First, Last) VALUES (seq.nextVal(), "Patrick", "McFadin")
  • 11. No More Sequences • Almost impossible in a distributed system like Cassandra • Couple of great choices instead: – Natural Keys: Unique values like Email – Surrogate Key: UUID (or GUID for MS folks) • UUID: Universally Unique Identifier – 128-bit number represented in character form – Can be generated easily on the client side 11 99051fe9-6a9c-46c2-b949-38ef78858dd0
  • 13. Cassandra Data Modeling Thought Process • Start with your application and the queries it needs to run • Then build models to satisfy those queries 13 Models Application Data
  • 14. Entity Table • Query: Find user by id • Simple view of a single user • UUID used for ID • Simple primary key 14 CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) ); SELECT firstname, lastname FROM users WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0
  • 15. Entity Table – A reminder on Partition Keys • First part of Primary Key is the Partition Key 15 CREATE TABLE users ( userid uuid, firstname text, lastname text, email text, created_date timestamp, PRIMARY KEY (userid) ); firstname ... Luke ... Jon ... Patrick ... userid 689d56e5- … 93357d73- … d978b136- …
  • 16. More Complicated Primary Keys • Query: Find comments for a video (most recent first) 16 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); SELECT commentid, userid, comment FROM comments_by_video WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f LIMIT 10
  • 17. Let's Break This Down • TimeUUID: a UUID with a timestamp component • Ordering by a TimeUUID is like ordering by its timestamp 17 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); eeaca440-c745-11e4-8830-0800200c9a6603/10/2015 16:53:09 GMT
  • 18. Let's Break This Down • The Primary Key uniquely identifies a row, so a comment is uniquely identified by its videoid and commentid 18 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 19. Let's Break This Down • The first part of the Primary Key is the Partition Key, so comments for a given video will be stored together in a partition • When we query for a given videoid, we only need to talk to one partition (and thus one node), which is fast 19 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 20. Let's Break This Down • The second part of the Primary Key is the Clustering Column(s) • Inside a partition, comments for a given video will be ordered by commentid • Remember ordering by TimeUUID is ordering by timestamp 20 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 21. Let's Break This Down • We can specify a default clustering order when creating the table which will affect the ordering of the data stored on disk • Since our query was to get the latest comments for a video, we order by commentid descending 21 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC);
  • 22. Let's Break This Down 22 CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY (videoid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); videoid='0fe6a...' userid= 'ac346...' comment= 'Awesome!' commentid='82be1...' (10/1/2014 9:36AM) userid= 'f89d3...' comment= 'Garbage!' commentid='765ac...' (9/17/2014 7:55AM)
  • 23. This query will be fast 23 videoid='0fe6a...' userid= 'ac346...' comment= 'Awesome!' commentid='82be1...' (10/1/2014 9:36AM) userid= 'f89d3...' comment= 'Garbage!' commentid='765ac...' (9/17/2014 7:55AM) SELECT commentid, userid, comment FROM comments_by_video WHERE videoid = 0fe6ab76-cf17-4664-abcc-4e363cee273f LIMIT 10 1. Locate single partition 2. Single seek on disk 3. Slice 10 latest rows and return
  • 24. Getting the most from queries • Queries on Partition Key are fast – Querying inside a single partition should be the goal – Always specify a value for partition key when querying • Queries on Partition Key and one or more Clustering Column(s) are fast – Again, inside a single partition should be the goal – Use default ordering when creating the table to optimize if applicable • Cassandra will give you errors if you try to stray 24
  • 25. More than one way to query the same data • New Query: Find comments made by a user (most recent first) 25 CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY (userid, commentid) ) WITH CLUSTERING ORDER BY (commentid DESC); SELECT commentid, videoid, comment FROM comments_by_user WHERE userid = 99051fe9-6a9c-46c2-b949-38ef78858dd0 LIMIT 10
  • 26. More than one way to query the same data • Two views of the same data • Use a batch when inserting to both tables • Denormalize at write time to do efficient queries at read time 26 CREATE TABLE comments_by_user ( userid uuid, commentid timeuuid, videoid uuid, comment text, PRIMARY KEY ( userid, commentid) ) WITH CLUSTERING ORDER BY ( commentid DESC); CREATE TABLE comments_by_video ( videoid uuid, commentid timeuuid, userid uuid, comment text, PRIMARY KEY ( videoid, commentid) ) WITH CLUSTERING ORDER BY ( commentid DESC);
  • 28. CQL Collection Basics • Store a collection of related things in a column • Meant to be dynamic part of a table • Update syntax is very different from insert • Reads require all of the collection to be read 28
  • 29. CQL Set • No duplicates, sorted by CQL type's comparator 29 INSERT INTO collections_example (id, set_example) VALUES (1, {'Patrick', 'Jon', 'Luke'}); set_example set<text> Collection name (column name) Collection type CQL type
  • 30. CQL Set • Adding an element to a set • Removing an element from a set 30 UPDATE collections_example SET set_example = set_example + {'Rebecca'} WHERE id = 1 UPDATE collections_example SET set_example = set_example - {'Luke'} WHERE id = 1
  • 31. CQL List • Allows duplicates, sorted by insertion order • Use with caution 31 INSERT INTO collections_example (id, list_example) VALUES (1, ['Patrick', 'Jon', 'Luke']); list_example list<text> Collection name (column name) Collection type CQL type
  • 32. CQL List • Adding an element to the end of a list • Adding an element to the beginning of a list • Removing an element from a list 32 UPDATE collections_example SET list_example = list_example + ['Rebecca'] WHERE id = 1 UPDATE collections_example SET list_example = ['Rebecca'] + list_example WHERE id = 1 UPDATE collections_example SET list_example = list_example - ['Luke'] WHERE id = 1
  • 33. CQL Map • Key and value, sorted by key's CQL type comparator 33 INSERT INTO collections_example (id, map_example) VALUES (1, { 'Patrick' : 72, 'Jon' : 33, 'Luke' : 34 }); map_example map<text, int> Collection name (column name) Collection type Key CQL type Value CQL type
  • 34. CQL Map • Adding an element to a map • Updating an existing element in a map • Removing an element from a map 34 UPDATE collections_example SET map_example['Rebecca'] = 29 WHERE id = 1 UPDATE collections_example SET map_example['Jon'] = 34 WHERE id = 1 DELETE map_example['Luke'] FROM collections_example WHERE id = 1
  • 36. Revisiting our One-to-Many Relationship 36 Id First Last DeptId 7bc7a... Luke Tillman 5078c... d7463... Jon Haddad 5078c... 8c26b... Helena Edelson 1d0f3... Id Dept 5078c... Evangelists 1d0f3... Engineering EmployeesDepartments Department Employeehas n1
  • 37. Revisiting our One-to-Many Relationship • Query: Get an employee and his/her department by employee id – Denormalize department data 37 First Last Dept Luke Tillman Evangelists Jon Haddad Evangelists Helena Edelson Engineering Id 7bc7a... d7463... 8c26b... Employees CREATE TABLE employees ( id uuid, first text, last text, dept text, PRIMARY KEY (id) ); SELECT first, last, dept FROM employees WHERE id = 7bc7a...
  • 38. What about the other side of the relationship? • Query: Get all the employees for a given department 38 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); SELECT first, last, dept FROM employees_by_dept WHERE dept_id = 5078c...
  • 39. What about the other side of the relationship? 39 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' emp_id='7bc7a...' dept= 'Evangelists' first= 'Luke' last= 'Tillman' emp_id='d7463...' dept= 'Evangelists' first= 'Jon' last= 'Haddad'
  • 40. Static Columns • Department name (dept) will be the same across all rows in the partition • This is a good candidate for a static column 40 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' emp_id='7bc7a...' dept= 'Evangelists' first= 'Luke' last= 'Tillman' emp_id='d7463...' dept= 'Evangelists' first= 'Jon' last= 'Haddad'
  • 41. Static Columns • For data that is shared across all rows in a partition, use static columns • Updates to the value will affect all rows in the partition 41 CREATE TABLE employees_by_dept ( dept_id uuid, emp_id uuid, first text, last text, dept text STATIC, PRIMARY KEY (dept_id, emp_id) ); dept_id= '5078c...' dept= 'Evangelists' emp_id='7bc7a...' first= 'Luke' last= 'Tillman' emp_id='d7463...' first= 'Jon' last= 'Haddad'
  • 42. Time Series Use Case 42
  • 43. Weather Station • Weather station collects data • Cassandra stores in sequence • Application reads in sequence 43
  • 44. Weather Station Needed Queries • Get all data for one weather station • Get data for a single date and time • Get data for a range of dates and times Data Model for Queries • Store data per weather station • Store time series in order: first to last 44
  • 45. Weather Station • Weather station id and time are unique • Store as many as needed 45 CREATE TABLE temperatures ( weather_station text, year int, month int, day int, hour int, temperature double, PRIMARY KEY ( weather_station, year, month, day, hour) ); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 7, -5.6); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 8, -5.1); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 9, -4.9); INSERT INTO temperatures (weather_station, year, month, day, hour, temperature) VALUES ('10010:99999', 2005, 12, 1, 10, -5.3);
  • 46. Storage Model: Logical View 46 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 10010:99999 10010:99999 10010:99999 weather_station 7 8 9 10 hour -5.6 -5.1 -4.9 -5.3 temperature
  • 47. Storage Model: Disk Layout 47 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3
  • 48. Storage Model: Disk Layout 48 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3 2005:12:1:11 Merged, Sorted, and Stored Sequentially
  • 49. Query Patterns • Range queries • "Slice" operation on disk 49 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10 10010:99999 2005:12:1:7 -5.6 2005:12:1:8 -5.1 2005:12:1:9 -4.9 2005:12:1:10 -5.3 2005:12:1:11 Partition key for locality Single seek on disk
  • 50. Query Patterns 50 • Range queries • "Slice" operation on disk 10010:99999 10010:99999 10010:99999 10010:99999 weather_station hour temperature 7 8 9 10 -5.6 -5.1 -4.9 -5.3 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10
  • 51. Query Patterns 51 • Programmers like this 10010:99999 10010:99999 10010:99999 10010:99999 weather_station hour temperature 7 8 9 10 -5.6 -5.1 -4.9 -5.3 SELECT weather_station, hour, temperature FROM temperatures WHERE weather_station = '10010:99999' AND year = 2005 AND month = 12 AND day = 1 AND hour >= 7 AND hour <= 10 Sorted in time order
  • 52. Takeaway: Goals of Cassandra Data Modeling • Spread data evenly around the cluster – Choose a good Primary Key (particularly, the Partition Key portion) • Minimize the number of partitions read for a given query – Remember: Partitions are spread out around the cluster • Do not worry about: – Minimizing the number of writes: Cassandra is really fast at writes – Minimizing data duplication: this is not 3NF from RDBMS, disk is cheap 52
  • 53. Questions? Follow me for updates or to ask questions later: @LukeTillman 53