Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

Scalable data modelling by example
Carlos Alonso (@calonso)

Carlos Alonso
2
• Ex-Londoner
• MSc Salamanca University, Spain
• Software Engineer @ Jobandtalent
• Cassandra certified developer
• Datastax Cassandra MVP 2015 & 2016
• @calonso / http://mrcalonso.com

Jobandtalent
3
• Revolutionising how people find jobs and how businesses
hire employees.
• Leveraging data to produce a unique job matching
technology.
• 10M+ users and 150K+ companies worldwide
• @jobandtalentEng / http://jobandtalent.com
• We are hiring!!

The data model is the only thing you can’t change
once in production.

Consistent Hashing
8
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234

Replication factor
How many copies (replicas) for your data
9

Consistency Level
How many replicas of your data must acknowledge?
10

A complete read/write example
11
DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…

Data Modelling
15
• Understand your data
• Decide (know) how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimise

Data Modelling
16
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation

Query Driven Methodology: goals
17
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Keep partitions manageable

Query Driven Methodology: process
18
• Entities and relationships: map to tables
• Key attributes: map to primary key columns
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns

The Primary Key
19
PARTITION
KEY
+
CLUSTERING
COLUMN(S)
CREATE TABLE . . .(
fields . . .
PRIMARY KEY (part_key, clust1, . . .)
);

Analysis & Validation
20
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)

Requirement: 1
22
Books can be uniquely identiﬁed and accessed by ISBN, we also need a title, genre, author and publisher.
Book
ISBN K
Title
Author
Genre
Publisher
QDM
Q1
Q1: Find books by ISBN

– 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? 0
23
Book
ISBN K
Title
Author
Genre
Publisher
Q1

Physical data model
24
Book
ISBN K
Title
Author
Genre
Publisher
Q1
CREATE TABLE books (
ISBN VARCHAR PRIMARY KEY,
title VARCHAR,
author VARCHAR,
genre VARCHAR,
publisher VARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;

Requirement 2
25
Users register into the system uniquely identiﬁed by an email and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
Users_by_ID
KID
full_name
QDM
Q1
Q1: Find users by ID
Users_by_login_info
email K
password K
full_name
ID
Q2
Q2: Find users by login info
Q3: Find users by email
(to guarantee uniqueness)
Q3
C

– 1 X (2 - 1 - 0) + 0 < 1M
26
K
Users_by_ID
ID
full_name
Q1

Physical Data Model
27
CREATE TABLE users_by_id (
ID TIMEUUID PRIMARY KEY,
full_name VARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1

– 1 X (4 - 1 - 0) + 0 < 1M
28
Users_by_login_info
email K
password C
full_name
ID

Physical Data Model
29
CREATE TABLE users_by_login_info (
email VARCHAR,
password VARCHAR,
full_name VARCHAR,
ID TIMEUUID,
PRIMARY KEY (email, password)
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’ [AND password = ‘…’];
Users_by_login_info
email K
password C
full_name
ID

Physical Data Model
30
BEGIN BATCH
INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);
APPLY BATCH;

Requirement 3
31
Users read books.
We want to know which books has a user read and
show them sorted by title and author
Books_read_by_user
Kuser_ID
QDM
Q1: Find all books a logged
user has read
Q1
ISBN
genre
publisher
title
author
C
C
full_name S

– Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
32
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S

Physical Data Model
33
CREATE TABLE books_read_by_user (
user_id TIMEUUID,
title VARCHAR,
author VARCHAR,
full_name VARCHAR STATIC,
ISBN VARCHAR,
genre VARCHAR,
publisher VARCHAR,
PRIMARY KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S

Requirement 4
34
In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
element
type
user_ID K
Actions_by_user
QDM
Q1
Q1: Find all actions a user
does in a time range
time C

– Actions X (4 - 1 - 0) + 0 < 1M => 333.333
35
K
Actions_by_user
user_ID
Q1
time
element
type
C

Requirement 4: Bucketing
36
time
element
type
user_ID K
Actions_by_user
month K
C
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333  
per user every <bucket_size>
bucket_size = 1 year => 38 actions / h
bucket_size = 1 month => 462 actions / h
bucket_size = 1 week => 1984 actions / h

– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month
37
K
Actions_by_user
user_ID
month
Q1
K
time
element
type
C

Physical Data Model
38
CREATE TABLE actions_by_user (
user_ID TIMEUUID,
month INT,
time TIMESTAMP,
element VARCHAR,
type VARCHAR,
PRIMARY KEY ((user_ID, month), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND month = … AND time < … AND time > …;
K
Actions_by_user
user_ID
month
Q1
K
time
element
type
C

Further validation
39
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB
– pk = Partition Key column
– sc = Static column
– Nr = Number of rows
– rc = Regular column
– clc = Clustering column
– Nv = Number of values

Next Steps
40
• Test your models against your hardware setup
– cassandra-stress
– http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
• Monitor everything
– DataStax OpsCenter
– Graphite
– Datadog
– . . .

Thanks!
Carlos Alonso
Software Engineer
@calonso

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

More Related Content

What's hot (20)

Viewers also liked (11)

Similar to Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016 (20)

More from DataStax (20)

Recently uploaded (20)

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016