SlideShare a Scribd company logo
Scalable data modelling by example
Carlos Alonso (@calonso)
Carlos Alonso
2
• Ex-Londoner
• MSc Salamanca University, Spain
• Software Engineer @ Jobandtalent
• Cassandra certified developer
• Datastax Cassandra MVP 2015 & 2016
• @calonso / http://mrcalonso.com
Jobandtalent
3
• Revolutionising how people find jobs and how businesses
hire employees.
• Leveraging data to produce a unique job matching
technology.
• 10M+ users and 150K+ companies worldwide
• @jobandtalentEng / http://jobandtalent.com
• We are hiring!!
Cassandra Concepts
The data model is the only thing you can’t change
once in production.
Data organisation
6
Token
Physical Data Layout
7
Consistent Hashing
8
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
Replication factor
How many copies (replicas) for your data
9
Consistency Level
How many replicas of your data must acknowledge?
10
A complete read/write example
11
DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…
12
13
Data Modelling
Data Modelling
15
• Understand your data
• Decide (know) how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimise
Data Modelling
16
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation
Query Driven Methodology: goals
17
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Keep partitions manageable
Query Driven Methodology: process
18
• Entities and relationships: map to tables
• Key attributes: map to primary key columns
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
The Primary Key
19
PARTITION
KEY
+
CLUSTERING
COLUMN(S)
CREATE TABLE . . .(
fields . . .
PRIMARY KEY (part_key, clust1, . . .)
);
Analysis & Validation
20
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
An E-Library project.
Requirement: 1
22
Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.
Book
ISBN K
Title
Author
Genre
Publisher
QDM
Q1
Q1: Find books by ISBN
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
23
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
Physical data model
24
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
CREATE TABLE books (
ISBN VARCHAR PRIMARY KEY,
title VARCHAR,
author VARCHAR,
genre VARCHAR,
publisher VARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;
Requirement 2
25
Users register into the system uniquely identified by an email and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
Users_by_ID
KID
full_name
QDM
Q1
Q1: Find users by ID
Users_by_login_info
email K
password K
full_name
ID
Q2
Q2: Find users by login info
Q3: Find users by email
(to guarantee uniqueness)
Q3
C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (2 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
26
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Physical Data Model
27
CREATE TABLE users_by_id (
ID TIMEUUID PRIMARY KEY,
full_name VARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (4 - 1 - 0) + 0 < 1M
• How much data duplication? 1
Analysis & Validation
28
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
29
CREATE TABLE users_by_login_info (
email VARCHAR,
password VARCHAR,
full_name VARCHAR,
ID TIMEUUID,
PRIMARY KEY (email, password)
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’ [AND password = ‘…’];
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
30
BEGIN BATCH
INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);
APPLY BATCH;
Requirement 3
31
Users read books.
We want to know which books has a user read and
show them sorted by title and author
Books_read_by_user
Kuser_ID
QDM
Q1: Find all books a logged
user has read
Q1
ISBN
genre
publisher
title
author
C
C
full_name S
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
• How much data duplication? 0
Analysis & Validation
32
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Physical Data Model
33
CREATE TABLE books_read_by_user (
user_id TIMEUUID,
title VARCHAR,
author VARCHAR,
full_name VARCHAR STATIC,
ISBN VARCHAR,
genre VARCHAR,
publisher VARCHAR,
PRIMARY KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Requirement 4
34
In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
element
type
user_ID K
Actions_by_user
QDM
Q1
Q1: Find all actions a user
does in a time range
time C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (4 - 1 - 0) + 0 < 1M => 333.333
• How much data duplication? 0
Analysis & Validation
35
K
Actions_by_user
user_ID
Q1
Q1: Find all actions a user
does in a time range
time
element
type
C
Requirement 4: Bucketing
36
time
element
type
user_ID K
Actions_by_user
month K
C
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 

per user every <bucket_size>
bucket_size = 1 year => 38 actions / h
bucket_size = 1 month => 462 actions / h
bucket_size = 1 week => 1984 actions / h
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month
• How much data duplication? 0
Analysis & Validation
37
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Physical Data Model
38
CREATE TABLE actions_by_user (
user_ID TIMEUUID,
month INT,
time TIMESTAMP,
element VARCHAR,
type VARCHAR,
PRIMARY KEY ((user_ID, month), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND month = … AND time < … AND time > …;
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Further validation
39
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB
– pk = Partition Key column
– sc = Static column
– Nr = Number of rows
– rc = Regular column
– clc = Clustering column
– Nv = Number of values
Next Steps
40
• Test your models against your hardware setup
– cassandra-stress
– http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
• Monitor everything
– DataStax OpsCenter
– Graphite
– Datadog
– . . .
Thanks!
Carlos Alonso
Software Engineer
@calonso

More Related Content

PDF
Cassandra for impatients
PDF
Introduction to R for data science
PDF
Benchmark MinHash+LSH algorithm on Spark
PDF
managing big data
PDF
Distributed computing with spark
PPTX
Anomaly Detection with Apache Spark
PDF
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
PDF
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
Cassandra for impatients
Introduction to R for data science
Benchmark MinHash+LSH algorithm on Spark
managing big data
Distributed computing with spark
Anomaly Detection with Apache Spark
Ted Dunning – Very High Bandwidth Time Series Database Implementation - NoSQL...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...

What's hot (9)

PDF
Tweaking perfomance on high-load projects_Думанский Дмитрий
PDF
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
PDF
Cassandra Explained
PPT
Cassandra Data Model
PDF
RedisConf18 - Redis and Elasticsearch
PPTX
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
PDF
Mongo db improve the performance of your application codemotion2016
PPTX
Cassandra Overview
PDF
Spark: Taming Big Data
Tweaking perfomance on high-load projects_Думанский Дмитрий
ManetoDB: Key/Value storage, BigData in Open Stack_Сергей Ковалев, Илья Свиридов
Cassandra Explained
Cassandra Data Model
RedisConf18 - Redis and Elasticsearch
MongoDB for Time Series Data Part 1: Setting the Stage for Sensor Management
Mongo db improve the performance of your application codemotion2016
Cassandra Overview
Spark: Taming Big Data
Ad

Viewers also liked (17)

PDF
Cassandra Workshop - Cassandra from scratch in one day
PDF
Case Study: Troubleshooting Cassandra performance issues as a developer
PPTX
Swift and the BigData
PDF
Construyendo y publicando nuestra primera app multiplataforma
PDF
Enumerados Server
PDF
iOS Notifications
PDF
Construyendo y publicando nuestra primera app multi plataforma (II)
PDF
Ruby closures, how are they possible?
PDF
Aplicaciones móviles - HTML5
PDF
Javascript - 2014
PDF
PDF
Programacion web
PDF
Sensors (Accelerometer, Magnetometer, Gyroscope, Proximity and Luminosity)
PDF
Lambda at Weather Scale - Cassandra Summit 2015
PPTX
Always On: Building Highly Available Applications on Cassandra
Cassandra Workshop - Cassandra from scratch in one day
Case Study: Troubleshooting Cassandra performance issues as a developer
Swift and the BigData
Construyendo y publicando nuestra primera app multiplataforma
Enumerados Server
iOS Notifications
Construyendo y publicando nuestra primera app multi plataforma (II)
Ruby closures, how are they possible?
Aplicaciones móviles - HTML5
Javascript - 2014
Programacion web
Sensors (Accelerometer, Magnetometer, Gyroscope, Proximity and Luminosity)
Lambda at Weather Scale - Cassandra Summit 2015
Always On: Building Highly Available Applications on Cassandra
Ad

Similar to Scalable data modelling by example - Cassandra Summit '16 (20)

PDF
Black friday logs - Scaling Elasticsearch
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
PDF
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
PPTX
Spanner (may 19)
PDF
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
PDF
A Fast and Efficient Time Series Storage Based on Apache Solr
PDF
Chronix: A fast and efficient time series storage based on Apache Solr
PPTX
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
PPTX
MongoDB Best Practices
PPTX
Webinar: Best Practices for Getting Started with MongoDB
PDF
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
PDF
Hadoop Tutorial with @techmilind
 
PDF
The new time series kid on the block
PPTX
datamining-lect1.pptx
PDF
chương 1 - Tổng quan về khai phá dữ liệu.pdf
PDF
Chronix Time Series Database - The New Time Series Kid on the Block
PPTX
The Use of Data and Datasets in Data Science
PDF
Three steps to untangle data traffic jams
PPTX
Big data
PDF
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Black friday logs - Scaling Elasticsearch
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Remaining Agile with Billions of Documents: Appboy and Creative MongoDB Schemas
Spanner (may 19)
Data Management for Quantitative Biology - Database Systems (continued) LIMS ...
A Fast and Efficient Time Series Storage Based on Apache Solr
Chronix: A fast and efficient time series storage based on Apache Solr
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB Best Practices
Webinar: Best Practices for Getting Started with MongoDB
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Hadoop Tutorial with @techmilind
 
The new time series kid on the block
datamining-lect1.pptx
chương 1 - Tổng quan về khai phá dữ liệu.pdf
Chronix Time Series Database - The New Time Series Kid on the Block
The Use of Data and Datasets in Data Science
Three steps to untangle data traffic jams
Big data
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Digital Logic Computer Design lecture notes
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
DOCX
573137875-Attendance-Management-System-original
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
web development for engineering and engineering
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Construction Project Organization Group 2.pptx
UNIT 4 Total Quality Management .pptx
Digital Logic Computer Design lecture notes
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
573137875-Attendance-Management-System-original
Operating System & Kernel Study Guide-1 - converted.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Sustainable Sites - Green Building Construction
Model Code of Practice - Construction Work - 21102022 .pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Foundation to blockchain - A guide to Blockchain Tech
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
web development for engineering and engineering
Internet of Things (IOT) - A guide to understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Scalable data modelling by example - Cassandra Summit '16

  • 1. Scalable data modelling by example Carlos Alonso (@calonso)
  • 2. Carlos Alonso 2 • Ex-Londoner • MSc Salamanca University, Spain • Software Engineer @ Jobandtalent • Cassandra certified developer • Datastax Cassandra MVP 2015 & 2016 • @calonso / http://mrcalonso.com
  • 3. Jobandtalent 3 • Revolutionising how people find jobs and how businesses hire employees. • Leveraging data to produce a unique job matching technology. • 10M+ users and 150K+ companies worldwide • @jobandtalentEng / http://jobandtalent.com • We are hiring!!
  • 5. The data model is the only thing you can’t change once in production.
  • 8. Consistent Hashing 8 Hash function “Carlos” 185664 1773456738847666528349 -894763734895827651234
  • 9. Replication factor How many copies (replicas) for your data 9
  • 10. Consistency Level How many replicas of your data must acknowledge? 10
  • 11. A complete read/write example 11 DriverClient Partitioner f81d4fae-… 834 • RF = 3 • CL = QUORUM • SELECT * … WHERE id = f81d4fae-…
  • 12. 12
  • 13. 13
  • 15. Data Modelling 15 • Understand your data • Decide (know) how you’ll query the data • Define column families to satisfy those queries • Implement and optimise
  • 17. Query Driven Methodology: goals 17 • Spread data evenly around the cluster • Minimise the number of partitions read • Keep partitions manageable
  • 18. Query Driven Methodology: process 18 • Entities and relationships: map to tables • Key attributes: map to primary key columns • Equality search attributes: must be at the beginning of the primary key • Inequality search attributes: become clustering columns • Ordering attributes: become clustering columns
  • 19. The Primary Key 19 PARTITION KEY + CLUSTERING COLUMN(S) CREATE TABLE . . .( fields . . . PRIMARY KEY (part_key, clust1, . . .) );
  • 20. Analysis & Validation 20 • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M • How much data duplication? (batches)
  • 22. Requirement: 1 22 Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher. Book ISBN K Title Author Genre Publisher QDM Q1 Q1: Find books by ISBN
  • 23. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (5 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 23 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN
  • 24. Physical data model 24 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN CREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR ); SELECT * FROM books WHERE ISBN = ‘…’;
  • 25. Requirement 2 25 Users register into the system uniquely identified by an email and a password. We also want their full name. They will be accessed by email and password or internal unique ID. Users_by_ID KID full_name QDM Q1 Q1: Find users by ID Users_by_login_info email K password K full_name ID Q2 Q2: Find users by login info Q3: Find users by email (to guarantee uniqueness) Q3 C
  • 26. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (2 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 26 K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 27. Physical Data Model 27 CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR ); SELECT * FROM users_by_id WHERE ID = …; K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 28. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (4 - 1 - 0) + 0 < 1M • How much data duplication? 1 Analysis & Validation 28 Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 29. Physical Data Model 29 CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) ); SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’]; Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 30. Physical Data Model 30 BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
  • 31. Requirement 3 31 Users read books. We want to know which books has a user read and show them sorted by title and author Books_read_by_user Kuser_ID QDM Q1: Find all books a logged user has read Q1 ISBN genre publisher title author C C full_name S
  • 32. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user • How much data duplication? 0 Analysis & Validation 32 Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 33. Physical Data Model 33 CREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) ); SELECT * FROM books_read_by_user WHERE user_ID = …; Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 34. Requirement 4 34 In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have with our site. element type user_ID K Actions_by_user QDM Q1 Q1: Find all actions a user does in a time range time C
  • 35. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (4 - 1 - 0) + 0 < 1M => 333.333 • How much data duplication? 0 Analysis & Validation 35 K Actions_by_user user_ID Q1 Q1: Find all actions a user does in a time range time element type C
  • 36. Requirement 4: Bucketing 36 time element type user_ID K Actions_by_user month K C – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 
 per user every <bucket_size> bucket_size = 1 year => 38 actions / h bucket_size = 1 month => 462 actions / h bucket_size = 1 week => 1984 actions / h
  • 37. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month • How much data duplication? 0 Analysis & Validation 37 K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 38. Physical Data Model 38 CREATE TABLE actions_by_user ( user_ID TIMEUUID, month INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, month), time) ); SELECT * FROM actions_by_user WHERE user_ID = … AND month = … AND time < … AND time > …; K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 39. Further validation 39 ∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB – pk = Partition Key column – sc = Static column – Nr = Number of rows – rc = Regular column – clc = Clustering column – Nv = Number of values
  • 40. Next Steps 40 • Test your models against your hardware setup – cassandra-stress – http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez) • Monitor everything – DataStax OpsCenter – Graphite – Datadog – . . .