SlideShare a Scribd company logo
Scalable data modelling by example
Carlos Alonso (@calonso)
Carlos Alonso
2
• Ex-Londoner
• MSc Salamanca University, Spain
• Software Engineer @ Jobandtalent
• Cassandra certified developer
• Datastax Cassandra MVP 2015 & 2016
• @calonso / http://mrcalonso.com
Jobandtalent
3
• Revolutionising how people find jobs and how businesses
hire employees.
• Leveraging data to produce a unique job matching
technology.
• 10M+ users and 150K+ companies worldwide
• @jobandtalentEng / http://jobandtalent.com
• We are hiring!!
Cassandra Concepts
The data model is the only thing you can’t change
once in production.
Data organisation
6
Token
Physical Data Layout
7
Consistent Hashing
8
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
Replication factor
How many copies (replicas) for your data
9
Consistency Level
How many replicas of your data must acknowledge?
10
A complete read/write example
11
DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…
12
13
Data Modelling
Data Modelling
15
• Understand your data
• Decide (know) how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimise
Data Modelling
16
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation
Query Driven Methodology: goals
17
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Keep partitions manageable
Query Driven Methodology: process
18
• Entities and relationships: map to tables
• Key attributes: map to primary key columns
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
The Primary Key
19
PARTITION
KEY
+
CLUSTERING
COLUMN(S)
CREATE TABLE . . .(
fields . . .
PRIMARY KEY (part_key, clust1, . . .)
);
Analysis & Validation
20
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
An E-Library project.
Requirement: 1
22
Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher.
Book
ISBN K
Title
Author
Genre
Publisher
QDM
Q1
Q1: Find books by ISBN
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
23
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
Physical data model
24
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
CREATE TABLE books (
ISBN VARCHAR PRIMARY KEY,
title VARCHAR,
author VARCHAR,
genre VARCHAR,
publisher VARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;
Requirement 2
25
Users register into the system uniquely identified by an email and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
Users_by_ID
KID
full_name
QDM
Q1
Q1: Find users by ID
Users_by_login_info
email K
password K
full_name
ID
Q2
Q2: Find users by login info
Q3: Find users by email
(to guarantee uniqueness)
Q3
C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (2 - 1 - 0) + 0 < 1M
• How much data duplication? 0
Analysis & Validation
26
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Physical Data Model
27
CREATE TABLE users_by_id (
ID TIMEUUID PRIMARY KEY,
full_name VARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– 1 X (4 - 1 - 0) + 0 < 1M
• How much data duplication? 1
Analysis & Validation
28
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
29
CREATE TABLE users_by_login_info (
email VARCHAR,
password VARCHAR,
full_name VARCHAR,
ID TIMEUUID,
PRIMARY KEY (email, password)
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’ [AND password = ‘…’];
Q2: Find users by login info
Users_by_login_info
email K
password C
full_name
ID
Q3: Find users by email
(to guarantee uniqueness)
Physical Data Model
30
BEGIN BATCH
INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…);
APPLY BATCH;
Requirement 3
31
Users read books.
We want to know which books has a user read and
show them sorted by title and author
Books_read_by_user
Kuser_ID
QDM
Q1: Find all books a logged
user has read
Q1
ISBN
genre
publisher
title
author
C
C
full_name S
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user
• How much data duplication? 0
Analysis & Validation
32
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Physical Data Model
33
CREATE TABLE books_read_by_user (
user_id TIMEUUID,
title VARCHAR,
author VARCHAR,
full_name VARCHAR STATIC,
ISBN VARCHAR,
genre VARCHAR,
publisher VARCHAR,
PRIMARY KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
S
Requirement 4
34
In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have
with our site.
element
type
user_ID K
Actions_by_user
QDM
Q1
Q1: Find all actions a user
does in a time range
time C
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (4 - 1 - 0) + 0 < 1M => 333.333
• How much data duplication? 0
Analysis & Validation
35
K
Actions_by_user
user_ID
Q1
Q1: Find all actions a user
does in a time range
time
element
type
C
Requirement 4: Bucketing
36
time
element
type
user_ID K
Actions_by_user
month K
C
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 

per user every <bucket_size>
bucket_size = 1 year => 38 actions / h
bucket_size = 1 month => 462 actions / h
bucket_size = 1 week => 1984 actions / h
• Data evenly spread?
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
– Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
– Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month
• How much data duplication? 0
Analysis & Validation
37
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Physical Data Model
38
CREATE TABLE actions_by_user (
user_ID TIMEUUID,
month INT,
time TIMESTAMP,
element VARCHAR,
type VARCHAR,
PRIMARY KEY ((user_ID, month), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND month = … AND time < … AND time > …;
K
Actions_by_user
user_ID
month
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
Further validation
39
∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB
– pk = Partition Key column
– sc = Static column
– Nr = Number of rows
– rc = Regular column
– clc = Clustering column
– Nv = Number of values
Next Steps
40
• Test your models against your hardware setup
– cassandra-stress
– http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez)
• Monitor everything
– DataStax OpsCenter
– Graphite
– Datadog
– . . .
Thanks!
Carlos Alonso
Software Engineer
@calonso

More Related Content

PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PPTX
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
PDF
Engineering fast indexes
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Engineering fast indexes
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
Aggregated queries with Druid on terrabytes and petabytes of data

What's hot (20)

PDF
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
PPTX
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PDF
Apache Cassandra at Macys
PDF
DataStax and Esri: Geotemporal IoT Search and Analytics
PDF
Spark with Cassandra by Christopher Batey
PDF
Cassandra CLuster Management by Japan Cassandra Community
PPTX
Large partition in Cassandra
PPTX
Processing 50,000 events per second with Cassandra and Spark
PDF
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
PPT
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
PPTX
M6d cassandrapresentation
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
PDF
Real-time Cassandra
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Андрей Козлов (Altoros): Оптимизация производительности Cassandra
Real time data pipeline with spark streaming and cassandra with mesos
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Apache Cassandra at Macys
DataStax and Esri: Geotemporal IoT Search and Analytics
Spark with Cassandra by Christopher Batey
Cassandra CLuster Management by Japan Cassandra Community
Large partition in Cassandra
Processing 50,000 events per second with Cassandra and Spark
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
An Effective Approach to Migrate Cassandra Thrift to CQL (Yabin Meng, Pythian...
M6d cassandrapresentation
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
Real-time Cassandra
Ad

Viewers also liked (11)

PDF
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
PDF
Parquet performance tuning: the missing guide
PPTX
A Benchmark Test on Presto, Spark Sql and Hive on Tez
PPTX
A Comparative Performance Evaluation of Apache Flink
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PDF
Storing time series data with Apache Cassandra
PDF
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
PDF
Help! I want to contribute to an Open Source project but my boss says no.
PDF
Cassandra 3.0 advanced preview
PDF
Advanced data modeling with apache cassandra
PPTX
Hive, Presto, and Spark on TPC-DS benchmark
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Parquet performance tuning: the missing guide
A Benchmark Test on Presto, Spark Sql and Hive on Tez
A Comparative Performance Evaluation of Apache Flink
Analyzing Time Series Data with Apache Spark and Cassandra
Storing time series data with Apache Cassandra
A Cassandra + Solr + Spark Love Triangle Using DataStax Enterprise
Help! I want to contribute to an Open Source project but my boss says no.
Cassandra 3.0 advanced preview
Advanced data modeling with apache cassandra
Hive, Presto, and Spark on TPC-DS benchmark
Ad

Similar to Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016 (20)

PDF
Cassandra for impatients
PDF
Cassandra - lesson learned
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PDF
Cassandra Community Webinar | The World's Next Top Data Model
PDF
The world's next top data model
PDF
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
PDF
Cassandra lesson learned - extended
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
PDF
Apache Cassandra & Data Modeling
DOCX
llr+ cHApTEFt s Database Processing(2) Does this design e.docx
PDF
Big Data Grows Up - A (re)introduction to Cassandra
PDF
Cassandra introduction 2016
PDF
Cassandra nice use cases and worst anti patterns
PPTX
Structured Query Language (SQL) _ Edu4Sure Training.pptx
PDF
About "Apache Cassandra"
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PPTX
Introduction to cassandra
PDF
Cassandra in production
PDF
Introduction to cassandra 2014
PDF
Database Systems - A Historical Perspective
Cassandra for impatients
Cassandra - lesson learned
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
Cassandra Community Webinar | The World's Next Top Data Model
The world's next top data model
DOAN DuyHai – Cassandra: real world best use-cases and worst anti-patterns - ...
Cassandra lesson learned - extended
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Apache Cassandra & Data Modeling
llr+ cHApTEFt s Database Processing(2) Does this design e.docx
Big Data Grows Up - A (re)introduction to Cassandra
Cassandra introduction 2016
Cassandra nice use cases and worst anti patterns
Structured Query Language (SQL) _ Edu4Sure Training.pptx
About "Apache Cassandra"
Cassandra Data Modelling with CQL (OSCON 2015)
Introduction to cassandra
Cassandra in production
Introduction to cassandra 2014
Database Systems - A Historical Perspective

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Essential Infomation Tech presentation.pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
System and Network Administration Chapter 2
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PPTX
history of c programming in notes for students .pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
2025 Textile ERP Trends: SAP, Odoo & Oracle
Nekopoi APK 2025 free lastest update
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Essential Infomation Tech presentation.pptx
Softaken Excel to vCard Converter Software.pdf
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 2 - PM Management and IT Context
Navsoft: AI-Powered Business Solutions & Custom Software Development
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
System and Network Administration Chapter 2
wealthsignaloriginal-com-DS-text-... (1).pdf
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
history of c programming in notes for students .pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
L1 - Introduction to python Backend.pptx
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra Summit 2016

  • 1. Scalable data modelling by example Carlos Alonso (@calonso)
  • 2. Carlos Alonso 2 • Ex-Londoner • MSc Salamanca University, Spain • Software Engineer @ Jobandtalent • Cassandra certified developer • Datastax Cassandra MVP 2015 & 2016 • @calonso / http://mrcalonso.com
  • 3. Jobandtalent 3 • Revolutionising how people find jobs and how businesses hire employees. • Leveraging data to produce a unique job matching technology. • 10M+ users and 150K+ companies worldwide • @jobandtalentEng / http://jobandtalent.com • We are hiring!!
  • 5. The data model is the only thing you can’t change once in production.
  • 8. Consistent Hashing 8 Hash function “Carlos” 185664 1773456738847666528349 -894763734895827651234
  • 9. Replication factor How many copies (replicas) for your data 9
  • 10. Consistency Level How many replicas of your data must acknowledge? 10
  • 11. A complete read/write example 11 DriverClient Partitioner f81d4fae-… 834 • RF = 3 • CL = QUORUM • SELECT * … WHERE id = f81d4fae-…
  • 12. 12
  • 13. 13
  • 15. Data Modelling 15 • Understand your data • Decide (know) how you’ll query the data • Define column families to satisfy those queries • Implement and optimise
  • 17. Query Driven Methodology: goals 17 • Spread data evenly around the cluster • Minimise the number of partitions read • Keep partitions manageable
  • 18. Query Driven Methodology: process 18 • Entities and relationships: map to tables • Key attributes: map to primary key columns • Equality search attributes: must be at the beginning of the primary key • Inequality search attributes: become clustering columns • Ordering attributes: become clustering columns
  • 19. The Primary Key 19 PARTITION KEY + CLUSTERING COLUMN(S) CREATE TABLE . . .( fields . . . PRIMARY KEY (part_key, clust1, . . .) );
  • 20. Analysis & Validation 20 • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M • How much data duplication? (batches)
  • 22. Requirement: 1 22 Books can be uniquely identified and accessed by ISBN, we also need a title, genre, author and publisher. Book ISBN K Title Author Genre Publisher QDM Q1 Q1: Find books by ISBN
  • 23. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (5 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 23 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN
  • 24. Physical data model 24 Book ISBN K Title Author Genre Publisher Q1 Q1: Find books by ISBN CREATE TABLE books ( ISBN VARCHAR PRIMARY KEY, title VARCHAR, author VARCHAR, genre VARCHAR, publisher VARCHAR ); SELECT * FROM books WHERE ISBN = ‘…’;
  • 25. Requirement 2 25 Users register into the system uniquely identified by an email and a password. We also want their full name. They will be accessed by email and password or internal unique ID. Users_by_ID KID full_name QDM Q1 Q1: Find users by ID Users_by_login_info email K password K full_name ID Q2 Q2: Find users by login info Q3: Find users by email (to guarantee uniqueness) Q3 C
  • 26. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (2 - 1 - 0) + 0 < 1M • How much data duplication? 0 Analysis & Validation 26 K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 27. Physical Data Model 27 CREATE TABLE users_by_id ( ID TIMEUUID PRIMARY KEY, full_name VARCHAR ); SELECT * FROM users_by_id WHERE ID = …; K Users_by_ID ID full_name Q1 Q1: Find users by ID
  • 28. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – 1 X (4 - 1 - 0) + 0 < 1M • How much data duplication? 1 Analysis & Validation 28 Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 29. Physical Data Model 29 CREATE TABLE users_by_login_info ( email VARCHAR, password VARCHAR, full_name VARCHAR, ID TIMEUUID, PRIMARY KEY (email, password) ); SELECT * FROM users_by_login_info WHERE email = ‘…’ [AND password = ‘…’]; Q2: Find users by login info Users_by_login_info email K password C full_name ID Q3: Find users by email (to guarantee uniqueness)
  • 30. Physical Data Model 30 BEGIN BATCH INSERT INTO users_by_id (ID, full_name) VALUES (…) IF NOT EXISTS; INSERT INTO users_by_login_info (email, password, full_name, ID) VALUES (…); APPLY BATCH;
  • 31. Requirement 3 31 Users read books. We want to know which books has a user read and show them sorted by title and author Books_read_by_user Kuser_ID QDM Q1: Find all books a logged user has read Q1 ISBN genre publisher title author C C full_name S
  • 32. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Books X (7 - 1 - 1) + 1 < 1M => 200,000 books per user • How much data duplication? 0 Analysis & Validation 32 Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 33. Physical Data Model 33 CREATE TABLE books_read_by_user ( user_id TIMEUUID, title VARCHAR, author VARCHAR, full_name VARCHAR STATIC, ISBN VARCHAR, genre VARCHAR, publisher VARCHAR, PRIMARY KEY (user_id, title, author) ); SELECT * FROM books_read_by_user WHERE user_ID = …; Q1: Find all books a logged user has read K Books_read_by_user user_ID title Q1 full_name ISBN genre publisher author C C S
  • 34. Requirement 4 34 In order to improve our site’s usability we need to understand how our users use it by tracking every interaction they have with our site. element type user_ID K Actions_by_user QDM Q1 Q1: Find all actions a user does in a time range time C
  • 35. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (4 - 1 - 0) + 0 < 1M => 333.333 • How much data duplication? 0 Analysis & Validation 35 K Actions_by_user user_ID Q1 Q1: Find all actions a user does in a time range time element type C
  • 36. Requirement 4: Bucketing 36 time element type user_ID K Actions_by_user month K C – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 
 per user every <bucket_size> bucket_size = 1 year => 38 actions / h bucket_size = 1 month => 462 actions / h bucket_size = 1 week => 1984 actions / h
  • 37. • Data evenly spread? • 1 Partition per read? • Are write conflicts (overwrites) possible? • How large are partitions? – Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M – Actions X (5 - 2 - 0) + 0 < 1M => 333.333 / month • How much data duplication? 0 Analysis & Validation 37 K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 38. Physical Data Model 38 CREATE TABLE actions_by_user ( user_ID TIMEUUID, month INT, time TIMESTAMP, element VARCHAR, type VARCHAR, PRIMARY KEY ((user_ID, month), time) ); SELECT * FROM actions_by_user WHERE user_ID = … AND month = … AND time < … AND time > …; K Actions_by_user user_ID month Q1 Q1: Find all actions a user does in a time range K time element type C
  • 39. Further validation 39 ∑sizeOf(pk) + ∑sizeOf(sc) + Nr x ∑(sizeOf(rc) + ∑sizeOf(clc)) + 8 x Nv < 200 MB – pk = Partition Key column – sc = Static column – Nr = Number of rows – rc = Regular column – clc = Clustering column – Nv = Number of values
  • 40. Next Steps 40 • Test your models against your hardware setup – cassandra-stress – http://www.sestevez.com/sestevez/CassandraDataModeler/ (kudos Sebastian Estevez) • Monitor everything – DataStax OpsCenter – Graphite – Datadog – . . .