Data Modeling for Performance Masterclass: Why Data Modeling Matters

Why Data Modeling
Matters
Felipe Cardeneti Mendes
Data Modeling for Performance

A building is only strong as its foundation
Data Model shapes how data gets stored and
accessed on databases.
■ Trade-off between scalability and flexibility:
○ A flexible querying model is easier to work with; at the
cost of performance
○ A performance-oriented model sacrifices flexibility; but
benefits from near-infinite scaling
■ A weak foundation often lead to:
○ Bottlenecks and higher maintenance (thus $$$ ;-)
○ Hard to adapt to evolving business needs
■ Our goal: Build solid and strong skyscrapers!
2

Real Story
3
https://www.youtube.com/watch?v=G71MnVtDtS4

#1 Understand Your Needs
Ad-hoc
4
■ Unknown access patterns
■ Lots of flexibility
■ VERY large datasets
Analytics, Business Intelligence,
Reporting
Search
■ Prone to Index overloads
■ Limited Flexibility
■ Moderate datasets
Full-text search, Knowledge
bases, Type-ahead…
Transactional
■ Defined structure
■ Good flexibility
■ Small datasets
Financial, Tax Filing, Reservations
Performance
■ Access patterns are well established
■ Not much flexibility beyond what was set upfront
■ Any scale
Real-time Machine Learning, Social Networks, IoT / Time-series,
Recommendation Engines, Fraud Detection, Event Logging, Gaming,
Catalogs…

#2 Use the Right Tool for the Job
■ Our focus: Performance
■ Understand the goals of each solution:
● Data Warehouses focus on throughput over latency
● Poor insert performance, slow disks is the norm
● OLAP (incl. Search Engines) are optimized for complex aggregations
● Immediate, real-time updates are an anti-pattern
● Relational databases pose limited scaling
● Locking, JOINs and ACID guarantees inevitably affect query performance
● NoSQL OLTP focus on highly concurrent point query retrieval
● Some optimize for reads (DynamoDB, MongoDB)
● Some optimize for writes (Cassandra, RocksDB)
● Some optimize for both (ScyllaDB)
5

#3 Plan for the Future
What's wrong with this? Consider 100K devices reporting
measurements at every 15 seconds, and 5 years data retention.
6
CREATE TABLE timeseries (
device_id uuid,
time timestamp,
temperature float,
pressure float,
air_speed float,
(...)
PRIMARY KEY(device_id, time)
);

1 Year = 86400 * 365 seconds
Single device = 1Y / 15s ~= 2.1M rows/device
5 years = >10M rows
7
Read performance degrades over time!
Maybe acceptable… Focus on your queries!

If we infrequently read all rows, range queries help with latencies:
At the same time, there's very little reason to group many rows
together:
8
Maybe acceptable… Focus on your queries!
SELECT COUNT(1) FROM timeseries WHERE
device_id = ? AND time >= ? AND < ?;
CREATE TABLE timeseries (
bucket int,
(...)
PRIMARY KEY((device_id, bucket), time)
);

#4 Simplify Access Patterns
■ Minimize expensive queries
● Joins, complex ﬁlters
■ Pre-aggregate when possible
● Machine Learning – Pre-computed Counts, Sums, Similarity Scores, Summary
Statistics
● Dashboards – Denormalized data structures to minimize complexity/latencies
■ Denormalize
● A single query should retrieve all data it requires
● Reduces client-side processing
10

11

12

#5 Reﬁne and Build!
■ Apply open-loop load testing
● Brian Taylor's talk on How to Maximize Database Concurrency
■ Deﬁne performance KPIs upfront
● Database Performance at Scale – Chapter 10: Monitoring
■ Hunt for potential imbalances
● Load imbalance → Uneven access patterns
● Data imbalance → Large partitions/rows
● Hotspots → Low cardinality/Retry Storms/Spammers
13

Keep in touch!
Felipe Cardeneti Mendes
Technical Director
ScyllaDB
felipe@scylladb.com
@felipemendes.dev
Attributions
Icon made by juicy_fish from www.flaticon.com
Icons made by Freepik from www.flaticon.com
Icon made by PixelPerfect from www.flaticon.com

Data Modeling for Performance Masterclass: Why Data Modeling Matters

More Related Content

Similar to Data Modeling for Performance Masterclass: Why Data Modeling Matters (20)

More from ScyllaDB (20)

Recently uploaded (20)

Data Modeling for Performance Masterclass: Why Data Modeling Matters