Writing Applications for Scylla

Writing Applications
for Scylla
Shlomi Livne, VP R&D

Presenter
Shlomi Livne, VP of R&D
Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB he led the
research and development team at Convergin, which was
acquired by Oracle.

Part 1: Enhancement
for Application Development

Development Cycle NoSQL Databases
Think about the queries you are going to run

Create a Data Model

Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop

Create a Data Model
Develop
CQL Optimization

Create a Data Model
Develop
Scale test

Create a Data Model
Develop
Scale test
Deploy

Create a Data Model
Develop
Scale test
Deploy
Disk Access

What will Disk Access track
■ Disk Access looks at:
● Amount of I/O operations
● Overall amount of read bytes
■ When sstables are read from disk there are two related components
(everything else is in memory):
● Data - stores the actual data
● Index - provides lookup into the data file “blocks” that contain the partition (if the
partition is large - it contains promoted index)

Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk

Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk
● There are workloads that you will always prefer running from disk
(background analytics)

An IoT application(+)
Total amount of data points
526 billion
temperature readings
1,000,000 sensors, representing homes in an area
365 days (1 year storage requirement) 1 reading per second

Analytics over the entire data?
How long would it take at
normal speeds?
We need more if analytics
are a part of the pipeline
That means we need Scylla
We need a good application
And we need hardware
200,000 points/second
730 hours (30 days)
1 million points/second
146 hours (almost a week)

Why climb Mount Everest?
Because it’s there.
George Leigh Mallory
What kind of performance are we after?

Data Model
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
PRIMARY KEY ((sensor_id, date), time))
What kind of queries can we reasonably support?
■ SELECT * from readings where sensor_id = ? and date = ?;
■ SELECT * from readings where sensor_id = ? and date = ? and time > ?;

Analytics Application Option 1
■ Let the server do as much work as possible
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where sensor_id = ? and date = ?`

Application
(Example) Total amount of data to scan: 1.44 billion points/day
Coordinator
Worker
(loader machine)
ScyllaDB cluster
Worker
(loader machine)
Worker
(loader machine)
Set time frame,
compute average,
min, max of
all sensors

Disk Access Analysis Option 1 (in theory)
● For simplification lets assume
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block
○ Bloom filters do not provide false positives
● Analysis
Number of partitions 365 * 10^6 = 365 Million
I/O for index 365 Million
I/O for data 365 Million

Analytics Application Option 2
■ Do range scan’s and use CQL GROUP BY (new in 3.2)
SELECT sensor_id,
date,
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y GROUP BY sensor_id, date

Disk Access Analysis Option 2 (in theory)
● For simplification lets assume
○ Application breaks requests by vnode token ranges
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block (and the only one there)
■ vnode token ranges do not share data blocks
○ Bloom filters do not provide false positives
● Analysis
Number of scans Number of vnode token ranges
I/O for index Number of vnode token ranges *
Number of shards
I/O for data Number of data blocks

Disk Access Comparison
Option1: Single Partition Option 2: Range Scans
Number of ops Number of partitions
365 * 10^6 = 365 Million
Number of scans
Number of vnode token ranges
83 * 256 = 21248
I/O for index 365 Million Number of vnode token ranges *
Number of shards
83 * 256 * 54 = 1147392
I/O for data 365 Million 365 Million

Billy using Full Scan (theoretical) gain
1. The number of I/O ops for Index on the cluster drops from 365 Million
to ~ 1.2 Million
● In reality SSTable Bloom Filters are not perfect so single partitions reads will be
attempted on sstables that don’t have the partition - even a bigger win for scans)
1. The number of CQL operations on the cluster drops from 365 Million
to ~22K
● Returning a result per partition - 365M / 5000 (page size) = 73K pages (in optimal
case) so we will need more than 22K requests.
1. In reality partitions do share data blocks they are not perfectly
aligned

Putting Data Access into practice
● Queries
● Data model
● Some test data (at small scale)
● Docker
● Scylla-Nightly (Pre Scylla 3.2)
○ Tracing including disk access
‘ ... mc-132-big-Index.db: finished bulk DMA read of size 538 at offset 0,
successfully read 4096 bytes [shard 0] ‘
● A simple script that parses system_trace.events after running a
traced query

Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4

■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M
Index Bytes ~2.8 GB
Data I/O ~1 M
Data Bytes ~14.4 GB

■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M 3318 X 392
Index Bytes ~2.8 GB ~6.9 M X 424
Data I/O ~1 M 10738 X 93
Data Bytes ~14.4 GB ~1.3 GB X 11

Billy using Full Scan gain is even bigger
1. Read aheads for the full scans - utilizing better the disk
● Single Partition Avg Data Byts: 14748600348÷1024089 = ~14.5K
● Range Scan Avg Data Bytes: 1355390291÷10738 = ~126K
1. AIO reads are sent to the disk aligning to Index/Data placement - yet
disks do block size reads:
● Doing 2 reads for two halves of a disk block will result in reading the block twice and
returning part of it each time.

Should Range Scans always be used
for analytics ?
■ No
■ If Number of Partitions < Number of Token Ranges * Number of
Shards
■ What if we are doing a partial scan - what should we do ?
a. Example: What was the max & min temperature over the last 7/30/90 days

Billy+: Partial Scan
SELECT sensor_id,
date,
FROM readings where token(sensor_id, date) > X and
token(sensor_id, date) < Y and date >= Z GROUP BY sensor_id,
date ALLOW FILTERING

Billy+: Partial Scan
● If we are back to the simplifications: ~7% seems to be a good mark:
○ Partial Scan < 7% data use single partitions
○ Partial Scan > 7% data use full scan and filter
● General case: it depends how big the partitions are
○ Larger partitions have a higher penalty on reading them unnecessarily
Single Partition Range Scan
Total I/O ~2.3 M 14056 0.6%
Total Bytes ~17.7 GB ~1.3 GB 7.7%

We deployed we are on the
beach and drinking a Mojito

Writing Applications for Scylla

Evaluating a data model
We need this done faster - for simplicity lets add static min/max for each
partition that will cache the info - does this help
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
temp_min float static,
temp_max float static,
PRIMARY KEY ((sensor_id, date), time))

■ Do range scan’s and use CQL PER PARTITION LIMIT (new in 3.1)
SELECT sensor_id,
date,
temp_min,
temp_max
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y PER PARTITION LIMIT 1

Results
Range Scan Range Scan pre-computed Gain
Index I/O 3318 2874 X 1.15
Index Bytes ~6.9 M ~5.9 M X 1.15
Data I/O 10738 3520 X 3.05
Data Bytes ~1.3 GB ~430 M X 3.15

Part 2: Scylla 3.1 / 3.2 Additional
CQL Features

CQL BYPASS CACHE
■ Scylla uses Read-Through caching - if information read is not in the
cache it will be added
■ CQL BYPASS CACHE allows overriding that for a specific query - don’t
read via the cache / don’t populate the cache

CQL PER PARTITION LIMIT
Limits the number of rows that are returned for each partition
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples PER
PARTITION LIMIT 1;
pk | ck | val
----+----+-----
10 | 1 | 1
11 | 1 | 3

CQL GROUP BY
■ The GROUP BY option allows to condense into a single row all
selected rows that share the same values for a set of columns (that
are limited to partition key + optionally clustering keys)
■ Aggregate functions will produce a separate value for each group.
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
select pk, min(val),max (val) from
samples GROUP BY PK;
pk | system.min(val) |
system.max(val)
----+-----------------+--------------
---
10 | 1 |
2
11 | 3 |
4

CQL LIKE
■ Filtering using LIKE syntax
■ No need for indexing
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples
where pk like '%0' ALLOW FILTERING;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2

● Disk Access:
○ Is another form that can be used to evaluate data models
○ Its especially useful for the analytics / background batch processing jobs - since
those will access data from disk
● Scylla 3.1 includes
○ CQL:
■ BYPASS CACHE( )
■ PER PARTITION LIMIT
● Upcoming Scylla 3.2 will include:
○ Tracing with Disk Access
○ CQL:
■ GROUP BY
■ LIKE ( )
■ Non Frozen UDTS (not covered)
● Optimized(*) full scans reduce the overall amount of disk access -
when compared to aggregated single partition scans

Thank you Stay in touch
Any questions?
Shlomi Livne
shlomi@scylladb.com
@shlomilivne

Writing Applications for Scylla

More Related Content

What's hot (20)

Similar to Writing Applications for Scylla (20)

More from ScyllaDB (20)

Recently uploaded (20)

Writing Applications for Scylla

Editor's Notes