SlideShare a Scribd company logo
Writing Applications
for Scylla
Shlomi Livne, VP R&D
Presenter
Shlomi Livne, VP of R&D
Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB he led the
research and development team at Convergin, which was
acquired by Oracle.
Part 1: Enhancement
for Application Development
The basics
Development Cycle NoSQL Databases
Think about the queries you are going to run
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
CQL Optimization
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
Use Disk Access
Development Cycle NoSQL Databases
Think about the queries you are going to run
Create a Data Model
Use cassandra-stress (or
other) to validate (*)
Develop
Scale test
Deploy
Disk Access
What will Disk Access track
■ Disk Access looks at:
● Amount of I/O operations
● Overall amount of read bytes
■ When sstables are read from disk there are two related components
(everything else is in memory):
● Data - stores the actual data
● Index - provides lookup into the data file “blocks” that contain the partition (if the
partition is large - it contains promoted index)
Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk
Disk Access - Why
● The ratio memory : disk is increasing:
○ EC2 i3 family memory : disk ratio for is: 1:30
○ EC2 i3en family memory : disk ratio for is 1:78
○ More queries will be served from disk
● There are workloads that you will always prefer running from disk
(background analytics)
Sample App Billy(+)
An IoT application(+)
Total amount of data points
526 billion
temperature readings
1,000,000 sensors, representing homes in an area
365 days (1 year storage requirement) 1 reading per second
Analytics over the entire data?
How long would it take at
normal speeds?
We need more if analytics
are a part of the pipeline
That means we need Scylla
We need a good application
And we need hardware
200,000 points/second
730 hours (30 days)
1 million points/second
146 hours (almost a week)
Why climb Mount Everest?
Because it’s there.
George Leigh Mallory
What kind of performance are we after?
Data Model
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
PRIMARY KEY ((sensor_id, date), time))
What kind of queries can we reasonably support?
■ SELECT * from readings where sensor_id = ? and date = ?;
■ SELECT * from readings where sensor_id = ? and date = ? and time > ?;
Analytics Application Option 1
■ Let the server do as much work as possible
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where sensor_id = ? and date = ?`
Application
(Example) Total amount of data to scan: 1.44 billion points/day
Coordinator
Worker
(loader machine)
ScyllaDB cluster
Worker
(loader machine)
Worker
(loader machine)
Set time frame,
compute average,
min, max of
all sensors
Disk Access Analysis Option 1 (in theory)
● For simplification lets assume
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block
○ Bloom filters do not provide false positives
● Analysis
Number of partitions 365 * 10^6 = 365 Million
I/O for index 365 Million
I/O for data 365 Million
Analytics Application Option 2
■ Do range scan’s and use CQL GROUP BY (new in 3.2)
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y GROUP BY sensor_id, date
Application
(Example) Total amount of data to scan: 1.44 billion points/day
Coordinator
Worker
(loader machine)
ScyllaDB cluster
Worker
(loader machine)
Worker
(loader machine)
Set time frame,
compute average,
min, max of
all sensors
Disk Access Analysis Option 2 (in theory)
● For simplification lets assume
○ Application breaks requests by vnode token ranges
○ Every partition:
■ is fully stored in a single sstable
■ is exactly placed in a single data block (and the only one there)
■ vnode token ranges do not share data blocks
○ Bloom filters do not provide false positives
● Analysis
Number of scans Number of vnode token ranges
I/O for index Number of vnode token ranges *
Number of shards
I/O for data Number of data blocks
Disk Access Comparison
Option1: Single Partition Option 2: Range Scans
Number of ops Number of partitions
365 * 10^6 = 365 Million
Number of scans
Number of vnode token ranges
83 * 256 = 21248
I/O for index 365 Million Number of vnode token ranges *
Number of shards
83 * 256 * 54 = 1147392
I/O for data 365 Million 365 Million
Billy using Full Scan (theoretical) gain
1. The number of I/O ops for Index on the cluster drops from 365 Million
to ~ 1.2 Million
● In reality SSTable Bloom Filters are not perfect so single partitions reads will be
attempted on sstables that don’t have the partition - even a bigger win for scans)
1. The number of CQL operations on the cluster drops from 365 Million
to ~22K
● Returning a result per partition - 365M / 5000 (page size) = 73K pages (in optimal
case) so we will need more than 22K requests.
1. In reality partitions do share data blocks they are not perfectly
aligned
Putting Data Access into practice
● Queries
● Data model
● Some test data (at small scale)
● Docker
● Scylla-Nightly (Pre Scylla 3.2)
○ Tracing including disk access
‘ ... mc-132-big-Index.db: finished bulk DMA read of size 538 at offset 0,
successfully read 4096 bytes [shard 0] ‘
● A simple script that parses system_trace.events after running a
traced query
Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M
Index Bytes ~2.8 GB
Data I/O ~1 M
Data Bytes ~14.4 GB
Billy on small scale
■ 1000 sensors, 100 dates, 1 sample per minute
■ 1 M partitions, 1440 M rows
■ # shards 4
Results
Single Partition Range Scan Gain
Index I/O ~1.3 M 3318 X 392
Index Bytes ~2.8 GB ~6.9 M X 424
Data I/O ~1 M 10738 X 93
Data Bytes ~14.4 GB ~1.3 GB X 11
Billy using Full Scan gain is even bigger
1. Read aheads for the full scans - utilizing better the disk
● Single Partition Avg Data Byts: 14748600348÷1024089 = ~14.5K
● Range Scan Avg Data Bytes: 1355390291÷10738 = ~126K
1. AIO reads are sent to the disk aligning to Index/Data placement - yet
disks do block size reads:
● Doing 2 reads for two halves of a disk block will result in reading the block twice and
returning part of it each time.
Should Range Scans always be used
for analytics ?
■ No
■ If Number of Partitions < Number of Token Ranges * Number of
Shards
■ What if we are doing a partial scan - what should we do ?
a. Example: What was the max & min temperature over the last 7/30/90 days
Billy+: Partial Scan
SELECT sensor_id,
date,
min(temperature) as minTemperature,
max(temperature) as maxTemperature
FROM readings where token(sensor_id, date) > X and
token(sensor_id, date) < Y and date >= Z GROUP BY sensor_id,
date ALLOW FILTERING
Billy+: Partial Scan
● If we are back to the simplifications: ~7% seems to be a good mark:
○ Partial Scan < 7% data use single partitions
○ Partial Scan > 7% data use full scan and filter
● General case: it depends how big the partitions are
○ Larger partitions have a higher penalty on reading them unnecessarily
Single Partition Range Scan
Total I/O ~2.3 M 14056 0.6%
Total Bytes ~17.7 GB ~1.3 GB 7.7%
We deployed we are on the
beach and drinking a Mojito
Writing Applications for Scylla
Evaluating a data model
We need this done faster - for simplicity lets add static min/max for each
partition that will cache the info - does this help
CREATE TABLE readings (
sensor_id int,
date date,
time time,
temperature float,
temp_min float static,
temp_max float static,
PRIMARY KEY ((sensor_id, date), time))
■ Do range scan’s and use CQL PER PARTITION LIMIT (new in 3.1)
SELECT sensor_id,
date,
temp_min,
temp_max
FROM readings where token(sensor_id, token_id) > X and
token(sensor_id, token_id) < Y PER PARTITION LIMIT 1
Results
Range Scan Range Scan pre-computed Gain
Index I/O 3318 2874 X 1.15
Index Bytes ~6.9 M ~5.9 M X 1.15
Data I/O 10738 3520 X 3.05
Data Bytes ~1.3 GB ~430 M X 3.15
Part 2: Scylla 3.1 / 3.2 Additional
CQL Features
CQL BYPASS CACHE
■ Scylla uses Read-Through caching - if information read is not in the
cache it will be added
■ CQL BYPASS CACHE allows overriding that for a specific query - don’t
read via the cache / don’t populate the cache
CQL PER PARTITION LIMIT
Limits the number of rows that are returned for each partition
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples PER
PARTITION LIMIT 1;
pk | ck | val
----+----+-----
10 | 1 | 1
11 | 1 | 3
CQL GROUP BY
■ The GROUP BY option allows to condense into a single row all
selected rows that share the same values for a set of columns (that
are limited to partition key + optionally clustering keys)
■ Aggregate functions will produce a separate value for each group.
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
select pk, min(val),max (val) from
samples GROUP BY PK;
pk | system.min(val) |
system.max(val)
----+-----------------+--------------
---
10 | 1 |
2
11 | 3 |
4
CQL LIKE
■ Filtering using LIKE syntax
■ No need for indexing
cqlsh:ks> select * from samples ;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
11 | 1 | 3
11 | 2 | 4
cqlsh:ks> select * from samples
where pk like '%0' ALLOW FILTERING;
pk | ck | val
----+----+-----
10 | 1 | 1
10 | 2 | 2
To Summarize
● Disk Access:
○ Is another form that can be used to evaluate data models
○ Its especially useful for the analytics / background batch processing jobs - since
those will access data from disk
● Scylla 3.1 includes
○ CQL:
■ BYPASS CACHE( )
■ PER PARTITION LIMIT
● Upcoming Scylla 3.2 will include:
○ Tracing with Disk Access
○ CQL:
■ GROUP BY
■ LIKE ( )
■ Non Frozen UDTS (not covered)
● Optimized(*) full scans reduce the overall amount of disk access -
when compared to aggregated single partition scans
Thank you Stay in touch
Any questions?
Shlomi Livne
shlomi@scylladb.com
@shlomilivne

More Related Content

PDF
Lookout on Scaling Security to 100 Million Devices
PDF
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
PPTX
How to be Successful with Scylla
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PPTX
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
PPTX
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
PDF
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Lookout on Scaling Security to 100 Million Devices
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How to be Successful with Scylla
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...

What's hot (20)

PDF
A glimpse of cassandra 4.0 features netflix
PPTX
Sizing Your Scylla Cluster
PDF
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
PPTX
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
PPTX
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PPTX
Seastar Summit 2019 Keynote
PDF
ScyllaDB @ Apache BigData, may 2016
PPTX
How Workload Prioritization Reduces Your Datacenter Footprint
PDF
Scylla Summit 2016: Compose on Containing the Database
PDF
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
PDF
Looking towards an official cassandra sidecar netflix
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
PDF
Building and running cloud native cassandra
PDF
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
PDF
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
PPTX
S3 cassandra or outer space? dumping time series data using spark
PDF
ScyllaDB: NoSQL at Ludicrous Speed
A glimpse of cassandra 4.0 features netflix
Sizing Your Scylla Cluster
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
FireEye & Scylla: Intel Threat Analysis Using a Graph Database
iFood on Delivering 100 Million Events a Month to Restaurants with Scylla
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Seastar Summit 2019 Keynote
ScyllaDB @ Apache BigData, may 2016
How Workload Prioritization Reduces Your Datacenter Footprint
Scylla Summit 2016: Compose on Containing the Database
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Looking towards an official cassandra sidecar netflix
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Building and running cloud native cassandra
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
S3 cassandra or outer space? dumping time series data using spark
ScyllaDB: NoSQL at Ludicrous Speed
Ad

Similar to Writing Applications for Scylla (20)

PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
PPT
BWC Supercomputing 2008 Presentation
PDF
Security sizing meetup
PDF
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
PDF
Data Science in the Cloud @StitchFix
PDF
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
PDF
Using BigBench to compare Hive and Spark (short version)
PDF
Apache con 2020 use cases and optimizations of iotdb
PDF
Approximate "Now" is Better Than Accurate "Later"
PPTX
Sizing MongoDB Clusters
PPTX
Our journey with druid - from initial research to full production scale
PDF
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
PPTX
MongoDB for Time Series Data: Sharding
PPTX
Lessons learned from designing a QA Automation for analytics databases (big d...
PDF
How to Develop and Operate Cloud First Data Platforms
PPTX
splunkquickstartsplunkquickstartsplunkquickstart
DOC
Sqlmaterial 120414024230-phpapp01
PPTX
Lessons learned from designing QA automation event streaming platform(IoT big...
PPTX
Adventures in RDS Load Testing
PDF
How to Develop and Operate Cloud Native Data Platforms and Applications
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
BWC Supercomputing 2008 Presentation
Security sizing meetup
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
Data Science in the Cloud @StitchFix
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
Using BigBench to compare Hive and Spark (short version)
Apache con 2020 use cases and optimizations of iotdb
Approximate "Now" is Better Than Accurate "Later"
Sizing MongoDB Clusters
Our journey with druid - from initial research to full production scale
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
MongoDB for Time Series Data: Sharding
Lessons learned from designing a QA Automation for analytics databases (big d...
How to Develop and Operate Cloud First Data Platforms
splunkquickstartsplunkquickstartsplunkquickstart
Sqlmaterial 120414024230-phpapp01
Lessons learned from designing QA automation event streaming platform(IoT big...
Adventures in RDS Load Testing
How to Develop and Operate Cloud Native Data Platforms and Applications
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
cuic standard and advanced reporting.pdf
PDF
KodekX | Application Modernization Development
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation_ Review paper, used for researhc scholars
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Unlocking AI with Model Context Protocol (MCP)
NewMind AI Weekly Chronicles - August'25 Week I
cuic standard and advanced reporting.pdf
KodekX | Application Modernization Development
Per capita expenditure prediction using model stacking based on satellite ima...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy

Writing Applications for Scylla

  • 2. Presenter Shlomi Livne, VP of R&D Shlomi is VP of R&D at ScyllaDB. Prior to ScyllaDB he led the research and development team at Convergin, which was acquired by Oracle.
  • 3. Part 1: Enhancement for Application Development
  • 5. Development Cycle NoSQL Databases Think about the queries you are going to run
  • 6. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model
  • 7. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop
  • 8. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop
  • 9. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop
  • 10. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop CQL Optimization
  • 11. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop Scale test
  • 12. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop Scale test
  • 13. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop Scale test Deploy
  • 14. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop Scale test Deploy
  • 16. Development Cycle NoSQL Databases Think about the queries you are going to run Create a Data Model Use cassandra-stress (or other) to validate (*) Develop Scale test Deploy Disk Access
  • 17. What will Disk Access track ■ Disk Access looks at: ● Amount of I/O operations ● Overall amount of read bytes ■ When sstables are read from disk there are two related components (everything else is in memory): ● Data - stores the actual data ● Index - provides lookup into the data file “blocks” that contain the partition (if the partition is large - it contains promoted index)
  • 18. Disk Access - Why ● The ratio memory : disk is increasing: ○ EC2 i3 family memory : disk ratio for is: 1:30 ○ EC2 i3en family memory : disk ratio for is 1:78 ○ More queries will be served from disk
  • 19. Disk Access - Why ● The ratio memory : disk is increasing: ○ EC2 i3 family memory : disk ratio for is: 1:30 ○ EC2 i3en family memory : disk ratio for is 1:78 ○ More queries will be served from disk ● There are workloads that you will always prefer running from disk (background analytics)
  • 21. An IoT application(+) Total amount of data points 526 billion temperature readings 1,000,000 sensors, representing homes in an area 365 days (1 year storage requirement) 1 reading per second
  • 22. Analytics over the entire data? How long would it take at normal speeds? We need more if analytics are a part of the pipeline That means we need Scylla We need a good application And we need hardware 200,000 points/second 730 hours (30 days) 1 million points/second 146 hours (almost a week)
  • 23. Why climb Mount Everest? Because it’s there. George Leigh Mallory What kind of performance are we after?
  • 24. Data Model CREATE TABLE readings ( sensor_id int, date date, time time, temperature float, PRIMARY KEY ((sensor_id, date), time)) What kind of queries can we reasonably support? ■ SELECT * from readings where sensor_id = ? and date = ?; ■ SELECT * from readings where sensor_id = ? and date = ? and time > ?;
  • 25. Analytics Application Option 1 ■ Let the server do as much work as possible SELECT sensor_id, date, min(temperature) as minTemperature, max(temperature) as maxTemperature FROM readings where sensor_id = ? and date = ?`
  • 26. Application (Example) Total amount of data to scan: 1.44 billion points/day Coordinator Worker (loader machine) ScyllaDB cluster Worker (loader machine) Worker (loader machine) Set time frame, compute average, min, max of all sensors
  • 27. Disk Access Analysis Option 1 (in theory) ● For simplification lets assume ○ Every partition: ■ is fully stored in a single sstable ■ is exactly placed in a single data block ○ Bloom filters do not provide false positives ● Analysis Number of partitions 365 * 10^6 = 365 Million I/O for index 365 Million I/O for data 365 Million
  • 28. Analytics Application Option 2 ■ Do range scan’s and use CQL GROUP BY (new in 3.2) SELECT sensor_id, date, min(temperature) as minTemperature, max(temperature) as maxTemperature FROM readings where token(sensor_id, token_id) > X and token(sensor_id, token_id) < Y GROUP BY sensor_id, date
  • 29. Application (Example) Total amount of data to scan: 1.44 billion points/day Coordinator Worker (loader machine) ScyllaDB cluster Worker (loader machine) Worker (loader machine) Set time frame, compute average, min, max of all sensors
  • 30. Disk Access Analysis Option 2 (in theory) ● For simplification lets assume ○ Application breaks requests by vnode token ranges ○ Every partition: ■ is fully stored in a single sstable ■ is exactly placed in a single data block (and the only one there) ■ vnode token ranges do not share data blocks ○ Bloom filters do not provide false positives ● Analysis Number of scans Number of vnode token ranges I/O for index Number of vnode token ranges * Number of shards I/O for data Number of data blocks
  • 31. Disk Access Comparison Option1: Single Partition Option 2: Range Scans Number of ops Number of partitions 365 * 10^6 = 365 Million Number of scans Number of vnode token ranges 83 * 256 = 21248 I/O for index 365 Million Number of vnode token ranges * Number of shards 83 * 256 * 54 = 1147392 I/O for data 365 Million 365 Million
  • 32. Billy using Full Scan (theoretical) gain 1. The number of I/O ops for Index on the cluster drops from 365 Million to ~ 1.2 Million ● In reality SSTable Bloom Filters are not perfect so single partitions reads will be attempted on sstables that don’t have the partition - even a bigger win for scans) 1. The number of CQL operations on the cluster drops from 365 Million to ~22K ● Returning a result per partition - 365M / 5000 (page size) = 73K pages (in optimal case) so we will need more than 22K requests. 1. In reality partitions do share data blocks they are not perfectly aligned
  • 33. Putting Data Access into practice ● Queries ● Data model ● Some test data (at small scale) ● Docker ● Scylla-Nightly (Pre Scylla 3.2) ○ Tracing including disk access ‘ ... mc-132-big-Index.db: finished bulk DMA read of size 538 at offset 0, successfully read 4096 bytes [shard 0] ‘ ● A simple script that parses system_trace.events after running a traced query
  • 34. Billy on small scale ■ 1000 sensors, 100 dates, 1 sample per minute ■ 1 M partitions, 1440 M rows ■ # shards 4
  • 35. Billy on small scale ■ 1000 sensors, 100 dates, 1 sample per minute ■ 1 M partitions, 1440 M rows ■ # shards 4 Results Single Partition Range Scan Gain Index I/O ~1.3 M Index Bytes ~2.8 GB Data I/O ~1 M Data Bytes ~14.4 GB
  • 36. Billy on small scale ■ 1000 sensors, 100 dates, 1 sample per minute ■ 1 M partitions, 1440 M rows ■ # shards 4 Results Single Partition Range Scan Gain Index I/O ~1.3 M 3318 X 392 Index Bytes ~2.8 GB ~6.9 M X 424 Data I/O ~1 M 10738 X 93 Data Bytes ~14.4 GB ~1.3 GB X 11
  • 37. Billy using Full Scan gain is even bigger 1. Read aheads for the full scans - utilizing better the disk ● Single Partition Avg Data Byts: 14748600348÷1024089 = ~14.5K ● Range Scan Avg Data Bytes: 1355390291÷10738 = ~126K 1. AIO reads are sent to the disk aligning to Index/Data placement - yet disks do block size reads: ● Doing 2 reads for two halves of a disk block will result in reading the block twice and returning part of it each time.
  • 38. Should Range Scans always be used for analytics ? ■ No ■ If Number of Partitions < Number of Token Ranges * Number of Shards ■ What if we are doing a partial scan - what should we do ? a. Example: What was the max & min temperature over the last 7/30/90 days
  • 39. Billy+: Partial Scan SELECT sensor_id, date, min(temperature) as minTemperature, max(temperature) as maxTemperature FROM readings where token(sensor_id, date) > X and token(sensor_id, date) < Y and date >= Z GROUP BY sensor_id, date ALLOW FILTERING
  • 40. Billy+: Partial Scan ● If we are back to the simplifications: ~7% seems to be a good mark: ○ Partial Scan < 7% data use single partitions ○ Partial Scan > 7% data use full scan and filter ● General case: it depends how big the partitions are ○ Larger partitions have a higher penalty on reading them unnecessarily Single Partition Range Scan Total I/O ~2.3 M 14056 0.6% Total Bytes ~17.7 GB ~1.3 GB 7.7%
  • 41. We deployed we are on the beach and drinking a Mojito
  • 43. Evaluating a data model We need this done faster - for simplicity lets add static min/max for each partition that will cache the info - does this help CREATE TABLE readings ( sensor_id int, date date, time time, temperature float, temp_min float static, temp_max float static, PRIMARY KEY ((sensor_id, date), time))
  • 44. ■ Do range scan’s and use CQL PER PARTITION LIMIT (new in 3.1) SELECT sensor_id, date, temp_min, temp_max FROM readings where token(sensor_id, token_id) > X and token(sensor_id, token_id) < Y PER PARTITION LIMIT 1
  • 45. Results Range Scan Range Scan pre-computed Gain Index I/O 3318 2874 X 1.15 Index Bytes ~6.9 M ~5.9 M X 1.15 Data I/O 10738 3520 X 3.05 Data Bytes ~1.3 GB ~430 M X 3.15
  • 46. Part 2: Scylla 3.1 / 3.2 Additional CQL Features
  • 47. CQL BYPASS CACHE ■ Scylla uses Read-Through caching - if information read is not in the cache it will be added ■ CQL BYPASS CACHE allows overriding that for a specific query - don’t read via the cache / don’t populate the cache
  • 48. CQL PER PARTITION LIMIT Limits the number of rows that are returned for each partition cqlsh:ks> select * from samples ; pk | ck | val ----+----+----- 10 | 1 | 1 10 | 2 | 2 11 | 1 | 3 11 | 2 | 4 cqlsh:ks> select * from samples PER PARTITION LIMIT 1; pk | ck | val ----+----+----- 10 | 1 | 1 11 | 1 | 3
  • 49. CQL GROUP BY ■ The GROUP BY option allows to condense into a single row all selected rows that share the same values for a set of columns (that are limited to partition key + optionally clustering keys) ■ Aggregate functions will produce a separate value for each group. cqlsh:ks> select * from samples ; pk | ck | val ----+----+----- 10 | 1 | 1 10 | 2 | 2 11 | 1 | 3 11 | 2 | 4 select pk, min(val),max (val) from samples GROUP BY PK; pk | system.min(val) | system.max(val) ----+-----------------+-------------- --- 10 | 1 | 2 11 | 3 | 4
  • 50. CQL LIKE ■ Filtering using LIKE syntax ■ No need for indexing cqlsh:ks> select * from samples ; pk | ck | val ----+----+----- 10 | 1 | 1 10 | 2 | 2 11 | 1 | 3 11 | 2 | 4 cqlsh:ks> select * from samples where pk like '%0' ALLOW FILTERING; pk | ck | val ----+----+----- 10 | 1 | 1 10 | 2 | 2
  • 52. ● Disk Access: ○ Is another form that can be used to evaluate data models ○ Its especially useful for the analytics / background batch processing jobs - since those will access data from disk ● Scylla 3.1 includes ○ CQL: ■ BYPASS CACHE( ) ■ PER PARTITION LIMIT ● Upcoming Scylla 3.2 will include: ○ Tracing with Disk Access ○ CQL: ■ GROUP BY ■ LIKE ( ) ■ Non Frozen UDTS (not covered) ● Optimized(*) full scans reduce the overall amount of disk access - when compared to aggregated single partition scans
  • 53. Thank you Stay in touch Any questions? Shlomi Livne shlomi@scylladb.com @shlomilivne

Editor's Notes

  • #6: NoSQL - you need to start with the queries
  • #7: Dama Model is built to answer those queries
  • #8: Testing the DataModel and the queries - Some start with c-s ot other simulating tool This is more complex then it sounds - simulating the data distribution and request distribution on the data set is not as simple
  • #9: Next step is to develop
  • #10: And once you start you find some queries need to be updated / the data model needs ot be changed
  • #11: Last year we showed how using Monitoring CQL optimization can have find development bugs earlier
  • #12: Next - you move to scale testing - trying to emulate the real production dtaa
  • #13: In this scale test - you find that you may get large partitions - and that changes ...
  • #14: You delpoy
  • #15: And you find yourself with hot partitions / large partitions that you may have not detected in scale testing So this requires changes
  • #17: Disk Access can be done around Data Model verification and can detect some issues detected longer down the line
  • #21: Billy is the internal code name for the system that Glauber + … presented at the keynote session doing mote than 1B ops per second
  • #24: Or phrasing it differently - Glauber just showed you we can do it My session is about showing you how we can do it EVEN BETTER
  • #34: We do expose metrics for disk access yet understanding they are 100% related to a single query is not possible (as such we looked for a different way - not to mislead you)
  • #37: Index Bytes = 2926800234
  • #48: BYPASS CACHE goes hand in hand with Workload Prioritization Workload Prioritization assures that the analytic workload co-exists side by side with the online workload BYPASS CACHE allows to enforce this even further to assure that analytics are only done from disk and do not “polute” the cache