SlideShare a Scribd company logo
How we made large partition
scans over two times faster
Botond Denes
Software Developer @ ScyllaDB
Presenter bio
Botond is a software engineer who has worked in a range of
roles from web-developer to backend developer in a range of
industries from railway automation to finance. He loves
programming and solving challenging problems with elegant
code, open-source software, Linux, and C++. What he likes best
about working here is that Scylla is made up of that entire list.
Contents
▪ Problem Statement
▪ Solution
▪ Benchmarking Results
Problem Statement
What is Paging?
CLIENT SCYLLA
How Paging Worked?
CLIENT SCYLLA
Paging State Cookie
LEGEND
What is Stateless Paging?
Paging State Cookie
Create Query State
Destroy Query State
LEGEND
CLIENT SCYLLA
What is Wrong with Stateless Paging?
▪ Setting up the query state requires a non-trivial amount of work
▪ Relatively cheap for row-cache
▪ Expensive for read-from-disk:
• Identify sstables
• Read summary and index files
• Skip to start position in the sstable
The Solution
Stateful Paging
CLIENT SCYLLA
Save Query State
Look-up Query State
Query Key
Paging State Cookie
Create Query State
Destroy Query State
LEGEND
What is the query state exactly?
Cluster
Node 0
0 1 2 3 4 5
Node 1
0 1 2 3 4 5
Node 2
0 1 2 3 4 5
Sticky Replicas
▪ Send all page requests to the same set of replicas
▪ Implemented by storing the list of replicas in the
Paging State Cookie
▪ The replicas are chosen on the first page and “stuck to”
for the rest of the query
Querier Cache - Overview
▪ Special-purpose cache
▪ Each shard of each node has one
▪ Entries are saved under the query key
▪ Multiple entries can be inserted with the same key
Querier Cache - Dealing with Failures
▪ Create new querier on miss
▪ Drop found querier and create a new one on:
• Read position mismatch
• Schema version mismatch
Querier Cache - Eviction Policies
▪ Time based
▪ Memory based
▪ Read permit based
Diagnostics
New counters:
▪ Lookups
▪ Misses
▪ Drops
▪ Evictions
▪ Population
New CQL trace messages:
▪ When a querier is looked up
▪ When a querier is cached
Benchmarking Results
Single Partition Scans
CQL Reads
Partition Range Scans
CQL
Reads
Disk Bytes
Read/CQL
Read
Disk OPS/
CQL Read
Summary
▪ Better resource utilization
▪ Improved performance
▪ Better handling of large partitions
▪ Especially beneficial for disk-bound setups
More Details
▪ https://www.scylladb.com/2018/07/13/efficient-query-paging/
▪ https://www.scylladb.com/2018/11/01/more-efficient-range-scan-paging-with-scylla-
3-0/
▪ https://www.scylladb.com/users-blog/
▪ https://www.scylladb.com/developers-blog/
Thank You
Any Questions ?
Please stay in touch
bdenes@scylladb.com

More Related Content

PPTX
Scylla Summit 2018: What's New in Scylla Manager?
PPTX
Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
PPTX
Scylla Summit 2018: Cassandra and ScyllaDB at Yahoo! Japan
PPTX
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
PPTX
Using ScyllaDB with JanusGraph for Cyber Security
PPTX
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
PPTX
How Scylla Manager Handles Backups
Scylla Summit 2018: What's New in Scylla Manager?
Scylla Summit 2018: The Short and Straight Road That Leads from Cassandra to ...
Scylla Summit 2018: Consensus in Eventually Consistent Databases
Scylla Summit 2018: Cassandra and ScyllaDB at Yahoo! Japan
Scylla Summit 2018: OLAP or OLTP? Why Not Both?
Using ScyllaDB with JanusGraph for Cyber Security
Scylla Summit 2018: Make Scylla Fast Again! Find out how using Tools, Talent,...
How Scylla Manager Handles Backups

What's hot (20)

PPTX
How to be Successful with Scylla
PPTX
Scylla’s Journey Towards Being an Elastic Cloud Native Database
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
PPTX
Lightweight Transactions at Lightning Speed
PPTX
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
PDF
Introducing Scylla Open Source 4.0
PPTX
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
PPTX
Scylla Summit 2019 Keynote - Avi Kivity
PDF
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
PPTX
Scylla Summit 2018: Keynote - 4 Years of Scylla
PPTX
Powering a Graph Data System with Scylla + JanusGraph
PDF
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
PPTX
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
PDF
Scylla Summit 2016: Graph Processing with Titan and Scylla
PPTX
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
PDF
ScyllaDB @ Apache BigData, may 2016
PDF
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
PPTX
Developing Scylla Applications: Practical Tips
PDF
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
PPTX
Scylla Summit 2018: Scylla 3.0 and Beyond
How to be Successful with Scylla
Scylla’s Journey Towards Being an Elastic Cloud Native Database
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
Lightweight Transactions at Lightning Speed
Scylla Summit 2018: Rebuilding the Ceph Distributed Storage Solution with Sea...
Introducing Scylla Open Source 4.0
Scylla Summit 2018: Building Recoverable (and optionally Async) Spark Pipelines
Scylla Summit 2019 Keynote - Avi Kivity
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
Scylla Summit 2018: Keynote - 4 Years of Scylla
Powering a Graph Data System with Scylla + JanusGraph
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
Scylla Summit 2016: Graph Processing with Titan and Scylla
Scylla Summit 2018: Scylla Feature Talks - Scylla Streaming and Repair Updates
ScyllaDB @ Apache BigData, may 2016
Scylla Summit 2016: Using ScyllaDB for a Microservice-based Pipeline in Go
Developing Scylla Applications: Practical Tips
Comparing Apache Cassandra 4.0, 3.0, and ScyllaDB
Scylla Summit 2018: Scylla 3.0 and Beyond
Ad

Similar to Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster (20)

PDF
WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...
PPTX
Replacing Your Cache with ScyllaDB
PDF
Replacing Your Cache with ScyllaDB by Felipe Cardeneti Mendes and Tomasz Grabiec
PDF
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
PDF
Using ScyllaDB for Real-Time Read-Heavy Workloads.pdf
PDF
The Path to ScyllaDB 5.2
PPTX
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
PPTX
Latest performance changes by Scylla - Project optimus / Nolimits
PPTX
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
PDF
Scylla Summit 2016: ScyllaDB, Present and Future
PDF
How Development Teams Cut Costs with ScyllaDB.pdf
PDF
How to Monitor and Size Workloads on AWS i3 instances
PDF
Optimizing ScyllaDB Performance via Observability
PDF
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
PPTX
Writing Applications for Scylla
PDF
Using ScyllaDB for Real-Time Write-Heavy Workloads
PPTX
Meeting the challenges of OLTP Big Data with Scylla
PDF
How to achieve no compromise performance and availability
PPTX
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
PDF
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB by Felipe Cardeneti Mendes and Tomasz Grabiec
Elasticity, Speed & Simplicity: Get the Most Out of New ScyllaDB Capabilities
Using ScyllaDB for Real-Time Read-Heavy Workloads.pdf
The Path to ScyllaDB 5.2
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
Latest performance changes by Scylla - Project optimus / Nolimits
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
Scylla Summit 2016: ScyllaDB, Present and Future
How Development Teams Cut Costs with ScyllaDB.pdf
How to Monitor and Size Workloads on AWS i3 instances
Optimizing ScyllaDB Performance via Observability
Scylla Summit 2022: How ScyllaDB Powers This Next Tech Cycle
Writing Applications for Scylla
Using ScyllaDB for Real-Time Write-Heavy Workloads
Meeting the challenges of OLTP Big Data with Scylla
How to achieve no compromise performance and availability
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
L1 - Introduction to python Backend.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
Transform Your Business with a Software ERP System
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
medical staffing services at VALiNTRY
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How Creative Agencies Leverage Project Management Software.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
L1 - Introduction to python Backend.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PTS Company Brochure 2025 (1).pdf.......
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Transform Your Business with a Software ERP System
How to Migrate SBCGlobal Email to Yahoo Easily
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
medical staffing services at VALiNTRY
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Softaken Excel to vCard Converter Software.pdf
ISO 45001 Occupational Health and Safety Management System
Odoo POS Development Services by CandidRoot Solutions
2025 Textile ERP Trends: SAP, Odoo & Oracle
Internet Downloader Manager (IDM) Crack 6.42 Build 41
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)

Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster

Editor's Notes

  • #2: Scylla's query paging. How we improved it. How this benefits large partition scans and partition range scans.
  • #4: Note that I might use “read” and “query” interchangeably.
  • #6: Queries can return an unknown amount of data. The exact amount is only known *after* the query has been executed. Reading an unknown amount of data at once is dangerous, can fill memory, hog CPU - can cause service denial. To avoid this Scylla uses paging - that it reads and transmits the results in limited-size chunks, called pages. Pages are limited by the number of rows (10k, changeable by client) and size (1MB - fixed, sanity limit, setting to a high number can lead to service denial). After sending each page to the client, Scylla stops and waits for the client to explicitly request the next page.
  • #7: A Cookie transmitted on each request - response. This cookie is called the “Paging State”. The Paging State is an opaque binary blob with arbitrary content from the point of view of the client. Scylla can choose what to include in it, its content is not part of the protocol. This provides a certain flexibility for the server in the implementation of paging. In any case Scylla stores just the position - last partition and clustering key. Scylla itself didn’t store any state related to the query.
  • #8: The internal Query State of Scylla was created anew on the beginning of each page and destroyed at the end. The Query State is an abstract concept that represents all the internal state required to serve the query. Essentially each page is a separate query. This has the advantage of simple code but has many drawbacks.
  • #9: Scylla had to do this all over again one the start of each page. Scylla doesn’t use the OS page cache, so all this effort is truly lost when the state is destroyed, we don’t get the benefit of the OS having cached the recently read files in RAM for us. Gets worse as the size of the scanned partition increases - hurts large partitions especially bad.
  • #11: Make paging stateful, that is create the Query State on the first page and use it throughout the entire query. Sounds simple but had a lot of details to get right. On the end of each page we save the Query State in a cache. For this we use a unique key called the Query Key, that is generated on the first page. This key is then remembered by being included in the Paging State Cookie. On the beginning of all subsequent pages we look up the saved Query State and continue the query where we left off.
  • #12: The Query has a local state on each shard of each node that is involved in the query. In this imaginary cluster of 3 nodes, with 6 shards each, an imaginary query is run on Shard1 of Node0 and Node2. This imaginary query can be a single partition scan, executed with a CL=2. So the query state will be made up from the local state on Shard1 of Node0 and the local state on Shard1 of Node2. We call the local state the Querier, the querier is an actual object that encapsulates all state and logic required to serve the query on a single shard of a single node. So when we are talking about saving the query state, we mean saving the actual querier objects. All this state is located on replicas, no state is stored on the coordinator. Coordinator is the node that receives the request. It’s job is to select the replicas to forward the request to, merging the results and sending it to the client. The replica is the node that actually has the data, and that actually executes the query. The coordinator can be a replica as well, in fact drivers will choose replicas such that this is true.
  • #13: Since state is local to replicas, we have to use the same set of replicas through the query. This has a side effect: the driver can choose a coordinator that is not one of the replicas previously used for this query - an extra network hop is introduced. Drivers choose a new coordinator for each page for load balancing. This can be fixed by changing the driver to stick to the same coordinator for the entire query. Piotr Jastrzebski talks about this in details in his talk about driver optimizations.
  • #14: It is the foundation upon which stateful paging is built. When multiple entries have the same key, we distinguish them by their read range - the partition range they are reading. In the case of single partition scans this will be just a single partition. This is possible for IN queries, if two listed partitions are located on the same shard of the same node.
  • #15: In a perfect world each lookup for a saved querier succeeds and querier can be used to continue the query. We don’t live in a perfect world - a lot can go wrong in a distributed database. A previously used replica can crash or be partitioned - the query has to move to a new one - will miss. It is possible for the lookup to succeed but the querier to be not suitable for continuing the query. It is possible that the page request will want to continue from a position that doesn’t match the cached querier’s. The position of a querier is the position it stopped reading on the previous page and consequently the position it will continue on the next page. This position has row granularity. This can be caused by nodes having mismatching data - read repair. Or a node having been skipped for a few pages - due to partition or slowness. Schema updates can run concurrently with the query - would require complex code to deal with - not worth it, we drop the querier instead.
  • #16: Abandoned queriers. Can happen for a number of reasons - client crashed, node was partitioned. Each inserted querier has a TTL of 10s. Bound memory consumption. Currently 4% of the shard’s memory. We have a read concurrency control. It is permit based, each new read, that is each new querier has to obtain a permit before it can start reading. Permits are limited. Queriers hold on to their permit for their entire lifetime. It can happen that incoming new reads cannot be started as all permits have run out - evict cached queriers to free up permits.
  • #17: Misses - number of lookups that failed Drops - number of lookups that succeeded but the querier is not suitable for continuing the query. Hit rate can be derived from these three metrics.
  • #18: Mostly focused on benchmarking scanning large partitions, read from disk - the use case that suffered the most from stateless paging.
  • #19: Normalized graph. Focusing on the improvement itself, instead of the actual numbers. Explain BEFORE and AFTER. Amazing almost 2.5X improvement in throughput.
  • #20: Also normalized graph, showing only the improvements. Improvement in throughput is not as impressive as that of single partition scans. Partition range scans are a lot more complicated, higher CPU cost. Disk is a smaller factor in their performance. We observed the bottleneck moving from the disk to the CPU. We can see that the improvement in disk usage is much more significant Disk is accessed a lot less. We read less bytes. Per CQL read.
  • #21: Stateful paging achieved: Better (less) resource utilization. Improved performance (throughput). Vastly improved handling of large partitions, a pain point of Scylla’s in the past.
  • #22: We published two blogpost on this topic, with a lot more details on how all this is implemented. If you are interested in more details then I recommend reading them. Even if you are not interested in more details on paging, I still recommend visiting our blog as it has a lot of other interesting posts
  • #24: Index Slide