SlideShare a Scribd company logo
Kirill Alekseev, Mail.ru Group
High-Load Storage
of Users’ Actions
with Scylla and HDDs
Kirill Alekseev
+ Software Engineering Team Lead @ Mail Service @ Mail.Ru Group
+ Master’s degree in Computer Science in 2019 @ Lomonosov Moscow
State University
+ Love coding, music and parties
Presenter
19 million
unique real users DAU
47 million
unique real users MAU
3
1 000 000
emails per minute
4
Agenda
o Service overview
o Data model, cluster specs
o Application details
o Using Scylla with HDDs
o Q&A
High-load storage
of users’ actions
6
7
Service overview
Basically, actions history is a time series of actions stored by email:
8
user | system.totimestamp(time) | ip | project_id | event_id
-----------------+--------------------------------------+---------------+------------+------------
test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4
test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13
test@mail.ru | 2020-11-15 15:22:41.000000+0000 | 172.27.56.34 | 3 | 20
test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4
test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120
Service overview
9
HTTP API
Mail Service Cloud Service Calendar Service
write action by user read a list of actions by user
10
65000
peak API write RPS
50
peak API read RPS
Problems of previous storage
The previous storage had the following problems:
+ poor scalability
+ difficult to maintain
+ lack of must-have DBMS features (secondary indexes, tunable replication, query
language etc)
The
Speaker’s
camera
displays
here
11
HTTP API
Mail Service Cloud Service Calendar Service
write action by user read a list of actions by user
12
Scylla as a storage
for users’ actions
Cluster and data model overview, hardware specs
13
Cluster overview
+ 2 DCs, 4+5 nodes, RF=1 inside each DC
+ CL=ONE for writes/reads
+ Bare metal
+ 2 x Intel Xeon Gold 6230
+ 6 x 32GB DDR4 2666 MHz
+ 2 x SATA SSD 1TB RAID 1 for clogs, 10 x HDD 16TB RAID 10 for data
+ 10 Gb/s Network
14
CREATE TABLE becca.actions (
user text, year smallint, week tinyint,
time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((user, year, week, project_id), time)
) WITH CLUSTERING ORDER BY (time DESC)
Data model
+ Partition is a list of actions sorted by time
+ Partition is identified by user, year, week and project
15
cqlsh> select user, toTimestamp(time), ip, project_id, event_id, args from becca.events where user = 'test@mail.ru' and year
= 2020 and week = 46 and project_id = 3 LIMIT 5;
user | system.totimestamp(time) | ip | project_id | event_id | args
--------------+---------------------------------+--------------+------------+----------+------------------------------------
test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': '5aa73d', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13 | {'rid': '44a7b0', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:21.000000+0000 | 172.27.56.34 | 3 | 20 | {'rid': 'a17143', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': 'c77f6d', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120 | {'rid': 'e4b3ad', 'ua': 'Mozilla'}
16
Example of SELECT query:
Data model
Compaction Strategy
+ Time Window Compaction Strategy gives the best (min) write amplification
+ We chose time window with size of 1 week
+ Compacting 1 week of data takes ~3 hours
+ We never expire data (no TTL)
17
Various other options
+ We tried different compression settings, LZ4 64kb works best for us (0.243385 ratio)
+ bloom_filter_fp_chance = 0.001
18
Reading by a secondary key
19
+ Out-of-the-box secondary indexes involve an ambiguous number of network requests and
lots of random IO
+ Materialized views require a read-before-update for every write operation (not gonna work
with HDDs)
+ Duplicating writes to a separate table by a different partition key
CREATE TABLE becca.actions_by_ip (
ip text, year smallint, week tinyint,
user text, time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((ip, year, week, project_id), time, user)
) WITH CLUSTERING ORDER BY (time DESC)
Secondary key data model
+ Requires 2x space and 2x write load
+ Gives predictable performance on reads
20
cqlsh> SELECT ip, user, toTimestamp(time), project_id, event_id, args FROM becca.actions_by_ip
WHERE ip = '172.27.28.155' AND year = 2020 AND week = 46 AND project_id = 3 LIMIT 5;
ip | user | system.totimestamp(time) | project_id | event_id | args
---------------+---------------+---------------------------------+------------+----------+----------------------------------
172.27.28.155 | test1@mail.ru | 2020-11-12 08:16:50.000000+0000 | 3 | 4 | {rid': 'ef749e', 'ua': 'Mozilla'}
172.27.28.155 | test1@mail.ru | 2020-11-12 08:10:34.000000+0000 | 3 | 120 | {rid': '7aa30b', 'ua': 'Mozilla'}
172.27.28.155 | test2@mail.ru | 2020-11-12 08:09:30.000000+0000 | 3 | 4 | {rid': 'dd6679', 'ua': 'Mozilla'}
172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:31.000000+0000 | 3 | 81 | {rid': '55f33c', 'ua': 'Mozilla'}
172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:29.000000+0000 | 3 | 80 | {rid': 'e8f3d2', 'ua': 'Mozilla'}
Reading by a secondary key
21
Example of INSERT query:
Example of SELECT query:
cqlsh> INSERT INTO becca.actions_by_ip(ip, year, week, time, user, project_id, event_id, args)
VALUES('172.27.56.34', 2020, 46, a447b680-278c-11eb-ac37-fa163e4302ba, 'test@mail.ru', 3, 4, {'ua':'Mozilla'});
240 000
writes per second
95% ~1.5ms, 99.9% ~22ms
22
10 (100 peak)
reads per second
Avg ~120ms, 95% ~400ms,
99.9% ~650ms
23
+4TB
of compressed data
every week
24
Read/write
API
Overview of the API and the logic it implies
25
write/action
26
Write action for the user:
curl -d '{"ua": "Mozilla"}'
"http://api.mail.ru/api/v1/write/action?user=test@mail.ru&project_id=3&event_id=4&ip=172.27.5
6.34&ts=$(date +%s)"
{"code":200}
INSERT INTO becca.action(user,year,week,project_id,event_id,time,ip,args)
VALUES ('test@mail.ru', 2020,46,3,4,a447b680-278c-11eb-ac37-fa163e4302ba,'172.27.56.34',{'ua':
'Mozilla'});
INSERT INTO becca.action_by_ip(ip, year, week, time, user, project_id, event_id, args)
VALUES('172.27.56.34',2020,46,a447b680-278c-11eb-ac37-fa163e4302ba,'test@mail.ru',3,4,{'ua':'
Mozilla'});
CQL:
Retries on write
27
+ One retry per one write request
+ Retry goes to another DC
+ Safe retries, thanks to LWW (primary key is ((user, year, week, project_id), time))
DC 1 DC 2
HTTP
API
read/actions
28
Read actions for the given user in the given time range:
curl
"http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547
8606&limit=1"
{"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c-
11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]}
SELECT event_id, time, ip, args FROM becca.events
WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3
AND time > maxTimeUUID(1605392205) AND time < minTimeUUID(1605478607)
ORDER BY time DESC LIMIT 1
CQL:
read/actions
29
Read actions for the given user in the given time range:
curl
"http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547
8606&limit=1"
{"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c-
11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]}
SELECT event_id, time, ip, args FROM becca.events
WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3
AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < 4026f180-2790-11eb-8080-808080808080
ORDER BY time DESC LIMIT 1
CQL:
Concurrent reads
30
+ Time range in request can vary from 1 second to one month
+ API breaks month into weeks and makes concurrent requests to Scylla
+ Tradeoff: more (possibly excessive) concurrent requests to Scylla => faster response time
for API
4
concurrent reads
API timings reduced
95% → 2.49x, 99.9% → 2.26x
31
Exploiting promoted index
32
+ Max partition size we have: 500 000 rows
+ To get predictable response time we put a limit on number of rows returned, which goes
straight to CQL query
+ If partition has more than LIMIT rows, code 206 is returned
+ Client then can provide the timeuuid of last action with the next API call
+ Scylla gives the next portion of rows in no time (thanks to the promoted index)
Exploiting promoted index
33
Read the next portion of actions:
curl
"http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547
8606&limit=1&state=a447b680-278c-11eb-ac37-fa163e4302ba"
{"code":206,"body":{"events":[{"ts":1605477055,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a316
8980-278c-11eb-ac35-fa163e4302ba","args":{"rid":"5aa73d","ua":"Mozilla"}}],"state":"a3168980-278c-11eb
-ac35-fa163e4302ba"}}
SELECT event_id, time, ip, args FROM becca.events
WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3
AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < a447b680-278c-11eb-ac37-fa163e4302ba
ORDER BY time DESC LIMIT 1
CQL:
Retries on read
34
+ One retry per one API request
+ Retry goes to another DC
+ If retry fails API gives partial response (206), successive API request continues where the
previous stopped
DC 1 DC 2
HTTP
API
Tuning gocql
35
+ Set prefetch to 0.999 to speed up background fetches
+ Implement custom unmarshaller to optimize allocations
PAGE 1 PAGE 2 PAGE 3 PAGE 4 PAGE 5
PAGE 1
PAGE 2
PAGE 3
PAGE 4
PAGE 5
Using Scylla
with HDDs
Potential problems and possible solutions to them
36
num-io-queues
37
+ num-io-queues stands for a number of threads
that interact with disks
+ You have to find your sweet spot so that throughput is optimal and latencies are ok (Little’s
Law)
+ 10 HDDs in RAID 10 provide the maximum concurrency of 5 for writes, set
num-io-queues to 4-5
Cluster repairs
38
+ nodetool repair does not finish in acceptable time (months)
+ nodetool repair overloads cluster (read latencies grow 4 times)
+ We came up with a more IO-efficient way to repair a cluster in our case
Cluster repairs
39
week
42
week
43
week
44
week
45
week
46
time
week
45
week
45
week
42
week
43
week
44
week
45
week
46
time
week
45
Cluster repairs
40
+ nodetool refresh will finish quickly
+ compactions of new data will be triggered, but the cluster will not be overloaded
+ compactions will finish in a couple of hours
+ run nodetool cleanup to remove ‘foreign’ data
Cluster repairs
41
The algorithm is as simple as are these commands:
+ ./filter_sstables.sh $range_min $range_max
+ rsync $sstables $dmgd_node/scylla_upload_dir
+ nodetool refresh
Data full scan
Full scan is an anti-pattern for CQL, however it can be good if DB can handle full scans on rare
occasions.
42
Possible use cases:
+ find all unique users that authenticated at least one time last month
+ how many users activated a particular feature last quarter
+ any other case useful for business
Data full scan
Let’s say we want to do a full scan within a particular time range (1 month for example). Naive
CQL approach works too long and overloads cluster.
Our way:
1. sstablemetadata to collect SSTables with data in the given time range
2. sstabledump on every SSTable from step 1 (multiple SSTables in parallel)
3. parse JSON output of sstabledump in a streaming fashion
43
Problem: output of sstabledump is a single large JSON (will not fit in memory)
Data full scan: parsing a huge JSON
44
{
"bands": [
{
"name": "Metallica",
"origin": "USA",
"albums": [
...
]
},
...
{
"name": "Enter Shikari",
"origin": "England",
"albums": [
...
]
}
]
}
What we have:
+ a huge JSON object (tens of GBs)
+ any field can be arbitrary long
+ you need only a small subset of fields
Solution: implement a custom parser for the
regular subset of JSON
Data full scan: parsing a huge JSON in Go
45
englishArtists := 0
state := searchingForOriginKey
for {
currToken, err := lexer.Token()
if err != nil {
// ...
}
switch state {
case searchingForOriginKey:
if currToken == "origin" {
state = pendingOriginValue
}
case pendingOriginValue:
if currToken == "England" {
englishArtists++
}
state = searchingForOriginKey
}
}
Need a streaming JSON tokenizer
https://github.com/gibsn/gojsonlex
+ drop in replacement for standard encoding/json
+ 2-3 times faster than encoding/json
+ requires small amount of memory
encoding/json
+ provides a streaming JSON tokenizer
+ consumes lots of CPU
Data full scan
46
+ most effective use of HDDs (sequential access)
+ finishes in reasonable time (days)
+ excludes network requests (fast and reliable)
+ requires some coding
+ have to run on each node (at least in one DC)
Pros
Cons
Problems yet to be solved
47
The following problems are yet to be solved:
+ latencies grow during compactions, cleanup, bootstrap
+ latencies grow when a node is down
+ slow bootstrapping
$150 000
CAPEX saved per 1PB
48
compared to SSD setup
Conclusion
49
Results
We have achieved the following results:
+ we have built a high-load horizontally scalable service for storing users’ actions with Scylla
and HDDs
+ the given service is able to handle 240 000 writes per second with 95% of timing equal to
1.5ms with just a few Scylla nodes
+ we have implemented an approach to serve reads by a secondary key with predictable
performance
+ we have implemented an approach to do full scans in reasonable time on rare occasions
50
Future work
In 2021:
+ third DC
+ optimize Scylla and clients to get even better latencies
+ integrate Scylla into more projects
51
Special Thanks
I would like to give special thanks to:
+ Dmitry Pavlov, Pavel Buchinchik, Igor Platonov
+ Vladislav Zolotarov, Avi Kivity, Raphael Carvalho
+ The whole ScyllaDB team
52
Q&A
gibsn@mail.ru
Stay in touch
United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

More Related Content

PDF
WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...
PDF
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
PPTX
Cassandra vs. ScyllaDB: Evolutionary Differences
PDF
RDBMS to NoSQL: Practical Advice from Successful Migrations
PDF
Building Event Streaming Architectures on Scylla and Kafka
PPTX
Lightweight Transactions in Scylla versus Apache Cassandra
PPTX
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
PPTX
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB
WEBINAR - Introducing Scylla Open Source 3.0: Materialized Views, Secondary I...
Introducing Project Alternator - Scylla’s Open-Source DynamoDB-compatible API
Cassandra vs. ScyllaDB: Evolutionary Differences
RDBMS to NoSQL: Practical Advice from Successful Migrations
Building Event Streaming Architectures on Scylla and Kafka
Lightweight Transactions in Scylla versus Apache Cassandra
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

What's hot (20)

PDF
Webinar how to build a highly available time series solution with kairos-db (1)
PDF
Introducing Scylla Manager: Cluster Management and Task Automation
PDF
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
PDF
Introducing Scylla Cloud
PPTX
Building a Lambda Architecture with Elasticsearch at Yieldbot
PDF
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
PDF
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
PDF
Running Scylla on Kubernetes with Scylla Operator
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
PDF
The True Cost of NoSQL DBaaS Options
PDF
Webinar: Using Control Theory to Keep Compactions Under Control
PDF
Webinar: How to Shrink Your Datacenter Footprint by 50%
PDF
Running a DynamoDB-compatible Database on Managed Kubernetes Services
PDF
Measuring Database Performance on Bare Metal AWS Instances
PDF
How to Build a Scylla Database Cluster that Fits Your Needs
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PPTX
Seastar Summit 2019 Keynote
PDF
How to achieve no compromise performance and availability
PDF
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
Webinar how to build a highly available time series solution with kairos-db (1)
Introducing Scylla Manager: Cluster Management and Task Automation
Building a Real-time Streaming ETL Framework Using ksqlDB and NoSQL
Introducing Scylla Cloud
Building a Lambda Architecture with Elasticsearch at Yieldbot
Numberly on Joining Billions of Rows in Seconds: Replacing MongoDB and Hive w...
NoSQL and NewSQL: Tradeoffs between Scalable Performance & Consistency
Running Scylla on Kubernetes with Scylla Operator
Critical Attributes for a High-Performance, Low-Latency Database
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
The True Cost of NoSQL DBaaS Options
Webinar: Using Control Theory to Keep Compactions Under Control
Webinar: How to Shrink Your Datacenter Footprint by 50%
Running a DynamoDB-compatible Database on Managed Kubernetes Services
Measuring Database Performance on Bare Metal AWS Instances
How to Build a Scylla Database Cluster that Fits Your Needs
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Seastar Summit 2019 Keynote
How to achieve no compromise performance and availability
TechTalk: Reduce Your Storage Footprint with a Revolutionary New Compaction S...
Ad

Similar to Fast NoSQL from HDDs? (20)

PPTX
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
PDF
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
PDF
The Future of Sharding
 
PDF
Cassandra introduction 2016
PDF
Tweaking performance on high-load projects
PDF
Cassandra summit keynote 2014
PDF
Scaling Twitter
PPTX
Lightweight Transactions at Lightning Speed
PDF
PgQ Generic high-performance queue for PostgreSQL
PDF
Scaling Twitter 12758
PPTX
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
PDF
Tweaking perfomance on high-load projects_Думанский Дмитрий
PDF
Cassandra Data Modeling
PDF
Cassandra introduction @ ParisJUG
PDF
Manchester Hadoop User Group: Cassandra Intro
PPTX
Hadoop World 2011: Advanced HBase Schema Design
PDF
Treasure Data and AWS - Developers.io 2015
PDF
Cassandra - lesson learned
PDF
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
The Future of Sharding
 
Cassandra introduction 2016
Tweaking performance on high-load projects
Cassandra summit keynote 2014
Scaling Twitter
Lightweight Transactions at Lightning Speed
PgQ Generic high-performance queue for PostgreSQL
Scaling Twitter 12758
Hadoop World 2011: Advanced HBase Schema Design - Lars George, Cloudera
Tweaking perfomance on high-load projects_Думанский Дмитрий
Cassandra Data Modeling
Cassandra introduction @ ParisJUG
Manchester Hadoop User Group: Cassandra Intro
Hadoop World 2011: Advanced HBase Schema Design
Treasure Data and AWS - Developers.io 2015
Cassandra - lesson learned
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Encapsulation theory and applications.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
Big Data Technologies - Introduction.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPT
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Encapsulation theory and applications.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf
A Presentation on Artificial Intelligence
Big Data Technologies - Introduction.pptx
NewMind AI Weekly Chronicles - August'25 Week I
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Teaching material agriculture food technology

Fast NoSQL from HDDs?

  • 1. Kirill Alekseev, Mail.ru Group High-Load Storage of Users’ Actions with Scylla and HDDs
  • 2. Kirill Alekseev + Software Engineering Team Lead @ Mail Service @ Mail.Ru Group + Master’s degree in Computer Science in 2019 @ Lomonosov Moscow State University + Love coding, music and parties Presenter
  • 3. 19 million unique real users DAU 47 million unique real users MAU 3
  • 4. 1 000 000 emails per minute 4
  • 5. Agenda o Service overview o Data model, cluster specs o Application details o Using Scylla with HDDs o Q&A
  • 8. Basically, actions history is a time series of actions stored by email: 8 user | system.totimestamp(time) | ip | project_id | event_id -----------------+--------------------------------------+---------------+------------+------------ test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4 test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13 test@mail.ru | 2020-11-15 15:22:41.000000+0000 | 172.27.56.34 | 3 | 20 test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4 test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120 Service overview
  • 9. 9 HTTP API Mail Service Cloud Service Calendar Service write action by user read a list of actions by user
  • 10. 10 65000 peak API write RPS 50 peak API read RPS
  • 11. Problems of previous storage The previous storage had the following problems: + poor scalability + difficult to maintain + lack of must-have DBMS features (secondary indexes, tunable replication, query language etc) The Speaker’s camera displays here 11
  • 12. HTTP API Mail Service Cloud Service Calendar Service write action by user read a list of actions by user 12
  • 13. Scylla as a storage for users’ actions Cluster and data model overview, hardware specs 13
  • 14. Cluster overview + 2 DCs, 4+5 nodes, RF=1 inside each DC + CL=ONE for writes/reads + Bare metal + 2 x Intel Xeon Gold 6230 + 6 x 32GB DDR4 2666 MHz + 2 x SATA SSD 1TB RAID 1 for clogs, 10 x HDD 16TB RAID 10 for data + 10 Gb/s Network 14
  • 15. CREATE TABLE becca.actions ( user text, year smallint, week tinyint, time timeuuid, project_id smallint, event_id smallint, ip inet, args map<text, text>, PRIMARY KEY ((user, year, week, project_id), time) ) WITH CLUSTERING ORDER BY (time DESC) Data model + Partition is a list of actions sorted by time + Partition is identified by user, year, week and project 15
  • 16. cqlsh> select user, toTimestamp(time), ip, project_id, event_id, args from becca.events where user = 'test@mail.ru' and year = 2020 and week = 46 and project_id = 3 LIMIT 5; user | system.totimestamp(time) | ip | project_id | event_id | args --------------+---------------------------------+--------------+------------+----------+------------------------------------ test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': '5aa73d', 'ua': 'Mozilla'} test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13 | {'rid': '44a7b0', 'ua': 'Mozilla'} test@mail.ru | 2020-11-15 15:22:21.000000+0000 | 172.27.56.34 | 3 | 20 | {'rid': 'a17143', 'ua': 'Mozilla'} test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': 'c77f6d', 'ua': 'Mozilla'} test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120 | {'rid': 'e4b3ad', 'ua': 'Mozilla'} 16 Example of SELECT query: Data model
  • 17. Compaction Strategy + Time Window Compaction Strategy gives the best (min) write amplification + We chose time window with size of 1 week + Compacting 1 week of data takes ~3 hours + We never expire data (no TTL) 17
  • 18. Various other options + We tried different compression settings, LZ4 64kb works best for us (0.243385 ratio) + bloom_filter_fp_chance = 0.001 18
  • 19. Reading by a secondary key 19 + Out-of-the-box secondary indexes involve an ambiguous number of network requests and lots of random IO + Materialized views require a read-before-update for every write operation (not gonna work with HDDs) + Duplicating writes to a separate table by a different partition key
  • 20. CREATE TABLE becca.actions_by_ip ( ip text, year smallint, week tinyint, user text, time timeuuid, project_id smallint, event_id smallint, ip inet, args map<text, text>, PRIMARY KEY ((ip, year, week, project_id), time, user) ) WITH CLUSTERING ORDER BY (time DESC) Secondary key data model + Requires 2x space and 2x write load + Gives predictable performance on reads 20
  • 21. cqlsh> SELECT ip, user, toTimestamp(time), project_id, event_id, args FROM becca.actions_by_ip WHERE ip = '172.27.28.155' AND year = 2020 AND week = 46 AND project_id = 3 LIMIT 5; ip | user | system.totimestamp(time) | project_id | event_id | args ---------------+---------------+---------------------------------+------------+----------+---------------------------------- 172.27.28.155 | test1@mail.ru | 2020-11-12 08:16:50.000000+0000 | 3 | 4 | {rid': 'ef749e', 'ua': 'Mozilla'} 172.27.28.155 | test1@mail.ru | 2020-11-12 08:10:34.000000+0000 | 3 | 120 | {rid': '7aa30b', 'ua': 'Mozilla'} 172.27.28.155 | test2@mail.ru | 2020-11-12 08:09:30.000000+0000 | 3 | 4 | {rid': 'dd6679', 'ua': 'Mozilla'} 172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:31.000000+0000 | 3 | 81 | {rid': '55f33c', 'ua': 'Mozilla'} 172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:29.000000+0000 | 3 | 80 | {rid': 'e8f3d2', 'ua': 'Mozilla'} Reading by a secondary key 21 Example of INSERT query: Example of SELECT query: cqlsh> INSERT INTO becca.actions_by_ip(ip, year, week, time, user, project_id, event_id, args) VALUES('172.27.56.34', 2020, 46, a447b680-278c-11eb-ac37-fa163e4302ba, 'test@mail.ru', 3, 4, {'ua':'Mozilla'});
  • 22. 240 000 writes per second 95% ~1.5ms, 99.9% ~22ms 22
  • 23. 10 (100 peak) reads per second Avg ~120ms, 95% ~400ms, 99.9% ~650ms 23
  • 25. Read/write API Overview of the API and the logic it implies 25
  • 26. write/action 26 Write action for the user: curl -d '{"ua": "Mozilla"}' "http://api.mail.ru/api/v1/write/action?user=test@mail.ru&project_id=3&event_id=4&ip=172.27.5 6.34&ts=$(date +%s)" {"code":200} INSERT INTO becca.action(user,year,week,project_id,event_id,time,ip,args) VALUES ('test@mail.ru', 2020,46,3,4,a447b680-278c-11eb-ac37-fa163e4302ba,'172.27.56.34',{'ua': 'Mozilla'}); INSERT INTO becca.action_by_ip(ip, year, week, time, user, project_id, event_id, args) VALUES('172.27.56.34',2020,46,a447b680-278c-11eb-ac37-fa163e4302ba,'test@mail.ru',3,4,{'ua':' Mozilla'}); CQL:
  • 27. Retries on write 27 + One retry per one write request + Retry goes to another DC + Safe retries, thanks to LWW (primary key is ((user, year, week, project_id), time)) DC 1 DC 2 HTTP API
  • 28. read/actions 28 Read actions for the given user in the given time range: curl "http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547 8606&limit=1" {"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c- 11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]} SELECT event_id, time, ip, args FROM becca.events WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3 AND time > maxTimeUUID(1605392205) AND time < minTimeUUID(1605478607) ORDER BY time DESC LIMIT 1 CQL:
  • 29. read/actions 29 Read actions for the given user in the given time range: curl "http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547 8606&limit=1" {"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c- 11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]} SELECT event_id, time, ip, args FROM becca.events WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3 AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < 4026f180-2790-11eb-8080-808080808080 ORDER BY time DESC LIMIT 1 CQL:
  • 30. Concurrent reads 30 + Time range in request can vary from 1 second to one month + API breaks month into weeks and makes concurrent requests to Scylla + Tradeoff: more (possibly excessive) concurrent requests to Scylla => faster response time for API
  • 31. 4 concurrent reads API timings reduced 95% → 2.49x, 99.9% → 2.26x 31
  • 32. Exploiting promoted index 32 + Max partition size we have: 500 000 rows + To get predictable response time we put a limit on number of rows returned, which goes straight to CQL query + If partition has more than LIMIT rows, code 206 is returned + Client then can provide the timeuuid of last action with the next API call + Scylla gives the next portion of rows in no time (thanks to the promoted index)
  • 33. Exploiting promoted index 33 Read the next portion of actions: curl "http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547 8606&limit=1&state=a447b680-278c-11eb-ac37-fa163e4302ba" {"code":206,"body":{"events":[{"ts":1605477055,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a316 8980-278c-11eb-ac35-fa163e4302ba","args":{"rid":"5aa73d","ua":"Mozilla"}}],"state":"a3168980-278c-11eb -ac35-fa163e4302ba"}} SELECT event_id, time, ip, args FROM becca.events WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3 AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < a447b680-278c-11eb-ac37-fa163e4302ba ORDER BY time DESC LIMIT 1 CQL:
  • 34. Retries on read 34 + One retry per one API request + Retry goes to another DC + If retry fails API gives partial response (206), successive API request continues where the previous stopped DC 1 DC 2 HTTP API
  • 35. Tuning gocql 35 + Set prefetch to 0.999 to speed up background fetches + Implement custom unmarshaller to optimize allocations PAGE 1 PAGE 2 PAGE 3 PAGE 4 PAGE 5 PAGE 1 PAGE 2 PAGE 3 PAGE 4 PAGE 5
  • 36. Using Scylla with HDDs Potential problems and possible solutions to them 36
  • 37. num-io-queues 37 + num-io-queues stands for a number of threads that interact with disks + You have to find your sweet spot so that throughput is optimal and latencies are ok (Little’s Law) + 10 HDDs in RAID 10 provide the maximum concurrency of 5 for writes, set num-io-queues to 4-5
  • 38. Cluster repairs 38 + nodetool repair does not finish in acceptable time (months) + nodetool repair overloads cluster (read latencies grow 4 times) + We came up with a more IO-efficient way to repair a cluster in our case
  • 40. Cluster repairs 40 + nodetool refresh will finish quickly + compactions of new data will be triggered, but the cluster will not be overloaded + compactions will finish in a couple of hours + run nodetool cleanup to remove ‘foreign’ data
  • 41. Cluster repairs 41 The algorithm is as simple as are these commands: + ./filter_sstables.sh $range_min $range_max + rsync $sstables $dmgd_node/scylla_upload_dir + nodetool refresh
  • 42. Data full scan Full scan is an anti-pattern for CQL, however it can be good if DB can handle full scans on rare occasions. 42 Possible use cases: + find all unique users that authenticated at least one time last month + how many users activated a particular feature last quarter + any other case useful for business
  • 43. Data full scan Let’s say we want to do a full scan within a particular time range (1 month for example). Naive CQL approach works too long and overloads cluster. Our way: 1. sstablemetadata to collect SSTables with data in the given time range 2. sstabledump on every SSTable from step 1 (multiple SSTables in parallel) 3. parse JSON output of sstabledump in a streaming fashion 43 Problem: output of sstabledump is a single large JSON (will not fit in memory)
  • 44. Data full scan: parsing a huge JSON 44 { "bands": [ { "name": "Metallica", "origin": "USA", "albums": [ ... ] }, ... { "name": "Enter Shikari", "origin": "England", "albums": [ ... ] } ] } What we have: + a huge JSON object (tens of GBs) + any field can be arbitrary long + you need only a small subset of fields Solution: implement a custom parser for the regular subset of JSON
  • 45. Data full scan: parsing a huge JSON in Go 45 englishArtists := 0 state := searchingForOriginKey for { currToken, err := lexer.Token() if err != nil { // ... } switch state { case searchingForOriginKey: if currToken == "origin" { state = pendingOriginValue } case pendingOriginValue: if currToken == "England" { englishArtists++ } state = searchingForOriginKey } } Need a streaming JSON tokenizer https://github.com/gibsn/gojsonlex + drop in replacement for standard encoding/json + 2-3 times faster than encoding/json + requires small amount of memory encoding/json + provides a streaming JSON tokenizer + consumes lots of CPU
  • 46. Data full scan 46 + most effective use of HDDs (sequential access) + finishes in reasonable time (days) + excludes network requests (fast and reliable) + requires some coding + have to run on each node (at least in one DC) Pros Cons
  • 47. Problems yet to be solved 47 The following problems are yet to be solved: + latencies grow during compactions, cleanup, bootstrap + latencies grow when a node is down + slow bootstrapping
  • 48. $150 000 CAPEX saved per 1PB 48 compared to SSD setup
  • 50. Results We have achieved the following results: + we have built a high-load horizontally scalable service for storing users’ actions with Scylla and HDDs + the given service is able to handle 240 000 writes per second with 95% of timing equal to 1.5ms with just a few Scylla nodes + we have implemented an approach to serve reads by a secondary key with predictable performance + we have implemented an approach to do full scans in reasonable time on rare occasions 50
  • 51. Future work In 2021: + third DC + optimize Scylla and clients to get even better latencies + integrate Scylla into more projects 51
  • 52. Special Thanks I would like to give special thanks to: + Dmitry Pavlov, Pavel Buchinchik, Igor Platonov + Vladislav Zolotarov, Avi Kivity, Raphael Carvalho + The whole ScyllaDB team 52
  • 54. United States 545 Faber Place Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank you