Fast NoSQL from HDDs?

Kirill Alekseev, Mail.ru Group
High-Load Storage
of Users’ Actions
with Scylla and HDDs

Kirill Alekseev
+ Software Engineering Team Lead @ Mail Service @ Mail.Ru Group
+ Master’s degree in Computer Science in 2019 @ Lomonosov Moscow
State University
+ Love coding, music and parties
Presenter

19 million
unique real users DAU
47 million
unique real users MAU
3

Agenda
o Service overview
o Data model, cluster specs
o Application details
o Using Scylla with HDDs
o Q&A

High-load storage
of users’ actions
6

Basically, actions history is a time series of actions stored by email:
8
user | system.totimestamp(time) | ip | project_id | event_id
-----------------+--------------------------------------+---------------+------------+------------
test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4
test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13
test@mail.ru | 2020-11-15 15:22:41.000000+0000 | 172.27.56.34 | 3 | 20
test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4
test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120
Service overview

9
HTTP API
Mail Service Cloud Service Calendar Service
write action by user read a list of actions by user

10
65000
peak API write RPS
50
peak API read RPS

Problems of previous storage
The previous storage had the following problems:
+ poor scalability
+ diﬃcult to maintain
+ lack of must-have DBMS features (secondary indexes, tunable replication, query
language etc)
The
Speaker’s
camera
displays
here
11

HTTP API
Mail Service Cloud Service Calendar Service
write action by user read a list of actions by user
12

Scylla as a storage
for users’ actions
Cluster and data model overview, hardware specs
13

Cluster overview
+ 2 DCs, 4+5 nodes, RF=1 inside each DC
+ CL=ONE for writes/reads
+ Bare metal
+ 2 x Intel Xeon Gold 6230
+ 6 x 32GB DDR4 2666 MHz
+ 2 x SATA SSD 1TB RAID 1 for clogs, 10 x HDD 16TB RAID 10 for data
+ 10 Gb/s Network
14

CREATE TABLE becca.actions (
user text, year smallint, week tinyint,
time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((user, year, week, project_id), time)
) WITH CLUSTERING ORDER BY (time DESC)
Data model
+ Partition is a list of actions sorted by time
+ Partition is identiﬁed by user, year, week and project
15

cqlsh> select user, toTimestamp(time), ip, project_id, event_id, args from becca.events where user = 'test@mail.ru' and year
= 2020 and week = 46 and project_id = 3 LIMIT 5;
user | system.totimestamp(time) | ip | project_id | event_id | args
--------------+---------------------------------+--------------+------------+----------+------------------------------------
test@mail.ru | 2020-11-15 15:22:46.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': '5aa73d', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:45.000000+0000 | 172.27.56.34 | 3 | 13 | {'rid': '44a7b0', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:21.000000+0000 | 172.27.56.34 | 3 | 20 | {'rid': 'a17143', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:23.000000+0000 | 172.27.56.34 | 3 | 4 | {'rid': 'c77f6d', 'ua': 'Mozilla'}
test@mail.ru | 2020-11-15 15:22:22.000000+0000 | 172.27.56.34 | 3 | 120 | {'rid': 'e4b3ad', 'ua': 'Mozilla'}
16
Example of SELECT query:
Data model

Compaction Strategy
+ Time Window Compaction Strategy gives the best (min) write ampliﬁcation
+ We chose time window with size of 1 week
+ Compacting 1 week of data takes ~3 hours
+ We never expire data (no TTL)
17

Various other options
+ We tried different compression settings, LZ4 64kb works best for us (0.243385 ratio)
+ bloom_filter_fp_chance = 0.001
18

Reading by a secondary key
19
+ Out-of-the-box secondary indexes involve an ambiguous number of network requests and
lots of random IO
+ Materialized views require a read-before-update for every write operation (not gonna work
with HDDs)
+ Duplicating writes to a separate table by a different partition key

CREATE TABLE becca.actions_by_ip (
ip text, year smallint, week tinyint,
user text, time timeuuid,
project_id smallint, event_id smallint,
ip inet, args map<text, text>,
PRIMARY KEY ((ip, year, week, project_id), time, user)
) WITH CLUSTERING ORDER BY (time DESC)
Secondary key data model
+ Requires 2x space and 2x write load
+ Gives predictable performance on reads
20

cqlsh> SELECT ip, user, toTimestamp(time), project_id, event_id, args FROM becca.actions_by_ip
WHERE ip = '172.27.28.155' AND year = 2020 AND week = 46 AND project_id = 3 LIMIT 5;
ip | user | system.totimestamp(time) | project_id | event_id | args
---------------+---------------+---------------------------------+------------+----------+----------------------------------
172.27.28.155 | test1@mail.ru | 2020-11-12 08:16:50.000000+0000 | 3 | 4 | {rid': 'ef749e', 'ua': 'Mozilla'}
172.27.28.155 | test1@mail.ru | 2020-11-12 08:10:34.000000+0000 | 3 | 120 | {rid': '7aa30b', 'ua': 'Mozilla'}
172.27.28.155 | test2@mail.ru | 2020-11-12 08:09:30.000000+0000 | 3 | 4 | {rid': 'dd6679', 'ua': 'Mozilla'}
172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:31.000000+0000 | 3 | 81 | {rid': '55f33c', 'ua': 'Mozilla'}
172.27.28.155 | test3@mail.ru | 2020-11-12 08:08:29.000000+0000 | 3 | 80 | {rid': 'e8f3d2', 'ua': 'Mozilla'}
Reading by a secondary key
21
Example of INSERT query:
Example of SELECT query:
cqlsh> INSERT INTO becca.actions_by_ip(ip, year, week, time, user, project_id, event_id, args)
VALUES('172.27.56.34', 2020, 46, a447b680-278c-11eb-ac37-fa163e4302ba, 'test@mail.ru', 3, 4, {'ua':'Mozilla'});

240 000
writes per second
95% ~1.5ms, 99.9% ~22ms
22

10 (100 peak)
reads per second
Avg ~120ms, 95% ~400ms,
99.9% ~650ms
23

+4TB
of compressed data
every week
24

Read/write
API
Overview of the API and the logic it implies
25

write/action
26
Write action for the user:
curl -d '{"ua": "Mozilla"}'
"http://api.mail.ru/api/v1/write/action?user=test@mail.ru&project_id=3&event_id=4&ip=172.27.5
6.34&ts=$(date +%s)"
{"code":200}
INSERT INTO becca.action(user,year,week,project_id,event_id,time,ip,args)
VALUES ('test@mail.ru', 2020,46,3,4,a447b680-278c-11eb-ac37-fa163e4302ba,'172.27.56.34',{'ua':
'Mozilla'});
INSERT INTO becca.action_by_ip(ip, year, week, time, user, project_id, event_id, args)
VALUES('172.27.56.34',2020,46,a447b680-278c-11eb-ac37-fa163e4302ba,'test@mail.ru',3,4,{'ua':'
Mozilla'});
CQL:

Retries on write
27
+ One retry per one write request
+ Retry goes to another DC
+ Safe retries, thanks to LWW (primary key is ((user, year, week, project_id), time))
DC 1 DC 2
HTTP
API

read/actions
28
Read actions for the given user in the given time range:
curl
"http://api.mail.ru/api/v1/read/actions?user=test@mail.ru&project_id=3&ts_min=1605392206&ts_max=160547
8606&limit=1"
{"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c-
11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]}
SELECT event_id, time, ip, args FROM becca.events
WHERE user = 'test@mail.ru' AND year = 2020 AND week = 46 AND project_id = 3
AND time > maxTimeUUID(1605392205) AND time < minTimeUUID(1605478607)
ORDER BY time DESC LIMIT 1
CQL:

read/actions
29
Read actions for the given user in the given time range:
curl
8606&limit=1"
{"code":206,"body":[{"ts":1605477057,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a447b680-278c-
11eb-ac37-fa163e4302ba","args":{"ua":"Mozilla"}}]}
AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < 4026f180-2790-11eb-8080-808080808080
CQL:

Concurrent reads
30
+ Time range in request can vary from 1 second to one month
+ API breaks month into weeks and makes concurrent requests to Scylla
+ Tradeoff: more (possibly excessive) concurrent requests to Scylla => faster response time
for API

4
concurrent reads
API timings reduced
95% → 2.49x, 99.9% → 2.26x
31

Exploiting promoted index
32
+ Max partition size we have: 500 000 rows
+ To get predictable response time we put a limit on number of rows returned, which goes
straight to CQL query
+ If partition has more than LIMIT rows, code 206 is returned
+ Client then can provide the timeuuid of last action with the next API call
+ Scylla gives the next portion of rows in no time (thanks to the promoted index)

Exploiting promoted index
33
Read the next portion of actions:
curl
8606&limit=1&state=a447b680-278c-11eb-ac37-fa163e4302ba"
{"code":206,"body":{"events":[{"ts":1605477055,"ip":"172.27.56.34","project_id":3,"id":4,"state":"a316
8980-278c-11eb-ac35-fa163e4302ba","args":{"rid":"5aa73d","ua":"Mozilla"}}],"state":"a3168980-278c-11eb
-ac35-fa163e4302ba"}}
AND time > 148c0480-26c7-11eb-bf7f-7f7f7f7f7f7f AND time < a447b680-278c-11eb-ac37-fa163e4302ba
CQL:

Retries on read
34
+ One retry per one API request
+ Retry goes to another DC
+ If retry fails API gives partial response (206), successive API request continues where the
previous stopped
DC 1 DC 2
HTTP
API

Tuning gocql
35
+ Set prefetch to 0.999 to speed up background fetches
+ Implement custom unmarshaller to optimize allocations
PAGE 1 PAGE 2 PAGE 3 PAGE 4 PAGE 5
PAGE 1
PAGE 2
PAGE 3
PAGE 4
PAGE 5

Using Scylla
with HDDs
Potential problems and possible solutions to them
36

num-io-queues
37
+ num-io-queues stands for a number of threads
that interact with disks
+ You have to ﬁnd your sweet spot so that throughput is optimal and latencies are ok (Little’s
Law)
+ 10 HDDs in RAID 10 provide the maximum concurrency of 5 for writes, set
num-io-queues to 4-5

Cluster repairs
38
+ nodetool repair does not ﬁnish in acceptable time (months)
+ nodetool repair overloads cluster (read latencies grow 4 times)
+ We came up with a more IO-eﬃcient way to repair a cluster in our case

Cluster repairs
39
week
42
week
43
week
44
week
45
week
46
time
week
45
week
45
week
42
week
43
week
44
week
45
week
46
time
week
45

Cluster repairs
40
+ nodetool refresh will ﬁnish quickly
+ compactions of new data will be triggered, but the cluster will not be overloaded
+ compactions will ﬁnish in a couple of hours
+ run nodetool cleanup to remove ‘foreign’ data

Cluster repairs
41
The algorithm is as simple as are these commands:
+ ./filter_sstables.sh $range_min $range_max
+ rsync $sstables $dmgd_node/scylla_upload_dir
+ nodetool refresh

Data full scan
Full scan is an anti-pattern for CQL, however it can be good if DB can handle full scans on rare
occasions.
42
Possible use cases:
+ ﬁnd all unique users that authenticated at least one time last month
+ how many users activated a particular feature last quarter
+ any other case useful for business

Data full scan
Let’s say we want to do a full scan within a particular time range (1 month for example). Naive
CQL approach works too long and overloads cluster.
Our way:
1. sstablemetadata to collect SSTables with data in the given time range
2. sstabledump on every SSTable from step 1 (multiple SSTables in parallel)
3. parse JSON output of sstabledump in a streaming fashion
43
Problem: output of sstabledump is a single large JSON (will not ﬁt in memory)

Data full scan: parsing a huge JSON
44
{
"bands": [
{
"name": "Metallica",
"origin": "USA",
"albums": [
...
]
},
...
{
"name": "Enter Shikari",
"origin": "England",
"albums": [
...
]
}
]
}
What we have:
+ a huge JSON object (tens of GBs)
+ any ﬁeld can be arbitrary long
+ you need only a small subset of ﬁelds
Solution: implement a custom parser for the
regular subset of JSON

Data full scan: parsing a huge JSON in Go
45
englishArtists := 0
state := searchingForOriginKey
for {
currToken, err := lexer.Token()
if err != nil {
// ...
}
switch state {
case searchingForOriginKey:
if currToken == "origin" {
state = pendingOriginValue
}
case pendingOriginValue:
if currToken == "England" {
englishArtists++
}
state = searchingForOriginKey
}
}
Need a streaming JSON tokenizer
https://github.com/gibsn/gojsonlex
+ drop in replacement for standard encoding/json
+ 2-3 times faster than encoding/json
+ requires small amount of memory
encoding/json
+ provides a streaming JSON tokenizer
+ consumes lots of CPU

Data full scan
46
+ most effective use of HDDs (sequential access)
+ ﬁnishes in reasonable time (days)
+ excludes network requests (fast and reliable)
+ requires some coding
+ have to run on each node (at least in one DC)
Pros
Cons

Problems yet to be solved
47
The following problems are yet to be solved:
+ latencies grow during compactions, cleanup, bootstrap
+ latencies grow when a node is down
+ slow bootstrapping

$150 000
CAPEX saved per 1PB
48
compared to SSD setup

Results
We have achieved the following results:
+ we have built a high-load horizontally scalable service for storing users’ actions with Scylla
and HDDs
+ the given service is able to handle 240 000 writes per second with 95% of timing equal to
1.5ms with just a few Scylla nodes
+ we have implemented an approach to serve reads by a secondary key with predictable
performance
+ we have implemented an approach to do full scans in reasonable time on rare occasions
50

Future work
In 2021:
+ third DC
+ optimize Scylla and clients to get even better latencies
+ integrate Scylla into more projects
51

Special Thanks
I would like to give special thanks to:
+ Dmitry Pavlov, Pavel Buchinchik, Igor Platonov
+ Vladislav Zolotarov, Avi Kivity, Raphael Carvalho
+ The whole ScyllaDB team
52

Q&A
gibsn@mail.ru
Stay in touch

United States
545 Faber Place
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank you

Fast NoSQL from HDDs?

More Related Content

What's hot (20)

Similar to Fast NoSQL from HDDs? (20)

More from ScyllaDB (20)

Recently uploaded (20)

Fast NoSQL from HDDs?