Apache Cassandra Lunch #74: ScyllaDB - Peter Corless

Cassandra 4.0 vs.
Scylla Open Source 4.5
Similarities & Differences
Cassandra.lunch
https://github.com/Anant/Cassandra.Lunch

Peter Corless
+ Listen to customer stories
+ Write blogs & case studies
+ Play (and design) strategy &
roleplaying games
Director of Technical Advocacy
ScyllaDB

4
+ Comparing Open Source Releases
+ Software Release Cycles
+ “Tale of the Tape”
+ Beyond Cassandra
+ Features
+ Same-Same, Similarities, Differences
+ Benchmarks
Comparison: Cassandra vs. Scylla

5
Comparing Open Source
Releases
Apache Cassandra Compared to Scylla

6
+ February 2017: Cassandra 3.11 released
+ May 2019: First Roadmap for Cassandra 4.0 laid out
+ September 2019: First 4.0 Alpha
+ July 2020: First 4.0 Beta
+ April 26, 2021: Cassandra 4.0rc1
+ April 28, 2021: Cassandra 4.0 “World Party”
+ July 27, 2021: Actual Cassandra 4.0 Release Date
+ Sep, 07, 2021: Cassandra 4.0.1
+ >4 Years from last minor release (3.11) to 4.0
Cassandra 4.0 is Finally Here!

7
Scylla’s Predictable Releases
Aug 2021: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance

8
Scylla Open Source + Enterprise

9
Head-to-Head
+ Scylla engineers make
~5x more commits/month
+ Bigger engineering team —
>50% more active committers
+ More active release cycle —
13x more major/minor releases
over past 3 years
+ More popular — Scylla exceeds
Cassandra in Github stars

10
“Chasing Cassandra”
+ Scylla traditionally trailed implementation of Cassandra
+ Playing “catch-up” c. 2016 – 2020
+ Scylla 4.0 went beyond “feature completeness” for Cassandra 3.11
+ Now Scylla has features not found in Cassandra
+ Though Cassandra 4.0 has some features not (yet) present in Scylla
+ Some we’ll add for parity/compatibility
+ Some we’ll go our own way (solve differently, improve or obviate)

11
Scylla Beyond Cassandra
Cassandra
“core”
Scylla same-
same
“core” iteration
Unique to
Cassandra
Unique to
Scylla
Scylla specific
implementation
Cassandra
specific
implementation
Same/similar feature implemented differently
May or may not be intercompatible
Same/similar feature implemented
identically/intercompatible
Not in Cassandra
Not in Scylla

12
What’s the Same
Between Scylla &
Cassandra?
Commonalities

13
Common Ancestry
+ Cassandra and Scylla both
descend from the same historical
antecedents / whitepapers
+ Google’s Bigtable
+ Amazon’s Dynamo
+ Facebook’s Cassandra
+ [Not to be confused with
commercial offerings Google
Cloud Bigtable and Amazon
DynamoDB, or open source
Apache Cassandra]

14
Keyspaces, Tables
+ CREATE KEYSPACE
+ CREATE TABLE
+ ALTER KEYSPACE
+ ALTER TABLE
+ DROP KEYSPACE
+ DROP TABLE
ᐩ Pretty much standard Cassandra
Query Language (CQL)
https://xkcd.com/327/

15
Basic CQL CRUD
Operations
+ Create [INSERT]
+ Read [SELECT]
+ Update [UPDATE]
+ Delete [DELETE]
+ WHERE clause
+ ALLOW FILTERING
+ TTL functions
ᐩ Pretty much standard Cassandra
Query Language (CQL)
ᐩ Like SQL, at least at cursory
glance, but do not be lulled into a
false sense of familiarity

16
+ Peer-to-peer leaderless topology
+ Replication Factor (RF)
+ Tunable consistency per request
+ Multi datacenter replication
+ CAP Theorem:
Availability/Partition Tolerant “AP”
High Availability ᐩ No primary/replica complications
ᐩ Homogeneity of nodes
ᐩ Full datacenter loss can be survivable

17
Ring Architecture
+ Token ring topology
+ Wide column “Key-key value”
+ Partition key
+ Clustering key
+ Nodes/vNodes
+ Automatic sharding
+ Same murmur3 partitioner &
hash algorithms

18
What’s similar but not
the same?
Cassandra and Scylla differences

19
CQL
+ For the most part, all basic
CQL queries for Cassandra
will work with Scylla
+ Scylla uses the same CQL
wire protocol as Cassandra
ᐩ Scylla does implement some
features differently (we’ll get into
those)
ᐩ Naturally, those differences will
have related CQL commands
ᐩ Implementation lag:
Scylla is compatible to CQL 3.4.0;
current Cassandra CQL is 3.4.5

20
SSTables
+ Scylla supports the same
immutable on-disk SSTable
LSM tree file formats
+ Standard compaction
algorithms are the same
(LCS, STCS, TWCS)
ᐩ Cassandra 4.0 implemented a new
“nb” SSTable file format
ᐩ Scylla will add support for “nb” file
format #8593
// na (4.0-rc1): uncompressed
chunks, pending repair session,
isTransient, checksummed sstable
metadata file, new Bloomfilter
format
// nb (4.0.0): originating host
id

21
Lightweight
Transactions (LWT)
+ Both use Paxos consensus
algorithm
+ Compare-and-set operations
+ Also called “conditional updates”
ᐩ Scylla can accomplish LWTs in only
3 round trips (Cassandra takes 4)
ᐩ Scylla is more performant / efficient
ᐩ Blog:
https://www.scylladb.com/2020/07/15
/getting-the-most-out-of-lightweight-
transactions-in-scylla/
Scylla accomplishes LWTs in 3x round trips
Cassandra LWTs take 4x round trips

22
Materialized Views
+ Cassandra: introduced in 3.0
[2017], but still experimental
+ Problems when base table gets
out of sync
+ To this day, major issues like
CASSANDRA-10346 are still open
ᐩ Scylla: production ready since 3.0 [Jan
2019]
ᐩ Serve as the infrastructural basis for
Secondary Indexes
ᐩ Can still get out of sync, but not easily
ᐩ Continually improving implementation
* Read more:
https://www.scylladb.com/2018/09/19/overheard-at-
distributed-data-summit/
“If you have them, take them out.”
— Nate McCall PMC Chair,
on Materialized Views in Cassandra [2018]*

23
Secondary Indexes
+ Cassandra: only local Secondary
Indexes (SIs)
+ Scylla: both local and global SIs
+ The choice is now yours!
ᐩ https://www.scylladb.com/2019/07/23
/global-or-localsecondary-indexes-in-
scylla-the-choice-is-now-yours/
A global indexing query workflow in Scylla

24
+ Introduced in C* 3.8, uses commitlog-like structure
+ Creates indexes as commit logs are written - for
improved performance and reliability
+ Feature enabled through cassandra.yaml
+ CDC can be enabled per table through ALTER TABLE
command
+ Currently, no standard way to read CDC files
+ DS planning to open source Kafka Source
connector
+ Advance replication from DS Labs
+ Example CDC project build by someone
Change Data
Capture (CDC)
CDC in Scylla
ᐩ Implemented as standard CQL Tables
ᐩ Just like adding another table
ᐩ Enabled by default
ᐩ Easy to integrate & consume
+ Deltas (changes) plus pre/post image
+ Replicated in same way as normal data
ᐩ Reasonable overhead
ᐩ TTL prevents unbounded data
ᐩ Easily consumable by Apache Kafka

25
+ Debezium-based
+ Simply consumes CDC data via CQL
+ Doesn’t need to de-dupe data
+ Pumps data into Kafka topics
+ Confluent-certified
+ Less muss & fuss
Kafka CDC Source
Connector

26
Zero Copy Streaming
vs. Row-level Repair
+ Cassandra now can stream
SSTables as a whole
+ Bypasses turning SStables into
objects (aka “object reification”)
providing 5x better performance
ᐩ Scylla implemented a completely
different approach in 2019
ᐩ Scylla’s row-level repair feature is
used instead of streaming
ᐩ Row-level repair is more:
○ Robust: Better able to endure
interruptions and outages
○ Granular: Only specific rows are
transferred
○ Efficient: There’s no extra data
streaming!

27
+ C* 4.0 integrates async-driven code from
Netty library for communication between
nodes to leverage Java’s Non-Blocking IO
(NIO) capability.
+ A single thread pool for all connections to
corresponding nodes instead of
maintaining N threads per peer.
+ Potentially improves internode
performance issues, providing better tail
latencies and facilitating zero-copy
streaming.
Netty Async
Messaging
ᐩ Scylla also believes in non-blocking IO
ᐩ Scylla uses asynchronous / non blocking I/O
in C++ (aio) with its own schedulers
ᐩ Scylla per-core shards maintain as great a
shared-nothing approach as possible; use
async messaging when needed
ᐩ Read:
https://www.scylladb.com/2021/09/15/what-
weve-learned-after-6-years-of-io-scheduling/

28
+ Plethora of K8s operators
+ DataStax K8ssandra 1.3+
+ Orange KassCop 2.0+
+ Bitnami Charts
+ [cass-operator deprecated]
+ Sidecars collocated/run on the same
instance as the DB server daemon
+ What Works and What Doesn’t:
https://k8ssandra.io/blog/articles/k
ubernetes-and-apache-cassandra-
what-works-and-what-doesnt/
Kubernetes Support
& Sidecars
ᐩ Scylla Operator offers great K8s
support — It just works
ᐩ Scylla Manager Agent is a sidecar
and already included by default with
Scylla Operator
ᐩ https://www.scylladb.com/product
/scylla-operator-kubernetes/

29
What’s Just Totally
Different?
Cassandra and Scylla differences

30
Shard-per-Core
Architecture
+ Based Seastar framework
(also used in Redpanda,
Redhat Crimson)
+ Designed/optimized for
multicore systems (scales to
100+ CPUs per node)
ᐩ Cassandra is shard-per-node
ᐩ Scylla balances data with more
granularity

31
+ Run your DynamoDB-compatible
workloads anywhere:
+ on AWS or in an AWS Outpost
+ on Google Cloud, Azure, or
+ on-premises
+ Supports DynamoDB Streams
+ Supports Load Balancing
+ Scylla Spark Migrator to move data
to any Scylla cluster anywhere
DynamoDB-compatible
API (Alternator)
ᐩ Cassandra has no comparable feature

32
+ Gossip in Cassandra requires seed
nodes; which violates the idea of
homogeneity of nodes
+ Requires manual assignment and
configuration
+ Seed nodes do not bootstrap
+ Complicated to add new seed
node or replace a dead seed node
Seedless Gossip ᐩ Scylla implemented gossip without
requiring seed nodes
ᐩ More symmetric; less problematic
ᐩ Read more:
https://www.scylladb.com/2020/09/22/
seedless-nosql-getting-rid-of-seed-
nodes-in-scylla/

33
Benchmarking:
Cassandra 4.0 vs Scylla 4.4
and how Scylla dominates

34
Cassandra 4.0 vs. Scylla 4.4
+ Scylla up to 100x lower P99 latencies
+ Scylla can maintain 2x - 5x throughput
+ Scylla adds nodes 3x faster

35
Scylla 4.4 vs. Cassandra 4.0
+ Cassandra 4.0 cannot
maintain useable
low latencies except
at very low throughput
(≤30-40k ops)
+ Scylla can maintain
low latencies for far
greater throughputs
(≤170-180k ops)

36
Replacing a Node
+ Scylla can heal clusters far
faster than Cassandra 4.0
by spinning nodes up and
rebalancing data
~3x - 4x faster

37
Doubling Cluster Capacity
+ Scylla doubled a cluster’s
capacity in just over
an hour and a half
(94 minutes)
+ It took Cassandra 4.0
just shy of 4 hours
(238 minutes)
to perform the same task
+ Scylla performed 2.5X faster

38
+ Scylla 4.4: 36 min on a 3-node cluster
+ Cassandra 4.0 took 36x - 63x as long
(nearly a day; or a day and a half!)
+ Cassandra 4.0 performed worse than
Cassandra 3.11 with num_tokens: 16
Major Compaction Speed

39
TCO Comparison: 4 vs. 40
+ 4x i3.metal instances with Scylla
provided the same or better performance
as 40 nodes of Cassandra on i3.4xlarge
+ Cassandra had 640 vCPUs
+ Scylla had 288 vCPUs
+ Scylla got better utility out of hardware
+ Cost savings of 60%
+ Administrative burden/attack surface
reduced by 90%

40
BLOGS
+ Benchmark, Part 1: Cassandra 4.0 vs. Cassandra 3.11: Comparing Performance
+ Benchmark, Part 2: Apache Cassandra 4.0 vs. Scylla 4.4: Comparing Performance
+ Webinar: Your Questions about Cassandra 4.0 vs. Scylla 4.4 Answered
WEBINAR
+ Comparing Apache Cassandra 4.0, 3.0 and ScyllaDB
Published Benchmarks

FREE Virtual Training Event
AMERICAS - Tuesday, Nov 9th
9AM-1PM PT | 12PM-4PM ET | 1PM-5PM BRT
EMEA and APAC - Wednesday, Nov 10th 8:00-
12:00 UTC | 9AM-1PM CET | 1:30PM-5:30PM
IST
https://lp.scylladb.com/university-live-
2021-11-registration

Learn NoSQL for free!
university.scylladb.com
42

United States
2445 Faber St, Suite #200
Palo Alto, CA USA 94303
Israel
Maskit 4
Herzliya, Israel 4673304
www.scylladb.com
@scylladb
Learn NoSQL for free!
university.scylladb.com
@petercorless

Apache Cassandra Lunch #74: ScyllaDB - Peter Corless

More Related Content

Similar to Apache Cassandra Lunch #74: ScyllaDB - Peter Corless (20)

More from Anant Corporation (20)

Recently uploaded (20)

Apache Cassandra Lunch #74: ScyllaDB - Peter Corless

Editor's Notes