Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016

Cassandra exports as a trivially
parallelizable problem
Emilio Del Tessandoro
Spotify

Agenda
1 The problem
2 Introducing Cassandra-Hezo
3 Wrap up
2© DataStax, All Rights Reserved.

About Emilio
● From Lucca, Italy
● Studied (Theoretical)
Computer Science
● Software Engineer at Spotify
● Started 6 months ago!
© DataStax, All Rights Reserved. 3

What Emilio does at Spotify?
● Part of the bases team
● Making sure that data is reliably stored, backed up and restore tested
● Advising and creating tools for operating Cassandra
Like:
● Cassandra Reaper (last year talk)
● Hecuba (other talk)
● Cassandra-Hezo (this talk)

The problem
...of exporting terabytes of data from a distributed database

The problem
We want to export all data from a distributed database.
A lot of open problems in this area, but we are eventually consistent... :)
Export data like:
● Playlists
● Financially relevant information
● Various kinds of user generated content
To be able to quickly analyze it.

So, what’s out there?
Not much for batch processing (although there are streaming solutions).
● SELECT * is not enough
● COPY is not enough
● Bunch of small github projects

And at Spotify?
cass2hdfs
● Not too bad, but very fragile
● Involves shipping SSTables to Hadoop
● Custom parsing and Avro conversion in MapReduce jobs
● Runtime is dependent on the SSTable size

How we would like to solve it
● No impact on the source cluster
● Cassandra version agnostic
● Point in time snapshot
● Horizontally Scalable
● Composable (easy to understand and test)
● Possibly incremental

Let’s start with this...
● We don’t want to impact the source cluster
➔ So we need to have data off the source cluster quickly
● But we also want want to be horizontally scalable
➔ So we actually need to be able to get the data to multiple machines quickly

Also...
● We want to avoid custom parsing code
➔ So we need to use Cassandra read path, on those machines
● But SELECT * is too expensive
➔ So we need to make data more local
➔ SELECT * WHERE token(pk) < X AND token(pk) > Y

SELECT * WHERE
SELECT * WHERE CQL -> AvroCQL -> Avro
Cassandra-Hezo architecture
13
SELECT * WHERE CQL -> Avroclone
clone
clone
SELECT * WHERE
SELECT * WHERE CQL -> Avro
SELECT * WHERE
© DataStax, All Rights Reserved.

In case you didn’t know
Spotify is now using GCP (Google Cloud Platform).
news.spotify.com/us/2016/02/23/announcing-spotify-infrastructures-googley-future

With Persistent disks (PDs)!
Interesting features like:
● PD snapshotting
● PD creating (from snapshot)
● PD attaching
How to clone storage in GCP
cloning!
PD Snapshots are incremental!

Why one node clusters
● No need for internode communications
● Easier setup
● No need to attach everything to everything
● Perfect setup for even further read-tuning
● The perfect distributed application!

Implementation
● An orchestrator written in Python.
● “Just” a state machine with a bunch of external binaries.
● Super fine grain parallelization (file descriptors and I/O events).
Less than 2000 lines of code.
Including everything, from start to end.

Looking back at cass2hdfs
✓ We now use Cassandra read path
✓ Robust to topology changes
✓ We can easily dump single tables and exclude columns
✓ No need for a worker to see all the data
✓ Much less Cassandra specific code
✓ Automatic CQL -> Avro conversion

And back to our requirements
✓ No impact on the source cluster
✓ Cassandra version agnostic
✓ Point in time snapshot
✓ Horizontally Scalable
✓ Composable (easy to understand and test)
✗ Partially incremental

Performance
Cassandra size
Output size
Avg row size
Export time
Workers
Total processes
Export cost
Small
415GiB
290GiB
57B
~40min
16
128
$18
Medium
530GiB
58GiB
124B
~70min
24
192
$30
Large
12.8TiB
2.7TiB
730B
~80min
32
256
~$75
● Around 10x faster than
our previous solution.
● Without any tuning.
● Without even fully
utilizing the dump
machines!

Wrapping up
● We can now dump our biggest cluster in less than 1 hour.
● A synergy of Cassandra and GCP snapshots.
● Developed in ~2 months by 4 people.
● Working on deployment and deprecation of the old tool.
Cassandra specific, but maybe possible for other databases.

Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016

More Related Content

What's hot (20)

Similar to Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016 (20)

More from DataStax (20)

Recently uploaded (20)

Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016