SlideShare a Scribd company logo
Cassandra exports as a trivially
parallelizable problem
Emilio Del Tessandoro
Spotify
Agenda
1 The problem
2 Introducing Cassandra-Hezo
3 Wrap up
2© DataStax, All Rights Reserved.
About Emilio
● From Lucca, Italy
● Studied (Theoretical)
Computer Science
● Software Engineer at Spotify
● Started 6 months ago!
© DataStax, All Rights Reserved. 3
What Emilio does at Spotify?
● Part of the bases team
● Making sure that data is reliably stored, backed up and restore tested
● Advising and creating tools for operating Cassandra
Like:
● Cassandra Reaper (last year talk)
● Hecuba (other talk)
● Cassandra-Hezo (this talk)
© DataStax, All Rights Reserved. 4
The problem
...of exporting terabytes of data from a distributed database
The problem
We want to export all data from a distributed database.
A lot of open problems in this area, but we are eventually consistent... :)
Export data like:
● Playlists
● Financially relevant information
● Various kinds of user generated content
To be able to quickly analyze it.
6© DataStax, All Rights Reserved.
So, what’s out there?
Not much for batch processing (although there are streaming solutions).
● SELECT * is not enough
● COPY is not enough
● Bunch of small github projects
7© DataStax, All Rights Reserved.
And at Spotify?
cass2hdfs
● Not too bad, but very fragile
● Involves shipping SSTables to Hadoop
● Custom parsing and Avro conversion in MapReduce jobs
● Runtime is dependent on the SSTable size
8© DataStax, All Rights Reserved.
How we would like to solve it
● No impact on the source cluster
● Cassandra version agnostic
● Point in time snapshot
● Horizontally Scalable
● Composable (easy to understand and test)
● Possibly incremental
9© DataStax, All Rights Reserved.
Introducing Cassandra-Hezo
Let’s start with this...
● We don’t want to impact the source cluster
➔ So we need to have data off the source cluster quickly
● But we also want want to be horizontally scalable
➔ So we actually need to be able to get the data to multiple machines quickly
11© DataStax, All Rights Reserved.
Also...
● We want to avoid custom parsing code
➔ So we need to use Cassandra read path, on those machines
● But SELECT * is too expensive
➔ So we need to make data more local
➔ SELECT * WHERE token(pk) < X AND token(pk) > Y
12© DataStax, All Rights Reserved.
SELECT * WHERE
SELECT * WHERE CQL -> AvroCQL -> Avro
Cassandra-Hezo architecture
13
SELECT * WHERE CQL -> Avroclone
clone
clone
SELECT * WHERE
SELECT * WHERE CQL -> AvroCQL -> Avro
SELECT * WHERE CQL -> Avro
SELECT * WHERE
SELECT * WHERE CQL -> AvroCQL -> Avro
SELECT * WHERE CQL -> Avro
© DataStax, All Rights Reserved.
In case you didn’t know
Spotify is now using GCP (Google Cloud Platform).
news.spotify.com/us/2016/02/23/announcing-spotify-infrastructures-googley-future
14© DataStax, All Rights Reserved.
With Persistent disks (PDs)!
Interesting features like:
● PD snapshotting
● PD creating (from snapshot)
● PD attaching
How to clone storage in GCP
15© DataStax, All Rights Reserved.
cloning!
PD Snapshots are incremental!
Why one node clusters
16
SELECT * WHERE CQL -> Avroclone
clone
clone
SELECT * WHERE CQL -> Avro
SELECT * WHERE CQL -> Avro
© DataStax, All Rights Reserved.
Why one node clusters
● No need for internode communications
● Easier setup
● No need to attach everything to everything
● Perfect setup for even further read-tuning
● The perfect distributed application!
17© DataStax, All Rights Reserved.
Implementation
● An orchestrator written in Python.
● “Just” a state machine with a bunch of external binaries.
● Super fine grain parallelization (file descriptors and I/O events).
18© DataStax, All Rights Reserved.
Less than 2000 lines of code.
Including everything, from start to end.
Looking back at cass2hdfs
✓ We now use Cassandra read path
✓ Robust to topology changes
✓ We can easily dump single tables and exclude columns
✓ No need for a worker to see all the data
✓ Much less Cassandra specific code
✓ Automatic CQL -> Avro conversion
19© DataStax, All Rights Reserved.
And back to our requirements
✓ No impact on the source cluster
✓ Cassandra version agnostic
✓ Point in time snapshot
✓ Horizontally Scalable
✓ Composable (easy to understand and test)
✗ Partially incremental
20© DataStax, All Rights Reserved.
Performance
Cassandra size
Output size
Avg row size
Export time
Workers
Total processes
Export cost
© DataStax, All Rights Reserved. 21
Small
415GiB
290GiB
57B
~40min
16
128
$18
Medium
530GiB
58GiB
124B
~70min
24
192
$30
Large
12.8TiB
2.7TiB
730B
~80min
32
256
~$75
● Around 10x faster than
our previous solution.
● Without any tuning.
● Without even fully
utilizing the dump
machines!
Wrapping up
Wrapping up
● We can now dump our biggest cluster in less than 1 hour.
● A synergy of Cassandra and GCP snapshots.
● Developed in ~2 months by 4 people.
● Working on deployment and deprecation of the old tool.
Cassandra specific, but maybe possible for other databases.
23© DataStax, All Rights Reserved.
Thanks!
Questions?

More Related Content

PDF
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PPTX
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PPTX
Load testing Cassandra applications
Cassandra @ Yahoo Japan (Satoshi Konno, Yahoo) | Cassandra Summit 2016
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Hecuba2: Cassandra Operations Made Easy (Radovan Zvoncek, Spotify) | C* Summi...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Load testing Cassandra applications

What's hot (20)

PDF
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
A glimpse of cassandra 4.0 features netflix
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PDF
How to Monitor and Size Workloads on AWS i3 instances
PDF
ScyllaDB: NoSQL at Ludicrous Speed
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
PPTX
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
PPTX
How Workload Prioritization Reduces Your Datacenter Footprint
PDF
Managing Cassandra at Scale by Al Tobey
PPTX
Writing Applications for Scylla
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
PDF
Live traffic capture and replay in cassandra 4.0
PPTX
Processing 50,000 events per second with Cassandra and Spark
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
A glimpse of cassandra 4.0 features netflix
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
How to Monitor and Size Workloads on AWS i3 instances
ScyllaDB: NoSQL at Ludicrous Speed
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
AdGear Use Case with Scylla - 1M Queries Per Second with Single-Digit Millise...
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at Night
How Workload Prioritization Reduces Your Datacenter Footprint
Managing Cassandra at Scale by Al Tobey
Writing Applications for Scylla
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Webinar: Does it Still Make Sense to do Big Data with Small Nodes?
Live traffic capture and replay in cassandra 4.0
Processing 50,000 events per second with Cassandra and Spark
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Ad

Similar to Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016 (20)

PDF
Apache Cassandra in the Real World
PDF
Apache Cassandra in the Real World
PDF
Datastax day 2016 introduction to apache cassandra
PDF
Slides: Relational to NoSQL Migration
PDF
Cassandra Day NY 2014: From Proof of Concept to Production
PDF
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
PDF
Introduction to Apache Cassandra
PPTX
Performance is not an Option - gRPC and Cassandra
PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PDF
Streaming Data from Cassandra into Kafka
PPTX
BigData Developers MeetUp
PPTX
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
PPTX
Everyday I’m scaling... Cassandra
PDF
Scaling Cassandra in all directions - Jimmy Mardell Spotify
DOCX
Cassandra data modelling best practices
PPT
5266732.ppt
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PDF
Apache Cassandra and The Multi-Cloud by Amanda Moran
PDF
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Apache Cassandra in the Real World
Apache Cassandra in the Real World
Datastax day 2016 introduction to apache cassandra
Slides: Relational to NoSQL Migration
Cassandra Day NY 2014: From Proof of Concept to Production
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Introduction to Apache Cassandra
Performance is not an Option - gRPC and Cassandra
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Streaming Data from Cassandra into Kafka
BigData Developers MeetUp
Everyday I'm Scaling... Cassandra (Ben Bromhead, Instaclustr) | C* Summit 2016
Everyday I’m scaling... Cassandra
Scaling Cassandra in all directions - Jimmy Mardell Spotify
Cassandra data modelling best practices
5266732.ppt
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Apache Cassandra and The Multi-Cloud by Amanda Moran
Lambda Architecture with Spark Streaming, Kafka, Cassandra, Akka, Scala
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Digital Strategies for Manufacturing Companies
PPTX
history of c programming in notes for students .pptx
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
VVF-Customer-Presentation2025-Ver1.9.pptx
CHAPTER 2 - PM Management and IT Context
Design an Analysis of Algorithms II-SECS-1021-03
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Digital Strategies for Manufacturing Companies
history of c programming in notes for students .pptx
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
ai tools demonstartion for schools and inter college
How to Choose the Right IT Partner for Your Business in Malaysia
Navsoft: AI-Powered Business Solutions & Custom Software Development
How to Migrate SBCGlobal Email to Yahoo Easily
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Upgrade and Innovation Strategies for SAP ERP Customers
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Understanding Forklifts - TECH EHS Solution
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Wondershare Filmora 15 Crack With Activation Key [2025
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
Lecture 3: Operating Systems Introduction to Computer Hardware Systems

Cassandra Exports as a Trivially Parallelizable Problem (Emilio Del Tessandoro, Spotify) | Cassandra Summit 2016

  • 1. Cassandra exports as a trivially parallelizable problem Emilio Del Tessandoro Spotify
  • 2. Agenda 1 The problem 2 Introducing Cassandra-Hezo 3 Wrap up 2© DataStax, All Rights Reserved.
  • 3. About Emilio ● From Lucca, Italy ● Studied (Theoretical) Computer Science ● Software Engineer at Spotify ● Started 6 months ago! © DataStax, All Rights Reserved. 3
  • 4. What Emilio does at Spotify? ● Part of the bases team ● Making sure that data is reliably stored, backed up and restore tested ● Advising and creating tools for operating Cassandra Like: ● Cassandra Reaper (last year talk) ● Hecuba (other talk) ● Cassandra-Hezo (this talk) © DataStax, All Rights Reserved. 4
  • 5. The problem ...of exporting terabytes of data from a distributed database
  • 6. The problem We want to export all data from a distributed database. A lot of open problems in this area, but we are eventually consistent... :) Export data like: ● Playlists ● Financially relevant information ● Various kinds of user generated content To be able to quickly analyze it. 6© DataStax, All Rights Reserved.
  • 7. So, what’s out there? Not much for batch processing (although there are streaming solutions). ● SELECT * is not enough ● COPY is not enough ● Bunch of small github projects 7© DataStax, All Rights Reserved.
  • 8. And at Spotify? cass2hdfs ● Not too bad, but very fragile ● Involves shipping SSTables to Hadoop ● Custom parsing and Avro conversion in MapReduce jobs ● Runtime is dependent on the SSTable size 8© DataStax, All Rights Reserved.
  • 9. How we would like to solve it ● No impact on the source cluster ● Cassandra version agnostic ● Point in time snapshot ● Horizontally Scalable ● Composable (easy to understand and test) ● Possibly incremental 9© DataStax, All Rights Reserved.
  • 11. Let’s start with this... ● We don’t want to impact the source cluster ➔ So we need to have data off the source cluster quickly ● But we also want want to be horizontally scalable ➔ So we actually need to be able to get the data to multiple machines quickly 11© DataStax, All Rights Reserved.
  • 12. Also... ● We want to avoid custom parsing code ➔ So we need to use Cassandra read path, on those machines ● But SELECT * is too expensive ➔ So we need to make data more local ➔ SELECT * WHERE token(pk) < X AND token(pk) > Y 12© DataStax, All Rights Reserved.
  • 13. SELECT * WHERE SELECT * WHERE CQL -> AvroCQL -> Avro Cassandra-Hezo architecture 13 SELECT * WHERE CQL -> Avroclone clone clone SELECT * WHERE SELECT * WHERE CQL -> AvroCQL -> Avro SELECT * WHERE CQL -> Avro SELECT * WHERE SELECT * WHERE CQL -> AvroCQL -> Avro SELECT * WHERE CQL -> Avro © DataStax, All Rights Reserved.
  • 14. In case you didn’t know Spotify is now using GCP (Google Cloud Platform). news.spotify.com/us/2016/02/23/announcing-spotify-infrastructures-googley-future 14© DataStax, All Rights Reserved.
  • 15. With Persistent disks (PDs)! Interesting features like: ● PD snapshotting ● PD creating (from snapshot) ● PD attaching How to clone storage in GCP 15© DataStax, All Rights Reserved. cloning! PD Snapshots are incremental!
  • 16. Why one node clusters 16 SELECT * WHERE CQL -> Avroclone clone clone SELECT * WHERE CQL -> Avro SELECT * WHERE CQL -> Avro © DataStax, All Rights Reserved.
  • 17. Why one node clusters ● No need for internode communications ● Easier setup ● No need to attach everything to everything ● Perfect setup for even further read-tuning ● The perfect distributed application! 17© DataStax, All Rights Reserved.
  • 18. Implementation ● An orchestrator written in Python. ● “Just” a state machine with a bunch of external binaries. ● Super fine grain parallelization (file descriptors and I/O events). 18© DataStax, All Rights Reserved. Less than 2000 lines of code. Including everything, from start to end.
  • 19. Looking back at cass2hdfs ✓ We now use Cassandra read path ✓ Robust to topology changes ✓ We can easily dump single tables and exclude columns ✓ No need for a worker to see all the data ✓ Much less Cassandra specific code ✓ Automatic CQL -> Avro conversion 19© DataStax, All Rights Reserved.
  • 20. And back to our requirements ✓ No impact on the source cluster ✓ Cassandra version agnostic ✓ Point in time snapshot ✓ Horizontally Scalable ✓ Composable (easy to understand and test) ✗ Partially incremental 20© DataStax, All Rights Reserved.
  • 21. Performance Cassandra size Output size Avg row size Export time Workers Total processes Export cost © DataStax, All Rights Reserved. 21 Small 415GiB 290GiB 57B ~40min 16 128 $18 Medium 530GiB 58GiB 124B ~70min 24 192 $30 Large 12.8TiB 2.7TiB 730B ~80min 32 256 ~$75 ● Around 10x faster than our previous solution. ● Without any tuning. ● Without even fully utilizing the dump machines!
  • 23. Wrapping up ● We can now dump our biggest cluster in less than 1 hour. ● A synergy of Cassandra and GCP snapshots. ● Developed in ~2 months by 4 people. ● Working on deployment and deprecation of the old tool. Cassandra specific, but maybe possible for other databases. 23© DataStax, All Rights Reserved.