SlideShare a Scribd company logo
Bulk Loading into Cassandra
What are we talking about today?
• Problem statement
• Possible Solutions
– cqlsh COPY FROM
– Custom code using SSTable formatted files
– Java CQL INSERTs
• Test Results
• Unloading considerations
2© 2015. All Rights Reserved.
The problem is simple…
© 2015. All Rights Reserved. 3
Load a pile of files into Cassandra
• Where do the files start out?
– “On my Laptop/Server’s local file system”
• The focus today!
– “In HDFS (or another DFS)
• Consider using Spark – not the topic today
– “In an NFS mount”
• Consider using Spark – not the topic today
© 2015. All Rights Reserved. 4
The Options
• The “Front Door”
– Cqlsh COPY FROM
– Java program loading via INSERT statements and executeAsync()
• Or the language of your choice: C/C++, Python, C#, etc.
• The “Side Door”
– Leverage “streaming” via sstableloader
– Need to create SSTables via Java and CQLSStableLoader
• No other language choice
© 2015. All Rights Reserved. 5
The Front Door: CQL INSERT
© 2015. All Rights Reserved. 6
SSTable
SSTable
Cassandra Write Path
© 2015. All Rights Reserved. 7
Coordinator
Commit
Log
Memtable
SSTable
SSTable
SSTable
PeriodicallySynchronously
Cassandra
Cassandra Clients
• Load Balancing
– Prepared Statements
– Token-aware routing
– Round-robin
• Connections per Cassandra host
– The driver connects to every Cassandra host in the “local” data center
• Synchronous / Asyncrhonous Execution
– How many “in-flight queries”?
• Consistency Level
• Does not require all nodes to be online
– Standard Cassandra rules apply – hinted handoff, etc
© 2015. All Rights Reserved. 8
Cqlsh COPY FROM
• Command-line CQL tool
• Ships with Cassandra
– Can be run from a client machine (versions must match)
• Built in Python
– In 2.1 leverages the Python driver
– No “token aware routing” yet
• Only makes connection to one coordinator
– Does not round-robin
• Executes CQL INSERTs asynchronously
© 2015. All Rights Reserved. 9
Java client – e.g., cassandra-loader
(https://github.com/brianmhess/cassandra-loader)
• Java program leveraging the Java CQL driver
– Java driver (and others) provided by DataStax
• Connects to every node in the cluster
– Potentially multiple times per node
• Variety of driver options
– Load balancing – TokenAwarePolicy, DCAwareRoundRobinPolicy, etc
– Connections per host
– Consistency Level, etc
• Asynchronous execution
– Or Synchronous – e.g., for “DDL” operations
– Aside: cassandra-loader uses asynchronous execution (no DDL)
© 2015. All Rights Reserved. 10
The Side Door: “Streaming”
© 2015. All Rights Reserved. 11
Streaming – the Client
• A connection to each Cassandra node
– Along with token range information
• For each file
– Read records
– Determine which nodes own the token range for this record
– Send the record to those nodes
© 2015. All Rights Reserved. 12
sstableloader
SSTable
File
Streaming – the Cluster
• Receive records from the client
– First write out to SSTable file
– Read file back in to create various Cassandra objects
• The “Primary Index” – in memory index for “shortcuts”
• Any secondary indices defined on this table
• Any materialized views defined on this table (in 3.0)
– Move on to next SSTable file and repeat
© 2015. All Rights Reserved. 13
2i MV
Primary
Index
Streaming
• Streaming requires all nodes to be online
– Because sstableloader will connect to each node
• Can “blacklist” nodes to skip
– sstableloader will not stream to those nodes
– Must know which nodes up front – via nodetool status, say
– SSTables won’t be streamed later to offline nodes when they come online
• No “streaming hints”
• To get data to offline nodes, you must repair
• Streaming also requires SSTables to start with
– Use CQLSSTableWriter Java class to create SSTables
© 2015. All Rights Reserved. 14
The test
• Delimited files
– Different size rows: 100 bytes, 1KB, 10KB, 1MB
– Same “schema”: 12-byte TEXT, 8-byte BIGINT, rest in a TEXT
• 12-byte TEXT is the partition key, BIGINT is unique and the clustering column
– Each file is 1GB – 20 files to load
• Larger rows means fewer rows-per-file
– Parallel execution of commands is allowed – use all the cores
• Hardware/Software
– 8x i2.2xlarge nodes for Cassandra running DSE 4.7.3
– 1x r3.xlarge node as the client
© 2015. All Rights Reserved. 15
The Contenders
1. CQLSSTableWriter + sstableloader
– Wrote a Java program to take delimited files to SSTables
– Use command-line sstableloader to load
– Need to combine the times of both
2. Cqlsh COPY FROM
– Use DSE 4.7.3 (not started) on the client
3. cassandra-loader
– https://github.com/brianmhess/cassandra-loader
– Java CQL driver client
© 2015. All Rights Reserved. 16
Experiment Execution Details
• CQLSSTableWriter +
sstableloader
– Leverage “make -j 8” to run 8 at
a time
– Time CQLSSTableWriter and
then time sstableloader
• Cqlsh COPY FROM
– Leverage “make -j 2” to run 2 at
a time
– Running more than 2 caused
timeouts and errors
• cassandra-loader
– 8 threads (“-numThreads 8”)
– unlogged batches of size 4
(“-batchSize 4”)
– 10000 queries in flight
(“-numFutures 10000”)
– Exception for 100-byte test
• 10 threads
• unlogged batches of size 20
• 50000 queries in flight
© 2015. All Rights Reserved. 17
Results
© 2015. All Rights Reserved. 18
0
1000
2000
3000
4000
5000
6000
7000
8000
100B 1KB 10KB 1MB
Duration (s)
cassandra-
loader
sstablewriter+
sstableloader
copy
0
50000
100000
150000
200000
100B 1KB 10KB 1MB
Rows/s
cassandra-
loader
sstablewriter+
sstableloader
copy
0
20
40
60
80
100
120
100B 1KB 10KB 1MB
Data Rate (MB/s)
cassandra-
loader
sstablewriter+
sstableloader
copy
Observations
• Java executeAsync() was faster in all tests
– Except the 100-byte test, where it was a close second
– cassandra-loader means no custom code
• CQLSSTableWriter+sstableloader works better for smaller
records
– Performance eroded as record size increased
– Custom Java program for each format
• Cqlsh COPY FROM was never the winner
– Second place in the 10KB/row test
– Could not handle the 1MB/row test – ERROR
© 2015. All Rights Reserved. 19
To Batch or Not To Batch
• Varying opinions on unlogged batches
• Batching puts more load on the coordinator
– The coordinator gets the list of INSERTs and executes each one
– Not all INSERTs will be “owned” by the coordinator
– Essentially, the client offloads work to the coordinator
• Batching means fewer queries to the cluster
– Since the INSERTs are bundled into one query
© 2015. All Rights Reserved. 20
Batch test
• 10 delimited files
– 10 BIGINT columns – one is partition key, one is clustering column
• Use cassandra-loader
– Vary the -batchSize argument
• Measure
– Throughput – Rows/sec
– Latency – 95th Percentile
© 2015. All Rights Reserved. 21
Results
• Observation: Increasing batch size
– Increases throughput (to a point)
– Increases latency
• Neither is surprising…
© 2015. All Rights Reserved. 22
0
20000
40000
60000
80000
100000
120000
1 2 4 6 8 12 16 24 32 64 128
Rows/sec
0
1000
2000
3000
4000
5000
6000
7000
1 2 4 6 8 12 16 24 32 64 128
95th Percentile Latency
(ms)
Bulk Unloading
• The Problem
– Get all that data from Table X out to file(s)
• Refinement
– Where’s the data going?
• Local FS
• Distributed FS (e.g., HDFS) – Use Spark (or Hadoop, if you have to)?
© 2015. All Rights Reserved. 23
• The “Front Door”
– CQL SELECT
• There is no “Side Door”
© 2015. All Rights Reserved. 24
Parallel unload
• Split token range into pieces
– Need the set of splits to cover and not overlap
• Cassandra drivers provide that
– Need each split to be completely within one node
• So each extract is able to talk only to one Cassandra node
• Optimization step – not necessary 
– Same approach as Spark and Hadoop (and others)
• Connection / Query per “split”
– Export to a different file
• Optimize paging size
– Reduce overhead for decompression
• Consistency Level
© 2015. All Rights Reserved. 25
Available Tools
• Cqlsh COPY TO
– Built into Cassandra command-line tool cqlsh
– Leverages the Python driver
– Recent improvements: CASSANDRA-9304
• Parallel export, etc
• cassandra-unloader
– Part of the https://github.com/brianmhess/cassandra-loader project
– Delimited file options, just like cassandra-loader
– Parallel export
© 2015. All Rights Reserved. 26
Performance
• From the CASSANDRA-9304 ticket:
“A small benchmark was done on a table of 10M rows inside of a Vagrant box
with 8 cores. The table was created using the following command
`tools/bin/cassandra-stress write n=10M -rate threads=50`.
The original single proc version took about 30 minutes to export the table.
The multi proc version takes about 7 minutes.
Brian Hess's cassandra-unloader takes a little over 2 minutes.”
• Summary:
– Pre-9304 COPY TO: 30 minutes
– Post-9304 COPY TO: 7 minutes
– cassandra-unloader: 2 minutes
© 2015. All Rights Reserved. 27
Summary
• Bulk Loading
– CQL asynchronous INSERTs are your best bet
• Simplicity, performance (almost always), configurability, low/no coding
– CQLSSTableWriter requires a custom Java application
– sstableloader requires all nodes to be online
• Operational consideration
• Batching
– Can improve throughput at the cost of latency
• Bulk Unloading
– Parallel export via splitting token range
– Use CQL, there is no “side door”
© 2015. All Rights Reserved. 28
Thank you

More Related Content

PDF
Deep Dive into Cassandra
PPTX
MySQL Indexing - Best practices for MySQL 5.6
PDF
MySQL Query And Index Tuning
PDF
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
PDF
ETL With Cassandra Streaming Bulk Loading
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
PDF
Bulk Loading Data into Cassandra
PPTX
Sizing Your Scylla Cluster
Deep Dive into Cassandra
MySQL Indexing - Best practices for MySQL 5.6
MySQL Query And Index Tuning
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...
ETL With Cassandra Streaming Bulk Loading
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Bulk Loading Data into Cassandra
Sizing Your Scylla Cluster

What's hot (20)

PPTX
Data Federation with Apache Spark
PDF
Chasing the optimizer
PDF
PostgreSQL Performance Tuning
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Troubleshooting Complex Performance issues - Oracle SEG$ contention
PDF
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
PDF
MySQL Optimizer Cost Model
PDF
Automated master failover
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PPTX
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
PPT
Taking Full Advantage of Galera Multi Master Cluster
PDF
Introduction to Apache Cassandra
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PDF
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
PDF
RocksDB Performance and Reliability Practices
PDF
Linux Systems Performance 2016
PDF
NUMA and Java Databases
PDF
MySQL Performance Tuning: Top 10 Tips
Data Federation with Apache Spark
Chasing the optimizer
PostgreSQL Performance Tuning
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Troubleshooting Complex Performance issues - Oracle SEG$ contention
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
MySQL Optimizer Cost Model
Automated master failover
Fine Tuning and Enhancing Performance of Apache Spark Jobs
How to size up an Apache Cassandra cluster (Training)
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PostgreSQL 12は ここがスゴイ! ~性能改善やpluggable storage engineなどの新機能を徹底解説~ (NTTデータ テクノ...
Taking Full Advantage of Galera Multi Master Cluster
Introduction to Apache Cassandra
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Kevin Kempter PostgreSQL Backup and Recovery Methods @ Postgres Open
RocksDB Performance and Reliability Practices
Linux Systems Performance 2016
NUMA and Java Databases
MySQL Performance Tuning: Top 10 Tips
Ad

Similar to Bulk Loading into Cassandra (20)

PDF
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
PDF
Cassandra on Docker
PDF
DataStax: Dockerizing Cassandra on Modern Linux
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PPTX
Performance Tuning a Cloud Application: A Real World Case Study
PPTX
Apache Performance Tuning: Scaling Out
PPTX
Cassandra - A Basic Introduction Guide
PPTX
Migrating to XtraDB Cluster
PPTX
Migrating to XtraDB Cluster
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PPTX
Kubernetes Internals
PPTX
Performance out
PPTX
Performance out
PPTX
Performance out
PPTX
Performance out
PPTX
PPTX
Performance out
PPTX
Performance_Out.pptx
PPTX
title
PPTX
Performance out
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Cassandra on Docker
DataStax: Dockerizing Cassandra on Modern Linux
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
Performance Tuning a Cloud Application: A Real World Case Study
Apache Performance Tuning: Scaling Out
Cassandra - A Basic Introduction Guide
Migrating to XtraDB Cluster
Migrating to XtraDB Cluster
Real time Analytics with Apache Kafka and Apache Spark
Kubernetes Internals
Performance out
Performance out
Performance out
Performance out
Performance out
Performance_Out.pptx
title
Performance out
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Electronic commerce courselecture one. Pdf
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Building Integrated photovoltaic BIPV_UPV.pdf
Unlocking AI with Model Context Protocol (MCP)
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
Electronic commerce courselecture one. Pdf
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Diabetes mellitus diagnosis method based random forest with bat algorithm
Dropbox Q2 2025 Financial Results & Investor Presentation

Bulk Loading into Cassandra

  • 1. Bulk Loading into Cassandra
  • 2. What are we talking about today? • Problem statement • Possible Solutions – cqlsh COPY FROM – Custom code using SSTable formatted files – Java CQL INSERTs • Test Results • Unloading considerations 2© 2015. All Rights Reserved.
  • 3. The problem is simple… © 2015. All Rights Reserved. 3
  • 4. Load a pile of files into Cassandra • Where do the files start out? – “On my Laptop/Server’s local file system” • The focus today! – “In HDFS (or another DFS) • Consider using Spark – not the topic today – “In an NFS mount” • Consider using Spark – not the topic today © 2015. All Rights Reserved. 4
  • 5. The Options • The “Front Door” – Cqlsh COPY FROM – Java program loading via INSERT statements and executeAsync() • Or the language of your choice: C/C++, Python, C#, etc. • The “Side Door” – Leverage “streaming” via sstableloader – Need to create SSTables via Java and CQLSStableLoader • No other language choice © 2015. All Rights Reserved. 5
  • 6. The Front Door: CQL INSERT © 2015. All Rights Reserved. 6
  • 7. SSTable SSTable Cassandra Write Path © 2015. All Rights Reserved. 7 Coordinator Commit Log Memtable SSTable SSTable SSTable PeriodicallySynchronously Cassandra
  • 8. Cassandra Clients • Load Balancing – Prepared Statements – Token-aware routing – Round-robin • Connections per Cassandra host – The driver connects to every Cassandra host in the “local” data center • Synchronous / Asyncrhonous Execution – How many “in-flight queries”? • Consistency Level • Does not require all nodes to be online – Standard Cassandra rules apply – hinted handoff, etc © 2015. All Rights Reserved. 8
  • 9. Cqlsh COPY FROM • Command-line CQL tool • Ships with Cassandra – Can be run from a client machine (versions must match) • Built in Python – In 2.1 leverages the Python driver – No “token aware routing” yet • Only makes connection to one coordinator – Does not round-robin • Executes CQL INSERTs asynchronously © 2015. All Rights Reserved. 9
  • 10. Java client – e.g., cassandra-loader (https://github.com/brianmhess/cassandra-loader) • Java program leveraging the Java CQL driver – Java driver (and others) provided by DataStax • Connects to every node in the cluster – Potentially multiple times per node • Variety of driver options – Load balancing – TokenAwarePolicy, DCAwareRoundRobinPolicy, etc – Connections per host – Consistency Level, etc • Asynchronous execution – Or Synchronous – e.g., for “DDL” operations – Aside: cassandra-loader uses asynchronous execution (no DDL) © 2015. All Rights Reserved. 10
  • 11. The Side Door: “Streaming” © 2015. All Rights Reserved. 11
  • 12. Streaming – the Client • A connection to each Cassandra node – Along with token range information • For each file – Read records – Determine which nodes own the token range for this record – Send the record to those nodes © 2015. All Rights Reserved. 12 sstableloader
  • 13. SSTable File Streaming – the Cluster • Receive records from the client – First write out to SSTable file – Read file back in to create various Cassandra objects • The “Primary Index” – in memory index for “shortcuts” • Any secondary indices defined on this table • Any materialized views defined on this table (in 3.0) – Move on to next SSTable file and repeat © 2015. All Rights Reserved. 13 2i MV Primary Index
  • 14. Streaming • Streaming requires all nodes to be online – Because sstableloader will connect to each node • Can “blacklist” nodes to skip – sstableloader will not stream to those nodes – Must know which nodes up front – via nodetool status, say – SSTables won’t be streamed later to offline nodes when they come online • No “streaming hints” • To get data to offline nodes, you must repair • Streaming also requires SSTables to start with – Use CQLSSTableWriter Java class to create SSTables © 2015. All Rights Reserved. 14
  • 15. The test • Delimited files – Different size rows: 100 bytes, 1KB, 10KB, 1MB – Same “schema”: 12-byte TEXT, 8-byte BIGINT, rest in a TEXT • 12-byte TEXT is the partition key, BIGINT is unique and the clustering column – Each file is 1GB – 20 files to load • Larger rows means fewer rows-per-file – Parallel execution of commands is allowed – use all the cores • Hardware/Software – 8x i2.2xlarge nodes for Cassandra running DSE 4.7.3 – 1x r3.xlarge node as the client © 2015. All Rights Reserved. 15
  • 16. The Contenders 1. CQLSSTableWriter + sstableloader – Wrote a Java program to take delimited files to SSTables – Use command-line sstableloader to load – Need to combine the times of both 2. Cqlsh COPY FROM – Use DSE 4.7.3 (not started) on the client 3. cassandra-loader – https://github.com/brianmhess/cassandra-loader – Java CQL driver client © 2015. All Rights Reserved. 16
  • 17. Experiment Execution Details • CQLSSTableWriter + sstableloader – Leverage “make -j 8” to run 8 at a time – Time CQLSSTableWriter and then time sstableloader • Cqlsh COPY FROM – Leverage “make -j 2” to run 2 at a time – Running more than 2 caused timeouts and errors • cassandra-loader – 8 threads (“-numThreads 8”) – unlogged batches of size 4 (“-batchSize 4”) – 10000 queries in flight (“-numFutures 10000”) – Exception for 100-byte test • 10 threads • unlogged batches of size 20 • 50000 queries in flight © 2015. All Rights Reserved. 17
  • 18. Results © 2015. All Rights Reserved. 18 0 1000 2000 3000 4000 5000 6000 7000 8000 100B 1KB 10KB 1MB Duration (s) cassandra- loader sstablewriter+ sstableloader copy 0 50000 100000 150000 200000 100B 1KB 10KB 1MB Rows/s cassandra- loader sstablewriter+ sstableloader copy 0 20 40 60 80 100 120 100B 1KB 10KB 1MB Data Rate (MB/s) cassandra- loader sstablewriter+ sstableloader copy
  • 19. Observations • Java executeAsync() was faster in all tests – Except the 100-byte test, where it was a close second – cassandra-loader means no custom code • CQLSSTableWriter+sstableloader works better for smaller records – Performance eroded as record size increased – Custom Java program for each format • Cqlsh COPY FROM was never the winner – Second place in the 10KB/row test – Could not handle the 1MB/row test – ERROR © 2015. All Rights Reserved. 19
  • 20. To Batch or Not To Batch • Varying opinions on unlogged batches • Batching puts more load on the coordinator – The coordinator gets the list of INSERTs and executes each one – Not all INSERTs will be “owned” by the coordinator – Essentially, the client offloads work to the coordinator • Batching means fewer queries to the cluster – Since the INSERTs are bundled into one query © 2015. All Rights Reserved. 20
  • 21. Batch test • 10 delimited files – 10 BIGINT columns – one is partition key, one is clustering column • Use cassandra-loader – Vary the -batchSize argument • Measure – Throughput – Rows/sec – Latency – 95th Percentile © 2015. All Rights Reserved. 21
  • 22. Results • Observation: Increasing batch size – Increases throughput (to a point) – Increases latency • Neither is surprising… © 2015. All Rights Reserved. 22 0 20000 40000 60000 80000 100000 120000 1 2 4 6 8 12 16 24 32 64 128 Rows/sec 0 1000 2000 3000 4000 5000 6000 7000 1 2 4 6 8 12 16 24 32 64 128 95th Percentile Latency (ms)
  • 23. Bulk Unloading • The Problem – Get all that data from Table X out to file(s) • Refinement – Where’s the data going? • Local FS • Distributed FS (e.g., HDFS) – Use Spark (or Hadoop, if you have to)? © 2015. All Rights Reserved. 23
  • 24. • The “Front Door” – CQL SELECT • There is no “Side Door” © 2015. All Rights Reserved. 24
  • 25. Parallel unload • Split token range into pieces – Need the set of splits to cover and not overlap • Cassandra drivers provide that – Need each split to be completely within one node • So each extract is able to talk only to one Cassandra node • Optimization step – not necessary  – Same approach as Spark and Hadoop (and others) • Connection / Query per “split” – Export to a different file • Optimize paging size – Reduce overhead for decompression • Consistency Level © 2015. All Rights Reserved. 25
  • 26. Available Tools • Cqlsh COPY TO – Built into Cassandra command-line tool cqlsh – Leverages the Python driver – Recent improvements: CASSANDRA-9304 • Parallel export, etc • cassandra-unloader – Part of the https://github.com/brianmhess/cassandra-loader project – Delimited file options, just like cassandra-loader – Parallel export © 2015. All Rights Reserved. 26
  • 27. Performance • From the CASSANDRA-9304 ticket: “A small benchmark was done on a table of 10M rows inside of a Vagrant box with 8 cores. The table was created using the following command `tools/bin/cassandra-stress write n=10M -rate threads=50`. The original single proc version took about 30 minutes to export the table. The multi proc version takes about 7 minutes. Brian Hess's cassandra-unloader takes a little over 2 minutes.” • Summary: – Pre-9304 COPY TO: 30 minutes – Post-9304 COPY TO: 7 minutes – cassandra-unloader: 2 minutes © 2015. All Rights Reserved. 27
  • 28. Summary • Bulk Loading – CQL asynchronous INSERTs are your best bet • Simplicity, performance (almost always), configurability, low/no coding – CQLSSTableWriter requires a custom Java application – sstableloader requires all nodes to be online • Operational consideration • Batching – Can improve throughput at the cost of latency • Bulk Unloading – Parallel export via splitting token range – Use CQL, there is no “side door” © 2015. All Rights Reserved. 28