SlideShare a Scribd company logo
7
Most read
8
Most read
23
Most read
Cassandra ETL
Streaming Bulk Loading
     Alex Araujo alexaraujo@gmail.com
Background

•   Sharded MySql ETL
    Platform on EC2 (EBS)

•   Database Size - Up to
    1TB

•   Write latencies
    exponentially
    proportional to data size
Background
• Cassandra Thrift Loading on EC2
  (Ephemeral RAID0)

• Database Size - ∑ available node space
• Write latencies ~linearly proportional to
  number of nodes
• 12 XL node cluster (RF=3): 6-125x
  Improvement over EBS backed MySQL
  systems
Thrift ETL

• Thrift overhead: Converting to/from
  internal structures
• Routing from coordinator nodes
• Writing to commitlog
• Internal structures -> on-disk format
          Source: http://wiki.apache.org/cassandra/BinaryMemtable
Bulk Load
• Core functionality
• Existing ETL
  Nodes for bulk
  loading
• Move data file &
  index generation
  off C* nodes
BMT Bulk Load

• Requires StorageProxy API (Java)
• Rows not live until flush
• Wiki example uses Hadoop
         Source: http://wiki.apache.org/cassandra/BinaryMemtable
Streaming Bulk Load
• Cassandra as Fat Client
• BYO SSTables
• sstableloader [options] /path/to/
  keyspace_dir
  • Can ignore list of nodes (-i)
  • keyspace_dir should be name of keyspace
    and contain generated SSTable Data &
    Index files
Users
UserId     email                      name          ...
<Hash>
            UserGroups
 GroupId             UserId                        ...
 <UUID>     {“date_joined”:”<date>”,”date_left”:
              ”<date>”,”active”:<true|false>}



         UserGroupTimeline
 GroupId       <TimeUUID>                          ...
 <UUID>          UserId
Setup

• Opscode Chef 0.10.2 on EC2
• Cassandra 0.8.2-dev-SNAPSHOT (trunk)
• Custom Java ETL JAR
• The Grinder 3.4 (Jython) Test Harness
Chef 0.10.2

• knife-ec2 bootstrap with --ephemeral
• ec2::ephemeral_raid0 recipe
 • Installs mdadm, unmounts default /mnt,
    creates RAID0 array on /mnt/md0
Chef 0.10.2
• cassandra::default recipe
 • Downloads/extracts apache-cassandra-
    <version>-bin.tar.gz
 • Links /var/lib/cassandra to /raid0/
    cassandra
 • Creates cassandra user & directories,
    increases file limits, sets up cassandra
    service, generates config files
Chef 0.10.2

• cassandra::cluster_node recipe
 • Determines # nodes in cluster
 • Calculates initial_token; generates
    cassandra.yaml
  • Creates keyspace and column families
Chef 0.10.2

• cassandra::bulk_load_node recipe
 • Generates same cassandra.yaml with
    empty initial_token
 • Installs/configures grinder scripts; Java
    ETL JAR
ETL JAR
for (File : files)
{
                         ETL JAR
  importer = new CBLI(...);
  importer.open();
  // Processing omitted
  importer.close()
}
ETL JAR
CassandraBulkLoadImporter.initSSTableWriters():
File tempFiles = new File(“/path/to/Prefs”);
tempFiles.mkdirs();
for (String cfName : COLUMN_FAMILY_NAMES) {
  SSTableSimpleUnsortedWriter writer = new
    SSTableSimpleUnsortedWriter(
        tempFiles, Model.Prefs.Keyspace.name, cfName,
        Model.COLUMN_FAMILY_COMPARATORS.get(cfName),
        null, bufferSizeInMB);// No Super CFs
  writers.put(name, writer);
}
ETL JAR
CassandraBulkLoadImporter.processSuppressionRecips():


for (User user : users) {
  String key = user.getUserId();
  SSTSUW writer = tableWriters.get(Model.Users.CF.name);
  // rowKey() converts String to ByteBuffer
  writer.newRow(rowKey(key));
  o.a.c.t.Column column = newUserColumn(user);
  writer.addColumn(column.name, column.value,
                    column.timestamp);
  ...
  // Repeat for each column family
}
ETL JAR
CassandraBulkLoadImporter.close():

for (String cfName : COLUMN_FAMILY_NAMES) {
  try {
    tableWriters.get(cfName).close();
  }
  catch (IOException e) {
    log.error(“close failed for ”+cfName);
    throw new RuntimeException(cfName+” did not close”);
  }
  String streamCmd = "sstableloader -v --debug"
    + tempFiles.getAbsolutePath();
  Process stream = Runtime.getRuntime().exec(streamCmd);
  if (!stream.waitFor() == 0)
    log.error(“stream failed”);
}
cassandra_bulk_load.py
import random
import sys
import uuid

from   java.io import File
from   net.grinder.script.Grinder import grinder
from   net.grinder.script import Statistics
from   net.grinder.script import Test
from   com.mycompany import App
from   com.mycompany.tool import SingleColumnBulkImport
cassandra_bulk_load.py
input_files = [] # files to load
site_id = str(uuid.uuid4())
import_id = random.randint(1,1000000)
list_ids = []     # lists users will be loaded to

try:
  App.INSTANCE.start()
  dao = App.INSTANCE.getUserDAO()
  bulk_import = SingleColumnBulkImport(
     dao.prefsKeyspace, input_files, site_id,
     list_ids, import_id)
except:
   exception = sys.exc_info()[1]
   print exception.message
   print exception.stackTrace
cassandra_bulk_load.py
# Import stats
grinder.statistics.registerDataLogExpression(
  "Users Imported", "userLong0")
grinder.statistics.registerSummaryExpression(
  "Total Users Imported", "(+ userLong0)")
grinder.statistics.registerDataLogExpression(
  "Import Time", "userLong1")
grinder.statistics.registerSummaryExpression(
  "Import Total Time (sec)", "(/ (+ userLong1) 1000)")

rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)
"+str(num_threads)+")) "+str(replication_factor)+")"

grinder.statistics.registerSummaryExpression(
  "Cluster Insert Rate (users/sec)", rate_expression)
cassandra_bulk_load.py
# Import and record stats
def import_and_record():
    bulk_import.importFiles()
    grinder.statistics.forCurrentTest.setLong(
      "userLong0", bulk_import.totalLines)
    grinder.statistics.forCurrentTest.setLong(
      "userLong1",
      grinder.statistics.forCurrentTest.time)

# Create an Import Test with a test number and a
description
import_test = Test(1, "Recip Bulk Import").wrap(
  import_and_record)

# A TestRunner instance is created for each thread
class TestRunner:
# This method is called for every run.
   def __call__(self):
      import_test()
Stress Results
• Once Data and Index files generated,
  streaming bulk load is FAST
 • Average: ~2.5x increase over Thrift
 • ~15-300x increase over MySQL
• Impact on cluster is minimal
• Observed downside: Writing own SSTables
  slower than Cassandra
Q’s?

More Related Content

PPTX
Cassandra - A decentralized storage system
ZIP
NoSQL databases
PDF
Deep Dive into Cassandra
PDF
Cassandra 101
PDF
Introduction to Cassandra Architecture
PPTX
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
PDF
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
PDF
Cassandra at eBay - Cassandra Summit 2012
Cassandra - A decentralized storage system
NoSQL databases
Deep Dive into Cassandra
Cassandra 101
Introduction to Cassandra Architecture
Materialized Views and Secondary Indexes in Scylla: They Are finally here!
MySQL Database Architectures - InnoDB ReplicaSet & Cluster
Cassandra at eBay - Cassandra Summit 2012

What's hot (20)

PPTX
Kafka replication apachecon_2013
PDF
Cassandra Database
PDF
ClickHouse Monitoring 101: What to monitor and how
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PPTX
An Overview of Apache Cassandra
PDF
Introduction to Apache Cassandra
PDF
Deploying Flink on Kubernetes - David Anderson
PDF
RocksDB Performance and Reliability Practices
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PDF
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
PDF
Cassandra serving netflix @ scale
PDF
MongodB Internals
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
PDF
Redis cluster
PPTX
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
PPTX
How to size up an Apache Cassandra cluster (Training)
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Kafka replication apachecon_2013
Cassandra Database
ClickHouse Monitoring 101: What to monitor and how
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
An Overview of Apache Cassandra
Introduction to Apache Cassandra
Deploying Flink on Kubernetes - David Anderson
RocksDB Performance and Reliability Practices
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
ProxySQL and the Tricks Up Its Sleeve - Percona Live 2022.pdf
Cassandra serving netflix @ scale
MongodB Internals
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
Redis cluster
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBase
How to size up an Apache Cassandra cluster (Training)
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Ad

Viewers also liked (20)

PDF
Bulk Loading Data into Cassandra
PDF
Bulk Loading into Cassandra
PDF
Migration Best Practices: From RDBMS to Cassandra without a Hitch
PDF
Cassandra Virtual Node talk
PDF
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
PDF
Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.
PDF
Advanced Apache Cassandra Operations with JMX
PPTX
Large partition in Cassandra
PPTX
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
PDF
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
PPTX
Bucket your partitions wisely - Cassandra summit 2016
PDF
Managing Cassandra at Scale by Al Tobey
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PPTX
Cassandra Troubleshooting 3.0
PDF
How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...
PDF
On heap cache vs off-heap cache
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
PDF
Cassandra Summit 2010 Performance Tuning
PDF
Cassandra overview: Um Caso Prático
PPTX
Managing (Schema) Migrations in Cassandra
Bulk Loading Data into Cassandra
Bulk Loading into Cassandra
Migration Best Practices: From RDBMS to Cassandra without a Hitch
Cassandra Virtual Node talk
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.
Advanced Apache Cassandra Operations with JMX
Large partition in Cassandra
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Bucket your partitions wisely - Cassandra summit 2016
Managing Cassandra at Scale by Al Tobey
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Cassandra Troubleshooting 3.0
How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...
On heap cache vs off-heap cache
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Cassandra Summit 2010 Performance Tuning
Cassandra overview: Um Caso Prático
Managing (Schema) Migrations in Cassandra
Ad

Similar to ETL With Cassandra Streaming Bulk Loading (20)

PDF
Store and Process Big Data with Hadoop and Cassandra
PDF
Pycon 2012 Apache Cassandra
PDF
Time series with Apache Cassandra - Long version
PPTX
Apache Cassandra, part 3 – machinery, work with Cassandra
PDF
Seattle Cassandra Meetup - HasOffers
PDF
Cassandra
PDF
Cassandra - An Introduction
PDF
Streaming Data from Cassandra into Kafka
PDF
Introduction to cassandra 2014
KEY
Cassandra and Rails at LA NoSQL Meetup
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
Scaling Cassandra for Big Data
PDF
Nike Tech Talk: Double Down on Apache Cassandra and Spark
PDF
Apache Cassandra at Macys
PPTX
Using Cassandra with your Web Application
PPTX
Apache Cassandra Data Modeling with Travis Price
PDF
State of Cassandra 2012
DOCX
Cassandra data modelling best practices
PPT
Introduction to apache_cassandra_for_develope
PDF
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott
Store and Process Big Data with Hadoop and Cassandra
Pycon 2012 Apache Cassandra
Time series with Apache Cassandra - Long version
Apache Cassandra, part 3 – machinery, work with Cassandra
Seattle Cassandra Meetup - HasOffers
Cassandra
Cassandra - An Introduction
Streaming Data from Cassandra into Kafka
Introduction to cassandra 2014
Cassandra and Rails at LA NoSQL Meetup
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Scaling Cassandra for Big Data
Nike Tech Talk: Double Down on Apache Cassandra and Spark
Apache Cassandra at Macys
Using Cassandra with your Web Application
Apache Cassandra Data Modeling with Travis Price
State of Cassandra 2012
Cassandra data modelling best practices
Introduction to apache_cassandra_for_develope
Trivadis TechEvent 2016 Big Data Cassandra, wieso brauche ich das? by Jan Ott

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Approach and Philosophy of On baking technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
A Presentation on Artificial Intelligence
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Cloud computing and distributed systems.
PDF
Modernizing your data center with Dell and AMD
PDF
Encapsulation theory and applications.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Approach and Philosophy of On baking technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
NewMind AI Weekly Chronicles - August'25 Week I
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
A Presentation on Artificial Intelligence
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Network Security Unit 5.pdf for BCA BBA.
Cloud computing and distributed systems.
Modernizing your data center with Dell and AMD
Encapsulation theory and applications.pdf
Understanding_Digital_Forensics_Presentation.pptx

ETL With Cassandra Streaming Bulk Loading

  • 1. Cassandra ETL Streaming Bulk Loading Alex Araujo alexaraujo@gmail.com
  • 2. Background • Sharded MySql ETL Platform on EC2 (EBS) • Database Size - Up to 1TB • Write latencies exponentially proportional to data size
  • 3. Background • Cassandra Thrift Loading on EC2 (Ephemeral RAID0) • Database Size - ∑ available node space • Write latencies ~linearly proportional to number of nodes • 12 XL node cluster (RF=3): 6-125x Improvement over EBS backed MySQL systems
  • 4. Thrift ETL • Thrift overhead: Converting to/from internal structures • Routing from coordinator nodes • Writing to commitlog • Internal structures -> on-disk format Source: http://wiki.apache.org/cassandra/BinaryMemtable
  • 5. Bulk Load • Core functionality • Existing ETL Nodes for bulk loading • Move data file & index generation off C* nodes
  • 6. BMT Bulk Load • Requires StorageProxy API (Java) • Rows not live until flush • Wiki example uses Hadoop Source: http://wiki.apache.org/cassandra/BinaryMemtable
  • 7. Streaming Bulk Load • Cassandra as Fat Client • BYO SSTables • sstableloader [options] /path/to/ keyspace_dir • Can ignore list of nodes (-i) • keyspace_dir should be name of keyspace and contain generated SSTable Data & Index files
  • 8. Users UserId email name ... <Hash> UserGroups GroupId UserId ... <UUID> {“date_joined”:”<date>”,”date_left”: ”<date>”,”active”:<true|false>} UserGroupTimeline GroupId <TimeUUID> ... <UUID> UserId
  • 9. Setup • Opscode Chef 0.10.2 on EC2 • Cassandra 0.8.2-dev-SNAPSHOT (trunk) • Custom Java ETL JAR • The Grinder 3.4 (Jython) Test Harness
  • 10. Chef 0.10.2 • knife-ec2 bootstrap with --ephemeral • ec2::ephemeral_raid0 recipe • Installs mdadm, unmounts default /mnt, creates RAID0 array on /mnt/md0
  • 11. Chef 0.10.2 • cassandra::default recipe • Downloads/extracts apache-cassandra- <version>-bin.tar.gz • Links /var/lib/cassandra to /raid0/ cassandra • Creates cassandra user & directories, increases file limits, sets up cassandra service, generates config files
  • 12. Chef 0.10.2 • cassandra::cluster_node recipe • Determines # nodes in cluster • Calculates initial_token; generates cassandra.yaml • Creates keyspace and column families
  • 13. Chef 0.10.2 • cassandra::bulk_load_node recipe • Generates same cassandra.yaml with empty initial_token • Installs/configures grinder scripts; Java ETL JAR
  • 15. for (File : files) { ETL JAR importer = new CBLI(...); importer.open(); // Processing omitted importer.close() }
  • 16. ETL JAR CassandraBulkLoadImporter.initSSTableWriters(): File tempFiles = new File(“/path/to/Prefs”); tempFiles.mkdirs(); for (String cfName : COLUMN_FAMILY_NAMES) { SSTableSimpleUnsortedWriter writer = new SSTableSimpleUnsortedWriter( tempFiles, Model.Prefs.Keyspace.name, cfName, Model.COLUMN_FAMILY_COMPARATORS.get(cfName), null, bufferSizeInMB);// No Super CFs writers.put(name, writer); }
  • 17. ETL JAR CassandraBulkLoadImporter.processSuppressionRecips(): for (User user : users) { String key = user.getUserId(); SSTSUW writer = tableWriters.get(Model.Users.CF.name); // rowKey() converts String to ByteBuffer writer.newRow(rowKey(key)); o.a.c.t.Column column = newUserColumn(user); writer.addColumn(column.name, column.value, column.timestamp); ... // Repeat for each column family }
  • 18. ETL JAR CassandraBulkLoadImporter.close(): for (String cfName : COLUMN_FAMILY_NAMES) { try { tableWriters.get(cfName).close(); } catch (IOException e) { log.error(“close failed for ”+cfName); throw new RuntimeException(cfName+” did not close”); } String streamCmd = "sstableloader -v --debug" + tempFiles.getAbsolutePath(); Process stream = Runtime.getRuntime().exec(streamCmd); if (!stream.waitFor() == 0) log.error(“stream failed”); }
  • 19. cassandra_bulk_load.py import random import sys import uuid from java.io import File from net.grinder.script.Grinder import grinder from net.grinder.script import Statistics from net.grinder.script import Test from com.mycompany import App from com.mycompany.tool import SingleColumnBulkImport
  • 20. cassandra_bulk_load.py input_files = [] # files to load site_id = str(uuid.uuid4()) import_id = random.randint(1,1000000) list_ids = [] # lists users will be loaded to try: App.INSTANCE.start() dao = App.INSTANCE.getUserDAO() bulk_import = SingleColumnBulkImport( dao.prefsKeyspace, input_files, site_id, list_ids, import_id) except: exception = sys.exc_info()[1] print exception.message print exception.stackTrace
  • 21. cassandra_bulk_load.py # Import stats grinder.statistics.registerDataLogExpression( "Users Imported", "userLong0") grinder.statistics.registerSummaryExpression( "Total Users Imported", "(+ userLong0)") grinder.statistics.registerDataLogExpression( "Import Time", "userLong1") grinder.statistics.registerSummaryExpression( "Import Total Time (sec)", "(/ (+ userLong1) 1000)") rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000) "+str(num_threads)+")) "+str(replication_factor)+")" grinder.statistics.registerSummaryExpression( "Cluster Insert Rate (users/sec)", rate_expression)
  • 22. cassandra_bulk_load.py # Import and record stats def import_and_record(): bulk_import.importFiles() grinder.statistics.forCurrentTest.setLong( "userLong0", bulk_import.totalLines) grinder.statistics.forCurrentTest.setLong( "userLong1", grinder.statistics.forCurrentTest.time) # Create an Import Test with a test number and a description import_test = Test(1, "Recip Bulk Import").wrap( import_and_record) # A TestRunner instance is created for each thread class TestRunner: # This method is called for every run. def __call__(self): import_test()
  • 23. Stress Results • Once Data and Index files generated, streaming bulk load is FAST • Average: ~2.5x increase over Thrift • ~15-300x increase over MySQL • Impact on cluster is minimal • Observed downside: Writing own SSTables slower than Cassandra