ETL With Cassandra Streaming Bulk Loading

Cassandra ETL
Streaming Bulk Loading
Alex Araujo alexaraujo@gmail.com

Background

• Sharded MySql ETL
Platform on EC2 (EBS)

• Database Size - Up to
1TB

• Write latencies
exponentially
proportional to data size

Background
• Cassandra Thrift Loading on EC2
(Ephemeral RAID0)

• Database Size - ∑ available node space
• Write latencies ~linearly proportional to
number of nodes
• 12 XL node cluster (RF=3): 6-125x
Improvement over EBS backed MySQL
systems

Thrift ETL

• Thrift overhead: Converting to/from
internal structures
• Routing from coordinator nodes
• Writing to commitlog
• Internal structures -> on-disk format
Source: http://wiki.apache.org/cassandra/BinaryMemtable

Bulk Load
• Core functionality
• Existing ETL
Nodes for bulk
loading
• Move data ﬁle &
index generation
off C* nodes

BMT Bulk Load

• Requires StorageProxy API (Java)
• Rows not live until ﬂush
• Wiki example uses Hadoop
Source: http://wiki.apache.org/cassandra/BinaryMemtable

Streaming Bulk Load
• Cassandra as Fat Client
• BYO SSTables
• sstableloader [options] /path/to/
keyspace_dir
• Can ignore list of nodes (-i)
• keyspace_dir should be name of keyspace
and contain generated SSTable Data &
Index ﬁles

Users
UserId email name ...
<Hash>
UserGroups
GroupId UserId ...
<UUID> {“date_joined”:”<date>”,”date_left”:
”<date>”,”active”:<true|false>}

UserGroupTimeline
GroupId <TimeUUID> ...
<UUID> UserId

Setup

• Opscode Chef 0.10.2 on EC2
• Cassandra 0.8.2-dev-SNAPSHOT (trunk)
• Custom Java ETL JAR
• The Grinder 3.4 (Jython) Test Harness

Chef 0.10.2

• knife-ec2 bootstrap with --ephemeral
• ec2::ephemeral_raid0 recipe
• Installs mdadm, unmounts default /mnt,
creates RAID0 array on /mnt/md0

Chef 0.10.2
• cassandra::default recipe
• Downloads/extracts apache-cassandra-
<version>-bin.tar.gz
• Links /var/lib/cassandra to /raid0/
cassandra
• Creates cassandra user & directories,
increases file limits, sets up cassandra
service, generates config files

Chef 0.10.2

• cassandra::cluster_node recipe
• Determines # nodes in cluster
• Calculates initial_token; generates
cassandra.yaml
• Creates keyspace and column families

Chef 0.10.2

• cassandra::bulk_load_node recipe
• Generates same cassandra.yaml with
empty initial_token
• Installs/conﬁgures grinder scripts; Java
ETL JAR

for (File : files)
{
ETL JAR
importer = new CBLI(...);
importer.open();
// Processing omitted
importer.close()
}

ETL JAR
CassandraBulkLoadImporter.initSSTableWriters():
File tempFiles = new File(“/path/to/Prefs”);
tempFiles.mkdirs();
for (String cfName : COLUMN_FAMILY_NAMES) {
SSTableSimpleUnsortedWriter writer = new
SSTableSimpleUnsortedWriter(
tempFiles, Model.Prefs.Keyspace.name, cfName,
Model.COLUMN_FAMILY_COMPARATORS.get(cfName),
null, bufferSizeInMB);// No Super CFs
writers.put(name, writer);
}

ETL JAR
CassandraBulkLoadImporter.processSuppressionRecips():

for (User user : users) {
String key = user.getUserId();
SSTSUW writer = tableWriters.get(Model.Users.CF.name);
// rowKey() converts String to ByteBuffer
writer.newRow(rowKey(key));
o.a.c.t.Column column = newUserColumn(user);
writer.addColumn(column.name, column.value,
column.timestamp);
...
// Repeat for each column family
}

ETL JAR
CassandraBulkLoadImporter.close():

for (String cfName : COLUMN_FAMILY_NAMES) {
try {
tableWriters.get(cfName).close();
}
catch (IOException e) {
log.error(“close failed for ”+cfName);
throw new RuntimeException(cfName+” did not close”);
}
String streamCmd = "sstableloader -v --debug"
+ tempFiles.getAbsolutePath();
Process stream = Runtime.getRuntime().exec(streamCmd);
if (!stream.waitFor() == 0)
log.error(“stream failed”);
}

cassandra_bulk_load.py
import random
import sys
import uuid

from java.io import File
from net.grinder.script.Grinder import grinder
from net.grinder.script import Statistics
from net.grinder.script import Test
from com.mycompany import App
from com.mycompany.tool import SingleColumnBulkImport

input_files = [] # files to load
site_id = str(uuid.uuid4())
import_id = random.randint(1,1000000)
list_ids = [] # lists users will be loaded to

try:
App.INSTANCE.start()
dao = App.INSTANCE.getUserDAO()
bulk_import = SingleColumnBulkImport(
dao.prefsKeyspace, input_files, site_id,
list_ids, import_id)
except:
exception = sys.exc_info()[1]
print exception.message
print exception.stackTrace

# Import stats
grinder.statistics.registerDataLogExpression(
"Users Imported", "userLong0")
grinder.statistics.registerSummaryExpression(
"Total Users Imported", "(+ userLong0)")
grinder.statistics.registerDataLogExpression(
"Import Time", "userLong1")
"Import Total Time (sec)", "(/ (+ userLong1) 1000)")

rate_expression = "(* (/ userLong0 (/ (/ userLong1 1000)
"+str(num_threads)+")) "+str(replication_factor)+")"

"Cluster Insert Rate (users/sec)", rate_expression)

# Import and record stats
def import_and_record():
bulk_import.importFiles()
grinder.statistics.forCurrentTest.setLong(
"userLong0", bulk_import.totalLines)
grinder.statistics.forCurrentTest.setLong(
"userLong1",
grinder.statistics.forCurrentTest.time)

# Create an Import Test with a test number and a
description
import_test = Test(1, "Recip Bulk Import").wrap(
import_and_record)

# A TestRunner instance is created for each thread
class TestRunner:
# This method is called for every run.
def __call__(self):
import_test()

Stress Results
• Once Data and Index ﬁles generated,
streaming bulk load is FAST
• Average: ~2.5x increase over Thrift
• ~15-300x increase over MySQL
• Impact on cluster is minimal
• Observed downside: Writing own SSTables
slower than Cassandra

ETL With Cassandra Streaming Bulk Loading

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to ETL With Cassandra Streaming Bulk Loading (20)

Recently uploaded (20)

ETL With Cassandra Streaming Bulk Loading