Intravert Server side processing for Cassandra

Before we get into the heavy
stuff, Let's imagine hacking
around with C* for a bit...

You run a large video website
● CREATE TABLE videos (
videoid uuid,
videoname varchar,
username varchar,
description varchar, tags varchar,
upload_date timestamp,
PRIMARY KEY (videoid,videoname) );
● INSERT INTO videos (videoid, videoname, username,
description, tags, upload_date) VALUES ('99051fe9-6a9c-
46c2-b949-38ef78858dd0','My funny cat','ctodd', 'My cat
likes to play the piano! So funny.','cats,piano,lol','2012-06-01
08:00:00');

You have a bajillion users
● CREATE TABLE users (
username varchar,
firstname varchar,
lastname varchar,
email varchar,
password varchar,
created_date timestamp,
PRIMARY KEY (username));
● INSERT INTO users (username, firstname, lastname, email,
password, created_date) VALUES ('tcodd','Ted','Codd',
'tcodd@relational.com','5f4dcc3b5aa765d61d8327deb882cf99'
,'2011-06-01 08:00:00');

That's great! Then you ask
yourself...

● Can I slice a slice (or sub query)?
● Can I do advanced where clauses ?
● Can I union two slices server side?
● Can I join data from two tables without two
request/response round trips?
● What about procedures?
● Can I write functions or aggregation functions?

Let's look at the API's we have

http://www.slideshare.net/aaronmorton/apachecon-nafeb2013

But none of those API's do what I
want, and it seems simple
enough to do...

Intravert joins the “party”
at the API Layer

Why not just do it client side?
● Move processing close to data
– Idea borrowed from Hadoop
● Doing work close to the source can result in:
– Less network IO
– Less memory spend encoding/decoding 'throw
away' data
– New storage and access paradigms

Vertx + cassandra
● What is vertx ?
– Distributed Event Bus which spans the server and
even penetrates into client side for effortless 'real-
time' web applications
● What are the cool features?
– Asynchronous
– Hot re-loadable modules
– Modules can be written in groovy, ruby, java, java-
script

http://vertx.io

Transport, payload, and
batching

HTTP Transport
● HTTP is easy to use on firewall'ed networks
● Easy to secure
● Easy to compress
● The defacto way to do everything anyway
● IntraVert attempts to limit round-trips
– Not provide a terse binary format

JSON Payload
● Simple nested types like list, map, String
● Request is composed of N operations
● Each operation has a configurable timeout
● Again, IntraVert attempts to limit round-trips
– Not provide a terse message format

Why not use lighting fast transport
and serialization library X?
● X's language/code gen issues
● You probably can not TCP dump X
● Net-admins don't like 90 jars for health checks
● IntraVert attempts to limit round-trips:
– Prepared statements
– Server side filtering
– Other cool stuff

Sample request and response
{"e": [ { {
"type": "SETKEYSPACE",
"exception":null,
"op": { "keyspace": "myks" }
"exceptionId":null,
}, {
"type": "SETCOLUMNFAMILY", "opsRes": {
"op": { "columnfamily": "mycf" } "0":"OK",
}, { "1":"OK",
"type": "SLICE",
"2":[{
"op": {
"name":"Founders",
"rowkey": "beers",
"start": "Allagash", "value":"Breakfast Stout"
"end": "Sierra Nevada", }]
"size": 9 }}
} }]}

Imagine your data looks like...
{ "rowkey": "beers", "name":
"Allagash", "value": "Allagash Tripel" }
{ "rowkey": "beers", "name":
"Founders", "value": "Breakfast Stout" }
{ "rowkey": "beers", "name": "Dogfish
Head",
"value": "Hellhound IPA" }

Application requirement
● User request wishes to know which beers are
“Breakfast Stout” (s)
● Common “solutions”:
– Write a copy of the data sorted by type
– Request all the data and parse on client side

Using an IntraVert filter
● Send a function to the server
● Function is applied to subsequent get or slice
operations
● Only results of the filter are returned to the
client

Defining a filter JavaScript
● Syntax to create a filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "javascript",
"value": "function(row) { if (row['value'] == 'Breakfast Stout')
return row; else return null; }"
}
},

Defining a filter Groovy/Java

● We can define a groovy closure or Java filter
{
"type": "CREATEFILTER",
"op": {
"name": "stouts",
"spec": "groovy",
"{ row -> if (row["value"] == "Breakfast Stout") return row else
return null }"
}
},

Common filter use cases
● Transform data
● Prune columns/rows like a where clause
● Extract data from complex fields (json, xml,
protobuf, etc)

It's the cure for your “redis envy”

Imagine your data looks like...
● { “row key”:”1”, ● { “row key”:”4”,
name:”a” ,val...} name:”a” ,val...}
● { “row key”:”1”, ● { “row key”:”4”,
name:”b” ,val...} name:”z” ,val...}

Application Requirements
● User wishes to intersect the column names of
two slices/queries
● Common “solutions”
– Pull all results to client and apply the intersection
there

Server Side MultiProcessor
● Send a class that implements MultiProcessor
interface to server
● public List<Map> multiProcess
(Map<Integer,Object> input, Map params);
● Do one or more get/slice operations as input
● Invoke MultiProcessor on input

Multi-processor use cases
● Union N slices
● Intersection N slices
● Some “Join” scenarios

Fat client becomes
the 'Phat client'

Imagine you want to insert this data
● User wishes to enter this event for multiple column
families
– 09/10/201111:12:13
– App=Yahoo
– Platform=iOS
– OS=4.3.4
– Device=iPad2,1
– Resolution=768x1024
– Events–videoPlayPercent=38–Taste=great

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Inserting the data
aggregateColumnNames(”AppPlatformOSVersionDeviceResolution") = "app+platform+osversion+device+resolution#”

def ccAppPlatformOSVersionDeviceResolution(c: (String) => Unit) = {
c(aggregateColumnNames(”AppPlatformOSVersionDeviceResolution”) + app + p(platform) + p(osversion) + p(device) + p(resolution))
}

aggregateKeys(KEYSPACE ”ByMonth") = month //201109
aggregateKeys(KEYSPACE "ByDay") = day //20110910
aggregateKeys(KEYSPACE ”ByHour") = hour //2011091012
aggregateKeys(KEYSPACE ”ByMinute") = minute //201109101213

def r(columnName: String): Unit = {
aggregateKeys.foreach{tuple:(ColumnFamily, String) => {
val (columnFamily,row) = tuple
if (row !=null && row.size > 0)
rows add (columnFamily -> row has columnName inc) //increment the counter
}
}
}
ccAppPlatformOSVersionDeviceResolution(r)

http://www.slideshare.net/charmalloc/jsteincassandranyc2011

Solution
● Send the data once and compute the N
permutations on the server side
public void process(JsonObject request, JsonObject state, JsonObject response, EventBus eb) {
JsonObject params = request.getObject("mpparams");
String uid = (String) params.getString("userid");
String fname = (String) params.getString("fname");
String lname = (String) params.getString("lname");
String city = (String) params.getString("city");

RowMutation rm = new RowMutation("myks", IntraService.byteBufferForObject(uid));
QueryPath qp = new QueryPath("users", null, IntraService.byteBufferForObject("fname"));
rm.add(qp, IntraService.byteBufferForObject(fname), System.nanoTime());
QueryPath qp2 = new QueryPath("users", null, IntraService.byteBufferForObject("lname"));
rm.add(qp2, IntraService.byteBufferForObject(lname), System.nanoTime());
...
try {
StorageProxy.mutate(mutations, ConsistencyLevel.ONE);
} catch (WriteTimeoutException | UnavailableException | OverloadedException e) {
e.printStackTrace();
response.putString("status", "FAILED");
}
response.putString("status", "OK");
}

IntraVert status
● Still pre 1.0
● Good docs
– https://github.com/zznate/intravert-ug/wiki/_pages
● Functional equivalent to thrift (mostly features
ported)
● CQL support
● Virgil (coming soon)
● Hbase like scanners (coming soon)

Hack at it

https://github.com/zznate/intravert-ug

Intravert Server side processing for Cassandra

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Intravert Server side processing for Cassandra (20)

More from Edward Capriolo (16)

Recently uploaded (20)

Intravert Server side processing for Cassandra