Brisk: more powerful Hadoop powered by Cassandra

Brisk: More Powerful Hadoop
Powered by Cassandra
jbellis@datastax.com

Monday, July 25, 2011

The evolution of Analytics

Analytics + Realtime



replication

Analytics Realtime



ETL


Brisk re-unifies realtime and analytics


The Traditional Hadoop Stack
Slave Nodes
Master Nodes
Data Node
Name Node
Task Tracker
Secondary Name Node
Region Server
Job Tracker

Hbase Master Client Nodes
Pig
ZooKeeper
Hive
MetaStore
Region Server


Brisk Architecture


Brisk Highlights

✤ Easy to deploy and operate
✤ No single points of failure
✤ Scale and change nodes with no downtime
✤ Cross-DC, multi-master clusters
✤ Allocate resources for OLAP vs OLTP
✤ With no ETL


Cassandra data model

✤ ColumnFamilies contain rows + columns
✤ (Not really schemaless for a while now)

password name site
zznate * Nate McCall
driftx * Brandon Williams
jbellis * Jonathan Ellis datastax.com


Sparse

password name
zznate
* Nate McCall

password name
driftx
* Brandon Williams

password name site
jbellis
* Jonathan Ellis datastax.com


Rows as containers / materialized views

driftx thobbs pcmanus jbellis zznate
circle1

xedin mdennis
circle2

xedin pcmanus ymorishita
circle3


CassandraFS

✤ data stored as ByteBuffer internally -- excellent ﬁt for blocks
✤ local reads mmap data directly (no rpc)
✤ blocks are compressed with google snappy
✤ hadoop distcp hdfs:///mydata cfs:///mydata


Hive support

✤ Hive MetaStore in Cassandra
✤ Uniﬁed schema view from any node, with no external systems
and no SPOF
✤ Automatically maps Cassandra column families to Hive tables
✤ Supports static and dynamic column families (and supercolumns)


Hive: CFS and ColumnFamilies

CREATE TABLE users (name STRING, zip INT);
LOAD DATA LOCAL INPATH 'kv2.txt' OVERWRITE INTO TABLE users;

CREATE EXTERNAL TABLE Keyspace1.Users(name STRING, zip INT)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';

CREATE EXTERNAL TABLE Keyspace1.Users
(row_key STRING, column_name STRING, value string)
STORED BY
'org.apache.hadoop.hive.cassandra.CassandraStorageHandler';


Pig Support

✤ With standard Cassandra:
$ export PIG_HOME=/path/to/pig
$ export PIG_INITIAL_ADDRESS=localhost

$ export PIG_RPC_PORT=9160
$ export
PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
$ contrib/pig/bin/pig_cassandra

grunt>

✤ With Brisk:
$ bin/brisk pig
grunt>


Pig: CFS and ColumnFamilies

grunt> data = LOAD 'cfs:///example.txt' using PigStorage() as
(name:chararray, value:long);

data = LOAD 'cassandra://Demo1/Scores' using CassandraStorage()
AS (key, columns: {T: tuple(name, value)});

data = LOAD 'cassandra://Demo1/Scores&slice_start=M&slice_end=S'
using CassandraStorage() AS (key, columns: {T: tuple(name,
value)});


Data model: Realtime
LiveStocks
last
GOOG $95.52
AAPL $186.10
AMZN $112.98

Portfolios
GOOG LNKD P AMZN AAPLE
Portfolio1
80 20 40 100 20

StockHist
2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11


Data model: Analytics
HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93


10dayreturns
ticker rdate return
GOOG 2011-07-25 $8.23
GOOG 2011-07-24 $6.14
GOOG 2011-07-23 $7.78
AAPL 2011-07-25 $15.32
AAPL 2011-07-24 $12.68

INSERT OVERWRITE TABLE 10dayreturns
SELECT a.row_key ticker,
b.column_name rdate,
b.value - a.value
FROM StockHist a
JOIN StockHist b
ON (a.row_key = b.row_key
AND date_add(a.column_name,10) = b.column_name);


2011-01-01 2011-01-02 2011-01-03
GOOG
$79.85 $75.23 $82.11

row_key column_name value
GOOG 2011-01-01 $8.23
GOOG 2011-01-02 $6.14
GOOG 2011-001-03 $7.78


portfolio_returns
portfolio rdate preturn
Portfolio1 2011-07-25 $118.21
Portfolio1 2011-07-24 $60.78
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-07-25 $2143.92
Portfolio3 2011-07-24 -$10.19

INSERT OVERWRITE TABLE portfolio_returns
SELECT row_key portfolio,
rdate,
SUM(b.return)
FROM portfolios a JOIN 10dayreturns b
ON (a.column_name = b.ticker)
GROUP BY row_key, rdate;


HistLoss
worst_date loss
Portfolio1 2011-07-23 -$34.81
Portfolio2 2011-03-11 -$11432.24
Portfolio3 2011-05-21 -$1476.93

INSERT OVERWRITE TABLE HistLoss
SELECT a.portfolio, rdate, minp
FROM (
SELECT portfolio, min(preturn) as minp
FROM portfolio_returns
GROUP BY portfolio
) a
JOIN portfolio_returns b
ON (a.portfolio = b.portfolio and a.minp = b.preturn);


Portfolio Demo dataflow

Portfolios Web-based Portfolios
Historical Prices Live Prices for today
Intermediate Results
Largest loss Largest loss


OpsCenter


Where to get it

✤ http://www.datastax.com/brisk


Brisk: more powerful Hadoop powered by Cassandra

More Related Content

Similar to Brisk: more powerful Hadoop powered by Cassandra (20)

More from jbellis (20)

Recently uploaded (20)

Brisk: more powerful Hadoop powered by Cassandra