From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016

Rimas Silkaitis
From Postgres to Cassandra

Rimas Silkaitis
Product
@neovintage

$ git push heroku master
Counting objects: 11, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (10/10), done.
Writing objects: 100% (11/11), 22.29 KiB | 0 bytes/s, done.
Total 11 (delta 1), reused 0 (delta 0)
remote: Compressing source files... done.
remote: Building source:
remote:
remote: -----> Ruby app detected
remote: -----> Compiling Ruby
remote: -----> Using Ruby version: ruby-2.3.1

Heroku Postgres
Over 1 Million Active DBs

Heroku Redis
Over 100K Active Instances

Site Traffic
Events
* Totally Not to Scale

CREATE TABLE users (
id bigserial,
account_id bigint,
name text,
email text,
encrypted_password text,
created_at timestamptz,
updated_at timestamptz
);
CREATE TABLE accounts (
id bigserial,
name text,
owner_id bigint,
created_at timestamptz,
updated_at timestamptz
);

CREATE TABLE events (
user_id bigint,
account_id bigint,
session_id text,
occurred_at timestamptz,
category text,
action text,
label text,
attributes jsonb
);

events
events_20160901
events_20160902
events_20160903
events_20160904
Add Some Triggers

$ psql
neovintage::DB=> e
INSERT INTO events (
user_id,
account_id,
category,
action,
created_at)
VALUES (1,
2,
“in_app”,
“purchase_upgrade”
“2016-09-07 11:00:00 -07:00”);

events_20160901
events_20160902
events_20160903
events_20160904
events
INSERT
query

Constraints
• Data has little value after a period of time
• Small range of data has to be queried
• Old data can be archived or aggregated

Why Introduce
Cassandra?
• Linear Scalability
• No Single Point of Failure
• Flexible Data Model
• Tunable Consistency

Runtime
WorkersNew Architecture

I only know relational databases.
How do I do this?

Two Dimensional
Table Spaces
RELATIONAL

Associative Arrays
or Hash
KEY-VALUE

Postgres is Typically Run as Single Instance*

• Partitioned Key-Value Store
• Has a Grouping of Nodes (data
center)
• Data is distributed amongst the
nodes

Cassandra Cluster with 2 Data Centers

SQL-like
[sēkwel lahyk]
adjective
Resembling SQL in appearance,
behavior or character
adverb
In the manner of SQL

s Talk About Primary K
Partition

• 5 Node Cluster
• Simplest terms: Data is partitioned
amongst all the nodes using the
hashing function.

Replication Factor
Setting this parameter
tells Cassandra how
many nodes to copy
incoming the data to
This is a replication factor of 3

But I thought
Cassandra had
tables?

Prior to 3.0, tables were called column families

Let’s Model Our Events
Table in Cassandra

We’re not going to go
through any setup
Plenty of tutorials exist
for that sort of thing
Let’s assume were
working with 5 node
cluster

$ cqlsh
cqlsh> CREATE KEYSPACE
IF NOT EXISTS neovintage_prod
WITH REPLICATION = {
‘class’: ‘NetworkTopologyStrategy’,
‘us-east’: 3
};

$ cqlsh
cqlsh> CREATE SCHEMA
‘us-east’: 3
};

KEYSPACE ==
SCHEMA
• CQL can use KEYSPACE and SCHEMA
interchangeably
• SCHEMA in Cassandra is somewhere between
`CREATE DATABASE` and `CREATE SCHEMA` in
Postgres

$ cqlsh
‘us-east’: 3
};
Replication Strategy

$ cqlsh
‘us-east’: 3
};
Replication Factor

Replication Strategies
• NetworkTopologyStrategy - You have to define the
network topology by defining the data centers. No
magic here
• SimpleStrategy - Has no idea of the topology and
doesn’t care to. Data is replicated to adjacent nodes.

$ cqlsh
cqlsh> CREATE TABLE neovintage_prod.events (
user_id bigint primary key,
account_id bigint,
session_id text,
occurred_at timestamp,
category text,
action text,
label text,
attributes map<text, text>
);

Remember the Primary
Key?
• Postgres defines a PRIMARY KEY as a constraint
that a column or group of columns can be used as a
unique identifier for rows in the table.
• CQL shares that same constraint but extends the
definition even further. Although the main purpose is
to order information in the cluster.
• CQL includes partitioning and sort order of the data
on disk (clustering).

Single Column Primary
Key
• Used for both partitioning and clustering.
• Syntactically, can be defined inline or as a separate
line within the DDL statement.

$ cqlsh
user_id bigint,
account_id bigint,
session_id text,
category text,
action text,
label text,
attributes map<text, text>,
PRIMARY KEY (
(user_id, occurred_at),
account_id,
session_id
)
);

$ cqlsh
user_id bigint,
account_id bigint,
session_id text,
category text,
action text,
label text,
PRIMARY KEY (
account_id,
session_id
)
);
Composite
Partition Key

$ cqlsh
user_id bigint,
account_id bigint,
session_id text,
category text,
action text,
label text,
PRIMARY KEY (
account_id,
session_id
)
);
Clustering Keys

PRIMARY KEY (
account_id,
session_id
)
Composite Partition Key
• This means that both the user_id and the occurred_at
columns are going to be used to partition data.
• If you were to not include the inner parenthesis, the the
first column listed in this PRIMARY KEY definition
would be the sole partition key.

PRIMARY KEY (
account_id,
session_id
)
Clustering Columns
• Defines how the data is sorted on disk. In this case, its
by account_id and then session_id
• It is possible to change the direction of the sort order

$ cqlsh
user_id bigint,
account_id bigint,
session_id text,
category text,
action text,
label text,
PRIMARY KEY (
account_id,
session_id
)
) WITH CLUSTERING ORDER BY (
account_id desc, session_id acc
);
Ahhhhh… Just
like SQL

Postgres Type Cassandra Type
bigint bigint
int int
decimal decimal
float float
text text
varchar(n) varchar
blob blob
json N/A
jsonb N/A
hstore map<type>, <type>

Challenges
• JSON / JSONB columns don't have 1:1 mappings in
Cassandra
• You’ll need to nest MAP type in Cassandra or flatten
out your JSON
• Be careful about timestamps!! Time zones are already
challenging in Postgres.
• If you don’t specify a time zone in Cassandra the time
zone of the coordinator node is used. Always specify
one.

General Tips
• Just like Table Partitioning in Postgres, you need to
think about how you’re going to query the data in
Cassandra. This dictates how you set up your keys.
• We just walked through the semantics on the
database side. Tackling this change on the
application-side is a whole extra topic.
• This is just enough information to get you started.

We’re not going to go through
any setup, again……..
https://bitbucket.org/openscg/cassandra_fdw

$ psql
neovintage::DB=> CREATE EXTENSION cassandra_fdw;
CREATE EXTENSION

$ psql
CREATE EXTENSION
neovintage::DB=> CREATE SERVER cass_serv
FOREIGN DATA WRAPPER cassandra_fdw
OPTIONS (host ‘127.0.0.1');
CREATE SERVER

$ psql
CREATE EXTENSION
CREATE SERVER
neovintage::DB=> CREATE USER MAPPING FOR public
SERVER cass_serv
OPTIONS (username 'test', password ‘test');
CREATE USER

$ psql
CREATE EXTENSION
CREATE SERVER
neovintage::DB=> CREATE USER MAPPING FOR public SERVER cass_serv
OPTIONS (username 'test', password ‘test');
CREATE USER
neovintage::DB=> CREATE FOREIGN TABLE cass.events (id int)
SERVER cass_serv
OPTIONS (schema_name ‘neovintage_prod',
table_name 'events', primary_key ‘id');
CREATE FOREIGN TABLE

neovintage::DB=> INSERT INTO cass.events (
user_id,
occurred_at,
label
)
VALUES (
1234,
“2016-09-08 11:00:00 -0700”,
“awesome”
);

Some Gotchas
• No Composite Primary Key Support in
cassandra_fdw
• No support for UPSERT
• Postgres 9.5+ and Cassandra 3.0+ Supported

From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016 (20)

More from DataStax (20)

Recently uploaded (20)

From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016