Relational data model in Cassandra: Will it fit?

Relational model in Cassandra:
Will it fit?
Distributed Data Days SF, September 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204

Agenda
Cassandra data model
Options and alternatives
UDT use case and Apache Spark

What are 3 major Cassandra issues
Data model
Over-expectations
Poor resource planning

It’s simple

It’s simple
Map[k, Map[k, v]]

It’s simple
Map[k, Map[k, v]]
It sucks

It’s simple
Map[k, Map[k, v]]
It sucks
Or not...

Primary key DATA
Slim row

Partition key
DATA
Wide row
Clustering key
DATA
Clustering key
DATA
Clustering key
...

But my data model looks like this

Denormalization
Query based data model
Employee
EmployeeID
OrganizationID
Name
OrganizationID Name
Employee name
EmployeeID
1. Select all employees for a given organizationID
EmployeeID Name
2. Select employee for a given employeeID
OrganizationID
Organization
OrganizationID
Name
Relational model

Denormalization
Application level joins
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
OrganizationID Name
1. Select all employees for a given organization
EmployeeID Name OrganizationID
Results in multiple select statements
EmployeeID EmployeeID
...

Denormalization
Secondary indexes
1. Select all employees for a given organization
EmployeeID Name OrganizationID
Performance impact
...
CREATE SECONDARY INDEX
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model

Denormalization
PROS
Fast reads
One query per request (usually)
Scalable (probably)
CONS
Complex data management
Can be extremely hard and complex on
insert/update/delete
Need to know all queries upfront

UDTs
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees list<frozen<employee>>
);
CREATE TYPE test.employee (
employeeid bigint,
firstname text,
lastname text,
email text
);
OrganizationID Name
Employees
Employee Employee ...

UDTs
PROS
Fast(er) reads
One query per request
Scalable (should be!!)
Indexing?
CONS
No partial updates
Indexing?

Blob data
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees text / blob
);
OrganizationID Name
Employees
Employees list as a JSON text
or a serialized objects blob
JSON text or serialized objects

Blob data
PROS
Fast reads
One query per request
No need to serialize into JSON
CONS
No partial updates
No indexing option

Relational database
PROS
It’s made for relational data
CONS
Scaling
Availability
Fault tolerance
Performance

Other options
Cassandra+Indexing
Cassandra+RDB
...

Use case
Highly nested data model
Impossible to denormalize
Fairly simple access patterns
Top level (root) entity

Data model
Root entity
Child entity Child entity Child entity
Child entityChild entity Child entity Child entity
Child entity Child entity
Child entity
Child entity
Child entity Child entity Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity

How to insert data
Insert AS JSON (2.2+)
Inserted as string, stored as a column type
Easy to manage and debug
Keep track of the data size!!!

Spark dataframe UDT mapping
dataframe.as("parent").join(
child.groupBy(seq.map(col): _*)
.agg(collect_list(struct(columns.map(col): _*))
.alias(alias)), seq, joinType
)
dataframe.join(child
.withColumn(alias, struct(child.columns.map(col): _*))
.select(joinColumn, alias), Seq(joinColumn), joinType)
One to many
One to one

Inserting from Spark
// Save to cassandra
dataframe.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"keyspace" -> s"$keyspace",
"table" -> s"$table"
))
.mode(SaveMode.Append)
.save

Indexing UDTs
Not possible with just Cassandra
Lucene/Solr based secondary index
Indexing of fields on nested UDTs
Field analyzers

Solr schema example
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="solrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
...
<fields>
<field docValues="true" indexed="true" multiValued="false" name="partition_key" stored="true"
type="TrieIntField"/>
<field docValues="true" indexed="true" multiValued="false" name="clustering_key" stored="true"
type="TrieDateField"/>
<field docValues="true" indexed="true" multiValued="false" name="some_type.id" stored="true"
type="TrieIntField" />
<field docValues="true" indexed="true" multiValued="false" name="some_type.some_other_type.id"
stored="true" type="TrieIntField" />
...
</fields>
<uniqueKey>(partition_key,clustering_key)</uniqueKey>
</schema>

Closing notes
Cassandra data model supports a lot of use cases
Data modeling skills are required
Relational model is hard but not impossible
Additional tools in the ecosystem
Don’t be stubborn

Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

Relational data model in Cassandra: Will it fit?

More Related Content

What's hot (20)

Similar to Relational data model in Cassandra: Will it fit? (20)

Recently uploaded (20)

Relational data model in Cassandra: Will it fit?