SlideShare a Scribd company logo
Relational data model in Cassandra: Will it fit?
Relational model in Cassandra:
Will it fit?
Distributed Data Days SF, September 2018
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Why this talk?
Agenda
Cassandra data model
Options and alternatives
UDT use case and Apache Spark
Who is using Cassandra?
What are 3 major Cassandra issues
Data model
Over-expectations
Poor resource planning
What are 3 major Cassandra issues
Data model
Over-expectations
Poor resource planning
Cassandra data model
It’s simple
Cassandra data model
It’s simple
Map[k, Map[k, v]]
Cassandra data model
It’s simple
Map[k, Map[k, v]]
It sucks
Cassandra data model
It’s simple
Map[k, Map[k, v]]
It sucks
Or not...
Data model
Cassandra data model
Primary key DATA
Slim row
Cassandra data model
Partition key
DATA
Wide row
Clustering key
DATA
Clustering key
DATA
Clustering key
...
But my data model looks like this
Or even...
What are my options?
Denormalization
Query based data model
Employee
EmployeeID
OrganizationID
Name
OrganizationID Name
Employee name
EmployeeID
1. Select all employees for a given organizationID
EmployeeID Name
2. Select employee for a given employeeID
OrganizationID
Organization
OrganizationID
Name
Relational model
Denormalization
Application level joins
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
OrganizationID Name
1. Select all employees for a given organization
2. Select employee for a given employeeID
EmployeeID Name OrganizationID
Results in multiple select statements
EmployeeID EmployeeID
...
Denormalization
Secondary indexes
1. Select all employees for a given organization
2. Select employee for a given employeeID
EmployeeID Name OrganizationID
Performance impact
...
CREATE SECONDARY INDEX
Organization
OrganizationID
Name
Employee
EmployeeID
OrganizationID
Firstname
Lastname
Email
...
Relational model
Denormalization
PROS
Fast reads
One query per request (usually)
Scalable (probably)
CONS
Complex data management
Can be extremely hard and complex on
insert/update/delete
Need to know all queries upfront
UDTs
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees list<frozen<employee>>
);
CREATE TYPE test.employee (
employeeid bigint,
firstname text,
lastname text,
email text
);
OrganizationID Name
Employees
Employee Employee ...
UDTs
PROS
Fast(er) reads
One query per request
Scalable (should be!!)
Indexing?
CONS
Complex data management
No partial updates
Need to know all queries upfront
Indexing?
Blob data
CREATE TABLE keyspace.organization (
organizationid bigint PRIMARY KEY,
name text,
employees text / blob
);
OrganizationID Name
Employees
Employees list as a JSON text
or a serialized objects blob
JSON text or serialized objects
Blob data
PROS
Fast reads
One query per request
No need to serialize into JSON
CONS
Complex data management
No partial updates
Need to know all queries upfront
No indexing option
Relational database
PROS
It’s made for relational data
CONS
Scaling
Availability
Fault tolerance
Performance
Other options
Cassandra+Indexing
Cassandra+RDB
...
Leveraging UDTs
Use case
Highly nested data model
Impossible to denormalize
Fairly simple access patterns
Top level (root) entity
Data model
Root entity
Child entity Child entity Child entity
Child entityChild entity Child entity Child entity
Child entity Child entity
Child entity
Child entity
Child entity Child entity Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
Child entity
How to insert data
Insert AS JSON (2.2+)
Inserted as string, stored as a column type
Easy to manage and debug
Keep track of the data size!!!
Spark dataframe UDT mapping
dataframe.as("parent").join(
child.groupBy(seq.map(col): _*)
.agg(collect_list(struct(columns.map(col): _*))
.alias(alias)), seq, joinType
)
dataframe.join(child
.withColumn(alias, struct(child.columns.map(col): _*))
.select(joinColumn, alias), Seq(joinColumn), joinType)
One to many
One to one
Inserting from Spark
// Save to cassandra
dataframe.write
.format("org.apache.spark.sql.cassandra")
.options(Map(
"keyspace" -> s"$keyspace",
"table" -> s"$table"
))
.mode(SaveMode.Append)
.save
Indexing UDTs
Not possible with just Cassandra
Lucene/Solr based secondary index
Indexing of fields on nested UDTs
Field analyzers
Solr schema example
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<schema name="solrSchema" version="1.5">
<types>
<fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/>
<fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/>
...
<fields>
<field docValues="true" indexed="true" multiValued="false" name="partition_key" stored="true"
type="TrieIntField"/>
<field docValues="true" indexed="true" multiValued="false" name="clustering_key" stored="true"
type="TrieDateField"/>
<field docValues="true" indexed="true" multiValued="false" name="some_type.id" stored="true"
type="TrieIntField" />
<field docValues="true" indexed="true" multiValued="false" name="some_type.some_other_type.id"
stored="true" type="TrieIntField" />
...
</fields>
<uniqueKey>(partition_key,clustering_key)</uniqueKey>
</schema>
But will it blend?
Closing notes
Cassandra data model supports a lot of use cases
Data modeling skills are required
Relational model is hard but not impossible
Additional tools in the ecosystem
Don’t be stubborn
Q&A
Matija Gobec
matija.gobec@smartcat.io
@mad_max0204
Thank you

More Related Content

PDF
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
PDF
Graph based data models
PPTX
NoSQL Graph Databases - Why, When and Where
PDF
DBPedia-past-present-future
PPTX
Enterprise knowledge graphs
PPT
Documenting Data Transformations
PDF
One Ontology, One Data Set, Multiple Shapes with SHACL
PDF
Extend db
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Graph based data models
NoSQL Graph Databases - Why, When and Where
DBPedia-past-present-future
Enterprise knowledge graphs
Documenting Data Transformations
One Ontology, One Data Set, Multiple Shapes with SHACL
Extend db

What's hot (20)

PDF
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
PPTX
Solid pods and the future of the spatial web
PDF
Data Modeling with Neo4j
PDF
Property graph vs. RDF Triplestore comparison in 2020
PPTX
Hibernate
PDF
The Bounties of Semantic Data Integration for the Enterprise
PPTX
CSHALS 2010 W3C Semanic Web Tutorial
PPTX
Semantics for Big Data Integration and Analysis
PDF
Linked Data Experiences at Springer Nature
PDF
Managing RDF data with graph databases
PPTX
Choosing your NoSQL storage
PPTX
PDF
Building Knowledge Graphs in 10 steps
PDF
Connected datalondon metadata-driven apps
PDF
JSON-LD and SHACL for Knowledge Graphs
PPTX
RDF SHACL, Annotations, and Data Frames
ODP
Graph databases
PPTX
Deriving an Emergent Relational Schema from RDF Data
PPT
Graph database
PDF
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
NOSQL IMPLEMENTATION OF A CONCEPTUAL DATA MODEL: UML CLASS DIAGRAM TO A DOCUM...
Solid pods and the future of the spatial web
Data Modeling with Neo4j
Property graph vs. RDF Triplestore comparison in 2020
Hibernate
The Bounties of Semantic Data Integration for the Enterprise
CSHALS 2010 W3C Semanic Web Tutorial
Semantics for Big Data Integration and Analysis
Linked Data Experiences at Springer Nature
Managing RDF data with graph databases
Choosing your NoSQL storage
Building Knowledge Graphs in 10 steps
Connected datalondon metadata-driven apps
JSON-LD and SHACL for Knowledge Graphs
RDF SHACL, Annotations, and Data Frames
Graph databases
Deriving an Emergent Relational Schema from RDF Data
Graph database
Supporting GDPR Compliance through effectively governing Data Lineage and Dat...
Ad

Similar to Relational data model in Cassandra: Will it fit? (20)

PDF
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PDF
Breakthrough OLAP performance with Cassandra and Spark
PPTX
Cassandra Lunch #89: Semi-Structured Data in Cassandra
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
PPTX
Using Spark to Load Oracle Data into Cassandra
PPTX
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
PDF
Internet of things and their requirements.
ODP
Nyc summit intro_to_cassandra
PDF
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
PPTX
Oracle Database 12c - Features for Big Data
ODP
Cassandra Data Modelling
DOCX
Cassandra data modelling best practices
PPTX
Apache Cassandra Developer Training Slide Deck
PPT
oodb.ppt
PPTX
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
PDF
Olap with Spark and Cassandra
PDF
OLAP with Cassandra and Spark
PDF
Big Data Grows Up - A (re)introduction to Cassandra
TupleJump: Breakthrough OLAP performance on Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
Cassandra Data Modelling with CQL (OSCON 2015)
Breakthrough OLAP performance with Cassandra and Spark
Cassandra Lunch #89: Semi-Structured Data in Cassandra
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Using Spark to Load Oracle Data into Cassandra
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...
Internet of things and their requirements.
Nyc summit intro_to_cassandra
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and Spark
Oracle Database 12c - Features for Big Data
Cassandra Data Modelling
Cassandra data modelling best practices
Apache Cassandra Developer Training Slide Deck
oodb.ppt
Indexing 3-dimensional trajectories: Apache Spark and Cassandra integration
Olap with Spark and Cassandra
OLAP with Cassandra and Spark
Big Data Grows Up - A (re)introduction to Cassandra
Ad

Recently uploaded (20)

PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
web development for engineering and engineering
PDF
composite construction of structures.pdf
PPT
Project quality management in manufacturing
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Welding lecture in detail for understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Geodesy 1.pptx...............................................
UNIT-1 - COAL BASED THERMAL POWER PLANTS
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering
composite construction of structures.pdf
Project quality management in manufacturing
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Operating System & Kernel Study Guide-1 - converted.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Lecture Notes Electrical Wiring System Components
CH1 Production IntroductoryConcepts.pptx
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Welding lecture in detail for understanding

Relational data model in Cassandra: Will it fit?

  • 2. Relational model in Cassandra: Will it fit? Distributed Data Days SF, September 2018 Matija Gobec matija.gobec@smartcat.io @mad_max0204
  • 4. Agenda Cassandra data model Options and alternatives UDT use case and Apache Spark
  • 5. Who is using Cassandra?
  • 6. What are 3 major Cassandra issues Data model Over-expectations Poor resource planning
  • 7. What are 3 major Cassandra issues Data model Over-expectations Poor resource planning
  • 9. Cassandra data model It’s simple Map[k, Map[k, v]]
  • 10. Cassandra data model It’s simple Map[k, Map[k, v]] It sucks
  • 11. Cassandra data model It’s simple Map[k, Map[k, v]] It sucks Or not...
  • 13. Cassandra data model Primary key DATA Slim row
  • 14. Cassandra data model Partition key DATA Wide row Clustering key DATA Clustering key DATA Clustering key ...
  • 15. But my data model looks like this
  • 17. What are my options?
  • 18. Denormalization Query based data model Employee EmployeeID OrganizationID Name OrganizationID Name Employee name EmployeeID 1. Select all employees for a given organizationID EmployeeID Name 2. Select employee for a given employeeID OrganizationID Organization OrganizationID Name Relational model
  • 19. Denormalization Application level joins Organization OrganizationID Name Employee EmployeeID OrganizationID Firstname Lastname Email ... Relational model OrganizationID Name 1. Select all employees for a given organization 2. Select employee for a given employeeID EmployeeID Name OrganizationID Results in multiple select statements EmployeeID EmployeeID ...
  • 20. Denormalization Secondary indexes 1. Select all employees for a given organization 2. Select employee for a given employeeID EmployeeID Name OrganizationID Performance impact ... CREATE SECONDARY INDEX Organization OrganizationID Name Employee EmployeeID OrganizationID Firstname Lastname Email ... Relational model
  • 21. Denormalization PROS Fast reads One query per request (usually) Scalable (probably) CONS Complex data management Can be extremely hard and complex on insert/update/delete Need to know all queries upfront
  • 22. UDTs CREATE TABLE keyspace.organization ( organizationid bigint PRIMARY KEY, name text, employees list<frozen<employee>> ); CREATE TYPE test.employee ( employeeid bigint, firstname text, lastname text, email text ); OrganizationID Name Employees Employee Employee ...
  • 23. UDTs PROS Fast(er) reads One query per request Scalable (should be!!) Indexing? CONS Complex data management No partial updates Need to know all queries upfront Indexing?
  • 24. Blob data CREATE TABLE keyspace.organization ( organizationid bigint PRIMARY KEY, name text, employees text / blob ); OrganizationID Name Employees Employees list as a JSON text or a serialized objects blob JSON text or serialized objects
  • 25. Blob data PROS Fast reads One query per request No need to serialize into JSON CONS Complex data management No partial updates Need to know all queries upfront No indexing option
  • 26. Relational database PROS It’s made for relational data CONS Scaling Availability Fault tolerance Performance
  • 29. Use case Highly nested data model Impossible to denormalize Fairly simple access patterns Top level (root) entity
  • 30. Data model Root entity Child entity Child entity Child entity Child entityChild entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity Child entity
  • 31. How to insert data Insert AS JSON (2.2+) Inserted as string, stored as a column type Easy to manage and debug Keep track of the data size!!!
  • 32. Spark dataframe UDT mapping dataframe.as("parent").join( child.groupBy(seq.map(col): _*) .agg(collect_list(struct(columns.map(col): _*)) .alias(alias)), seq, joinType ) dataframe.join(child .withColumn(alias, struct(child.columns.map(col): _*)) .select(joinColumn, alias), Seq(joinColumn), joinType) One to many One to one
  • 33. Inserting from Spark // Save to cassandra dataframe.write .format("org.apache.spark.sql.cassandra") .options(Map( "keyspace" -> s"$keyspace", "table" -> s"$table" )) .mode(SaveMode.Append) .save
  • 34. Indexing UDTs Not possible with just Cassandra Lucene/Solr based secondary index Indexing of fields on nested UDTs Field analyzers
  • 35. Solr schema example <?xml version="1.0" encoding="UTF-8" standalone="no"?> <schema name="solrSchema" version="1.5"> <types> <fieldType class="org.apache.solr.schema.TrieIntField" name="TrieIntField"/> <fieldType class="org.apache.solr.schema.TrieDateField" name="TrieDateField"/> ... <fields> <field docValues="true" indexed="true" multiValued="false" name="partition_key" stored="true" type="TrieIntField"/> <field docValues="true" indexed="true" multiValued="false" name="clustering_key" stored="true" type="TrieDateField"/> <field docValues="true" indexed="true" multiValued="false" name="some_type.id" stored="true" type="TrieIntField" /> <field docValues="true" indexed="true" multiValued="false" name="some_type.some_other_type.id" stored="true" type="TrieIntField" /> ... </fields> <uniqueKey>(partition_key,clustering_key)</uniqueKey> </schema>
  • 36. But will it blend?
  • 37. Closing notes Cassandra data model supports a lot of use cases Data modeling skills are required Relational model is hard but not impossible Additional tools in the ecosystem Don’t be stubborn
  • 38. Q&A