Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for

Chicago 2015
CQL: This is not the SQL
you are looking for
Aaron Ploetz

Wait... CQL is not SQL?
l CQL3 introduced in Cassandra 1.1.
l CQL is beneficial to new users who have a relational background
(which is most of us).
l However similar, CQL is NOT a direct implementation of SQL.
l New users leave themselves open to issues and frustration when
they use CQL with SQL-based expectations.

$ whoami
l Aaron Ploetz
l @APloetz
l Lead Database Engineer
l Using Cassandra since version 0.8.
l Contributor to the Cassandra tag on
l 2014/15 DataStax MVP for Apache Cassandra

1 SQL features/keywords not present in CQL
2 Differences between CQL and SQL keywords
3 Secondary Indexes
4 Anti-Patterns
5 Questions

SQL features/keywords not present in CQL
l JOINs
l LIKE
l Subqueries
l Aggregation
l Arithmetic
l Except for counters and collections.

Differences between CQL and SQL keywords
l WHERE
l PRIMARY KEY
l ORDER BY
l IN
l DISTINCT
l COUNT
l LIMIT
l INSERT vs. UPDATE (“upsert”)

WHERE
l Only supports AND, IN, =, >, >=, <, <=.
l Some only function under certain conditions.
l Also: CONTAINS, CONTAINS KEY for indexed collections.
l Does not exist: OR, !=
l Conditions can only operate on PRIMARY KEY components, and
in the defined order of the keys.

WHERE (cont)
l SELECT * FROM shipcrewregistry WHERE
shipname='Serenity';
l Start with partition key(s); cannot skip PRIMARY KEY
components.
l CREATE TABLE shipcrewregistry
(shipname text, lastname text, firstname
text, citizenid uuid,
aliases set<text>, PRIMARY KEY
(shipname, lastname, firstname,
citizenid));

ALLOW FILTERING
l Actually I lied, you can skip primary key components if you apply
the ALLOW FILTERING clause.
lastname='Washburne';

ALLOW FILTERING (cont)
lastname='Washburne' ALLOW FILTERING;
l But I don't recommend that.
l ALLOW FILTERING pulls back all rows and then applies your
WHERE conditions.
l The folks at DataStax have proposed some alternate names...
l Bottom line, if you are using ALLOW FILTERING, you are doing it
wrong.

PRIMARY KEY
l PRIMARY KEYs function differently between Cassandra and
relational databases.
l Cassandra uses primary keys to determine data distribution and
on-disk sort order.
l Partition keys are the equivalent of “old school” row keys.
l Clustering keys determine on-disk sort order within a
partitioning key.

ORDER BY
l One of the most misunderstood aspects of CQL.
l Can only order by clustering columns, in the key order of the
clustering columns listed in the table definition (CLUSTERING
ORDER).
l Which means, that you really don't need ORDER BY.
l So what does it do? It can reverse the sort direction (ASCending
vs. DESCending) of the first clustering column.

PRIMARY KEY / ORDER BY Example:
Table Definition
l CREATE TABLE postsByUserYear
l (userid text, year bigint, tag text, posttime
timestamp, content text, postid UUID,
PRIMARY KEY ((userid, year), posttime,
tag)) WITH CLUSTERING ORDER BY
(posttime desc, tag asc);

PRIMARY KEY / ORDER BY Example:
Queries
l SELECT * FROM postsByUserYear WHERE userid='2';
l SELECT * FROM postsByUserYear ORDER BY
posttime;
l SELECT * FROM postsByUserYear WHERE userid='2'
AND year=2015 ORDER BY posttime DESC;
l SELECT * FROM postsByUserYear WHERE userid='2'
AND year=2015 ORDER BY tag;

IN
l Can only operate on the last partition key and/or the last clustering
key.
l And only when the first partition/clustering keys are restricted
by an equals relation.
l Does not perform well...especially with large clusters.

Testing IN
l CREATE TABLE bladerunners (id text, type text, ts
timestamp, name text, data text, PRIMARY KEY (id));

Testing IN (cont)
l  SELECT * FROM bladerunners WHERE id IN
('B26354','B26354');

DISTINCT
l Returns a list of the queried partition keys.
l Can only operate on partition key column(s).
l In Cassandra DISTINCT returns the partition (row) keys, so it is a
fairly light operation (relative to the size of the cluster and/or data
set).
l Whereas in the relational world, DISTINCT is a very resource
intensive operation.

COUNT
l Counts the number of rows returned, dependent on the WHERE
clause.
l Does not aggregate.
l Similar to its RDBMs counterpart.
l Can be (inadvertently) restricted by LIMIT.
l Resource intensive command; especially because it has to scan
each row in the table (which may be on different nodes), and apply
the WHERE conditions.

Limit
l Limits your query to N rows (where N is a positive integer).
l SELECT * FROM bladerunners LIMIT 2;
l Does not allow you to specify a start point.
l You cannot use LIMIT to “page” through your result set.

Cassandra “Upserts”
l Under the hood, INSERT and UPDATE are treated the same by
Cassandra.
l Colloquially known as an “Upsert.”
l Both INSERT and UPDATE operations require the complete
PRIMARY KEY.

So why the different syntax?
l Flexibility. Some situations call for one or the other.
l Counter columns/tables can only be incremented with an
UPDATE.
l INSERTs can save you some dev time in the application layer if
your PRIMARY KEY changes.

“Upsert” example
l UPDATE bladerunners SET data='This guy
is a one-man slaughterhouse.',name='Harry
Bryant',ts='2015-03-30 14:47:00-0600',type='Captain'
WHERE id='B16442';
l UPDATE bladerunners SET data = 'Drink
some for me, huh pal?' WHERE id='B16442';

“Upsert” example (cont)
l INSERT INTO bladerunners (id, type, ts, data, name)
VALUES ('B29591','Blade Runner','2015-03-30
14:34:00-0600','Captain Bryant would like a
word.','Eduardo Gaff');
l INSERT INTO bladerunners (id,data) VALUES
('B29591','It''s too bad she won't live. But then again,
who does?');

Secondary Indexes
l Cassandra provides secondary indexes to allow queries on non-
partition key columns.
l In 2.1.x you can even create indexes on collections and user
defined types.
l Designed for convenience, not for performance.
l Does not perform well on high-cardinality columns.
l Extremely low cardinality is also not a good idea.
l Low performance on a frequently updated column.
l In my opinion, try to avoid using them all together.

Anti-Patterns
l Multi-Key queries: IN
l Secondary Index queries
l DELETEs or INSERTing null values

Summary
l While CQL is designed to make use of our previous experience
using SQL, it is important to remember that the two do not behave
the same.
l Even if you are at an expert level in SQL, read the CQL
documentation before making any assumptions.

Additional Reading
l Getting Started with Time Series Data Modeling –
Patrick McFadin
l SELECT – DataStax CQL 3.1 documentation
l Counting Keys in Cassandra –
Richard Low
l Cassandra High Availability –
Robbie Strickland

Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for

More Related Content

Similar to Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for (20)

More from DataStax Academy (20)

Recently uploaded (20)

Cassandra Day Chicago 2015: CQL: This is not he SQL you are looking for