Cassandra 3 new features @ Geecon Krakow 2016

Cassandra 3.0
new features
DuyHai DOAN
Apache Cassandra Evangelist
Speaker’s Name, 11-13 May 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Apache Cassandra Evangelist!
•  talks, meetups, confs!
•  open-source devs (Achilles, Apache Zeppelin)!
•  OSS Cassandra point of contact!
☞ duy_hai.doan@datastax.com!
☞ @doanduyhai
Who Am I ?

Datastax
•  Founded in April 2010!
•  We contribute a lot to Apache Cassandra™!
•  400+ customers (25 of the Fortune 100), 450+ employees!
•  Headquarter in San Francisco Bay area!
•  EU headquarter in London, ofﬁces in France and Germany!
•  Datastax Enterprise = OSS Cassandra + extra features!

Agenda
•  Materialized Views (MV)!
•  User Deﬁned Functions (UDF) & User Deﬁned Aggregates (UDA)!
•  JSON syntax!
•  New SASI full text search!

Materialized Views (MV)
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why Materialized Views ?
•  Relieve the pain of manual denormalization!
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …,
PRIMARY KEY(country, id));

Materialized Views creation
CREATE TABLE user_by_country (
country text, id int,
firstname text, lastname text,
PRIMARY KEY(country, id));
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)

Materialized View Demo

Materialized Views Performance
•  Write performance
•  slower than normal write!
•  local lock + read-before-write cost (but paid only once for all views)!
•  for each base table update, worst case: mv_count x 2 (DELETE +
INSERT) extra mutations for the views!

•  Write performance vs manual denormalization
•  MV better because no client-server network trafﬁc for read-before-write
•  MV better because less network trafﬁc for multiple views (client-side
BATCH)
•  Makes developer life easier à priceless

•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many
nodes)
•  MV better because single read path (secondary index = read index + read
data)

Materialized Views Consistency
•  Consistency level!
•  CL honoured for base table, ONE for MV + local batchlog!
•  Weaker consistency guarantees for MV than for base table !

Q & A
! "

User Defined Functions (UDF)

Rationale
•  Push computation server-side!
•  save network bandwidth (1000 nodes!)!
•  simplify client-side code!
•  provide standard & useful function (sum, avg …)!
•  accelerate analytics use-case (pre-aggregation for Spark)!

How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;

RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code!
Type = Cassandra type!

RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Always called. Null-check mandatory in code !

RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
If any input is null, function execution is skipped and return null!

RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Cassandra types!
•  primitives (boolean, int, …)!
•  collections (list, set, map)!
•  tuples!
•  UDT!

RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
JVM supported languages!
•  Java, Scala!
•  Javascript (slow)!
•  Groovy, Jython, JRuby!
•  Clojure ( JSR 223 impl issue)!

UDF Demo

User Define Aggregate (UDA)
•  Real use-case for UDF!
•  Aggregation server-side à huge network bandwidth saving !
•  Provide similar behavior for Group By, Sum, Avg etc …!

How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name!
State type!
Initial state type!

STYPE stateType
INITCOND initCond;
Accumulator function signature:!
accumulatorFunction(stateType, type1, type2, …)!
RETURNS stateType!
!
Accumulator function ≈ foldLeft function !

STYPE stateType
INITCOND initCond;
Optional final function signature:
finalFunction(stateType)

UDA Demo

Gotchas
•  UDA in Cassandra is not distributed !!
•  Do not execute UDA on a large number of rows (106 for ex.)!
•  single fat partition!
•  multiple partitions!
•  full table scan!
!
•  à Increase client-side timeout!
•  default Java driver timeout = 12 secs!

Cassandra UDA or Apache Spark ?
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition! UDA with token-aware driver because node local!
ONE Multiple partitions! Apache Spark because distributed reads!
> ONE Single partition! UDA because data-locality lost with Spark!
> ONE Multiple partitions! Apache Spark deﬁnitely!

JSON Syntax

Why JSON ?
•  JSON is a very good exchange format
•  But a terrible schema …!
!
•  How to have best of both worlds ?!
•  use Cassandra schema!
•  convert rows to JSON format!

JSON Syntax Demo

SASI full text search index

Why SASI ?
•  Searching (and full text search) was always a pain point for
Cassandra!
•  limited search predicates (=, <=, <, > and >= only)!
•  limited scope (only on primary key columns)!
•  Existing secondary index performance is poor!
•  reversed-index!
•  use Cassandra itself as index storage …!
•  limited predicate ( = ). Inequality predicate = full cluster scan
😱!

How is it implemented ?
•  New index structure = sufﬁx trees
•  Extended predicates (=, inequalities, LIKE %)!
•  Full text search (tokenizers, stop-words, stemming …)!
•  Query Planner to optimize AND predicates!
•  NO, we don’t use Apache Lucene

Who made it ?
•  Open source contribution by an engineers team from …!
!

Full Text Search Demo

When is it available ?
•  Right now with Cassandra ≥ 3.5!
•  available in Cassandra 3.4 but critical bugs!
•  Later improvement!
•  index on collections (List, Set & Map) !!
•  OR clause (WHERE (xxx OR yyy) AND zzz)!
•  != operator!

SASI vs Search Engine
SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!
•  Cassandra is not a search engine !!! (database = durability)!
•  always slower because 2 passes (SASI index read + original Cassandra
data)!
•  no scoring
•  no ordering (ORDER BY)!
•  no grouping (GROUP BY) à Apache Spark for analytics!
!
!

Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

Cassandra 3 new features @ Geecon Krakow 2016

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Cassandra 3 new features @ Geecon Krakow 2016 (20)

More from Duyhai Doan (13)

Recently uploaded (20)

Cassandra 3 new features @ Geecon Krakow 2016