SlideShare a Scribd company logo
Cassandra 3.0
new features
DuyHai DOAN
Apache Cassandra Evangelist
Speaker’s Name, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Apache Cassandra Evangelist!
•  talks, meetups, confs!
•  open-source devs (Achilles, Apache Zeppelin)!
•  OSS Cassandra point of contact!
☞ duy_hai.doan@datastax.com!
☞ @doanduyhai
Who Am I ?
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Datastax
•  Founded in April 2010!
•  We contribute a lot to Apache Cassandra™!
•  400+ customers (25 of the Fortune 100), 450+ employees!
•  Headquarter in San Francisco Bay area!
•  EU headquarter in London, offices in France and Germany!
•  Datastax Enterprise = OSS Cassandra + extra features!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Agenda
•  Materialized Views (MV)!
•  User Defined Functions (UDF) & User Defined Aggregates (UDA)!
•  JSON syntax!
•  New SASI full text search!
Materialized Views (MV)
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why Materialized Views ?
•  Relieve the pain of manual denormalization!
CREATE TABLE user(id int PRIMARY KEY, country text, …);
CREATE TABLE user_by_country( country text, id int, …,
PRIMARY KEY(country, id));
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views creation
CREATE TABLE user_by_country (
country text, id int,
firstname text, lastname text,
PRIMARY KEY(country, id));
CREATE MATERIALIZED VIEW user_by_country
AS SELECT country, id, firstname, lastname
FROM user
WHERE country IS NOT NULL AND id IS NOT NULL
PRIMARY KEY(country, id)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized View Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
•  Write performance
•  slower than normal write!
•  local lock + read-before-write cost (but paid only once for all views)!
•  for each base table update, worst case: mv_count x 2 (DELETE +
INSERT) extra mutations for the views!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
•  Write performance vs manual denormalization
•  MV better because no client-server network traffic for read-before-write
•  MV better because less network traffic for multiple views (client-side
BATCH)
•  Makes developer life easier à priceless
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance
•  Read performance vs secondary index
•  MV better because single node read (secondary index can hit many
nodes)
•  MV better because single read path (secondary index = read index + read
data)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Consistency
•  Consistency level!
•  CL honoured for base table, ONE for MV + local batchlog!
•  Weaker consistency guarantees for MV than for base table !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "
User Defined Functions (UDF)
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Rationale
•  Push computation server-side!
•  save network bandwidth (1000 nodes!)!
•  simplify client-side code!
•  provide standard & useful function (sum, avg …)!
•  accelerate analytics use-case (pre-aggregation for Spark)!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Param name to refer to in the code!
Type = Cassandra type!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Always called. Null-check mandatory in code !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
If any input is null, function execution is skipped and return null!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
Cassandra types!
•  primitives (boolean, int, …)!
•  collections (list, set, map)!
•  tuples!
•  UDT!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS]
[keyspace.]functionName (param1 type1, param2 type2, …)
CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT
RETURNS returnType
LANGUAGE language
AS $$
// source code here
$$;
JVM supported languages!
•  Java, Scala!
•  Javascript (slow)!
•  Groovy, Jython, JRuby!
•  Clojure ( JSR 223 impl issue)!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
UDF Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
User Define Aggregate (UDA)
•  Real use-case for UDF!
•  Aggregation server-side à huge network bandwidth saving !
•  Provide similar behavior for Group By, Sum, Avg etc …!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Only type, no param name!
State type!
Initial state type!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Accumulator function signature:!
accumulatorFunction(stateType, type1, type2, …)!
RETURNS stateType!
!
Accumulator function ≈ foldLeft function !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Optional final function signature:
finalFunction(stateType)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS]
[keyspace.]aggregateName(type1, type2, …)
SFUNC accumulatorFunction
STYPE stateType
[FINALFUNC finalFunction]
INITCOND initCond;
Optional final function signature:
finalFunction(stateType)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
UDA Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Gotchas
•  UDA in Cassandra is not distributed !!
•  Do not execute UDA on a large number of rows (106 for ex.)!
•  single fat partition!
•  multiple partitions!
•  full table scan!
!
•  à Increase client-side timeout!
•  default Java driver timeout = 12 secs!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Cassandra UDA or Apache Spark ?
Consistency
Level
Single/Multiple
Partition(s)
Recommended
Approach
ONE Single partition! UDA with token-aware driver because node local!
ONE Multiple partitions! Apache Spark because distributed reads!
> ONE Single partition! UDA because data-locality lost with Spark!
> ONE Multiple partitions! Apache Spark definitely!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "
JSON Syntax
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why JSON ?
•  JSON is a very good exchange format
•  But a terrible schema …!
!
•  How to have best of both worlds ?!
•  use Cassandra schema!
•  convert rows to JSON format!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
JSON Syntax Demo
SASI full text search index
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why SASI ?
•  Searching (and full text search) was always a pain point for
Cassandra!
•  limited search predicates (=, <=, <, > and >= only)!
•  limited scope (only on primary key columns)!
•  Existing secondary index performance is poor!
•  reversed-index!
•  use Cassandra itself as index storage …!
•  limited predicate ( = ). Inequality predicate = full cluster scan
😱!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How is it implemented ?
•  New index structure = suffix trees
•  Extended predicates (=, inequalities, LIKE %)!
•  Full text search (tokenizers, stop-words, stemming …)!
•  Query Planner to optimize AND predicates!
•  NO, we don’t use Apache Lucene
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Who made it ?
•  Open source contribution by an engineers team from …!
!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Full Text Search Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
When is it available ?
•  Right now with Cassandra ≥ 3.5!
•  available in Cassandra 3.4 but critical bugs!
•  Later improvement!
•  index on collections (List, Set & Map) !!
•  OR clause (WHERE (xxx OR yyy) AND zzz)!
•  != operator!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
SASI vs Search Engine
SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!
•  Cassandra is not a search engine !!! (database = durability)!
•  always slower because 2 passes (SASI index read + original Cassandra
data)!
•  no scoring
•  no ordering (ORDER BY)!
•  no grouping (GROUP BY) à Apache Spark for analytics!
!
!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "
Thank You
@doanduyhai
duy_hai.doan@datastax.com
https://academy.datastax.com/

More Related Content

PDF
Cassandra 3 new features 2016
PDF
Spark cassandra integration 2016
PDF
Cassandra introduction 2016
PDF
Data stax academy
PDF
Apache zeppelin the missing component for the big data ecosystem
PDF
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
PDF
Managing Your Content with Elasticsearch
PPTX
Using existing language skillsets to create large-scale, cloud-based analytics
Cassandra 3 new features 2016
Spark cassandra integration 2016
Cassandra introduction 2016
Data stax academy
Apache zeppelin the missing component for the big data ecosystem
Real time data processing with spark & cassandra @ NoSQLMatters 2015 Paris
Managing Your Content with Elasticsearch
Using existing language skillsets to create large-scale, cloud-based analytics

What's hot (20)

PDF
Apache Spark and DataStax Enablement
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
ODP
Cool bonsai cool - an introduction to ElasticSearch
PDF
Turning a Search Engine into a Relational Database
PDF
DataEngConf SF16 - Spark SQL Workshop
PDF
Spark Cassandra Connector Dataframes
PDF
Apache Drill Workshop
PDF
Solr Recipes Workshop
PDF
Introduction to Spark
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
PDF
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
PPTX
Introduction to Redis
PDF
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
PDF
SQL to Hive Cheat Sheet
PDF
Cassandra 3.0
PPTX
Ingesting and Manipulating Data with JavaScript
PDF
Solr Application Development Tutorial
PDF
Rapid Prototyping with Solr
PDF
Data Science with Solr and Spark
PDF
Elasticsearch: You know, for search! and more!
Apache Spark and DataStax Enablement
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Cool bonsai cool - an introduction to ElasticSearch
Turning a Search Engine into a Relational Database
DataEngConf SF16 - Spark SQL Workshop
Spark Cassandra Connector Dataframes
Apache Drill Workshop
Solr Recipes Workshop
Introduction to Spark
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Introduction to Redis
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
SQL to Hive Cheat Sheet
Cassandra 3.0
Ingesting and Manipulating Data with JavaScript
Solr Application Development Tutorial
Rapid Prototyping with Solr
Data Science with Solr and Spark
Elasticsearch: You know, for search! and more!
Ad

Viewers also liked (20)

PDF
Apache cassandra in 2016
PDF
Libon cassandra summiteu2014
PDF
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
PDF
Cassandra introduction 2016
PDF
Spark Cassandra 2016
PDF
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
PDF
Datastax day 2016 introduction to apache cassandra
PDF
Sasi, cassandra on full text search ride
PDF
Datastax enterprise presentation
PDF
Cassandra UDF and Materialized Views
PDF
Datastax day 2016 : Cassandra data modeling basics
PDF
Spark zeppelin-cassandra at synchrotron
PDF
Introduction to Cassandra & Data model
PDF
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
PDF
Cassandra introduction mars jug
PDF
KillrChat Data Modeling
PDF
Cassandra introduction @ NantesJUG
PDF
Cassandra introduction @ ParisJUG
PDF
Cassandra drivers and libraries
PDF
Introduction to KillrChat
Apache cassandra in 2016
Libon cassandra summiteu2014
Cassandra nice use cases and worst anti patterns no sql-matters barcelona
Cassandra introduction 2016
Spark Cassandra 2016
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016
Datastax day 2016 introduction to apache cassandra
Sasi, cassandra on full text search ride
Datastax enterprise presentation
Cassandra UDF and Materialized Views
Datastax day 2016 : Cassandra data modeling basics
Spark zeppelin-cassandra at synchrotron
Introduction to Cassandra & Data model
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
Cassandra introduction mars jug
KillrChat Data Modeling
Cassandra introduction @ NantesJUG
Cassandra introduction @ ParisJUG
Cassandra drivers and libraries
Introduction to KillrChat
Ad

Similar to Cassandra 3 new features @ Geecon Krakow 2016 (20)

PDF
Cassandra and materialized views
PPTX
Cassandra 2.2 & 3.0
PPTX
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
PDF
Cassandra Data Modelling with CQL (OSCON 2015)
PDF
Cassandra 3.0 Awesomeness
PPTX
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
PDF
Cassandra for impatients
PPT
Database administration and management chapter 12
PDF
Spark & Cassandra - DevFest Córdoba
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
PPT
Toronto jaspersoft meetup
PPTX
In memory databases presentation
KEY
PostgreSQL
PDF
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
PPTX
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
PDF
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
PDF
Flexviews materialized views for my sql
PPTX
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
PPTX
PostgreSQL - It's kind've a nifty database
Cassandra and materialized views
Cassandra 2.2 & 3.0
ScyllaDB's Avi Kivity on UDF, UDA, and the Future
Cassandra Data Modelling with CQL (OSCON 2015)
Cassandra 3.0 Awesomeness
Coral & Transport UDFs: Building Blocks of a Postmodern Data Warehouse​
Cassandra for impatients
Database administration and management chapter 12
Spark & Cassandra - DevFest Córdoba
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Toronto jaspersoft meetup
In memory databases presentation
PostgreSQL
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office
Lucian Precup - Back to the Future: SQL 92 for Elasticsearch? - NoSQL matters...
Flexviews materialized views for my sql
Back to the future : SQL 92 for Elasticsearch @nosqlmatters Paris
PostgreSQL - It's kind've a nifty database

More from Duyhai Doan (13)

PDF
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
PDF
Le futur d'apache cassandra
PDF
Big data 101 for beginners devoxxpl
PDF
Big data 101 for beginners riga dev days
PDF
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
PDF
Apache Zeppelin @DevoxxFR 2016
PDF
Apache zeppelin, the missing component for the big data ecosystem
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
PDF
Fast track to getting started with DSE Max @ ING
PDF
Distributed algorithms for big data @ GeeCon
PDF
Spark cassandra integration, theory and practice
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
PDF
Algorithmes distribues pour le big data @ DevoxxFR 2015
Pourquoi Terraform n'est pas le bon outil pour les déploiements automatisés d...
Le futur d'apache cassandra
Big data 101 for beginners devoxxpl
Big data 101 for beginners riga dev days
Algorithme distribués pour big data saison 2 @DevoxxFR 2016
Apache Zeppelin @DevoxxFR 2016
Apache zeppelin, the missing component for the big data ecosystem
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Fast track to getting started with DSE Max @ ING
Distributed algorithms for big data @ GeeCon
Spark cassandra integration, theory and practice
Spark cassandra connector.API, Best Practices and Use-Cases
Algorithmes distribues pour le big data @ DevoxxFR 2015

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PDF
cuic standard and advanced reporting.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Cloud computing and distributed systems.
cuic standard and advanced reporting.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Cassandra 3 new features @ Geecon Krakow 2016

  • 1. Cassandra 3.0 new features DuyHai DOAN Apache Cassandra Evangelist Speaker’s Name, 11-13 May 2016
  • 2. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Apache Cassandra Evangelist! •  talks, meetups, confs! •  open-source devs (Achilles, Apache Zeppelin)! •  OSS Cassandra point of contact! ☞ duy_hai.doan@datastax.com! ☞ @doanduyhai Who Am I ?
  • 3. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Datastax •  Founded in April 2010! •  We contribute a lot to Apache Cassandra™! •  400+ customers (25 of the Fortune 100), 450+ employees! •  Headquarter in San Francisco Bay area! •  EU headquarter in London, offices in France and Germany! •  Datastax Enterprise = OSS Cassandra + extra features!
  • 4. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Agenda •  Materialized Views (MV)! •  User Defined Functions (UDF) & User Defined Aggregates (UDA)! •  JSON syntax! •  New SASI full text search!
  • 5. Materialized Views (MV) DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
  • 6. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Why Materialized Views ? •  Relieve the pain of manual denormalization! CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));
  • 7. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized Views creation CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id)); CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id)
  • 8. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized View Demo
  • 9. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized Views Performance •  Write performance •  slower than normal write! •  local lock + read-before-write cost (but paid only once for all views)! •  for each base table update, worst case: mv_count x 2 (DELETE + INSERT) extra mutations for the views!
  • 10. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized Views Performance •  Write performance vs manual denormalization •  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side BATCH) •  Makes developer life easier à priceless
  • 11. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized Views Performance •  Read performance vs secondary index •  MV better because single node read (secondary index can hit many nodes) •  MV better because single read path (secondary index = read index + read data)
  • 12. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Materialized Views Consistency •  Consistency level! •  CL honoured for base table, ONE for MV + local batchlog! •  Weaker consistency guarantees for MV than for base table !
  • 13. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Q & A ! "
  • 14. User Defined Functions (UDF) DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
  • 15. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Rationale •  Push computation server-side! •  save network bandwidth (1000 nodes!)! •  simplify client-side code! •  provide standard & useful function (sum, avg …)! •  accelerate analytics use-case (pre-aggregation for Spark)!
  • 16. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
  • 17. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; Param name to refer to in the code! Type = Cassandra type!
  • 18. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; Always called. Null-check mandatory in code !
  • 19. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; If any input is null, function execution is skipped and return null!
  • 20. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; Cassandra types! •  primitives (boolean, int, …)! •  collections (list, set, map)! •  tuples! •  UDT!
  • 21. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDF ? CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$; JVM supported languages! •  Java, Scala! •  Javascript (slow)! •  Groovy, Jython, JRuby! •  Clojure ( JSR 223 impl issue)!
  • 22. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 UDF Demo
  • 23. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 User Define Aggregate (UDA) •  Real use-case for UDF! •  Aggregation server-side à huge network bandwidth saving ! •  Provide similar behavior for Group By, Sum, Avg etc …!
  • 24. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Only type, no param name! State type! Initial state type!
  • 25. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Accumulator function signature:! accumulatorFunction(stateType, type1, type2, …)! RETURNS stateType! ! Accumulator function ≈ foldLeft function !
  • 26. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Optional final function signature: finalFunction(stateType)
  • 27. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How to create an UDA ? CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond; Optional final function signature: finalFunction(stateType)
  • 28. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 UDA Demo
  • 29. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Gotchas •  UDA in Cassandra is not distributed !! •  Do not execute UDA on a large number of rows (106 for ex.)! •  single fat partition! •  multiple partitions! •  full table scan! ! •  à Increase client-side timeout! •  default Java driver timeout = 12 secs!
  • 30. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Cassandra UDA or Apache Spark ? Consistency Level Single/Multiple Partition(s) Recommended Approach ONE Single partition! UDA with token-aware driver because node local! ONE Multiple partitions! Apache Spark because distributed reads! > ONE Single partition! UDA because data-locality lost with Spark! > ONE Multiple partitions! Apache Spark definitely!
  • 31. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Q & A ! "
  • 32. JSON Syntax DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
  • 33. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Why JSON ? •  JSON is a very good exchange format •  But a terrible schema …! ! •  How to have best of both worlds ?! •  use Cassandra schema! •  convert rows to JSON format!
  • 34. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 JSON Syntax Demo
  • 35. SASI full text search index DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
  • 36. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Why SASI ? •  Searching (and full text search) was always a pain point for Cassandra! •  limited search predicates (=, <=, <, > and >= only)! •  limited scope (only on primary key columns)! •  Existing secondary index performance is poor! •  reversed-index! •  use Cassandra itself as index storage …! •  limited predicate ( = ). Inequality predicate = full cluster scan 😱!
  • 37. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 How is it implemented ? •  New index structure = suffix trees •  Extended predicates (=, inequalities, LIKE %)! •  Full text search (tokenizers, stop-words, stemming …)! •  Query Planner to optimize AND predicates! •  NO, we don’t use Apache Lucene
  • 38. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Who made it ? •  Open source contribution by an engineers team from …! !
  • 39. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Full Text Search Demo
  • 40. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 When is it available ? •  Right now with Cassandra ≥ 3.5! •  available in Cassandra 3.4 but critical bugs! •  Later improvement! •  index on collections (List, Set & Map) !! •  OR clause (WHERE (xxx OR yyy) AND zzz)! •  != operator!
  • 41. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 SASI vs Search Engine SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?! •  Cassandra is not a search engine !!! (database = durability)! •  always slower because 2 passes (SASI index read + original Cassandra data)! •  no scoring •  no ordering (ORDER BY)! •  no grouping (GROUP BY) à Apache Spark for analytics! ! !
  • 42. Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016 Q & A ! "