SlideShare a Scribd company logo
Introduction to  Apache Cassandra (for Java developers!) Nate McCall [email_address] @zznate
Brief Intro  NOT a "key/value store" Columns are dynamic inside a column family SSTables are immutable  SSTables merged on reads All nodes share the same role (i.e. no single point of failure) Trading ACID compliance for scalability is a fundamental design decision
How does this impact development? Substantially.  For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.  But you can trade availability for consistency. (More on this later) You can store whatever you want. It's all just bytes. You need to think about how you will query the data before you write it.
Neat. So Now What? Like any database,  you need a client! Python: Telephus:  http://github.com/driftx/Telephus  (Twisted) Pycassa:  http://github.com/pycassa/pycassa Java: Hector:  http://github.com/rantav/hector  (Examples  https://github.com/zznate/hector-examples  ) Pelops:  http://github.com/s7/scale7-pelops Kundera  http://code.google.com/p/kundera/ Datanucleus JDO:  http://github.com/tnine/Datanucleus-Cassandra-Plugin Grails: grails-cassandra:  https://github.com/wolpert/grails-cassandra .NET: FluentCassandra :  http://github.com/managedfusion/fluentcassandra Aquiles:  http://aquiles.codeplex.com/ Ruby: Cassandra:  http://github.com/fauna/cassandra PHP: phpcassa:  http://github.com/thobbs/phpcassa SimpleCassie :  http://code.google.com/p/simpletools-php/wiki/SimpleCassie
... but do not roll your own
Thrift Fast, efficient serialization and network IO.  Lots of clients available (you can probably use it in other places as well) Why you don't want to work with the Thrift API directly: SuperColumn ColumnOrSuperColumn ColumnParent.super_column ColumnPath.super_column Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap 
Higher Level Client Hector JMX Counters Add/remove hosts: automatically  programatically via JMX Plugable load balancing Complete encapsulation of Thrift API Type-safe approach to dealing with Apache Cassandra Lightweight ORM (supports JPA 1.0 annotations) Mavenized!  http://repo2.maven.org/maven2/me/prettyprint/
&quot;CQL&quot; Currently in Apache Cassandra trunk  Experimental Lots of possibilities from test/system/test_cql.py: UPDATE StandardLong1 SET 1L=&quot;1&quot;, 2L=&quot;2&quot;, 3L=&quot;3&quot;, 4L=&quot;4&quot; WHERE KEY=&quot;aa&quot; SELECT &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; DELETE &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot;
Avro?? Gone. Added too much complexity after Thrift caught up.   &quot;None of the libraries distinguished themselves as being a particularly crappy choice for serialization.&quot;  (See  CASSANDRA-1765 )
Thrift API Methods Retrieving Writing/Removing Meta Information Schema Manipulation
Thrift API Methods - Retrieving get: retrieve a single column for a key get_slice: retrieve a &quot;slice&quot; of columns for a key multiget_slice: retrieve a &quot;slice&quot; of columns for a list of keys get_count: counts the columns of key (you have to deserialize the row to do it) get_range_slices: retrieve a slice for a range of keys get_indexed_slices (FTW!)
Thrift API Methods - Writing/Removing insert batch_mutate (batch insertion AND deletion) remove truncate**
Thrift API Methods - Meta Information describe_cluster_name describe_version describe_keyspace describe_keyspaces
Thrift API Methods - Schema system_add_keyspace system_update_keyspace system_drop_keyspace system_add_column_family system_update_column_family system_drop_column_family
vs. RDBMS - Consistency Level Consistency is tunable per request! Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). *** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK*** Idempotent: an operation can be applied multiple times without changing the result
vs. RDBMS - Append Only Proper data modelling will minimizes seeks  (Go to Tyler's presentation for more!)
On to the Code... https://github.com/zznate/cassandra-tutorial Uses Maven.  Really basic.  Modify/abuse/alter as needed.  Descriptions of what is going on and how to run each example are in the Javadoc comments.  Sample data is based on North American Numbering Plan http://en.wikipedia.org/wiki/North_American_Numbering_Plan
Data Shape 512 202 30.27 097.74 W TX Austin 512 203 30.27 097.74 L TX Austin 512 204 30.32 097.73 W TX Austin 512 205 30.32 097.73 W TX Austin 512 206 30.32 097.73 L TX Austin
Get a Single Column for a Key GetCityForNpanxx.java Retrieve a single column with: Name Value Timestamp TTL
Get the Contents of a Row GetSliceForNpanxx.java Retrieves a list of columns (Hector wraps these in a ColumnSlice) &quot;SlicePredicate&quot; can either be explicit set of columns OR a range (more on ranges soon) Another messy either/or choice encapsulated by Hector
Get the (sorted!) Columns of a Row  GetSliceForStateCity.java Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it) Can be easily modified to return results in reverse order (but this is slightly slower)
Get the Same Slice from Several Rows MultigetSliceForNpanxx.java Very similar to get_slice examples, except we provide a list of keys
Get Slices From a Range of Rows GetRangeSlicesForStateCity.java Like multiget_slice, except we can specify a KeyRange (encapsulated by RangeSlicesQuery#setKeys(start, end) The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)
Get Slices From a Range of Rows - 2 GetSliceForAreaCodeCity.java Compound column name for controlling ranges Comparator at work on text field
Get Slices from Indexed Columns GetIndexedSlicesForCityState.java You only need to index a single column to apply clauses on other columns (BUT- the indexed column must be present with an EQUALS clause!) (It's just another ColumnFamily maintained automatically)
Insert, Update and Delete ... are effectively the same operation.  InsertRowsForColumnFamilies.java DeleteRowsForColumnFamily.java Run each in succession (in whichever combination you like) and verify your results on the CLI Hint: watch the timestamps bin/cassandra-cli --host localhost use Tutorial; list AreaCode; list Npanxx; list StateCity;
Stuff I Punted on for the Sake of Brevity meta_* methods CassandraClusterTest.java: L43-81 @hector system_* methods SchemaManipulation.java @ hector-examples CassandraClusterTest.java: L84-157 @hector ORM (it works and is in production) ORM Documentation multiple nodes failure scenarios Data modelling (go see Tyler's presentation)
Things to Remember deletes and timestamp granularity &quot;range ghosts&quot; using the wrong column comparator and InvalidRequestException deletions actually write data use column-level TTL to automate deletion &quot;how do I iterate over all the rows in a column family&quot;? get_range_slices, but don't do that a good sign your data model is wrong
Dealing with *Lots* of Data (Briefly) Two biggest headaches have been addressed: Compaction pollutes os page cache ( CASSANDRA-1470 ) Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 ) Hadoop integration: Yes. (Go see Jeremy's presentation) Bulk loading: Yes.  CASSANDRA-1278 For more information:  http://wiki.apache.org/cassandra/LargeDataSetConsiderations

More Related Content

ODP
Introduction to apache_cassandra_for_developers-lhg
KEY
Introduction to Cassandra: Replication and Consistency
ODP
Introduciton to Apache Cassandra for Java Developers (JavaOne)
PDF
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
PPTX
Apache Cassandra 2.0
PDF
Apache cassandra architecture internals
PDF
Cassandra Tutorial
PPTX
Cassandra concepts, patterns and anti-patterns
Introduction to apache_cassandra_for_developers-lhg
Introduction to Cassandra: Replication and Consistency
Introduciton to Apache Cassandra for Java Developers (JavaOne)
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra 2.0
Apache cassandra architecture internals
Cassandra Tutorial
Cassandra concepts, patterns and anti-patterns

What's hot (20)

PPTX
Understanding AntiEntropy in Cassandra
PDF
Cassandra for Sysadmins
PPTX
Introduction to NoSQL & Apache Cassandra
PDF
Introduction to Cassandra
PPTX
Cassandra ppt 2
PDF
Cassandra multi-datacenter operations essentials
PDF
Distribute Key Value Store
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PDF
Cassandra Introduction & Features
PDF
The Automation Factory
PDF
Understanding Data Consistency in Apache Cassandra
PPTX
Apache Cassandra Developer Training Slide Deck
PPTX
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
PDF
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
PDF
Node.js and Cassandra
PPTX
Learning Cassandra
ODP
Intro to cassandra
PPTX
Managing Objects and Data in Apache Cassandra
PPTX
Real-time streaming and data pipelines with Apache Kafka
PPTX
Apache cassandra v4.0
Understanding AntiEntropy in Cassandra
Cassandra for Sysadmins
Introduction to NoSQL & Apache Cassandra
Introduction to Cassandra
Cassandra ppt 2
Cassandra multi-datacenter operations essentials
Distribute Key Value Store
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Cassandra Introduction & Features
The Automation Factory
Understanding Data Consistency in Apache Cassandra
Apache Cassandra Developer Training Slide Deck
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...
Node.js and Cassandra
Learning Cassandra
Intro to cassandra
Managing Objects and Data in Apache Cassandra
Real-time streaming and data pipelines with Apache Kafka
Apache cassandra v4.0
Ad

Similar to Introduction to apache_cassandra_for_develope (20)

PPT
NOSQL and Cassandra
ODP
Meetup cassandra for_java_cql
PDF
Building a High-Performance Database with Scala, Akka, and Spark
PDF
Scala active record
PPT
Hacking Tomcat
PPT
Hackingtomcat
ODP
Practical catalyst
PDF
Sparklife - Life In The Trenches With Spark
PPT
Java findamentals1
PPT
Java findamentals1
PPT
Java findamentals1
PPTX
Heap and stack space in java
PPT
B2 2006 tomcat_clusters
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PPTX
Using Cassandra with your Web Application
PDF
Jdbc[1]
PDF
JDBC programming
PDF
Inside the JVM - Follow the white rabbit! / Breizh JUG
PDF
JDD 2016 - Grzegorz Rozniecki - Java 8 What Could Possibly Go Wrong
PDF
Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X
NOSQL and Cassandra
Meetup cassandra for_java_cql
Building a High-Performance Database with Scala, Akka, and Spark
Scala active record
Hacking Tomcat
Hackingtomcat
Practical catalyst
Sparklife - Life In The Trenches With Spark
Java findamentals1
Java findamentals1
Java findamentals1
Heap and stack space in java
B2 2006 tomcat_clusters
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Using Cassandra with your Web Application
Jdbc[1]
JDBC programming
Inside the JVM - Follow the white rabbit! / Breizh JUG
JDD 2016 - Grzegorz Rozniecki - Java 8 What Could Possibly Go Wrong
Cassandra SF Meetup - CQL Performance With Apache Cassandra 3.X
Ad

More from zznate (15)

PDF
Advanced Apache Cassandra Operations with JMX
PDF
Hardening cassandra q2_2016
PDF
Seattle C* Meetup: Hardening cassandra for compliance or paranoia
PDF
Software Development with Apache Cassandra
PDF
Hardening cassandra for compliance or paranoia
PDF
Successful Software Development with Apache Cassandra
PDF
Stampede con 2014 cassandra in the real world
PDF
An Introduction to the Vert.x framework
PDF
Intravert atx meetup_condensed
PDF
Apachecon cassandra transport
KEY
Oscon 2012 tdd_cassandra
PPTX
Strata west 2012_java_cassandra
ODP
Nyc summit intro_to_cassandra
ODP
Meetup cassandra sfo_jdbc
PPT
Hector v2: The Second Version of the Popular High-Level Java Client for Apach...
Advanced Apache Cassandra Operations with JMX
Hardening cassandra q2_2016
Seattle C* Meetup: Hardening cassandra for compliance or paranoia
Software Development with Apache Cassandra
Hardening cassandra for compliance or paranoia
Successful Software Development with Apache Cassandra
Stampede con 2014 cassandra in the real world
An Introduction to the Vert.x framework
Intravert atx meetup_condensed
Apachecon cassandra transport
Oscon 2012 tdd_cassandra
Strata west 2012_java_cassandra
Nyc summit intro_to_cassandra
Meetup cassandra sfo_jdbc
Hector v2: The Second Version of the Popular High-Level Java Client for Apach...

Introduction to apache_cassandra_for_develope

  • 1. Introduction to  Apache Cassandra (for Java developers!) Nate McCall [email_address] @zznate
  • 2. Brief Intro  NOT a &quot;key/value store&quot; Columns are dynamic inside a column family SSTables are immutable  SSTables merged on reads All nodes share the same role (i.e. no single point of failure) Trading ACID compliance for scalability is a fundamental design decision
  • 3. How does this impact development? Substantially.  For operations affecting the same data, that data will become consistent eventually as determined by the timestamps.  But you can trade availability for consistency. (More on this later) You can store whatever you want. It's all just bytes. You need to think about how you will query the data before you write it.
  • 4. Neat. So Now What? Like any database, you need a client! Python: Telephus:  http://github.com/driftx/Telephus  (Twisted) Pycassa:  http://github.com/pycassa/pycassa Java: Hector:  http://github.com/rantav/hector  (Examples  https://github.com/zznate/hector-examples  ) Pelops:  http://github.com/s7/scale7-pelops Kundera  http://code.google.com/p/kundera/ Datanucleus JDO:  http://github.com/tnine/Datanucleus-Cassandra-Plugin Grails: grails-cassandra:  https://github.com/wolpert/grails-cassandra .NET: FluentCassandra :  http://github.com/managedfusion/fluentcassandra Aquiles:  http://aquiles.codeplex.com/ Ruby: Cassandra:  http://github.com/fauna/cassandra PHP: phpcassa:  http://github.com/thobbs/phpcassa SimpleCassie :  http://code.google.com/p/simpletools-php/wiki/SimpleCassie
  • 5. ... but do not roll your own
  • 6. Thrift Fast, efficient serialization and network IO.  Lots of clients available (you can probably use it in other places as well) Why you don't want to work with the Thrift API directly: SuperColumn ColumnOrSuperColumn ColumnParent.super_column ColumnPath.super_column Map<ByteBuffer,Map<String,List<Mutation>>> mutationMap 
  • 7. Higher Level Client Hector JMX Counters Add/remove hosts: automatically  programatically via JMX Plugable load balancing Complete encapsulation of Thrift API Type-safe approach to dealing with Apache Cassandra Lightweight ORM (supports JPA 1.0 annotations) Mavenized!  http://repo2.maven.org/maven2/me/prettyprint/
  • 8. &quot;CQL&quot; Currently in Apache Cassandra trunk  Experimental Lots of possibilities from test/system/test_cql.py: UPDATE StandardLong1 SET 1L=&quot;1&quot;, 2L=&quot;2&quot;, 3L=&quot;3&quot;, 4L=&quot;4&quot; WHERE KEY=&quot;aa&quot; SELECT &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot; DELETE &quot;cd1&quot;, &quot;col&quot; FROM Standard1 WHERE KEY = &quot;kd&quot;
  • 9. Avro?? Gone. Added too much complexity after Thrift caught up.   &quot;None of the libraries distinguished themselves as being a particularly crappy choice for serialization.&quot;  (See  CASSANDRA-1765 )
  • 10. Thrift API Methods Retrieving Writing/Removing Meta Information Schema Manipulation
  • 11. Thrift API Methods - Retrieving get: retrieve a single column for a key get_slice: retrieve a &quot;slice&quot; of columns for a key multiget_slice: retrieve a &quot;slice&quot; of columns for a list of keys get_count: counts the columns of key (you have to deserialize the row to do it) get_range_slices: retrieve a slice for a range of keys get_indexed_slices (FTW!)
  • 12. Thrift API Methods - Writing/Removing insert batch_mutate (batch insertion AND deletion) remove truncate**
  • 13. Thrift API Methods - Meta Information describe_cluster_name describe_version describe_keyspace describe_keyspaces
  • 14. Thrift API Methods - Schema system_add_keyspace system_update_keyspace system_drop_keyspace system_add_column_family system_update_column_family system_drop_column_family
  • 15. vs. RDBMS - Consistency Level Consistency is tunable per request! Cassandra provides consistency when R + W > N (read replica count + write replica count > replication factor). *** CONSITENCY LEVEL FAILURE IS NOT A ROLLBACK*** Idempotent: an operation can be applied multiple times without changing the result
  • 16. vs. RDBMS - Append Only Proper data modelling will minimizes seeks  (Go to Tyler's presentation for more!)
  • 17. On to the Code... https://github.com/zznate/cassandra-tutorial Uses Maven.  Really basic.  Modify/abuse/alter as needed.  Descriptions of what is going on and how to run each example are in the Javadoc comments.  Sample data is based on North American Numbering Plan http://en.wikipedia.org/wiki/North_American_Numbering_Plan
  • 18. Data Shape 512 202 30.27 097.74 W TX Austin 512 203 30.27 097.74 L TX Austin 512 204 30.32 097.73 W TX Austin 512 205 30.32 097.73 W TX Austin 512 206 30.32 097.73 L TX Austin
  • 19. Get a Single Column for a Key GetCityForNpanxx.java Retrieve a single column with: Name Value Timestamp TTL
  • 20. Get the Contents of a Row GetSliceForNpanxx.java Retrieves a list of columns (Hector wraps these in a ColumnSlice) &quot;SlicePredicate&quot; can either be explicit set of columns OR a range (more on ranges soon) Another messy either/or choice encapsulated by Hector
  • 21. Get the (sorted!) Columns of a Row  GetSliceForStateCity.java Shows why the choice of comparator is important (this is the order in which the columns hit the disk - take advantage of it) Can be easily modified to return results in reverse order (but this is slightly slower)
  • 22. Get the Same Slice from Several Rows MultigetSliceForNpanxx.java Very similar to get_slice examples, except we provide a list of keys
  • 23. Get Slices From a Range of Rows GetRangeSlicesForStateCity.java Like multiget_slice, except we can specify a KeyRange (encapsulated by RangeSlicesQuery#setKeys(start, end) The results of this query will be significantly more meaningful with OrderPreservingPartitioner (try this at home!)
  • 24. Get Slices From a Range of Rows - 2 GetSliceForAreaCodeCity.java Compound column name for controlling ranges Comparator at work on text field
  • 25. Get Slices from Indexed Columns GetIndexedSlicesForCityState.java You only need to index a single column to apply clauses on other columns (BUT- the indexed column must be present with an EQUALS clause!) (It's just another ColumnFamily maintained automatically)
  • 26. Insert, Update and Delete ... are effectively the same operation.  InsertRowsForColumnFamilies.java DeleteRowsForColumnFamily.java Run each in succession (in whichever combination you like) and verify your results on the CLI Hint: watch the timestamps bin/cassandra-cli --host localhost use Tutorial; list AreaCode; list Npanxx; list StateCity;
  • 27. Stuff I Punted on for the Sake of Brevity meta_* methods CassandraClusterTest.java: L43-81 @hector system_* methods SchemaManipulation.java @ hector-examples CassandraClusterTest.java: L84-157 @hector ORM (it works and is in production) ORM Documentation multiple nodes failure scenarios Data modelling (go see Tyler's presentation)
  • 28. Things to Remember deletes and timestamp granularity &quot;range ghosts&quot; using the wrong column comparator and InvalidRequestException deletions actually write data use column-level TTL to automate deletion &quot;how do I iterate over all the rows in a column family&quot;? get_range_slices, but don't do that a good sign your data model is wrong
  • 29. Dealing with *Lots* of Data (Briefly) Two biggest headaches have been addressed: Compaction pollutes os page cache ( CASSANDRA-1470 ) Greater than 143mil keys on a single SSTable means more BF false positives ( CASSANDRA-1555 ) Hadoop integration: Yes. (Go see Jeremy's presentation) Bulk loading: Yes.  CASSANDRA-1278 For more information:  http://wiki.apache.org/cassandra/LargeDataSetConsiderations