Cassandra from tarball to production

Cassandra: From tarball to production

Why talk about this?
You are about to deploy Cassandra
You are looking for “best practices”
You don’t want:
... to scour through the documentation
... to do something known not to work well
... to forget to cover some important step

What we won’t cover
● Cassandra: how does
it work?
● How do I design my
schema?
● What’s new in
Cassandra X.Y?

So many things to do
Monitoring Snitch DC/Rack Settings Time Sync
Seeds/Autoscaling Full/Incremental
Backups
AWS Instance
Selection
Disk - SSD?
Disk Space - 2x? AWS AMI (Image)
Selection
Periodic Repairs Replication Strategy
Compaction
Strategy
SSL/VPC/VPN Authorization +
Authentication
OS Conf - Users
OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs
C* Start/Stop OS Conf - Path Use case evaluation

Chef to the rescue?
Chef community cookbook available
https://github.com/michaelklishin/cassandra-chef-cookbook
Installs java Creates a “cassandra” user/group
Download/extract the tarball Fixes up ownership
Builds the C* configuration files
Sets the ulimits for filehandles, processes,
memory locking
Sets up an init script Sets up data directories

Chef Cookbook Coverage
Monitoring Snitch DC/Rack Settings Time Sync
Seeds/Autoscaling Full/Incremental
Backups
Disk - SSD? Disk - How much?
AWS Instance Type AWS AMI (Image)
Selection
Periodic Repairs Replication Strategy
Compaction
Strategy
SSL/VPC/VPN Authorization +
Authentication
OS Conf - Users
OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs
C* Start/Stop OS Conf - Path Use case evaluation

Monitoring
Is every node answering queries?
Are nodes talking to each other?
Are any nodes running slowly?
Push UDP! (statsd)
http://hackers.lookout.com/2015/01/cassandra-monitoring/
https://github.com/lookout/cassandra-statsd-agent

Monitoring - Synthetic
Health checks, bad and good
● ‘nodetool status’ exit code
○ Might return 0 if the node is not accepting requests
○ Slow, cross node reads
● cqlsh -u sysmon -p password < /dev/null
● Verifies this node can read auth table
● https://github.com/lookout/cassandra-health-check

What about OpsCenter?
We chose not to use it
Want consistent interface for all monitoring
GUI vs Command Line argument
Didn’t see good auditing capabilities
Didn’t interface well with our chef solution

Snitch
Use the right snitch!
● AWS EC2MultiRegionSnitch
● Google? GoogleCloudSnitch
● GossipingPropertyFileSnitch
NOT
● SimpleSnitch (default)
Community cookbook: set it!

What is RF?
Replication Factor is how many copies of data
Value is hashed to determine primary host
Additional copies always next node
Hash here

What is CL?
Consistency Level -- It’s not RF!
Describes how many nodes must respond
before operation is considered COMPLETE
CL_ONE - only one node responds
CL_QUORUM - (RF/2)+1 nodes (round down)
CL_ALL - RF nodes respond

DC/Rack Settings
You might need to set these
Maybe you’re not in Amazon
Rack == Availability Zone?
Hard: Renaming DC or adding racks

Renaming DCs
Clients “remember” which DC they talk to
Renaming single DC causes all clients to fail
Better to spin up a new one than rename old

Adding a rack
Start with 6 node cluster, rack R1
Replication factor 3
Add 1 node in R2, and rebalance
ALL data in R2 node?
Good idea to keep racks balanced

I don’t have time for this
Clusters must have synchronized time
You will get lots of drift with: [0-3].amazon.pool.
ntp.org
Community cookbook doesn’t cover anything
here

Better make time for this
C* serializes write operations by time stamps
Clocks on virtual machines drift!
It’s the relative difference among clocks that matters
C* nodes should synchronize with each other
Solution: use a pair of peered NTP servers (level 2 or 3)
and a small set of known upstream providers

From a small seed…
Seeds are used for new nodes to find cluster
Every new node should use the same seeds
Seed nodes get topology changes faster
Each seed node must be in the config file
Multiple seeds per datacenter recommended
Tricky to configure on AWS

Backups - Full+Incremental
Nothing in the cookbooks for this
C* makes it “easy”: snapshot, then copy
Snapshots might require a lot more space
Remove the snapshot after copying it

Disk selection
SSD Rotational
Ephemeral
EBS
Low latency Any size instance Any size instance
Recommended Not cheap Less expensive
Great random r/w perf Good write performance No node rebuilds
No network use for disk No network use for disk

AWS Instance Selection
We moved to EC2
c3.2xlarge (15GiB mem, Disk 160GB)?
i2.xlarge (30GiB mem, 800GB disk)
Max recommended storage per node is 1TB
Use instance types that support HVM
Some previous generation instance types, such as T1, C1, M1, and M2 do not support Linux HVM AMIs. Some current generation instance
types, such as T2, I2, R3, G2, and C4 do not support PV AMIs.

How much can I use??
Snapshots take space (kind of)
Best practice: keep disks half full!
800GB disk becomes 400GB
Snapshots during repairs?
Lots of uses for snapshots!

Periodic Repairs
Buried in the docs:
“As a best practice, you should
schedule repairs weekly”
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html
● “-pr” (yes)
● “-par” (maybe)
● “--in-local-dc” (no)

Repair Tips
Raise gc_grace_seconds (tombstones?)
Run on one node at a time
Schedule for low usage hours
Use “par” if you have dead time (faster)
Tune with: nodetool setcompactionthroughput

I thought I deleted that
Compaction removes “old” tombstones
10 day default grace period (gc_grace_period)
After that, deletes will not be propagated!
Run ‘nodetool repair’ at least every 10 days
Once a week is perfect (3 day grace)
Node down >7 days? ‘nodetool remove’ it!

Changing RF within DC?
Easy to decrease RF
Impossible to increase RF without (usually)
Reads with CL_ONE might fail!
Hash here

Replication Strategy
How many replicas should we have?
What happens if some data is lost?
Are you write-heavy or read-heavy?
Quorum considerations: odd is better!
RF=1? RF=3? RF=5?

Magic JMX setting: reduce traffic to a node
Great when node is “behind” the 4 hour window
Used by gossiper to divert traffic during repairs
Writes: ok, read repair: ok, nodetool repair: ok
$ java -jar jmxterm.jar -l localhost:7199
$> set -b org.apache.cassandra.db:type=DynamicEndpointSnitch Severity
10000
Don’t be too severe!

Compaction Strategy
Solved by using a good C* design
SizeTiered or Leveled?
Leveled has better guarantees for read times
SizeTiered may require 10 (or more) reads!
Leveled uses less disk space
Leveled tombstone collection is slower

Auth*
Cookbooks default to OFF
Turn authenticator and authorizer on
‘cassandra’ user is super special
Requires QUORUM (cross-DC) for signon
LOCAL_ONE for all other users!

Users
OS users vs Cassandra users: 1 to 1?
Shared credentials for apps?
Nothing logs the user taking the action!
‘cassandra’ user is created by cookbook
All processes run as ‘cassandra’

Limits
Chef helps here! Startup:
ulimit -l unlimited # mem lock
ulimit -n 48000 # fds
/etc/security/limits.d
cassandra - nofile 48000
cassandra - nproc unlimited
cassandra - memlock unlimited

Filesystem Type
Officially supported: ext4 or XFS
XFS is slightly faster
Interesting options:
● ext4 without journal
● ext2
● zfs

Logs
To consolidate or not to consolidate?
Push or pull? Usually push!
FOSS: syslogd, syslog-ng, logstash/kibana,
heka, banana
Others: Splunk, SumoLogic, Loggly, Stackify

Shutdown
Nice init script with cookbook, steps are:
● nodetool disablethrift (no more clients)
● nodetool disablegossip (stop talking to
cluster)
● nodetool drain (flush all memtables)
● kill the jvm

Quick performance wins
● Disable assertions - cookbook property
● No swap space (or vm.swappiness=1)
● max_concurrent_reads
● max_concurrent_writes

Thank
You!@rkuris
ron.kuris@gmail.com

Cassandra from tarball to production

More Related Content

What's hot (13)

Viewers also liked (12)

Similar to Cassandra from tarball to production (20)

Recently uploaded (20)

Cassandra from tarball to production