C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

PRACTICE MAKES PERFECT:
EXTREME CASSANDRA OPTIMIZATION
@AlTobey
Tech Lead, Compute and Data Services
#CASSANDRA13
1Saturday, June 15, 13
I didn’t name this talk. The conference people did, but I like it a lot.

2
⁍ About me / Ooyala
⁍ How not to manage your Cassandra clusters
⁍ Make it suck less
⁍ How to be a heuristician
⁍ Tools of the trade
⁍ More Settings
⁍ Show & Tell
#CASSANDRA13
Outline

3
⁍ Tech Lead, Compute and Data Services at Ooyala, Inc.
⁍ C&D team is #devops: 3 ops, 3 eng, me
⁍ C&D team is #bdaas: Big Data as a Service
⁍ ~100 Cassandra nodes, expanding quickly
⁍ Obligatory: we’re hiring
#CASSANDRA13
@AlTobey
⁍ I won’t go into devops today, but I’m happy to talk about it later.
⁍ 2 years at Ooyala, SRE -> TL Tools Team -> C&D
⁍ C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka
⁍ 11 clusters, 5-36 nodes, working on something big
⁍ BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use API’s & consult

4
⁍ Founded in 2007
⁍ 230+ employees globally
⁍ 200M unique users,110+ countries
⁍ Over 1 billion videos played per month
⁍ Over 2 billion analytic events per day
#CASSANDRA13
Ooyala

5
Ooyala has been using Cassandra since v0.4
Use cases:
⁍ Analytics data (real-time and batch)
⁍ Highly available K/V store
⁍ Time series data
⁍ Play head tracking (cross-device resume)
⁍ Machine Learning Data
#CASSANDRA13
Ooyala & Cassandra

Ooyala: Legacy Platform
cassandracassandracassandracassandra
6
S3
hadoophadoophadoophadoophadoop
cassandra
ABE Service
APIloggersplayers
START HERE
#CASSANDRA13
read-modify-write
⁍ Ruby MR -- CDH3u4 -- 80 Dell Blades
⁍ Cassandra 0.4 --> 1.1 / DSE 3.x
⁍ 18x Dell r509 48GiB RAM 6x 600G 15k SAS / MD RAID 5 -- more on RAID later
⁍ We’ve scaled our data volume by 2x yearly for the last 4 years.

memTable
Avoiding read-modify-write
7#CASSANDRA13
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
cassandra13_drinks column family
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
⁍ CF to track how much I expect my team at Ooyala to drink
⁍ Row keys are names
⁍ Column keys are days
⁍ Values are a count of drinks

memTable
8#CASSANDRA13
Al Tuesday 2 Wednesday 0
ssTable
Tuesday
⁍ Next day, after after a ﬂush
⁍ I’m speaking so I decided to drink less
⁍ Phillip informs me that he has quit drinking

memTable
9#CASSANDRA13
Albert Tuesday 22 Wednesday 0
ssTable
ssTable
Tuesday
⁍ I’m drinking with all you people so I decide to add 20
⁍ read 2, add 20, write 22

10#CASSANDRA13
ssTable
⁍ After compaction & conﬂict resolution
⁍ Overwriting the same value is just ﬁne! Works really well for some patterns such as time-series data
⁍ Separate read/write streams handy for debugging, but not a big deal

2011: 0.6 ➜ 0.8
11
⁍ Migration is still a largely unsolved problem
⁍ Wrote a tool in Scala to scrub data and write via Thrift
⁍ Rebuilt indexes - faster than copying
hadoop
cassandra
GlusterFS P2P
cassandra
Thrift
#CASSANDRA13
Scala Map/Reduce
⁍ Because of some legacy choices, we know we had a bunch of expired tombstones
⁍ GlusterFS: userspace, ionice(1), fast & easy
⁍ Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)

Changes: 0.6 ➜ 0.8
12
⁍ Cassandra 0.8
⁍ 24GiB heap
⁍ Sun Java 1.6 update
⁍ Linux 2.6.36
⁍ XFS on MD RAID5
⁍ Disabled swap or at least vm.swappiness=1
#CASSANDRA13
⁍ More on XFS settings & bugs later
⁍ Got significant improvements from RAID & readahead tuning (more later)
⁍ Al’s first rule of tuning databases: disable swap or GTFO
⁍ fixed lots of applications by simply disabling swap

13
⁍ 18 nodes ➜ 36 nodes
⁍ DSE 3.0
⁍ Stale tombstones again!
⁍ No downtime!
cassandra
GlusterFS P2P
DSE 3.0
Thrift
#CASSANDRA13
Scala Map/Reduce
2012: Capacity Increase
⁍ I switched teams, working on Hastur, didn’t document enough, repairs were forgotten again
⁍ 60 day GC Grace Period expired ... 3 months ago
⁍ rsync is not enough for hardware moves: do rebuilds!
⁍ Use DSE Map/Reduce to isolate most of the load from production

System Changes: Apache 1.0 ➜ DSE 3.0
14
⁍ DSE 3.0 installed via apt packages
⁍ Unchanged: heap, distro
⁍ Ran much faster this time!
⁍ Mistake: Moved to MD RAID 0
Fix: RAID10 or RAID5, MD, ZFS, or btrfs
⁍ Mistake: Running on Ubuntu Lucid
Fix: Ubuntu Precise
#CASSANDRA13
⁍ Previously deployed with Capistrano
⁍ DSE 3’s Hadoop is compiled on Debian 6 so native components will not load on 10.04’s libc
⁍ still gradually rebuilding nodes from RAID0 ➜ RAID5 and Lucid -> Precise

Config Changes: Apache 1.0 ➜ DSE 3.0
15
⁍ Schema: compaction_strategy = LCS
⁍ Schema: bloom_filter_fp_chance = 0.1
⁍ Schema: sstable_size_in_mb = 256
⁍ Schema: compression_options = Snappy
⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13
⁍ LCS is a huge improvement in operations life (no more major compactions)
⁍ Bloom filters were tipping over a 24GiB heap
⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger
⁍ > 100,000 open files slows everything down, especially startup
⁍ 256mb v.s. 5mb is 50x reduction in file count
⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled
⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs
⁍ backreference RMW?
⁍ might be fixed in >= 1.2

16
⁍ 36 nodes ➜ lots more nodes
⁍ As usual, no downtime!
#CASSANDRA13
DSE 3.1DSE 3.1
replication
2013: Datacenter Move
⁍ Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks.
⁍ Wasn’t the original plan, but we save a lot of $$ by leaving old cage
⁍ Prep for next-generation architecture!

17
Upcoming use cases:
⁍ Store every event from our players at full resolution
⁍ Cache code for our Spark job server
⁍ AMPLab Tachyon backend?
#CASSANDRA13
Coming Soon for Cassandra at Ooyala
⁍ This is the intro for the next slide / diagram.
⁍ Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back

18
spark
APIloggersplayers kafka
ingest
job server
#CASSANDRA13
DSE 3.1
Next Generation Architecture: Ooyala Event Store
Tachyon?
⁍ Look mom! No Hadoop! Remember what I said about latency?
⁍ But we’re not just running DSE on these machines. They’re running: DSE, Spark, KVM, and CDH3u4 (legacy)
⁍ Secret is cgroups!
⁍ Also, ZFS (later)

19
⁍ Security
⁍ Cost of Goods Sold
⁍ Operations / support
⁍ Developer happiness
⁍ Physical capacity (cpu/memory/network/disk)
⁍ Reliability / Resilience
⁍ Compromise
#CASSANDRA13
There’s more to tuning than performance:
Shifting themes: philosophy of tuning
⁍ Security is always #1: The decision to disable security features is an important decision!
⁍ Example: EC2 instances sizes vary wildly in consistency and raw performance
⁍ Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2
⁍ how much of this stuff do my devs need to know? How much work is it to get a new KS/CF?
⁍ speed of node rebuilds, risk incurred by extended rebuilds, speed of repair
a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster
⁍ repeatable conﬁgurations, do future admins have to remember to do stuff or is it automated?
⁍ Look up “John Allspaw Resilience”
⁍ you only have access to EC2 or old hardware, your company has an OS/ﬁlesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux
2.18.x hardened distro)
There are others of course.

20
⁍ I’d love to be more scientific, but production comes first
⁍ Sometimes you have to make educated guesses
⁍ It’s not as difficult as it’s made out to be
⁍ Your brain is great at heuristics. Trust it.
⁍ Concentrate on bottlenecks
⁍ Make incremental changes
⁍ Read Malcom Gladwell’s “Blink”
#CASSANDRA13
I am not a scientist ... heuristician?
⁍ A truly scientiﬁc approach would take a lot of time and resources.
⁍ When under time pressure and things are slow, you have to move fast and measure “by the seat of your pants”
⁍ Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time
⁍ It’s actually pretty fast and easy this way!
⁍ More on what tools I use later on.

21
Observe, Orient, Decide, Act:
⁍ Observe the system in production under load
⁍ Make small, safe changes
⁍ Observe
⁍ Commit or Revert
#CASSANDRA13
The OODA Loop
⁍ Understand YOUR production workload ﬁrst!
⁍ Look at Opscenter latency numbers
⁍ cl-netstat.pl (later)
⁍Examples:
⁍ Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly.
⁍ Some network settings can take your node ofﬂine, temporarily or require manual intervention.
⁍ Changing the compaction scheme requires a lot of time and has other implications.

Testing Shiny Things
22
⁍ Like kernels
⁍ And Linux distributions
⁍ And ZFS
⁍ And btrfs
⁍ And JVM’s & parameters
⁍ Test them in production!
#CASSANDRA13
⁍ Testing stuff in a lab is ﬁne, if you have one and you have the time.
⁍ Take (responsible) advantage of Cassandra’s resilience:
⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

ext4
ext4
ext4
ZFS
ext4
kernel
upgrade
ext4
btrfs
Testing Shiny Things: In Production
23#CASSANDRA13
⁍ Use your staging / non-prod environments ﬁrst if you have them (some people don’t and that’s unfortunate but it happens)
⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.

24#CASSANDRA13
Brendan Gregg’s Tool Chart
http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
⁍ Brendan Gregg’s chart is so good, I just copied it for now.
⁍ Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
⁍ I’ll brieﬂy talk about a few

25#CASSANDRA13
dstat -lrvn 10
⁍ Just like vmstat but prettier and does way more
⁍ 35 lines of output = about 5 minutes of 10s snapshots
⁍ What’s interesting?
⁍ IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job
⁍ IO wait is high, but disk throughput isn’t impressive at all
⁍ ~2 blocked “procs” (which includes threads)
Not bothering to tune this right now because production latency is ﬁne.

26#CASSANDRA13
cl-netstat.pl
https://github.com/tobert/perl-ssh-tools
⁍ Home grown.
⁍ Requires no software on the target machines except for SSH.
⁍ Recent Net::SSH2 supports ssh-agent

27#CASSANDRA13
iostat -x 1
⁍ Mostly I just look at the *wait numbers here.
⁍ Great for ﬁnding a bad disk with high latency.

28#CASSANDRA13
htop
⁍ Per-CPU utilization bars are nice
⁍ Displays threads by default (hit “H” in plain top)
⁍ Very conﬁgurable!
⁍ For example: 1 thread at 100% CPU is usually the GC

29#CASSANDRA13
jconsole
⁍ Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)

30#CASSANDRA13
opscenter
⁍ It looks better on a high-resolution display ;)

31#CASSANDRA13
nodetool ring
10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 1012046694721756637024691720378965
⁍ hotspots

32#CASSANDRA13
nodetool cfstats
Keyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1
Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336
Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007
Could be using a lot of heap
Controllable by sstable_size_in_mb
⁍ bloom ﬁlters
⁍ sstable_size_in_mb

33#CASSANDRA13
nodetool proxyhistograms
Offset Read Latency Write Latency Range Latency
35 0 20 0
42 0 61 0
50 0 82 0
60 0 440 0
72 0 3416 0
86 0 17910 0
103 0 48675 0
124 1 97423 0
149 0 153109 0
179 2 186205 0
215 5 139022 0
258 134 44058 0
310 2656 60660 0
372 34698 742684 0
446 469515 7359351 0
535 3920391 31030588 0
642 9852708 33070248 0
770 4487796 9719615 0
924 651959 984889 0
⁍ units are microseconds
⁍ can give you a good idea of how much latency coordinator hops are costing you

34#CASSANDRA13
nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 9819749801 16922291634 58.03%
Compaction hastur counter_archive 12141850720 16147440484 75.19%
Compaction hastur mark_archive 647389841 1475432590 43.88%
Active compaction remaining time : n/a
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 10239806890 16922291634 60.51%
Compaction hastur counter_archive 12544404397 16147440484 77.69%
Compaction hastur mark_archive 1107897093 1475432590 75.09%
Active compaction remaining time : n/a
⁍

35#CASSANDRA13
⁍ cassandra-stress
⁍ YCSB
⁍ Production
⁍ Terasort (DSE)
⁍ Homegrown
Stress Testing Tools
⁍ we mostly focus on cassandra-stress for burn-in of new clusters
⁍ can quickly ﬁgure out the right setting for -Xmn
⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)
⁍ Consider writing custom benchmarks for your application patterns
⁍ sometimes it’s faster to write one than ﬁgure out how to make a generic tool do what you want

36#CASSANDRA13
kernel.pid_max = 999999
fs.file-max = 1048576
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
vm.dirty_ratio = 10
vm.dirty_background_ratio = 2
vm.swappiness = 1
/etc/sysctl.conf
⁍ pid_max doesn’t ﬁx anything, I just like it and have never had a problem with it
⁍ These are my starting point settings for nearly every system/application.
⁍ Generally safe for production.
⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/ﬁle corruption on power loss
⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low

37#CASSANDRA13
ra=$((2**14))# 16k
ss=$(blockdev --getss /dev/sda)
blockdev --setra $(($ra / $ss)) /dev/sda
echo 256 > /sys/block/sda/queue/nr_requests
echo cfq > /sys/block/sda/queue/scheduler
echo 16384 > /sys/block/md7/md/stripe_cache_size
/etc/rc.local
⁍ Lower readahead is better for latency on seeky workloads
⁍ More readahead will artiﬁcially increase your IOPS by reading a bunch of stuff you might not need!
⁍ nr_requests = number of IO structs the kernel will keep in ﬂight, don’t go crazy
⁍ Deadline is best for raw throughput
⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives
⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.
⁍ Don’t forget to set readahead separately for MD RAID devices

38#CASSANDRA13
-Xmx8G leave it alone
-Xms8G leave it alone
-Xmn1200M 100MiB * nCPU
-Xss180k should be fine
-XX:+UseNUMA
numactl --interleave
JVM Args
⁍ In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC
pauses.
⁍ Don’t count hypercores.
⁍ Our biggest bang for the buck so far has been tuning newsize.
⁍ Have you ever seen “out of memory” when there’s plenty of memory available? You probably have a full NUMA node.
⁍ NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesn’t seem to be in the DSE
scripts. I’ve been running +UseNUMA for about a year and a half now and it seems to work ﬁne.

cgroups
39#CASSANDRA13
Provides fine-grained control over Linux resources
⁍ Makes the Linux scheduler better
⁍ Lets you manage systems under extreme load
⁍ Useful on all Linux machines
⁍ Can choose between determinism and flexibility
⁍ static resource assignment has better determinism / constentcy
⁍ weighted resources provide most of the advantage with a lot more ﬂexibility

cgroups
40#CASSANDRA13
cat >> /etc/default/cassandra <<EOF
cpucg=/sys/fs/cgroup/cpu/cassandra
mkdir $cpucg
cat $cpucg/../cpuset.mems >$cpucg/cpuset.mems
cat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpus
echo 100 > $cpucg/shares
echo $$ > $cpucg/tasks
EOF
⁍ automatically adds cassandra to a CG called “cassandra”
⁍ cpuset.mems can be used to limit NUMA nodes if you have huge machines
⁍ cpuset.cpus can restrict tasks to speciﬁc cores (like taskset, stricter)
⁍ shares is just a number, set your own scale, 1-1000 works for me
⁍ adding a task to a CG is as simple as adding its PID
⁍ children are not necessarily added, you must add threads too if joining after startup (ps -efL)

Successful Experiment: btrfs
41#CASSANDRA13
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mount -o compress=lzo /dev/sdc1 /data
⁍ Like ZFS, btrfs can manage multiple disks without mdraid or LVM.
⁍ We have one production system in EC2 running btrfs ﬂawlessly.
⁍ I’m told there are problems when the disk ﬁlls up so don’t do that.
⁍ noatime isn’t necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough

Successful Experiment: ZFS on Linux
42#CASSANDRA13
zpool create data raidz /dev/sd[c-h]
zfs create data/cassandra
zfs set compression=lzjb data/cassandra
zfs set atime=off data/cassandra
zfs set logbias=throughput data/cassandra
⁍ ZFS really is the ultimate ﬁlesystem.
⁍ RAIDZ is like RAID5 but totally different:
⁍ variable-width stripes
⁍ no write hole
⁍ VERY fast, plays well with C*
⁍ Stable! (so far)

Conclusions
43#CASSANDRA13
⁍ Tuning is multi-dimensional
⁍ Production load is your most important benchmark
⁍ Lean on Cassandra, experiment!
⁍ No one metric tells the whole story

Questions?
44#CASSANDRA13
⁍ Twitter: @AlTobey
⁍ Github: https://github.com/tobert
⁍ Email: al@ooyala.com / tobert@gmail.com

C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

More Related Content

What's hot (20)

Similar to C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey (20)

More from DataStax Academy (20)

Recently uploaded (20)

C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey