SlideShare a Scribd company logo
PRACTICE MAKES PERFECT:
EXTREME CASSANDRA OPTIMIZATION
@AlTobey
Tech Lead, Compute and Data Services
#CASSANDRA13
1Saturday, June 15, 13
I didn’t name this talk. The conference people did, but I like it a lot.
2
⁍ About me / Ooyala
⁍ How not to manage your Cassandra clusters
⁍ Make it suck less
⁍ How to be a heuristician
⁍ Tools of the trade
⁍ More Settings
⁍ Show & Tell
#CASSANDRA13
Outline
2Saturday, June 15, 13
3
⁍ Tech Lead, Compute and Data Services at Ooyala, Inc.
⁍ C&D team is #devops: 3 ops, 3 eng, me
⁍ C&D team is #bdaas: Big Data as a Service
⁍ ~100 Cassandra nodes, expanding quickly
⁍ Obligatory: we’re hiring
#CASSANDRA13
@AlTobey
3Saturday, June 15, 13
⁍ I won’t go into devops today, but I’m happy to talk about it later.
⁍ 2 years at Ooyala, SRE -> TL Tools Team -> C&D
⁍ C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka
⁍ 11 clusters, 5-36 nodes, working on something big
⁍ BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use API’s & consult
4
⁍ Founded in 2007
⁍ 230+ employees globally
⁍ 200M unique users,110+ countries
⁍ Over 1 billion videos played per month
⁍ Over 2 billion analytic events per day
#CASSANDRA13
Ooyala
4Saturday, June 15, 13
5
Ooyala has been using Cassandra since v0.4
Use cases:
⁍ Analytics data (real-time and batch)
⁍ Highly available K/V store
⁍ Time series data
⁍ Play head tracking (cross-device resume)
⁍ Machine Learning Data
#CASSANDRA13
Ooyala & Cassandra
5Saturday, June 15, 13
Ooyala: Legacy Platform
cassandracassandracassandracassandra
6
S3
hadoophadoophadoophadoophadoop
cassandra
ABE Service
APIloggersplayers
START HERE
#CASSANDRA13
read-modify-write
6Saturday, June 15, 13
⁍ Ruby MR -- CDH3u4 -- 80 Dell Blades
⁍ Cassandra 0.4 --> 1.1 / DSE 3.x
⁍ 18x Dell r509 48GiB RAM 6x 600G 15k SAS / MD RAID 5 -- more on RAID later
⁍ We’ve scaled our data volume by 2x yearly for the last 4 years.
memTable
Avoiding read-modify-write
7#CASSANDRA13
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
cassandra13_drinks column family
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
7Saturday, June 15, 13
⁍ CF to track how much I expect my team at Ooyala to drink
⁍ Row keys are names
⁍ Column keys are days
⁍ Values are a count of drinks
memTable
Avoiding read-modify-write
8#CASSANDRA13
Al Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
cassandra13_drinks column family
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
8Saturday, June 15, 13
⁍ Next day, after after a flush
⁍ I’m speaking so I decided to drink less
⁍ Phillip informs me that he has quit drinking
memTable
Avoiding read-modify-write
9#CASSANDRA13
Albert Tuesday 22 Wednesday 0
cassandra13_drinks column family
ssTable
Albert Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
9Saturday, June 15, 13
⁍ I’m drinking with all you people so I decide to add 20
⁍ read 2, add 20, write 22
Avoiding read-modify-write
10#CASSANDRA13
cassandra13_drinks column family
ssTable
Albert Tuesday 22 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 0 Wednesday 1
10Saturday, June 15, 13
⁍ After compaction & conflict resolution
⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data
⁍ Separate read/write streams handy for debugging, but not a big deal
2011: 0.6 ➜ 0.8
11
⁍ Migration is still a largely unsolved problem
⁍ Wrote a tool in Scala to scrub data and write via Thrift
⁍ Rebuilt indexes - faster than copying
hadoop
cassandra
GlusterFS P2P
cassandra
Thrift
#CASSANDRA13
Scala Map/Reduce
11Saturday, June 15, 13
⁍ Because of some legacy choices, we know we had a bunch of expired tombstones
⁍ GlusterFS: userspace, ionice(1), fast & easy
⁍ Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)
Changes: 0.6 ➜ 0.8
12
⁍ Cassandra 0.8
⁍ 24GiB heap
⁍ Sun Java 1.6 update
⁍ Linux 2.6.36
⁍ XFS on MD RAID5
⁍ Disabled swap or at least vm.swappiness=1
#CASSANDRA13
12Saturday, June 15, 13
⁍ More on XFS settings & bugs later
⁍ Got significant improvements from RAID & readahead tuning (more later)
⁍ Al’s first rule of tuning databases: disable swap or GTFO
⁍ fixed lots of applications by simply disabling swap
13
⁍ 18 nodes ➜ 36 nodes
⁍ DSE 3.0
⁍ Stale tombstones again!
⁍ No downtime!
cassandra
GlusterFS P2P
DSE 3.0
Thrift
#CASSANDRA13
Scala Map/Reduce
2012: Capacity Increase
13Saturday, June 15, 13
⁍ I switched teams, working on Hastur, didn’t document enough, repairs were forgotten again
⁍ 60 day GC Grace Period expired ... 3 months ago
⁍ rsync is not enough for hardware moves: do rebuilds!
⁍ Use DSE Map/Reduce to isolate most of the load from production
System Changes: Apache 1.0 ➜ DSE 3.0
14
⁍ DSE 3.0 installed via apt packages
⁍ Unchanged: heap, distro
⁍ Ran much faster this time!
⁍ Mistake: Moved to MD RAID 0
Fix: RAID10 or RAID5, MD, ZFS, or btrfs
⁍ Mistake: Running on Ubuntu Lucid
Fix: Ubuntu Precise
#CASSANDRA13
14Saturday, June 15, 13
⁍ Previously deployed with Capistrano
⁍ DSE 3’s Hadoop is compiled on Debian 6 so native components will not load on 10.04’s libc
⁍ still gradually rebuilding nodes from RAID0 ➜ RAID5 and Lucid -> Precise
Config Changes: Apache 1.0 ➜ DSE 3.0
15
⁍ Schema: compaction_strategy = LCS
⁍ Schema: bloom_filter_fp_chance = 0.1
⁍ Schema: sstable_size_in_mb = 256
⁍ Schema: compression_options = Snappy
⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13
15Saturday, June 15, 13
⁍ LCS is a huge improvement in operations life (no more major compactions)
⁍ Bloom filters were tipping over a 24GiB heap
⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger
⁍ > 100,000 open files slows everything down, especially startup
⁍ 256mb v.s. 5mb is 50x reduction in file count
⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled
⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs
⁍ backreference RMW?
⁍ might be fixed in >= 1.2
16
⁍ 36 nodes ➜ lots more nodes
⁍ As usual, no downtime!
#CASSANDRA13
DSE 3.1DSE 3.1
replication
2013: Datacenter Move
16Saturday, June 15, 13
⁍ Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks.
⁍ Wasn’t the original plan, but we save a lot of $$ by leaving old cage
⁍ Prep for next-generation architecture!
17
Upcoming use cases:
⁍ Store every event from our players at full resolution
⁍ Cache code for our Spark job server
⁍ AMPLab Tachyon backend?
#CASSANDRA13
Coming Soon for Cassandra at Ooyala
17Saturday, June 15, 13
⁍ This is the intro for the next slide / diagram.
⁍ Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back
18
spark
APIloggersplayers kafka
ingest
job server
#CASSANDRA13
DSE 3.1
Next Generation Architecture: Ooyala Event Store
Tachyon?
18Saturday, June 15, 13
⁍ Look mom! No Hadoop! Remember what I said about latency?
⁍ But we’re not just running DSE on these machines. They’re running: DSE, Spark, KVM, and CDH3u4 (legacy)
⁍ Secret is cgroups!
⁍ Also, ZFS (later)
19
⁍ Security
⁍ Cost of Goods Sold
⁍ Operations / support
⁍ Developer happiness
⁍ Physical capacity (cpu/memory/network/disk)
⁍ Reliability / Resilience
⁍ Compromise
#CASSANDRA13
There’s more to tuning than performance:
19Saturday, June 15, 13
Shifting themes: philosophy of tuning
⁍ Security is always #1: The decision to disable security features is an important decision!
⁍ Example: EC2 instances sizes vary wildly in consistency and raw performance
⁍ Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2
⁍ how much of this stuff do my devs need to know? How much work is it to get a new KS/CF?
⁍ speed of node rebuilds, risk incurred by extended rebuilds, speed of repair
a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster
⁍ repeatable configurations, do future admins have to remember to do stuff or is it automated?
⁍ Look up “John Allspaw Resilience”
⁍ you only have access to EC2 or old hardware, your company has an OS/filesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux
2.18.x hardened distro)
There are others of course.
20
⁍ I’d love to be more scientific, but production comes first
⁍ Sometimes you have to make educated guesses
⁍ It’s not as difficult as it’s made out to be
⁍ Your brain is great at heuristics. Trust it.
⁍ Concentrate on bottlenecks
⁍ Make incremental changes
⁍ Read Malcom Gladwell’s “Blink”
#CASSANDRA13
I am not a scientist ... heuristician?
20Saturday, June 15, 13
⁍ A truly scientific approach would take a lot of time and resources.
⁍ When under time pressure and things are slow, you have to move fast and measure “by the seat of your pants”
⁍ Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time
⁍ It’s actually pretty fast and easy this way!
⁍ More on what tools I use later on.
21
Observe, Orient, Decide, Act:
⁍ Observe the system in production under load
⁍ Make small, safe changes
⁍ Observe
⁍ Commit or Revert
#CASSANDRA13
The OODA Loop
21Saturday, June 15, 13
⁍ Understand YOUR production workload first!
⁍ Look at Opscenter latency numbers
⁍ cl-netstat.pl (later)
⁍Examples:
⁍ Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly.
⁍ Some network settings can take your node offline, temporarily or require manual intervention.
⁍ Changing the compaction scheme requires a lot of time and has other implications.
Testing Shiny Things
22
⁍ Like kernels
⁍ And Linux distributions
⁍ And ZFS
⁍ And btrfs
⁍ And JVM’s & parameters
⁍ Test them in production!
#CASSANDRA13
22Saturday, June 15, 13
⁍ Testing stuff in a lab is fine, if you have one and you have the time.
⁍ Take (responsible) advantage of Cassandra’s resilience:
⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.
ext4
ext4
ext4
ZFS
ext4
kernel
upgrade
ext4
btrfs
Testing Shiny Things: In Production
23#CASSANDRA13
23Saturday, June 15, 13
⁍ Use your staging / non-prod environments first if you have them (some people don’t and that’s unfortunate but it happens)
⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.
24#CASSANDRA13
Brendan Gregg’s Tool Chart
http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
24Saturday, June 15, 13
⁍ Brendan Gregg’s chart is so good, I just copied it for now.
⁍ Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x
⁍ I’ll briefly talk about a few
25#CASSANDRA13
dstat -lrvn 10
25Saturday, June 15, 13
⁍ Just like vmstat but prettier and does way more
⁍ 35 lines of output = about 5 minutes of 10s snapshots
⁍ What’s interesting?
⁍ IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job
⁍ IO wait is high, but disk throughput isn’t impressive at all
⁍ ~2 blocked “procs” (which includes threads)
Not bothering to tune this right now because production latency is fine.
26#CASSANDRA13
cl-netstat.pl
https://github.com/tobert/perl-ssh-tools
26Saturday, June 15, 13
⁍ Home grown.
⁍ Requires no software on the target machines except for SSH.
⁍ Recent Net::SSH2 supports ssh-agent
27#CASSANDRA13
iostat -x 1
27Saturday, June 15, 13
⁍ Mostly I just look at the *wait numbers here.
⁍ Great for finding a bad disk with high latency.
28#CASSANDRA13
htop
28Saturday, June 15, 13
⁍ Per-CPU utilization bars are nice
⁍ Displays threads by default (hit “H” in plain top)
⁍ Very configurable!
⁍ For example: 1 thread at 100% CPU is usually the GC
29#CASSANDRA13
jconsole
29Saturday, June 15, 13
⁍ Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)
30#CASSANDRA13
opscenter
30Saturday, June 15, 13
⁍ It looks better on a high-resolution display ;)
31#CASSANDRA13
nodetool ring
10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 1012046694721756637024691720378965
10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 1026714038123521225967078556906197
10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 1041381381525285814909465393433428
10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 1056048724927050403851852229960659
10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 1070716068328814992794239066487891
10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 1100423945662575060114582859200003
10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 1137814208669076757916163680305794
10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 1196501513956187970179620530735245
10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 1394248867770897155613247921498720
10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 1435882108713996181107000284314407
10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 1465773686249280216901752503449044
10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 1481401683578223483181070489250370
31Saturday, June 15, 13
⁍ hotspots
32#CASSANDRA13
nodetool cfstats
Keyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1
Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336
Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007
Could be using a lot of heap
Controllable by sstable_size_in_mb
32Saturday, June 15, 13
⁍ bloom filters
⁍ sstable_size_in_mb
33#CASSANDRA13
nodetool proxyhistograms
Offset Read Latency Write Latency Range Latency
35 0 20 0
42 0 61 0
50 0 82 0
60 0 440 0
72 0 3416 0
86 0 17910 0
103 0 48675 0
124 1 97423 0
149 0 153109 0
179 2 186205 0
215 5 139022 0
258 134 44058 0
310 2656 60660 0
372 34698 742684 0
446 469515 7359351 0
535 3920391 31030588 0
642 9852708 33070248 0
770 4487796 9719615 0
924 651959 984889 0
33Saturday, June 15, 13
⁍ units are microseconds
⁍ can give you a good idea of how much latency coordinator hops are costing you
34#CASSANDRA13
nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 9819749801 16922291634 58.03%
Compaction hastur counter_archive 12141850720 16147440484 75.19%
Compaction hastur mark_archive 647389841 1475432590 43.88%
Active compaction remaining time : n/a
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 10239806890 16922291634 60.51%
Compaction hastur counter_archive 12544404397 16147440484 77.69%
Compaction hastur mark_archive 1107897093 1475432590 75.09%
Active compaction remaining time : n/a
34Saturday, June 15, 13
⁍
35#CASSANDRA13
⁍ cassandra-stress
⁍ YCSB
⁍ Production
⁍ Terasort (DSE)
⁍ Homegrown
Stress Testing Tools
35Saturday, June 15, 13
⁍ we mostly focus on cassandra-stress for burn-in of new clusters
⁍ can quickly figure out the right setting for -Xmn
⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!)
⁍ Consider writing custom benchmarks for your application patterns
⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
36#CASSANDRA13
kernel.pid_max = 999999
fs.file-max = 1048576
vm.max_map_count = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 65536 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
vm.dirty_ratio = 10
vm.dirty_background_ratio = 2
vm.swappiness = 1
/etc/sysctl.conf
36Saturday, June 15, 13
⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it
⁍ These are my starting point settings for nearly every system/application.
⁍ Generally safe for production.
⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss
⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
37#CASSANDRA13
ra=$((2**14))# 16k
ss=$(blockdev --getss /dev/sda)
blockdev --setra $(($ra / $ss)) /dev/sda
echo 256 > /sys/block/sda/queue/nr_requests
echo cfq > /sys/block/sda/queue/scheduler
echo 16384 > /sys/block/md7/md/stripe_cache_size
/etc/rc.local
37Saturday, June 15, 13
⁍ Lower readahead is better for latency on seeky workloads
⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need!
⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy
⁍ Deadline is best for raw throughput
⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives
⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot.
⁍ Don’t forget to set readahead separately for MD RAID devices
38#CASSANDRA13
-Xmx8G leave it alone
-Xms8G leave it alone
-Xmn1200M 100MiB * nCPU
-Xss180k should be fine
-XX:+UseNUMA
numactl --interleave
JVM Args
38Saturday, June 15, 13
⁍ In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC
pauses.
⁍ Don’t count hypercores.
⁍ Our biggest bang for the buck so far has been tuning newsize.
⁍ Have you ever seen “out of memory” when there’s plenty of memory available? You probably have a full NUMA node.
⁍ NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesn’t seem to be in the DSE
scripts. I’ve been running +UseNUMA for about a year and a half now and it seems to work fine.
cgroups
39#CASSANDRA13
Provides fine-grained control over Linux resources
⁍ Makes the Linux scheduler better
⁍ Lets you manage systems under extreme load
⁍ Useful on all Linux machines
⁍ Can choose between determinism and flexibility
39Saturday, June 15, 13
⁍ static resource assignment has better determinism / constentcy
⁍ weighted resources provide most of the advantage with a lot more flexibility
cgroups
40#CASSANDRA13
cat >> /etc/default/cassandra <<EOF
cpucg=/sys/fs/cgroup/cpu/cassandra
mkdir $cpucg
cat $cpucg/../cpuset.mems >$cpucg/cpuset.mems
cat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpus
echo 100 > $cpucg/shares
echo $$ > $cpucg/tasks
EOF
40Saturday, June 15, 13
⁍ automatically adds cassandra to a CG called “cassandra”
⁍ cpuset.mems can be used to limit NUMA nodes if you have huge machines
⁍ cpuset.cpus can restrict tasks to specific cores (like taskset, stricter)
⁍ shares is just a number, set your own scale, 1-1000 works for me
⁍ adding a task to a CG is as simple as adding its PID
⁍ children are not necessarily added, you must add threads too if joining after startup (ps -efL)
Successful Experiment: btrfs
41#CASSANDRA13
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1
mount -o compress=lzo /dev/sdc1 /data
41Saturday, June 15, 13
⁍ Like ZFS, btrfs can manage multiple disks without mdraid or LVM.
⁍ We have one production system in EC2 running btrfs flawlessly.
⁍ I’m told there are problems when the disk fills up so don’t do that.
⁍ noatime isn’t necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough
Successful Experiment: ZFS on Linux
42#CASSANDRA13
zpool create data raidz /dev/sd[c-h]
zfs create data/cassandra
zfs set compression=lzjb data/cassandra
zfs set atime=off data/cassandra
zfs set logbias=throughput data/cassandra
42Saturday, June 15, 13
⁍ ZFS really is the ultimate filesystem.
⁍ RAIDZ is like RAID5 but totally different:
⁍ variable-width stripes
⁍ no write hole
⁍ VERY fast, plays well with C*
⁍ Stable! (so far)
Conclusions
43#CASSANDRA13
⁍ Tuning is multi-dimensional
⁍ Production load is your most important benchmark
⁍ Lean on Cassandra, experiment!
⁍ No one metric tells the whole story
43Saturday, June 15, 13
Questions?
44#CASSANDRA13
⁍ Twitter: @AlTobey
⁍ Github: https://github.com/tobert
⁍ Email: al@ooyala.com / tobert@gmail.com
44Saturday, June 15, 13

More Related Content

PDF
Scaling Cassandra for Big Data
PDF
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy Cobley
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
PDF
Managing Cassandra at Scale by Al Tobey
PDF
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
PPTX
Cassandra Operations at Netflix
PPTX
Cassandra in Operation
PPTX
Performance tuning - A key to successful cassandra migration
Scaling Cassandra for Big Data
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy Cobley
DataStax: Extreme Cassandra Optimization: The Sequel
Managing Cassandra at Scale by Al Tobey
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
Cassandra Operations at Netflix
Cassandra in Operation
Performance tuning - A key to successful cassandra migration

What's hot (20)

PDF
TechTalk v2.0 - Performance tuning Cassandra + AWS
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
PDF
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
PDF
Cassandra summit 2013 how not to use cassandra
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
Cassandra Community Webinar | Data Model on Fire
PPTX
Cassandra Summit 2015: Real World DTCS For Operators
PDF
Developing with Cassandra
PPTX
HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)
PDF
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
PDF
Development to Production with Sharded MongoDB Clusters
PDF
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
PDF
Setting up mongodb sharded cluster in 30 minutes
PDF
BigData as a Platform: Cassandra and Current Trends
PDF
Cassandra at Instagram (August 2013)
PPTX
Cassandra Troubleshooting 3.0
PDF
AddThis: Scaling Cassandra up and down into containers with ZFS
PDF
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
TechTalk v2.0 - Performance tuning Cassandra + AWS
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
How to size up an Apache Cassandra cluster (Training)
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
Cassandra summit 2013 how not to use cassandra
Performance Monitoring: Understanding Your Scylla Cluster
Cassandra Community Webinar | Data Model on Fire
Cassandra Summit 2015: Real World DTCS For Operators
Developing with Cassandra
HighLoad Solutions On MySQL / Xiaobin Lin (Alibaba)
The Best and Worst of Cassandra-stress Tool (Christopher Batey, The Last Pick...
Development to Production with Sharded MongoDB Clusters
Cassandra Summit 2014: Performance Tuning Cassandra in AWS
Setting up mongodb sharded cluster in 30 minutes
BigData as a Platform: Cassandra and Current Trends
Cassandra at Instagram (August 2013)
Cassandra Troubleshooting 3.0
AddThis: Scaling Cassandra up and down into containers with ZFS
Cassandra Community Webinar | Practice Makes Perfect: Extreme Cassandra Optim...
Ad

Similar to C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey (20)

PDF
Cassandra at scale
PDF
C* Summit EU 2013: Practice Makes Perfect: Extreme Cassandra Optimization
PDF
C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
PDF
Making WordPress Fly
PDF
Why Spark Is the Next Top (Compute) Model
PPTX
OSS Presentation DRMC by Keith Brennan
PDF
Cassandra at Pollfish
PDF
Cassandra at Pollfish
PDF
The Do’s and Don’ts of Benchmarking Databases
PDF
Complex Ephemeral Caching With Redis: Jeff Pollard
PDF
Real Developer Tools for WordPress by Stefan Didak
PPT
High Availabiltity & Replica Sets with mongoDB
PDF
High Availability in GCE
PDF
Pl2017 High Availability in GCE
ODP
Databases benoitg 2009-03-10
PPTX
IT Made Me Virtualize Essbase and Performance Sucks
PDF
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
PDF
Pgbr 2013 postgres on aws
Cassandra at scale
C* Summit EU 2013: Practice Makes Perfect: Extreme Cassandra Optimization
C* Summit 2013: No moving parts. Taking advantage of Pure Speed by Matt Kennedy
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Making WordPress Fly
Why Spark Is the Next Top (Compute) Model
OSS Presentation DRMC by Keith Brennan
Cassandra at Pollfish
Cassandra at Pollfish
The Do’s and Don’ts of Benchmarking Databases
Complex Ephemeral Caching With Redis: Jeff Pollard
Real Developer Tools for WordPress by Stefan Didak
High Availabiltity & Replica Sets with mongoDB
High Availability in GCE
Pl2017 High Availability in GCE
Databases benoitg 2009-03-10
IT Made Me Virtualize Essbase and Performance Sucks
C* Summit EU 2013: Hardware Agnostic: Cassandra on Raspberry Pi
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Pgbr 2013 postgres on aws
Ad

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Graph Database
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra on Docker @ Walmart Labs
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Understanding_Digital_Forensics_Presentation.pptx
Machine learning based COVID-19 study performance prediction
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing

C* Summit 2013: Practice Makes Perfect: Extreme Cassandra Optimization by Albert Tobey

  • 1. PRACTICE MAKES PERFECT: EXTREME CASSANDRA OPTIMIZATION @AlTobey Tech Lead, Compute and Data Services #CASSANDRA13 1Saturday, June 15, 13 I didn’t name this talk. The conference people did, but I like it a lot.
  • 2. 2 ⁍ About me / Ooyala ⁍ How not to manage your Cassandra clusters ⁍ Make it suck less ⁍ How to be a heuristician ⁍ Tools of the trade ⁍ More Settings ⁍ Show & Tell #CASSANDRA13 Outline 2Saturday, June 15, 13
  • 3. 3 ⁍ Tech Lead, Compute and Data Services at Ooyala, Inc. ⁍ C&D team is #devops: 3 ops, 3 eng, me ⁍ C&D team is #bdaas: Big Data as a Service ⁍ ~100 Cassandra nodes, expanding quickly ⁍ Obligatory: we’re hiring #CASSANDRA13 @AlTobey 3Saturday, June 15, 13 ⁍ I won’t go into devops today, but I’m happy to talk about it later. ⁍ 2 years at Ooyala, SRE -> TL Tools Team -> C&D ⁍ C&D builds BDaaS for Ooyala: fully managed Cassandra / Spark / Hadoop / Zookeeper / Kafka ⁍ 11 clusters, 5-36 nodes, working on something big ⁍ BEFORE: Engineers deployed systems: expensive, error-prone, AFTER: Engineers use API’s & consult
  • 4. 4 ⁍ Founded in 2007 ⁍ 230+ employees globally ⁍ 200M unique users,110+ countries ⁍ Over 1 billion videos played per month ⁍ Over 2 billion analytic events per day #CASSANDRA13 Ooyala 4Saturday, June 15, 13
  • 5. 5 Ooyala has been using Cassandra since v0.4 Use cases: ⁍ Analytics data (real-time and batch) ⁍ Highly available K/V store ⁍ Time series data ⁍ Play head tracking (cross-device resume) ⁍ Machine Learning Data #CASSANDRA13 Ooyala & Cassandra 5Saturday, June 15, 13
  • 6. Ooyala: Legacy Platform cassandracassandracassandracassandra 6 S3 hadoophadoophadoophadoophadoop cassandra ABE Service APIloggersplayers START HERE #CASSANDRA13 read-modify-write 6Saturday, June 15, 13 ⁍ Ruby MR -- CDH3u4 -- 80 Dell Blades ⁍ Cassandra 0.4 --> 1.1 / DSE 3.x ⁍ 18x Dell r509 48GiB RAM 6x 600G 15k SAS / MD RAID 5 -- more on RAID later ⁍ We’ve scaled our data volume by 2x yearly for the last 4 years.
  • 7. memTable Avoiding read-modify-write 7#CASSANDRA13 Albert 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 cassandra13_drinks column family Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 Tuesday 7Saturday, June 15, 13 ⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks
  • 8. memTable Avoiding read-modify-write 8#CASSANDRA13 Al Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 cassandra13_drinks column family ssTable Albert 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 Tuesday 8Saturday, June 15, 13 ⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking
  • 9. memTable Avoiding read-modify-write 9#CASSANDRA13 Albert Tuesday 22 Wednesday 0 cassandra13_drinks column family ssTable Albert Tuesday 2 Wednesday 0 Phillip Tuesday 0 Wednesday 1 ssTable Albert 6 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 12 Wednesday 0 Tuesday 9Saturday, June 15, 13 ⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22
  • 10. Avoiding read-modify-write 10#CASSANDRA13 cassandra13_drinks column family ssTable Albert Tuesday 22 Wednesday 0 Evan Tuesday 0 Wednesday 0 Frank Tuesday 3 Wednesday 3 Kelvin Tuesday 0 Wednesday 0 Krzysztof Tuesday 0 Wednesday 0 Phillip Tuesday 0 Wednesday 1 10Saturday, June 15, 13 ⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal
  • 11. 2011: 0.6 ➜ 0.8 11 ⁍ Migration is still a largely unsolved problem ⁍ Wrote a tool in Scala to scrub data and write via Thrift ⁍ Rebuilt indexes - faster than copying hadoop cassandra GlusterFS P2P cassandra Thrift #CASSANDRA13 Scala Map/Reduce 11Saturday, June 15, 13 ⁍ Because of some legacy choices, we know we had a bunch of expired tombstones ⁍ GlusterFS: userspace, ionice(1), fast & easy ⁍ Scala MR: sstabledump, etc. TOO SLOW, Scala MR only took a week (with production running too!)
  • 12. Changes: 0.6 ➜ 0.8 12 ⁍ Cassandra 0.8 ⁍ 24GiB heap ⁍ Sun Java 1.6 update ⁍ Linux 2.6.36 ⁍ XFS on MD RAID5 ⁍ Disabled swap or at least vm.swappiness=1 #CASSANDRA13 12Saturday, June 15, 13 ⁍ More on XFS settings & bugs later ⁍ Got significant improvements from RAID & readahead tuning (more later) ⁍ Al’s first rule of tuning databases: disable swap or GTFO ⁍ fixed lots of applications by simply disabling swap
  • 13. 13 ⁍ 18 nodes ➜ 36 nodes ⁍ DSE 3.0 ⁍ Stale tombstones again! ⁍ No downtime! cassandra GlusterFS P2P DSE 3.0 Thrift #CASSANDRA13 Scala Map/Reduce 2012: Capacity Increase 13Saturday, June 15, 13 ⁍ I switched teams, working on Hastur, didn’t document enough, repairs were forgotten again ⁍ 60 day GC Grace Period expired ... 3 months ago ⁍ rsync is not enough for hardware moves: do rebuilds! ⁍ Use DSE Map/Reduce to isolate most of the load from production
  • 14. System Changes: Apache 1.0 ➜ DSE 3.0 14 ⁍ DSE 3.0 installed via apt packages ⁍ Unchanged: heap, distro ⁍ Ran much faster this time! ⁍ Mistake: Moved to MD RAID 0 Fix: RAID10 or RAID5, MD, ZFS, or btrfs ⁍ Mistake: Running on Ubuntu Lucid Fix: Ubuntu Precise #CASSANDRA13 14Saturday, June 15, 13 ⁍ Previously deployed with Capistrano ⁍ DSE 3’s Hadoop is compiled on Debian 6 so native components will not load on 10.04’s libc ⁍ still gradually rebuilding nodes from RAID0 ➜ RAID5 and Lucid -> Precise
  • 15. Config Changes: Apache 1.0 ➜ DSE 3.0 15 ⁍ Schema: compaction_strategy = LCS ⁍ Schema: bloom_filter_fp_chance = 0.1 ⁍ Schema: sstable_size_in_mb = 256 ⁍ Schema: compression_options = Snappy ⁍ YAML: compaction_throughput_mb_per_sec: 0 #CASSANDRA13 15Saturday, June 15, 13 ⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2
  • 16. 16 ⁍ 36 nodes ➜ lots more nodes ⁍ As usual, no downtime! #CASSANDRA13 DSE 3.1DSE 3.1 replication 2013: Datacenter Move 16Saturday, June 15, 13 ⁍ Size omitted in published slides. I was asked not to publish yet, I will tweet, etc. in a couple weeks. ⁍ Wasn’t the original plan, but we save a lot of $$ by leaving old cage ⁍ Prep for next-generation architecture!
  • 17. 17 Upcoming use cases: ⁍ Store every event from our players at full resolution ⁍ Cache code for our Spark job server ⁍ AMPLab Tachyon backend? #CASSANDRA13 Coming Soon for Cassandra at Ooyala 17Saturday, June 15, 13 ⁍ This is the intro for the next slide / diagram. ⁍ Considering Astyanax or CQL3 backend for Tachyon so we can contribute it back
  • 18. 18 spark APIloggersplayers kafka ingest job server #CASSANDRA13 DSE 3.1 Next Generation Architecture: Ooyala Event Store Tachyon? 18Saturday, June 15, 13 ⁍ Look mom! No Hadoop! Remember what I said about latency? ⁍ But we’re not just running DSE on these machines. They’re running: DSE, Spark, KVM, and CDH3u4 (legacy) ⁍ Secret is cgroups! ⁍ Also, ZFS (later)
  • 19. 19 ⁍ Security ⁍ Cost of Goods Sold ⁍ Operations / support ⁍ Developer happiness ⁍ Physical capacity (cpu/memory/network/disk) ⁍ Reliability / Resilience ⁍ Compromise #CASSANDRA13 There’s more to tuning than performance: 19Saturday, June 15, 13 Shifting themes: philosophy of tuning ⁍ Security is always #1: The decision to disable security features is an important decision! ⁍ Example: EC2 instances sizes vary wildly in consistency and raw performance ⁍ Leveled v.s. Size Tiered compaction, ZFS/LVM/MDRAID, bare metal v.s. EC2 ⁍ how much of this stuff do my devs need to know? How much work is it to get a new KS/CF? ⁍ speed of node rebuilds, risk incurred by extended rebuilds, speed of repair a.) e.g. it takes a full 24 hours to repair each node in our 36-node cluster, so > 1 month to repair the cluster ⁍ repeatable configurations, do future admins have to remember to do stuff or is it automated? ⁍ Look up “John Allspaw Resilience” ⁍ you only have access to EC2 or old hardware, your company has an OS/filesystem/settings policy (e.g. my $PREVIOUS_JOB CentOS 5.3 Linux 2.18.x hardened distro) There are others of course.
  • 20. 20 ⁍ I’d love to be more scientific, but production comes first ⁍ Sometimes you have to make educated guesses ⁍ It’s not as difficult as it’s made out to be ⁍ Your brain is great at heuristics. Trust it. ⁍ Concentrate on bottlenecks ⁍ Make incremental changes ⁍ Read Malcom Gladwell’s “Blink” #CASSANDRA13 I am not a scientist ... heuristician? 20Saturday, June 15, 13 ⁍ A truly scientific approach would take a lot of time and resources. ⁍ When under time pressure and things are slow, you have to move fast and measure “by the seat of your pants” ⁍ Be educated, do research, and make sensible decisions without months of testing, be prepared to do better next time ⁍ It’s actually pretty fast and easy this way! ⁍ More on what tools I use later on.
  • 21. 21 Observe, Orient, Decide, Act: ⁍ Observe the system in production under load ⁍ Make small, safe changes ⁍ Observe ⁍ Commit or Revert #CASSANDRA13 The OODA Loop 21Saturday, June 15, 13 ⁍ Understand YOUR production workload first! ⁍ Look at Opscenter latency numbers ⁍ cl-netstat.pl (later) ⁍Examples: ⁍ Changing /proc/sys/vm/dirty_background_ratio is fairly safe and shows results quickly. ⁍ Some network settings can take your node offline, temporarily or require manual intervention. ⁍ Changing the compaction scheme requires a lot of time and has other implications.
  • 22. Testing Shiny Things 22 ⁍ Like kernels ⁍ And Linux distributions ⁍ And ZFS ⁍ And btrfs ⁍ And JVM’s & parameters ⁍ Test them in production! #CASSANDRA13 22Saturday, June 15, 13 ⁍ Testing stuff in a lab is fine, if you have one and you have the time. ⁍ Take (responsible) advantage of Cassandra’s resilience: ⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.
  • 23. ext4 ext4 ext4 ZFS ext4 kernel upgrade ext4 btrfs Testing Shiny Things: In Production 23#CASSANDRA13 23Saturday, June 15, 13 ⁍ Use your staging / non-prod environments first if you have them (some people don’t and that’s unfortunate but it happens) ⁍ test things you think should work well in production on ONE node or a couple nodes well spaced out.
  • 24. 24#CASSANDRA13 Brendan Gregg’s Tool Chart http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x 24Saturday, June 15, 13 ⁍ Brendan Gregg’s chart is so good, I just copied it for now. ⁍ Original: http://joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x ⁍ I’ll briefly talk about a few
  • 25. 25#CASSANDRA13 dstat -lrvn 10 25Saturday, June 15, 13 ⁍ Just like vmstat but prettier and does way more ⁍ 35 lines of output = about 5 minutes of 10s snapshots ⁍ What’s interesting? ⁍ IO wait starting at line 5, but all numbers are going up, so this is probably during a map/reduce job ⁍ IO wait is high, but disk throughput isn’t impressive at all ⁍ ~2 blocked “procs” (which includes threads) Not bothering to tune this right now because production latency is fine.
  • 26. 26#CASSANDRA13 cl-netstat.pl https://github.com/tobert/perl-ssh-tools 26Saturday, June 15, 13 ⁍ Home grown. ⁍ Requires no software on the target machines except for SSH. ⁍ Recent Net::SSH2 supports ssh-agent
  • 27. 27#CASSANDRA13 iostat -x 1 27Saturday, June 15, 13 ⁍ Mostly I just look at the *wait numbers here. ⁍ Great for finding a bad disk with high latency.
  • 28. 28#CASSANDRA13 htop 28Saturday, June 15, 13 ⁍ Per-CPU utilization bars are nice ⁍ Displays threads by default (hit “H” in plain top) ⁍ Very configurable! ⁍ For example: 1 thread at 100% CPU is usually the GC
  • 29. 29#CASSANDRA13 jconsole 29Saturday, June 15, 13 ⁍ Looks like I can reduce the heap size on this cluster, but should probably increase -Xmn to 100mb * (physical cores) (not counting hypercores)
  • 30. 30#CASSANDRA13 opscenter 30Saturday, June 15, 13 ⁍ It looks better on a high-resolution display ;)
  • 31. 31#CASSANDRA13 nodetool ring 10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 1012046694721756637024691720378965 10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 1026714038123521225967078556906197 10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 1041381381525285814909465393433428 10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 1056048724927050403851852229960659 10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 1070716068328814992794239066487891 10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 1100423945662575060114582859200003 10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 1137814208669076757916163680305794 10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 1196501513956187970179620530735245 10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 1394248867770897155613247921498720 10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 1435882108713996181107000284314407 10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 1465773686249280216901752503449044 10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 1481401683578223483181070489250370 31Saturday, June 15, 13 ⁍ hotspots
  • 32. 32#CASSANDRA13 nodetool cfstats Keyspace: gostress Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Column Family: stressful SSTable count: 1 Space used (live): 32981239 Space used (total): 32981239 Number of Keys (estimate): 128 Memtable Columns Count: 0 Memtable Data Size: 0 Memtable Switch Count: 0 Read Count: 0 Read Latency: NaN ms. Write Count: 0 Write Latency: NaN ms. Pending Tasks: 0 Bloom Filter False Positives: 0 Bloom Filter False Ratio: 0.00000 Bloom Filter Space Used: 336 Compacted row minimum size: 7007507 Compacted row maximum size: 8409007 Compacted row mean size: 8409007 Could be using a lot of heap Controllable by sstable_size_in_mb 32Saturday, June 15, 13 ⁍ bloom filters ⁍ sstable_size_in_mb
  • 33. 33#CASSANDRA13 nodetool proxyhistograms Offset Read Latency Write Latency Range Latency 35 0 20 0 42 0 61 0 50 0 82 0 60 0 440 0 72 0 3416 0 86 0 17910 0 103 0 48675 0 124 1 97423 0 149 0 153109 0 179 2 186205 0 215 5 139022 0 258 134 44058 0 310 2656 60660 0 372 34698 742684 0 446 469515 7359351 0 535 3920391 31030588 0 642 9852708 33070248 0 770 4487796 9719615 0 924 651959 984889 0 33Saturday, June 15, 13 ⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you
  • 34. 34#CASSANDRA13 nodetool compactionstats al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 9819749801 16922291634 58.03% Compaction hastur counter_archive 12141850720 16147440484 75.19% Compaction hastur mark_archive 647389841 1475432590 43.88% Active compaction remaining time : n/a al@node ~ $ nodetool compactionstats pending tasks: 3 compaction type keyspace column family bytes compacted bytes total progress Compaction hastur gauge_archive 10239806890 16922291634 60.51% Compaction hastur counter_archive 12544404397 16147440484 77.69% Compaction hastur mark_archive 1107897093 1475432590 75.09% Active compaction remaining time : n/a 34Saturday, June 15, 13 ⁍
  • 35. 35#CASSANDRA13 ⁍ cassandra-stress ⁍ YCSB ⁍ Production ⁍ Terasort (DSE) ⁍ Homegrown Stress Testing Tools 35Saturday, June 15, 13 ⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
  • 36. 36#CASSANDRA13 kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.dirty_ratio = 10 vm.dirty_background_ratio = 2 vm.swappiness = 1 /etc/sysctl.conf 36Saturday, June 15, 13 ⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
  • 37. 37#CASSANDRA13 ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda echo 256 > /sys/block/sda/queue/nr_requests echo cfq > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size /etc/rc.local 37Saturday, June 15, 13 ⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices
  • 38. 38#CASSANDRA13 -Xmx8G leave it alone -Xms8G leave it alone -Xmn1200M 100MiB * nCPU -Xss180k should be fine -XX:+UseNUMA numactl --interleave JVM Args 38Saturday, June 15, 13 ⁍ In general, most people should leave the defaults alone. Especially the heap, which can cause no end of trouble if you do it wrong and cause GC pauses. ⁍ Don’t count hypercores. ⁍ Our biggest bang for the buck so far has been tuning newsize. ⁍ Have you ever seen “out of memory” when there’s plenty of memory available? You probably have a full NUMA node. ⁍ NUMA is how modern machines are built. Older Apache Cassandra distros had numactl --interleave, but this doesn’t seem to be in the DSE scripts. I’ve been running +UseNUMA for about a year and a half now and it seems to work fine.
  • 39. cgroups 39#CASSANDRA13 Provides fine-grained control over Linux resources ⁍ Makes the Linux scheduler better ⁍ Lets you manage systems under extreme load ⁍ Useful on all Linux machines ⁍ Can choose between determinism and flexibility 39Saturday, June 15, 13 ⁍ static resource assignment has better determinism / constentcy ⁍ weighted resources provide most of the advantage with a lot more flexibility
  • 40. cgroups 40#CASSANDRA13 cat >> /etc/default/cassandra <<EOF cpucg=/sys/fs/cgroup/cpu/cassandra mkdir $cpucg cat $cpucg/../cpuset.mems >$cpucg/cpuset.mems cat $cpucg/../cpuset.cpus >$cpucg/cpuset.cpus echo 100 > $cpucg/shares echo $$ > $cpucg/tasks EOF 40Saturday, June 15, 13 ⁍ automatically adds cassandra to a CG called “cassandra” ⁍ cpuset.mems can be used to limit NUMA nodes if you have huge machines ⁍ cpuset.cpus can restrict tasks to specific cores (like taskset, stricter) ⁍ shares is just a number, set your own scale, 1-1000 works for me ⁍ adding a task to a CG is as simple as adding its PID ⁍ children are not necessarily added, you must add threads too if joining after startup (ps -efL)
  • 41. Successful Experiment: btrfs 41#CASSANDRA13 mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1 mkfs.btrfs -m raid10 -d raid0 /dev/sd[c-h]1 mount -o compress=lzo /dev/sdc1 /data 41Saturday, June 15, 13 ⁍ Like ZFS, btrfs can manage multiple disks without mdraid or LVM. ⁍ We have one production system in EC2 running btrfs flawlessly. ⁍ I’m told there are problems when the disk fills up so don’t do that. ⁍ noatime isn’t necessary on modern Linux, relatime is the default for xfs / ext4 and is good enough
  • 42. Successful Experiment: ZFS on Linux 42#CASSANDRA13 zpool create data raidz /dev/sd[c-h] zfs create data/cassandra zfs set compression=lzjb data/cassandra zfs set atime=off data/cassandra zfs set logbias=throughput data/cassandra 42Saturday, June 15, 13 ⁍ ZFS really is the ultimate filesystem. ⁍ RAIDZ is like RAID5 but totally different: ⁍ variable-width stripes ⁍ no write hole ⁍ VERY fast, plays well with C* ⁍ Stable! (so far)
  • 43. Conclusions 43#CASSANDRA13 ⁍ Tuning is multi-dimensional ⁍ Production load is your most important benchmark ⁍ Lean on Cassandra, experiment! ⁍ No one metric tells the whole story 43Saturday, June 15, 13
  • 44. Questions? 44#CASSANDRA13 ⁍ Twitter: @AlTobey ⁍ Github: https://github.com/tobert ⁍ Email: al@ooyala.com / tobert@gmail.com 44Saturday, June 15, 13