Handling Kernel Upgrades at Scale - The Dirty Cow Story

DIRTY COW
STORY AT
YAHOO GRID
Sameer Gawande
Savitha Ravikrishnan
June 13, 2017

2
Agenda
Topic Speaker(s)
Overview, Planning, Execution & Coordination Savitha Ravikrishnan
Automation details & Final outcome Sameer Gawande
Q&A All Presenters

3
WHAT IS DIRTY COW?
 Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all
Linux-based operating systems including Android.
 It allows a malicious actor to tamper with read-only, root-owned executable file.
 It’s been around for a decade but had surfaced and was actively exploited in early Q4,
2016.
 Linux kernel needed to to be patched, followed with a full reboot.

4
CHALLENGE
 Yahoo Grid comprises of 38 clusters:
 19 Hadoop clusters
 9 Hbase clusters
 10 Storm clusters
 47,000+ hosts of diverse makes and models

5
GRID STACK
Zookeeper
Backend
Support
Hadoop
Storage
Hadoop
Compute
Hadoop
Services
Support
Shop
Monitoring
Starling for
logging
HDFS Hbase as
NoSql store
Hcatalog for
metadata
registry
YARN (Mapred) and Tez
for Batch processing
Storm for stream
processing
Spark for iterative
programming
PIG for
ETL
Hive for
SQL
Oozie for
workflows
Proxy
services
GDM for
data Mang
Café on
Spark for
ML

6
CHALLENGE
 End of quarter deadline
 Cannot afford data loss
 Need minimal to no downtime to prevent inconvenience to customers using the
clusters.
 Coordinating the whole thing between different tiers of operations, site ops
technician and the users.
 Rigorous end to end automation.

7
PLANNING AND PREPARATION
 Numerous discussions between prod ops and dev teams.
 Leverage existing framework to rollout new kernel.
 Many of these hosts weren’t rebooted in ages, so behavior was uncertain.
 Thorough testing of new kernel on different kinds of hardware.
 Encountered a variety of issues while testing.
o Use this as opportunity to fix hosts with hardware issues.
 Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.
 Use Kexec on system at higher risk and with time constraint

8
EXECUTION
 Pre upgrade work:
 Required scanning all hosts for hardware issues - mem, disks and cpu.
 Decom them before the upgrade.
 We kept the namenodes up at all times and used it to help with the upgrade
by reporting the missing blocks.
 Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1
 Clients talk to nn1
 Upgrade components while nn1 was down.

9
EXECUTION
 Before start of the upgrade:
Increase namenode heartbeat recheck interval
o dfs.namenode.heartbeat.recheck-interval
Upgrade namenodes
Block map of hosts to blocks on the
 The Upgrade
Bring down nn1, bring down component services
Try a rack and a stripe first and increase that count as needed
Troubleshoot hosts failing to come back up
 For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could
sustain a rolling upgrade.

11
HADOOP SUBSYSTEMS
 This included various sub-components such as LDAP, Kerberos, syslog
servers, monitoring nodes, proxy-nodes, gateways, admin servers to name
a few.
 These servers could be failed over and were not considered a single point
of failure.
 The upgrade was done in rolling fashion with no down time to the service.
 Inbuilt support in kernel Upgrade.

12
COORDINATION
 Comprehensive UI
 Display all the clusters, with kernel and bios versions of all hosts
 Display host upgrade progress and host health status
 Display stats on numbers of hosts upgraded, being upgraded and not upgraded
yet
 2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked
into by site ops.
 Site ops technicians on standby to immediately troubleshoot hosts with hardware
issues.

15
Kernel Upgrade Flow Diagram

16
Initialize
Workflow
/Anchor Function
Find Active/
Non Active
Kernel
Upgrade
Required
Kernel Current/
Unreachable
Push Repo
mv old
kernel
Push RPM
Validate
Nodes
Failed
Register
Error/Failures
Terminate Exit
Passed Nodes
(nodes to work)
Select batch to
work
Shutdown
processes
Reboot
Reboot
Failure
Start Services Service failure
Check
HDFS
Status
Thresholds
crossed
Underrepl failed
nodes/ Missing
blocks
Find Active
nodes in
HDFS
Push New
temp repo,
/boot can’t
have
multiple
kernel,
Move old
kernel. Push
New kernel
RPMs
Validation
involved disk,
CPU, memory
consistency
checks
Kernel Upgrade Flow

17
Initialize Block
Map Tool
Find All Blocks
on DNs and
record path
Monitor
Namenode
Missing
Blocks
Trigger
metasave
Find failed
nodes
Find nodes
having all blocks
Escalate to
siteops
Do find to
upload all blocks
and path on hdfs
and locally
(Use pig to find
block locations) After this step we are
ready to do Kernel
Upgrade
Block Map Tool

18
TOOL CONFIG
default:
database_type: 'mysql’
host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml
hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml
repo_file:
'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6-
kernel-upgrade.yum’
# host Selection logic based on batch specified
# [0-9+]s - stripe, select a stripe in cluster
# r - rack, select biggest rack of cluster
# [0-9]+ - group on n(number) hosts
# stop,halt - stop execution further
# example
# r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and
then stop irrespective host available or not
batch: r,s,4s,7s
reboot_wait: 1500
missing_blocks_threshold: 1000
namenode_safemode_timeout: 3600
addNodes:
datanode: command_add_datanode
storm: command_add_storm
removeNodes:
datanode: command_remove_datanode
storm: command_remove_storm
moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ "
installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum"
validateKernelHost: "/usr/local/libexec/validateNodeHealth.py"
reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py”
reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' |
cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ;
sleep 5 ; reboot "
command:
command_add_datanode: "/home/y/bin/addNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data
[cluster]_[colo]:STORM:[hosts]”
command_remove_datanode: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_remove_storm: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:STORM:[hosts]"

19
Rolling Upgrade low latency

20
CI/CD
process
Git
(release
info)
Jenkins
Start
Put NN in RU
mode &
Upgrade NN
SNN
Master
Upgrade
Region-
server
Upgrade
process
Stargate
Upgrade
Gateway
Upgrade
HBase Upgrade
Foreach
DN/RS
System
Upgrade
regionserver
Repo Server
Package +
conf version
Stop
Regionserver
Stop DN Reboot Host
Validate and
Start DN, RS
1
2
3
4
3a
3c
3b
3d 3e
3f
3f
5HDFS Rolling
Upgrade process
Iterate over each group
Iterate over
each server in
a group

21
Storm Kernel Upgrade CI/CD
process
Git
(release
info)
Jenkins
Start
Artifactory
(State files &
Release info)
RE Jenkins
and SD
process
Pacemaker
System
Upgrade
Nimbus
System
Upgrade
Kill workers
and stop
Supervisor
Reboot
Host(s)
Start
Supervisor
Services
Verify
Services
DRPC
System
Upgrade
Run
Test/Validatio
n topology
Audit All
Components
RE Jenkins leads to statefile
generation for each component and
updates git with release info
Statefiles are published in artifactory
and downloaded during upgrade
Upgrade fails if
more than X
supervisors
fails to upgrade

23
TEST RESULTS MODEL VS RHEL VERSIONS
 We use different configs starting multiple architectures such as Westmere,
Sandybridge, Ivybridge, Haswell, Broadwell.
 Each of the configurations were installed with different OS versions and kernel
versions.
OS version Kernel minor version
RHEL 6.4 2.6.32-358
RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512
RHEL 6.8 2.6.32-632

24
MODEL VS RHEL AND KERNEL
 Issues
Slower reboots
Boot failing due to iDRAC/IPMI
Slowness on systems
Hardware issues

25
KEXEC
 The primary difference between a standard system boot and a kexec boot is that
the hardware initialization or POST normally performed by the BIOS is not
performed during a kexec boot. This has the effect of reducing the time required
for a reboot.
 We had approximately 3000 nodes that had the potential to cause issues if we
chose a standard system boot. These nodes belonged to a specific config and
had a bad history when it came to rebooting.
 We did do a full system reboot in rolling fashion after we were done with the
dirtyCOW kernel upgrade project.

26
SUCCESS MATRIX
 Zero data loss.
 47000+ nodes upgraded at extremely fast pace.
 Minimal customer downtime.
 S0 security bug resolved.
 Minimum impact to low latency services.
 Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC,
fix edac issue that was causing system slowness and that resulted in improved
system reliability.

Handling Kernel Upgrades at Scale - The Dirty Cow Story

Handling Kernel Upgrades at Scale - The Dirty Cow Story

More Related Content

What's hot (20)

Similar to Handling Kernel Upgrades at Scale - The Dirty Cow Story (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Handling Kernel Upgrades at Scale - The Dirty Cow Story