SlideShare a Scribd company logo
DIRTY COW
STORY AT
YAHOO GRID
Sameer Gawande
Savitha Ravikrishnan
June 13, 2017
2
Agenda
Topic Speaker(s)
Overview, Planning, Execution & Coordination Savitha Ravikrishnan
Automation details & Final outcome Sameer Gawande
Q&A All Presenters
3
WHAT IS DIRTY COW?
 Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all
Linux-based operating systems including Android.
 It allows a malicious actor to tamper with read-only, root-owned executable file.
 It’s been around for a decade but had surfaced and was actively exploited in early Q4,
2016.
 Linux kernel needed to to be patched, followed with a full reboot.
4
CHALLENGE
 Yahoo Grid comprises of 38 clusters:
 19 Hadoop clusters
 9 Hbase clusters
 10 Storm clusters
 47,000+ hosts of diverse makes and models
5
GRID STACK
Zookeeper
Backend
Support
Hadoop
Storage
Hadoop
Compute
Hadoop
Services
Support
Shop
Monitoring
Starling for
logging
HDFS Hbase as
NoSql store
Hcatalog for
metadata
registry
YARN (Mapred) and Tez
for Batch processing
Storm for stream
processing
Spark for iterative
programming
PIG for
ETL
Hive for
SQL
Oozie for
workflows
Proxy
services
GDM for
data Mang
Café on
Spark for
ML
6
CHALLENGE
 End of quarter deadline
 Cannot afford data loss
 Need minimal to no downtime to prevent inconvenience to customers using the
clusters.
 Coordinating the whole thing between different tiers of operations, site ops
technician and the users.
 Rigorous end to end automation.
7
PLANNING AND PREPARATION
 Numerous discussions between prod ops and dev teams.
 Leverage existing framework to rollout new kernel.
 Many of these hosts weren’t rebooted in ages, so behavior was uncertain.
 Thorough testing of new kernel on different kinds of hardware.
 Encountered a variety of issues while testing.
o Use this as opportunity to fix hosts with hardware issues.
 Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.
 Use Kexec on system at higher risk and with time constraint
8
EXECUTION
 Pre upgrade work:
 Required scanning all hosts for hardware issues - mem, disks and cpu.
 Decom them before the upgrade.
 We kept the namenodes up at all times and used it to help with the upgrade
by reporting the missing blocks.
 Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1
 Clients talk to nn1
 Upgrade components while nn1 was down.
9
EXECUTION
 Before start of the upgrade:
Increase namenode heartbeat recheck interval
o dfs.namenode.heartbeat.recheck-interval
Upgrade namenodes
Block map of hosts to blocks on the
 The Upgrade
Bring down nn1, bring down component services
Try a rack and a stripe first and increase that count as needed
Troubleshoot hosts failing to come back up
 For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could
sustain a rolling upgrade.
10
Subsystem Upgrade
11
HADOOP SUBSYSTEMS
 This included various sub-components such as LDAP, Kerberos, syslog
servers, monitoring nodes, proxy-nodes, gateways, admin servers to name
a few.
 These servers could be failed over and were not considered a single point
of failure.
 The upgrade was done in rolling fashion with no down time to the service.
 Inbuilt support in kernel Upgrade.
12
COORDINATION
 Comprehensive UI
 Display all the clusters, with kernel and bios versions of all hosts
 Display host upgrade progress and host health status
 Display stats on numbers of hosts upgraded, being upgraded and not upgraded
yet
 2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked
into by site ops.
 Site ops technicians on standby to immediately troubleshoot hosts with hardware
issues.
13
UI SNAPSHOT
14
15
Kernel Upgrade Flow Diagram
16
Initialize
Workflow
/Anchor Function
Find Active/
Non Active
Kernel
Upgrade
Required
Kernel Current/
Unreachable
Push Repo
mv old
kernel
Push RPM
Validate
Nodes
Failed
Register
Error/Failures
Terminate Exit
Passed Nodes
(nodes to work)
Select batch to
work
Shutdown
processes
Reboot
Reboot
Failure
Start Services Service failure
Check
HDFS
Status
Thresholds
crossed
Underrepl failed
nodes/ Missing
blocks
Find Active
nodes in
HDFS
Push New
temp repo,
/boot can’t
have
multiple
kernel,
Move old
kernel. Push
New kernel
RPMs
Validation
involved disk,
CPU, memory
consistency
checks
Kernel Upgrade Flow
17
Initialize Block
Map Tool
Find All Blocks
on DNs and
record path
Monitor
Namenode
Missing
Blocks
Trigger
metasave
Find failed
nodes
Find nodes
having all blocks
Escalate to
siteops
Do find to
upload all blocks
and path on hdfs
and locally
(Use pig to find
block locations) After this step we are
ready to do Kernel
Upgrade
Block Map Tool
18
TOOL CONFIG
default:
database_type: 'mysql’
host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml
hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml
repo_file:
'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6-
kernel-upgrade.yum’
# host Selection logic based on batch specified
# [0-9+]s - stripe, select a stripe in cluster
# r - rack, select biggest rack of cluster
# [0-9]+ - group on n(number) hosts
# stop,halt - stop execution further
# example
# r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and
then stop irrespective host available or not
batch: r,s,4s,7s
reboot_wait: 1500
missing_blocks_threshold: 1000
namenode_safemode_timeout: 3600
addNodes:
datanode: command_add_datanode
storm: command_add_storm
removeNodes:
datanode: command_remove_datanode
storm: command_remove_storm
moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ "
installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum"
validateKernelHost: "/usr/local/libexec/validateNodeHealth.py"
reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py”
reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' |
cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ;
sleep 5 ; reboot "
command:
command_add_datanode: "/home/y/bin/addNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data
[cluster]_[colo]:STORM:[hosts]”
command_remove_datanode: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:HDFS:[hosts]”
command_remove_storm: "/home/y/bin/shutdownNodes -input_data
[cluster]_[colo]:STORM:[hosts]"
19
Rolling Upgrade low latency
20
CI/CD
process
Git
(release
info)
Jenkins
Start
Put NN in RU
mode &
Upgrade NN
SNN
Master
Upgrade
Region-
server
Upgrade
process
Stargate
Upgrade
Gateway
Upgrade
HBase Upgrade
Foreach
DN/RS
System
Upgrade
regionserver
Repo Server
Package +
conf version
Stop
Regionserver
Stop DN Reboot Host
Validate and
Start DN, RS
1
2
3
4
3a
3c
3b
3d 3e
3f
3f
5HDFS Rolling
Upgrade process
Iterate over each group
Iterate over
each server in
a group
21
Storm Kernel Upgrade CI/CD
process
Git
(release
info)
Jenkins
Start
Artifactory
(State files &
Release info)
RE Jenkins
and SD
process
Pacemaker
System
Upgrade
Nimbus
System
Upgrade
Kill workers
and stop
Supervisor
Reboot
Host(s)
Start
Supervisor
Services
Verify
Services
DRPC
System
Upgrade
Run
Test/Validatio
n topology
Audit All
Components
RE Jenkins leads to statefile
generation for each component and
updates git with release info
Statefiles are published in artifactory
and downloaded during upgrade
Upgrade fails if
more than X
supervisors
fails to upgrade
22
Impact and Statistics
23
TEST RESULTS MODEL VS RHEL VERSIONS
 We use different configs starting multiple architectures such as Westmere,
Sandybridge, Ivybridge, Haswell, Broadwell.
 Each of the configurations were installed with different OS versions and kernel
versions.
OS version Kernel minor version
RHEL 6.4 2.6.32-358
RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512
RHEL 6.8 2.6.32-632
24
MODEL VS RHEL AND KERNEL
 Issues
Slower reboots
Boot failing due to iDRAC/IPMI
Slowness on systems
Hardware issues
25
KEXEC
 The primary difference between a standard system boot and a kexec boot is that
the hardware initialization or POST normally performed by the BIOS is not
performed during a kexec boot. This has the effect of reducing the time required
for a reboot.
 We had approximately 3000 nodes that had the potential to cause issues if we
chose a standard system boot. These nodes belonged to a specific config and
had a bad history when it came to rebooting.
 We did do a full system reboot in rolling fashion after we were done with the
dirtyCOW kernel upgrade project.
26
SUCCESS MATRIX
 Zero data loss.
 47000+ nodes upgraded at extremely fast pace.
 Minimal customer downtime.
 S0 security bug resolved.
 Minimum impact to low latency services.
 Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC,
fix edac issue that was causing system slowness and that resulted in improved
system reliability.
Handling Kernel Upgrades at Scale - The Dirty Cow Story

More Related Content

PPTX
Running a container cloud on YARN
PPTX
Scale-Out Resource Management at Microsoft using Apache YARN
PPTX
An Apache Hive Based Data Warehouse
PPT
State of Security: Apache Spark & Apache Zeppelin
PPTX
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
PPTX
Managing enterprise users in Hadoop ecosystem
PPTX
The Future of Apache Ambari
PPTX
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Running a container cloud on YARN
Scale-Out Resource Management at Microsoft using Apache YARN
An Apache Hive Based Data Warehouse
State of Security: Apache Spark & Apache Zeppelin
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Managing enterprise users in Hadoop ecosystem
The Future of Apache Ambari
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data

What's hot (20)

PPTX
Schema Registry - Set Your Data Free
PPTX
An Overview on Optimization in Apache Hive: Past, Present, Future
PPTX
Effective Spark on Multi-Tenant Clusters
PPTX
Running Enterprise Workloads in the Cloud
PPTX
Hadoop & cloud storage object store integration in production (final)
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PDF
An Apache Hive Based Data Warehouse
PPTX
SAM - Streaming Analytics Made Easy
PDF
The state of SQL-on-Hadoop in the Cloud
PPTX
LLAP: Building Cloud First BI
PPTX
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
PDF
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
PPTX
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPTX
Bringing complex event processing to Spark streaming
PPTX
YARN and the Docker container runtime
PPTX
Ingest and Stream Processing - What will you choose?
PPTX
A Multi Colored YARN
PDF
What s new in spark 2.3 and spark 2.4
PPTX
Ozone- Object store for Apache Hadoop
Schema Registry - Set Your Data Free
An Overview on Optimization in Apache Hive: Past, Present, Future
Effective Spark on Multi-Tenant Clusters
Running Enterprise Workloads in the Cloud
Hadoop & cloud storage object store integration in production (final)
Apache Hive 2.0: SQL, Speed, Scale
An Apache Hive Based Data Warehouse
SAM - Streaming Analytics Made Easy
The state of SQL-on-Hadoop in the Cloud
LLAP: Building Cloud First BI
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Bringing Real-Time to the Enterprise with Hortonworks DataFlow
Migrating your clusters and workloads from Hadoop 2 to Hadoop 3
Row/Column- Level Security in SQL for Apache Spark
Bringing complex event processing to Spark streaming
YARN and the Docker container runtime
Ingest and Stream Processing - What will you choose?
A Multi Colored YARN
What s new in spark 2.3 and spark 2.4
Ozone- Object store for Apache Hadoop
Ad

Similar to Handling Kernel Upgrades at Scale - The Dirty Cow Story (20)

PPTX
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
PDF
Practice and challenges from building IaaS
PDF
Infrastructure Around Hadoop
PPTX
Hadoop Migration from 0.20.2 to 2.0
PPTX
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
PDF
Openstack upgrade without_down_time_20141103r1
PDF
Next Generation Hadoop Operations
PDF
Extensible dev secops pipelines with Jenkins, Docker, Terraform, and a kitche...
PDF
JUC Europe 2015: Jenkins-Based Continuous Integration for Heterogeneous Hardw...
PPTX
Infrastructure Automation
PDF
Devops learning path
PPTX
'Intro to Infrastructure as Code' - DevOps Belfast
PPTX
VOLODYMYR TSAP, BAQ, "CI/CD Infrastructure as a Code"
ODP
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
PDF
How to Build a Compute Cluster
PPTX
Top 10 dev ops tools (1)
PDF
Devops with Python by Yaniv Cohen DevopShift
PDF
Manage your bare-metal infrastructure with a CI/CD-driven approach
PDF
Kernelci.org needs you!
PPTX
Ansible: How to Get More Sleep and Require Less Coffee
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Practice and challenges from building IaaS
Infrastructure Around Hadoop
Hadoop Migration from 0.20.2 to 2.0
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Openstack upgrade without_down_time_20141103r1
Next Generation Hadoop Operations
Extensible dev secops pipelines with Jenkins, Docker, Terraform, and a kitche...
JUC Europe 2015: Jenkins-Based Continuous Integration for Heterogeneous Hardw...
Infrastructure Automation
Devops learning path
'Intro to Infrastructure as Code' - DevOps Belfast
VOLODYMYR TSAP, BAQ, "CI/CD Infrastructure as a Code"
Jenkinsconf Presentation - Advance jenkins management with multiple projects.
How to Build a Compute Cluster
Top 10 dev ops tools (1)
Devops with Python by Yaniv Cohen DevopShift
Manage your bare-metal infrastructure with a CI/CD-driven approach
Kernelci.org needs you!
Ansible: How to Get More Sleep and Require Less Coffee
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Tartificialntelligence_presentation.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Machine Learning_overview_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Assigned Numbers - 2025 - Bluetooth® Document
The Rise and Fall of 3GPP – Time for a Sabbatical?
SOPHOS-XG Firewall Administrator PPT.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Tartificialntelligence_presentation.pptx
A comparative analysis of optical character recognition models for extracting...
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars

Handling Kernel Upgrades at Scale - The Dirty Cow Story

  • 1. DIRTY COW STORY AT YAHOO GRID Sameer Gawande Savitha Ravikrishnan June 13, 2017
  • 2. 2 Agenda Topic Speaker(s) Overview, Planning, Execution & Coordination Savitha Ravikrishnan Automation details & Final outcome Sameer Gawande Q&A All Presenters
  • 3. 3 WHAT IS DIRTY COW?  Dirty COW (Copy-On-Write) is a security vulnerability of the Linux kernel that affects all Linux-based operating systems including Android.  It allows a malicious actor to tamper with read-only, root-owned executable file.  It’s been around for a decade but had surfaced and was actively exploited in early Q4, 2016.  Linux kernel needed to to be patched, followed with a full reboot.
  • 4. 4 CHALLENGE  Yahoo Grid comprises of 38 clusters:  19 Hadoop clusters  9 Hbase clusters  10 Storm clusters  47,000+ hosts of diverse makes and models
  • 5. 5 GRID STACK Zookeeper Backend Support Hadoop Storage Hadoop Compute Hadoop Services Support Shop Monitoring Starling for logging HDFS Hbase as NoSql store Hcatalog for metadata registry YARN (Mapred) and Tez for Batch processing Storm for stream processing Spark for iterative programming PIG for ETL Hive for SQL Oozie for workflows Proxy services GDM for data Mang Café on Spark for ML
  • 6. 6 CHALLENGE  End of quarter deadline  Cannot afford data loss  Need minimal to no downtime to prevent inconvenience to customers using the clusters.  Coordinating the whole thing between different tiers of operations, site ops technician and the users.  Rigorous end to end automation.
  • 7. 7 PLANNING AND PREPARATION  Numerous discussions between prod ops and dev teams.  Leverage existing framework to rollout new kernel.  Many of these hosts weren’t rebooted in ages, so behavior was uncertain.  Thorough testing of new kernel on different kinds of hardware.  Encountered a variety of issues while testing. o Use this as opportunity to fix hosts with hardware issues.  Resulted in BIOS + BMC + CPLD upgrade across a particular type of systems.  Use Kexec on system at higher risk and with time constraint
  • 8. 8 EXECUTION  Pre upgrade work:  Required scanning all hosts for hardware issues - mem, disks and cpu.  Decom them before the upgrade.  We kept the namenodes up at all times and used it to help with the upgrade by reporting the missing blocks.  Namenode HA setup: IP aliasing, nn1-ha1, nn1-ha2 and nn1  Clients talk to nn1  Upgrade components while nn1 was down.
  • 9. 9 EXECUTION  Before start of the upgrade: Increase namenode heartbeat recheck interval o dfs.namenode.heartbeat.recheck-interval Upgrade namenodes Block map of hosts to blocks on the  The Upgrade Bring down nn1, bring down component services Try a rack and a stripe first and increase that count as needed Troubleshoot hosts failing to come back up  For Storm and Hbase, rolling upgrade script was updated to do a system upgrade as they could sustain a rolling upgrade.
  • 11. 11 HADOOP SUBSYSTEMS  This included various sub-components such as LDAP, Kerberos, syslog servers, monitoring nodes, proxy-nodes, gateways, admin servers to name a few.  These servers could be failed over and were not considered a single point of failure.  The upgrade was done in rolling fashion with no down time to the service.  Inbuilt support in kernel Upgrade.
  • 12. 12 COORDINATION  Comprehensive UI  Display all the clusters, with kernel and bios versions of all hosts  Display host upgrade progress and host health status  Display stats on numbers of hosts upgraded, being upgraded and not upgraded yet  2nd tier of Ops scan the UI for hosts with hardware issues that need to be looked into by site ops.  Site ops technicians on standby to immediately troubleshoot hosts with hardware issues.
  • 14. 14
  • 16. 16 Initialize Workflow /Anchor Function Find Active/ Non Active Kernel Upgrade Required Kernel Current/ Unreachable Push Repo mv old kernel Push RPM Validate Nodes Failed Register Error/Failures Terminate Exit Passed Nodes (nodes to work) Select batch to work Shutdown processes Reboot Reboot Failure Start Services Service failure Check HDFS Status Thresholds crossed Underrepl failed nodes/ Missing blocks Find Active nodes in HDFS Push New temp repo, /boot can’t have multiple kernel, Move old kernel. Push New kernel RPMs Validation involved disk, CPU, memory consistency checks Kernel Upgrade Flow
  • 17. 17 Initialize Block Map Tool Find All Blocks on DNs and record path Monitor Namenode Missing Blocks Trigger metasave Find failed nodes Find nodes having all blocks Escalate to siteops Do find to upload all blocks and path on hdfs and locally (Use pig to find block locations) After this step we are ready to do Kernel Upgrade Block Map Tool
  • 18. 18 TOOL CONFIG default: database_type: 'mysql’ host_netswitch_map: /home/y/conf/ygrid_kernel_upgrade/netswitch_mapping.yaml hbase_client_config: /home/y/conf/cluster_upgrade/ygrid_package_version.yaml repo_file: 'http://xxxxxx.yyyyyyyy.yahoo.com:xxxxx/yum/properties/ylinux/ylinux/dirtycow/ylinux6- kernel-upgrade.yum’ # host Selection logic based on batch specified # [0-9+]s - stripe, select a stripe in cluster # r - rack, select biggest rack of cluster # [0-9]+ - group on n(number) hosts # stop,halt - stop execution further # example # r,s,50,100,stop - upgrade rack, then stripe, then batch of 50 & 100 respectively and then stop irrespective host available or not batch: r,s,4s,7s reboot_wait: 1500 missing_blocks_threshold: 1000 namenode_safemode_timeout: 3600 addNodes: datanode: command_add_datanode storm: command_add_storm removeNodes: datanode: command_remove_datanode storm: command_remove_storm moveKernel: "mv /boot/initramfs-2.6.32-*.img /boot/initrd-2.6.32-*.img /grid/0/tmp/ " installKernel: "yum -y shell /tmp/ylinux6-kernel-upgrade.yum" validateKernelHost: "/usr/local/libexec/validateNodeHealth.py" reboot: "SUDO_USER=kernelupgrade /etc/init.d/systemupgrade.py” reboot: " kernel=`grubby --default-kernel`; initrd=`grubby --info=${kernel} | grep '^initrd' | cut -d'=' -f2`; kexec -l $kernel --initrd=$initrd --command-line="$( cat /proc/cmdline )" ; sleep 5 ; reboot " command: command_add_datanode: "/home/y/bin/addNodes -input_data [cluster]_[colo]:HDFS:[hosts]” command_add_storm: "/home/y/bin/quarantineDebugNodes -input_data [cluster]_[colo]:STORM:[hosts]” command_remove_datanode: "/home/y/bin/shutdownNodes -input_data [cluster]_[colo]:HDFS:[hosts]” command_remove_storm: "/home/y/bin/shutdownNodes -input_data [cluster]_[colo]:STORM:[hosts]"
  • 20. 20 CI/CD process Git (release info) Jenkins Start Put NN in RU mode & Upgrade NN SNN Master Upgrade Region- server Upgrade process Stargate Upgrade Gateway Upgrade HBase Upgrade Foreach DN/RS System Upgrade regionserver Repo Server Package + conf version Stop Regionserver Stop DN Reboot Host Validate and Start DN, RS 1 2 3 4 3a 3c 3b 3d 3e 3f 3f 5HDFS Rolling Upgrade process Iterate over each group Iterate over each server in a group
  • 21. 21 Storm Kernel Upgrade CI/CD process Git (release info) Jenkins Start Artifactory (State files & Release info) RE Jenkins and SD process Pacemaker System Upgrade Nimbus System Upgrade Kill workers and stop Supervisor Reboot Host(s) Start Supervisor Services Verify Services DRPC System Upgrade Run Test/Validatio n topology Audit All Components RE Jenkins leads to statefile generation for each component and updates git with release info Statefiles are published in artifactory and downloaded during upgrade Upgrade fails if more than X supervisors fails to upgrade
  • 23. 23 TEST RESULTS MODEL VS RHEL VERSIONS  We use different configs starting multiple architectures such as Westmere, Sandybridge, Ivybridge, Haswell, Broadwell.  Each of the configurations were installed with different OS versions and kernel versions. OS version Kernel minor version RHEL 6.4 2.6.32-358 RHEL 6.6 and RHEL 6.7 2.6.32-432 to 2.6.32-512 RHEL 6.8 2.6.32-632
  • 24. 24 MODEL VS RHEL AND KERNEL  Issues Slower reboots Boot failing due to iDRAC/IPMI Slowness on systems Hardware issues
  • 25. 25 KEXEC  The primary difference between a standard system boot and a kexec boot is that the hardware initialization or POST normally performed by the BIOS is not performed during a kexec boot. This has the effect of reducing the time required for a reboot.  We had approximately 3000 nodes that had the potential to cause issues if we chose a standard system boot. These nodes belonged to a specific config and had a bad history when it came to rebooting.  We did do a full system reboot in rolling fashion after we were done with the dirtyCOW kernel upgrade project.
  • 26. 26 SUCCESS MATRIX  Zero data loss.  47000+ nodes upgraded at extremely fast pace.  Minimal customer downtime.  S0 security bug resolved.  Minimum impact to low latency services.  Uncovered multiple system issues. Got an opportunity to upgrade BIOS, BMC, fix edac issue that was causing system slowness and that resulted in improved system reliability.