SlideShare a Scribd company logo
Chef	
  Pa(erns	
  From	
  Building	
  Clusters	
  
Biju	
  Nair	
  
Boston	
  DevOps	
  Meetup	
  
08-­‐July-­‐2015	
  
Background	
  
•  Automate	
  build	
  &	
  management	
  of	
  clusters	
  	
  
– Hadoop	
  
– KaLa…	
  etc	
  
•  Pa(erns	
  which	
  can	
  be	
  used	
  elsewhere	
  
Movies	
  On	
  Demand	
  
Service	
  On	
  Demand	
  
•  Common	
  services	
  which	
  can	
  be	
  requested	
  
– Copy	
  logs	
  from	
  applicaQons	
  to	
  a	
  centralized	
  
locaQon	
  
– Service	
  available	
  on	
  all	
  the	
  nodes	
  
– ApplicaQons	
  can	
  request	
  the	
  service	
  dynamically	
  
Service	
  On	
  Demand	
  
•  Node	
  A(ribute	
  to	
  store	
  service	
  requests	
  
default['bcpc']['hadoop']['copylog'] = {}
{
'app_id' => { 'logfile' => "/path/file_name_of_log_file",
'docopy' => true (or false)
},...
}
•  Data	
  Structure	
  to	
  make	
  service	
  requests	
  
Service	
  On	
  Demand	
  
•  ApplicaQon	
  recipes	
  make	
  service	
  requests	
  
#
# Updating node attributes to copy HBase master log file to HDFS
#
node.default['bcpc']['hadoop']['copylog']['hbase_master'] = {
'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.log",
'docopy' => true
}
node.default['bcpc']['hadoop']['copylog']['hbase_master_out'] = {
'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.out",
'docopy' => true
}
Service	
  On	
  Demand	
  
•  Service	
  recipe	
  
node['bcpc']['hadoop']['copylog'].each do |id,f|
if f['docopy']
template "/etc/flume/conf/flume-#{id}.conf" do
source "flume_flume-conf.erb”
action :create ...
variables(:agent_name => "#{id}",
:log_location => "#{f['logfile']}" )
notifies :restart,"service[flume-agent-multi-#{id}]",:delayed
end
service "flume-agent-multi-#{id}" do
supports :status => true, :restart => true, :reload => false
service_name "flume-agent-multi"
action :start
start_command "service flume-agent-multi start #{id}"
restart_command "service flume-agent-multi restart #{id}"
status_command "service flume-agent-multi status #{id}"
end
•  Separate	
  role	
  at	
  the	
  end	
  of	
  run	
  list	
  	
  
Choices	
  
Pluggable	
  Alerts	
  
•  Single	
  source	
  for	
  monitored	
  stats	
  
– Allows	
  users	
  to	
  visualize	
  stats	
  across	
  different	
  
parameters	
  
– Didn’t	
  want	
  to	
  duplicate	
  the	
  stats	
  collecQon	
  by	
  
alerQng	
  system	
  
– Need	
  to	
  feed	
  data	
  to	
  the	
  alerQng	
  system	
  to	
  
generate	
  alerts	
  
Pluggable	
  Alerts	
  
•  A(ribute	
  where	
  users	
  can	
  define	
  alerts	
  
default["bcpc"]["hadoop"]["graphite"]["queries"] = {
'hbase_master' => [
{ 'type' => "jmx",
'query' => "memory.NonHeapMemoryUsage_committed",
'key' => "hbasenonheapmem",
'trigger_val' => "max(61,0)",
'trigger_cond' => "=0",
'trigger_name' => "HBaseMasterAvailability",
'trigger_dep' => ["NameNodeAvailability"],
'trigger_desc' => "HBase master seems to be down",
'severity' => 1
},{
'type' => "jmx",
'query' => "memory.HeapMemoryUsage_committed",
'key' => "hbaseheapmem",
...
},...], ’namenode' => [...] ...}
Pluggable	
  Alerts	
  
•  Recipes	
  and	
  templates	
  use	
  the	
  data	
  structure	
  
– To	
  generate	
  queries	
  to	
  pull	
  data	
  from	
  staQsQcs	
  
store	
  and	
  send	
  
•  h(ps://github.com/bloomberg/chef-­‐bach/blob/master/
cookbooks/bcpc-­‐hadoop/templates/default/
graphite.query_graphite.config.erb	
  
– To	
  create	
  requested	
  trigger	
  related	
  objects	
  in	
  
alarming	
  system	
  
•  h(ps://github.com/bloomberg/chef-­‐bach/blob/master/
cookbooks/bcpc-­‐hadoop/recipes/graphite_to_zabbix.rb	
  
Pluggable	
  Alerts	
  
•  Servers	
  Defined	
  in	
  role	
  is	
  used	
  by	
  recipes	
  
"default_attributes" : {
"jmxtrans": {
"servers": [
{
"type": "hbase_master",
"service": "hbase-master",
"service_cmd": "org.apache.hadoop.hbase.master.HMaster”
}, {
"type": "hbase_rs",
"service": "hbase-regionserver",
"service_cmd":
"org.apache.hadoop.hbase.regionserver.HRegionServer"
}
]
} ...
Dependency	
  
Service	
  Restart	
  
•  We	
  use	
  jmxtrans	
  to	
  monitor	
  jmx	
  stats	
  
– Services	
  to	
  be	
  monitored	
  varies	
  with	
  node	
  
– There	
  can	
  be	
  more	
  than	
  one	
  service	
  to	
  be	
  
monitored	
  
– Monitored	
  service	
  restart	
  requires	
  JMXtrans	
  to	
  be	
  
restarted**	
  
Service	
  Restart	
  
•  Data	
  structure	
  in	
  roles	
  to	
  define	
  the	
  services	
  
"default_attributes" : {
"jmxtrans": {
"servers": [
{
"type": "datanode",
"service": "hadoop-hdfs-datanode",
"service_cmd":
"org.apache.hadoop.hdfs.server.datanode.DataNode"
}, {
"type": "hbase_rs",
"service": "hbase-regionserver",
"service_cmd":
“org.apache.hadoop.hbase.regionserver.HRegionServer"
}
]
} ...
Service	
  Restart	
  
•  Jmxtrans	
  service	
  restart	
  logic	
  built	
  dynamically	
  
jmx_services = Array.new
jmx_srvc_cmds = Hash.new
node['jmxtrans']['servers'].each do |server|
jmx_services.push(server['service'])
jmx_srvc_cmds[server['service']] = server['service_cmd']
end
service "restart jmxtrans on dependent service" do
service_name "jmxtrans"
supports :restart => true, :status => true, :reload => true
action :restart
jmx_services.each do |jmx_dep_service|
subscribes :restart, "service[#{jmx_dep_service}]", :delayed
end
only_if {process_require_restart?("jmxtrans","jmxtrans-all.jar",
jmx_srvc_cmds)}
end
Service	
  Restart	
  
def process_require_restart?(process_name, process_cmd, dep_cmds)
tgt_proces_pid = `pgrep -f #{process_cmd}`
...
tgt_proces_stime = `ps --no-header -o start_time #{tgt_process_pid}`
...
ret = false
restarted_processes = Array.new
dep_cmds.each do |dep_process, dep_cmd|
dep_pids = `pgrep -f #{dep_cmd}`
if dep_pids != ""
dep_pids_arr = dep_pids.split("n")
dep_pids_arr.each do |dep_pid|
dep_process_stime = `ps --no-header -o start_time #{dep_pid}`
if DateTime.parse(tgt_proces_stime) <
DateTime.parse(dep_process_stime)
restarted_processes.push(dep_process)
ret = true
end ...
External	
  Dependency	
  
Rolling	
  Restart	
  	
  
•  Changes	
  to	
  configuraQon	
  
•  Availability	
  
– Toxic	
  ConfiguraQon	
  
•  ContenQon	
  
– Poll	
  &	
  Wait	
  
– Fail	
  the	
  Run	
  
– Simply	
  Skip	
  Service	
  Restart	
  and	
  Go	
  On	
  
•  Store	
  the	
  state	
  and	
  need	
  for	
  restart	
  
•  Breaks	
  assumpQons	
  of	
  Procedural	
  Chef	
  Runs	
  
Rolling	
  Restart	
  	
  
•  ZooKeeper	
  
– Service	
  specific	
  znode	
  as	
  lock	
  
•  Node	
  a(ribute	
  to	
  flag	
  restart	
  failures	
  
h(ps://github.com/bloomberg/chef-­‐bach/blob/rolling_restart/
cookbooks/bcpc-­‐hadoop/definiQons/hadoop_service.rb	
  
Change	
  Course	
  
Logic	
  InjecQon	
  
•  We	
  use	
  Community	
  cookbooks	
  
– Takes	
  care	
  of	
  standard	
  install,	
  enable	
  and	
  starQng	
  
of	
  services	
  
•  Need	
  to	
  add	
  logic	
  to	
  cookbook	
  recipes	
  
– Take	
  acQon	
  on	
  a	
  service	
  only	
  when	
  condiQons	
  are	
  
saQsfied	
  
– Take	
  acQon	
  on	
  a	
  service	
  based	
  on	
  dependent	
  
service	
  state	
  
Logic	
  InjecQon	
  
kafka_install node.kafka.version_install_dir do
from kafka_target_path
not_if { kafka_installed? }
end
template ::File.join(node.kafka.config_dir, 'server.properties') do
source 'server.properties.erb’
...
helpers(Kafka::Configuration)
if restart_on_configuration_change?
notifies :restart, 'service[kafka]', :delayed
end
end
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
end
Logic	
  InjecQon	
  
•  Changes	
  to	
  standard	
  cookbook	
  
– Create	
  a	
  new	
  recipe	
  to	
  perform	
  service	
  acQon	
  
•  Resource	
  to	
  intercept	
  noQficaQons	
  to	
  service	
  resource	
  
•  Original	
  service	
  resource	
  	
  
• Add	
  node	
  attribute	
  which	
  stores	
  name	
  of	
  new	
  
recipe	
  
• Update	
  original	
  recipe	
  
– Remove	
  the	
  service	
  resource	
  from	
  the	
  original	
  
recipe	
  
– Replace	
  it	
  with	
  include_recipe	
  new_a(ribute	
  
Logic	
  InjecQon	
  
•  New	
  recipe	
  to	
  perform	
  service	
  acQons	
  
– First	
  step	
  is	
  the	
  ruby_block	
  to	
  intercept	
  
noQficaQons	
  
ruby_block 'coordinate-kafka-start' do
block do
Chef::Log.debug 'Default recipe to coordinate Kafka start is used'
end
action :nothing
notifies :restart, 'service[kafka]', :delayed
end
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
action kafka_service_actions
end
Logic	
  InjecQon	
  
•  A(ribute	
  to	
  set	
  the	
  recipe	
  for	
  service	
  acQons	
  
#
# Attribute to set the recipe to used to coordinate Kafka service star
# if nothing is set the default recipe ”_coordinate" will be used
#
default.kafka.start_coordination.recipe = 'kafka::_coordinate'
Logic	
  InjecQon	
  
•  Changes	
  to	
  the	
  original	
  recipe	
  
kafka_install node.kafka.version_install_dir do
from kafka_target_path
not_if { kafka_installed? }
end
template ::File.join(node.kafka.config_dir, 'server.properties') do
source 'server.properties.erb’
...
helpers(Kafka::Configuration)
if restart_on_configuration_change?
notifies :create,'ruby_block[coordinate-kafka-start]’,immediately
end
end
include_recipe node.kafka.start_coordination.recipe
Logic	
  InjecQon	
  
•  Changes	
  in	
  wrapper	
  cookbook	
  
– Create	
  custom	
  recipe	
  in	
  wrapper	
  cookbook	
  
•  NoQficaQon	
  interceptor	
  ruby_block	
  should	
  be	
  first	
  
•  Logic	
  to	
  determine	
  service	
  restart	
  acQon	
  
•  service	
  resource	
  
•  Any	
  clean-­‐up	
  logic	
  
– Overwrite	
  a(ribute	
  with	
  custom	
  recipe	
  name	
  
Logic	
  InjecQon	
  
ruby_block 'coordinate-kafka-start' do
block do
Chef::Log.info 'Custom recipe to coordinate Kafka start/restart'
end ...
ruby_block 'restart-coordination' do
block do
Chef::Log.info 'Implement the process to coordinate the restart'
end ...
service 'kafka' do
provider kafka_init_opts[:provider]
supports start: true, stop: true, restart: true, status: true
...
ruby_block 'restart-coordination-cleanup' do
block do
Chef::Log.info 'Implement any cleanup logic required'
end
Logic	
  InjecQon	
  
•  Overwrite	
  a(ribute	
  to	
  set	
  the	
  custom	
  recipe	
  	
  
#
# Overwrite the community cookbook attribute with custom recipe name
#
default[:kafka][:start_coordination][:recipe] = 'kafka-bcpc::coordinate'
QuesQons	
  ?	
  
References	
  	
  
•  h(ps://github.com/bloomberg/chef-­‐bach	
  
•  h(p://blog.asquareb.com/blog/categories/
chef-­‐pa(erns/	
  
Thank	
  You!!	
  
bnair@asquareb.com	
  

More Related Content

PDF
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
PPTX
Oracle: Binding versus caging
PDF
Whitepaper: Where did my CPU go?
PDF
Using Netezza Query Plan to Improve Performace
PPTX
Example R usage for oracle DBA UKOUG 2013
PDF
AWR Ambiguity: Performance reasoning when the numbers don't add up
PDF
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
PDF
Advanced Apache Cassandra Operations with JMX
Whitepaper: Mining the AWR repository for Capacity Planning and Visualization
Oracle: Binding versus caging
Whitepaper: Where did my CPU go?
Using Netezza Query Plan to Improve Performace
Example R usage for oracle DBA UKOUG 2013
AWR Ambiguity: Performance reasoning when the numbers don't add up
Hotsos 2011: Mining the AWR repository for Capacity Planning, Visualization, ...
Advanced Apache Cassandra Operations with JMX

What's hot (20)

PDF
Fortify aws aurora_proxy_2019_pleu
PPT
Oracle Open World Thursday 230 ashmasters
DOC
AWR reports-Measuring CPU
PPTX
Processing 50,000 events per second with Cassandra and Spark
ODP
PostgreSQL Administration for System Administrators
PPTX
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
PDF
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PDF
PostgreSQL Replication Tutorial
PDF
VirtaThon 2011 - Mining the AWR
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PPT
Intro to ASH
PPTX
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
PPT
Ash masters : advanced ash analytics on Oracle
PDF
AWR Sample Report
PDF
Learning postgresql
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PDF
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
PDF
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
PDF
PostgreSQL Table Partitioning / Sharding
Fortify aws aurora_proxy_2019_pleu
Oracle Open World Thursday 230 ashmasters
AWR reports-Measuring CPU
Processing 50,000 events per second with Cassandra and Spark
PostgreSQL Administration for System Administrators
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PostgreSQL Replication Tutorial
VirtaThon 2011 - Mining the AWR
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Intro to ASH
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
Ash masters : advanced ash analytics on Oracle
AWR Sample Report
Learning postgresql
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PostgreSQL Performance Tables Partitioning vs. Aggregated Data Tables
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
PostgreSQL Table Partitioning / Sharding
Ad

Viewers also liked (11)

PDF
Managing Websphere Application Server certificates
PDF
Concurrency
PDF
Project Risk Management
PDF
HDFS User Reference
PDF
NENUG Apr14 Talk - data modeling for netezza
PDF
Hadoop security
PDF
Websphere MQ (MQSeries) fundamentals
PDF
HBase Application Performance Improvement
PDF
Netezza workload management
PDF
Row or Columnar Database
PDF
Netezza fundamentals for developers
Managing Websphere Application Server certificates
Concurrency
Project Risk Management
HDFS User Reference
NENUG Apr14 Talk - data modeling for netezza
Hadoop security
Websphere MQ (MQSeries) fundamentals
HBase Application Performance Improvement
Netezza workload management
Row or Columnar Database
Netezza fundamentals for developers
Ad

Similar to Chef patterns (20)

PPTX
Chef Patterns at Bloomberg Scale
PDF
Chef conf-2015-chef-patterns-at-bloomberg-scale
PPTX
Open Source Recipes for Chef Deployments of Hadoop
PDF
Introduction to Chef
PDF
Chef Fundamentals Training Series Module 3: Setting up Nodes and Cookbook Aut...
KEY
Picconf12
KEY
SELF 2011: Deploying Django Application Stacks with Chef
PDF
Atmosphere 2014: Really large scale systems configuration - Phil Dibowitz
PDF
Priming Your Teams For Microservice Deployment to the Cloud
PDF
Introduction to chef framework
PDF
OSDC 2013 | Introduction into Chef by Andy Hawkins
PDF
Automated infrastructure is on the menu
PDF
Overview of Chef - Fundamentals Webinar Series Part 1
PDF
Cooking 5 Star Infrastructure with Chef
PPT
vBACD - Introduction to Opscode Chef - 2/29
KEY
Chef 0.8, Knife and Amazon EC2
PDF
Chef - Configuration Management for the Cloud
PDF
under the covers -- chef in 20 minutes or less
PDF
SCALE 2011 Deploying OpenStack with Chef
ODP
Configuration management with Chef
Chef Patterns at Bloomberg Scale
Chef conf-2015-chef-patterns-at-bloomberg-scale
Open Source Recipes for Chef Deployments of Hadoop
Introduction to Chef
Chef Fundamentals Training Series Module 3: Setting up Nodes and Cookbook Aut...
Picconf12
SELF 2011: Deploying Django Application Stacks with Chef
Atmosphere 2014: Really large scale systems configuration - Phil Dibowitz
Priming Your Teams For Microservice Deployment to the Cloud
Introduction to chef framework
OSDC 2013 | Introduction into Chef by Andy Hawkins
Automated infrastructure is on the menu
Overview of Chef - Fundamentals Webinar Series Part 1
Cooking 5 Star Infrastructure with Chef
vBACD - Introduction to Opscode Chef - 2/29
Chef 0.8, Knife and Amazon EC2
Chef - Configuration Management for the Cloud
under the covers -- chef in 20 minutes or less
SCALE 2011 Deploying OpenStack with Chef
Configuration management with Chef

Recently uploaded (20)

PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
Reach Out and Touch Someone: Haptics and Empathic Computing
Mobile App Security Testing_ A Comprehensive Guide.pdf
Empathic Computing: Creating Shared Understanding
Diabetes mellitus diagnosis method based random forest with bat algorithm
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Digital-Transformation-Roadmap-for-Companies.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows

Chef patterns

  • 1. Chef  Pa(erns  From  Building  Clusters   Biju  Nair   Boston  DevOps  Meetup   08-­‐July-­‐2015  
  • 2. Background   •  Automate  build  &  management  of  clusters     – Hadoop   – KaLa…  etc   •  Pa(erns  which  can  be  used  elsewhere  
  • 4. Service  On  Demand   •  Common  services  which  can  be  requested   – Copy  logs  from  applicaQons  to  a  centralized   locaQon   – Service  available  on  all  the  nodes   – ApplicaQons  can  request  the  service  dynamically  
  • 5. Service  On  Demand   •  Node  A(ribute  to  store  service  requests   default['bcpc']['hadoop']['copylog'] = {} { 'app_id' => { 'logfile' => "/path/file_name_of_log_file", 'docopy' => true (or false) },... } •  Data  Structure  to  make  service  requests  
  • 6. Service  On  Demand   •  ApplicaQon  recipes  make  service  requests   # # Updating node attributes to copy HBase master log file to HDFS # node.default['bcpc']['hadoop']['copylog']['hbase_master'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.log", 'docopy' => true } node.default['bcpc']['hadoop']['copylog']['hbase_master_out'] = { 'logfile' => "/var/log/hbase/hbase-master-#{node.hostname}.out", 'docopy' => true }
  • 7. Service  On  Demand   •  Service  recipe   node['bcpc']['hadoop']['copylog'].each do |id,f| if f['docopy'] template "/etc/flume/conf/flume-#{id}.conf" do source "flume_flume-conf.erb” action :create ... variables(:agent_name => "#{id}", :log_location => "#{f['logfile']}" ) notifies :restart,"service[flume-agent-multi-#{id}]",:delayed end service "flume-agent-multi-#{id}" do supports :status => true, :restart => true, :reload => false service_name "flume-agent-multi" action :start start_command "service flume-agent-multi start #{id}" restart_command "service flume-agent-multi restart #{id}" status_command "service flume-agent-multi status #{id}" end •  Separate  role  at  the  end  of  run  list    
  • 9. Pluggable  Alerts   •  Single  source  for  monitored  stats   – Allows  users  to  visualize  stats  across  different   parameters   – Didn’t  want  to  duplicate  the  stats  collecQon  by   alerQng  system   – Need  to  feed  data  to  the  alerQng  system  to   generate  alerts  
  • 10. Pluggable  Alerts   •  A(ribute  where  users  can  define  alerts   default["bcpc"]["hadoop"]["graphite"]["queries"] = { 'hbase_master' => [ { 'type' => "jmx", 'query' => "memory.NonHeapMemoryUsage_committed", 'key' => "hbasenonheapmem", 'trigger_val' => "max(61,0)", 'trigger_cond' => "=0", 'trigger_name' => "HBaseMasterAvailability", 'trigger_dep' => ["NameNodeAvailability"], 'trigger_desc' => "HBase master seems to be down", 'severity' => 1 },{ 'type' => "jmx", 'query' => "memory.HeapMemoryUsage_committed", 'key' => "hbaseheapmem", ... },...], ’namenode' => [...] ...}
  • 11. Pluggable  Alerts   •  Recipes  and  templates  use  the  data  structure   – To  generate  queries  to  pull  data  from  staQsQcs   store  and  send   •  h(ps://github.com/bloomberg/chef-­‐bach/blob/master/ cookbooks/bcpc-­‐hadoop/templates/default/ graphite.query_graphite.config.erb   – To  create  requested  trigger  related  objects  in   alarming  system   •  h(ps://github.com/bloomberg/chef-­‐bach/blob/master/ cookbooks/bcpc-­‐hadoop/recipes/graphite_to_zabbix.rb  
  • 12. Pluggable  Alerts   •  Servers  Defined  in  role  is  used  by  recipes   "default_attributes" : { "jmxtrans": { "servers": [ { "type": "hbase_master", "service": "hbase-master", "service_cmd": "org.apache.hadoop.hbase.master.HMaster” }, { "type": "hbase_rs", "service": "hbase-regionserver", "service_cmd": "org.apache.hadoop.hbase.regionserver.HRegionServer" } ] } ...
  • 14. Service  Restart   •  We  use  jmxtrans  to  monitor  jmx  stats   – Services  to  be  monitored  varies  with  node   – There  can  be  more  than  one  service  to  be   monitored   – Monitored  service  restart  requires  JMXtrans  to  be   restarted**  
  • 15. Service  Restart   •  Data  structure  in  roles  to  define  the  services   "default_attributes" : { "jmxtrans": { "servers": [ { "type": "datanode", "service": "hadoop-hdfs-datanode", "service_cmd": "org.apache.hadoop.hdfs.server.datanode.DataNode" }, { "type": "hbase_rs", "service": "hbase-regionserver", "service_cmd": “org.apache.hadoop.hbase.regionserver.HRegionServer" } ] } ...
  • 16. Service  Restart   •  Jmxtrans  service  restart  logic  built  dynamically   jmx_services = Array.new jmx_srvc_cmds = Hash.new node['jmxtrans']['servers'].each do |server| jmx_services.push(server['service']) jmx_srvc_cmds[server['service']] = server['service_cmd'] end service "restart jmxtrans on dependent service" do service_name "jmxtrans" supports :restart => true, :status => true, :reload => true action :restart jmx_services.each do |jmx_dep_service| subscribes :restart, "service[#{jmx_dep_service}]", :delayed end only_if {process_require_restart?("jmxtrans","jmxtrans-all.jar", jmx_srvc_cmds)} end
  • 17. Service  Restart   def process_require_restart?(process_name, process_cmd, dep_cmds) tgt_proces_pid = `pgrep -f #{process_cmd}` ... tgt_proces_stime = `ps --no-header -o start_time #{tgt_process_pid}` ... ret = false restarted_processes = Array.new dep_cmds.each do |dep_process, dep_cmd| dep_pids = `pgrep -f #{dep_cmd}` if dep_pids != "" dep_pids_arr = dep_pids.split("n") dep_pids_arr.each do |dep_pid| dep_process_stime = `ps --no-header -o start_time #{dep_pid}` if DateTime.parse(tgt_proces_stime) < DateTime.parse(dep_process_stime) restarted_processes.push(dep_process) ret = true end ...
  • 19. Rolling  Restart     •  Changes  to  configuraQon   •  Availability   – Toxic  ConfiguraQon   •  ContenQon   – Poll  &  Wait   – Fail  the  Run   – Simply  Skip  Service  Restart  and  Go  On   •  Store  the  state  and  need  for  restart   •  Breaks  assumpQons  of  Procedural  Chef  Runs  
  • 20. Rolling  Restart     •  ZooKeeper   – Service  specific  znode  as  lock   •  Node  a(ribute  to  flag  restart  failures   h(ps://github.com/bloomberg/chef-­‐bach/blob/rolling_restart/ cookbooks/bcpc-­‐hadoop/definiQons/hadoop_service.rb  
  • 22. Logic  InjecQon   •  We  use  Community  cookbooks   – Takes  care  of  standard  install,  enable  and  starQng   of  services   •  Need  to  add  logic  to  cookbook  recipes   – Take  acQon  on  a  service  only  when  condiQons  are   saQsfied   – Take  acQon  on  a  service  based  on  dependent   service  state  
  • 23. Logic  InjecQon   kafka_install node.kafka.version_install_dir do from kafka_target_path not_if { kafka_installed? } end template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb’ ... helpers(Kafka::Configuration) if restart_on_configuration_change? notifies :restart, 'service[kafka]', :delayed end end service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions end
  • 24. Logic  InjecQon   •  Changes  to  standard  cookbook   – Create  a  new  recipe  to  perform  service  acQon   •  Resource  to  intercept  noQficaQons  to  service  resource   •  Original  service  resource     • Add  node  attribute  which  stores  name  of  new   recipe   • Update  original  recipe   – Remove  the  service  resource  from  the  original   recipe   – Replace  it  with  include_recipe  new_a(ribute  
  • 25. Logic  InjecQon   •  New  recipe  to  perform  service  acQons   – First  step  is  the  ruby_block  to  intercept   noQficaQons   ruby_block 'coordinate-kafka-start' do block do Chef::Log.debug 'Default recipe to coordinate Kafka start is used' end action :nothing notifies :restart, 'service[kafka]', :delayed end service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true action kafka_service_actions end
  • 26. Logic  InjecQon   •  A(ribute  to  set  the  recipe  for  service  acQons   # # Attribute to set the recipe to used to coordinate Kafka service star # if nothing is set the default recipe ”_coordinate" will be used # default.kafka.start_coordination.recipe = 'kafka::_coordinate'
  • 27. Logic  InjecQon   •  Changes  to  the  original  recipe   kafka_install node.kafka.version_install_dir do from kafka_target_path not_if { kafka_installed? } end template ::File.join(node.kafka.config_dir, 'server.properties') do source 'server.properties.erb’ ... helpers(Kafka::Configuration) if restart_on_configuration_change? notifies :create,'ruby_block[coordinate-kafka-start]’,immediately end end include_recipe node.kafka.start_coordination.recipe
  • 28. Logic  InjecQon   •  Changes  in  wrapper  cookbook   – Create  custom  recipe  in  wrapper  cookbook   •  NoQficaQon  interceptor  ruby_block  should  be  first   •  Logic  to  determine  service  restart  acQon   •  service  resource   •  Any  clean-­‐up  logic   – Overwrite  a(ribute  with  custom  recipe  name  
  • 29. Logic  InjecQon   ruby_block 'coordinate-kafka-start' do block do Chef::Log.info 'Custom recipe to coordinate Kafka start/restart' end ... ruby_block 'restart-coordination' do block do Chef::Log.info 'Implement the process to coordinate the restart' end ... service 'kafka' do provider kafka_init_opts[:provider] supports start: true, stop: true, restart: true, status: true ... ruby_block 'restart-coordination-cleanup' do block do Chef::Log.info 'Implement any cleanup logic required' end
  • 30. Logic  InjecQon   •  Overwrite  a(ribute  to  set  the  custom  recipe     # # Overwrite the community cookbook attribute with custom recipe name # default[:kafka][:start_coordination][:recipe] = 'kafka-bcpc::coordinate'
  • 32. References     •  h(ps://github.com/bloomberg/chef-­‐bach   •  h(p://blog.asquareb.com/blog/categories/ chef-­‐pa(erns/