SlideShare a Scribd company logo
Monitoring
Far Beyond the
Operating System
WeOp 2014
Marcus Vechiato - @vechiato
http://weop.com.br
Agenda
⦿ Goal
⦿ How do we envision a monitoring system?
⦿ From simple to complex
⦿ What to monitor?
⦿ What to track?
⦿ Locaweb numbers
⦿ Where some get lost
⦿ Configuration automation
⦿ ITIL and ITSM Tools Automatic Incident Creation
⦿ Tools already being used
⦿ Challenges
Goal
The objective of this presentation is to explore monitoring
implementations without focusing on tools.
Best practices highlighting what worked well and the
lessons learned from mistakes made over the years.
How do we envision a monitoring system?
How do we envision a monitoring system?
⦿ It's not just a tool.
⦿ The monitoring tool is one of the components of the
process.
⦿ Process - it can lead to bureaucracy if it's not effective.
Locaweb numbers
⦿ Network
⚫ Brocade / Cisco / Force10 and others
⦿ ~21k servers (physical and virtual)
⚫ Windows (2003/2008/2012)
⚫ Linux (CentOs/Redhat/Debian)
⚫ Oracle/MySql/Postgre/MSSQL/MongoDB
⚫ VmWare/Xen
⦿ ~500 thousand items/services monitored every minute
⦿ ~17 thousand incidents handled per month
From simple to complex
⦿ Have a clear understanding of your biggest challenges to define your
objectives.
⦿ Do not idealize the perfect system that will cover all the gaps, it does not
exist.
⦿ Remember: what are your resources and what are the real skills of the
team.
⦿ Prefer a gradual implementation with well-defined deliverables.
What to monitor?
⦿ Core Services and Infrastructure - network/uninterruptible power
supply/temperature/DNS
⦿ Operating System (memory/CPU/local network/disk) where applicable
Applications
⚫ User perspective (HTTP/TCP requests)
⚫ Local (memory usage/threads/processes/etc.)
⦿ Business Indicators/errors
⚫ Example: Sales per hour Example:
⚫ Authentication failures per minute
What to track?
Convert the view of infrastructure indicators to products/components/teams
⦿ Dashboards for different audiences
⚫ Operations
○ KPI view by teams/infrastructure
⚫ Ex.: MTTR of N1 incidents by priority
⚫ Ex.: SLA and MTTR of storage abc
⚫ Products/Business
○ Common and specific indicator view
⚫ Ex.: SLA of product xyz 99.89%
⚫ Ex.: MTTR of product xyz 0h45m
Where some get lost
⦿ It's oversight to diagnose: "the xyz tool doesn't work, we need a new one."
⦿ Monitoring probe intervals are too short.
⦿ Retries are important to reduce false positives.
⦿ From my experience:
⚫ Standard probe intervals range from 1 to 5 minutes
⚫ Retries:
○ 5 minutes during deployment/with known instabilities.
○ 3 minutes in stable environments.
Configuration Automation
⦿ Monitoring is the best place to start managing component installation and
configurations.
⚫ Start with the monitoring agent (if available).
⚫ Monitoring server
○ Via API where possible
○ Configuration files
⦿ Which tool to use for automation?
⚫ It depends on your environment and the team's knowledge. Chef and
Puppet are good options to start with.
ITIL and ITSM Tools
⦿ ITSM Tools
⚫ I strongly recommend
⚫ If you intend to manage incidents automatically, spend more time
evaluating which tool will be used
⦿ Processes are the backbone
⚫ Incident Management
⚫ Problem Management
⚫ Change Management
⦿ CMDB - registration/control is mandatory
⚫ In small installations, your monitoring tool is your CMDB
⚫ In larger environments, you will need to synchronize it with the ITSM
tool
Automatic Incident Creation
Some benefits of automatic incident creation in larger environments:
⦿ Addresses the inefficiency of manual incident logging
⦿ Registers failures exactly when they occur
⦿ Allows predefining the importance of each component/service and
prioritizing its resolution in case of failure
⦿ Reduces informal incident resolution without logging
⦿ Provides insight for in-depth analysis of the environment
⦿ Integrated with crisis management, reduces resolution time and improves
related communication
⦿ Enables realistic calculation of OLAs and SLAs
Automatic Incident Creation
⦿ Integration via:
⚫ API preferably (REST/SOAP)
⚫ Email - with templates, most tools allow it (only use as a last resort)
⦿ Use the priority when opening the incident to allow prioritization by the
resolving team. According to ITIL, on a scale of 1-5:
⚫ Priorities (think of a pyramid):
○ 1 and 2: should be less than 5% of incidents
○ 3: 20%
○ 4: 30%
○ 5: 45%
⦿ For each priority, define different resolution OLAs. Remember that this will
directly affect the size of the team.
Automatic Incident Creation
⦿ Automatic reopening of incidents if resolved and
continue failing in monitoring or fail again within 30
minutes.
⦿ New incident in case of new alarm after 30 minutes
from the last resolved incident.
⦿ Suppress incident creation during scheduled
maintenance
Automatic Incident Creation
⦿ Automatic closure of incidents if monitoring normalizes before team
intervention with status "no intervention" allows:
⚫ Refinement of the solution and its efficiency
⚫ Adjustment of very tight thresholds
⚫ Information for opening Problems
⚫ Failures in planning/execution of changes
⚫ Quickly resume incident treatment after events with
hundreds/thousands of incidents opened in a short period of time
Tools already being used
⦿ Monitoring (open source):
⚫ Nagios
⚫ Check_mk – Locaweb
⚫ Zabbix
⦿ ITSM:
⚫ Service Now (API) – Locaweb
⚫ CA – Service Desk Manager (API) – Locaweb
⚫ HP – Service Center (API)
⚫ OTRS – (API)
Challenges
⦿ Golden Rule: "Every alarm must have a corrective action" even if it's just
adjusting the thresholds in case of false positives.
⦿ Don't be fooled - in the beginning, you will have many false positives.
Persistence is key.
⦿ If you don't close incidents automatically during instabilities, typically
network-related, you will be buried in incidents and will miss important
alarms when the instability ceases.
Challenges
⦿ Who implements the solution and who administers day-to-day operations?
⚫ Implementation of the solution: naturally the most Senior team/person.
⚫ Who should enable the monitoring in new systems? If you thought in
the intern or the Junior members of the team, you're mistaken. It's also
the responsibility of the most Senior members. It should be automated.
Challenges
More important than the tools are the people and adherence to the defined
processes, end-to-end.
Periodically revisit the processes to adjust and evolve according to current
needs.
If any process is not working, change it. Do not allow it to be abandoned or
circumvented.
Q&A ?

More Related Content

PPTX
Unified Operations Vision
PDF
9 postproduction
PDF
How to improve your system monitoring
PPTX
Webinar - How to Get Real-Time Network Management Right?
PPT
Service operations
PPTX
IT Infrastructure @ Essar Oil Ltd.(ITIL)
PPTX
Webinar - How to Get Real-Time Network Management Right?
PDF
OSMC 2024 | The story of firefighting: learnings from the incident management...
Unified Operations Vision
9 postproduction
How to improve your system monitoring
Webinar - How to Get Real-Time Network Management Right?
Service operations
IT Infrastructure @ Essar Oil Ltd.(ITIL)
Webinar - How to Get Real-Time Network Management Right?
OSMC 2024 | The story of firefighting: learnings from the incident management...

Similar to Monitoring Far Beyond the Operating System - WeOp 2014 (20)

PDF
ITIL compliant Open Source tools
PPT
Role of OpManager in event and fault management
PDF
Mission IT operations for a good night's sleep
PDF
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
PDF
Multi Layer Monitoring V1
PDF
Case Study: Datalink—Manage IT monitoring the MSP way
PDF
Brighttalk what should we be monitoring - final
PPTX
Problem management foundation - Lifecycle
PDF
ITIL Incident Management Workflow PowerPoint Presentation Slides
PPTX
Network Monitoring Basics
PPT
Managing IT Infrastructure And Applications Proactively For Performance And U...
PDF
How to implement effective ITSM System
PPT
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
PDF
Proactive monitoring tools or services - Open Source
PPS
ITIL Service Desk Tools
PDF
10 Ways to Better Application-Centric Service Management
PPTX
Functionality, security and performance monitoring of web assets (e.g. Joomla...
PPTX
servicedesk-plus-overview
PPTX
ServiceDesk Plus Overview Presentation
PDF
ITIL Methods, Tooling, ITOP Project - fOSSa2010
ITIL compliant Open Source tools
Role of OpManager in event and fault management
Mission IT operations for a good night's sleep
[Atlassian meets dev ops and itsm] itsm in an agile world atlassian scott
Multi Layer Monitoring V1
Case Study: Datalink—Manage IT monitoring the MSP way
Brighttalk what should we be monitoring - final
Problem management foundation - Lifecycle
ITIL Incident Management Workflow PowerPoint Presentation Slides
Network Monitoring Basics
Managing IT Infrastructure And Applications Proactively For Performance And U...
How to implement effective ITSM System
Troubleshooting: A High-Value Asset For The Service-Provider Discipline
Proactive monitoring tools or services - Open Source
ITIL Service Desk Tools
10 Ways to Better Application-Centric Service Management
Functionality, security and performance monitoring of web assets (e.g. Joomla...
servicedesk-plus-overview
ServiceDesk Plus Overview Presentation
ITIL Methods, Tooling, ITOP Project - fOSSa2010
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PDF
Web App vs Mobile App What Should You Build First.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Tartificialntelligence_presentation.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Hybrid model detection and classification of lung cancer
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
A novel scalable deep ensemble learning framework for big data classification...
PPTX
A Presentation on Artificial Intelligence
PDF
Approach and Philosophy of On baking technology
PDF
Zenith AI: Advanced Artificial Intelligence
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
Web App vs Mobile App What Should You Build First.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Hindi spoken digit analysis for native and non-native speakers
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Tartificialntelligence_presentation.pptx
1 - Historical Antecedents, Social Consideration.pdf
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Hybrid model detection and classification of lung cancer
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Enhancing emotion recognition model for a student engagement use case through...
A novel scalable deep ensemble learning framework for big data classification...
A Presentation on Artificial Intelligence
Approach and Philosophy of On baking technology
Zenith AI: Advanced Artificial Intelligence
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Ad

Monitoring Far Beyond the Operating System - WeOp 2014

  • 1. Monitoring Far Beyond the Operating System WeOp 2014 Marcus Vechiato - @vechiato http://weop.com.br
  • 2. Agenda ⦿ Goal ⦿ How do we envision a monitoring system? ⦿ From simple to complex ⦿ What to monitor? ⦿ What to track? ⦿ Locaweb numbers ⦿ Where some get lost ⦿ Configuration automation ⦿ ITIL and ITSM Tools Automatic Incident Creation ⦿ Tools already being used ⦿ Challenges
  • 3. Goal The objective of this presentation is to explore monitoring implementations without focusing on tools. Best practices highlighting what worked well and the lessons learned from mistakes made over the years.
  • 4. How do we envision a monitoring system?
  • 5. How do we envision a monitoring system? ⦿ It's not just a tool. ⦿ The monitoring tool is one of the components of the process. ⦿ Process - it can lead to bureaucracy if it's not effective.
  • 6. Locaweb numbers ⦿ Network ⚫ Brocade / Cisco / Force10 and others ⦿ ~21k servers (physical and virtual) ⚫ Windows (2003/2008/2012) ⚫ Linux (CentOs/Redhat/Debian) ⚫ Oracle/MySql/Postgre/MSSQL/MongoDB ⚫ VmWare/Xen ⦿ ~500 thousand items/services monitored every minute ⦿ ~17 thousand incidents handled per month
  • 7. From simple to complex ⦿ Have a clear understanding of your biggest challenges to define your objectives. ⦿ Do not idealize the perfect system that will cover all the gaps, it does not exist. ⦿ Remember: what are your resources and what are the real skills of the team. ⦿ Prefer a gradual implementation with well-defined deliverables.
  • 8. What to monitor? ⦿ Core Services and Infrastructure - network/uninterruptible power supply/temperature/DNS ⦿ Operating System (memory/CPU/local network/disk) where applicable Applications ⚫ User perspective (HTTP/TCP requests) ⚫ Local (memory usage/threads/processes/etc.) ⦿ Business Indicators/errors ⚫ Example: Sales per hour Example: ⚫ Authentication failures per minute
  • 9. What to track? Convert the view of infrastructure indicators to products/components/teams ⦿ Dashboards for different audiences ⚫ Operations ○ KPI view by teams/infrastructure ⚫ Ex.: MTTR of N1 incidents by priority ⚫ Ex.: SLA and MTTR of storage abc ⚫ Products/Business ○ Common and specific indicator view ⚫ Ex.: SLA of product xyz 99.89% ⚫ Ex.: MTTR of product xyz 0h45m
  • 10. Where some get lost ⦿ It's oversight to diagnose: "the xyz tool doesn't work, we need a new one." ⦿ Monitoring probe intervals are too short. ⦿ Retries are important to reduce false positives. ⦿ From my experience: ⚫ Standard probe intervals range from 1 to 5 minutes ⚫ Retries: ○ 5 minutes during deployment/with known instabilities. ○ 3 minutes in stable environments.
  • 11. Configuration Automation ⦿ Monitoring is the best place to start managing component installation and configurations. ⚫ Start with the monitoring agent (if available). ⚫ Monitoring server ○ Via API where possible ○ Configuration files ⦿ Which tool to use for automation? ⚫ It depends on your environment and the team's knowledge. Chef and Puppet are good options to start with.
  • 12. ITIL and ITSM Tools ⦿ ITSM Tools ⚫ I strongly recommend ⚫ If you intend to manage incidents automatically, spend more time evaluating which tool will be used ⦿ Processes are the backbone ⚫ Incident Management ⚫ Problem Management ⚫ Change Management ⦿ CMDB - registration/control is mandatory ⚫ In small installations, your monitoring tool is your CMDB ⚫ In larger environments, you will need to synchronize it with the ITSM tool
  • 13. Automatic Incident Creation Some benefits of automatic incident creation in larger environments: ⦿ Addresses the inefficiency of manual incident logging ⦿ Registers failures exactly when they occur ⦿ Allows predefining the importance of each component/service and prioritizing its resolution in case of failure ⦿ Reduces informal incident resolution without logging ⦿ Provides insight for in-depth analysis of the environment ⦿ Integrated with crisis management, reduces resolution time and improves related communication ⦿ Enables realistic calculation of OLAs and SLAs
  • 14. Automatic Incident Creation ⦿ Integration via: ⚫ API preferably (REST/SOAP) ⚫ Email - with templates, most tools allow it (only use as a last resort) ⦿ Use the priority when opening the incident to allow prioritization by the resolving team. According to ITIL, on a scale of 1-5: ⚫ Priorities (think of a pyramid): ○ 1 and 2: should be less than 5% of incidents ○ 3: 20% ○ 4: 30% ○ 5: 45% ⦿ For each priority, define different resolution OLAs. Remember that this will directly affect the size of the team.
  • 15. Automatic Incident Creation ⦿ Automatic reopening of incidents if resolved and continue failing in monitoring or fail again within 30 minutes. ⦿ New incident in case of new alarm after 30 minutes from the last resolved incident. ⦿ Suppress incident creation during scheduled maintenance
  • 16. Automatic Incident Creation ⦿ Automatic closure of incidents if monitoring normalizes before team intervention with status "no intervention" allows: ⚫ Refinement of the solution and its efficiency ⚫ Adjustment of very tight thresholds ⚫ Information for opening Problems ⚫ Failures in planning/execution of changes ⚫ Quickly resume incident treatment after events with hundreds/thousands of incidents opened in a short period of time
  • 17. Tools already being used ⦿ Monitoring (open source): ⚫ Nagios ⚫ Check_mk – Locaweb ⚫ Zabbix ⦿ ITSM: ⚫ Service Now (API) – Locaweb ⚫ CA – Service Desk Manager (API) – Locaweb ⚫ HP – Service Center (API) ⚫ OTRS – (API)
  • 18. Challenges ⦿ Golden Rule: "Every alarm must have a corrective action" even if it's just adjusting the thresholds in case of false positives. ⦿ Don't be fooled - in the beginning, you will have many false positives. Persistence is key. ⦿ If you don't close incidents automatically during instabilities, typically network-related, you will be buried in incidents and will miss important alarms when the instability ceases.
  • 19. Challenges ⦿ Who implements the solution and who administers day-to-day operations? ⚫ Implementation of the solution: naturally the most Senior team/person. ⚫ Who should enable the monitoring in new systems? If you thought in the intern or the Junior members of the team, you're mistaken. It's also the responsibility of the most Senior members. It should be automated.
  • 20. Challenges More important than the tools are the people and adherence to the defined processes, end-to-end. Periodically revisit the processes to adjust and evolve according to current needs. If any process is not working, change it. Do not allow it to be abandoned or circumvented.
  • 21. Q&A ?