SlideShare a Scribd company logo
The Elusive Root Cause Of IT Problems
And How To Easily Identify It


Noam Biran
Director of Product Management
Introduction
               Mr. Biran
               •    Director of Product Management at Neebula
               •    20 years experience in systems management & BSM
               •    Innovation Product Management at BMC
               •    Co-founder of Appilog (now HP uCMDB & DDMA)



 About Neebula
  Neebula provides the first and only automatic service-centric IT management
  solution allowing IT organizations to improve the service provided to the business
  by shifting from managing disparate technology silos to managing the services
  running in the data center. Leveraging unique technology that automatically maps
  business services to the underlying infrastructure, Neebula enables the IT team to
  increase availability of the main services they manage and reduce the time to
  repair of problems.
Agenda
•   Introduction
•   Root cause analysis defined
•   The problem resolution process
•   Problem detection
•   Root cause analysis methods
•   Improving root cause analysis processes
Root Cause Analysis Definition
   ITIL V3
              An Activity that identifies the Root Cause of
              an Incident or Problem.
              Root Cause Analysis typically concentrates on
              IT Infrastructure failures.



  Wikipedia
              Root Cause Analysis is any structured
              approach to identify the factors that resulted
              in the harmful consequences of one or more
              past events
The importance of Root Cause Analysis
• Root Cause Analysis has a high impact on
  – IT processes
     • The efficiency of the overall incident/problem
       management process
     • Good RCA discipline requires well established
       configuration management
  – Organizational goals
     • Meeting internal and external SLAs
     • Financial (budget & revenue) implications
     • Brand / customer loyalty
Root Cause Analysis Nowadays
The Critical Role of Root Cause Analysis
• Improper (or lack of) identification of the real
  root cause may yield:
   – Repeating problems
   – Increased downtime
   – Waste of human
     resources on
     “fixing” the wrong
     issues
   – Risk to the business
The Life of The Operator
We expect the operator
    – To handle 1000’s of cryptic events
    – Understand impact on 100’s of services
    – Understand the correlation to
       customers service complaints
    – Understand what changed
    – Orchestrate the resolution
And make these decisions within minutes to
reduce MTTR

   Are we giving our operators the tools to
   succeed?
Problem Resolution Process
Problem Resolution Process
• Events coming in to the NOC
• NOC performs some investigation
• Root cause analysis is shared between NOC
  & 2nd/3rd level support (admins)
• Low level diagnostics & problem resolution
  is done by 2nd/3rd level support (admins)
Involved Parties & Tools

• Tools
  – Monitoring tools
  – Configuration management tools
• People
  – Users
  – NOC
  – Admins – specialized teams focused on specific
    area, e.g. system, database, network
  – Application support / developers
The Common Process – Blame Game
•   No structured process
•   Lack of overall cross-domain view
•   Each team has its own terminology and view
•   Each team is working on its own
Problem Detection
Potential Problem Symptoms
• Lack of certain functionality
  – A certain transaction does not work
• Performance degradation
  – Fund transfer response time is above 2 sec.
• Availability issue
  – Application doesn’t work
• None
  – Unnoticeable failure due to high availability
    configuration
Problem Detection
• Good problem detection methods are key for a
  structured root cause analysis process
• Problem detection tools should provide sufficient
  data to the root cause analysis process
• There are various distinct methods each with its
  pros and cons
• There is no single superior detection method
Detection – Users
• What it does
  – Compensates for unknown / unreported
    problems
• What it doesn’t
  – Supposedly accurate – actually might point in
    the wrong direction
  – Usually takes place
    too late for a quick fix
    & impact to business
Detection – Infrastructure Monitoring
• What it does
  – Monitor each technical element
    comprising the service
  – Great way to identify
    specific availability failures
• What it doesn’t
  – Hard to correlate with real user experience
  – Too many false positives
  – Lots of events on symptoms rather on actual problem
Detection – End User Experience
• What it does
  – Measure overall response time of user transactions
  – Synthetic or real user transactions
  – The ultimate problem detection method
• What it doesn’t
  – No real breakdown to assist
    in pinpointing the problem
    or even the domain
Detection – Transaction Breakdown
• What it does
  – Discovery of each transaction’s path
    within the data center
  – Highlight potential performance
    problems within the transaction
    execution
• What it doesn’t
  – No correlation to infrastructure
    monitoring
  – Cannot cover the entire data center
    – domain specific
Detection – Domain Specific Tools
• What it does
  – Drill down in a specific application
  – Great analysis & diagnostics within an application
• What it doesn’t
  – No data center wide view
  – Lack of insight into the
    connections between
    applications
Detection - Synergy
Root Cause Analysis Methods
Potential Root Cause Types

•   Configuration change
•   Version upgrade
•   Hardware fault
•   Software bug
•   Capacity problem
•   Resource collision
Common Ways for Root Cause Analysis

•   War room scenario
•   The log file approach
•   APM tools
•   Transaction management
•   Manual event correlation / analysis
War Room Scenario

•   Getting everyone in the same room
•   Each has its own data and terminology
•   Blame game
•   Takes a lot of time
The Log File Approach

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
APM Tools

• An admin sits and analyzes log files and
  other historical data from various sources
• A domain specific approach
• Certain degree of structured process
• Might identify problems that
  are not the root cause
  (distractions)
Transaction Management

• A great tool to point to the probable area
  where the root cause resides
• Limited to specific domains
• Inability to correlate with infrastructure
  metrics / failures
Manual Event Correlation / Analysis

• Requires cross-domain expertise
• Requires understanding of dependencies
  between components
• Time consuming
• Lack of insight into other
  non-event data
Improving Root Cause Analysis
          Processes
Making The Best From Existing Tools

• Choose problem detection methods that
  assist in the root cause analysis process
• Turn the root cause analysis into a
  structured process
  – Internal team processes
  – Inter-team processes
• Common language & visibility between
  teams
New Methods: Mapping

• Mapping of Business service & applications
  and the supporting infrastructure
• Ties symptoms (user) to problems
  (technology)
• Introduces a common language between
  teams
• Enables a high level cross-domain view
New Methods: Structured Process

• Define a structured process for problem
  investigation and root cause analysis
• Define how collaboration should occur
  during root cause analysis between teams
New Methods: Tools

• Use tools that provide a historical
  dimension for problem investigation
• Use tools that enable the correlation of
  problems to configuration changes
• Use topology based correlation instead of
  rule based (or manual based) correlation
The elusive root cause

More Related Content

PPT
Software Project Management lecture 7
PPTX
Requirement analysis
PPT
PM Symposium RUP UC Realization
PPT
eUnit 2 software process model
PDF
Ch12
PPT
Software Project Management lecture 8
PPTX
Chapter 02
PPTX
Shruti ppt
Software Project Management lecture 7
Requirement analysis
PM Symposium RUP UC Realization
eUnit 2 software process model
Ch12
Software Project Management lecture 8
Chapter 02
Shruti ppt

What's hot (17)

DOC
Alexander Rhea Resume
PPT
PPTX
Requirements elicitation techniques
PPT
Requirements Engineering
PPTX
Requirement Elicitation Techniques/Methods
PPTX
Chapter 7 Development Strategies
PPTX
Financial Crime Projects
PPT
Chapter 2 analyzing the business case
PPT
Systems Analysis
PPTX
Non functional requirements. do we really care…?
PPTX
Design for non functional requirements
PPTX
Chapter 03
PDF
Requirement analysis and UML modelling in Software engineering
PPTX
Requirements Management Part 1 - Management and Elicitation
PPT
Intoduction to software engineering part 1
PPTX
2 feasibility-study
PPTX
Network Operations Center
Alexander Rhea Resume
Requirements elicitation techniques
Requirements Engineering
Requirement Elicitation Techniques/Methods
Chapter 7 Development Strategies
Financial Crime Projects
Chapter 2 analyzing the business case
Systems Analysis
Non functional requirements. do we really care…?
Design for non functional requirements
Chapter 03
Requirement analysis and UML modelling in Software engineering
Requirements Management Part 1 - Management and Elicitation
Intoduction to software engineering part 1
2 feasibility-study
Network Operations Center
Ad

Similar to The elusive root cause (20)

PDF
Root Cause Analysis for Software Testers
DOC
root-cause-analysis-checklist
PPTX
pm_training_-_process_tool_v1.4.pptx
PPT
Root Cause Analysis
PPTX
RCA - Root Cause Analysis
PPT
Root cause analysis arg sc
PPTX
Information Technology - Discover the Root Cause and Develop a solution throu...
PDF
Root cause analysis for incidents (or production defects)
PDF
Root Cause Analysis تحليل أسباب جذور المشكلة
PPTX
Root cause Methodology for evaluating different techniques
PDF
Root cause analysis questionnaire
PPTX
Root cause analysis questionnaire
PPT
RCA QHSE Presentation 2007 how to solve recurrant issues .ppt
PPTX
Root Cause Analysis
PDF
Root cause analysis master plan
PPTX
Benefiting from a Quality Problem Management Program
PDF
Root cause analysis by: ICG Team
PPTX
Getting to the Root of the Matter_LJanzen
PPT
root cause analyse
PPT
2 5 root cause
Root Cause Analysis for Software Testers
root-cause-analysis-checklist
pm_training_-_process_tool_v1.4.pptx
Root Cause Analysis
RCA - Root Cause Analysis
Root cause analysis arg sc
Information Technology - Discover the Root Cause and Develop a solution throu...
Root cause analysis for incidents (or production defects)
Root Cause Analysis تحليل أسباب جذور المشكلة
Root cause Methodology for evaluating different techniques
Root cause analysis questionnaire
Root cause analysis questionnaire
RCA QHSE Presentation 2007 how to solve recurrant issues .ppt
Root Cause Analysis
Root cause analysis master plan
Benefiting from a Quality Problem Management Program
Root cause analysis by: ICG Team
Getting to the Root of the Matter_LJanzen
root cause analyse
2 5 root cause
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Modernizing your data center with Dell and AMD
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
NewMind AI Monthly Chronicles - July 2025
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Network Security Unit 5.pdf for BCA BBA.
Modernizing your data center with Dell and AMD
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Spectral efficient network and resource selection model in 5G networks
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...

The elusive root cause

  • 1. The Elusive Root Cause Of IT Problems And How To Easily Identify It Noam Biran Director of Product Management
  • 2. Introduction Mr. Biran • Director of Product Management at Neebula • 20 years experience in systems management & BSM • Innovation Product Management at BMC • Co-founder of Appilog (now HP uCMDB & DDMA) About Neebula Neebula provides the first and only automatic service-centric IT management solution allowing IT organizations to improve the service provided to the business by shifting from managing disparate technology silos to managing the services running in the data center. Leveraging unique technology that automatically maps business services to the underlying infrastructure, Neebula enables the IT team to increase availability of the main services they manage and reduce the time to repair of problems.
  • 3. Agenda • Introduction • Root cause analysis defined • The problem resolution process • Problem detection • Root cause analysis methods • Improving root cause analysis processes
  • 4. Root Cause Analysis Definition ITIL V3 An Activity that identifies the Root Cause of an Incident or Problem. Root Cause Analysis typically concentrates on IT Infrastructure failures. Wikipedia Root Cause Analysis is any structured approach to identify the factors that resulted in the harmful consequences of one or more past events
  • 5. The importance of Root Cause Analysis • Root Cause Analysis has a high impact on – IT processes • The efficiency of the overall incident/problem management process • Good RCA discipline requires well established configuration management – Organizational goals • Meeting internal and external SLAs • Financial (budget & revenue) implications • Brand / customer loyalty
  • 7. The Critical Role of Root Cause Analysis • Improper (or lack of) identification of the real root cause may yield: – Repeating problems – Increased downtime – Waste of human resources on “fixing” the wrong issues – Risk to the business
  • 8. The Life of The Operator We expect the operator – To handle 1000’s of cryptic events – Understand impact on 100’s of services – Understand the correlation to customers service complaints – Understand what changed – Orchestrate the resolution And make these decisions within minutes to reduce MTTR Are we giving our operators the tools to succeed?
  • 10. Problem Resolution Process • Events coming in to the NOC • NOC performs some investigation • Root cause analysis is shared between NOC & 2nd/3rd level support (admins) • Low level diagnostics & problem resolution is done by 2nd/3rd level support (admins)
  • 11. Involved Parties & Tools • Tools – Monitoring tools – Configuration management tools • People – Users – NOC – Admins – specialized teams focused on specific area, e.g. system, database, network – Application support / developers
  • 12. The Common Process – Blame Game • No structured process • Lack of overall cross-domain view • Each team has its own terminology and view • Each team is working on its own
  • 14. Potential Problem Symptoms • Lack of certain functionality – A certain transaction does not work • Performance degradation – Fund transfer response time is above 2 sec. • Availability issue – Application doesn’t work • None – Unnoticeable failure due to high availability configuration
  • 15. Problem Detection • Good problem detection methods are key for a structured root cause analysis process • Problem detection tools should provide sufficient data to the root cause analysis process • There are various distinct methods each with its pros and cons • There is no single superior detection method
  • 16. Detection – Users • What it does – Compensates for unknown / unreported problems • What it doesn’t – Supposedly accurate – actually might point in the wrong direction – Usually takes place too late for a quick fix & impact to business
  • 17. Detection – Infrastructure Monitoring • What it does – Monitor each technical element comprising the service – Great way to identify specific availability failures • What it doesn’t – Hard to correlate with real user experience – Too many false positives – Lots of events on symptoms rather on actual problem
  • 18. Detection – End User Experience • What it does – Measure overall response time of user transactions – Synthetic or real user transactions – The ultimate problem detection method • What it doesn’t – No real breakdown to assist in pinpointing the problem or even the domain
  • 19. Detection – Transaction Breakdown • What it does – Discovery of each transaction’s path within the data center – Highlight potential performance problems within the transaction execution • What it doesn’t – No correlation to infrastructure monitoring – Cannot cover the entire data center – domain specific
  • 20. Detection – Domain Specific Tools • What it does – Drill down in a specific application – Great analysis & diagnostics within an application • What it doesn’t – No data center wide view – Lack of insight into the connections between applications
  • 23. Potential Root Cause Types • Configuration change • Version upgrade • Hardware fault • Software bug • Capacity problem • Resource collision
  • 24. Common Ways for Root Cause Analysis • War room scenario • The log file approach • APM tools • Transaction management • Manual event correlation / analysis
  • 25. War Room Scenario • Getting everyone in the same room • Each has its own data and terminology • Blame game • Takes a lot of time
  • 26. The Log File Approach • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 27. APM Tools • An admin sits and analyzes log files and other historical data from various sources • A domain specific approach • Certain degree of structured process • Might identify problems that are not the root cause (distractions)
  • 28. Transaction Management • A great tool to point to the probable area where the root cause resides • Limited to specific domains • Inability to correlate with infrastructure metrics / failures
  • 29. Manual Event Correlation / Analysis • Requires cross-domain expertise • Requires understanding of dependencies between components • Time consuming • Lack of insight into other non-event data
  • 30. Improving Root Cause Analysis Processes
  • 31. Making The Best From Existing Tools • Choose problem detection methods that assist in the root cause analysis process • Turn the root cause analysis into a structured process – Internal team processes – Inter-team processes • Common language & visibility between teams
  • 32. New Methods: Mapping • Mapping of Business service & applications and the supporting infrastructure • Ties symptoms (user) to problems (technology) • Introduces a common language between teams • Enables a high level cross-domain view
  • 33. New Methods: Structured Process • Define a structured process for problem investigation and root cause analysis • Define how collaboration should occur during root cause analysis between teams
  • 34. New Methods: Tools • Use tools that provide a historical dimension for problem investigation • Use tools that enable the correlation of problems to configuration changes • Use topology based correlation instead of rule based (or manual based) correlation

Editor's Notes

  • #2: Introduction to the subjectWebinar logistics: presentation first, send questions during, answer questions at the end
  • #5: RCA is problematic even to defineITIL definition -> useless. ITIL failedWikipedia:StructuredFactorsConsequencesPast events – I’ll call them symptoms
  • #6: Talk about each bullet
  • #7: Many data sources (event feeds)All are mixed and funneled into the NOCNOC needs to filter and make order in them based on:RelevanceSource / derivedBut the NOC doesn’t have the tools or processes to do thisNo structured way to do this filtering (though the NOC is used to structured processes like run book)
  • #8: Taking care of the symptoms and not the problemsAssociating wrong events -> figuring out the incorrect root cause
  • #9: NOC is used to structured processes (like run book)We don’t give them toolsWe don’t give them structured processes (or any processes)They don’t posses cross-domain knowledge usually
  • #11: Isolation – diagnosticsNOC’s investigation may yield forwarding to the wrong team and therefore wrong analysis done in the wrong context
  • #12: Explain eachHow do they all tie together? Usually they don’t
  • #15: Problem detection begins with the symptomsSame symptoms may be caused by different problems
  • #22: We need a combination of toolsChoose the right mix to assist in the RCA processNeed synergy between the methods
  • #24: Cross domainCross disciplineRequire deep understanding
  • #26: Not a structured approach