SlideShare a Scribd company logo
Root Cause Failure Analysis with Case Histories
Eugene T. Cottle
Reliability Engineer
Life Cycle Engineering
2© Life Cycle Engineering 2008
A confluence of events, factors and conditions which
conspire to produce an (undesirable) outcome.
Event(s), factor(s) or condition(s) which are under
your control and which, if corrected or eliminated,
will prevent recurrence of the undesirable outcome.
3© Life Cycle Engineering 2008
More Terminology…
• RCA: Root Cause Analysis
– A disciplined process for focusing ideas to identify
root causes. A class of problem solving methods
• RCFA: Root Cause Failure Analysis
– Reactive, in response to a failure
• RCCA: Root Cause and Corrective Action
– Incorporates preventive corrective action into the
process(i.e., elimination of special causes
4© Life Cycle Engineering 2008
Root Cause Analysis
• Safety-based RCA
– accident analysis and
– occupational safety and
health
• Production-based RCA
– quality control for
industrial manufacturing
• Process-based RCA
– Expanded scope to
include business
processes
• Failure-based RCA
– Based on failure analysis
– employed in engineering
and maintenance.
• Systems-based RCA
– amalgamation of the all
the others, and includes
• change management,
• risk management, and
• systems analysis
5© Life Cycle Engineering 2008
Objectives
• Prevent Recurrence
• Responsibility
– “Hand-off” the investigation
• Begins with an assumption of “cause”
– Liability
– Blame
6© Life Cycle Engineering 2008
Deming’s 14 points
1. Create constancy of purpose toward improvement of
product and service, with the aim to become
competitive and stay in business, and to provide
jobs.
2. Adopt the new philosophy. We are in a new
economic age. Western management must awaken
to the challenge, must learn their responsibilities,
and take on leadership for change.
3. Cease dependence on inspection to achieve quality.
Eliminate the need for inspection on a mass basis
by building quality into the product in the first place.
4. End the practice of awarding business on the basis
of price tag. Instead, minimize total cost. Move
towards a single supplier for any one item, on a
long-term relationship of loyalty and trust.
5. Improve constantly and forever the system of
production and service, to improve quality and
productivity, and thus constantly decrease cost.
6. Institute training on the job.
7. Institute leadership (see Point 12 and Ch. 8 of "Out
of the Crisis"). The aim of supervision should be to
help people and machines and gadgets to do a
better job. Supervision of management is in need of
overhaul, as well as supervision of production
workers.
8. Drive out fear, so that everyone may work effectively for
the company. (See Ch. 3 of "Out of the Crisis")
9. Break down barriers between departments. People in
research, design, sales, and production must work as a
team, to foresee problems of production and in use that
may be encountered with the product or service.
10. Eliminate slogans, exhortations, and targets for the work
force asking for zero defects and new levels of
productivity. Such exhortations only create adversarial
relationships, as the bulk of the causes of low quality and
low productivity belong to the system and thus lie beyond
the power of the work force.
11. a. Eliminate work standards (quotas) on the factory floor.
Substitute leadership.
b. Eliminate management by objective. Eliminate
management by numbers, numerical goals. Substitute
workmanship.
12. a. Remove barriers that rob the hourly worker of his right
to pride of workmanship. The responsibility of supervisors
must be changed from sheer numbers to quality.
b. Remove barriers that rob people in management and
in engineering of their right to pride of workmanship. This
means, inter alia, abolishment of the annual or merit
rating and of management by objective (See CH. 3 of
"Out of the Crisis").
13. Institute a vigorous program of education and self-
improvement.
14. Put everyone in the company to work to accomplish the
transformation. The transformation is everyone's work.
7© Life Cycle Engineering 2008
Deming’s 14 points
1. Create constancy of purpose toward improvement of
product and service, with the aim to become
competitive and stay in business, and to provide
jobs.
2. Adopt the new philosophy. We are in a new
economic age. Western management must awaken
to the challenge, must learn their responsibilities,
and take on leadership for change.
3. Cease dependence on inspection to achieve quality.
Eliminate the need for inspection on a mass basis
by building quality into the product in the first place.
4. End the practice of awarding business on the basis
of price tag. Instead, minimize total cost. Move
towards a single supplier for any one item, on a
long-term relationship of loyalty and trust.
5. Improve constantly and forever the system of
production and service, to improve quality and
productivity, and thus constantly decrease cost.
6. Institute training on the job.
7. Institute leadership (see Point 12 and Ch. 8 of "Out
of the Crisis"). The aim of supervision should be to
help people and machines and gadgets to do a
better job. Supervision of management is in need of
overhaul, as well as supervision of production
workers.
8. Drive out fear, so that everyone may work effectively for
the company. (See Ch. 3 of "Out of the Crisis")
9. Break down barriers between departments. People in
research, design, sales, and production must work as a
team, to foresee problems of production and in use that
may be encountered with the product or service.
10. Eliminate slogans, exhortations, and targets for the work
force asking for zero defects and new levels of
productivity. Such exhortations only create adversarial
relationships, as the bulk of the causes of low quality and
low productivity belong to the system and thus lie beyond
the power of the work force.
11. a. Eliminate work standards (quotas) on the factory floor.
Substitute leadership.
b. Eliminate management by objective. Eliminate
management by numbers, numerical goals. Substitute
workmanship.
12. a. Remove barriers that rob the hourly worker of his right
to pride of workmanship. The responsibility of supervisors
must be changed from sheer numbers to quality.
b. Remove barriers that rob people in management and
in engineering of their right to pride of workmanship. This
means, inter alia, abolishment of the annual or merit
rating and of management by objective (See CH. 3 of
"Out of the Crisis").
13. Institute a vigorous program of education and self-
improvement.
14. Put everyone in the company to work to accomplish the
transformation. The transformation is everyone's work.
8© Life Cycle Engineering 2008
9© Life Cycle Engineering 2008
5 Whys Method
5 Whys Method: Car not Starting
1. Why? - The battery is dead. (first
why)
2. Why? - The alternator is not
functioning. (second why)
3. Why? - The alternator belt has
broken. (third why)
4. Why? - The alternator belt was well
beyond its useful service life and
has never been replaced. (fourth
why)
5. Why? - I have not been maintaining
my car according to the
recommended service schedule.
(fifth why, root cause)
Sakichi Toyoda
(豊田 佐吉 Toyoda Sakichi,
February 14, 1867 –
October 30, 1930)
10© Life Cycle Engineering 2008
5 why’s continued
• Why 5 Questions?
– Nothing magic about the number 5
– After about 5 it can get absurd or go out of scope
– Do we have control over this cause?
– Will eliminating this cause prevent recurrence?
• Shortcoming of Procedure
– Oversimplifies cause and effect relationships
• Multiple causal and contributing factors
• Confluence of events
– Not a structured method for effective investigations
• Other methods help identify possible factors
• Fundamental idea underlying all RCA’s
Cause Effect
11© Life Cycle Engineering 2008
Ishikawa Diagram Method
Also named: “Fish-Bone” Diagram
• Can come at any point in the process
• Helps direct activities
• Brainstorming tool
• Followed by data collection, verification, tests, etc.
Tague’s, Nancy R. The Quality Toolbox, Second Edition, ASQ Quality Press, 2004, pages 247-249
12© Life Cycle Engineering 2008
Ishikawa diagrams
The 6 “M”s
1. Machine,
2. Method,
3. Materials,
4. Maintenance,
5. Man and
6. Mother Nature
(Environment)
The 8 “P”s
1. Price,
2. Promotion,
3. People,
4. Processes,
5. Place / Plant,
6. Policies,
7. Procedures, and
8. Product (or
Service)
The 4 “S”s
1. Surroundings,
2. Suppliers,
3. Systems,
4. Skills
13© Life Cycle Engineering 2008
Failure Model
• The level at which
any root cause
should be identified is
the level at which it is
possible to identify an
appropriate failure
management policy
14© Life Cycle Engineering 2008
“8 Disciplines” or “8D”
• The 8 Disciplines
1. Use Team Approach
2. Describe the Problem
3. Implement and Verify Short-Term Corrective Actions
4. Define and Verify Root Causes
5. Verify Corrective Actions
6. Implement Permanent Corrective Actions
7. Prevent Recurrence
8. Congratulate Your Team
• Other tools can be incorporated into the
steps of an 8D
15© Life Cycle Engineering 2008
16© Life Cycle Engineering 2008
Kepner-Tregoe (KT) analysis
Pioneered in early 1960’s
USAF and NASA
“built on the premise that people can be taught
to think critically”
• Invite someone from a different area as a
“fresh set of eyes”
– “Could you please explain…?”
– “How do you know…?
– “Do you have any data to show that…”
17© Life Cycle Engineering 2008
What is acceptable?
What do you expect?
Everything fails …
If you push it hard enough
If you run it long enough
If it gets hot enough
Etc.
It will fail.
• Also, “failure probability distribution” –
Answers simultaneously
– “How many as a portion of the
population?” and
– At what point in their life (age, cycles,
etc.)
• You need good data to answer these
questions
18© Life Cycle Engineering 2008
Statistical Analysis
• Important to understand…
– Coincidence
– Correlation
– Cause
• Tools…
– Design of Experiments (DOE)
– Analysis of Variance (ANOVA)
– Correlation analyses
– Hypothesis testing
“Smoking is one of
the leading causes
of statistics.”
-- Fletcher Knebel
19© Life Cycle Engineering 2008
Selecting and prioritizing actions
• Requires some knowledge of probability of
occurrence - Data
SEVERITY Catastrophic Critical Marginal Negligible
From To Definition Probability
~1 8 x10 –2
Likely to occur frequently Frequent 1 3 6 10
8 x10 –2
8 x10 –3
Will occur several times in life of an item Probable 2 5 9 14
8 x10 –3
8 x10 –4
Likely to occur sometime in life of an item Occasional 4 8 13 17
8 x10 –4
8 x10 –5
Unlikely but possible to occur in the life of an item Remote 7 12 16 19
8 x10 –5
~0 So unlikely it may be assumed that it won't occur Improbable 11 15 18 20
Probability Range
Customer Notification Containment Corrective Action
1~5 Immediate Restrict field use. Purge existing
stock.
Complete field retrofit as quickly as
possible.
6~10 Immediate Warn customer to avoid conditions
leading to the failure. Hold shipments
till design change is incorporated.
Complete paced field retrofit at earliest
opportunity.
11~15 Service Bulletin No containment required Change design, offer upgrade to
customer.
16~20 Revision notes No containment required Change design at next opportunity, or
correct the problem in the next
generation product.
20© Life Cycle Engineering 2008
Selecting and prioritizing actions
• FMEA: Failure Modes and Effects Analysis
• Requires some knowledge of probabilities
21© Life Cycle Engineering 2008
Keys for Success
• You aren’t the expert
– Challenge everything
– Speak with data, act on fact
• Have the data – and use it
• Don’t let motivations drive conclusions
• Resources
– Always resource-constrained
– Depends on risk and criticality
• Finish the job - verification
22© Life Cycle Engineering 2008
“In theory, there is no difference
between theory and practice; In
practice, there is.”
-- Chuck Reid
23© Life Cycle Engineering 2008
24© Life Cycle Engineering 2008
Boeing C-17 landing Gear
25© Life Cycle Engineering 2008
• Brake Sensors designed for and subjected
to 600 hour durability test, vibration and
thermal as specification requirement.
• A couple of redesigns already
– Identified location and failure mechanism
– Made it more robust both times
• Discarding 10-13 sensors per month
• Problem: Solve the high failure rate.
26© Life Cycle Engineering 2008
• Discarding 10-13 sensors per month
• A couple of redesigns already
• Problem: Solve the high failure rate.
• Although each redesign had made the sensor
stronger, there was never clear definition of the
requirement
• Initial problem was an inadequate specification
• Most of the sensors currently being discarded had
not failed
• Swaptronics…Resolution: Improve troubleshooting
27© Life Cycle Engineering 2008
Blue Screen of Death BSOD
28© Life Cycle Engineering 2008
BSOD continued…Software Errors
29© Life Cycle Engineering 2008
Key take-aways
• Conclusion
– Micro-bubbles forming on the control computer disk
drives
– Only happens if the computer is left on all the time
– Corrective action was to turn off the computers and
restart them once every 24 hours
• Not a true corrective action
• Lessons for RCFA
– Took about 18 months from initiation of activity to report
– Dedicated and determined engineer
30© Life Cycle Engineering 2008
Conveyor Drive Failures
• High failure Rate
– Motors Tripping
– Gearbox
Failures
• Solve the high
failure rate
31© Life Cycle Engineering 2008
Conveyor Failures continued…
• Problem definition
– The corrective action team determined that the
failures were generally of two types,
1. Premature wear out consistent with long term,
slightly elevated loading, and
2. Failures consistent with transient torque
overloads.
– One side has a higher failure rate than the other
– Load?
32© Life Cycle Engineering 2008
Strain gauges
applied at the
couplings on both
conveyors
• Setup a remote data
acquisition system
(WebDaq)
• Began gathering long-term
data
– About 8 days of continuous
data
– Then about 137 hours of
intermittent (triggered) data
Cutout
s for
strain
gauge
s.
Strain
gauge
locatio
ns
33© Life Cycle Engineering 2008
Normal operation Overload event
Overload event (zoom)
• Frequency indicates
coupling slipping
34© Life Cycle Engineering 2008
Conveyor Failures conclusion
• Life difference between drives is
normal wear-out due to higher
load during normal operation
• Premature failures due to
overload events…
– “Clamping” of the belts due to
programming errors in control
system
• Latent causes not addressed…
– Development, installation and
run-off process that permitted
the programming errors
– Process that failed to catch
the errors
• Fundamental Principles /
Lessons Learned
for Root Cause Failure
Analysis…
– Devoted adequate resources
– Did not do a design change
based on initial “apparent”
cause
– Problem definition / Data
collection
– Time commitment
• 10 Months from
identification of failure for
RCFA to final report
35© Life Cycle Engineering 2008
Mobile Hydraulic Truck Pumps Leakage
• Problem
– Reported substantial increase
in failure rate due to leakage
– Initial conclusion (assumed)
faulty pump
– Initiated a campaign to
replace all the pumps
• Very good data
– Extensive details on every
failure
• Model, serial number,
application, hours in service,
calendar time in service…
36© Life Cycle Engineering 2008
Mobile Hydraulic Truck Pumps Continued…
• Established 13 year
timeline showing
entire history of
design and
application
• Reviewed detailed
removal history and
failure probability
distributions
• Identified 2 different
failure modes…
1.00 100.0010.00
0.10
0.50
1.00
5.00
10.00
50.00
90.00
99.00
0.10
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.4
1.6
2.0
3.0
4.0
6.0
b
h
ReliaSoft's Weibull++ 6.0 - www.Weibull.com
Probability - Weibull
Time in Service (months)
Unreliability,F(t)
478
610
809
938
1153
1193
1314
1289
1378
1307
1251
134111691120105393388877267163755350743241538330332126621322817118915411812292433227149
Weibull
AllParts_Months_Warr
W5 MLE - SRM MED
F=24896 / S=334701
CB[FM]@90.00%
1-Sided-U [T1]
b[1]=2.2600, h[1]=16.5572, R[1]=0.0921 ; b[2]=1.8892, h[2]=163.0341, R[2]=0.9079
37© Life Cycle Engineering 2008
1.00 100.0010.00
0.01
0.05
0.10
0.50
1.00
5.00
10.00
50.00
90.00
99.00
0.01
0.5
0.6
0.7
0.8
0.9
1.0
1.2
1.4
1.6
2.0
3.0
4.0
6.0
b
h
ReliaSoft's Weibull++ 6.0 - www.Weibull.com
Probability - Weibull
Time in Service (months)
Unreliability,F(t)
80
99
124
120
112
94
1188910590807880576147564042172322917121510888892354
Weibull
16000K
W5 MLE - SRM MED
F=1752 / S=9223
b1[1]=1.8996, h1[1]=11.4709, R1[1]=0.1469 ; b1[2]=1.7402, h1[2]=133.6546, R1[2]=0.8531
13
3
12
8
11
8 5 7
11336364 52 2 3 16000L
W5 MLE - SRM MED
F=120 / S=209
b2[1]=1.8024, h2[1]=9.8963, R2[1]=0.3433 ; b2[2]=1.5343, h2[2]=116.0825, R2[2]=0.6567
223
280
427
473
622
665
731
703
7947346957576455655694674594223753443023132732542361621841571151348610976525942158
16000M
W5 MLE - SRM MED
F=13527 / S=89454
b3[1]=2.2610, h3[1]=14.0255, R3[1]=0.0997 ; b3[2]=1.8634, h3[2]=121.8553, R3[2]=0.9003
8
17
19
21
20
1512111617141413101179332623222 16000N
W5 MLE - SRM MED
F=264 / S=9106
b4[1]=1.9221, h4[1]=12.1015, R4[1]=0.0298 ; b4[2]=4.4799, h4[2]=91.3611, R4[2]=0.9702
37
50
50
423228
44383334292637172311121915138677665443
AllOthers
W5 MLE - SRM MED
F=648 / S=42756
CB[FM]@90.00%
1-Sided-U [T1]
b5[1]=1.6788, h5[1]=12.0507, R5[1]=0.0137 ; b5[2]=1.7057, h5[2]=612.6454, R5[2]=0.9863
Further analysis
permitted us to
isolate and
identify
subpopulations
with distinctly
different failure
distributions
38© Life Cycle Engineering 2008
Mobile Hydraulic Truck Pumps Continued…
• Truck test results
– 1. The highest acceleration
levels are always associated
with rapid pressure drops,
|dP/dt| about 1800 bar per
second or greater.
– 2. Pressure drops (|dP/dt|)
on Truck 2 were on average a
little greater truck 1, but they
never result in the impact
signature.
– 3. |dP/dt| >= about 1800 bar
per second ALWAYS results
in an impact signature on
truck 1
39© Life Cycle Engineering 2008
Mobile Hydraulic Truck Pumps Continued…
• Have the data
• Statistical tools
• Resources
– About 1 year
40© Life Cycle Engineering 2008
Acme* Gearbox - Background
 3-stage, 1800 kW
gearbox driving a rock
crusher
 Late in the evening there
was a vibration alarm
 Alarm was “not unusual”,
they continued operating
 Early the next morning
there was a loud noise,
and shutdown for
vibration
*Some details have been changed
41© Life Cycle Engineering 2008
Background continued…
• Over the next few days they replaced the
gearbox with a spare
• Vendor was consulted. They “knew
exactly what went wrong”
• Insurance company requested an
independent Root Cause Failure Analysis
42© Life Cycle Engineering 2008
Background continued…
• Over the next few days they replaced the
gearbox with a spare
• Vendor was consulted. They “knew exactly
what went wrong”
• Insurance company requested an
independent Root Cause Failure Analysis
43© Life Cycle Engineering 2008
Complications
• “Independent”
– Implies limited cooperation between experts
• People who designed and built the equipment
• People who maintained and operated equipment
– Don’t take everything at face value
• Consider everyone’s motivations
• There are vested interests in different possible
conclusions
• Limited access to the hardware
– Resources
44© Life Cycle Engineering 2008
Investigation
What did the
people do?
Why did they do it?
(systems, procedures, motivations)
This is where you usually find the “root” cause
45© Life Cycle Engineering 2008
Investigation
Induced
• Application
• Environment
• “You broke it”
(vendor)
Inherent
• Design
• Materials
• “It broke”
(user)
Answers the question “which humans?”
46© Life Cycle Engineering 2008
Data
“Describe the problem” from 8D form
• Loading, both before the incident and
historically
• Equipment design, ratings (what was it
expected to do?)
• Maintenance history
• Vibration analyses / reports
47© Life Cycle Engineering 2008
Oil Contaminant report…
48© Life Cycle Engineering 2008
Vibration
• Requested source data, FFT parameters, etc.
(monthly checks… one year history)
(motor bearing)
49© Life Cycle Engineering 2008
Vibration (source data)
50© Life Cycle Engineering 2008
0:00 12:00 24:00 36:00 48:00 60:00 72:00 84:00 96:00
Loading
Attime of failure
1 Year Earlier
2 Years Earlier
51© Life Cycle Engineering 2008
Power – 30 days leading up to failure
52© Life Cycle Engineering 2008
Power – 30 days 1 year earlier
53© Life Cycle Engineering 2008
Motor is replaced
(-2 days)
Internal winding
failure
Gearbox is
rebuilt
Maintenance crew
fixes an oil leak
Maintenance crew
fixes an oil leak
Maintenance crew
fixes an oil leak
Oil leaks repaired “many” times, most undocumented
due to vibration
alarm
(date and time)
Plant shuts down
Tripped due to
vibration alarm
(-2 hours)
Supervisor
decides to
continue running
(-2 hours)
Maintenance crew
fixes an oil leak
(-2 days)
54© Life Cycle Engineering 2008
Interviews – the picture that emerges
• 2 days prior – high speed
shaft was not properly drawn
up to engage the pinion
– Crew did not have specs or
manuals
– No one knew where they
were
• Oil leaks had been repaired
“many times” since rebuild
• Could have been improperly
reinstalled any of those times
• Prior to failure, crews heard
“Rumble” typical of loading too
much material (common
occurrence); Overloading.
• Other crews described the
proper procedure, “tribal
knowledge”
• Maintenance records were
incomplete
• Vendor reported no
apparent problems when
new motor was installed
• Control room vibration
monitoring was not helpful
• Alarms occurred “all the
time” with no action taken
• There were indications a
failure was imminent
55© Life Cycle Engineering 2008
56© Life Cycle Engineering 2008
Remaining questions:
• Was damage accumulating
over time?
• Were there material or
design contributors?
57© Life Cycle Engineering 2008
Metallurgical report
• Two contact patterns…
– “Frosting” below the pitch line, indicating a period of
normal wear
– Obvious indications of wear near tooth tips
• Bearings indicated a severe misalignment
• Nothing anomalous in material properties
(hardness, case depth, chemical and
microstructure)
• Failure was due to low cycle fatigue prior to
overload
58© Life Cycle Engineering 2008
Root Cause Conclusions
• Induced failure due to
– improper maintenance,
resulting in low cycle fatigue
then overload
– High loads due to material
overloading were a likely
contributor
• Latent factors:
– Poor cooperation with
supplier(s)
– Inadequate documentation and
equipment specific training
– Ineffective warning system and
propensity to ignore warnings
• Proposed corrective actions
– Acquire up-to-date
specifications, documentation
and maintenance procedures
for critical equipment
– Ensure equipment specific
training for maintenance
personnel
– Review adequacy of alarm
system to ensure warnings are
adequate and meaningful
– Define appropriate responses
– Instill a culture that expects
response and action
59© Life Cycle Engineering 2008
Conclusions; or if you remember nothing else about
root cause analysis, remember this:
 Do it. RCA is the engine that drives continuous improvement.
 Have the data
 Keep good records, not just of failures but of
All maintenance actions
When did it begin service? … end?
Operating conditions
If you don’t have a good CMMS, get one.
If you do (or when you do), USE IT
 Resources. Have the right
 People
 Training, and
 Tools.
60© Life Cycle Engineering 2008
The last word…
Problem Solving Flow Chart
Don’t Mess With It!
YES NO
YES
YOU POOR FOOL!
NO
Are You In
Hot Water?
NO
Throw Away
The Evidence
Does Anyone
Know? TOO BAD!
YESYES
NO
Hide It Can You Blame
Someone Else?
NO
NO PROBLEM!
YES
Is It Working?
Did You Mess
With It?

More Related Content

PDF
The six big losses - lean tool
PDF
KAIZEN: A Lean Manufacturing Technique
PPT
Managerial and Technical skills of supervisors
PPT
BUS 51 - Mosley7e ch14
PDF
Optimizing Petroleum Refining Unit Operations Training
PDF
Managing productivity
PPT
How to Introduce Operational Excellence in your Organisation?
The six big losses - lean tool
KAIZEN: A Lean Manufacturing Technique
Managerial and Technical skills of supervisors
BUS 51 - Mosley7e ch14
Optimizing Petroleum Refining Unit Operations Training
Managing productivity
How to Introduce Operational Excellence in your Organisation?

What's hot (18)

PPTX
Common ways to avoid the most frequent GMP errors
DOC
Employee productivity
PPTX
Value Chain Analysis, MUDA, Poke Yoke and Kaizen
PDF
Six Sigma Case Cart Project Final Report Jan. 2011
PPTX
Operations_Excellence_Presentation_Promotional
PPTX
Lean System
PPTX
CASE STUDY ON IMPLEMENTATION OF KAIZEN AND 5S TECHNIQUES IN SMALL MANUFACTURI...
PDF
The+complete+guide+to+simple+oee
PDF
Cedac ptg
PPT
Optimizing Sterile Processing Workflow
PPT
OMG: Preventive Maintenance 2015
PDF
IRJET- Application of Continuous Improvement Process in Manufacturing Industry
DOC
Combating entropy in business
DOC
Six sigma project report
PPTX
Where Does The Time Go?
PPT
Total productive maintenance
PDF
Oei building operational excellence in petroleum refining
PDF
Yellow belt process improvement training and certification module
Common ways to avoid the most frequent GMP errors
Employee productivity
Value Chain Analysis, MUDA, Poke Yoke and Kaizen
Six Sigma Case Cart Project Final Report Jan. 2011
Operations_Excellence_Presentation_Promotional
Lean System
CASE STUDY ON IMPLEMENTATION OF KAIZEN AND 5S TECHNIQUES IN SMALL MANUFACTURI...
The+complete+guide+to+simple+oee
Cedac ptg
Optimizing Sterile Processing Workflow
OMG: Preventive Maintenance 2015
IRJET- Application of Continuous Improvement Process in Manufacturing Industry
Combating entropy in business
Six sigma project report
Where Does The Time Go?
Total productive maintenance
Oei building operational excellence in petroleum refining
Yellow belt process improvement training and certification module
Ad

Similar to Root Cause Failure Analysis by Eugene Cottle-Lifecycle Engineering (20)

PPT
The challenges facing in pharmaceutical maintenance
DOCX
Om0015 – maintenance management
PDF
Multi criteria Decision model (MCDM) for the evaluation of maintenance practi...
PDF
Total Productive Maintenance - A Systematic Review
PPTX
[Pem Zhipeng Xie] project management: lean six sigma
PDF
Chapter 11_ The role of quality in performance management.pdf
PDF
IRJET - Implementation of TPM Philosophy on Critical Paint Shop Machine
PDF
Guidelines for Safe Pre-commissioning, Commissioning, and Operation of Proces...
PPT
PDF
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
PDF
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
PPT
Chapter-4-Motion_and_Time_Study in industry.ppt
DOCX
On concept of Total Productive Maintenance
PDF
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
DOCX
6 sigma assignment
PDF
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
PPTX
TPM - Total Productive Maintenance
PPT
Total productive maintenance
PPTX
Six sigma in automobile Industry
PPTX
CLT ver1.00
The challenges facing in pharmaceutical maintenance
Om0015 – maintenance management
Multi criteria Decision model (MCDM) for the evaluation of maintenance practi...
Total Productive Maintenance - A Systematic Review
[Pem Zhipeng Xie] project management: lean six sigma
Chapter 11_ The role of quality in performance management.pdf
IRJET - Implementation of TPM Philosophy on Critical Paint Shop Machine
Guidelines for Safe Pre-commissioning, Commissioning, and Operation of Proces...
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
Chapter-4-Motion_and_Time_Study in industry.ppt
On concept of Total Productive Maintenance
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
6 sigma assignment
Lean Production for Competitive Advantage A Comprehensive Guide to Lean Metho...
TPM - Total Productive Maintenance
Total productive maintenance
Six sigma in automobile Industry
CLT ver1.00
Ad

More from Abdulrahman Alkhowaiter (12)

PDF
Centrifugal Pump Bearings
PDF
RCFA Success of Root Cause Failure Analysis
PDF
Creativity in Science and Engineering by Martin Perl
PDF
How to draw mechanical parts correctly
PDF
Specialty reciprocating-rod-pump-improves-reliability-in-sand-laden-oil-wells...
PDF
Elimination voc-emissions-reciprocating-pump-stuffing-box
PDF
Plunger Pumps Pulsation Dampener Designs
PDF
Root Cause Failure Analysis Methods for Pump Failures
PDF
API 610 Centrifugal Pump Repairs By Clyde Union Pump Company
PDF
Reciprocating Pumps Texas A-M Article
PDF
Timoshenko: Vibration Problems in Engineering E-book Part-1
PDF
Timoshenko: Vibration Problems in Engineering E-book Part-2
Centrifugal Pump Bearings
RCFA Success of Root Cause Failure Analysis
Creativity in Science and Engineering by Martin Perl
How to draw mechanical parts correctly
Specialty reciprocating-rod-pump-improves-reliability-in-sand-laden-oil-wells...
Elimination voc-emissions-reciprocating-pump-stuffing-box
Plunger Pumps Pulsation Dampener Designs
Root Cause Failure Analysis Methods for Pump Failures
API 610 Centrifugal Pump Repairs By Clyde Union Pump Company
Reciprocating Pumps Texas A-M Article
Timoshenko: Vibration Problems in Engineering E-book Part-1
Timoshenko: Vibration Problems in Engineering E-book Part-2

Recently uploaded (20)

PDF
Well-logging-methods_new................
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPT
Project quality management in manufacturing
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
CYBER-CRIMES AND SECURITY A guide to understanding
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Project quality management in manufacturing
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Foundation to blockchain - A guide to Blockchain Tech
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
R24 SURVEYING LAB MANUAL for civil enggi
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Fundamentals of safety and accident prevention -final (1).pptx
573137875-Attendance-Management-System-original
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx

Root Cause Failure Analysis by Eugene Cottle-Lifecycle Engineering

  • 1. Root Cause Failure Analysis with Case Histories Eugene T. Cottle Reliability Engineer Life Cycle Engineering
  • 2. 2© Life Cycle Engineering 2008 A confluence of events, factors and conditions which conspire to produce an (undesirable) outcome. Event(s), factor(s) or condition(s) which are under your control and which, if corrected or eliminated, will prevent recurrence of the undesirable outcome.
  • 3. 3© Life Cycle Engineering 2008 More Terminology… • RCA: Root Cause Analysis – A disciplined process for focusing ideas to identify root causes. A class of problem solving methods • RCFA: Root Cause Failure Analysis – Reactive, in response to a failure • RCCA: Root Cause and Corrective Action – Incorporates preventive corrective action into the process(i.e., elimination of special causes
  • 4. 4© Life Cycle Engineering 2008 Root Cause Analysis • Safety-based RCA – accident analysis and – occupational safety and health • Production-based RCA – quality control for industrial manufacturing • Process-based RCA – Expanded scope to include business processes • Failure-based RCA – Based on failure analysis – employed in engineering and maintenance. • Systems-based RCA – amalgamation of the all the others, and includes • change management, • risk management, and • systems analysis
  • 5. 5© Life Cycle Engineering 2008 Objectives • Prevent Recurrence • Responsibility – “Hand-off” the investigation • Begins with an assumption of “cause” – Liability – Blame
  • 6. 6© Life Cycle Engineering 2008 Deming’s 14 points 1. Create constancy of purpose toward improvement of product and service, with the aim to become competitive and stay in business, and to provide jobs. 2. Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, must learn their responsibilities, and take on leadership for change. 3. Cease dependence on inspection to achieve quality. Eliminate the need for inspection on a mass basis by building quality into the product in the first place. 4. End the practice of awarding business on the basis of price tag. Instead, minimize total cost. Move towards a single supplier for any one item, on a long-term relationship of loyalty and trust. 5. Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease cost. 6. Institute training on the job. 7. Institute leadership (see Point 12 and Ch. 8 of "Out of the Crisis"). The aim of supervision should be to help people and machines and gadgets to do a better job. Supervision of management is in need of overhaul, as well as supervision of production workers. 8. Drive out fear, so that everyone may work effectively for the company. (See Ch. 3 of "Out of the Crisis") 9. Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production and in use that may be encountered with the product or service. 10. Eliminate slogans, exhortations, and targets for the work force asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, as the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force. 11. a. Eliminate work standards (quotas) on the factory floor. Substitute leadership. b. Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute workmanship. 12. a. Remove barriers that rob the hourly worker of his right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality. b. Remove barriers that rob people in management and in engineering of their right to pride of workmanship. This means, inter alia, abolishment of the annual or merit rating and of management by objective (See CH. 3 of "Out of the Crisis"). 13. Institute a vigorous program of education and self- improvement. 14. Put everyone in the company to work to accomplish the transformation. The transformation is everyone's work.
  • 7. 7© Life Cycle Engineering 2008 Deming’s 14 points 1. Create constancy of purpose toward improvement of product and service, with the aim to become competitive and stay in business, and to provide jobs. 2. Adopt the new philosophy. We are in a new economic age. Western management must awaken to the challenge, must learn their responsibilities, and take on leadership for change. 3. Cease dependence on inspection to achieve quality. Eliminate the need for inspection on a mass basis by building quality into the product in the first place. 4. End the practice of awarding business on the basis of price tag. Instead, minimize total cost. Move towards a single supplier for any one item, on a long-term relationship of loyalty and trust. 5. Improve constantly and forever the system of production and service, to improve quality and productivity, and thus constantly decrease cost. 6. Institute training on the job. 7. Institute leadership (see Point 12 and Ch. 8 of "Out of the Crisis"). The aim of supervision should be to help people and machines and gadgets to do a better job. Supervision of management is in need of overhaul, as well as supervision of production workers. 8. Drive out fear, so that everyone may work effectively for the company. (See Ch. 3 of "Out of the Crisis") 9. Break down barriers between departments. People in research, design, sales, and production must work as a team, to foresee problems of production and in use that may be encountered with the product or service. 10. Eliminate slogans, exhortations, and targets for the work force asking for zero defects and new levels of productivity. Such exhortations only create adversarial relationships, as the bulk of the causes of low quality and low productivity belong to the system and thus lie beyond the power of the work force. 11. a. Eliminate work standards (quotas) on the factory floor. Substitute leadership. b. Eliminate management by objective. Eliminate management by numbers, numerical goals. Substitute workmanship. 12. a. Remove barriers that rob the hourly worker of his right to pride of workmanship. The responsibility of supervisors must be changed from sheer numbers to quality. b. Remove barriers that rob people in management and in engineering of their right to pride of workmanship. This means, inter alia, abolishment of the annual or merit rating and of management by objective (See CH. 3 of "Out of the Crisis"). 13. Institute a vigorous program of education and self- improvement. 14. Put everyone in the company to work to accomplish the transformation. The transformation is everyone's work.
  • 8. 8© Life Cycle Engineering 2008
  • 9. 9© Life Cycle Engineering 2008 5 Whys Method 5 Whys Method: Car not Starting 1. Why? - The battery is dead. (first why) 2. Why? - The alternator is not functioning. (second why) 3. Why? - The alternator belt has broken. (third why) 4. Why? - The alternator belt was well beyond its useful service life and has never been replaced. (fourth why) 5. Why? - I have not been maintaining my car according to the recommended service schedule. (fifth why, root cause) Sakichi Toyoda (豊田 佐吉 Toyoda Sakichi, February 14, 1867 – October 30, 1930)
  • 10. 10© Life Cycle Engineering 2008 5 why’s continued • Why 5 Questions? – Nothing magic about the number 5 – After about 5 it can get absurd or go out of scope – Do we have control over this cause? – Will eliminating this cause prevent recurrence? • Shortcoming of Procedure – Oversimplifies cause and effect relationships • Multiple causal and contributing factors • Confluence of events – Not a structured method for effective investigations • Other methods help identify possible factors • Fundamental idea underlying all RCA’s Cause Effect
  • 11. 11© Life Cycle Engineering 2008 Ishikawa Diagram Method Also named: “Fish-Bone” Diagram • Can come at any point in the process • Helps direct activities • Brainstorming tool • Followed by data collection, verification, tests, etc. Tague’s, Nancy R. The Quality Toolbox, Second Edition, ASQ Quality Press, 2004, pages 247-249
  • 12. 12© Life Cycle Engineering 2008 Ishikawa diagrams The 6 “M”s 1. Machine, 2. Method, 3. Materials, 4. Maintenance, 5. Man and 6. Mother Nature (Environment) The 8 “P”s 1. Price, 2. Promotion, 3. People, 4. Processes, 5. Place / Plant, 6. Policies, 7. Procedures, and 8. Product (or Service) The 4 “S”s 1. Surroundings, 2. Suppliers, 3. Systems, 4. Skills
  • 13. 13© Life Cycle Engineering 2008 Failure Model • The level at which any root cause should be identified is the level at which it is possible to identify an appropriate failure management policy
  • 14. 14© Life Cycle Engineering 2008 “8 Disciplines” or “8D” • The 8 Disciplines 1. Use Team Approach 2. Describe the Problem 3. Implement and Verify Short-Term Corrective Actions 4. Define and Verify Root Causes 5. Verify Corrective Actions 6. Implement Permanent Corrective Actions 7. Prevent Recurrence 8. Congratulate Your Team • Other tools can be incorporated into the steps of an 8D
  • 15. 15© Life Cycle Engineering 2008
  • 16. 16© Life Cycle Engineering 2008 Kepner-Tregoe (KT) analysis Pioneered in early 1960’s USAF and NASA “built on the premise that people can be taught to think critically” • Invite someone from a different area as a “fresh set of eyes” – “Could you please explain…?” – “How do you know…? – “Do you have any data to show that…”
  • 17. 17© Life Cycle Engineering 2008 What is acceptable? What do you expect? Everything fails … If you push it hard enough If you run it long enough If it gets hot enough Etc. It will fail. • Also, “failure probability distribution” – Answers simultaneously – “How many as a portion of the population?” and – At what point in their life (age, cycles, etc.) • You need good data to answer these questions
  • 18. 18© Life Cycle Engineering 2008 Statistical Analysis • Important to understand… – Coincidence – Correlation – Cause • Tools… – Design of Experiments (DOE) – Analysis of Variance (ANOVA) – Correlation analyses – Hypothesis testing “Smoking is one of the leading causes of statistics.” -- Fletcher Knebel
  • 19. 19© Life Cycle Engineering 2008 Selecting and prioritizing actions • Requires some knowledge of probability of occurrence - Data SEVERITY Catastrophic Critical Marginal Negligible From To Definition Probability ~1 8 x10 –2 Likely to occur frequently Frequent 1 3 6 10 8 x10 –2 8 x10 –3 Will occur several times in life of an item Probable 2 5 9 14 8 x10 –3 8 x10 –4 Likely to occur sometime in life of an item Occasional 4 8 13 17 8 x10 –4 8 x10 –5 Unlikely but possible to occur in the life of an item Remote 7 12 16 19 8 x10 –5 ~0 So unlikely it may be assumed that it won't occur Improbable 11 15 18 20 Probability Range Customer Notification Containment Corrective Action 1~5 Immediate Restrict field use. Purge existing stock. Complete field retrofit as quickly as possible. 6~10 Immediate Warn customer to avoid conditions leading to the failure. Hold shipments till design change is incorporated. Complete paced field retrofit at earliest opportunity. 11~15 Service Bulletin No containment required Change design, offer upgrade to customer. 16~20 Revision notes No containment required Change design at next opportunity, or correct the problem in the next generation product.
  • 20. 20© Life Cycle Engineering 2008 Selecting and prioritizing actions • FMEA: Failure Modes and Effects Analysis • Requires some knowledge of probabilities
  • 21. 21© Life Cycle Engineering 2008 Keys for Success • You aren’t the expert – Challenge everything – Speak with data, act on fact • Have the data – and use it • Don’t let motivations drive conclusions • Resources – Always resource-constrained – Depends on risk and criticality • Finish the job - verification
  • 22. 22© Life Cycle Engineering 2008 “In theory, there is no difference between theory and practice; In practice, there is.” -- Chuck Reid
  • 23. 23© Life Cycle Engineering 2008
  • 24. 24© Life Cycle Engineering 2008 Boeing C-17 landing Gear
  • 25. 25© Life Cycle Engineering 2008 • Brake Sensors designed for and subjected to 600 hour durability test, vibration and thermal as specification requirement. • A couple of redesigns already – Identified location and failure mechanism – Made it more robust both times • Discarding 10-13 sensors per month • Problem: Solve the high failure rate.
  • 26. 26© Life Cycle Engineering 2008 • Discarding 10-13 sensors per month • A couple of redesigns already • Problem: Solve the high failure rate. • Although each redesign had made the sensor stronger, there was never clear definition of the requirement • Initial problem was an inadequate specification • Most of the sensors currently being discarded had not failed • Swaptronics…Resolution: Improve troubleshooting
  • 27. 27© Life Cycle Engineering 2008 Blue Screen of Death BSOD
  • 28. 28© Life Cycle Engineering 2008 BSOD continued…Software Errors
  • 29. 29© Life Cycle Engineering 2008 Key take-aways • Conclusion – Micro-bubbles forming on the control computer disk drives – Only happens if the computer is left on all the time – Corrective action was to turn off the computers and restart them once every 24 hours • Not a true corrective action • Lessons for RCFA – Took about 18 months from initiation of activity to report – Dedicated and determined engineer
  • 30. 30© Life Cycle Engineering 2008 Conveyor Drive Failures • High failure Rate – Motors Tripping – Gearbox Failures • Solve the high failure rate
  • 31. 31© Life Cycle Engineering 2008 Conveyor Failures continued… • Problem definition – The corrective action team determined that the failures were generally of two types, 1. Premature wear out consistent with long term, slightly elevated loading, and 2. Failures consistent with transient torque overloads. – One side has a higher failure rate than the other – Load?
  • 32. 32© Life Cycle Engineering 2008 Strain gauges applied at the couplings on both conveyors • Setup a remote data acquisition system (WebDaq) • Began gathering long-term data – About 8 days of continuous data – Then about 137 hours of intermittent (triggered) data Cutout s for strain gauge s. Strain gauge locatio ns
  • 33. 33© Life Cycle Engineering 2008 Normal operation Overload event Overload event (zoom) • Frequency indicates coupling slipping
  • 34. 34© Life Cycle Engineering 2008 Conveyor Failures conclusion • Life difference between drives is normal wear-out due to higher load during normal operation • Premature failures due to overload events… – “Clamping” of the belts due to programming errors in control system • Latent causes not addressed… – Development, installation and run-off process that permitted the programming errors – Process that failed to catch the errors • Fundamental Principles / Lessons Learned for Root Cause Failure Analysis… – Devoted adequate resources – Did not do a design change based on initial “apparent” cause – Problem definition / Data collection – Time commitment • 10 Months from identification of failure for RCFA to final report
  • 35. 35© Life Cycle Engineering 2008 Mobile Hydraulic Truck Pumps Leakage • Problem – Reported substantial increase in failure rate due to leakage – Initial conclusion (assumed) faulty pump – Initiated a campaign to replace all the pumps • Very good data – Extensive details on every failure • Model, serial number, application, hours in service, calendar time in service…
  • 36. 36© Life Cycle Engineering 2008 Mobile Hydraulic Truck Pumps Continued… • Established 13 year timeline showing entire history of design and application • Reviewed detailed removal history and failure probability distributions • Identified 2 different failure modes… 1.00 100.0010.00 0.10 0.50 1.00 5.00 10.00 50.00 90.00 99.00 0.10 0.5 0.6 0.7 0.8 0.9 1.0 1.2 1.4 1.6 2.0 3.0 4.0 6.0 b h ReliaSoft's Weibull++ 6.0 - www.Weibull.com Probability - Weibull Time in Service (months) Unreliability,F(t) 478 610 809 938 1153 1193 1314 1289 1378 1307 1251 134111691120105393388877267163755350743241538330332126621322817118915411812292433227149 Weibull AllParts_Months_Warr W5 MLE - SRM MED F=24896 / S=334701 CB[FM]@90.00% 1-Sided-U [T1] b[1]=2.2600, h[1]=16.5572, R[1]=0.0921 ; b[2]=1.8892, h[2]=163.0341, R[2]=0.9079
  • 37. 37© Life Cycle Engineering 2008 1.00 100.0010.00 0.01 0.05 0.10 0.50 1.00 5.00 10.00 50.00 90.00 99.00 0.01 0.5 0.6 0.7 0.8 0.9 1.0 1.2 1.4 1.6 2.0 3.0 4.0 6.0 b h ReliaSoft's Weibull++ 6.0 - www.Weibull.com Probability - Weibull Time in Service (months) Unreliability,F(t) 80 99 124 120 112 94 1188910590807880576147564042172322917121510888892354 Weibull 16000K W5 MLE - SRM MED F=1752 / S=9223 b1[1]=1.8996, h1[1]=11.4709, R1[1]=0.1469 ; b1[2]=1.7402, h1[2]=133.6546, R1[2]=0.8531 13 3 12 8 11 8 5 7 11336364 52 2 3 16000L W5 MLE - SRM MED F=120 / S=209 b2[1]=1.8024, h2[1]=9.8963, R2[1]=0.3433 ; b2[2]=1.5343, h2[2]=116.0825, R2[2]=0.6567 223 280 427 473 622 665 731 703 7947346957576455655694674594223753443023132732542361621841571151348610976525942158 16000M W5 MLE - SRM MED F=13527 / S=89454 b3[1]=2.2610, h3[1]=14.0255, R3[1]=0.0997 ; b3[2]=1.8634, h3[2]=121.8553, R3[2]=0.9003 8 17 19 21 20 1512111617141413101179332623222 16000N W5 MLE - SRM MED F=264 / S=9106 b4[1]=1.9221, h4[1]=12.1015, R4[1]=0.0298 ; b4[2]=4.4799, h4[2]=91.3611, R4[2]=0.9702 37 50 50 423228 44383334292637172311121915138677665443 AllOthers W5 MLE - SRM MED F=648 / S=42756 CB[FM]@90.00% 1-Sided-U [T1] b5[1]=1.6788, h5[1]=12.0507, R5[1]=0.0137 ; b5[2]=1.7057, h5[2]=612.6454, R5[2]=0.9863 Further analysis permitted us to isolate and identify subpopulations with distinctly different failure distributions
  • 38. 38© Life Cycle Engineering 2008 Mobile Hydraulic Truck Pumps Continued… • Truck test results – 1. The highest acceleration levels are always associated with rapid pressure drops, |dP/dt| about 1800 bar per second or greater. – 2. Pressure drops (|dP/dt|) on Truck 2 were on average a little greater truck 1, but they never result in the impact signature. – 3. |dP/dt| >= about 1800 bar per second ALWAYS results in an impact signature on truck 1
  • 39. 39© Life Cycle Engineering 2008 Mobile Hydraulic Truck Pumps Continued… • Have the data • Statistical tools • Resources – About 1 year
  • 40. 40© Life Cycle Engineering 2008 Acme* Gearbox - Background  3-stage, 1800 kW gearbox driving a rock crusher  Late in the evening there was a vibration alarm  Alarm was “not unusual”, they continued operating  Early the next morning there was a loud noise, and shutdown for vibration *Some details have been changed
  • 41. 41© Life Cycle Engineering 2008 Background continued… • Over the next few days they replaced the gearbox with a spare • Vendor was consulted. They “knew exactly what went wrong” • Insurance company requested an independent Root Cause Failure Analysis
  • 42. 42© Life Cycle Engineering 2008 Background continued… • Over the next few days they replaced the gearbox with a spare • Vendor was consulted. They “knew exactly what went wrong” • Insurance company requested an independent Root Cause Failure Analysis
  • 43. 43© Life Cycle Engineering 2008 Complications • “Independent” – Implies limited cooperation between experts • People who designed and built the equipment • People who maintained and operated equipment – Don’t take everything at face value • Consider everyone’s motivations • There are vested interests in different possible conclusions • Limited access to the hardware – Resources
  • 44. 44© Life Cycle Engineering 2008 Investigation What did the people do? Why did they do it? (systems, procedures, motivations) This is where you usually find the “root” cause
  • 45. 45© Life Cycle Engineering 2008 Investigation Induced • Application • Environment • “You broke it” (vendor) Inherent • Design • Materials • “It broke” (user) Answers the question “which humans?”
  • 46. 46© Life Cycle Engineering 2008 Data “Describe the problem” from 8D form • Loading, both before the incident and historically • Equipment design, ratings (what was it expected to do?) • Maintenance history • Vibration analyses / reports
  • 47. 47© Life Cycle Engineering 2008 Oil Contaminant report…
  • 48. 48© Life Cycle Engineering 2008 Vibration • Requested source data, FFT parameters, etc. (monthly checks… one year history) (motor bearing)
  • 49. 49© Life Cycle Engineering 2008 Vibration (source data)
  • 50. 50© Life Cycle Engineering 2008 0:00 12:00 24:00 36:00 48:00 60:00 72:00 84:00 96:00 Loading Attime of failure 1 Year Earlier 2 Years Earlier
  • 51. 51© Life Cycle Engineering 2008 Power – 30 days leading up to failure
  • 52. 52© Life Cycle Engineering 2008 Power – 30 days 1 year earlier
  • 53. 53© Life Cycle Engineering 2008 Motor is replaced (-2 days) Internal winding failure Gearbox is rebuilt Maintenance crew fixes an oil leak Maintenance crew fixes an oil leak Maintenance crew fixes an oil leak Oil leaks repaired “many” times, most undocumented due to vibration alarm (date and time) Plant shuts down Tripped due to vibration alarm (-2 hours) Supervisor decides to continue running (-2 hours) Maintenance crew fixes an oil leak (-2 days)
  • 54. 54© Life Cycle Engineering 2008 Interviews – the picture that emerges • 2 days prior – high speed shaft was not properly drawn up to engage the pinion – Crew did not have specs or manuals – No one knew where they were • Oil leaks had been repaired “many times” since rebuild • Could have been improperly reinstalled any of those times • Prior to failure, crews heard “Rumble” typical of loading too much material (common occurrence); Overloading. • Other crews described the proper procedure, “tribal knowledge” • Maintenance records were incomplete • Vendor reported no apparent problems when new motor was installed • Control room vibration monitoring was not helpful • Alarms occurred “all the time” with no action taken • There were indications a failure was imminent
  • 55. 55© Life Cycle Engineering 2008
  • 56. 56© Life Cycle Engineering 2008 Remaining questions: • Was damage accumulating over time? • Were there material or design contributors?
  • 57. 57© Life Cycle Engineering 2008 Metallurgical report • Two contact patterns… – “Frosting” below the pitch line, indicating a period of normal wear – Obvious indications of wear near tooth tips • Bearings indicated a severe misalignment • Nothing anomalous in material properties (hardness, case depth, chemical and microstructure) • Failure was due to low cycle fatigue prior to overload
  • 58. 58© Life Cycle Engineering 2008 Root Cause Conclusions • Induced failure due to – improper maintenance, resulting in low cycle fatigue then overload – High loads due to material overloading were a likely contributor • Latent factors: – Poor cooperation with supplier(s) – Inadequate documentation and equipment specific training – Ineffective warning system and propensity to ignore warnings • Proposed corrective actions – Acquire up-to-date specifications, documentation and maintenance procedures for critical equipment – Ensure equipment specific training for maintenance personnel – Review adequacy of alarm system to ensure warnings are adequate and meaningful – Define appropriate responses – Instill a culture that expects response and action
  • 59. 59© Life Cycle Engineering 2008 Conclusions; or if you remember nothing else about root cause analysis, remember this:  Do it. RCA is the engine that drives continuous improvement.  Have the data  Keep good records, not just of failures but of All maintenance actions When did it begin service? … end? Operating conditions If you don’t have a good CMMS, get one. If you do (or when you do), USE IT  Resources. Have the right  People  Training, and  Tools.
  • 60. 60© Life Cycle Engineering 2008 The last word… Problem Solving Flow Chart Don’t Mess With It! YES NO YES YOU POOR FOOL! NO Are You In Hot Water? NO Throw Away The Evidence Does Anyone Know? TOO BAD! YESYES NO Hide It Can You Blame Someone Else? NO NO PROBLEM! YES Is It Working? Did You Mess With It?

Editor's Notes

  • #3: How you define the “root” cause will depend on your objectives and motives in undertaking the investigation (Do a demonstration dropping a ball… ask what the root cause is…)
  • #10: History, development, different “schools of thought”
  • #11: For example… Why did the ball fall?
  • #15: http://www.national.com/analog/quality/8d
  • #18: How many could have the problem but don’t Suspensions – what portion of the population is failing? At what rate? What is acceptable? What is the threshhold?
  • #19: Use of statistics in medical, pharmaceutical fields
  • #22: Have the data… One of the primary roles of the Reliability Engineer is to see that the correct data is collected Collect data on non-failures You aren’t the expert… On which piece of equipment at your facility are you the most knowledgeable person in the plant? If you are the most knowledgeable person, what don’t you know? You are not the expert… For What % RCFA’s you are likely to be involved in, are you the single greatest repository of knowledge about possible causes and effects? Even then… There is critical information you don’t know or have The “apocryphal” story of the great engineer who “just saw” an obscure answer that noone else saw… that is a rare event. That is why it makes such a great story.
  • #23: Assume that you will have some responsibility to do RCFA’s within your organization Things that illustrate some of the basic principles… one or two key take-aways Some (not all) things that they might have occasion to deal with Vibration wherever possible
  • #24: Brake temperature constrains operations Excessive temperature
  • #25: Department of Defense Inspector General Auditing Report 99-193, C-17 Landing Gear Durability and Parts Support, June 24, 1999
  • #26: Expensive, failed attempts to solve the problem already “Failure rate…”
  • #36: References:
  • #37: References: C:\Documents and Settings\GCottle\My Documents\0\Archived\Vol1041\0\Automation Group R&M\2004\BRM\Background history and reference.zip\ETC_timeline-International Plugs and Fittings History.doc C:\Documents and Settings\GCottle\My Documents\0\Archived\Vol1041\0\Automation Group R&M\2004\BRM\20040607_HEUI_Warranty_Review.zip
  • #38: Ref C:\Documents and Settings\GCottle\My Documents\0\Archived\Vol1041\0\Automation Group R&M\2004\BRM\20041116_Sleuthing.zip\PRES.2004.11.15.LB TO Cottle T. Hanks Talking Points
  • #39: Ref: 20050114 R&M Overview
  • #42: What to do with the contractor’s conclusion… Don’t discard it. They know more about the design than most of us. Can you accept it? (Obviously not…)
  • #43: What to do with the contractor’s conclusion… Don’t discard it. They know more about the design than most of us. Can you accept it? (Obviously not…)
  • #57: Note that this is not a complete fault tree