You are on page 1of 28

Com MN Product Generation

Guideline

Root Cause Analysis

A1R16851 DS:03 SC:435

For internal use only

Date: 07 July 2005


Guideline Root Cause Analysis

Copyright Siemens AG 2005


Issued by the
Siemens AG Com Mobile Networks
St. Martin-Strasse 76
D-81541 Munich
Responsible for updating: Elena Rossi, Cinisello, Mi, Tel. (+39) 02437 7887
Claudio Ravizza, Cinisello, Mi, Tel(+39) 022437 7386

Subject to technical changes.


Technical specifications and performance features are binding only insofar as they are specifically and
expressly agreed upon in a written contract.

The file name of this document is: "RCA_Guideline 03_070705.doc".


It is based on the "Standard WORD Template for PEPP Working Level Process Documents.

Page 2 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

Contents

0 Preamble .......................................................................................................4
0.1 Scope ............................................................................................................. 4
0.2 Applicability................................................................................................... 4
0.3 Document management ............................................................................... 4
0.4 History ........................................................................................................... 4
0.5 Responsibility ............................................................................................... 4
1 Generality ......................................................................................................5
1.1 Principles of RCA.......................................................................................... 6
1.2 FEKAT RCA ................................................................................................... 7
1.2.1 RCA process ...........................................................................................................7
1.2.2 Technical aspects....................................................................................................8
1.2.2.1 RCA clauses....................................................................................................8
1.2.2.2 Data insertion ................................................................................................15
1.2.2.3 Data collection and reporting.........................................................................16
1.2.2.4 Data analysis .................................................................................................16
1.3 Ad-hoc RCA................................................................................................. 19
1.3.1 Starting ..................................................................................................................19
1.3.2 Data Collection ......................................................................................................20
1.3.3 Assessment ...........................................................................................................20
1.3.4 Corrective Actions Identification and Proposal ......................................................21
1.3.5 Corrective Actions Deployment .............................................................................21
1.3.6 Follow-up ...............................................................................................................22
2 Responsibility .............................................................................................23
3 References ..................................................................................................24
ANNEXES.........................................................................................................25
A Definitions ...................................................................................................25
B Forms...........................................................................................................26
C Example of ad-hoc RCA .............................................................................27
D RCA: an urban legend ................................................................................28

A1R16851 DS:03 SC:435 07-July-2005 Page 3 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

0 Preamble

0.1 Scope
This document defines and describes goals, principles, process and tools for Root Cause Analy-
sis (RCA) within Com MN PG.
With respect to PPP:D for Radio Access Networks - Process Handbook, RCA is one of the ac-
tivities within the Project Review process section of the Quality Management support
process. It may be referenced by other PG Areas as well.

0.2 Applicability
This work instruction applies to the Projects of Siemens Com MN PG.

0.3 Document management


Document regulation is described in [7].
Computer-generated printouts are only ever provided for information purposes and are not in-
cluded in a change service! Copies provided for information purposes are not labeled as such.

0.4 History
First Issue, mainly based on the pre-existing SICN ICM N MN PG RA document Root Cause
Analysis (A7060-D73-A719-*-7635), updated in order to contain references to ClearDDTS tool
and a partial redefinition of FEKAT clauses.

Issue 02: formal update to shorten some FEKAT clauses according to maximum allowed length
(up to 20 characters).

Issue 03: generalization and extension as PG-wide rule; update of involved/responsible roles
and of mandatory milestones/baselines for FEKAT RCA execution; introduction of ad-hoc RCA
in addition to regular FEKAT RCA.

0.5 Responsibility
This document is under the responsibility of Siemens Com CD RA.

Page 4 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

1 Generality
RCA is a technique appropriate to identify the causes and inner mechanisms that lead to costly
or risky problems related to the quality of the delivered products or the efficiency of the devel-
opment process.
This technique can also be seen as a step towards the implementation of CMMI Level-5 support
Process Area Causal Analysis and Resolution, aimed to identify causes of defects and other
problems and take action to prevent them from occurring in the future.
In other words, the main goals of this practice are:
o systematic identification of root causes of defects (or any other problems)
- selection of data about defects to be analyzed
- analysis of causes
o systematic prevention of future occurrences of defects, addressing the root causes
- identification and deployment of actions for defects prevention
- measurements/evaluation of effects of actions
- recording of data for knowledge base setup / future reuse

The following context diagram (from CMMI presentation material) graphically depicts this
c
o
n
c
e
p
t
.

A1R16851 DS:03 SC:435 07-July-2005 Page 5 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

Experience gained at PG through several years and product releases has shown that the most
favourable cost/benefit ratio of RCA comes from systematic application to critical and high-
priority faults detected in System Integration, System Test and Customer Acceptance (i.e., be-
tween D500 and B700). Main source of information in this case is Fault Management tool
FEKAT. From here on this kind of RCA will be referred to as FEKAT RCA.
RCA can be applied also to earlier review or testing phases, or in general to investigate the true
reasons for any kind of problem occurred.
Therefore, in addition to the systematic FEKAT RCA (described in chapter 1.2), also ad hoc
Root Cause Analysis can be performed; generic guidance for this topic is given in Chapter 1.3.
In these cases, one of the typical trigger events is EmA (priority A Emergency Case) as de-
scribed in section 2.2.3 of document [5] ICM N Management Escalation for Emergency Cases.

1.1 Principles of RCA


The goal of RCA is the analysis of the problem occurred for evaluating why it happened, in order
to identify the root causes of the problem, not simply the symptoms.
By this, it is possible to derive a diagnosis for fixing the defects in the process or materials that
caused the problem in the first place.
RCA techniques are based on the observation that SW/HW projects problems (faults, schedul-
ing delays, budget overrun) are not only symptoms of a problem in the product, but mainly in the
process that created it; as a consequence, the root causes might involve the processes, inputs,
environment and people.
In order to get the knowledge and remove the root causes (thus preventing future problems of a
similar nature), what is needed is an extra effort to go back and determine why the problem was
created in the first place.
This is done by means of Cause-Effect Diagrams, which are useful for different reasons.
They help to acquire knowledge about the processes, the product and the organization.
They guide the discussion on Process Improvement.
They can be also used to detect and understand positive factors, which could be exploited in
analogous situations to obtain substantial improvements.
They can be a support tool in the study of every kind of problems, since they guide to the action.
The preliminary condition for the application of RCA is the trace ability of the processes.
In fact, for an objective approach, the processes shall be defined and described in order to per-
form analysis on the project documentation.
A second condition for the application of RCA is that it must be supported by an accurate data
collection, by means of a well-defined set of RCA clauses.

Page 6 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

1.2 FEKAT RCA

1.2.1 RCA process

The process of FEKAT RCA can be split into the steps summarized in the following pictures and
listed below.

o Def
i
n
i
t
i
o
n

o
f

R
C
A

C
l
a
u
s
e
s
The
obj
ective of this phase is the definition of the Clauses that contribute to the Effect. The list of
clauses is implemented in FEKAT (and any other Fault Management system) and can be
reviewed as feedback of data analysis if required. For a given release the definition of RCA
clauses either confirming the existing ones or introducing updates must be completed
within D300, in order to apply it effectively in the following testing phases.
o Alignment of Fault Management system
In order to have reliable collected information, the RCA activities are supported by the Fault
Management system that has to be tailored whenever RCA clauses are changed.
o Insertion of RCA Information
The goal is the detection of the RCA clauses as soon as faults are corrected.
The RCA applies to all Fault Reports (FRs) originating typically from System Integration,
System Test and Customers and matching following criteria:
- Priority 1 or 2
- Solved by Development.
Fault Correction Responsible fill RCA fields when correcting fault reports, using appropriate
fields of the Fault Management system.

A1R16851 DS:03 SC:435 07-July-2005 Page 7 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

This activity can be performed at any testing stage, starting at D300, until B800; it is man-
datory between D400 and B800.
o Data Collection, Reporting, Analysis and Deployment.
Data are collected by means of:
- Extraction of RCA fields from Fault Management systems (e.g., FEKAT, Omni-
Tracker), importing such data to Metric Database (e.g., MEDAL tool), from where
subsequent exports can be done (e.g., to excel format).
- Formal analysis of completeness and correctness of values inserted; if relevant infor-
mation is missing or wrong, request to Fault Correction Responsible to complete or
correct the values
Scope of Data Reporting is to gather and to present data in an easy-to-analyze form, in or-
der to get reliable and ease to handle information.
The analysis is performed at all milestones and baselines between D500 and B800 in the
scope of a "lessons learned rule" aimed to
- identify measures derived from RCA
- approve these measures by the Release Manager, if of release-specific character, or
even to PG Quality Board if of general character
- ensure deployment of measures under responsibility of the respective Quality Man-
ager. This may be the EQM, if of entity character, the PQM, if of Release character,
the RQM if of Product line or process character, or Com MN PG BE
- identify metrics to monitor later on the effectiveness of defined measures
- ensure implementation of measures strictly following the Continuous Process Im-
provement.

1.2.2 Technical aspects

This section describes the activities needed to perform the process steps detailed above.
In particular focus is put on:
o RCA clauses
o Filling RCA data within Fault Management systems (e.g., in FEKAT)
o Extraction of RCA data (e.g., in MEDAL)
o Data Reporting
o Data Analysis
o Definition of Corrective Actions
o Deployment of Corrective Actions
o Review of effectiveness of Corrective Actions.

1.2.2.1 RCA clauses

The aim of RCA is the improvement of the process through the analysis of problems, by means
of a set of activities including:
o Fault injection analysis: investigation of when and why problems have been introduced

Page 8 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

o Missed fault detection analysis: investigation of when the problems could have been de-
1
tected, and why they were not detected earlier.
The RCA Clauses have been grouped into four clusters, in order to simplify the work of Fault
Correction Responsible, thus increasing the reliability of the data collected, but also having in
mind a reasonable trade-off between the data reliability and the granularity of the gathered in-
formation.
It is important to point out that, in answering the questions, Fault Correction Responsible filling
all RCA related fields should try to ask why many times, not just stopping at the most immedi-
2
ate and easy answer, to go back to the real roots of the problem! .
The Fault Correction Responsible shall provide all 4 answers to all the 4 questions about the
fault: (phase and cause of introduction, phase where it could have been detected and cause of
missing detection).
In the following paragraphs, the defined RCA Clauses are underlined, and related FEKAT
3
Codes are reported within parenthesis ; other Fault Tracking tools (ClearDDTS, OmniTracker,
etc.) allow direct selection of RCA clauses (no code is needed).
The current definition of RCA Clauses results from application of this practice since about 10
years, and has been periodically revised and tuned, according to the first two steps described in
section 1.2.1.

1.2.2.1.1 Fault injection

Phases where the faults were injected (PFI - Phase of Fault Injection)
The phases where the faults can be inserted are defined as follows:
o Preanalysis (SBS_PFI_PREANALYSIS)
o Analysis (SBS_PFI_ANALYSIS)
o Design (SBS_PFI_DESIGN)
o Implementation (SBS_PFI_IMPLEMENTAT)
o CR Implementation (SBS_PFI_CRIMPLEMENT)
o Error Correction (SBS_PFI_ERRORCORR).

Moreover the following choices can also be made:


o Previous Releases (SBS_PFI_PREVRELEASE)
o Included SW Products (SBS_PFI_INCLSWPROD).
In the last two cases the next field (Causes for fault injection) may be set as Not relevant be-
cause causes for fault injection might be unknown to the Fault Correction Responsible.

Causes for fault injection (EIC Error Injection Cause)


The following Ishikawa fish-bone diagram pictures the cause-effect relationships of the RCA
clauses for error injection showing all categories.

1
It is important to note that this analysis, especially when focused oh high-priority Customer Faults, is also
equivalent to a so-called escape-analysis, aimed to understand why such faults were not properly filtered
(escaped) by earlier review and testing stages.

2
If you are already bored for reading this paper so far, maybe you can take a break, relax and
read Annex D, where a witty example of RCA explain this concept.

3
These are values defined for GERAN-related projects, identified by the prefix SBS_.
A1R16851 DS:03 SC:435 07-July-2005 Page 9 of 28
Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

Peopl e Input

Lack of T raining Specifications


on SBS Do main F eature
C omp lexit y
Lack of T raining
W rong W rong HW /FW
Inform ation on De v. En v.
Plann ing Behaviour.
F low
C ode
Comp lexit y
F ault In jectio n

Analysis/D esign U nderes timated


T ools Proble m Effort

Guide lines St aff ing U nrealistic


D eadline

Metho ds/ Too ls Resou rces

In the following RCA clauses for error injection cause are described in detail.
A mapping is also provided with more detailed clauses that must be considered as examples of
the main cause, representing possible reasons of error injection in the different phases of the
project lifecycle.

o Specifications (SBS_EIC_SPECIFICAT)
Wrong specification
Specification contained wrong information
e.g., a message was described as containing 4 fields instead than 5
Unclear specification
A detail in the specification was ambiguous and led the programmer to a wrong inter-
pretation.
e.g., The variable ENV must set to its starting value, without describing what the
starting value is supposed to be.
Unstable specification
Specifications changed frequently during implementation; the continual updating of
code to follow specification changes caused error insertion.
e.g., some procedure or variable no more needed was forgotten in the latest version
of SW code; this interacted with other SW parts and caused wrong behaviour.
4
Missing specification
Important information was not contained in the specification, and this led to an error
in implementation
e.g., Specifications forgets to mention that some given messages must be sent in a
specific sequence to obtain a correct behaviour of the system

4
'Missing specification' can also be interpreted as requirement
Page 10 of 28 07-July-2005 A1R16851 DS:03 SC:435
For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

Wrong Change Request


A change request was made that contained some of the error described above (was either
wrong, unclear or missed some information)

o Wrong Planning (SBS_EIC_WRONGPLAN)


Activity not planned
An activity was forgotten when the activity plan was written, this led to errors.
e.g., skipping reviews of some deliverables to save time

o Lack of training on SBS domain (SBS_EIC_TECHKNOWSBS)


Lack of know-how on the system
The Designer had insufficient knowledge of the system and this led to introduction of
faulty code.
e.g., Programmer did not know how the system would react to a certain input
Lack of knowledge of existing code
The Designer had insufficient knowledge of the existing code and this led to introduc-
tion of faulty code.
e.g., variable type mismatch, errors in procedure calls parameters.

o Lack of training on Development Environment (SBS_EIC_TECHKNOWENV):


Lack of know-how on the environment
The Designer had insufficient knowledge of the environment and this led to introduc-
tion of faulty code.
e.g., wrong system calls, wrong use of system libraries, wrong version of source
code component selected for load production.

o Information Flow (SBS_EIC_INFORMATFLOW)


Lack of communication within project team
There was a misunderstanding between team members and this led to errors intro-
duction due to different interpretations
e.g., members of the same team did not discuss the meaning of a term and inter-
preted it in different ways
Lack of communication outside the project team
There was a misunderstanding between members of different teams and this led to
errors introduction due to different interpretations
e.g., the team developing an interface did not communicate promptly a value change
of a field in a message; another team that had to use that interface did not know of
the new value and introduced the error.
Designers belonging to groups that should collaborate do not participate to each
other Review Meetings
Lack of information
There was a general misinformation among project developers: tasks and responsi-
bilities were not clear and led to ambiguous interpretations.

o Underestimated effort (SBS_EIC_UNDEREFFORT)


Effort was underestimated in the initial plan, and too few resources were allocated

o Unrealistic Deadline (SBS_EIC_WRONGDEADLIN)


The deadline was initially set too early; subsequent estimates led to its delaying. This
caused a bad time managing, leading to error introduction

o Staffing (SBS_EIC_STAFFING)
Staff shortage
There was a staff shortage during the project due to people leaving the Company

A1R16851 DS:03 SC:435 07-July-2005 Page 11 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

e.g., the project started with 130 resources, but during the development 20 of them
resigned. This led to an overload of work for those who remained, causing a lower
quality work and introduction of errors.
Overlapping releases
A new release was scheduled when the previous one was not yet completed and this
created an overload on Designers, which led to a worse code quality and introduction
of errors
Resources shifted to other projects
The start of other projects caused some resources be shifted, this led to a staff shortage
Unexpected extra load
The load of work increased for some unexpected cause: a new, unplanned activity,
the necessity to repeat some tasks already done, etc.

o Guidelines (SBS_EIC_GUIDELINES)
Operating procedure not available
The operating procedure guiding the process was not available when needed; this
led to ambiguous or wrong decisions, which introduced some errors
Operating procedure insufficient
The operating procedure guiding the process was in some sense incomplete (i.e. not
finished, or lacking some necessary information); this led to ambiguous or wrong de-
cisions, which introduced some errors

o Analysis/Design Tool Problems (SBS_EIC_ANSDSGTOOL)


The error was due to problems in the tool used during Analysis/Design phases

o Feature Complexity (SBS_EIC_FEATCOMPLEXI)


The feature complexity was such that it caused errors introduction

o Code Complexity (SBS_EIC_CODECOMPLEXI)


The code complexity was such that it caused errors introduction
e.g., a Designer had to modify some existing code developed without following norms
of good programming; the difficulty of the existing code led to errors introduction

o Wrong HW/FW Behaviour (SBS_EIC_HW/FWBEHAVIO)


The error was caused by wrong behaviour in HW/FW and is not due to SW

1.2.2.1.2 Missed fault detection

Phases where faults should have been detected earliest (PED Phase of possi-
ble Earliest Detection)
The phases, during which faults should have been detected, with reasonable effort and in the
scope of the relevant test phase, are defined as follows:
o Preanalysis Review (SBS_PED_PREANALYSIS)
o Analysis Review (SBS_PED_ANALYSREVIEW)
o Design Review (SBS_PED_DESIGNREVIEW)
o Coding Review (SBS_PED_CODEREVIEW)
o Debug / Module Testing (SBS_PED_DEBUGMODTEST)
o Off- Line Testing (SBS_PED_OFFLINETEST)
o White Box Testing (SBS_PED_WHITEBOXTEST)
o Black Box Testing (SBS_PED_BLACKBOXTEST)
o SBS Integration Testing (SBS_PED_INTEGRATION)

Page 12 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

o System Test (SBS_PED_SYSTEMTEST).

For all the phases when test is not directly performed by the Fault Correction Responsible who
enters RCA information (typically the last two cases), the Fault Correction Responsible is likely
not able to figure out the reason why someone else (the Tester) missed to detect the fault. In
such cases Fault Correction Responsible will have to contact Tester to get information about the
following field (Causes for missed error detection).

Causes for missed error detection (MDC Missed Detection Cause)


The following Ishikawa fish-bone diagram depicts the cause-effect relationships of the RCA
clauses for missed error detection, showing all proposed categories.

People Input

Technical Test Design


Know-How

Lack of Training
Information on Dev. Env.
Wrong
Flow
Planning

Missed
Error Detection
Missed
Regression
Activities Test Tools
Problems Underestimated
Effort
Guidelines Test Staffing
Environment Unrealistic
Deadline

Methods/Tools Resources

The responsible for entering the RCA data into the relevant Fault management tool has to cate-
gorize the reason, why the error was not found in the phases/activities defined above. The
following categories are defined:
In the following, samples are provided of the meaning of various root causes.

o Test Design (SBS_MDC_TESTDESIGN)


Test specification missing
The test specification was not written
Test specification delayed
The test specification was written too late
Test specification wrong
The test specification contained some errors, which prevented finding the error
Test specification not aligned with project evolution
The test specification was not updated when some feature was changed or added

A1R16851 DS:03 SC:435 07-July-2005 Page 13 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

o Wrong Planning (SBS_MDC_WRONGPLAN)


Activity not planned
A test activity was not planned
Test execution too time consuming
A test activity was not done because it would have required too much time

o Underestimated effort (SBS_MDC_UNDEREFFORT)


Testing effort was underestimated and led to incomplete testing

o Unrealistic Deadline (SBS_MDC_WRONGDEADLIN)


Activity not fully executed due to time constraints
There was not time enough to complete all the scheduled tests

o Test Environment (Test Bed Set-up, Configuration): (SBS_MDC_TESTENVIRONM)


Testing environment too weak
Testing environment could not represent all the possible configurations of the actual
device, therefore testing activity was incomplete
Testing environment not aligned
Testing environment was slightly different from the actual device, therefore testing
activity was incomplete as not all the possible cases could be tested
Testing environment wrongly configured
Testing environment had a different configuration from the actual device, therefore
testing activity was incomplete as not all the possible cases could be tested
Data Base insufficient / wrong
Test environment was incomplete due to insufficient or wrong contents in database
Load / stress environment not sufficient
The environment used in load/stress was not adequate to support testing activity
Lack of Network Entity in Test Environment
Test environment did not include a a given Network Entity (e,g., BSC, BTS, OMC,
MSC, SGSN, RNC, ngRNC, NB, )
Lack of boards
Test environment did not include some boards, which were necessary for test execu-
tion

o Test Tools Problems (SBS_MDC_TESTTOOLPBLM)


Missing test execution tools
Some tools needed for test execution were missing so not all the tests could be exe-
cuted
Missing test analysis facilities
Some tools needed for test analysis were missing so not all the tests could be exe-
cuted

o Technical Know-How (SBS_MDC_TECHKNOWHOW)


Designer error in test execution
During test execution the Designer made some trivial mistakes
Designer error in analysis of test results
Tests were correctly executed, but results were misinterpreted

o Lack of Training on Development Environment (SBS_MDC_LACKOFTRAINI)

Page 14 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

The test activity was not well conducted because the development environment was
not well known

o Information Flow (SBS_MDC_INFORMATFLOW)


There were misunderstandings between team members or teams and this led to an
erroneous testing activity due to different interpretations of available information

o Missed Regression Activities (SBS_MDC_REGRACTIVITY)


Some regression activities were not performed
- due to lack of time or resources
- or because a given area was wrongly assumed as not affected by changes and
this prevented the error from being discovered

o Staffing (SBS_MDC_STAFFING)
Activity not fully executed due to resource shortage
Lack of resources prevented the complete testing of the product.

o Guidelines (SBS_MDC_GUIDELINES)
Guidelines about testing activities were either missing or incomplete

The causes in each group are examples of the main cause, representing the possible reasons of
missed error detection in the different phases of the release life-cycle.
In some cases this field might be not meaningful, as explained at the end of next paragraph.

1.2.2.2 Data insertion

Between D300 and B800, when identified faults have been analysed and corrected, the relevant
data for RCA is entered by the Fault Correction Responsible (typically HW or SW-engineer) into
the Fault Management system.
To ensure the semantic reliability and accuracy of input data, Fault Correction Responsible must
identify the real root cause, avoiding easy and simple answers (in early SBS releases, a possible
cause for EIC was Designer Error, and collected more than 90% of answers just because it
was the easiest one).
It is also important to avoid input of nonsense / inconsistent data. Following is a short list of in-
consistency examples:

o PFI = analysis AND PED = preanalysis review


(an error in Analysis cannot have been detected in Preanalysis review)
o PFI = design AND (PED = preanalysis review OR analysis review)
(an error in Designs cannot have been detected in Preanalysis or Analysis review)
o (PFI = implementation OR CR implementation) AND (PED = preanalysis review OR
analysis review OR design review)
(an error in Designs cannot have been detected in Preanalysis or Analysis review)
an error inserted in a given phase cannot be potentially detected in an earlier phase!

o PFI = (preanalysis OR analysis) AND EIC = code complexity


an error inserted during preanalysis or analysis has nothing to do with source code!

o (PED = actual phase of error detection) AND (MDC not empty)

A1R16851 DS:03 SC:435 07-July-2005 Page 15 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

If error could not be found earlier than it was actually found, it makes no sense to ask why it was
not found earlier (because it was found indeed).

1.2.2.3 Data collection and reporting

The extraction of the RCA raw data from Fault Management system data is made easier by ap-
propriate tooling.
For example, at Com MN PG R, MEDAL (Measurement Data Library) tool enables an easy dis-
play of the raw data originating from the Fault Management systems. In MEDAL, a RCA query
can be selected clicking on options Raw Data Faults RCA view, and applying further fil-
tering, if desired (e.g. on internal/customer originator, on priority), and the results can be easily
exported to excel format.
RCA raw data will be subsequently processed and elaborated in order to obtain summary infor-
mation about the phases and causes of error injection and missed error detection for each
release, as well as for trends over releases.
The MEDAL tool is reachable through Milan Com CD RA Intranet portal; direct link is
http://ik2sw001.icn.siemens.it/medal2/login_new.asp.
Raw data are processed and elaborated in the scope of Lesson Learned and Project Review
meetings, in order to obtain analysis of fault insertion and missed error detection phases and
causes for the current release and of trend over releases.

1.2.2.4 Data analysis

In the following, some typical RCA graphs are shown as examples.

1.2.2.4.1 Analysis for the current release

Data for each of the four RCA clusters (phase of fault insertion, cause of fault injection, phase
of earliest possible detection, cause of missed detection) are grouped and a relative percentage
is calculated for each phase, in order to obtain a ranking.
The results can be summarised either in table or graphical format.
It is advised to report percentage rather than absolute values, because this will allow comparison
and trend analysis across different releases that may have different number of faults depending
on size.
The following examples refer to table for phase of fault insertion (PFI), but similarly can be ap-
plied to other RCA clauses (EIC, PED, MDC) :

For graphical representation, among the many possible chart types (histogram, pie, etc.) the
most suitable one is a Paretochart, so that attention and definition of countermeasures can be
concentrated on most relevant causes.
Page 16 of 28 07-July-2005 A1R16851 DS:03 SC:435
For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

Implementation 74.0%

Error Correction 9.9%

CR Implementation 6.1%

Previous Releases 4.6%

Analysis 2.3%

Design 2.3%

Included SW Product 0.8%

Preanalysis 0.0%
T
h
e same analysis is performed about

o the cause of fault insertion (EIC)


o the phase of earliest possible detection (PED)
o the reason of missed error detection, i.e., possible reasons why the fault was NOT found in
the earliest possible phase/activities (MDC)
The analysis of the data should be done by experts from Development, BBT, SINT, ST (the
group where the error has been detected). They propose together both with experts from ear-
lier test phases and with experts from the responsible organizational unit where the fault had
been injected - countermeasures aiming at either earlier / cheaper / etc. error finding or even
prevention.
Apart from the investigation of each of the 4 RCA clusters of answers individually, analysis
should also try to get a holistic view (get a comprehensive overall sight by looking at separate
details and trying to correlate them).
In other words, for example, if RCA focused on System Integration Faults gives the following re-
sults:
o Phase of earliest possible detection (PED) ==> high percentage in Host Test
o Missed Detection Cause (MDC) ==> main reason is Test Design.
The correlation of these two results seems to lead to the conclusion that Host Tests were poorly
designed.

1.2.2.4.2 Analysis of trend over Releases

The evaluation of trend of the phases and causes of fault insertion and missed error detection,
through an analysis of the data coming from the different releases, can provide additional infor-
mation, and also allows to check the effectiveness of a countermeasure aimed to reduce a given
root cause.
In order to make the analysis independent from the relative size of the release and the absolute
number of fault reports, the data can be presented as relative percentages.
The results can be summarised in a graph, from which the trend over releases for each RCA
clause can be easily seen (the following example is from entity BLT of GERAN SBS releases).

A1R16851 DS:03 SC:435 07-July-2005 Page 17 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

Pre-Analysis Analysis Design Implementation CR Implementation Error Correction Previous releases Included SW Products

BR7.0

BR6.02

BR6.0(1)

BR5.5

BR5.0
Releases

BR4.5

BR4.0

BR3.7

BR3.6

BR3.0

BR2.1

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Page 18 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

1.3 Ad-hoc RCA

Whenever an unwanted situation occurs which consumes resources and tends to happen re-
peatedly, it is required to figure out what is really causing this situation, and remove that cause in
order to avoid such a situation in future. The result of this practice will be not just to cure the
symptoms, but to collect and analyze data that will allow diagnosing, addressing and finally re-
moving the real cause of the problem.

This raises the following questions:

a) How to determine situations candidate for root cause analysis?

b) How to figure out what is the real root cause?

c) How to identify countermeasures, apply them and monitor their effectiveness?

The following paragraphs, based on the CMMI Support Process Area CAR (Causal Analysis and
Resolution) propose a structured approach to provide answers.

The table below maps CMMI CAR Specific Practices to next paragraphs of this guideline.

CMMI CAR Specific Practices see paragraph:

SG 1 Determine Causes of Defects


SP 1.1-1 Select Defect Data for Analysis . . . . . . . . . . 1.3.1, 1.3.2
SP 1.2-1 Analyze Causes . . . . . . . . . . . . . . . . . . . . . 1.3.3

SG 2 Address Causes of Defects


SP 2.1-1 Implement the Action Proposals . . . . . . . . . 1.3.4, 1.3.5
SP 2.2-1 Evaluate the Effect of Changes . . . . . . . . . 1.3.6
SP 2.3-1 Record Data . . . . . . . . . . . . . . . . . . . . . . . . 1.3.6

1.3.1 Starting

Typical conditions for triggering RCA are:

o EmA (as defined in [5] paragraph 2.2.3)


o systematic occurrence of faults at a rate higher than normal (in reviews, testing, customer
reports)
o product and/or process showing unexpected deviations from the expected behaviour
o outcome from Q-Gates (see [6] paragraph 2.1)
o relevant problems identified by Project Management and requiring corrective actions
o occurrence of an emergency event (usually this should be an exception, as typically RCA
is applied in case of repeated occurrence of troubles, that may be the warning light for
some underlying problem in the process).

Some examples of ad-hoc RCA are:

o RCA for faulty patches delivered to the customer


o RCA for too long CR duration
o RCA due to insufficient overload handling

A1R16851 DS:03 SC:435 07-July-2005 Page 19 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

In any case, before triggering RCA, a cost/benefit evaluation must be done, comparing the im-
pact of the faults and their frequency of occurrence against the added value and the cost of the
RCA (the time and resources needed).

This evaluation might even lead to the result that it is acceptable, on cost/benefit ratio consid-
erations, to live with the fault, or just apply immediate countermeasure to cure the symptoms for
instant relief.

1.3.2 Data Collection

It is important to start gathering data as soon as possible following the occurrence identification,
to ensure that no information is lost. The information that should be collected consists of condi-
tions before, during, and after the occurrence; personnel involvement (including actions taken);
environmental factors; and other information having relevance to the occurrence.

When the analysis involves SW/HW Faults, information from Fault Management system de-
scribed in section 1.2 may be also used, if available.

1.3.3 Assessment

The RCA should include the following steps:


o identify the problem
o determine the significance of the problem
o identify the causes (conditions or actions) immediately preceding and surrounding the
problem
o identify the reasons why the causes in the preceding step existed, walking back to the root
cause (the fundamental reason which, if corrected, will prevent recurrence of these and
similar occurrences). The identification of this root cause is the completing point of the as-
sessment phase.
If the number of faults under analysis is large, it may be useful to group them according to some
root cause categories (e.g. lack of communication, lack of training, process deficiency, process
not correctly executed, component failure, human error )
Such analysis is performed by a team of experts (as above) who have best understanding of the
problem under study.
Root Cause Analysis experts should be effective:

o root causes must be identified in detail, avoiding generic classification such as external
factor
o root causes must be identified so that Management can influence them: for instance, if root
causes for late delivery of parts are severe weather conditions and truck engine fail-
ures, Management can address the latter taking action improve maintenance, but has
likely no control on the first
o it is not practical to spend too much valuable time indefinitely searching for the most re-
mote root causes, as in todays highly interconnected systems there would always be
another root cause for everything.

Fishbone diagrams and Check sheets can be used as methods supporting this activity.

Page 20 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

1.3.4 Corrective Actions Identification and Proposal

For each relevant root cause, corrective actions maybe defined, to reduce the probability that the
problem will recur. Result of this step is the generation of recommendations to implement effec-
tive corrective actions, such as
o training in areas where problems occur
o revision of process in the phases that resulted to be more error-prone: this may include
- reordering of steps (e.g. reduce too high concurrency/overlapping of activities)
- introduction of additional steps (e.g. additional meetings or reviews)
- automation of certain activities.

Corrective Action proposals are documented reporting the


o description of the problem
o phases when it was introduced and when it was detected
o the description of root cause
o proposed corrective action
o corrective action originator (typically head of RCA expert team).

1.3.5 Corrective Actions Deployment


The Corrective Action proposals shall be ranked taking into consideration cost/benefit and
added value. The responsibility varies depending of the scope of the corrective action proposal.

Typical ranking criteria are:


o Consequences if faults are not addressed at all
o Cost to implement process improvements for faults prevention
o Expected positive impact on product quality

As result from ranking and selection, responsibilities are defined and assigned for the deploy-
ment of Corrective Actions, including the following information:
o Description of deployment actions
o Rationale for decision to deploy it
o Time and cost spent for fault analysis and correction
o Estimated cost/consequence of not eliminating the root cause (if applicable)
o Estimated cost and benefit of eliminating the root cause
o Person responsible for deployment
o Description of the affected areas
o Description of additionally required test cases and their phases (if applicable), e.g., BBT,
SINT, ST
o Schedule and reporting
o People to be informed of its status

The actual deployment of the Corrective Action is carried over by the identified responsible
through assignment of tasks to the persons doing the work and their coordination, the review of
results and the tracking of progress to closure. For very complex changes in process or tools it

A1R16851 DS:03 SC:435 07-July-2005 Page 21 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

is wise to perform a trial (experiment) before wide application, applying strictly the concept of
Continuous Process Improvement.

1.3.6 Follow-up
Follow-up includes determining if corrective action has been effective in resolving the addressed
problems. A review ensures the effectiveness of the deployed corrective actions and the pre-
vention of recurrence.

Typically evidence of effectiveness can be achieved by means of metrics on product quality or


process performance before and after the introduction of corrective action.

In order to share the knowledge and possibly reuse as appropriate on benefit to other organiza-
tions and processes, information on Root Cause Analysis and Corrective Actions must be
recorded.

Page 22 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

2 Responsibility
In the following table Departments impacted by deployment of activities during RCA process are
provided in relationships to Phases and Responsibilities described in the previous sections.

Phase Responsibility Department

Definition of FEKAT RCA jEPG BE, SW and HW Develop-


Clauses ment
Alignment of fault manage- Tool responsible varies
ment systems

Insertion of FEKAT RCA in- Fault Correction Responsible SW and HW Development


formation
FEKAT RCA Data collection EQM SW and HW Development
and reporting
FEKAT RCA Data Analysis Experts Teams SW and HW Development
including Directors, Depart-
ment Responsibles, System
Integration, System Test,
TQM.
Ad-hoc RCA start varies varies
Ad-hoc RCA data collection varies varies
Ad-hoc RCA assessment varies varies
Ad-hoc RCA Corrective Ac- varies varies
tions Identification and
Proposal
Ad-hoc RCA Corrective Ac- varies varies
tions Deployment
Ad-hoc RCA Corrective Ac- varies varies
tions Follow-up

A1R16851 DS:03 SC:435 07-July-2005 Page 23 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

3 References
[1] A7060-073-A209-*-7635
Customer Fault Handling with FEKAT - Siemens Information and Communication Networks

[2] A30862-X1001-A558-*-76A1
Error Handling for SBS - Siemens AG

[3] A30862-X0711-B014-*-7635
Error Handling for Entity-IT Phase Siemens AG for SBS projects: BTS, BTSplus, OMCplus

[4] A30862-X0711-B029-*-7635
User Manual ClearDDTS Siemens AG (Distributed Defect Tracking System)

[5] CM N Management Escalation for Emergency Cases

[6] Rule 482: Com MN Q-Gate Principles

[7] A1R11401*
Regulation of PEPP Process Documents

Page 24 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

ANNEXES

A Definitions

CAUSE Factor influencing the Effect

ClearDDTS Distributed Defect Tracking System from Rational

EFFECT Qualitative Characteristic to improve

FEKAT FEhler KATalog: Siemens proprietary tool used for fault re-
port management (see [1] and [2])

FR Fault Report

GERAN GSM Evolution of Radio Access Networks

MEDAL MEasurement DAta Library: tool used to store,


manage and report data derived from FEKAT

RCA Root Cause Analysis

SBS Siemens Base Station

A1R16851 DS:03 SC:435 07-July-2005 Page 25 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

B Forms
Find embedded below some examples of templates for recording findings and corrective actions
of ad-hoc RCA.

o Basic template Root Cause Summary Table (adapted from internet -


http://www.asq.org/pub/qualityprogress/past/0704/qp0704rooney.pdf)

Root Cause
Summary Table.doc

o Template for Root Cause Analysis and Action Plan (adapted from internet -
http://www.stratosinstitute.com/forms/ONT-rootcauseanalysis.pdf)

RCA Framework.doc

o Template for Corrective Action Report


(adapted from intranet Siemens Medical Solutions -
http://mlvv1i6a.ww005.siemens.net/qmsi404/HDArtifact155.htm)

Corrective Action
Report.doc

Page 26 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG
Guideline Root Cause Analysis

C Example of ad-hoc RCA


The PowerPoint slide show embedded below is an informal example of ad-hoc RCA done to ad-
dress a problem of too long time needed to process and decide Change Requests.

CR_long_duration-R
CA.pps

A1R16851 DS:03 SC:435 07-July-2005 Page 27 of 28


Copyright Siemens AG For internal use only
Guideline Root Cause Analysis

D RCA: an urban legend


This urban legend has been circulating in the internet since about 2000 :-)

When we see a Space Shuttle sitting on its launch pad, there are two big booster rockets
attached to the sides of the main fuel tank. These are solid rocket boosters, or SRBs. The
SRBs are made by Thiokol at their factory in Utah. The engineers who designed the SRBs
might have preferred to make them a bit fatter, but the specifications did not allow this.
Why?
Because the SRBs had to be shipped by train from the factory to the launch site. The railroad
line from the factory had to run through a tunnel in the mountains. The SRBs had to fit
through that tunnel. The tunnel is slightly wider than the railroad track. Well, the US stan-
dard railroad gauge (width between the two rails) is 4 feet, 8.5 inches. That's an exceedingly
odd number. Why was that gauge used?
Because that's the way they built them in England, and the US railroads were built by English
expatriates.

Why did the English build them like that? Because the first rail lines were built by the same
people who built the pre-railroad tramways, and that's the gauge they used.

Why did "they" use that gauge then? Because the people who built the tramways used the
same jigs and tools that they used for building wagons which used that wheel spacing.

Okay! Why did the wagons have that particular odd wheel spacing? Well, if they tried to use
any other spacing, the wagon wheels would break on some of the old, long distance roads in
England, because that's the spacing of the wheel ruts.

So who built those old rutted roads? The first long distance roads in Europe (and England)
were built by Imperial Rome for their legions. The roads have been used ever since. And the
ruts in the roads? Roman war chariots first formed the initial ruts, which everyone else had
to match for fear of destroying their wagon wheels. Since the chariots were made for (or by)
Imperial Rome, they were all alike in the matter of wheel spacing.

The United States standard railroad gauge of 4 feet, 8.5 inches derives from the original
specification for an Imperial Roman war chariot. Specifications and bureaucracies live for-
ever.
So the next time you are handed a specification and wonder what horse's ass came up with
it, you may be exactly right, because the Imperial Roman war chariots were made just wide
enough to accommodate the back ends of two war horses. Or, in other words, the major de-
sign feature of what is arguably the world's most advanced transportation system was
determined over two thousand years ago by the width of a Horse's Ass!

Page 28 of 28 07-July-2005 A1R16851 DS:03 SC:435


For internal use only Copyright Siemens AG

You might also like