Professional Documents
Culture Documents
Guideline
Contents
0 Preamble .......................................................................................................4
0.1 Scope ............................................................................................................. 4
0.2 Applicability................................................................................................... 4
0.3 Document management ............................................................................... 4
0.4 History ........................................................................................................... 4
0.5 Responsibility ............................................................................................... 4
1 Generality ......................................................................................................5
1.1 Principles of RCA.......................................................................................... 6
1.2 FEKAT RCA ................................................................................................... 7
1.2.1 RCA process ...........................................................................................................7
1.2.2 Technical aspects....................................................................................................8
1.2.2.1 RCA clauses....................................................................................................8
1.2.2.2 Data insertion ................................................................................................15
1.2.2.3 Data collection and reporting.........................................................................16
1.2.2.4 Data analysis .................................................................................................16
1.3 Ad-hoc RCA................................................................................................. 19
1.3.1 Starting ..................................................................................................................19
1.3.2 Data Collection ......................................................................................................20
1.3.3 Assessment ...........................................................................................................20
1.3.4 Corrective Actions Identification and Proposal ......................................................21
1.3.5 Corrective Actions Deployment .............................................................................21
1.3.6 Follow-up ...............................................................................................................22
2 Responsibility .............................................................................................23
3 References ..................................................................................................24
ANNEXES.........................................................................................................25
A Definitions ...................................................................................................25
B Forms...........................................................................................................26
C Example of ad-hoc RCA .............................................................................27
D RCA: an urban legend ................................................................................28
0 Preamble
0.1 Scope
This document defines and describes goals, principles, process and tools for Root Cause Analy-
sis (RCA) within Com MN PG.
With respect to PPP:D for Radio Access Networks - Process Handbook, RCA is one of the ac-
tivities within the Project Review process section of the Quality Management support
process. It may be referenced by other PG Areas as well.
0.2 Applicability
This work instruction applies to the Projects of Siemens Com MN PG.
0.4 History
First Issue, mainly based on the pre-existing SICN ICM N MN PG RA document Root Cause
Analysis (A7060-D73-A719-*-7635), updated in order to contain references to ClearDDTS tool
and a partial redefinition of FEKAT clauses.
Issue 02: formal update to shorten some FEKAT clauses according to maximum allowed length
(up to 20 characters).
Issue 03: generalization and extension as PG-wide rule; update of involved/responsible roles
and of mandatory milestones/baselines for FEKAT RCA execution; introduction of ad-hoc RCA
in addition to regular FEKAT RCA.
0.5 Responsibility
This document is under the responsibility of Siemens Com CD RA.
1 Generality
RCA is a technique appropriate to identify the causes and inner mechanisms that lead to costly
or risky problems related to the quality of the delivered products or the efficiency of the devel-
opment process.
This technique can also be seen as a step towards the implementation of CMMI Level-5 support
Process Area Causal Analysis and Resolution, aimed to identify causes of defects and other
problems and take action to prevent them from occurring in the future.
In other words, the main goals of this practice are:
o systematic identification of root causes of defects (or any other problems)
- selection of data about defects to be analyzed
- analysis of causes
o systematic prevention of future occurrences of defects, addressing the root causes
- identification and deployment of actions for defects prevention
- measurements/evaluation of effects of actions
- recording of data for knowledge base setup / future reuse
The following context diagram (from CMMI presentation material) graphically depicts this
c
o
n
c
e
p
t
.
Experience gained at PG through several years and product releases has shown that the most
favourable cost/benefit ratio of RCA comes from systematic application to critical and high-
priority faults detected in System Integration, System Test and Customer Acceptance (i.e., be-
tween D500 and B700). Main source of information in this case is Fault Management tool
FEKAT. From here on this kind of RCA will be referred to as FEKAT RCA.
RCA can be applied also to earlier review or testing phases, or in general to investigate the true
reasons for any kind of problem occurred.
Therefore, in addition to the systematic FEKAT RCA (described in chapter 1.2), also ad hoc
Root Cause Analysis can be performed; generic guidance for this topic is given in Chapter 1.3.
In these cases, one of the typical trigger events is EmA (priority A Emergency Case) as de-
scribed in section 2.2.3 of document [5] ICM N Management Escalation for Emergency Cases.
The process of FEKAT RCA can be split into the steps summarized in the following pictures and
listed below.
o Def
i
n
i
t
i
o
n
o
f
R
C
A
C
l
a
u
s
e
s
The
obj
ective of this phase is the definition of the Clauses that contribute to the Effect. The list of
clauses is implemented in FEKAT (and any other Fault Management system) and can be
reviewed as feedback of data analysis if required. For a given release the definition of RCA
clauses either confirming the existing ones or introducing updates must be completed
within D300, in order to apply it effectively in the following testing phases.
o Alignment of Fault Management system
In order to have reliable collected information, the RCA activities are supported by the Fault
Management system that has to be tailored whenever RCA clauses are changed.
o Insertion of RCA Information
The goal is the detection of the RCA clauses as soon as faults are corrected.
The RCA applies to all Fault Reports (FRs) originating typically from System Integration,
System Test and Customers and matching following criteria:
- Priority 1 or 2
- Solved by Development.
Fault Correction Responsible fill RCA fields when correcting fault reports, using appropriate
fields of the Fault Management system.
This activity can be performed at any testing stage, starting at D300, until B800; it is man-
datory between D400 and B800.
o Data Collection, Reporting, Analysis and Deployment.
Data are collected by means of:
- Extraction of RCA fields from Fault Management systems (e.g., FEKAT, Omni-
Tracker), importing such data to Metric Database (e.g., MEDAL tool), from where
subsequent exports can be done (e.g., to excel format).
- Formal analysis of completeness and correctness of values inserted; if relevant infor-
mation is missing or wrong, request to Fault Correction Responsible to complete or
correct the values
Scope of Data Reporting is to gather and to present data in an easy-to-analyze form, in or-
der to get reliable and ease to handle information.
The analysis is performed at all milestones and baselines between D500 and B800 in the
scope of a "lessons learned rule" aimed to
- identify measures derived from RCA
- approve these measures by the Release Manager, if of release-specific character, or
even to PG Quality Board if of general character
- ensure deployment of measures under responsibility of the respective Quality Man-
ager. This may be the EQM, if of entity character, the PQM, if of Release character,
the RQM if of Product line or process character, or Com MN PG BE
- identify metrics to monitor later on the effectiveness of defined measures
- ensure implementation of measures strictly following the Continuous Process Im-
provement.
This section describes the activities needed to perform the process steps detailed above.
In particular focus is put on:
o RCA clauses
o Filling RCA data within Fault Management systems (e.g., in FEKAT)
o Extraction of RCA data (e.g., in MEDAL)
o Data Reporting
o Data Analysis
o Definition of Corrective Actions
o Deployment of Corrective Actions
o Review of effectiveness of Corrective Actions.
The aim of RCA is the improvement of the process through the analysis of problems, by means
of a set of activities including:
o Fault injection analysis: investigation of when and why problems have been introduced
o Missed fault detection analysis: investigation of when the problems could have been de-
1
tected, and why they were not detected earlier.
The RCA Clauses have been grouped into four clusters, in order to simplify the work of Fault
Correction Responsible, thus increasing the reliability of the data collected, but also having in
mind a reasonable trade-off between the data reliability and the granularity of the gathered in-
formation.
It is important to point out that, in answering the questions, Fault Correction Responsible filling
all RCA related fields should try to ask why many times, not just stopping at the most immedi-
2
ate and easy answer, to go back to the real roots of the problem! .
The Fault Correction Responsible shall provide all 4 answers to all the 4 questions about the
fault: (phase and cause of introduction, phase where it could have been detected and cause of
missing detection).
In the following paragraphs, the defined RCA Clauses are underlined, and related FEKAT
3
Codes are reported within parenthesis ; other Fault Tracking tools (ClearDDTS, OmniTracker,
etc.) allow direct selection of RCA clauses (no code is needed).
The current definition of RCA Clauses results from application of this practice since about 10
years, and has been periodically revised and tuned, according to the first two steps described in
section 1.2.1.
Phases where the faults were injected (PFI - Phase of Fault Injection)
The phases where the faults can be inserted are defined as follows:
o Preanalysis (SBS_PFI_PREANALYSIS)
o Analysis (SBS_PFI_ANALYSIS)
o Design (SBS_PFI_DESIGN)
o Implementation (SBS_PFI_IMPLEMENTAT)
o CR Implementation (SBS_PFI_CRIMPLEMENT)
o Error Correction (SBS_PFI_ERRORCORR).
1
It is important to note that this analysis, especially when focused oh high-priority Customer Faults, is also
equivalent to a so-called escape-analysis, aimed to understand why such faults were not properly filtered
(escaped) by earlier review and testing stages.
2
If you are already bored for reading this paper so far, maybe you can take a break, relax and
read Annex D, where a witty example of RCA explain this concept.
3
These are values defined for GERAN-related projects, identified by the prefix SBS_.
A1R16851 DS:03 SC:435 07-July-2005 Page 9 of 28
Copyright Siemens AG For internal use only
Guideline Root Cause Analysis
Peopl e Input
In the following RCA clauses for error injection cause are described in detail.
A mapping is also provided with more detailed clauses that must be considered as examples of
the main cause, representing possible reasons of error injection in the different phases of the
project lifecycle.
o Specifications (SBS_EIC_SPECIFICAT)
Wrong specification
Specification contained wrong information
e.g., a message was described as containing 4 fields instead than 5
Unclear specification
A detail in the specification was ambiguous and led the programmer to a wrong inter-
pretation.
e.g., The variable ENV must set to its starting value, without describing what the
starting value is supposed to be.
Unstable specification
Specifications changed frequently during implementation; the continual updating of
code to follow specification changes caused error insertion.
e.g., some procedure or variable no more needed was forgotten in the latest version
of SW code; this interacted with other SW parts and caused wrong behaviour.
4
Missing specification
Important information was not contained in the specification, and this led to an error
in implementation
e.g., Specifications forgets to mention that some given messages must be sent in a
specific sequence to obtain a correct behaviour of the system
4
'Missing specification' can also be interpreted as requirement
Page 10 of 28 07-July-2005 A1R16851 DS:03 SC:435
For internal use only Copyright Siemens AG
Guideline Root Cause Analysis
o Staffing (SBS_EIC_STAFFING)
Staff shortage
There was a staff shortage during the project due to people leaving the Company
e.g., the project started with 130 resources, but during the development 20 of them
resigned. This led to an overload of work for those who remained, causing a lower
quality work and introduction of errors.
Overlapping releases
A new release was scheduled when the previous one was not yet completed and this
created an overload on Designers, which led to a worse code quality and introduction
of errors
Resources shifted to other projects
The start of other projects caused some resources be shifted, this led to a staff shortage
Unexpected extra load
The load of work increased for some unexpected cause: a new, unplanned activity,
the necessity to repeat some tasks already done, etc.
o Guidelines (SBS_EIC_GUIDELINES)
Operating procedure not available
The operating procedure guiding the process was not available when needed; this
led to ambiguous or wrong decisions, which introduced some errors
Operating procedure insufficient
The operating procedure guiding the process was in some sense incomplete (i.e. not
finished, or lacking some necessary information); this led to ambiguous or wrong de-
cisions, which introduced some errors
Phases where faults should have been detected earliest (PED Phase of possi-
ble Earliest Detection)
The phases, during which faults should have been detected, with reasonable effort and in the
scope of the relevant test phase, are defined as follows:
o Preanalysis Review (SBS_PED_PREANALYSIS)
o Analysis Review (SBS_PED_ANALYSREVIEW)
o Design Review (SBS_PED_DESIGNREVIEW)
o Coding Review (SBS_PED_CODEREVIEW)
o Debug / Module Testing (SBS_PED_DEBUGMODTEST)
o Off- Line Testing (SBS_PED_OFFLINETEST)
o White Box Testing (SBS_PED_WHITEBOXTEST)
o Black Box Testing (SBS_PED_BLACKBOXTEST)
o SBS Integration Testing (SBS_PED_INTEGRATION)
For all the phases when test is not directly performed by the Fault Correction Responsible who
enters RCA information (typically the last two cases), the Fault Correction Responsible is likely
not able to figure out the reason why someone else (the Tester) missed to detect the fault. In
such cases Fault Correction Responsible will have to contact Tester to get information about the
following field (Causes for missed error detection).
People Input
Lack of Training
Information on Dev. Env.
Wrong
Flow
Planning
Missed
Error Detection
Missed
Regression
Activities Test Tools
Problems Underestimated
Effort
Guidelines Test Staffing
Environment Unrealistic
Deadline
Methods/Tools Resources
The responsible for entering the RCA data into the relevant Fault management tool has to cate-
gorize the reason, why the error was not found in the phases/activities defined above. The
following categories are defined:
In the following, samples are provided of the meaning of various root causes.
The test activity was not well conducted because the development environment was
not well known
o Staffing (SBS_MDC_STAFFING)
Activity not fully executed due to resource shortage
Lack of resources prevented the complete testing of the product.
o Guidelines (SBS_MDC_GUIDELINES)
Guidelines about testing activities were either missing or incomplete
The causes in each group are examples of the main cause, representing the possible reasons of
missed error detection in the different phases of the release life-cycle.
In some cases this field might be not meaningful, as explained at the end of next paragraph.
Between D300 and B800, when identified faults have been analysed and corrected, the relevant
data for RCA is entered by the Fault Correction Responsible (typically HW or SW-engineer) into
the Fault Management system.
To ensure the semantic reliability and accuracy of input data, Fault Correction Responsible must
identify the real root cause, avoiding easy and simple answers (in early SBS releases, a possible
cause for EIC was Designer Error, and collected more than 90% of answers just because it
was the easiest one).
It is also important to avoid input of nonsense / inconsistent data. Following is a short list of in-
consistency examples:
If error could not be found earlier than it was actually found, it makes no sense to ask why it was
not found earlier (because it was found indeed).
The extraction of the RCA raw data from Fault Management system data is made easier by ap-
propriate tooling.
For example, at Com MN PG R, MEDAL (Measurement Data Library) tool enables an easy dis-
play of the raw data originating from the Fault Management systems. In MEDAL, a RCA query
can be selected clicking on options Raw Data Faults RCA view, and applying further fil-
tering, if desired (e.g. on internal/customer originator, on priority), and the results can be easily
exported to excel format.
RCA raw data will be subsequently processed and elaborated in order to obtain summary infor-
mation about the phases and causes of error injection and missed error detection for each
release, as well as for trends over releases.
The MEDAL tool is reachable through Milan Com CD RA Intranet portal; direct link is
http://ik2sw001.icn.siemens.it/medal2/login_new.asp.
Raw data are processed and elaborated in the scope of Lesson Learned and Project Review
meetings, in order to obtain analysis of fault insertion and missed error detection phases and
causes for the current release and of trend over releases.
Data for each of the four RCA clusters (phase of fault insertion, cause of fault injection, phase
of earliest possible detection, cause of missed detection) are grouped and a relative percentage
is calculated for each phase, in order to obtain a ranking.
The results can be summarised either in table or graphical format.
It is advised to report percentage rather than absolute values, because this will allow comparison
and trend analysis across different releases that may have different number of faults depending
on size.
The following examples refer to table for phase of fault insertion (PFI), but similarly can be ap-
plied to other RCA clauses (EIC, PED, MDC) :
For graphical representation, among the many possible chart types (histogram, pie, etc.) the
most suitable one is a Paretochart, so that attention and definition of countermeasures can be
concentrated on most relevant causes.
Page 16 of 28 07-July-2005 A1R16851 DS:03 SC:435
For internal use only Copyright Siemens AG
Guideline Root Cause Analysis
Implementation 74.0%
CR Implementation 6.1%
Analysis 2.3%
Design 2.3%
Preanalysis 0.0%
T
h
e same analysis is performed about
The evaluation of trend of the phases and causes of fault insertion and missed error detection,
through an analysis of the data coming from the different releases, can provide additional infor-
mation, and also allows to check the effectiveness of a countermeasure aimed to reduce a given
root cause.
In order to make the analysis independent from the relative size of the release and the absolute
number of fault reports, the data can be presented as relative percentages.
The results can be summarised in a graph, from which the trend over releases for each RCA
clause can be easily seen (the following example is from entity BLT of GERAN SBS releases).
Pre-Analysis Analysis Design Implementation CR Implementation Error Correction Previous releases Included SW Products
BR7.0
BR6.02
BR6.0(1)
BR5.5
BR5.0
Releases
BR4.5
BR4.0
BR3.7
BR3.6
BR3.0
BR2.1
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Whenever an unwanted situation occurs which consumes resources and tends to happen re-
peatedly, it is required to figure out what is really causing this situation, and remove that cause in
order to avoid such a situation in future. The result of this practice will be not just to cure the
symptoms, but to collect and analyze data that will allow diagnosing, addressing and finally re-
moving the real cause of the problem.
The following paragraphs, based on the CMMI Support Process Area CAR (Causal Analysis and
Resolution) propose a structured approach to provide answers.
The table below maps CMMI CAR Specific Practices to next paragraphs of this guideline.
1.3.1 Starting
In any case, before triggering RCA, a cost/benefit evaluation must be done, comparing the im-
pact of the faults and their frequency of occurrence against the added value and the cost of the
RCA (the time and resources needed).
This evaluation might even lead to the result that it is acceptable, on cost/benefit ratio consid-
erations, to live with the fault, or just apply immediate countermeasure to cure the symptoms for
instant relief.
It is important to start gathering data as soon as possible following the occurrence identification,
to ensure that no information is lost. The information that should be collected consists of condi-
tions before, during, and after the occurrence; personnel involvement (including actions taken);
environmental factors; and other information having relevance to the occurrence.
When the analysis involves SW/HW Faults, information from Fault Management system de-
scribed in section 1.2 may be also used, if available.
1.3.3 Assessment
o root causes must be identified in detail, avoiding generic classification such as external
factor
o root causes must be identified so that Management can influence them: for instance, if root
causes for late delivery of parts are severe weather conditions and truck engine fail-
ures, Management can address the latter taking action improve maintenance, but has
likely no control on the first
o it is not practical to spend too much valuable time indefinitely searching for the most re-
mote root causes, as in todays highly interconnected systems there would always be
another root cause for everything.
Fishbone diagrams and Check sheets can be used as methods supporting this activity.
For each relevant root cause, corrective actions maybe defined, to reduce the probability that the
problem will recur. Result of this step is the generation of recommendations to implement effec-
tive corrective actions, such as
o training in areas where problems occur
o revision of process in the phases that resulted to be more error-prone: this may include
- reordering of steps (e.g. reduce too high concurrency/overlapping of activities)
- introduction of additional steps (e.g. additional meetings or reviews)
- automation of certain activities.
As result from ranking and selection, responsibilities are defined and assigned for the deploy-
ment of Corrective Actions, including the following information:
o Description of deployment actions
o Rationale for decision to deploy it
o Time and cost spent for fault analysis and correction
o Estimated cost/consequence of not eliminating the root cause (if applicable)
o Estimated cost and benefit of eliminating the root cause
o Person responsible for deployment
o Description of the affected areas
o Description of additionally required test cases and their phases (if applicable), e.g., BBT,
SINT, ST
o Schedule and reporting
o People to be informed of its status
The actual deployment of the Corrective Action is carried over by the identified responsible
through assignment of tasks to the persons doing the work and their coordination, the review of
results and the tracking of progress to closure. For very complex changes in process or tools it
is wise to perform a trial (experiment) before wide application, applying strictly the concept of
Continuous Process Improvement.
1.3.6 Follow-up
Follow-up includes determining if corrective action has been effective in resolving the addressed
problems. A review ensures the effectiveness of the deployed corrective actions and the pre-
vention of recurrence.
In order to share the knowledge and possibly reuse as appropriate on benefit to other organiza-
tions and processes, information on Root Cause Analysis and Corrective Actions must be
recorded.
2 Responsibility
In the following table Departments impacted by deployment of activities during RCA process are
provided in relationships to Phases and Responsibilities described in the previous sections.
3 References
[1] A7060-073-A209-*-7635
Customer Fault Handling with FEKAT - Siemens Information and Communication Networks
[2] A30862-X1001-A558-*-76A1
Error Handling for SBS - Siemens AG
[3] A30862-X0711-B014-*-7635
Error Handling for Entity-IT Phase Siemens AG for SBS projects: BTS, BTSplus, OMCplus
[4] A30862-X0711-B029-*-7635
User Manual ClearDDTS Siemens AG (Distributed Defect Tracking System)
[7] A1R11401*
Regulation of PEPP Process Documents
ANNEXES
A Definitions
FEKAT FEhler KATalog: Siemens proprietary tool used for fault re-
port management (see [1] and [2])
FR Fault Report
B Forms
Find embedded below some examples of templates for recording findings and corrective actions
of ad-hoc RCA.
Root Cause
Summary Table.doc
o Template for Root Cause Analysis and Action Plan (adapted from internet -
http://www.stratosinstitute.com/forms/ONT-rootcauseanalysis.pdf)
RCA Framework.doc
Corrective Action
Report.doc
CR_long_duration-R
CA.pps
When we see a Space Shuttle sitting on its launch pad, there are two big booster rockets
attached to the sides of the main fuel tank. These are solid rocket boosters, or SRBs. The
SRBs are made by Thiokol at their factory in Utah. The engineers who designed the SRBs
might have preferred to make them a bit fatter, but the specifications did not allow this.
Why?
Because the SRBs had to be shipped by train from the factory to the launch site. The railroad
line from the factory had to run through a tunnel in the mountains. The SRBs had to fit
through that tunnel. The tunnel is slightly wider than the railroad track. Well, the US stan-
dard railroad gauge (width between the two rails) is 4 feet, 8.5 inches. That's an exceedingly
odd number. Why was that gauge used?
Because that's the way they built them in England, and the US railroads were built by English
expatriates.
Why did the English build them like that? Because the first rail lines were built by the same
people who built the pre-railroad tramways, and that's the gauge they used.
Why did "they" use that gauge then? Because the people who built the tramways used the
same jigs and tools that they used for building wagons which used that wheel spacing.
Okay! Why did the wagons have that particular odd wheel spacing? Well, if they tried to use
any other spacing, the wagon wheels would break on some of the old, long distance roads in
England, because that's the spacing of the wheel ruts.
So who built those old rutted roads? The first long distance roads in Europe (and England)
were built by Imperial Rome for their legions. The roads have been used ever since. And the
ruts in the roads? Roman war chariots first formed the initial ruts, which everyone else had
to match for fear of destroying their wagon wheels. Since the chariots were made for (or by)
Imperial Rome, they were all alike in the matter of wheel spacing.
The United States standard railroad gauge of 4 feet, 8.5 inches derives from the original
specification for an Imperial Roman war chariot. Specifications and bureaucracies live for-
ever.
So the next time you are handed a specification and wonder what horse's ass came up with
it, you may be exactly right, because the Imperial Roman war chariots were made just wide
enough to accommodate the back ends of two war horses. Or, in other words, the major de-
sign feature of what is arguably the world's most advanced transportation system was
determined over two thousand years ago by the width of a Horse's Ass!