You are on page 1of 14

A Case Study in the Use of Defect Classification in Inspections

Diane Kelly and Terry Shepard


Royal Military College Of Canada
Abstract
In many software organizations, defects are classi-
fied very simply, using categories such as Minor,
Major, Severe, Critical. Simple classifications of
this kind are typically used to assign priorities in
repairing defects. Deeper understanding of the
effectiveness of software development methodolo-
gies and techniques requires more detailed classifi-
cation of defects. A variety of classifications has
been proposed.
Although most detailied schemes have been devel-
oped for the purpose of analyzing software pro-
cesses, defect classification schemes have the
potential for more specific uses. These uses require
the classification scheme to be tailored to provide
relevant details. In this vein, a new scheme was
developed to evaluate and compare the effective-
ness of software inspection techniques. This paper
describes this scheme and its use as a metric in two
empirical studies. Its use was considered success-
ful, but issues of validity and repeatability are dis-
cussed.
Keywords
Software engineering, software maintenance,
software metrics, software testing, software
validation, orthogonal defect classification.
1. INTRODUCTION
Classification of software defects can be as simple
as specifying major/minor or as detailed as the
scheme described by Beizer [2]. For deciding
whether to assign resources to fix defects, major/
minor may be sufficient. To assess sources of
defects and trouble spots in a large complex system,
something more detailed is needed.
As described in [6], defect classifications have
been successfully used to analyze and evaluate dif-
ferent aspects of software development. Several
organizations have developed defect classification
schemes to identify common causes of errors and
develop profiles of software development method-
ologies.
For example, IBM has been refining a defect
classification scheme for about ten years [3,4,7].
The IBM scheme is intended to provide analysis
and feedback to steps in the software process that
deal with defect detection, correction and preven-
tion. As software techniques have evolved, IBMs
defect classification has changed to add support for
new areas such as object oriented messaging and
international language support.
Defect classification schemes are concerned
with removing the subjectivity of the classifier and
creating categories that are distinct, that is, orthogo-
nal. IBMs scheme defines an orthogonal defect
classification (ODC) by defining a relatively small
number of defect types. The thought is that with
fewer choices for any defect, the developer can
choose accurately among the types.
To evaluate the results of an empirical study of
specific software inspection techniques, we devel-
oped a classification scheme specific to our needs.
Similar to exercises carried out at IBM and Sperry
Univac [6], we used findings from an extensive
industrial inspection exercise [8] to develop a
detailed defect classification specifically for com-
putational code (ODC-CC) [9]. Each category in
the classification was then associated with one of
four levels of understanding, based on the perceived
conceptual difficulty of finding a defect in a given
category. Using these levels of understanding, the
results from two inspection experiments were ana-
lyzed to determine if a new software inspection
technique encouraged inspectors to gain a deeper
understanding of the code they were inspecting
[10].
Software inspection is recognized as an effec-
tive defect detection technique, e.g. [14]. Research
into improving this effectiveness has focused both
on the inspection process, e.g. [13], [17] and on
individual inspection techniques, e.g. [11], [12]. In
each of the winters of 2000 and 2001, we conducted
an experiment to examine the effectiveness of a
new individual inspection technique called task-
directed inspection (TDI) [9]. Instead of using sim-
ple findings counts (e.g. [13]) as a basis for the
analysis of the technique, we used the ODC-CC to
differentiate between defects based on the different
levels of understanding that findings represent.
Validating the completeness of the coverage of
the ODC-CC was straightforward, but validating
the orthogonality of the ODC-CC is more problem-
atic. Validation of the ODC-CC was carried out as
part of both experiments. Details are given in the
rest of the paper, with the main emphasis on the
second experiment.
2. Defect Classification Schemes
Defect classification schemes can be created for
several purposes, including:
making decisions during software development,
tracking defects for process improvement,
guiding the selection of test cases, and
analyzing research results.
This paper illustrates a defect classification scheme
used for the last of these purposes.
The 1998 report by Fredericks and Basili [6]
provides an overview of classification schemes that
have been developed since 1975. The goals for
most of the schemes, from companies such as HP,
IBM and Sperry Univac, are to identify common
causes for defects in order to determine corrective
action.
Our ODC-CC is unique among the defect classi-
fication schemes that we know of, in that it was
developed specifically to analyze the results of
inspection experiments. In other words, the activity
of software inspection is analyzed in isolation from
the software development process, without worry-
ing about the cause of the defects or the action
taken to fix the defect. This is a very different view-
point from those of other classification schemes. As
a point of comparison, we describe the IBM ODC
which has been evolving over the past ten years and
describe how the ODC-CC differs.
3. The IBM Orthogonal Defect Classifi-
cation Scheme
The IBM Orthogonal Defect Classification (ODC)
was originally described in the paper by Chillarege
et al in 1992 [4].
The goal of the IBM ODC as described by Chil-
larege et al is to provide a measurement paradigm
to extract key information from defects and use that
information to assess some part of a software devel-
opment process for the purpose of providing correc-
tive actions to that process. The application of the
1992 version of the IBM ODC involves identifying,
for each defect, a defect trigger, a defect type, and a
defect qualifier. More recent versions of the IBM
ODC [7] include the activity that uncovered the
defect, defect impact, and defect target, age and
source as well as the originally described trigger,
type, and qualifier. The activity, trigger, and impact
are normally identified when the defect is found;
the others are normally identified after the defect
has been fixed.
The activities in the current version of the IBM
ODC (v. 5.11) include design reviews, code inspec-
tion, and three kinds of testing. For the purpose of
this paper, we focus on code inspection. There are
nine defect triggers assigned by the inspector to
indicate the event that prompted the discovery of
the defect.
Impact presents a list of thirteen qualities that
define the impact the defect may have on the cus-
tomer if the defect escapes to the field.
Assigned at fix time, defect types are described
in the 1993 paper by Chaar et al [3] as: assignment,
checking, algorithm, timing/serialization, interface,
function, build/package/merge, and documenta-
tion. In version 5.11 of the IBM ODC [7], the inter-
face defect type has been expanded to include
object messages and the algorithm defect type has
been expanded to include object methods. Build/
package/merge and documentation have been
removed from the defect type list. An additional
defect type now appears: relationship, defined as
"problems related to associations among proce-
dures, data structures, and objects."
The defect qualifier, also assigned to the defect
at the time of the fix, evolved from two qualifiers,
Missing and Incorrect, to include a third qualifier,
Extraneous [7]. As an example, a section of docu-
mentation that is not pertinent and should be
removed, would be flagged as Extraneous.
Target represents the high level identity of the
entity that was fixed, for example, code, require-
ments, build script, user guide.
Age identifies the defect as being introduced in:
base: part of the product that was not modified by
the current project; it is a latent defect,
new: new functionality created for this product,
rewritten: redesign or rewrite of an old function,
refix: a fix of a previous (wrong) fix of a defect
Source identifies the development area that the
defect was found in: developed in-house, reused
from library, outsourced, or ported.
Since our goals for defect classification are dif-
ferent from those of the IBM ODC, the IBM ODC
is not completely suitable for our analysis. The
IBM ODC serves instead as a starting point and as a
point of reference for describing the ODC-CC.
4. Inspection Experiments
In 1996, one of us developed a new technique for
guiding and motivating inspections of computa-
tional code [8]. In the winter of 2000, we conducted
a first experiment to evaluate the effectiveness of
this new inspection technique, called task-directed
inspection (TDI) [9]. A second experiment was
conducted in the winter of 2001. The need to ana-
lyze results of the experiments led to the develop-
ment of the ODC-CC, and a metric based on it, as
described in the following sections. The intent of
the metric is to differentiate inspection results in
such a way that the TDI could be compared to
industry standard inspection techniques such as ad
hoc or paraphrasing.
The TDI technique, similar to scenario-based
inspection techniques [16], provides structured
guidance to inspectors during their individual work.
The TDI technique piggybacks code inspections on
other software development tasks and uses the
familiarity the inspector gains with the code to
identify issues that need attention. In the applica-
tion of TDI so far, the software development tasks
that combine readily with code inspections are doc-
umentation tasks and development of test cases.
Both experiments involved graduate students
enrolled in a Software Verification and Validation
graduate course [15] offered at Queens and the
Royal Military College (RMC). The experiments
each consisted of applying three different inspec-
tion techniques three different code pieces drawn
from computational software used by the military.
The computational software chosen was written in
Visual Basic and calculates loads on bridges due to
vehicle convoy traffic. The pieces of code chosen
for the experiments were all of equivalent length
and were intended to be equivalent complexity. The
pieces were not seeded with defects in order to not
predetermine the types of defects in the code.
The three inspection techniques chosen for the
experiments consisted of one industry standard
technique and two TDI techniques. The industry
standard technique used was paraphrasing (reading
the code and acquiring an understanding of the
intent of the code without writing the intent down).
The two TDI techniques used in the first experi-
ment were Method Description and White Box Test
Plan. Method Description required the student to
document in writing the logic of each method in the
assigned piece of code. White Box Test Plan
required the student to describe a series of test cases
for each method by providing values for controlled
variables and corresponding expected values for
observed variables.
For the second experiment, three different
pieces of code were chosen from the same military
application. The new pieces were shorter and turned
out to be less complex. The White Box Test Plan
was simplified to a Test Data Plan. This involved
identifying variables participating in decision state-
ments in the code and listing the values those vari-
ables should take for testing purposes.
Both experiments were a partial factorial,
repeated measures design where each student used
all three techniques on the three different code
pieces. The application of a technique to a code
piece is referred to as a round. The paraphrasing
technique was always used first, with the two TDI
techniques being alternated amongst the students
during rounds 2 and 3. The code pieces used were
permuted amongst the students and the rounds. For
example, student 1 may use code pieces 3, 2, 1 in
rounds 1, 2 and 3 while student 2 uses code pieces
2, 1, 3. In the first experiment, twelve students were
involved, which allowed the partial factorial design
to be complete. Only 10 students were involved in
the second experiment, so the partial factorial
design was incomplete.
The goal of the experiment was to measure the
effectiveness of the TDI technique as compared to
the industry standard technique. Effectiveness was
defined as the ability of the individual inspector to
detect software defects that require a deeper under-
standing of the code. To evaluate whether that had
been achieved, a metric was needed beyond simply
counting findings. The metric must differentiate
between findings that simply address formatting
issues and those that address logic errors. The
ODC-CC [9] was developed for this purpose. Asso-
ciated with this detailed defect classification was
the concept of a level of understanding.
Each category in the defect classification was
assigned a level of understanding intended to reflect
the depth of understanding needed by an inspector
to be able to identify a defect in that category. The
analysis of the experiment results involved first
classifying each defect and then assigning the asso-
ciated level of understanding.
If an inspector using a TDI technique identifies
proportionally more defects at a deeper level of
understanding, then using a TDI technique is more
effective than using the paraphrasing technique for
finding these deeper defects.
The depth of understanding needed to make a
finding of course does not correlate with the end
consequence of the finding on the operation of the
software product. Defects that are logically difficult
to find may have minor consequences while obvi-
ous defects may have major consequences.
5. Comparing IBM ODC and ODC-CC
The IBM ODC scheme [7] is probably the most
developed of the classification schemes due to its
continual evolution over the past ten years. How-
ever, due to its different purposes, the IBM ODC
did not readily lend itself to what we needed for the
analysis of the results from the inspection experi-
ments. By considering the attributes defined in the
IBM ODC, we can both map our inspection activity
to the ODC and identify where changes are neces-
sary.
Our defect removal activity is code inspection.
The triggers are defined in the IBM ODC as
what you were thinking about when you discov-
ered the defect [7]. We define a trigger in such a
way as to remove any subjective aspect to the activ-
ity. Instead of considering what you were thinking
of, we define the trigger as what task were you
carrying out. In our inspection experiments, this
was clearly defined, e.g. writing a method descrip-
tion or creating a test plan.
Impact was not considered in our experiment.
Target was the code or the documentation used
in our experiments.
For the ODC-CC, we changed the time at which
defect types are assigned, expanded the set of defect
types, and decreased the granularity of classifica-
tion.
In the IBM ODC, defect types are assigned at
the time the developer fixes the defect. This means
the defect types are defined in terms of the fix. For
example, the defect type function is defined as "The
error should require a formal design change ...". To
simplify evaluation of the results in our inspection
experiments, the defect types are assigned before
the time of fix without considering the change nec-
essary to fix the problem. This is a valid view in
industry as well, where there are times when
inspection is decoupled from fixing.
For our experiments, definitions of defect types
must thus reflect the problem as the inspector per-
ceives it in the code: the defect type must relate to
the code rather than the fix activity. For example,
obscure language constructs, lack of encapsulation,
and logically unrelated data items in a structure all
reflect what the inspector may find. Any of these
could eventually require a "formal design change".
This is a significant shift in viewpoint from the cat-
egorization needed for the inspection experiment to
the IBM ODC categorization done by the fixer.
As well as changing the viewpoint of the defect
type from fixer to inspector, we found the list of
defect types for code and design was inadequate for
defects typically found in computational code.
Defects such as poor naming conventions for vari-
ables, inaccessible code, and inadequate capture of
error conditions didnt seem to fit any category in
the IBM ODC. It was unclear if wrong assumptions
should be classified as assignment defects or algo-
rithm defects. The ODC described in the Research
web site [7] removed documentation from the
defect type list, yet this is a category we needed for
the inspector.
Finally, we expanded the number of defect types
substantially. Finer detail was needed than was
offered by the IBM ODC. The issue of granularity
in a defect classification scheme leads in conflicting
directions. A small number of types may make it
easier to pick the one that applies. A larger number
of types may increase precision and give greater
certainty but may also mean that classification takes
longer. In our case, the extra detail was needed to be
able to deduce the level of understanding needed to
find a given defect.
In the IBM ODC the defect qualifier is also
defined at the time of the fix. If we assign the defect
qualifier at the time of the inspection, then further
qualifiers are needed. "Inconsistent" becomes nec-
essary since there are cases where, for example, the
inspector may not be able to identify if the code is
wrong or the documentation is wrong, only that
they are inconsistent. We also found that "obscure"
was a frequent defect qualifier in complex compu-
tational code, and this was added to the list devel-
oped for the ODC-CC.
The IBM ODC Age attribute simplifies to Base
for our experiments. The Source attribute is Devel-
oped In-House. Neither of these attributes contrib-
ute to the analysis of our experiments.
The next section gives the details of the ODC-
CC.
6. Description of the Defect Types for the
ODC-CC
As mentioned above, a new defect classification
was necessary for the analysis of findings from the
inspection experiments. The list had to be compre-
hensive enough to include all types of findings that
could arise from the experiments and fine-grained
enough to allow us to do a meaningful analysis of
the experimental results.
In the industry inspection exercise reported in
[8], approximately 950 findings were identified. In
preparation for the analysis of the results of our
experiments, these 950 findings were grouped into
about 90 categories, based on subjective judge-
ments as to which of the 950 findings were most
similar to each other. These categories were then
used as a base to define defect types typical of com-
putational code. This formed the basis of the new
ODC-CC as shown in the Annex.
The ODC-CC contains multiple levels of defect
type categories, each level adding more detail. The
top level is composed of the following five catego-
ries:
Documentation: documentation against which the
code may be compared. The findings classified
here are with respect to the documentation. This
includes comments in the code.
Calculation/Logic: findings of implementation
related to flow of logic, numerical problems, and
formulation of computations.
Error Checking: findings related to data values
(conditions and bounds), pre- and post- condi-
tions, control flow, data size, where specific
checks should be included in the code (defensive
programming).
Support Code: findings in supplementary code
used for testing, debugging, and optional execu-
tion.
External Resources: findings in interactions with
the environment or other software systems.
These five top level categories are intended to
represent a partition of the source code. The first
category, documentation, is the only category that
includes both code and software products external
to the code such as user documentation and design
documents. Code self-documentation includes not
only headers, comment lines, and inline comments,
but elements of the active code. Well written code
serves as documentation in itself, with meaningful
names, well constructed structures, and consistent
usages. So included in the documentation category
are variable naming and use, data and code struc-
ture, and user interface constructs. Classes 1.1 and
1.2 in the Annex represent the types of external
documentation used in our experiments. Other
types of external documents would be needed for
other contexts.
The second top level category includes compu-
tational work such as calculations, logic, and algo-
rithms. This category covers non-exception cases.
The third top level category is for error checking
and exception handling. The fourth top level cate-
gory includes support code for testing and other
checks of the software. The fifth top level category
covers interactions with the environment during
normal execution.
Subtypes of each of the five top-level types are
added in successive levels. Each level provides
more detail, dividing the level above it into finer
categories. In the ODC-CC as it currently exists,
four levels were found to be sufficient. The multiple
levels allow the flexibility of a very fine grained
classification or a very broad classification. This
has advantages in different ways, e.g. in training
inspectors (the detailed levels help to make the clas-
sification clearer), and in adjusting the level of
effort desired in classifying the results of an inspec-
tion (by requiring more or less detail).
Five defect qualifiers are included in the ODC-
CC:
M = missing
W = wrong
S = superfluous
I = inconsistent
O = obscure
Each inspection finding is fully categorized by a
defect type and a defect qualifier. For example, for
defect type 2.2.1: Calculation/Logic - Data values
before computation - Terms initialized (see Annex);
the qualifiers can indicate if the initialization defect
is missing, wrong, superfluous, inconsistent, or
obscure. There should be only one defect type and
one qualifier into which the finding fits.
As well as providing a tool for analyzing the
types of findings identified by the inspectors, the
ODC-CC also provides a means to separate
reported findings. In our experiments, inspectors
sometimes recorded more than one finding, rolled
into one finding report. By subsequent classifica-
tion of the findings, the rolled up findings become
evident and can be identified as multiple distinct
findings. This helps to provide consistency among
the reported findings of the various inspectors. An
example is given later (Sec. 9.1).
7. Validation of the Completeness of the
ODC-CC
The completeness of the ODC-CC was informally
validated in three ways. First, before the first exper-
iment, a code inspection checklist developed by
Karl Wiegers [18] was used to ensure that all items
in the Wiegers checklist could be categorized in the
ODC-CC. Second, the ODC-CC was reviewed by a
professional software developer. Third, the data
from the three rounds of both experiments was cate-
gorized to verify that all findings could be associ-
ated with a defect class in the ODC-CC.
8. Levels of Understanding for the ODC-
CC and Validation of the ODC-CC Cat-
egories.
The goal of our inspection experiments is to com-
pare the effectiveness of a structured TDI technique
to the effectiveness of an unstructured inspection
technique. In this paper, effectiveness in inspecting
computational code is taken to mean acquiring
enough understanding to recognize subtle issues.
Other measures of effectiveness are of course possi-
ble. Our choice necessitates identifying those find-
ings that require deeper understanding.
To make this identification, each category in the
ODC-CC was associated with a level of under-
standing. These levels of understanding are referred
to as CISL, which stands for Comparative, Identi-
fier, Structural, and Logical levels of understanding.
They are defined by the depth of understanding an
inspector must attain to identify a defect. The low-
est level of understanding is at the Comparative
Level, where the inspector compares the code to
the easiest defects to identify in source code are
those found by comparing the code against other
documentation. The next level in conceptual diffi-
culty is Identifier, where the inspector determines
the use of variable identifiers and whether those
uses are consistent and unique for each variable.
The third level is the Structure of the code, where
the inspector obtains an understanding of the struc-
ture of the software to identify the coherence of
structure with the semantics of the different compo-
nents in the structure. The level requiring the great-
est understanding is the Logical Understanding,
where the inspector must understand the logical
flow, the formulation of equations, and the handling
of error conditions. The CISL categorization pre-
sented here is subjective, based on our experience,
and that of a number of other people who have
looked at it. Further research will continue its vali-
dation.
Table 1 gives the detailed correspondence
between CISL and the ODC-CC as given in the
Annex. Table 2 gives a high level description of the
Categories in ODC-CC
C
1.1, 1.2, 1.3, 1.4.1 to 1.4.3, 1.5
- all from documentation category
I
1.4.4 to 1.4.7
- all from documentation category
S
1.4.8 to 1.4.13, 4.
- from documentation and all of sup-
port code categories
L
2., 3., 5.
-all of calculation/logic, error check-
ing and external resources
Table 1: CISL correspondence to ODC-CC
Defect Description
C
Comparison of active code to User
Documentation, Theory Documen-
tation, internal comments, etc. Nam-
ing conventions for variables,
modules. Formatting styles.
I
Naming of variables and modules
versus use.
S
Semantic or logical structure of data,
active code, and modules. Support-
ing infrastructure for testing,
optional features, and debugging.
L
Calculation/logic and error condi-
tions.
Table 2: CISL Descriptions
CISL categories.
The IBM ODC assigns levels of experience to
the different triggers, indicating competencies the
inspectors had in order to find the defect. In devel-
oping CISL, we took a different viewpoint. First, in
the experiment, the students were assumed to be all
at the same level of competency. In reality, they are
not. Second, the trigger is defined as the task which
is carried out by all students. The level of under-
standing could not be assigned to the trigger (i.e.
the task) in advance, since this was exactly what we
were trying to measure: what level of understanding
was precipitated by the TDI task. Instead, the level
of understanding had to be a characteristic of the
defect itself (and hence a characteristic of the code).
This is what CISL provides.
9. Use of the ODC-CC in the Inspection
Experiments
9.1. Analysis Steps in the Experiments
The primary data for each experiment were the
findings identified by each student using each
inspection technique. Analysis of this data pro-
ceeded in several steps including the following:
the raw findings were categorized using the
ODC-CC;
the categorized findings were tagged by CISL
levels;
the number of findings in each of the CISL levels
were normalized to create C-, I-, S-, and L- pro-
portions; this gave the fraction of each students
findings that were at each of the CISL levels;
an analysis was done for the C- and L-propor-
tions;
The categorizations for each experiment were
handled slightly differently. They are discussed in
turn. All findings from the three rounds of each
experiment were categorized using the most
detailed levels of the ODC-CC.
For the first experiment, the categorization was
done by one of us (rounds 1 and 3) and by a profes-
sional developer (round 2), taking about 15 hours to
complete. This corresponds to a rate of about 30
findings per hour. Both categorizers were well
acquainted with the ODC-CC, computational type
codes, and the types of defect inherent in them.
As the categorization was carried out, twenty-
eight findings from the three rounds were rejected
when it was obvious that some findings were due to
some students lack of coding skills or to a misun-
derstanding of what should be recorded. The cate-
gorization exercise also identified eighteen multiple
findings. For example, the following finding report
contains two findings:
"The meaning of each value of the flag is not
stated anywhere; static final variables should be
used for the constants to give more meaning to the
flag."
The first is the lack of documentation for a par-
ticular variable i.e. the flag (classification
1.3.1.1M from the ODC-CC); the second is the lack
of a meaningful name for a constant value (classifi-
cation 1.4.1.3M). Findings such as this are given
additional classifications and counted as multiple
findings. This helped to standardize the lists of
findings submitted by the different students.
After the categorization, there was a total of 446
findings identified by the twelve students during the
three rounds of the first experiment.
For the second experiment, the students were
asked to classify their own defects at the end of the
three rounds of the experiment. A second indepen-
dent classification was done by a professional soft-
ware developer who was familiar with Visual Basic
and with computational software but not with this
particular application. The ten students identified
about 250 findings for the three rounds. The catego-
rizations by the students were compared to that of
the professional. Section 10 describes this compari-
son.
9.2. CISL Categorization and Normalization of
the CISL Finding Counts
Once the categorization of the findings using the
ODC-CC was complete, the findings were associ-
ated with each CISL level. This sequence was fol-
lowed to ensure there were no unconscious biases
creeping into the categorization. The CISL results
were obtained from the ODC-CC categorizations
by using associations programmed into a spread-
sheet to map the classified defects onto the CISL
levels.
Results for the first experiment for each of the
twelve participants (p1 to p12) for the round using
the paraphrasing inspection technique are shown in
Table 3.
Before we could interpret the results, we had to
take into account that the number of findings gener-
ated by each inspector using each technique is
affected by the technique itself and the performance
of the inspector. The performance of an individual
can vary significantly, both compared to other
inspectors, and between rounds for the same inspec-
tor [1], [9]. This variation can easily mask the influ-
ence of a technique on the experimental results.
The repeated measures design of the experiment
allowed each participant to act as his/her own con-
trol. Even at that, we wanted to normalize the indi-
vidual variations in findings counts. We did this by
dividing each individuals count of findings in each
category of CISL by the individuals total findings
for that round. This gave a proportion of CISL find-
ings for each round for that individual. The compar-
isons are then across the results for each individual.
For example, participant p1 in the first experi-
ment, using the P technique, had a total of nineteen
findings. Fifteen of those were in the Comparative
(C) level of understanding, one in the Identifier (I)
level of understanding, two in the Structural (S)
level of understanding, and one in the Logical (L)
level of understanding. For participant p1 and the P
technique, the counts were divided by nineteen to
give the proportions C: 0.79, I: 0.05, S: 0.11, L:
0.05. This normalization was done for each individ-
ual in both experiments.
Alternative normalizations were examined and
are discussed in [9].
9.3. Graphical Results
The CISL scale allowed us to identify and
extract the findings related to deeper understanding
of the code. This in turn allowed an evaluation of
the effectiveness of the TDI techniques as com-
pared to the unstructured inspection technique. The
L-proportions represent findings that require the
deepest understanding of the source code.
Graph 1 shows the L-proportions for each stu-
dent and each technique from the first experiment
(note that P=paraphrasing technique, D=Method
Description TDI, and T=Test Plan TDI). The trends
in the graph allowed us to make conjectures regard-
ing the use of the inspection techniques, the levels
of understanding acquired by the students and the
students background experience.
Student experience in the first experiment was
highly polarized. Students 1 to 6 had extensive soft-
ware experience and generally demonstrated a
greater ability to find L-level issues with the TDI
C I S L
p1 15 1 2 1
p2 16 4 1 6
p3 18 5 3 3
p4 18 1 2 11
p5 16 7 0 2
p6 4 3 2 1
p7 0 2 0 5
p8 1 0 0 7
p9 2 5 0 4
p10 0 0 0 1
p11 1 1 0 1
p12 4 1 0 2
Table 3: CISL counts for the Paraphrasing Inspection -
first experiment
Graph 1: L-Proporti ons by Techni que (2000 experi ment)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12
Participant Number
L
-
P
r
o
p
o
r
t
i
o
n
s
P Technique D Technique T Technique
techniques. Students 7 to 12 had little or no soft-
ware experience, turned in lower number of find-
ings (making the L-proportions often unreasonably
high) and had difficulty performing the TDI tasks.
A more detailed discussion can be found in [9],
[10]. A similar trend was evident in the second
experiment, though not so prominent. This is dis-
cussed in Section 11.
10. Comparing Professional and non-
Professional Classifications
For the second experiment, the findings were
classified by the students as a separate step after the
three inspection rounds. The findings were also
classified independently by a professional software
developer. Interesting results came from comparing
the students classifications to that of the profes-
sional. The graphs 2 and 3 show the differences
when comparing the L-Proportions by technique for
the ten students. Graph 2 uses the classifications
Graph 2: L-Proportions by Technique - Categorization by professional (2001 experiment)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12
Participant Number
L
-
P
r
o
p
o
r
t
i
o
n
s
P Technique D Technique
T Technique
Graph 3: L-Proporti ons by Techni que - Categori zati on by students (2001
experi ment)
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12
Participant Number
L
-
P
r
o
p
o
r
t
i
o
n
s
P Technique D Technique T Technique
done by the professional and Graph 3 uses the clas-
sifications done by the students.
For the sake of the experimental results, differ-
ences that result in changes of trends between the
paraphrasing round (P) and the two TDI rounds (D
and T) are of most interest, i.e., where the L propor-
tion for P versus D and T change relative to each
other. This happens for students 1, 4, 5, and 9. Stu-
dents 4 and 9 turned in less than 5 findings per
round, making their results dubious at best. Stu-
dents 1 and 5 each turned in more than 30 findings
over the three rounds, so it is of more interest to
examine these two cases.
For student 1, the categorization by the profes-
sional moved three findings from CISL level L to
CISL levels C, S, and I. Two L findings were dis-
counted. For example, the student inspector was
concerned about a variable not being reset properly
on a FOR loop; the professional marked this as not
being an issue. As another example, the student
classified a finding as an improper initialization
(category 2.2.1W) and the professional classified
the finding as inconsistent use of the variable with
respect to type (category 1.4.6I). Generally, for this
student, the professional categorized the findings as
being documentation problems rather than logical
errors.
For student 5, there were 9 findings whose CISL
levels differed depending on the categorization.
Two of the students C level findings were catego-
rized as L level by the professional. Two of the stu-
dents L level findings were categorized as C level
by the professional. Three of the students S level
findings and one I level finding were categorized as
L level by the professional. For example, the stu-
dent classified an irrelevant return value as super-
fluous debugging code (category 4.1.3) whereas the
professional classified the same finding as a case
statement having extraneous code (category 2.4.2).
Here, the professional generally judged the stu-
dents findings to be more logic oriented than the
student did.
There is no particular pattern in these differ-
ences, but in Section 11 below, we see that the
changes made by the professional actually
strengthen our final results from the second experi-
ment.
Table 4 gives the number of findings that
changed CISL level for each student based on the
two categorizations done. Student 9 was struggling
with the inspection and all findings were catego-
rized differently by the professional, including
rejecting half of the findings.
Generally, the weaker the student, the greater
the discrepancy of the categorizations between the
student and the professional. Section 12 discusses
possible reasons for the difficulty in classifying
defects.
11. Analysis of Experimental Results
In a previous paper [10], we stated that: There
seems at this point evidence that for the experi-
enced participants in the experiment, using tasks to
structure the inspection resulted in proportionally
more L level findings being identified. This was
based on the first experiment. The analysis leading
to this conclusion is given in [10]. Subjective com-
ments from the participants of both experiments
supported the conclusion.
The results of the second experiment support
the same conclusion, although some explanation is
needed as to why this is so, since Graphs 2 and 3
show some participants for whom it is not true.
After the second experiment was complete, we
observed that the number of findings for code piece
1 was less than for the other two pieces, and the
number of L-findings was significantly less. We
also found that the number of L-findings in round 3
was significantly less than for the other two rounds.
When we looked more closely at code piece 1, it
appeared that it was too simple to have very many
L-findings. Since we put significant time pressure
on the students, and discussed the amount of extra
time they were spending before they completed
Student
Total Number
of Findings
over 3 rounds
(Professional
categorization
)
Number of
Findings
that
changed
CISL Level
%
change
1 35 5 14
2 17 4 23
3 21 4 19
4 14 3 21
5 53 9 17
6 22 4 18
7 27 7 26
8 32 9 28
9 6 12 200
10 16 3 19
Table 4: CISL level changes from different
categorizations (second experiment)
round 3, it seems likely that at least some of the stu-
dents decided to put less effort into round 3, and
that this skewed the results.
On this basis, we eliminated the results for par-
ticipants 3 through 9 in Graphs 2 and 3. For partici-
pants 1 and 2, round 3 used the D technique, so all
that is left to compare is P and T. For participant 10,
round 3 used the T technique, so all that is left to
compare is P and D. For both graphs and for all
three of these participants, the L-proportions using
the D or T technique (as the case may be) are
greater than they are for the P technique, so our
results as given in [10] are supported by these three
participants.
12. Categorization is Hard
The range of efforts to create defect classification
schemes described earlier in this paper, and the long
history, in which there has been no single, widely
used scheme, suggests that defect classification is
hard, and repeatable orthogonal classification is in
itself difficult. Generally, training is advised for
anyone using a classification scheme. In the second
experiment the students were asked to do the classi-
fication of their own findings. They received very
limited training. This was likely the largest contrib-
utor along with inexperience, to the variations
between the students work and the professionals.
Our experience with the differences found in
comparing classification results from students to
results from a professional developer raises issues
of design for defect classification schemes. These
include:
what level of detail is appropriate for the purpose
of the scheme?
lower level details help to define higher level
categories more clearly
lower level details can make classification
harder and more time consuming
defects and categories are open to multiple inter-
pretations:
the ODC-CC categories can be more clearly
defined, but perhaps not to the point of perfec-
tion
different categorizers interpret inspectors writ-
ten comments/findings differently
the code authors intent may be difficult to
interpret
the level of experience and knowledge of the
categorizer affects how the code is interpreted
there may be legitimate difference of opinion
on the source of the problem (e.g. documenta-
tion versus logic error)
inconsistency can be interpreted differently
the ODC-CC categories have a tendency to
reflect procedural/FORTRAN code, so students
may misinterpret when their experience is with
OO code
there may be a subconscious influence in making
a classification decision due to the type of fix the
categorizer expects, which may be different from
the category the defect belongs to.
13. Conclusions
Although normally applied to software process
evaluation, defect classifications can be used as a
metric for individual activities as shown in this case
study.
For such activities, the development and use of
the classification scheme may require a shift in
viewpoint and a change of granularity in the classi-
fication categories. Existing classification schemes
can provide starting frameworks from which
schemes focused on specific needs can be devel-
oped.
It is hard work to develop a classification
scheme that is both complete and repeatable in the
ideal sense. The ODC-CC was complete for our
purposes. Its repeatability has to be studied further.
It is difficult to factor out the subjectivity of the
humans involved at all stages of classification: cre-
ation of the scheme, interpretation of the scheme,
interpretation of the items being classified. Further
refinement of the scheme and training are possible
remedies. Dependence on the skill of the classifier
as hard to avoid. However, as imperfect as a defect
classification scheme may be, it provides a valuable
metric and potential for insight into many different
aspects of software development.
14. Biographies
Diane Kelly is an instructor and Ph.D. student at
the Royal Military College of Canada. Previously,
Diane worked in the Nuclear Division at Ontario
Hydro where she participated in a wide variety of
roles in software development, from programmer to
project leader, trainer to QA advisor. Diane has an
M.Eng. in Software Engineering from the Royal
Military College, a B.Sc in Mathematics and a B.
Ed. in Mathematics and Computer Science, both
from the University of Toronto, Canada.
Terry Shepard is a professor in the Department
of Electrical and Computer Engineering at the
Royal Military College of Canada, where he has
played the lead role in creating strong software
engineering programs. This includes working
extensively with a number of Canadian military
software projects, and creating and teaching gradu-
ate and undergraduate courses on software design,
V&V, and maintenance. He has over 30 years of
software experience in industry, government and
academia. Terry received his B.Sc. and M.A. from
Queens University in Kingston, and his Ph.D. from
the University of Illinois, all in Mathematics. He is
a Registered Professional Engineer. He has pub-
lished a number of papers in software design and
verification, and worked with ObjecTime Ltd. for
several years.
15. References
[1] Victor Basili, Richard W. Selby, David Hutch-
ens, Experimentation in Software Engineer-
ing, IEEE Transactions on Software
Engineering, Vol. SE-12, No.7, July 1986,
pp.733-643
[2] Boris Beizer, Software Testing Techniques, van
Nostrand Reinhold, 2nd Edition, 1990
[3] J.K. Chaar, M.J. Halliday, I.S. Bhandari and R.
Chillarege, "In-Process Evaluation for Soft-
ware Inspection and Test", IEEE TSE, Nov. 93,
pp. 1055-1071, v. 19, n. 11
[4] Ram Chillarege, Inderpal S. Bhandari, Jarir K.
Chaar, Michael J. Halliday, Diane S. Moebus,
Bonnie K. Ray, Man-Yuen Wong; "Orthogonal
Defect Classification - A Concept for In-Pro-
cess Measurements", IEEE Transactions on
Software Engineering, Vol. 18, No. 11,
November 1992, pp. 943-956
[5] Khaled El Emam and Isabelle Wieczorek; "The
Repeatability of Code Defect Classifications",
Proceedings of the 9th International Sympo-
sium on Software Reliability Engineering,
1998, pp. 322-333
[6] Michael Fredericks and Victor Basili, Using
Defect Tracking and Analysis to Improve Soft-
ware Quality, Technical Report, DoD Data &
Analysis Center for Software (DACS), Nov.
1998
[7] IBM Centre for Software Engineering, Details
of ODC, http://www.research.ibm.com/soft-
eng/ODC/DETODC.HTM and
FAQ.HTM#concepts
[8] Diane Kelly and Terry Shepard; "A Novel
Approach to Inspection of Legacy Code", Pro-
ceedings of PSQT00, Austin Texas, March
2000
[9] Diane Kelly; An Experiment to Investigate a
New Software Inspection Technique, Masters
Thesis, Royal Military College of Canada, July
2000.
[10] Diane Kelly and Terry Shepard, "Task-
Directed Software Inspection Technique: An
Experiment and Case Study", IBM CASCON
2000, Toronto, November 2000
[11] Oliver Laitenbergerr, Colin Atkinson, Maud
Schlich, and Kahled El-Emam; "An Experi-
mental Comparison of Reading Techniques for
Defect Detection in UML Design Documents,
December 1999, NRC 43614
[12] Adam Porter, Lawrence G. Votta, Jr., Victor R.
Basili; "Comparing Detection Methods for
Software Requirements Inspections: A Repli-
cated Experiment", IEEE Transactions on Soft-
ware Engineering, Vol. 21, No. 6, June 1995,
pp. 563-575
[13] Adam Porter, Harvey P. Siy, Carol A. Toman,
Lawrence G. Votta; "An Experiment to Assess
the Cost-Benefits of Code Inspections in Large
Scale Software Development", IEEE Transac-
tions on Software Engineering, Vol. 23, No. 6,
June 1997, pp. 329-346
[14] Glen Russell; "Experience with Inspection in
Ultralarge-Scale Developments", IEEE Soft-
ware, Vol.8, No.1, Jan. 1991, pp. 25-31
[15] Terry Shepard; On Teaching Software Verifi-
cation and Validation, Proceedings of the 8th
SEI Conference on Software Engineering Edu-
cation, New Orleans, LA, 1995, pp. 375-386
[16] University of Maryland, Notes on Perspective
Based Scenarios; http://www.cs.umd.edu/
projects/SoftEng/ESEG/manual/pbr_package/
node21.html,[online], November 1999.
[17] Lawrence Votta, Does Every Inspection Need
a Meeting?, SIGSOFT93 - Proceedings of 1st
ACM SIGSOFT Symposium on Software
Development Engineering, ACM Press, NY,
1993, pp. 107-114
[18] Karl Weigers; Process Impact. Review Check-
lists, http://www.processimpact.com/
process_assets/review_checklists.doc
Annex: ODC-CC detailed classification
1. Documentation
1.1 User documentation (no descriptions of
equations or models)
1.1.1 Data input description
1.1.1.1 Format
1.1.1.2 Default values
1.1.1.3 Recommended values
1.1.1.4 Use of input data
1.1.1.5 Data size
1.1.1.6 Measurement units
1.1.2 Data output description
1.1.2.1 Format
1.1.2.2 Measurement units
1.1.2.3 Description
1.1.2.4 Source of output
1.1.3 Error messages
1.1.4 Interrelationships between data
items
1.1.5 Software environment
1.2 Theory documentation
1.2.1 Descriptions
1.2.1.1 Description of functionality
1.2.1.2 Stated defaults, assumptions,
limitations for models
1.2.1.3 Precondition on calculation
1.2.1.4 Post condition on calculation
1.2.2 Symbology
1.2.2.1 Definitions, distinctions between
symbols
1.3 Internal documentation (comments in
code)
1.3.1 Description of variables
1.3.1.1 Local
1.3.1.2 Global
1.3.1.3 Interface
1.3.2 Description of process
1.3.2.1 Logic/Calculations
1.3.2.2 Module header: description of
functionality, version, references
1.4 Code (style, internal consistency,
structure)
1.4.1 Convention for naming variables
1.4.1.1 Variable naming scheme
1.4.1.2 Variable names avoid using
reserved keywords
1.4.1.3 Constant values have meaningful
names (e.g., PI for 3.14)
1.4.2 Module naming
1.4.2.1 Reflects use
1.4.3 Formatting style, e.g., labels, white
space, comments
1.4.3.1 Use of labels and other inactive
code features
1.4.3.2 Use of white space
1.4.3.3 Use of comments
1.4.4 Variable naming versus use
1.4.4.1 One variable name per semantic
(e.g., MASS for mass of fuel rod;
MFUEL for same quantity)
1.4.4.2 One use over time for each variable
name. (e.g., COUNT used for
number of channels and later for
number of devices)
1.4.4.3 One use at any one time for each
variable name (e.g., +ve values are
Masses, -ve values are Volumes)
1.4.5 Interface parameters; type versus use,
size and scope
1.4.6 Local/private variables; type versus
use, size and scope
1.4.7 Global variables; type versus use, size
and scope
1.4.8 Data structures composed of semanti-
cally or logically related data
1.4.9 Structure of data logically evident or
thoroughly documented
1.4.10 Code structure reflects logical
relationships
1.4.11 Common functionality confined to
one module (e.g., use of library
routines, one module for common
computations)
1.4.12 Volatile or complex structures
isolated or encapsulated (e.g., access
routines for complex data structures
or system dependent tasks routines)
1.4.13 Language constructs
1.4.13.1 Generalized to reduce restrictions
and/or obscurity
1.4.13.2 Recommended constructs to
replace obsolete constructs
1.5 I/O Code
1.5.1 Output wording and format
1.5.2 Input wording and format
1.5.3 Error message wording and format
1.5.4 Program action versus error message
1.5.5 Program action versus output
1.5.6 Program action versus input
2. Calculation/ Logic
2.1 Formulation of equation
2.1.1 Combination of terms of equations
2.2 Data values before computation
2.2.1 Terms initialized
2.2.2 Appropriate data value used (e.g.,
from previous calculations or based
on related data)
2.2.3 Appropriate assumptions, defaults
2.3 Formulation of boolean condition
2.3.1 Variables in boolean expressions
2.3.1.1 Variables checked against correct
limits
2.3.1.2 Comparing two floating point
numbers
2.3.2 Formulation of boolean condition
2.4 Logic flow
2.4.1 All cases addressed
2.4.2 Coding for each case is complete and
each case has only code required
specifically for that case
2.4.3 All code accessible (including calls to
modules, use of passive code: formats,
inline functions)
2.4.3.1 All modules accessed
2.4.3.2 All parts of logic accessible
2.4.3.3 All passive code constructs used
(labels, inline functions referenced)
2.5 Calculation
2.5.1 All calculations completed before
subsequent use
2.5.2 Calculation done within scope of
assumptions
2.5.3 Normal exit from calculation
2.5.4 Order of calculations
2.5.5 Limitations on calculations consistent
with subsequent calculations
2.5.6 Access to data structures
2.6 Algorithm
2.6.1 Appropriate algorithm
2.6.2 Algorithm used within documented
limitations
2.7 Numerics
2.7.1 Round-off error
2.7.2 Discrete computation
2.7.3 Accuracy
2.8 Error handling
2.8.1 Clean up or reset state variables after
error condition
3. Error Checking (identified places in code
where error checking should be done)
3.1 Preconditions on data values
3.1.1 Expected data values (e.g., discrete
values), maximum and minimum
values, zero values
3.1.2 Data values based on related data, e.g.
check the related value is as expected
3.2 Post conditions on data values
3.2.1 Completion of calculation
3.2.2 Expected range of values, e.g., void
fraction calculated value between 0
and 1
3.3 Conditions in logic structures, case
structures, loops
3.3.1 Bounds on loops
3.3.2 All conditions covered, fall through
3.3.3 Appropriate conditions for IF block
3.4 Data size
3.4.1 Array bounds
3.4.2 Size of character or other variables
3.5 Status codes or error conditions passed
from other software
3.5.1 Capturing and handling error
conditions from the operating system,
hardware
3.5.2 Capturing and handling error
conditions from other application
software
3.5.3 Capturing and handling error
conditions from other modules or
submodules within application
4. Support Code
4.1 Code to support testing
4.1.1 Test harnesses
4.1.1.1 Stubs
4.1.1.2 Other
4.1.2 Debugging code
4.2 Code to support additional development
features
4.2.1 "backdoors", shortcuts
4.2.2 trial options, temporary features
5. External Resources
5.1 File/memory handling
5.1.1 Buffers flushed
5.1.2 Files opened before writes, files
closed; memory allocated/deallocated
5.1.3 End of file flags
5.1.4 File positioning
5.2 Library programs
5.2.1 Interfaces
5.2.2 Assumptions on use
5.2.3 Interpretation of returned values from
external routines
5.3 Interaction with other software systems
5.3.1 Interfaces, both input and output
5.3.2 Assumptions on exchanged data
5.4 Dependencies on environment
5.4.1 Assumptions about internal word size
5.4.2 Assumptions about compiler
initialization of data

You might also like