Defect classification schemes have the potential for more specific uses. A new scheme was developed to evaluate and compare the effectiveness of software inspection techniques. Its use was considered successful, but issues of validity and repeatability are discussed.
Original Description:
Original Title
A Case Study in the Use of Defect Classification in Inspections
Defect classification schemes have the potential for more specific uses. A new scheme was developed to evaluate and compare the effectiveness of software inspection techniques. Its use was considered successful, but issues of validity and repeatability are discussed.
Defect classification schemes have the potential for more specific uses. A new scheme was developed to evaluate and compare the effectiveness of software inspection techniques. Its use was considered successful, but issues of validity and repeatability are discussed.
A Case Study in the Use of Defect Classification in Inspections
Diane Kelly and Terry Shepard
Royal Military College Of Canada Abstract In many software organizations, defects are classi- fied very simply, using categories such as Minor, Major, Severe, Critical. Simple classifications of this kind are typically used to assign priorities in repairing defects. Deeper understanding of the effectiveness of software development methodolo- gies and techniques requires more detailed classifi- cation of defects. A variety of classifications has been proposed. Although most detailied schemes have been devel- oped for the purpose of analyzing software pro- cesses, defect classification schemes have the potential for more specific uses. These uses require the classification scheme to be tailored to provide relevant details. In this vein, a new scheme was developed to evaluate and compare the effective- ness of software inspection techniques. This paper describes this scheme and its use as a metric in two empirical studies. Its use was considered success- ful, but issues of validity and repeatability are dis- cussed. Keywords Software engineering, software maintenance, software metrics, software testing, software validation, orthogonal defect classification. 1. INTRODUCTION Classification of software defects can be as simple as specifying major/minor or as detailed as the scheme described by Beizer [2]. For deciding whether to assign resources to fix defects, major/ minor may be sufficient. To assess sources of defects and trouble spots in a large complex system, something more detailed is needed. As described in [6], defect classifications have been successfully used to analyze and evaluate dif- ferent aspects of software development. Several organizations have developed defect classification schemes to identify common causes of errors and develop profiles of software development method- ologies. For example, IBM has been refining a defect classification scheme for about ten years [3,4,7]. The IBM scheme is intended to provide analysis and feedback to steps in the software process that deal with defect detection, correction and preven- tion. As software techniques have evolved, IBMs defect classification has changed to add support for new areas such as object oriented messaging and international language support. Defect classification schemes are concerned with removing the subjectivity of the classifier and creating categories that are distinct, that is, orthogo- nal. IBMs scheme defines an orthogonal defect classification (ODC) by defining a relatively small number of defect types. The thought is that with fewer choices for any defect, the developer can choose accurately among the types. To evaluate the results of an empirical study of specific software inspection techniques, we devel- oped a classification scheme specific to our needs. Similar to exercises carried out at IBM and Sperry Univac [6], we used findings from an extensive industrial inspection exercise [8] to develop a detailed defect classification specifically for com- putational code (ODC-CC) [9]. Each category in the classification was then associated with one of four levels of understanding, based on the perceived conceptual difficulty of finding a defect in a given category. Using these levels of understanding, the results from two inspection experiments were ana- lyzed to determine if a new software inspection technique encouraged inspectors to gain a deeper understanding of the code they were inspecting [10]. Software inspection is recognized as an effec- tive defect detection technique, e.g. [14]. Research into improving this effectiveness has focused both on the inspection process, e.g. [13], [17] and on individual inspection techniques, e.g. [11], [12]. In each of the winters of 2000 and 2001, we conducted an experiment to examine the effectiveness of a new individual inspection technique called task- directed inspection (TDI) [9]. Instead of using sim- ple findings counts (e.g. [13]) as a basis for the analysis of the technique, we used the ODC-CC to differentiate between defects based on the different levels of understanding that findings represent. Validating the completeness of the coverage of the ODC-CC was straightforward, but validating the orthogonality of the ODC-CC is more problem- atic. Validation of the ODC-CC was carried out as part of both experiments. Details are given in the rest of the paper, with the main emphasis on the second experiment. 2. Defect Classification Schemes Defect classification schemes can be created for several purposes, including: making decisions during software development, tracking defects for process improvement, guiding the selection of test cases, and analyzing research results. This paper illustrates a defect classification scheme used for the last of these purposes. The 1998 report by Fredericks and Basili [6] provides an overview of classification schemes that have been developed since 1975. The goals for most of the schemes, from companies such as HP, IBM and Sperry Univac, are to identify common causes for defects in order to determine corrective action. Our ODC-CC is unique among the defect classi- fication schemes that we know of, in that it was developed specifically to analyze the results of inspection experiments. In other words, the activity of software inspection is analyzed in isolation from the software development process, without worry- ing about the cause of the defects or the action taken to fix the defect. This is a very different view- point from those of other classification schemes. As a point of comparison, we describe the IBM ODC which has been evolving over the past ten years and describe how the ODC-CC differs. 3. The IBM Orthogonal Defect Classifi- cation Scheme The IBM Orthogonal Defect Classification (ODC) was originally described in the paper by Chillarege et al in 1992 [4]. The goal of the IBM ODC as described by Chil- larege et al is to provide a measurement paradigm to extract key information from defects and use that information to assess some part of a software devel- opment process for the purpose of providing correc- tive actions to that process. The application of the 1992 version of the IBM ODC involves identifying, for each defect, a defect trigger, a defect type, and a defect qualifier. More recent versions of the IBM ODC [7] include the activity that uncovered the defect, defect impact, and defect target, age and source as well as the originally described trigger, type, and qualifier. The activity, trigger, and impact are normally identified when the defect is found; the others are normally identified after the defect has been fixed. The activities in the current version of the IBM ODC (v. 5.11) include design reviews, code inspec- tion, and three kinds of testing. For the purpose of this paper, we focus on code inspection. There are nine defect triggers assigned by the inspector to indicate the event that prompted the discovery of the defect. Impact presents a list of thirteen qualities that define the impact the defect may have on the cus- tomer if the defect escapes to the field. Assigned at fix time, defect types are described in the 1993 paper by Chaar et al [3] as: assignment, checking, algorithm, timing/serialization, interface, function, build/package/merge, and documenta- tion. In version 5.11 of the IBM ODC [7], the inter- face defect type has been expanded to include object messages and the algorithm defect type has been expanded to include object methods. Build/ package/merge and documentation have been removed from the defect type list. An additional defect type now appears: relationship, defined as "problems related to associations among proce- dures, data structures, and objects." The defect qualifier, also assigned to the defect at the time of the fix, evolved from two qualifiers, Missing and Incorrect, to include a third qualifier, Extraneous [7]. As an example, a section of docu- mentation that is not pertinent and should be removed, would be flagged as Extraneous. Target represents the high level identity of the entity that was fixed, for example, code, require- ments, build script, user guide. Age identifies the defect as being introduced in: base: part of the product that was not modified by the current project; it is a latent defect, new: new functionality created for this product, rewritten: redesign or rewrite of an old function, refix: a fix of a previous (wrong) fix of a defect Source identifies the development area that the defect was found in: developed in-house, reused from library, outsourced, or ported. Since our goals for defect classification are dif- ferent from those of the IBM ODC, the IBM ODC is not completely suitable for our analysis. The IBM ODC serves instead as a starting point and as a point of reference for describing the ODC-CC. 4. Inspection Experiments In 1996, one of us developed a new technique for guiding and motivating inspections of computa- tional code [8]. In the winter of 2000, we conducted a first experiment to evaluate the effectiveness of this new inspection technique, called task-directed inspection (TDI) [9]. A second experiment was conducted in the winter of 2001. The need to ana- lyze results of the experiments led to the develop- ment of the ODC-CC, and a metric based on it, as described in the following sections. The intent of the metric is to differentiate inspection results in such a way that the TDI could be compared to industry standard inspection techniques such as ad hoc or paraphrasing. The TDI technique, similar to scenario-based inspection techniques [16], provides structured guidance to inspectors during their individual work. The TDI technique piggybacks code inspections on other software development tasks and uses the familiarity the inspector gains with the code to identify issues that need attention. In the applica- tion of TDI so far, the software development tasks that combine readily with code inspections are doc- umentation tasks and development of test cases. Both experiments involved graduate students enrolled in a Software Verification and Validation graduate course [15] offered at Queens and the Royal Military College (RMC). The experiments each consisted of applying three different inspec- tion techniques three different code pieces drawn from computational software used by the military. The computational software chosen was written in Visual Basic and calculates loads on bridges due to vehicle convoy traffic. The pieces of code chosen for the experiments were all of equivalent length and were intended to be equivalent complexity. The pieces were not seeded with defects in order to not predetermine the types of defects in the code. The three inspection techniques chosen for the experiments consisted of one industry standard technique and two TDI techniques. The industry standard technique used was paraphrasing (reading the code and acquiring an understanding of the intent of the code without writing the intent down). The two TDI techniques used in the first experi- ment were Method Description and White Box Test Plan. Method Description required the student to document in writing the logic of each method in the assigned piece of code. White Box Test Plan required the student to describe a series of test cases for each method by providing values for controlled variables and corresponding expected values for observed variables. For the second experiment, three different pieces of code were chosen from the same military application. The new pieces were shorter and turned out to be less complex. The White Box Test Plan was simplified to a Test Data Plan. This involved identifying variables participating in decision state- ments in the code and listing the values those vari- ables should take for testing purposes. Both experiments were a partial factorial, repeated measures design where each student used all three techniques on the three different code pieces. The application of a technique to a code piece is referred to as a round. The paraphrasing technique was always used first, with the two TDI techniques being alternated amongst the students during rounds 2 and 3. The code pieces used were permuted amongst the students and the rounds. For example, student 1 may use code pieces 3, 2, 1 in rounds 1, 2 and 3 while student 2 uses code pieces 2, 1, 3. In the first experiment, twelve students were involved, which allowed the partial factorial design to be complete. Only 10 students were involved in the second experiment, so the partial factorial design was incomplete. The goal of the experiment was to measure the effectiveness of the TDI technique as compared to the industry standard technique. Effectiveness was defined as the ability of the individual inspector to detect software defects that require a deeper under- standing of the code. To evaluate whether that had been achieved, a metric was needed beyond simply counting findings. The metric must differentiate between findings that simply address formatting issues and those that address logic errors. The ODC-CC [9] was developed for this purpose. Asso- ciated with this detailed defect classification was the concept of a level of understanding. Each category in the defect classification was assigned a level of understanding intended to reflect the depth of understanding needed by an inspector to be able to identify a defect in that category. The analysis of the experiment results involved first classifying each defect and then assigning the asso- ciated level of understanding. If an inspector using a TDI technique identifies proportionally more defects at a deeper level of understanding, then using a TDI technique is more effective than using the paraphrasing technique for finding these deeper defects. The depth of understanding needed to make a finding of course does not correlate with the end consequence of the finding on the operation of the software product. Defects that are logically difficult to find may have minor consequences while obvi- ous defects may have major consequences. 5. Comparing IBM ODC and ODC-CC The IBM ODC scheme [7] is probably the most developed of the classification schemes due to its continual evolution over the past ten years. How- ever, due to its different purposes, the IBM ODC did not readily lend itself to what we needed for the analysis of the results from the inspection experi- ments. By considering the attributes defined in the IBM ODC, we can both map our inspection activity to the ODC and identify where changes are neces- sary. Our defect removal activity is code inspection. The triggers are defined in the IBM ODC as what you were thinking about when you discov- ered the defect [7]. We define a trigger in such a way as to remove any subjective aspect to the activ- ity. Instead of considering what you were thinking of, we define the trigger as what task were you carrying out. In our inspection experiments, this was clearly defined, e.g. writing a method descrip- tion or creating a test plan. Impact was not considered in our experiment. Target was the code or the documentation used in our experiments. For the ODC-CC, we changed the time at which defect types are assigned, expanded the set of defect types, and decreased the granularity of classifica- tion. In the IBM ODC, defect types are assigned at the time the developer fixes the defect. This means the defect types are defined in terms of the fix. For example, the defect type function is defined as "The error should require a formal design change ...". To simplify evaluation of the results in our inspection experiments, the defect types are assigned before the time of fix without considering the change nec- essary to fix the problem. This is a valid view in industry as well, where there are times when inspection is decoupled from fixing. For our experiments, definitions of defect types must thus reflect the problem as the inspector per- ceives it in the code: the defect type must relate to the code rather than the fix activity. For example, obscure language constructs, lack of encapsulation, and logically unrelated data items in a structure all reflect what the inspector may find. Any of these could eventually require a "formal design change". This is a significant shift in viewpoint from the cat- egorization needed for the inspection experiment to the IBM ODC categorization done by the fixer. As well as changing the viewpoint of the defect type from fixer to inspector, we found the list of defect types for code and design was inadequate for defects typically found in computational code. Defects such as poor naming conventions for vari- ables, inaccessible code, and inadequate capture of error conditions didnt seem to fit any category in the IBM ODC. It was unclear if wrong assumptions should be classified as assignment defects or algo- rithm defects. The ODC described in the Research web site [7] removed documentation from the defect type list, yet this is a category we needed for the inspector. Finally, we expanded the number of defect types substantially. Finer detail was needed than was offered by the IBM ODC. The issue of granularity in a defect classification scheme leads in conflicting directions. A small number of types may make it easier to pick the one that applies. A larger number of types may increase precision and give greater certainty but may also mean that classification takes longer. In our case, the extra detail was needed to be able to deduce the level of understanding needed to find a given defect. In the IBM ODC the defect qualifier is also defined at the time of the fix. If we assign the defect qualifier at the time of the inspection, then further qualifiers are needed. "Inconsistent" becomes nec- essary since there are cases where, for example, the inspector may not be able to identify if the code is wrong or the documentation is wrong, only that they are inconsistent. We also found that "obscure" was a frequent defect qualifier in complex compu- tational code, and this was added to the list devel- oped for the ODC-CC. The IBM ODC Age attribute simplifies to Base for our experiments. The Source attribute is Devel- oped In-House. Neither of these attributes contrib- ute to the analysis of our experiments. The next section gives the details of the ODC- CC. 6. Description of the Defect Types for the ODC-CC As mentioned above, a new defect classification was necessary for the analysis of findings from the inspection experiments. The list had to be compre- hensive enough to include all types of findings that could arise from the experiments and fine-grained enough to allow us to do a meaningful analysis of the experimental results. In the industry inspection exercise reported in [8], approximately 950 findings were identified. In preparation for the analysis of the results of our experiments, these 950 findings were grouped into about 90 categories, based on subjective judge- ments as to which of the 950 findings were most similar to each other. These categories were then used as a base to define defect types typical of com- putational code. This formed the basis of the new ODC-CC as shown in the Annex. The ODC-CC contains multiple levels of defect type categories, each level adding more detail. The top level is composed of the following five catego- ries: Documentation: documentation against which the code may be compared. The findings classified here are with respect to the documentation. This includes comments in the code. Calculation/Logic: findings of implementation related to flow of logic, numerical problems, and formulation of computations. Error Checking: findings related to data values (conditions and bounds), pre- and post- condi- tions, control flow, data size, where specific checks should be included in the code (defensive programming). Support Code: findings in supplementary code used for testing, debugging, and optional execu- tion. External Resources: findings in interactions with the environment or other software systems. These five top level categories are intended to represent a partition of the source code. The first category, documentation, is the only category that includes both code and software products external to the code such as user documentation and design documents. Code self-documentation includes not only headers, comment lines, and inline comments, but elements of the active code. Well written code serves as documentation in itself, with meaningful names, well constructed structures, and consistent usages. So included in the documentation category are variable naming and use, data and code struc- ture, and user interface constructs. Classes 1.1 and 1.2 in the Annex represent the types of external documentation used in our experiments. Other types of external documents would be needed for other contexts. The second top level category includes compu- tational work such as calculations, logic, and algo- rithms. This category covers non-exception cases. The third top level category is for error checking and exception handling. The fourth top level cate- gory includes support code for testing and other checks of the software. The fifth top level category covers interactions with the environment during normal execution. Subtypes of each of the five top-level types are added in successive levels. Each level provides more detail, dividing the level above it into finer categories. In the ODC-CC as it currently exists, four levels were found to be sufficient. The multiple levels allow the flexibility of a very fine grained classification or a very broad classification. This has advantages in different ways, e.g. in training inspectors (the detailed levels help to make the clas- sification clearer), and in adjusting the level of effort desired in classifying the results of an inspec- tion (by requiring more or less detail). Five defect qualifiers are included in the ODC- CC: M = missing W = wrong S = superfluous I = inconsistent O = obscure Each inspection finding is fully categorized by a defect type and a defect qualifier. For example, for defect type 2.2.1: Calculation/Logic - Data values before computation - Terms initialized (see Annex); the qualifiers can indicate if the initialization defect is missing, wrong, superfluous, inconsistent, or obscure. There should be only one defect type and one qualifier into which the finding fits. As well as providing a tool for analyzing the types of findings identified by the inspectors, the ODC-CC also provides a means to separate reported findings. In our experiments, inspectors sometimes recorded more than one finding, rolled into one finding report. By subsequent classifica- tion of the findings, the rolled up findings become evident and can be identified as multiple distinct findings. This helps to provide consistency among the reported findings of the various inspectors. An example is given later (Sec. 9.1). 7. Validation of the Completeness of the ODC-CC The completeness of the ODC-CC was informally validated in three ways. First, before the first exper- iment, a code inspection checklist developed by Karl Wiegers [18] was used to ensure that all items in the Wiegers checklist could be categorized in the ODC-CC. Second, the ODC-CC was reviewed by a professional software developer. Third, the data from the three rounds of both experiments was cate- gorized to verify that all findings could be associ- ated with a defect class in the ODC-CC. 8. Levels of Understanding for the ODC- CC and Validation of the ODC-CC Cat- egories. The goal of our inspection experiments is to com- pare the effectiveness of a structured TDI technique to the effectiveness of an unstructured inspection technique. In this paper, effectiveness in inspecting computational code is taken to mean acquiring enough understanding to recognize subtle issues. Other measures of effectiveness are of course possi- ble. Our choice necessitates identifying those find- ings that require deeper understanding. To make this identification, each category in the ODC-CC was associated with a level of under- standing. These levels of understanding are referred to as CISL, which stands for Comparative, Identi- fier, Structural, and Logical levels of understanding. They are defined by the depth of understanding an inspector must attain to identify a defect. The low- est level of understanding is at the Comparative Level, where the inspector compares the code to the easiest defects to identify in source code are those found by comparing the code against other documentation. The next level in conceptual diffi- culty is Identifier, where the inspector determines the use of variable identifiers and whether those uses are consistent and unique for each variable. The third level is the Structure of the code, where the inspector obtains an understanding of the struc- ture of the software to identify the coherence of structure with the semantics of the different compo- nents in the structure. The level requiring the great- est understanding is the Logical Understanding, where the inspector must understand the logical flow, the formulation of equations, and the handling of error conditions. The CISL categorization pre- sented here is subjective, based on our experience, and that of a number of other people who have looked at it. Further research will continue its vali- dation. Table 1 gives the detailed correspondence between CISL and the ODC-CC as given in the Annex. Table 2 gives a high level description of the Categories in ODC-CC C 1.1, 1.2, 1.3, 1.4.1 to 1.4.3, 1.5 - all from documentation category I 1.4.4 to 1.4.7 - all from documentation category S 1.4.8 to 1.4.13, 4. - from documentation and all of sup- port code categories L 2., 3., 5. -all of calculation/logic, error check- ing and external resources Table 1: CISL correspondence to ODC-CC Defect Description C Comparison of active code to User Documentation, Theory Documen- tation, internal comments, etc. Nam- ing conventions for variables, modules. Formatting styles. I Naming of variables and modules versus use. S Semantic or logical structure of data, active code, and modules. Support- ing infrastructure for testing, optional features, and debugging. L Calculation/logic and error condi- tions. Table 2: CISL Descriptions CISL categories. The IBM ODC assigns levels of experience to the different triggers, indicating competencies the inspectors had in order to find the defect. In devel- oping CISL, we took a different viewpoint. First, in the experiment, the students were assumed to be all at the same level of competency. In reality, they are not. Second, the trigger is defined as the task which is carried out by all students. The level of under- standing could not be assigned to the trigger (i.e. the task) in advance, since this was exactly what we were trying to measure: what level of understanding was precipitated by the TDI task. Instead, the level of understanding had to be a characteristic of the defect itself (and hence a characteristic of the code). This is what CISL provides. 9. Use of the ODC-CC in the Inspection Experiments 9.1. Analysis Steps in the Experiments The primary data for each experiment were the findings identified by each student using each inspection technique. Analysis of this data pro- ceeded in several steps including the following: the raw findings were categorized using the ODC-CC; the categorized findings were tagged by CISL levels; the number of findings in each of the CISL levels were normalized to create C-, I-, S-, and L- pro- portions; this gave the fraction of each students findings that were at each of the CISL levels; an analysis was done for the C- and L-propor- tions; The categorizations for each experiment were handled slightly differently. They are discussed in turn. All findings from the three rounds of each experiment were categorized using the most detailed levels of the ODC-CC. For the first experiment, the categorization was done by one of us (rounds 1 and 3) and by a profes- sional developer (round 2), taking about 15 hours to complete. This corresponds to a rate of about 30 findings per hour. Both categorizers were well acquainted with the ODC-CC, computational type codes, and the types of defect inherent in them. As the categorization was carried out, twenty- eight findings from the three rounds were rejected when it was obvious that some findings were due to some students lack of coding skills or to a misun- derstanding of what should be recorded. The cate- gorization exercise also identified eighteen multiple findings. For example, the following finding report contains two findings: "The meaning of each value of the flag is not stated anywhere; static final variables should be used for the constants to give more meaning to the flag." The first is the lack of documentation for a par- ticular variable i.e. the flag (classification 1.3.1.1M from the ODC-CC); the second is the lack of a meaningful name for a constant value (classifi- cation 1.4.1.3M). Findings such as this are given additional classifications and counted as multiple findings. This helped to standardize the lists of findings submitted by the different students. After the categorization, there was a total of 446 findings identified by the twelve students during the three rounds of the first experiment. For the second experiment, the students were asked to classify their own defects at the end of the three rounds of the experiment. A second indepen- dent classification was done by a professional soft- ware developer who was familiar with Visual Basic and with computational software but not with this particular application. The ten students identified about 250 findings for the three rounds. The catego- rizations by the students were compared to that of the professional. Section 10 describes this compari- son. 9.2. CISL Categorization and Normalization of the CISL Finding Counts Once the categorization of the findings using the ODC-CC was complete, the findings were associ- ated with each CISL level. This sequence was fol- lowed to ensure there were no unconscious biases creeping into the categorization. The CISL results were obtained from the ODC-CC categorizations by using associations programmed into a spread- sheet to map the classified defects onto the CISL levels. Results for the first experiment for each of the twelve participants (p1 to p12) for the round using the paraphrasing inspection technique are shown in Table 3. Before we could interpret the results, we had to take into account that the number of findings gener- ated by each inspector using each technique is affected by the technique itself and the performance of the inspector. The performance of an individual can vary significantly, both compared to other inspectors, and between rounds for the same inspec- tor [1], [9]. This variation can easily mask the influ- ence of a technique on the experimental results. The repeated measures design of the experiment allowed each participant to act as his/her own con- trol. Even at that, we wanted to normalize the indi- vidual variations in findings counts. We did this by dividing each individuals count of findings in each category of CISL by the individuals total findings for that round. This gave a proportion of CISL find- ings for each round for that individual. The compar- isons are then across the results for each individual. For example, participant p1 in the first experi- ment, using the P technique, had a total of nineteen findings. Fifteen of those were in the Comparative (C) level of understanding, one in the Identifier (I) level of understanding, two in the Structural (S) level of understanding, and one in the Logical (L) level of understanding. For participant p1 and the P technique, the counts were divided by nineteen to give the proportions C: 0.79, I: 0.05, S: 0.11, L: 0.05. This normalization was done for each individ- ual in both experiments. Alternative normalizations were examined and are discussed in [9]. 9.3. Graphical Results The CISL scale allowed us to identify and extract the findings related to deeper understanding of the code. This in turn allowed an evaluation of the effectiveness of the TDI techniques as com- pared to the unstructured inspection technique. The L-proportions represent findings that require the deepest understanding of the source code. Graph 1 shows the L-proportions for each stu- dent and each technique from the first experiment (note that P=paraphrasing technique, D=Method Description TDI, and T=Test Plan TDI). The trends in the graph allowed us to make conjectures regard- ing the use of the inspection techniques, the levels of understanding acquired by the students and the students background experience. Student experience in the first experiment was highly polarized. Students 1 to 6 had extensive soft- ware experience and generally demonstrated a greater ability to find L-level issues with the TDI C I S L p1 15 1 2 1 p2 16 4 1 6 p3 18 5 3 3 p4 18 1 2 11 p5 16 7 0 2 p6 4 3 2 1 p7 0 2 0 5 p8 1 0 0 7 p9 2 5 0 4 p10 0 0 0 1 p11 1 1 0 1 p12 4 1 0 2 Table 3: CISL counts for the Paraphrasing Inspection - first experiment Graph 1: L-Proporti ons by Techni que (2000 experi ment) 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 Participant Number L - P r o p o r t i o n s P Technique D Technique T Technique techniques. Students 7 to 12 had little or no soft- ware experience, turned in lower number of find- ings (making the L-proportions often unreasonably high) and had difficulty performing the TDI tasks. A more detailed discussion can be found in [9], [10]. A similar trend was evident in the second experiment, though not so prominent. This is dis- cussed in Section 11. 10. Comparing Professional and non- Professional Classifications For the second experiment, the findings were classified by the students as a separate step after the three inspection rounds. The findings were also classified independently by a professional software developer. Interesting results came from comparing the students classifications to that of the profes- sional. The graphs 2 and 3 show the differences when comparing the L-Proportions by technique for the ten students. Graph 2 uses the classifications Graph 2: L-Proportions by Technique - Categorization by professional (2001 experiment) 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 Participant Number L - P r o p o r t i o n s P Technique D Technique T Technique Graph 3: L-Proporti ons by Techni que - Categori zati on by students (2001 experi ment) 0 0.2 0.4 0.6 0.8 1 1.2 1 2 3 4 5 6 7 8 9 10 11 12 Participant Number L - P r o p o r t i o n s P Technique D Technique T Technique done by the professional and Graph 3 uses the clas- sifications done by the students. For the sake of the experimental results, differ- ences that result in changes of trends between the paraphrasing round (P) and the two TDI rounds (D and T) are of most interest, i.e., where the L propor- tion for P versus D and T change relative to each other. This happens for students 1, 4, 5, and 9. Stu- dents 4 and 9 turned in less than 5 findings per round, making their results dubious at best. Stu- dents 1 and 5 each turned in more than 30 findings over the three rounds, so it is of more interest to examine these two cases. For student 1, the categorization by the profes- sional moved three findings from CISL level L to CISL levels C, S, and I. Two L findings were dis- counted. For example, the student inspector was concerned about a variable not being reset properly on a FOR loop; the professional marked this as not being an issue. As another example, the student classified a finding as an improper initialization (category 2.2.1W) and the professional classified the finding as inconsistent use of the variable with respect to type (category 1.4.6I). Generally, for this student, the professional categorized the findings as being documentation problems rather than logical errors. For student 5, there were 9 findings whose CISL levels differed depending on the categorization. Two of the students C level findings were catego- rized as L level by the professional. Two of the stu- dents L level findings were categorized as C level by the professional. Three of the students S level findings and one I level finding were categorized as L level by the professional. For example, the stu- dent classified an irrelevant return value as super- fluous debugging code (category 4.1.3) whereas the professional classified the same finding as a case statement having extraneous code (category 2.4.2). Here, the professional generally judged the stu- dents findings to be more logic oriented than the student did. There is no particular pattern in these differ- ences, but in Section 11 below, we see that the changes made by the professional actually strengthen our final results from the second experi- ment. Table 4 gives the number of findings that changed CISL level for each student based on the two categorizations done. Student 9 was struggling with the inspection and all findings were catego- rized differently by the professional, including rejecting half of the findings. Generally, the weaker the student, the greater the discrepancy of the categorizations between the student and the professional. Section 12 discusses possible reasons for the difficulty in classifying defects. 11. Analysis of Experimental Results In a previous paper [10], we stated that: There seems at this point evidence that for the experi- enced participants in the experiment, using tasks to structure the inspection resulted in proportionally more L level findings being identified. This was based on the first experiment. The analysis leading to this conclusion is given in [10]. Subjective com- ments from the participants of both experiments supported the conclusion. The results of the second experiment support the same conclusion, although some explanation is needed as to why this is so, since Graphs 2 and 3 show some participants for whom it is not true. After the second experiment was complete, we observed that the number of findings for code piece 1 was less than for the other two pieces, and the number of L-findings was significantly less. We also found that the number of L-findings in round 3 was significantly less than for the other two rounds. When we looked more closely at code piece 1, it appeared that it was too simple to have very many L-findings. Since we put significant time pressure on the students, and discussed the amount of extra time they were spending before they completed Student Total Number of Findings over 3 rounds (Professional categorization ) Number of Findings that changed CISL Level % change 1 35 5 14 2 17 4 23 3 21 4 19 4 14 3 21 5 53 9 17 6 22 4 18 7 27 7 26 8 32 9 28 9 6 12 200 10 16 3 19 Table 4: CISL level changes from different categorizations (second experiment) round 3, it seems likely that at least some of the stu- dents decided to put less effort into round 3, and that this skewed the results. On this basis, we eliminated the results for par- ticipants 3 through 9 in Graphs 2 and 3. For partici- pants 1 and 2, round 3 used the D technique, so all that is left to compare is P and T. For participant 10, round 3 used the T technique, so all that is left to compare is P and D. For both graphs and for all three of these participants, the L-proportions using the D or T technique (as the case may be) are greater than they are for the P technique, so our results as given in [10] are supported by these three participants. 12. Categorization is Hard The range of efforts to create defect classification schemes described earlier in this paper, and the long history, in which there has been no single, widely used scheme, suggests that defect classification is hard, and repeatable orthogonal classification is in itself difficult. Generally, training is advised for anyone using a classification scheme. In the second experiment the students were asked to do the classi- fication of their own findings. They received very limited training. This was likely the largest contrib- utor along with inexperience, to the variations between the students work and the professionals. Our experience with the differences found in comparing classification results from students to results from a professional developer raises issues of design for defect classification schemes. These include: what level of detail is appropriate for the purpose of the scheme? lower level details help to define higher level categories more clearly lower level details can make classification harder and more time consuming defects and categories are open to multiple inter- pretations: the ODC-CC categories can be more clearly defined, but perhaps not to the point of perfec- tion different categorizers interpret inspectors writ- ten comments/findings differently the code authors intent may be difficult to interpret the level of experience and knowledge of the categorizer affects how the code is interpreted there may be legitimate difference of opinion on the source of the problem (e.g. documenta- tion versus logic error) inconsistency can be interpreted differently the ODC-CC categories have a tendency to reflect procedural/FORTRAN code, so students may misinterpret when their experience is with OO code there may be a subconscious influence in making a classification decision due to the type of fix the categorizer expects, which may be different from the category the defect belongs to. 13. Conclusions Although normally applied to software process evaluation, defect classifications can be used as a metric for individual activities as shown in this case study. For such activities, the development and use of the classification scheme may require a shift in viewpoint and a change of granularity in the classi- fication categories. Existing classification schemes can provide starting frameworks from which schemes focused on specific needs can be devel- oped. It is hard work to develop a classification scheme that is both complete and repeatable in the ideal sense. The ODC-CC was complete for our purposes. Its repeatability has to be studied further. It is difficult to factor out the subjectivity of the humans involved at all stages of classification: cre- ation of the scheme, interpretation of the scheme, interpretation of the items being classified. Further refinement of the scheme and training are possible remedies. Dependence on the skill of the classifier as hard to avoid. However, as imperfect as a defect classification scheme may be, it provides a valuable metric and potential for insight into many different aspects of software development. 14. Biographies Diane Kelly is an instructor and Ph.D. student at the Royal Military College of Canada. Previously, Diane worked in the Nuclear Division at Ontario Hydro where she participated in a wide variety of roles in software development, from programmer to project leader, trainer to QA advisor. Diane has an M.Eng. in Software Engineering from the Royal Military College, a B.Sc in Mathematics and a B. Ed. in Mathematics and Computer Science, both from the University of Toronto, Canada. Terry Shepard is a professor in the Department of Electrical and Computer Engineering at the Royal Military College of Canada, where he has played the lead role in creating strong software engineering programs. This includes working extensively with a number of Canadian military software projects, and creating and teaching gradu- ate and undergraduate courses on software design, V&V, and maintenance. He has over 30 years of software experience in industry, government and academia. Terry received his B.Sc. and M.A. from Queens University in Kingston, and his Ph.D. from the University of Illinois, all in Mathematics. He is a Registered Professional Engineer. He has pub- lished a number of papers in software design and verification, and worked with ObjecTime Ltd. for several years. 15. References [1] Victor Basili, Richard W. Selby, David Hutch- ens, Experimentation in Software Engineer- ing, IEEE Transactions on Software Engineering, Vol. SE-12, No.7, July 1986, pp.733-643 [2] Boris Beizer, Software Testing Techniques, van Nostrand Reinhold, 2nd Edition, 1990 [3] J.K. Chaar, M.J. Halliday, I.S. Bhandari and R. Chillarege, "In-Process Evaluation for Soft- ware Inspection and Test", IEEE TSE, Nov. 93, pp. 1055-1071, v. 19, n. 11 [4] Ram Chillarege, Inderpal S. Bhandari, Jarir K. Chaar, Michael J. Halliday, Diane S. Moebus, Bonnie K. Ray, Man-Yuen Wong; "Orthogonal Defect Classification - A Concept for In-Pro- cess Measurements", IEEE Transactions on Software Engineering, Vol. 18, No. 11, November 1992, pp. 943-956 [5] Khaled El Emam and Isabelle Wieczorek; "The Repeatability of Code Defect Classifications", Proceedings of the 9th International Sympo- sium on Software Reliability Engineering, 1998, pp. 322-333 [6] Michael Fredericks and Victor Basili, Using Defect Tracking and Analysis to Improve Soft- ware Quality, Technical Report, DoD Data & Analysis Center for Software (DACS), Nov. 1998 [7] IBM Centre for Software Engineering, Details of ODC, http://www.research.ibm.com/soft- eng/ODC/DETODC.HTM and FAQ.HTM#concepts [8] Diane Kelly and Terry Shepard; "A Novel Approach to Inspection of Legacy Code", Pro- ceedings of PSQT00, Austin Texas, March 2000 [9] Diane Kelly; An Experiment to Investigate a New Software Inspection Technique, Masters Thesis, Royal Military College of Canada, July 2000. [10] Diane Kelly and Terry Shepard, "Task- Directed Software Inspection Technique: An Experiment and Case Study", IBM CASCON 2000, Toronto, November 2000 [11] Oliver Laitenbergerr, Colin Atkinson, Maud Schlich, and Kahled El-Emam; "An Experi- mental Comparison of Reading Techniques for Defect Detection in UML Design Documents, December 1999, NRC 43614 [12] Adam Porter, Lawrence G. Votta, Jr., Victor R. Basili; "Comparing Detection Methods for Software Requirements Inspections: A Repli- cated Experiment", IEEE Transactions on Soft- ware Engineering, Vol. 21, No. 6, June 1995, pp. 563-575 [13] Adam Porter, Harvey P. Siy, Carol A. Toman, Lawrence G. Votta; "An Experiment to Assess the Cost-Benefits of Code Inspections in Large Scale Software Development", IEEE Transac- tions on Software Engineering, Vol. 23, No. 6, June 1997, pp. 329-346 [14] Glen Russell; "Experience with Inspection in Ultralarge-Scale Developments", IEEE Soft- ware, Vol.8, No.1, Jan. 1991, pp. 25-31 [15] Terry Shepard; On Teaching Software Verifi- cation and Validation, Proceedings of the 8th SEI Conference on Software Engineering Edu- cation, New Orleans, LA, 1995, pp. 375-386 [16] University of Maryland, Notes on Perspective Based Scenarios; http://www.cs.umd.edu/ projects/SoftEng/ESEG/manual/pbr_package/ node21.html,[online], November 1999. [17] Lawrence Votta, Does Every Inspection Need a Meeting?, SIGSOFT93 - Proceedings of 1st ACM SIGSOFT Symposium on Software Development Engineering, ACM Press, NY, 1993, pp. 107-114 [18] Karl Weigers; Process Impact. Review Check- lists, http://www.processimpact.com/ process_assets/review_checklists.doc Annex: ODC-CC detailed classification 1. Documentation 1.1 User documentation (no descriptions of equations or models) 1.1.1 Data input description 1.1.1.1 Format 1.1.1.2 Default values 1.1.1.3 Recommended values 1.1.1.4 Use of input data 1.1.1.5 Data size 1.1.1.6 Measurement units 1.1.2 Data output description 1.1.2.1 Format 1.1.2.2 Measurement units 1.1.2.3 Description 1.1.2.4 Source of output 1.1.3 Error messages 1.1.4 Interrelationships between data items 1.1.5 Software environment 1.2 Theory documentation 1.2.1 Descriptions 1.2.1.1 Description of functionality 1.2.1.2 Stated defaults, assumptions, limitations for models 1.2.1.3 Precondition on calculation 1.2.1.4 Post condition on calculation 1.2.2 Symbology 1.2.2.1 Definitions, distinctions between symbols 1.3 Internal documentation (comments in code) 1.3.1 Description of variables 1.3.1.1 Local 1.3.1.2 Global 1.3.1.3 Interface 1.3.2 Description of process 1.3.2.1 Logic/Calculations 1.3.2.2 Module header: description of functionality, version, references 1.4 Code (style, internal consistency, structure) 1.4.1 Convention for naming variables 1.4.1.1 Variable naming scheme 1.4.1.2 Variable names avoid using reserved keywords 1.4.1.3 Constant values have meaningful names (e.g., PI for 3.14) 1.4.2 Module naming 1.4.2.1 Reflects use 1.4.3 Formatting style, e.g., labels, white space, comments 1.4.3.1 Use of labels and other inactive code features 1.4.3.2 Use of white space 1.4.3.3 Use of comments 1.4.4 Variable naming versus use 1.4.4.1 One variable name per semantic (e.g., MASS for mass of fuel rod; MFUEL for same quantity) 1.4.4.2 One use over time for each variable name. (e.g., COUNT used for number of channels and later for number of devices) 1.4.4.3 One use at any one time for each variable name (e.g., +ve values are Masses, -ve values are Volumes) 1.4.5 Interface parameters; type versus use, size and scope 1.4.6 Local/private variables; type versus use, size and scope 1.4.7 Global variables; type versus use, size and scope 1.4.8 Data structures composed of semanti- cally or logically related data 1.4.9 Structure of data logically evident or thoroughly documented 1.4.10 Code structure reflects logical relationships 1.4.11 Common functionality confined to one module (e.g., use of library routines, one module for common computations) 1.4.12 Volatile or complex structures isolated or encapsulated (e.g., access routines for complex data structures or system dependent tasks routines) 1.4.13 Language constructs 1.4.13.1 Generalized to reduce restrictions and/or obscurity 1.4.13.2 Recommended constructs to replace obsolete constructs 1.5 I/O Code 1.5.1 Output wording and format 1.5.2 Input wording and format 1.5.3 Error message wording and format 1.5.4 Program action versus error message 1.5.5 Program action versus output 1.5.6 Program action versus input 2. Calculation/ Logic 2.1 Formulation of equation 2.1.1 Combination of terms of equations 2.2 Data values before computation 2.2.1 Terms initialized 2.2.2 Appropriate data value used (e.g., from previous calculations or based on related data) 2.2.3 Appropriate assumptions, defaults 2.3 Formulation of boolean condition 2.3.1 Variables in boolean expressions 2.3.1.1 Variables checked against correct limits 2.3.1.2 Comparing two floating point numbers 2.3.2 Formulation of boolean condition 2.4 Logic flow 2.4.1 All cases addressed 2.4.2 Coding for each case is complete and each case has only code required specifically for that case 2.4.3 All code accessible (including calls to modules, use of passive code: formats, inline functions) 2.4.3.1 All modules accessed 2.4.3.2 All parts of logic accessible 2.4.3.3 All passive code constructs used (labels, inline functions referenced) 2.5 Calculation 2.5.1 All calculations completed before subsequent use 2.5.2 Calculation done within scope of assumptions 2.5.3 Normal exit from calculation 2.5.4 Order of calculations 2.5.5 Limitations on calculations consistent with subsequent calculations 2.5.6 Access to data structures 2.6 Algorithm 2.6.1 Appropriate algorithm 2.6.2 Algorithm used within documented limitations 2.7 Numerics 2.7.1 Round-off error 2.7.2 Discrete computation 2.7.3 Accuracy 2.8 Error handling 2.8.1 Clean up or reset state variables after error condition 3. Error Checking (identified places in code where error checking should be done) 3.1 Preconditions on data values 3.1.1 Expected data values (e.g., discrete values), maximum and minimum values, zero values 3.1.2 Data values based on related data, e.g. check the related value is as expected 3.2 Post conditions on data values 3.2.1 Completion of calculation 3.2.2 Expected range of values, e.g., void fraction calculated value between 0 and 1 3.3 Conditions in logic structures, case structures, loops 3.3.1 Bounds on loops 3.3.2 All conditions covered, fall through 3.3.3 Appropriate conditions for IF block 3.4 Data size 3.4.1 Array bounds 3.4.2 Size of character or other variables 3.5 Status codes or error conditions passed from other software 3.5.1 Capturing and handling error conditions from the operating system, hardware 3.5.2 Capturing and handling error conditions from other application software 3.5.3 Capturing and handling error conditions from other modules or submodules within application 4. Support Code 4.1 Code to support testing 4.1.1 Test harnesses 4.1.1.1 Stubs 4.1.1.2 Other 4.1.2 Debugging code 4.2 Code to support additional development features 4.2.1 "backdoors", shortcuts 4.2.2 trial options, temporary features 5. External Resources 5.1 File/memory handling 5.1.1 Buffers flushed 5.1.2 Files opened before writes, files closed; memory allocated/deallocated 5.1.3 End of file flags 5.1.4 File positioning 5.2 Library programs 5.2.1 Interfaces 5.2.2 Assumptions on use 5.2.3 Interpretation of returned values from external routines 5.3 Interaction with other software systems 5.3.1 Interfaces, both input and output 5.3.2 Assumptions on exchanged data 5.4 Dependencies on environment 5.4.1 Assumptions about internal word size 5.4.2 Assumptions about compiler initialization of data