Are Fault Failure Rates Good Estimators of Adequate Test Set Size

2009 Ninth International Conference on Quality Software
Are Fault Failure Rates Good Estimators of Adequate Test Set Size?
Vidroha Debroy and W. Eric Wong
Department of Computer Science The University of Texas at Dallas, USA {vxd024000, ewong}@utdallas.edu
Abstract
Test set size in terms of the number of test cases is an important consideration when testing software systems. Using too few test cases might result in poor fault detection and using too many might be very expensive and suffer from redundancy. For a given fault, the ratio of the number of failure causing inputs to the number of possible inputs is referred to as the failure rate. Assuming a test set represents the input domain uniformly, the failure rate can be redefined as the fraction of failed test cases in the test set. This paper investigates the relationship between fault failure rates and the number of test cases required to detect the faults. Our experiments suggest that an accurate estimation of failure rates of potential fault(s) in a program can provide a reliable estimate of an adequate test set size with respect to fault detection (a test set of size sufficient to detect all of the faults) and therefore should be one of the factors kept in mind during test set generation. Keywords: fault failure rate, fault detection, software testing, test set size.
1. Introduction
The number of test cases, or the size of the test set, used to test software has a significant impact on the overall testing process. A test set consisting of few test cases may not be able to detect faults effectively. On the other hand, a test set with too many test cases greatly increases the cost associated with testing and also may result in test case redundancy. Thus, test set size is one of the chief determinants of the overall cost and effectiveness of the software testing process. Recognizing that large, comprehensive sets of test cases are rarely available and impractical to use due to their expense, several studies have explored the link between the size of a test set and its corresponding effectiveness in terms of its fault detection capabilities [15,19,20]. In contrast, the objective of this study is to observe empirically whether the hardness of detection of faults can be used as an effective predictor of an adequate test set size to detect all the faults. We formally define this hardness of detection in Section 2. To the best of our knowledge we are not aware of any studies with an identical objective as that reported above. For the purposes of such an initial study, we make the following assumptions: first, we assume that a test set represents the input domain uniformly. This means that for a
1550-6002/09 $26.00 2009 IEEE DOI 10.1109/QSIC.2009.38 229
given faulty program, the fraction of test cases in the test set that fail on the program equals that of the test cases that fail in the actual input domain. Second, we assume that a faulty version of a program contains only one fault. Thus, if a test case fails on a particular faulty program, we can attribute the failure to one and only one fault. Additionally, we apply our method only to faulty versions that result in the failure of at least one test case. All faulty versions that cannot be detected by any test case available for this study have been excluded. Also excluded are faulty versions that result in a failure of every test case used in this study. Hereafter, when we use the term 100% effectiveness we refer to a complete detection of all the faults (available for study) in a program, each of which must cause at least one failure, i.e., have a non-zero failure rate. Our approach uses probabilistic estimation of the expected number of faults detected by a test set. We then predict an adequate test set size such that all of the faults are detected and we achieve 100% fault detection effectiveness. The results are compared against those obtained from actual test case execution and the prediction error is recorded. In order to make the validation more comprehensive, we repeat this process across two suites of programs (eight programs in total) each with multiple faulty versions. The rest of the paper is organized as follows: Section 2 gives background information relevant to this paper. Section 3 describes the methodology. Section 4 elaborates on the case studies undertaken and presents an analysis of the results. Section 5 discusses the threats to validity and relevant issues. Related work is in Section 6, and conclusions and future work appear in Section 7.
2. Background
In this paper, we use terminology similar to that defined in [1]. A failure is an event that occurs when a delivered service deviates from correct service. The deviation of system state from correct state is called an error. The adjudged or hypothesized cause of an error is called a fault. Given a group of faulty versions (faults) of the same program and a test set T, we define the fault detection effectiveness E as the percentage of faults detected by T. For example, in a group of five faults, i.e., five programs with one fault each if at least one test case in T fails on three of the programs (it does not have to be the same test case for all of the programs), but none fails on the other two, then the effectiveness is 60%.
As in [5,6] for any program P, let the input domain be represented by D, represent the size of D, and be the number of failure-causing inputs in D. The overall failure rate of the input domain, denoted by , is the proportion of failure-causing inputs in the input domain, that is, = /. Note that a fault that is capable of being detected must have a non-zero failure rate. Also, the failure rate is restricted to a value between 0 and 1. A failure rate close to 0 means that the fault is relatively hard to detect as the ratio of failed test cases to successful ones is low. On the other hand, a failure rate close to 1 means that the fault is relatively easy to detect, as a large percentage of test cases fail on the fault. This failure rate represents the hardness of detection of a fault. For the purposes of this study, we assume that a test set is constructed such that it represents the input domain uniformly. Therefore, we extend the above definition of the failure rate to be equivalent to the fraction of failed test cases in the test set used. Selection of test cases with replacement leads us to simpler mathematical models. However, one of the goals of this study is to assist in the identification of an adequate test set size (to detect all faults used in the study) while avoiding redundancy. Thus, we force ourselves to only consider test sets with distinct, non-repetitive test cases. Hence, duplicity within a test set is avoided. Furthermore, repeating the same test case offers no gain in fault detection effectiveness, assuming the execution environment does not change.
detection of another. The expected number of faults detected by T (which is the sum of the individual expectations) is computed as
F F p1F ( ) + p2 () + ... + pn ( ) = piF ( ) i =1 n
(5)
Accordingly, the effectiveness is computed as

n F pi ( ) i =1 *100 E = n
(6)
Combining Equations (2) and (6), the effectiveness equation now reduces to
n S pi ( ) i =1 *100 E = 1 n
(7)
For any fault fi, piS () may be computed as follows

piS () = (m i )!(m )! (m i )!(m)!
(8)
3. Methodology
Let there be a set of n faults: f1, f2.., fn. Let there also be a test set T that contains m test cases: t1, t2..., tm. Let each fault fi be associated with a failure rate i such that i = i/m (1) where i represents the number of test cases out of m that fail on fault fi. Supposing we wish to use a subset of T, say T, with test cases such that m. We define piF () as
This computation is very similar to that of the Q-measure in random testing [12] that is used to compute the probability of detecting at least one failure when test case selection is done without replacement. The effectiveness computation now resolves to
( m )! (m 1 )! ( m n )! + ... + m ! ( m )! ( m n )! 1 E = 1 *100 n
(9)
the probability of detecting fault fi by a test case in T. The outcome of the fault detection process is binary because fi is either detected by T or it is not. Therefore, we can say (2) piF ( ) = 1 piS ( ) where pi ( ) represents the probability of not detecting fi, i.e., none of the test cases in T fails on the fault. Let Yi denote the outcome of the fault detection process for fault fi, then Yi can take one of two values. Let us say
S
1 Yi = 0
if fault f i is detected otherwise
(3)
Let the expected value of Yi be defined as

EX (Yi ) = (1)( piF ()) + (0)(1 piF ()) = piF ()
Given all the failure rates i (i.e., i and m), the number of faults we expect to test for n, and the desired level of effectiveness E, we can solve Equation (9) for in order to compute an expected test set size to reach that level of fault detection effectiveness. Only integer solutions of are meaningful and applicable because there is no such thing as a fractional test case. Unfortunately, using this equation to calculate the value of is analytically complex and computationally intractable for increasingly large values of n. Therefore, we calculate the value of numerically by successive substitution (namely, =1, =2, and so on) until the desired level of effectiveness is achieved. We are now ready to apply this approach to actual programs in order to observe how accurate the effectiveness predictions are in reality. The next section describes the validation of the methodology described here against various C programs.
(4)
4. Case studies
In this section we report our case studies using the methodology proposed in Section 3 on two different suites
Note that each faulty version has only one fault and therefore, the detection of a fault is independent of the
230
of C programs: the Siemens suite (consisting of 7 different programs) and the space program. Each program has a varying number of test cases and faulty versions available. Further information on each program is detailed as follows.
4.1. Subject programs

The seven programs in the Siemens suite have been used extensively in testing and fault localization related studies such as those in [7,10,13,14,21,22]. The correct versions, 132 faulty versions of the programs, and all the test cases are downloaded from [16]. Three faulty versions are excluded in our study: version 9 of schedule2 because it does not result in any test case failure, and versions 4 and 6 of print_tokens because these faults are located in header files and not in the .c files and therefore represent special cases. Table 1 presents a summary of the programs in the Siemens suite.
Program print_tokens print_tokens2 schedule schedule2 replace tcas tot_info Table 1. Summary of Siemens suite Number of faults used Number of test cases 5 4130 10 4115 9 2650 9 2710 32 5542 41 1608 23 1052
order to minimize bias due to the choice of test cases in the set. For test sets between the sizes of 1 and 5000, 60 sets are generated for each size. For test sets between the sizes of 5001 and 10,000, 30 sets are generated for each size. Finally, for test sets between the sizes of 10,001 and 13584, 10 sets are generated for each size. The reason that a different number of test sets is generated for different sizes is that as test set size increases, it becomes progressively harder and more time consuming to generate distinct test sets. In our experiment failure rates are empirically calculated by executing all the available test cases downloaded from [9] and [16] against each faulty version of a program and observing the fraction of failed test cases on that faulty version.
4.3. Prediction accuracy

Comparisons between the predicted effectiveness by Equation (9) and the observed effectiveness as described in Section 4.2 are presented. The comparison is drawn until the predicted effectiveness reaches 100%. We remind the reader that effectiveness levels of 100% are possible in our controlled experiments because we exclude those faults which do not result in a failure across any of the test cases. As long as every fault can be detected by at least one test case, a trivial way to ensure 100% effectiveness is to use every test case available. We round off to 100% past 3 decimal places, i.e., 99.999% and above 100%. We also note that while the predicted effectiveness monotonically increases with an increase in test set size until all of the faults have been detected; it may not be the case with the observed effectiveness. It is possible for a test set of size (k+1) test cases to have a lower effectiveness than that of a test set with k test cases. Considering that the objective of the experiment is to identify if failure rates can assist in identifying an adequate test set size, it is useful to note the circumstances under which either the predicted or observed effectiveness reaches exactly 100%. We recall that the observed effectiveness of any test set of a particular size is computed as the average effectiveness across multiple sets of that size in order to eliminate bias. Therefore, in order to achieve 100% effectiveness, each set must be able to detect all of the faults, i.e., each set must have an effectiveness of 100%. In contrast, for any set of faults (faults that do not result in test case failure are excluded), the predicted effectiveness will only be 100% when the size of the test set is greater than the minimum number of test cases required to guarantee detection of each and every fault. Such a guarantee can only be made when the size of the test set exceeds the number of successful test cases for the fault which caused the minimum number of failures. Recall that is the size of the test set and i represents the number of failed test cases on fault fi where i ranges from 1 to n. Stating the above condition differently, whenever we have > m min(i) then the predicted effectiveness will be 100% as we must now include a failed test case for every fault.
The space program provides a user interface that allows the user to configure an array of antennas by using a high level language. The correct version, the 38 faulty versions, and a suite of 13585 test cases used in this study were downloaded from [9]. Three faulty versions that do not cause any execution failures have been left out of the study. Thus, the number of faulty versions used in this experiment is 35. A summary of the number of faults and the number of test cases available for the space program is in Table 2.
Program space Table 2. Summary of the space program Number of faults used Number of test cases 35 13585
4.2. Data collection

All program executions were on a PC with a 2.13GHz Intel Core 2 Duo CPU and 8GB physical memory. The operating system was SunOS 5.10 (Solaris 10) and the compiler used was GCC 3.4.3. Each faulty version was executed against all its corresponding available test cases. The success or failure of an execution was determined by comparing the outputs of the faulty version and the correct version of a program. A failure was recorded whenever there was deviation of observed output from expected output. For each program, test sets are randomly generated for all sizes between 1 and the number of test cases available for that program. Furthermore, multiple test sets are generated for each size and the observed effectiveness is then averaged across the number of test sets generated in
231
Figures 1 through 8 illustrate the comparison graphically program by program. We observe that in general the predicted effectiveness follows the observed effectiveness closely. Without loss of generality, we discuss the program print_tokens in Figure 1 for an analysis. Similar trends are observable in the curves for the other programs as well. In the case of print_tokens the maximum percentage error between the predicted and observed effectiveness is 22.84% when just one test case is used (i.e., a test set size of 1). When the test set consists of just 1 test case, then the choice of test case is extremely relevant as some test cases may be able to detect a lot more faults than others. Thus, there may be a drastic change in the observed effectiveness as we move from one test case to another. Intuitively, the smaller the size of the test set used, the greater the possibility of encountering such bias. This is the explanation for the high percentage error observed on test sets of smaller sizes. The error gradually diminishes as the size of the test set is increased. Further discussion on the relationship between test set size and prediction error is presented in Section 4.4 where graphs are provided to show the change in prediction error with respect to an increase in test set size. In the case of print_tokens, the predicted effectiveness approximates to 100% after the use of 1114 test cases out of a possible 4130 test cases. In contrast, the observed effectiveness on print_tokens consistently approximates to 100% after only about 850 test cases are used. An interesting observation is that the predicted effectiveness is generally conservative in its estimate when compared to the observed effectiveness (i.e., the observed effectiveness is generally greater than that of the predicted effectiveness). Such an observation is also made on the other sets of programs as well where the observed effectiveness approaches and reaches the 100% effectiveness mark faster than the predicted effectiveness.
Figure 2. Comparison on print_tokens2
Figure 3. Comparison on schedule
Figure 1. Comparison on print_tokens
Figure 4. Comparison on schedule2
232
Figure 5. Comparison on replace
Figure 8. Comparison on space
Figure 6. Comparison on tcas
Figure 7. Comparison on tot_info
We now present an analysis on the quality of the predictions across various effectiveness levels in Table 3. The predicted effectiveness is divided up into 10 sections ranging from 10% to 100% with intervals of 10% in between. The number of test cases required to reach a predicted effectiveness level (the rows labeled P) is presented along with the observed effectiveness (the rows labeled O) by using the same number of test cases. This data is presented for each of the programs under study. Recall that the predicted effectiveness monotonically increases with an increase in test size, while the observed effectiveness may not. Thus, it is hard to attribute a single effectiveness value to a test set with a given number of test cases, because this value is subject to variation both positive and negative. However, since the observed effectiveness on a given test set size is averaged over several test sets of the same size, the bias is greatly reduced. This observed effectiveness (O) corresponding to the number of test cases used (P) is reported in the table. Note also that it is not always possible to section the effectiveness into clean partitions such as 10%, 20% and so on without having to consider fractional test cases. To deal with this situation, appropriate approximations are made to round to the nearest effectiveness partition. For instance, the predicted effectiveness by using 6 test cases on program print_tokens is 10.71% which is still approximated to the 10% partition. We now discuss an example on how Table 3 is to be read. If we look at the first row, i.e., the 10% level, we observe that 6 test cases are required for a predicted effectiveness of about 10% on the program print_tokens. Correspondingly, by using a test set consisting of 6 test cases on the program print_tokens, we find that about 12.5% of the faults are detected on average. That is, we observe an effectiveness of 12.5% as opposed to the predicted 10%. As another example, we observe that for print_tokens 96 test cases are to be used in order to predict an effectiveness of 70%. If 96 test cases are actually used, then the observed effectiveness is about 69.17% on average.
233
Table 3. Prediction quality across various levels of effectiveness

Predicted Effectiveness
(% of faults detected)
Programs print_tokens print_tokens2 6 2 12.5% 11.17% 13 4 20.28% 20.50% 21 7 28.61% 34.83% 32 10 40.00% 38.87% 46 14 48.11% 49.33% 66 20 61.94% 63.67% 96 28 69.17% 71.67% 143 42 79.44% 80.33% 225 77 91.11% 90.33% 1114 844 100% 100% schedule 3 9.29% 6 21.19% 10 28.81% 15 39.76% 24 51.67% 38 59.05% 59 68.81% 95 81.19% 156 90.08% 794 100% schedule2 7 10.95% 14 19.05% 23 30% 33 45.24% 45 52.14% 59 58.57% 79 72.38% 107 84.52% 157 90.95% 687 100% replace 5 11.49% 11 19.71% 19 31.44% 28 38.28% 40 47.93% 58 61.32% 85 71.15% 132 81.03% 232 90% 1627 100% tcas 4 9.44% 9 22.22% 14 31.34% 22 43.24% 32 51.11% 48 60.09% 71 69.77% 109 81.81% 185 89.86% 984 100% tot_info 1 7.94% 3 25.48% 4 29.44% 6 40% 9 48.65% 13 59.60% 20 68.25% 33 79.29% 61 90.48% 613 100% space 2 24.76% 4 33.38% 7 40.43% 14 50.95% 27 61.38% 55 70.10% 128 80.24% 385 90.67% 8098 100%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
P O P O P O P O P O P O P O P O P O P O
Certain values such as the number of test cases required to reach a predicted level of effectiveness of 10% for the space program have been omitted as a higher effectiveness is achieved by using just a single test case. This is due to several faults in space being generally easy to detect, i.e., having a high failure rate and therefore one test case alone is adequate for detecting more than 10% of the faults. As with the curves (Figures 1 to 8), we observe that in general the gap between predicted effectiveness and observed effectiveness is close. Furthermore, the gap seems to narrow as the test set size increases. The following section addresses the relationship between prediction error and test set size.
MAE =
1 n | ai bi | n i =1
(10)
Thus, the MAE measures the average magnitude of errors without considering their direction and is a linear score because all of the individual errors are weighted equally.
Root Mean Squared Error (RMSE): The RMSE is another score which measures the average magnitude of the error but is different from the MAE in that it is a quadratic score. Using the same variables as in Equation (10), we formally define the RMSE as:
4.4. Prediction error and test set size

As discussed previously, the size of the test set has a significant bearing on the quality of the predictions and consequently the degree of prediction error. Now we analyze the differences between the predicted effectiveness and the observed effectiveness in terms of two commonly used error measures: the mean absolute error (MAE) and the root mean squared error (RMSE). A description of both the MAE and RMSE measures is as follows.
Mean Absolute Error (MAE): The MAE shows how close a prediction is to an observation in terms of the absolute value of the error recorded. Formally, given a set of n observations, let ai and bi denote the ith prediction and observation respectively. Then the MAE is given by:
RMSE =
(a b )
i i i =1
(11)
Since the RMSE squares the errors before taking their average, the RMSE assigns a correspondingly larger weight to the larger errors. Thus, the effect of relatively larger errors is magnified while that of extremely tiny errors is diminished by the computation of the RMSE. Both the MAE and the RMSE are negatively focused scores in the sense that the lower the value of each measure, the better the prediction quality. It is always desirable for any predictive model to have low measures of MAE and RMSE. Table 4 summarizes the differences between the observed and predicted effectiveness for each of the 8 programs in terms of the MAE and the RMSE.
234
Table 4. Error measures across the programs

Program print_tokens print_tokens2 schedule schedule2 replace tcas tot_info space Mean Absolute Error (MEA) 0.55 0.35 0.55 0.78 0.31 0.33 0.26 0.17 Root Mean Squared Error (RMSE) 0.91 0.75 0.96 1.37 0.58 0.60 0.50 0.28
We observe that the values for MAE and RMSE are relatively low for all of the programs under examination. Recall that the program print_tokens has a high percentage prediction error for a test set of size 1. However, the final values of the error measures are relatively low. This means that the accuracy of the predictions, as the test set size increases, compensates for the error in predictions on small sized test sets. For most of the programs, both the values of MAE and RMSE lie between 0 and 1. The exception is schedule2 with an MAE of 0.78 and an RMSE of 1.37 which is slightly higher than that of the other programs. Recall that the comparison is only drawn between the predicted and observed effectiveness until the point when the predicted effectiveness approximates to 100%. In the case of schedule2, the predicted effectiveness approximates to 100% when the test set size is 688 out of a possible 2710 test cases. Test sets of sizes larger than 688 have not been factored into the MAE and RMSE computations. Even so, there is no specific reason why the MAE and RMSE measures should be relatively higher in the case of schedule2 and this may be attributed to random causes. Based on the curves presented in Figures 1 through 8 and the data provided in Table 4, we observe that even though for test sets of small size we have relatively larger errors, on average the prediction error is relatively low both in terms of the MAE and RMSE. It is therefore useful to observe the trend in prediction error as we gradually increase the size of the test set. We generated the graphs for the change in prediction error with respect to an increase in test set size for all of the programs under study. However, due to space constraints, for discussion purposes only the graphs for the programs print_tokens2 and schedule2 are presented. Figure 9 shows the change in prediction error with an increase in test set size for the program print_tokens2, while Figure 10 does so for the program schedule2. The x-axis represents an increase in test set size until the point when the predicted effectiveness approximates to 100%. The y-axis represents a 2-sided error between the predicted effectiveness and the observed effectiveness. The reason a 2-sided error graph is presented instead of presenting a graph based on the absolute error is so that it can be observed whether the predictions are generally optimistic (over-predicting) or pessimistic (under-predicting) with respect to the observed effectiveness.
Figure 9. Prediction error versus test set size on print_tokens2
Figure 10. Prediction error versus test set size on schedule2
Based on the figures for print_tokens2 and schedule2, we observe that the general trend is for the prediction error to diminish as the test set size increases. We observe similar patterns in the graphs for the other programs under study as well. Note that while the prediction error diminishes as the number of test cases used increases and becomes negligible or zero when the predicted effectiveness evaluates to 100%, the graph crosses the x-axis several times. This implies that there are several test set sizes (though not necessarily discrete) for which the prediction error evaluates to zero. The crossing of the x-axis multiple times means that the predictions in general are both optimistic and pessimistic in nature in that sometimes we over-predict the effectiveness of a test set and sometimes we under-predict its effectiveness. To analyze the distribution of when the predictions are optimistic and when they are pessimistic is beyond the immediate scope of this paper and shall be addressed in the future.
235
5. Discussion
In this section we present a discussion on some of the threats to the validity of the experiment and its derived results and also address issues related to the estimation of failure rates.
5.2. Imperfect estimation of failure rates

For the purposes of this study fault failure rates are directly obtained by running the entire set of available test cases against a faulty version of a program and observing the fraction of failures. We realize that in practice it is very difficult if not entirely impossible to obtain failure rates with such a high degree of accuracy. Furthermore, as a test set is minimized or augmented the corresponding failure rate linked to a fault is also subject to change. Therefore, we decide to extend our experiment to observe the effect on prediction accuracy in the presence of an imperfect estimation of failure rates. In order to simulate an incorrect estimation of failure rates, we induce deviation from perfect estimation by modifying the existing failure rates. Recall from Equation (1) that the failure rate of a fault is the fraction of test cases in the test set that fail on the fault. Thus, a failure rate of 0.12 would mean that 12% of the test cases in the test set fail on the fault. To modify the failure rates, we first section the possible failure rates into percentage intervals of 1%, 5%, 10%, 15%, 20%, ..., 90%, 95%, and 99%. The edge conditions of 0% and 100% are not considered (but instead replaced by 1% and 99%), respectively, because a failure rate of 0% would imply that no test case can detect the fault and a failure rate of 100% would imply that every test case detects the fault. We wish to avoid such extreme scenarios. Each accurate failure rate is then changed to one of these intervals based on the following rounding scheme.
Mixed-rounding: This scheme allows for a failure-rate to be rounded to the interval closest to it. Thus, a failure rate may be rounded up or down depending on its position with respect to the median value of an interval. Let us again consider the example of a failure rate of 0.12 or 12%. Since the failure rate occurs between the intervals of 10% and 15% and is less than 12.5% (the median), it is rounded down to 10%. Similarly, a failure rate of 12.5% and above (but below 17.5%) would have been rounded to 15%. Edge conditions: Failure rates towards the boundary of the possible intervals are handled such that the lowest possible interval is 1% and the highest possible interval is 99%. Thus, all failure rates below 2.5% (including those below 1%) are rounded to 1%, and all failure rates of and above 97.5% (including those above 99%) are rounded to 99%.
5.1. Threats to validity

We recognize that there are several threats to the validity of results described in this paper. They are addressed in the following. First and foremost, this experiment makes use of actual failure rates determined experimentally. Such information is only available after the test set has been constructed and the test cases have been run. The quality of the predictions is heavily dependent on how accurate the failure rates are and an inaccurate estimation of failure rates might lead to a misleading estimation of adequate test size. In practice failure rates can be obtained from sources such as programmer and manager intuition as well as from historical data, etc. In order to observe what the effects of inaccurate estimation of failure rates would be on the prediction quality, we further this discussion in Section 5.2. Second, for the sake of such an initial study, we assume a test set that represents the input domain uniformly. In practice, this is a nearly impossible thing to achieve especially as the complexity and number of possible inputs increases. Additionally, in this study we only apply the approach to faulty versions containing one fault each. Such an assumption has been made in many fault localization related papers as well [7,10,13,14,21,22]. The presence of multiple faults in a single program may affect the individual failure rate of any one fault because one cannot always assume independence of faults. We also may not be able to generalize the results presented in this paper to any arbitrary program. However, the analysis has been conducted on programs of varying sizes, faulty versions and number of test cases in order to increase our confidence in the validity and applicability of our observations. We take this opportunity to remind the reader once again, that the objective of this paper was never to illustrate how to estimate the failure rates of faults in a program accurately. Nor was it to establish how failure rates may change in the presence of multiple faults. Rather, before full fledged studies are conducted and much effort is spent on the above (and related issues), we wish to establish whether the use of failure rates in test set construction is at all useful and reliable. By no means does this paper or its results imply that fault failure rates should be the sole consideration when trying to identify an adequate test set size. Instead, our results suggest that an accurate estimation of failure rates of potential fault(s) in a program can aid in providing a reliable estimate of an adequate test set size and therefore should be one of the factors kept in mind during test set construction. We also remind the reader that the adequacy of a test set can be evaluated across multiple criteria such as coverage, etc. This paper refers to an adequate test set as one that is capable of detecting all of the faults (i.e., a test set that result in 100% fault detection effectiveness).
236
The experiment outlined in Section 4 is now repeated, except this time using the imperfect failure rates so obtained. As in Section 4.4, the experiment is performed across all the programs, but due to space constraints, only the results on two of the programs are presented in this paper. For consistency and continuity, once again the results are presented for the print_tokens2 and schedule2 programs of the Siemens suite. The change in failure rates implies that a different level of effectiveness may be predicted using test sets of the same size compared with the corresponding predicted effectiveness based on the correct failure rates.
However, to make the distinction clear, the comparison is drawn until test sets of the same size as in Section 4.3 are reached. Note that the observed effectiveness undergoes no change as the incorrect estimation of failure rates only applies to the predicted effectiveness. Figure 11 shows the curve of the predicted effectiveness using the incorrect failure rates against the observed effectiveness on the program print_tokens2, while Figure 12 does the same for the program schedule2.
Recall that the mixed-rounding scheme allows for failure rates to be rounded up or down depending on their position with respect to the median value of an interval. We notice that more of the faults are rounded down than up on print_tokens2 and schedule2. Furthermore, there are multiple faults that are brought down to the 1% level (a failure rate of 0.01). The net effect is that it becomes more difficult to detect all the faults. This implies more test cases are required to achieve an predicted effectiveness level of 100%.
6. Related Work
In this section we give an overview of work related to that presented in this paper. Failure rates have been used extensively in the area of adaptive random testing [2,4,5,11]. Adaptive random testing seeks to distribute test cases more evenly within the input space as an enhanced form of random testing. The failure rate defined in these papers is in terms of the entire input domain, whereas this paper assumes that a test set is uniformly representative of the input domain and therefore applies the definition to the test set itself. The concept of a software failure rate is also used extensively in the area of software reliability, but with a definition different from that used in this paper and in adaptive random testing. In the context of software reliability, the failure rate is defined as the rate at which failures occur in a certain time interval (t, t + t) [18]. Various methods and techniques exist to try and reduce if not minimize the size of a test set. In [8,19,20] the authors describe a methodology to minimize the size of a test set by eliminating redundant test cases while maintaining the same level of coverage. In [2] the authors propose a process to reduce the number of test cases used in the performance evaluation of components. The process is based on sequential curve fittings from an incremental number of test cases until a minimal pre-specified residual error is achieved. Even though in general the test set minimization problem is NP-complete, greedy heuristics such as the one presented in [17] and other such methods (e.g., the one used by the Suds tool suite [23] developed by Telcordia Technologies, formerly Bellcore) are useful strategies that work towards test set optimization. The objective of such studies and the one discussed in this paper is similar in the sense that both try to identify a suitable test set size in order to achieve certain thresholds across criteria such as coverage or fault detection effectiveness (the latter is the criterion in this paper).
Figure 11. Imperfect estimation on print_tokens2
Figure 12. Imperfect estimation on schedule2
From the figures we observe that in general the curve representing the predicted effectiveness still follows the observed effectiveness though the deviations are greater in number than on the corresponding curves using perfect estimation. An important observation is that a test size that is sufficient to reach 100% predicted effectiveness when using perfect estimation, is no longer sufficient to reach 100% effectiveness when using imperfect estimation. The explanation behind this lies with the way in which the correct failure rates are modified to induce imperfect estimation.
7. Conclusions and future work

In this paper we validate whether fault failure rates should be included as valid considerations when determining adequate test set size, i.e., the size of a test set which can detect all of the faults. Case studies are performed on eight different C programs and results are generally indicative of the fact that the accurate estimation of fault failure rates can indeed help accurately predict an adequate test set size.
237
We also perform an analysis on the relationship between prediction accuracy and test set size. Observations support the intuition that as the size of the test set increases, the bias (due to some test cases may detect more faults than others) involved is reduced and the prediction accuracy consequently also goes up. An additional analysis is performed to observe the effects of imperfect estimation of failure rates on the prediction accuracy using an approach where failure rates are modified to generate incorrect ones. Future work includes, but is not limited to, making our study even more comprehensive by not just observing the effect of an incorrect estimation of failure rates on prediction accuracy, but also observing the effect of an incorrect estimation of the potential number of faults to detect. Based on the results of this paper, we also wish to identify reliable and practical ways to accurately estimate fault failure rates. Potential areas of application of this research can include (but are not limited to) regression test selection techniques where the goal is to reduce the cost of regression testing by selecting a subset of the test suite that was used during development to test a modified version of the program. This research is also applicable to entities involved with mutation testing where an adequate test set size might be estimated to detect all the faults, i.e., kill all the mutants.
10
11
12
13
14
15
16 17
18 19
Acknowledgments
The authors wish to thank Hyung-Jae Chang of the Software Technology Advanced Research (STAR) Lab at the University of Texas at Dallas for his valuable insights and comments while this paper was still a work in progress.
20
21
References
1 A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, 1(1):11-33, January 2004 J. Cangussu, K. Cooper, and W. E. Wong, Reducing the number of test cases for performance evaluation of components, International Journal of Software Engineering and Knowledge Engineering (to appear). An earlier version appeared in the Proceedings of the 19th International Conference on Software Engineering and Knowledge Engineering, pp. 145-150, Boston, Massachusetts, USA, July 2007 K. P. Chan, T. Y. Chen and D. Towey, Restricted random testing: adaptive random testing by exclusion, International Journal of Software Engineering and Knowledge Engineering, 16(4): 553-584, August 2006 T. Y. Chen, H. Leung and I. K. Mak, Adaptive random testing, in Proceedings of the 9th Asian Computing Science Conference, pp. 320329, Chiang Mai, Thailand, December 2004 T. Y. Chen and Y. T. Yu, On the relationship between partition and random testing, IEEE Transactions on Software Engineering, 20(12): 977-980, December 1994 T. Y. Chen, T. H. Tse, and Y. T. Yu, Proportional sampling strategy: a compendium and some insights, Journal of Systems and Software, 58(1): 65-81, August 2001 H. Cleve and A. Zeller, Locating causes of program failures, in Proceedings of the 27th International Conference on Software Engineering, pp. 342-351, St. Louis, Missouri, USA, May 2005 M. J. Harrold, R. Gupta and M. L. Soffa, A methodology for controlling the size of a test suite, ACM Transactions on Software Engineering and Methodology, 2(3):270-285, July 1993 http://sir.unl.edu/portal/index.html 22
23
J. A. Jones and M. J. Harrold, Empirical evaluation of the Tarantula automatic fault-localization technique, in Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, pp. 273-283, California, USA, November 2005 F. C. Kuo, T. Y. Chen, H. Liu and W. K. Chan, Enhancing adaptive random testing in high dimensional input domains, in Proceedings of the 2007 ACM Symposium on Applied Computing, pp. 1467-1472, Seoul, Korea, March 2007 H. Leung, T. H. Tse, F. T. Chan and T. Y. Chen, Test case selection with and without replacement, Journal of Information Sciences, 129(1-4): 81-103, November 2000 C. Liu, L. Fei, X. Yan, J. Han, and S. P. Midkiff, Statistical debugging: a hypothesis testing-based approach, IEEE Transactions on Software Engineering, 32(10):831-848, October 2006 M. Renieris and S. P. Reiss, Fault localization with nearest neighbor queries, in Proceedings of the 18th IEEE International Conference on Automated Software Engineering, pp. 30-39, Montreal, Canada, October 2003 G. Rothermel and M. J. Harrold, Empirical studies of a safe regression test selection technique, IEEE Transactions on Software Engineering, 24(6):401-419, June 1998 Siemens Suite, http://www-static.cc.gatech.edu/ aristotle/Tools/ subjects/, January 2007 S. Tallam and N. Gupta, A concept analysis inspired greedy algorithm for test suite minimization, in Proceedings of the 6th ACM Workshop on Program Analysis for Software Tools and Engineering, pp. 35-42, Lisbon, Portugal, September 2005 K. S. Trivedi, Probability and statistics with reliability queuing and computer science applications, John Wiley & Sons, 2002. W.E. Wong, J. R. Horgan, S. London and A.P. Mathur, Effect of test set minimization on fault detection effectiveness, Software-Practice and Experience, 28(4):347-369, April 1998 W. E. Wong, J. R. Horgan, A. P. Mathur, and A. Pasquini, Test set size minimization and fault detection effectiveness: a case study in a Space application, Journal of Systems and Software, 48(2):79-89, October 1999 W. E. Wong, Y. Qi, Y. Shi and J. Dong BP neural network-based effective fault localization, International Journal of Software Engineering and Knowledge Engineering (to appear). An earlier version appeared in the Proceedings of the 19th International Conference on Software Engineering and Knowledge Engineering, pp. 374-379, Boston, Massachusetts, USA, July 2007 W. E. Wong, T. Wei, Y. Qi and L. Zhao, A Crosstab-based statistical method for effective fault localization, in Proceedings of the First International Conference on Software Testing, Verification and Validation, pp. 42-51, Lillehammer, Norway, April 2008 Suds Users Manual, Telcordia Technologies, 1998
238

Are Fault Failure Rates Good Estimators of Adequate Test Set Size

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Are Fault Failure Rates Good Estimators of Adequate Test Set Size

Uploaded by

Copyright:

Available Formats

2009 Ninth International Conference on Quality Software

Accordingly, the effectiveness is computed as

For any fault fi, piS () may be computed as follows

if fault f i is detected otherwise

Let the expected value of Yi be defined as

4.1. Subject programs

4.3. Prediction accuracy

4.2. Data collection

Figure 2. Comparison on print_tokens2

Figure 3. Comparison on schedule

Figure 1. Comparison on print_tokens

Figure 4. Comparison on schedule2

Figure 5. Comparison on replace

Figure 8. Comparison on space

Figure 6. Comparison on tcas

Figure 7. Comparison on tot_info

Table 3. Prediction quality across various levels of effectiveness

4.4. Prediction error and test set size

Table 4. Error measures across the programs

Figure 9. Prediction error versus test set size on print_tokens2

Figure 10. Prediction error versus test set size on schedule2

5.2. Imperfect estimation of failure rates

5.1. Threats to validity

Figure 11. Imperfect estimation on print_tokens2

Figure 12. Imperfect estimation on schedule2

7. Conclusions and future work

You might also like