Professional Documents
Culture Documents
entitled
A Cost Effective Methodology for Quantitative Evaluation of
Software Reliability using Static Analysis
by
Walter W. Schilling, Jr.
Graduate School
College of Engineering
_____________________________________________________________________
Dissertation Advisor: Dr. Mansoor Alam
Recommendation concurred by
_______________________________________
Dr. Mohsin Jamali
Committee
_______________________________________
Dr. Vikram Kapoor
On
_______________________________________
Final Examination
Dr. Henry Ledgard
_______________________________________
Dr. Hilda Standley
_______________________________________
Mr. Michael Mackin
_______________________________________
Mr. Joseph Ponyik
_____________________________________________________________________
Dean, College of Engineering
c
Copyright
2007
Typeset using the LATEX Documentation system using the MikTEX package devel-
All trademarks are the property of their respective holders and are hereby ac-
knowledged.
An Abstract of
systems have become larger and more complex, mission critical and safety critical
This change has resulted in a shift of the root cause of systems failure from hardware
to software. Market forces have encouraged projects to reuse existing software as well
as purchase COTS solutions. This has made the usage of existing reliability models
or is not made available to software engineers, these modeling techniques can not be
to address these issues. This dissertation puts forth a practical method for estimating
software reliability.
The proposed software reliability model combines static analysis of existing source
code modules, functional testing with execution path capture, and a series of Bayesian
iv
Belief Networks. Static analysis is used to detect faults within the source code which
may lead to failure. Code coverage is used to determine which paths within the source
code are executed as well as the execution rate. The Bayesian Belief Networks combine
these parameters and estimate the reliability for each method. A second series of
Bayesian Belief Networks then combines the data for each method to determine the
In order to use this model, the SOSART tool is developed. This tool serves as
a reliability modeling tool and a bug finding meta tool suitable for comparing the
software packages. A second validation is provided using the Tempest Web Server,
v
Dedication
I would like to dedicate this dissertation to my wife Laura whose help and assis-
vi
Acknowledgments
I would like to take this opportunity to express my thanks to the many persons
First and foremost, I would like to recognize the software companies whom I am
indebted to for the usage of their tools in my research. In particular, this includes
Gimple Software, developers of the PC-Lint static analysis tool, Fortify Software,
developed of the Fortify SCA security analysis tool, Programming Research Limited
developers of the QAC, QAC++, and QAJ tools, and Sofcheck, developer of the
Sofcheck static analysis tools. Through their academic licensing programs, I was able
to use these tools at greatly reduced costs for my research, and without their support,
in Columbus which was used for reliability research and web reliability calculation.
I am indebted to the NASA Glenn Research Center in Cleveland and the Flight
Software Engineering Branch, specifically Michael Mackin, Joseph Ponyak, and Kevin
Carmichael. Their assistance during my one summer on site proved vital in the
Ohio. Their Doctoral fellowship supported my graduate studies and made it possible
vii
I am also indebted to Dr. Mosin Jamali, Dr. Vikram Kapoor, Dr. Dr. Henry
Ledgard, Dr. Hilda Standley, and Dr. Afzal Upal from the University of Toledo who
viii
Contents
Abstract iv
Dedication vi
Acknowledgments vii
Contents ix
ix
2.3 Hard Limits Exceeded . . . . . . . . . . . . . . . . . . . . . . . . . . 25
x
5 Static Analysis Fault Detectability 83
6.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
xi
7.7 Extended Network Verification . . . . . . . . . . . . . . . . . . . . . . 131
xii
9.5 Jester Program Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 180
10.2.5 Analyzing the Source code for Statically Detectable Faults . . 198
Bibliography 210
xiii
List of Figures
4-6 Source exhibiting loop overflow and out of bounds array access. . . . 57
4-8 GNU gcov output from testing prime number source code. . . . . . . 64
4-9 Control flow graph for calculate distance to next prime number method.
65
xiv
4-15 gcov output for functional testing of timer routine. . . . . . . . . . . 77
6-4 BBN Network to combine multiple blocks with multiple faults. . . . . 116
8-1 Analysis menu used to import Java source code files. . . . . . . . . . 145
xv
8-7 Java Tracer execution example. . . . . . . . . . . . . . . . . . . . . . 150
8-8 XML file showing program execution for HTTPString.java class. . . . 152
xvi
List of Tables
5.5 Static Analysis Tool False Positive and Stylistic Rule Detections . . . 91
5.7 Percentage of warnings detected as valid based upon tool and warning. 95
xvii
6.1 Bayesian Belief Network State Definitions . . . . . . . . . . . . . . . 100
7.2 Bayesian Belief Network States Defined for Execution Rate . . . . . . 126
7.4 Differences between the Markov Model reliability values and the BBN
7.7 Extended Bayesian Belief Network States Defined for Execution Rate 131
7.9 Differences between the Markov Model reliability values and the BBN
xviii
9.2 RealEstate Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 175
10.6 Tempest Rule Violation Count with All Rules Enabled . . . . . . . . 198
10.7 Tempest Rule Violation Densities with All Rules Enabled . . . . . . . 199
xix
10.10Tempest Estimated Reliabilities using SOSART . . . . . . . . . . . . 200
xx
Chapter 1
Contributions
“The most significant problem facing the data processing business to-
This statement, leading into Myers book Software Reliability: Principles and
Practices was written in 1976. Yet, even today, this statement is equally valid, and
The problems of software reliability are not new. The first considerations to soft-
ware reliability, in fact, were made during the late 1960’s as system downtime began
ever, beyond a few specialized application environments, software failure was regarded
1
2
as an unavoidable nuisance. This has been especially true in the area of embedded
systems, where cost and production factors often outweigh quality and reliability
issues.
Increased systemic reliance upon software has began to change this attitude, and
functionality in the medical, transportation, and nuclear energy fields. The software
of firmware. The latest airplanes under development contain over 5 million lines of
code, and even older aircraft contain upwards of 1 million lines of code[Sha06]. A
study of airworthiness directives indicated that 13 out of 33 issued for the period
field faces similar complexity and reliability problems. 79% of medical device recalls
software has become the principle source for reliability problems, and it is reported
Aside from the inconvenience and potential safety hazards related to software fail-
ures, there is a huge economic impact as well. A 2002 study by the National Institute
of Standards found that software defects cost the American economy $59.5 billion
annually [Tas02]. For the fiscal year 2003, the Department of Defense is estimated to
have spent $21 billion on software development. Upwards of $8 billion, or 40% of the
total, was spent on reworking software due to quality and reliability issues[Sch04b].
3
But it is not only these large systems that have enormous economic costs. Delays in
the Denver airport automated luggage system due to software problems cost $1.1 mil-
lion per day[Nta97]. A single automotive software error led to a recall of 2.2 million
the point that modern embedded software typically costs between $15 and $30 per line
of source code[Gan01]. But, the development costs can be considered small when one
considers the economic cost for system downtime, which ranges from several thousand
dollars per hour to several million dollars per hour depending upon the organization,
as is shown in Table 1.
These economic costs, coupled with the associated legal liabilities, have made soft-
ware reliability an area of extreme importance. That being said, economic pressures,
including decreased time to market, stock market return on investment demands, and
shortages of skilled programmers have led business decisions being made that may
work against increasing reliability. In many projects, software reuse has been consid-
4
ered as a mechanism to solve both the problems of decreased delivery schedules and
increasing software cost. However, software reliability has not necessarily increased
with reuse. The Ariane 5[JM97]1 and Therac-25[LT93] failures can be directly related
piece of software. Each piece is then integrated to form a final product. Over the
course of a product lifecycle, a given vendor may release hundreds of versions of their
component. Each time a new release is made, the integration team must make a de-
cision as to whether or not it is safe to integrate the given component into the overall
product. Unfortunately, there is often little concrete knowledge to base this decision
upon, and the exponential growth in releases by COTS vendors has been shown to
The Capability Maturity Model (CMM) was developed by Carnegie Mellon Uni-
software on time and on budget. As such, software development companies are as-
the best and 1 being the lowest. However, due to many issues, the usage of CMM
assessment has been problematic in filtering capable companies from incapable com-
panies, as is discussed in O’Connell and Saiedian [OS00] and [Koc04]. It has also had
1
The Ariane 5 failure is discussed in further detail in Chapter 2 of this dissertation.
5
in Saiedian and Kuzara[SK95]. Thus, an engineer can not rely on CMM assessments
into a product.
development and testing, including the operational time between failures, the severity
of the failures, and other metrics. This data is then applied to the project to determine
if an adequate software reliability has been achieved. While these methods have been
applied successfully to many projects, there are often occasions where the failure data
has not been collected in an adequate fashion to obtain relevant results. This is often
the case when reusing software which has been developed previously or purchasing
COTS components for usage within a project. In the reuse scenario, the development
data may have been lost or never collected. In the case of COTS software, the requisite
This poses a dilemma for a software engineer wishing to reuse a piece of software or
purchase a piece of software from a vendor. Internal standards for release vary greatly,
and from the outside, it is impossible to know where on the software reliability curve
the existing software actually stands. One company might release early on the curve,
resulting in more failures occurring in the field, whereas another company might
release later in the curve, resulting in fewer field defects. Further complicating this
failure occurs it is nearly immediately repaired and a new version is released. With
COTS and open source code, this immediate response generally does not occur, and
the software must be used “as is”. Licensing agreements may also restrict a customer
from fixing a known defect which leads to a failure or risk support termination if such
a fix is attempted.
third party code can be a costly and expensive endeavor. Huisman et al. [HJv00]
report that the verification of one Java class, namely the Java Vector class, required
How can one economically ensure that software delivered from an ex-
Thusfar, we have discussed the problems of software reliability and ensuring that
reused software is of acceptable quality. A key emphasis has been on the aspect
of the economic costs of software failure. In this section, our intent is to outline the
key contributions to the software engineering body of knowledge that this dissertation
7
In order to address the needs of software practitioners, any developed model must
be readily understood, easily applied, and generate no more than a small increase
derstanding of compiler usage is a required skill for all software engineers developing
embedded systems software, and thus, by using static analysis tools to detect stati-
cally detectable faults within source code modules, the additional knowledge required
of a practitioner to apply this model is minimized. Static analysis tools offer yet
another benefit. Because the interface to static analysis tools is similar to that of a
compiler, the operation of static analysis tools can be highly automated through the
usage of a build process. This automation not only adds to the repeatability of the
This dissertation shows that for the Java programming language, through the
In carrying out this research, a Bayesian Belief Network was developed using
expert knowledge which relates the risk of failure to statically detected faults.
3. The design of a Bayesian Belief Network which relates method execution rates
and reliability and obtains results comparable to those obtained through Markov
Modeling.
This network serves to combine the reliabilities of multiple methods based upon
4. The design and development of a new software tool, SOSART, which allows the
The SOSART tool simplifies the application of this model to existing source
5. The successful demonstration of the reliability model through two different sets
of experiments which shows that the reliability model developed here can accu-
In the first set of experiments, readily available Open Source software is an-
alyzed for reliability using the STREW metrics suite. The software is then
re-analyzed using the SOSART method and the results are compared. The
systems project and comparing the results to actual reliability data obtained
Chapter 1 has provided a synopsis of the software reliability problem and a jus-
tification for the research that follows. Next, the problem statement is clearly and
succinctly stated. An overview of the key contributions of this research to the software
fails in the field is vital in order to create a practical software reliability model which
the failures discussed have been chosen to be ones which occurred due to a software
fault which was statically detectable through the usage of the appropriate static
analysis tools. The discussion of these failures provides the justification in Chapter 1
Chapter 3 provides an overview of static analysis tools. The research upon which
this dissertation is based relies heavily upon the capabilities of static analysis tools
to reliably detect injected faults at the source code level. In order to understand
the assessment capabilities of static analysis tools, this chapter provides a detailed
Chapter 4 presents the key areas of contribution of our research by presenting key
concepts for our reliability model, an analysis of faults, and how certain faults lead to
failures. Then we discuss how to measure code coverage. Code coverage plays a major
faults which manifest themselves as failures. This chapter then concludes with the
10
details of our proposed software reliability model, which combines static analysis of
existing source code modules, black-box testing of the existing source code module
while observing code coverage, and a series of Bayesian Belief Networks to combine
Chapter 5 discusses the results of our experimentation with Static analysis tools
While extensive studies have been done for the C, C++, and Ada languages, there
have been very few studies on the effectiveness of Java static analysis tools. This
research works with the Java language. This chapter presents both the experiment
design as well as the results of the experiment, indicating that static analysis can be
an effective mechanism for detecting faults which may cause run time failures in the
Java language.
Chapters 6 and 7 discuss in detail the developed Bayesian Belief Networks which
combine the results of static analysis execution with program execution traces ob-
tained during limited testing. In general, there are three sets of Bayesian Belief
Networks within the model. One Bayesian belief network combines information to de-
termine the risk of failure from a single statically detectable fault. A second Bayesian
Belief network combines statically detectable faults which are co-located within a
single code block to obtain the reliability of the given block. A third Bayesian Belief
Network is then used to combine the code blocks together to assess the net reliability
which has been designed to allow the usage of the proposed model in assessing the
11
reliability of existing software packages. The SOSART tool developed entirely in the
Java language, interacts with the static analysis tools to allow statically detectable
faults to be combined with execution traces and analyzed using the proposed relia-
bility model. Without such a tool, the usage of this model is impractical due to the
ages which used the STREW metrics suite to assess reliability. In this chapter, the
The chapter concludes with a discussion on the effort to apply this model in compar-
component. In this case, the large scale component is assessed for its reliability and
then the reliability of this component is compared with the reliability obtained from
actual field execution and testing. This chapter highlights the economic advantages
of this model versus a complete peer review of the source code for reliability purposes.
Chapter 11 summarizes the work and provides some suggestions for future work.
Chapter 2
Studies1
“The lessons of cogent case histories are timeless, and hence a height-
and errors of design judgment can help prevent the same mistakes from
Before one can understand how to make software reliable, one must understand
how software fails. Leveson [Lev94] and Holloway [Hol99] both advocate that in order
to understand how to model risk for future programs, it is important to study and
Table 2.1.
1
Portions of this chapter appeared in Schilling[Sch05].
12
13
these failures. Many of these failures are directly attributable to the reuse of previ-
ously developed software with insufficient testing. Furthermore, many of these failures
can be attributed to a fault which can be readily and easily detected with existing
On Tuesday September 14, 2004 at about 5 p.m. Pacific Time, air traffic con-
trollers lost contact with 400 airplanes which were flying in the Southwestern United
States. Planes were still visible on radar, but all voice communications between the
air traffic controllers and pilots was lost. This loss of communication was caused by
the failure of the Voice Switching and Control System (VSCS) which integrates all
14
air traffic controller communications into a single system. The failures lasted three
hours, and resulted in 800 flights being disrupted across the Western United States.
In at least five instances, planes violated minimum separation distances for planes
The cause was the failure of a newly integrated system enhancement intended
problem was detected within the VCSU installation, as the system crashed after 49.7
workaround, the FAA instituted a maintenance reboot every 30 days to the system.
This mandatory reboot was necessitated by a design flaw within the software.
The internal software of the VCSU relies on a 32 bit counter which counts down from
232 to 0 as the software operates. This takes 232 ms, or 49.7103 days to complete a
countdown[Cre05]. This tick, part of the Win32 system and accessed by a call to the
GetTickCount() API, is used to provide a periodic pulse to the system. When the
counter reaches 0, the tick counter wraps around from 0 to 4, 294, 967, 296, and in
doing so, the periodic pulse fails to occur. In the case of the Southern California air
traffic control system, the mandatory reboot did not occur and the tick count reached
zero, shutting down the system in its entirety. The backup system attempted to start,
but was unable to handle the load of traffic, resulting in the complete communications
failure[Wal04] [Gep04].
15
On September 21, 1997, the USS Yorktown, the prototype for United States Navy
Smart Ship program, suffered a catastrophic software failure while steaming in the
Atlantic. As a result of an invalid user input, the entire ship was left dead in the water
for two hours and forty five minutes while a complete reboot of systems occurred.
was mandated by the Navy’s IT-21 report, “Information Technology for the 21st
Century”[wir98]. The monitoring and control surveillance system code was developed
in Ada[Sla98a]. In order to meet Navy schedules and deployment deadlines for the
The failure of the Yorktown system began with a petty officer entering an incorrect
calibration value into the Remote Database Manager. This zero entry resulted in a
divide by zero, causing the database system to crash. Through the ATM network, the
crash propagated to all workstations on the network. Restoration of the ship required
due to the Nachi worm in the first confirmed case of malicious code penetrating
ATM machines. The ATM machines operated on top of the Windows XP Embedded
2
The typical schedule for such a development effort is approximately three years [Sla98a].
16
operating system which was vulnerable to an RPC DCOM security bug exploited by
Nachi and the Blaster virus. In this case, a patch had been available from Microsoft
for over a month, but had not been installed on the machines due to an in-depth set
The Nachi worm spread through a buffer overflow vulnerability within Microsoft
packets to a remote machine using port 135. When the packets were processed a
buffer overflow would occur on the receiving machine crashing the RPC service and
On August 14th, 2003, the worst electrical outage in North American History
occurred, as 50 million customers in eight US states and Canada lost power. The
economic impact of this failure has been estimated to range between $4.5 and $10
billion[ELC04]. While there were many contributing factors to the outage, buried
within the causes was the failure of a monitoring alarm system [Pou04a].
First Energy’s energy management system used to monitor the state of its elec-
trical grid failed silently. This system had over 100 installations worldwide [Jes04]
and the software was approximately four million lines of C code[Pou04b]. A thorough
analysis of the Alarm and Event processing routines comprised of approximately one
million lines of C and C++ code yielded a subtle race condition which allowed two
17
asynchronous threads to obtain write access to a common data structure. Once this
happened the data structure was corrupted and the alarm application went into an
infinite loop. This caused a queue overflow on the server hosting the alarm process,
resulting in a crash. A backup system kicked in, but it too was overwhelmed by the
On February 25, 1991, an incoming Iraqi Scud missile was launched and struck
the American Army barracks. An American Patriot Missile battery in Dharan, Saudi
Arabia was tasked with intercepting incoming missiles, but due to a software defect,
The Patriot missile defense system operated by detecting airborne objects which
the location to apply the range gate, the Patriot computer system used two funda-
mental data items: time, expressed as a 24 bit integer, and velocity expressed as a
24-bit fixed point decimal number. The prediction algorithm used in the source code
1
multiplied the current time stamp by 10
when calculating the next location.
In order to fit into the 24 bit registers of the Patriot computer system, the binary
1
expansion for 10
was truncated to 0.00011001100110011001100, yielding an error of
mal. This error, was cumulative with the time since the unit was initialized[Arn00].
The Alpha battery had been operating without reset for over 100 consecutive
hours. At 100 hours, the time inaccuracy was 0.3433 seconds. The incoming Iraqi
18
Scud missile cruised at 1,676 metres per second, and during this time error, it had
traveled more than one half of a kilometer [Car92]. The battery did not engage
the Scud missile, and an Army barrack was hit, resulting in 28 deaths and over 100
injuries.
during a training scenario. In this training scenario, the objective of the crew was to
abort the launch and land in Spain. This required the dumping of excess fuel from
the orbiter. When the crew initiated the abort, all four redundant flight computers
failed and became unresponsive. The displays each showed a large X, indicating that
The failure was traced to a single fault, namely a single, common routine used to
dump fuel from the orbiter. In this particular scenario, the routine had been invoked
once during the simulated assent, and then was canceled. Later on, the routine was
called again. However, not all variables had been re-initialized. One of the variables
was used to compute the offset address for a “GOTO” statement. This caused the
code to branch to an invalid address, resulting in the simultaneous lockup of all four
computers. A complete analysis of the shuttle code found 17 other instances of this
fault[Lad96].
19
On June 4, 1996, the maiden flight of the Ariane 5 rocket occurred. 39 seconds
into flight, a self destruct mechanism on the rocket activated, destroying the launcher
and all four payload satellites. The root cause for the failure was traced to a stat-
ically detectable programming error where a 64 bit floating point number was cast
into a 16 bit integer. As the code was implemented in Ada, this caused a runtime
exception which was not handled properly, resulting in the computer shutting down
and the ultimate loss of the vehicle and payload. Had this occurred on a less strongly
typed language, the program most likely would have continued executing without
incident[Hat99a]. In reviewing the source code for the module, it was found that
there were at least three other instances of unprotected typecasts present within the
The portion of the guidance software which failed in Ariane 5 had actually been
reused from Ariane 4 without review and integration testing[Lio96]. On the Ariane
task executing after lift-off[Gle96]. This had the effect of saving time whenever a
hold occurred. When the Ariane 5 software was developed, no one removed this
unnecessary feature. Thus, even after launch was initiated, pre-launch alignment
calculations occurred[And96]. The variable that overflowed was actually in the pre-
successful. However, for all the success the orbiter was achieving, significant prob-
lems were also occurring, as over 3000 floating point exceptions were detected during
All of these problems came to a climax in May, 1994 when another floating point
software reset commands to the craft, but these were ignored. After 20 minutes, a
hardware reset was successful at bringing the Clementine probe back on line [Lee94].
Software for the Clementine orbiter was developed through a spiral model pro-
cess, resulting in iterative releases. Code was developed using both the C and Ada
languages. In order to protect against random thruster firing, designers had im-
a hangup within the firmware. Evidence from the vehicle indicates that following
ing fuel and imparting a 80 RPM spin on the craft. This thruster fired until the
watchdog mechanism which could have detected the run away source code was not
A $433.1 million Lockheed Martin Titan IV B rocket was launched on April 30,
1999, from Cape Canaveral Air Station in Florida [Hal99] carrying the third Milstar
satellite destined for Geosynchronous orbit[Pav99]. The first nine minutes of the
launch were nominal. However, very quickly afterward an anomaly was detected in
the flight path as the vehicle was unstable in the roll orientation.
The instability against the roll axis returned during the second engine burn causing
excess roll commands that saturated pitch and yaw controls, making them unstable.
This resulted in vehicle tumble until engine shutdown. The vehicle then could not
obtain the correct velocity or the correct transfer orbit. Trying to stabilize the ve-
hicle, the RCS system exhausted all remaining propellant[Pav99]. The vehicle again
tumbled when the third engine firing occurred resulting in the vehicle being launched
Inertial Navigation Unit (INU) software file caused the error. In one filter table,
instead of −1.992476, a value of −0.1992476 had been entered. Thus, the values for
roll detection were effectively zeroed causing the instability within the roll system.
These tables were not under configuration management through a version control
system (VCS). This incorrect entry had been made in February of 1999, and had
gone undetected by both the independent quality assurance processes as well as the
As a footnote to this incident, and paralleling the Ariane 5 failure, the roll filter
22
which was mis-configured was not even necessary for correct flight behavior. The
filter had been requested early in the development of the first Milstar satellite when
there was a concern that fuel sloshing within the Milstar Satellite might effect launch
trajectory. Subsequently, this filter was deemed unnecessary. However, it was left in
record. 60 thousand people were left without telephone service as 114 switching nodes
within the AT&T system failed. All told, a net total of between $60 to $75 million was
lost. The culprit for this failure was a missing line of code within a failure recovery
routine[Dew90].
1: ...
2: switch (message)
3: {
4: case INCOMING_MESSAGE:
5: if (sending_switch == OUT_OF_SERVICE)
6: {
7: if (ring_write_buffer == EMPTY)
8: send_in_service_to_smm(3B);
9: else
10: break; /* Whoops */
11: }
12: process_incoming_message();
13: break;
14: ...
15: }
16: do_optional_database_work();
17: ...
Figure 2-1: Source code which caused AT&T Long Distance Outage[Hat99b]
When a single node crashed on the system, a message indicating that the given
node was out of service was sent to adjacent nodes so that the adjacent node could
route the traffic around the failed node. However, because of a misplaced “break”
statement within a C “case” construct the neighboring nodes themselves crashed upon
23
receiving an out of service message from an adjacent node. The second node then
transmitted an out of service message to its adjacent nodes, resulting in the domino
lish communications with the Mars Spirit Rover. After 18 days of nearly flawless
operation on the Martian surface, a serious anomaly with the robotic vehicle had
developed, rendering the vehicle inoperable. While initial speculation on the Rover
failure pointed toward a hardware failure, the root cause for the failure turned out to
be software. The executable code that had been loaded into the Rover at launch had
serious shortcomings. A new code was uploaded via radio to the rover during flight.
In doing so, a new directory structure was uploaded into the file system while leaving
Eventually, the Rover attempted to allocate more files than RAM would allow,
raising an exception and resulting in a diagnostic code being written to the file system
before rebooting the system. This scenario continued, causing the rover to get stuck
Mars PathFinder touched down on the surface of Mars on July 4, 1997, but
all communication was suddenly lost on September 27, 1997. The system began
encountering periodic total system resets, resulting in lost data and a cancelation
24
watchdog reset[Gan02a] caused by a priority inversion within the system. New code
was uploaded to PathFinder and the mission was able to recover and complete a
The Mars PathFinder failure resulted in significant research in real time tasking
leading to the development of the Java PathFinder [HP00] and JLint [Art01] static
analysis tools.
The United States Navy launched the 370 kg GeoSAT Follow-On (GFO) satellite
in February 10, 1998 from Vandenburg Air Force Base on board a Taurus rocket.
Immediately following launch, there were serious problems with attitude control as
the vehicle simply tumbled in space. Subsequent analysis of the motion equations
programmed into the vehicle indicated that momentum and torque were being applied
coefficient had been inverted[Hal03], resulting in forces being applied in the opposite
The December, 2006 launch of the TacSat-2 satellite on board an Air Force Mino-
taur I rocket was delayed due to software issues. While the investigation is not com-
plete, indications are that the problem maybe related to a missing minus sign within
a mathematical equation. If the TacSat had been launched, the software defect would
25
have prevented the satellite’s attitude control system from turning the solar panels
closer than 45 degrees to the sun’s rays resulting in an eventual loss of power to the
satellite.
The NASA Near Earth Asteroid Rendezvous (NEAR) mission was launched from
Cape Canaveral on board a Delta-2 rocket on February 17, 1996. Its main mission was
to rendezvous with the asteroid 433 Eros[SWA+ 00]. Problems occurred on December
20, 1998 when the spacecraft was to fire the main engine in order to place the vehicle
in orbit around Eros. The engine started successfully but the burn was aborted almost
immediately. Communications with the spacecraft was lost for 27 hours[Gan00] dur-
ing which the spacecraft performed 15 automatic momentum dumps, fired thrusters
The root cause of the engine abort was quickly discovered. Sensors on board the
space craft had detected a transient lateral acceleration which exceeded a defined
constant in the control software. The software did not appropriately filter the input
and thus the engine was shut down. The spacecraft then executed a set of automated
scripts intended to place the craft into safe mode. These scripts, however, did not
properly start the reaction control wheels used to control attitude in the absence of
handlers executed. Both were exacerbated by low batteries. Over the next few
26
hours, 7900 seconds of thruster firings were logged before the craft reached sun-safe
mode[Gan00].
As a part of the investigation, some of the 80,000 lines of source code were in-
spected, and 17 faults were discovered[Hof99]. Complicating the situation was the
fact that there turned out to be two different version of flight software 1.11, one which
was onboard the craft and one which was readily available on the ground, as flight
The television transmission standard has changed little since its initial develop-
ment during the 1930’s. Recently, there has been a trend towards increased usage
of Extra Digital Services (XDS) services, such as closed captioning, automatic time
of day synchronization, program guides, and other such information, which is broad-
cast digitally during the vertical blanking interval. XDS services are decoded on the
television receivers can easily be mass produced cheaply. Problems however occur
erating the digital data stream for closed captioning on a transmitter had a periodic
yet random failure whereby two extra bits would be erroneously inserted into the data
stream[Sop01]. On certain models, receiving these erroneous bits caused a buffer over-
flow within software causing a complete loss of video image, color tint being set to
maximum green, or muted audio. In each case, the only mechanism for recovery was
27
to unplug the television set and allow the microcontroller to reset itself to the default
settings[Sop01].
Chapter 3
and Techniques1
ties of a program that hold for all possible execution paths of the program.”[BV03]
tion and review to detect software implementation errors. Static analysis has been
shown to reduce software defects by a factor of six [XP04], as well as detect 60% of
post-release failures[Sys02]. Static analysis has been shown to out perform other Qual-
1
Portions of this chapter appeared in Schilling and Alam[SA06c].
28
29
caught with static analysis tools early in the development cycle, before testing com-
mences where defects can be 5 to 10 times cheaper to repair than at a later phase[Hol04].
Static analysis of source code does not represent new technology. Static analy-
sis tools are highly regarded within certain segments of industry for being able to
quickly detect software faults [Gan04]. Static analysis is routinely used in mission
critical source code development, such as aircraft[Har99] and rail transit[Pol] areas.
Robert Glass reports that static analysis can remove upwards of 91% of errors within
source code [Gla99b] [Gla99a]. It has also been found effective at detecting pieces
of dead or unused source code in embedded systems [Ger04] and buffer overflow
concept of static analysis, including the philosophy and practical issues related to
static analysis.
Recent papers dealing with static analysis tools have shown a statistically signif-
icant relationship between the faults detected during automated inspection and the
actual number of field failures occurring in a specific product[NWV+ 04]. Static anal-
ysis has been used to determine testing effort by Ostrand et al.[OWB04]. Nagappan
et al. [NWV+ 04] and Zheng et al.[ZWN+ 06] discuss the application of static analysis
to large scale industrial projects, while Schilling and Alam[SA05b] cite the benefits of
using static analysis in an academic setting. Integration of static analysis tools into a
software development process have been discussed by Schilling and Alam[SA06c] and
Static analysis tools have two important characteristics: soundness and complete-
ness. A static analysis tool is defined to be complete if it detects all faults present
30
within a given source code module. A static analysis tool is deemed to be sound if it
never gives a spurious warning. A static analysis tool is said to generate a false posi-
tive if a spurious warning is detected within source code. A static analysis tool is said
all static analysis tools are unsound and incomplete, as most tools generate false
For all of the advantages of static analysis tools, there have been very few inde-
pendent comparison studies between tools. Rutar et al. [RAF04] compare the results
31
of using Findbugs, JLint, and PMD tools on Java source code. Forristal[For05] com-
pares 12 commercial and open source tools for effectiveness, but the analysis is based
only on security aspects and security scanners, not the broader range of static anal-
benchmarks for benchmarking bug detection tools, but the paper does not specifi-
In order to evaluate which static analysis tools have the potential for usage in soft-
ware reliability modeling, it was important to obtain information about the currently
existing tools. Table 3 provides a summary of the tools which are to be discussed in
Lint
Lint[Joh78] is one of the first and widely used static analysis tools for the C and
C++ languages. Lint checks programs for a large set of syntax errors and seman-
tic errors. Newer versions of Lint include value tracking, which can detect subtle
initialization and value misuse problems, inter-function Value Tracking, which tracks
values across function calls during analysis, strong type checking, user-defined seman-
tic checking, usage verification, which can detect unused macros, typedef’s, classes,
members, and declarations, flow verification for uninitialized variables. Lint can also
32
handle the verification of common safer programming subsets, including the MISRA
the Scott Meyers Effective C++ Series of Standards [Mey92]. Lint also supports code
portability checks which can be used to verify that there are no known portability
issues with a given set of source code[Rai05]. A handbook on using Lint to verify C
In addition to the basic Lint tool, several add on companion programs exist to
aid in the execution of the Lint program. ALOA[Hol04] automatically collects a set
of metrics from the Lint execution which can be used to aid in quality analysis of
source code. ALOA provides an overall lint score, which is a weighted sum of all Lint
warnings encountered, as well as break downs by source code module of the number
QAC, and its companion tools QA C++, QAJ, and QA Fortran, have been de-
veloped by Programming Research Limited. Each tool is a deep flow static analyzer
tailored to the given languages. These tools are capable of detecting language im-
transgressions through code analysis. Version 2.0 of the tool issues over 800 warn-
ing and error messages, including warnings regarding non-portable code constructs,
overly complex code, or code which violates the ISO/IEC 14882:2003 Programming
The QAC and QAC++ family of tools are capable of validating several different
coding standards. QAC can validate the MISRA C coding standard [MIS98] [MIS04],
while QAC++ can validate against the High Integrity C++ Coding standard[Pro].
Polyspace C Verifier
The Polyspace Ada Verifier was developed as a result of the Ariane 501 launch fail-
ure and can analyze large Ada programs and reliably detect runtime errors. Polyspace
C++ and Polyspace C verifiers have subsequently been developed to analyze these
languages[VB04].
data-flow analyzers, designing new data-flow analysis, and handling particular infinite
sets of properties[Pil03].
Polyspace C analysis has suffered from scalability issues. Venet and Brat[VB04]
indicate that the C Verifier was limited to analyzing 20 to 40 KLOC in a given in-
stance, and this analysis took upwards of 8 to 12 hours to obtain 20% of the total
warnings within the source code. This requires overnight runs and batch process-
ing, making it difficult for software developers to understand if their changes have
corrected the discovered problems[BDG+ 04]. Zitser et al. [ZLL04] discuss a case
whereby Polyspace executed for four days to analyze a 145,000 LOC program before
aborting with an internal error. This level of performance is problematic for large
34
programs. Aliasing has also posed a significant problem to the Polyspace tool[BK03].
operation, catching runtime problems before the program actually executes, as well
as matching a list of common logic and syntactic errors. PREfix performs extensive
deep flow static analysis and requires significant installation of both client and server
packages in order to operate. Current versions include both a database server and a
graphical user interface and are typically integrated into the mast build process.
PREfix mainly targets memory errors such as uninitialized memory, buffer over-
static analyzer that employs symbolic evaluation of execution paths. Path sensitivity
ensures that program paths analyzed are only the paths that can be taken during
execution. This helps to reduce the number of false positives found during static
analysis. Path sensitivity, however, does cause problems. Exponential path blowup
can occur due to control constructs and possibly infinite paths due to loops make it
impractical. To avoid this problem, PREfix only explores a representative set of paths
which can be configured by the user[BPS00]. PREfix running time scales linearly with
The development of PREfix has shown that the bulk of errors come from inter-
actions between two or more procedures. Thus, for maximum effectiveness, static
cial C and C++ code, approximately 90% errors are attributable to the interaction
35
of multiple functions. Furthermore, these problems are only revealed under rare error
conditions.
PREfast is a simpler tool developed by Microsoft based upon the results of ap-
simpler intra-procedural analysis that detects fewer defects, has a higher noise faster,
Nagappan and Ball[NB05] discuss the usage of PREfix and PREfast at Microsoft.
Microsoft has also developed the PREsharp defect detection tool for C#. PRE-
SofCheck
SofCheck Inspector is a static analysis tool produced by SofCheck, Inc. for ana-
lyzing Java source code. The tool is designed to detect a large array of programming
errors including the misuse of pointers, array indices which go out of bounds, buffer
ditions are based upon what needs to be present to prevent a run-time failure, and
post conditions are based upon every possible output when the element is executed.
Output is then provided as annotated source code in browsable html format. Included
within the annotated source code is the characterization of each method. SofCheck
also includes a built in history system which allows regression testing on source code
36
to be conducted such that the tool can actually verify that fixed faults are completely
1000 lines per minute analysis speed depending upon the CPU speed, available RAM,
and complexity of the source code. Currently, SofCheck Inspector only works with
KlocWork K7
Klocwork was derived from a Nortel Networks tool to evaluate massive code bases
service vulnerabilities, buffer overflows, injection flaws, DNS spoofing, ignored return
values, mobile code security injection flaws, broken session management, insecure
storage, cross-site scripting, un-validated user input, improper error handling, and
broken access control. K7 can perform analysis based on Java source code and byte-
codes. This allows even third party libraries to be analyzed for possible defects.
Code Sonar
CodeSonar from GrammaTech, Inc. is a deep flow static analysands for C and
C++ source code. The tool is capable of detecting many common programming
the presence of buffer overflows within C source code. In this work, Codesurfer was
extended through the use of supplemental plugins to generate, analyze, and solve
constrains within the implemented source code. This extended tool was applied to
JiveLint
JiveLint by Sureshot Software is a static analysis tool for the Java programming
language. JiveLint has three fundamental goals: to improve source code quality by
and debugging through enforced coding and naming conventions; and to communicate
application which does not require the Java language to be installed and is Windows
C Global Surveyor
The NASA C Global Surveyor Project (CGS) was intended to develop an efficient
static analysis tool. The tool was brought about to overcome deficiencies in current
static analysis tools, such as the Polyspace C Verifier which suffers significantly from
scalability issues[VB04] This tool would then be used to reduce the occurrence of
larger projects in which the analysis can run on multiple computers. CGS results are
reported to a centralized SQL database. While CGS can analyze any ISO C program,
its analysis algorithms have been precisely tuned for the Mars PathFinder programs.
CGS has been applied to several NASA Jet Propulsion Laboratory projects, in-
cluding the Mars PathFinder mission (135K lines of code) and the Deep Space One
mission (280K lines of code). CGS is currently being extended to handle C++ pro-
grams for the Mars Science Laboratory mission. This requires significant advances in
ESP
ESP is a method developed by the Program Analysis group at the Center for
Software Excellence of Microsoft for static detection of protocol errors in large C/C++
programs. ESP requires the user to develop a specification for the high level protocol
that the existing source code is intended to satisfy. The tool then compares the
behavior of the source code as implemented with the requisite specification. The
output is either a guarantee that the code satisfies the protocol, or a browsable list
ESP has been used by Microsoft to verify the I/O properties of the GNU C
compiler, approximately 150,000 lines of C code. ESP has also been used to validate
JLint
JLint is a static analysis program for the Java language initially written by Kon-
code through the use of data flow analysis, abstract interpretation, and the construc-
JLint is designed as two separate programs which interact with each other during
analysis, the AntiC syntax analyzer and the JLint semantic analyzer[KA].
JLint has been applied to Space Exploration Software by NASA Ames Research
Center, and shown to be effective in all applications thusfar. Details of this experience
are provided in Artho and Havelund[AH04]. Rutar et al. [RAF04] compare JLint
with several other analysis tools for Java. JLint has also been applied to large scale
Lint4j
Lint4j (“Lint for Java”) is a static analyzer that detects locking and threading
issues, performance and scalability problems, and checks complex contracts such as
Java serialization by performing type, data flow, and lock graph analysis. In many
regards, Lint4j is quite similar in scope to JLint. The checks within Lint4j represent
the most common problems encountered while implementing products designed for
performance and scalability. General areas of problems detected are based upon those
and Gosling et. al[GJSB00]. Lint4j is written in pure Java, and therefore, will
40
execute on any platform on which the Java JDK or JRE 1.4 has been installed.
Java PathFinder
The Java PathFinder (JPF) program is a static analysis model checking tool
developed by the Robust Software Engineering Group (RSE) at NASA Ames Research
Center and available under Open Source Licensing agreement from Sourceforge. This
software is an explicit-state model checker which analyzes Java bytecode classes for
deadlocks, assertion violations and general linear-time temporal logic properties. The
user can provide custom property classes and write listener-extensions to implement
other property checks, such as race conditions. JPF uses a custom Java Virtual
Java PathFinder has been applied by NASA Ames to several projects. Havelund
and Pressburger [HP00] discuss the general application of an early version of the Java
PathFinder tool. Brat et al. [BDG+ 04] provide a detailed description of the results
ESC-Java
The Extended Static Checker for Java (ESC/Java) was developed at the Compaq
Systems Research Center (SRC) as a tool for detecting common errors in Java pro-
grams such as null dereference errors, array bounds errors, type cast errors, and race
guage with which programmers can use to express design decisions using light-weight
41
specifications. ESC/Java checks each class and each routine separately, allowing
ESC/Java to be applied to code that references libraries without the need for library
source code.
The initial version of ESC/Java supported the Java 1.2 language set. ESC/Java2
is based upon the initial ESC/Java tool, but has been modernized to support JML and
Java 1.4 as well as support for checking frame conditions and annotations containing
ESC/Java has proven very successful at analyzing programs which include an-
existing program has proven to be an error prone daunting task. To alleviate some of
Houdini, to aid in annotating source code. Houdini infers a set of ESC/Java annota-
tions for a given programs as candidate annotations. ESC/Java is then run on each
FindBugs
FindBugs is a lightweight static analysis tool for the Java program with a reputa-
detects common programming mistakes through the use of “Bug Patterns”, which
are code idioms that commonly represent mistakes in software. General usage of the
tors of the FindBugs program, the tool can be extended through the development of
In practice, the rate of false warnings reported by FindBugs is generally less than
50%. Rutar et al. [RAF04] compare the results of using FindBugs versus other Java
tools and report similar results. Wagner et al. [WJKT05] generally concur with this
assessment as well.
ITS4
ITS4 is a static vulnerability scanner for the C and C++ languages developed by
Cigital. The tool was developed as a replacement for a series of grep scans on source
Output from the tool includes a complete report of results as well as suggested fixes
however, does suffer from its simplistic nature resulting in a significant number of
false positives. ITS4 has been applied to the Linux Kernel by Majors[Maj03].
43
Fortify SCA
Fortify SCA is a static analysis tool produced by Fortify Software aimed at aiding
in the validation of software from a security perspective. The core of the tool includes
the Fortify Global Analysis Engine. This consists of five different static analysis en-
gines which find violations of secure coding guidelines. The Data Flow analyzer is
responsible for tracking tainted input across application architecture tiers and pro-
deemed to be vulnerable as well as the context of their usage. The control flow an-
alyzer tracks the sequencing of programming operations with the intent of detecting
the interactions between the structural configuration of the program and the code
by the chosen code structures. These engines can also be expanded by writing custom
rules.
While for the purposes of this article the focus is on Java, Fortify SCA supports
LCLint was a product of the MIT Lab for Computer Science and the DEC Re-
search Center and was designed to take a C program which has been annotated with
44
additional LCL formal specifications within the source code. In addition to detect-
ing many of the standard syntactical issues, LCLint detects violation of abstraction
state visible to clients and missing initialization for an actual parameter or use of an
Splint is the successor to LCLint, as the focus was changed to include secure pro-
grams. The name is extracted from “SPecification Lint” and “Secure Programming
Lint”. Splint extends LCLint to include checking for de-referencing a null pointer, us-
ing possibly undefined storage or returning storage that is not properly defined, type
ous aliasing, modifications and global variable uses inconsistent with specified inter-
faces, problematic control flow (likely infinite loops), fall through cases or incomplete
Splint has been compared to other dynamic tools in Hewett and DiPalma [HD03].
Flawfinder
code for potential security flaws. Flawfinder operates by providing a listing of target
files to be processed. Processing then generates a list of potential security flaws sorted
on the basis of their risk. As with most static analysis tools, Flawfinder generates
both false positives and false negatives as it scans the given source code[Whe04].
45
RATS
The Rough Auditing Tool for Security (RATS) is a basic Lexical analysis tool for
C and C++, similar in operation to ITS4 and Flawfinder. As implied by its name,
RATS only performs a rough analysis of source code for security vulnerabilities, and
will not find all errors. It is also hampered by flagging a significant number of false
positives[CM04].
SLAM
The SLAM project from Microsoft is designed to allow the safety verification of
C code. The Microsoft tool accomplishes this by placing a strong emphasis upon
verifying API usage rules. SLAM does not require the programmer to annotate
the source program, and it minimizes false positive error messages through a process
SLAM has been extensively used within Microsoft for the verification of Windows
XP Device Drivers. Behavior has been checked using this tool, as well as the usage
of kernel API calls[BR02]. The SLAM analysis engine is the core of Microsoft’s
Static Driver Verifier (SDV), available in Beta form as part of the Windows Software
Developers Kit.
MOPS
by Hao Chen in collaboration with David Wagner to find security bugs in C programs
46
and to verify compliance with rules of defensive programming. MOPS was targeted
of existing C code. MOPS was designed to check for violations of temporal safety
PMD
PMD, like JLint and FindBugs, is a static analysis tool for Java. However, unlike
these other tools, it does not contain a dataflow component as part of its analysis. In-
PMD allows users to create extensions to the tool to detect additional bug pat-
PMD is mainly concerned with infelicities of design or style. As such, it has a low
hit rate for detecting bugs. Furthermore, enabling all rules sets in PMD generates a
Checkstyle
Checkstyle is a Java style analyzer that verifies if Java source code is compliant
with predefined stylistic rules. Similar to PMD, Checkstyle has a very low rate for
47
detecting bugs within Java software. However, it does spare code reviewers the tedious
Safer C Toolkit
based upon extensive analysis of the failure modes of C code and the 1995 publi-
[Hat95], as well as feedback from teaching 2500 practicing engineers the concepts of
safer programming subsets. The key intent was to provide a tool which was both
educational to the user as well as practical for use with development projects.
Gauntlet
The Gauntlet tool for Java has been developed by the United States Military
the tool was to act as a pre-compiler, statically analyzing the source code before it
is sent to the Java compiler and translating the top 50 common errors into layman’s
terms for the students. Gauntlet was developed based upon four years of background
Model1
This dissertation has thusfar provided justification for a new software reliability
model. The first chapter provided a brief introduction to the problem as well as an
overview of the key objectives for this research. The second Chapter provided nu-
merous case studies showcasing that catastrophic system failure can be attributed to
software faults. The third Chapter introduced the concept of static analysis and pro-
vided a literature survey of currently existing static analysis tools for the C, C++, and
Java languages. This chapter will present relevant details for the Software Reliability
Model.
As software does not suffer from age related failure in the traditional sense, all
faults which lead to failure are present when the software is released. In a theoreti-
cal sense, if all faults can be detected in the released software, and these faults can
1
Portions of this chapter appeared in Schilling and Alam[SA05a][SA06d][SA06b].
48
49
is reliably detecting the software faults and assigning the appropriate failure prob-
static analysis, limited testing, and a series of Bayesian Belief Networks which can be
It is often the case that the terms fault and failure are used interchangeably. This
is incorrect, as each term has a distinct and specific meaning. Unfortunately, sources
are not in agreement of their relationships. Different models for this relationship are
For the purposes of this dissertation, a human makes a mistake during software
development, resulting in a software fault being injected into the source code. The
50
fault represents a static property of the source code. Faults are initiated through
action.
mers make mistakes, resulting in faults being injected during development into each
and every software product. The majority of faults are injected during the imple-
mentation phase[NIS06]. The rate varies for each developer, the implementation
language chosen, and the Software Development process chosen. Boland[Bol02] re-
ports that the rate is approximately one defect for every ten lines of code developed,
and Hatton[Hat95] reports the best software as having approximately five defects
per thousand lines of code. These injected defects are removed through the software
never executes, then it can not cause a failure. Software failures occur due to the
presence of one or more software faults being activated through a certain set of input
stimuli[Pai]. Any fault can potentially cause a failure of a software package, but not
all faults will cause a failure, as is shown graphically in Figure 4-1. Adams[Ada84],
on average, one third of all software faults only manifest themselves as a failure once
every 5000 executable years, and only two percent of all faults lead to a MTTF of less
than 50 years. Downtime is not evenly distributed either, as it is suggested that about
90 percent of the downtime comes from at most 10 percent of the faults. From this, it
51
follows that finding and removing a large number of defects does not necessarily yield
the highest reliability. Instead, it is important to focus on the faults that have a short
MTTF associated with them. Malaiya et al. [MLB+ 94] indicate that rarely executed
modules, such as error handlers and exception handlers, while rarely executed, are
notoriously difficult to test, and are highly critical to the resultant reliability for the
system.
Gray[Gra86] classifies software faults into two different categories, Bohrbugs and
that proper testing occurs during product development, most Bohrbugs can be de-
tected and easily removed from the product. Heisenbugs represent a class of tempo-
rary faults which are random and intermittent in their occurrence. Heisenbugs include
memory exhaustion, race conditions and other timing related issues, and exception
handling.
Vaidyanathan and Trivedi [VT01] have extended this initial classification of soft-
ware faults to include a third classification for software faults, “Aging-related faults”.
These faults are similar to Heisenbugs in that they are random in occurrence. How-
ever, they are typically brought on by prolonged execution of a given software pro-
gram.
Grottke and Trivedi[GT05] have further refined the Vaidyanathan and Trivedi
model to better reflect the nature of software bugs. In this classification, there are
two major classifications for software faults, Bohrbugs and Mandelbugs. Bohrbugs are
faults that are easily isolated and that manifest themselves consistently under a well-
are faults whose activation and/or error propagation are complex. Typically, a Man-
delbugs are difficult to isolate, as the failures caused by a it are not systematically
reproducible. Mandelbugs are divided into two subcategories, Heisenbugs and Ag-
ing related bugs. Heisenbugs are faults that cease to cause failures or that manifest
bugs are faults that leads to the accumulation of errors either inside the running
rate and/or degraded performance with increasing time. This classification scheme is
understand what causes a fault to manifest itself as a failure and to be able to predict
In terms of faults and their density, little has been published classifying faults
54
by their occurrence. One of the most thorough studies of this was published by
the top five errors in Japanese embedded systems programming. However, this pub-
1: int32_t foo(int32_t a)
2: {
3: int32_t b;
4: if (a > 0)
5: {
6: b = a;
7: }
8: return ((b) ? 1 : 0);
9: }
ically initialized when defined. For an automatic variable which is allocated either
on the stack or within a processor register, the value which was previously in that
location will be the value of the variable. Figure 4-3 shows an example of a function
However, if a ≤ 0, the value for b is indeterminate, and therefore, the return value of
the function is also indeterminate. This behavior is statically detectable, yet occurs
once every 250 lines of source code for Japanese programs and once every 840 lines
is entirely unpredictable.
Figure 4-4 provides another example of source code which contains statically de-
tectable faults. The intent of the code is to calculate the running average of an array
55
of variables. Variables are stored in a circular buffer data values of length NUM-
of the values stored is kept in the variable average, the sum of all data values is stored
in the variable data sums, and the current offset into the circular buffer is stored in
array offset. Each time the routine is called, a 16 bit value is passed in representing
the current value that is to be added to the average. The array offset is incremented,
the previous value is removed from the data sums variable, the new version is added
to the array and data sums variable, and the updated average is stored in average.
However, there are several potential problems with this simple routine associated with
the array offset variable. The intent of the source code is to increment the offset by
one and then perform a modulus operation on this offset to place it within the range
of 0 to 11. Based upon the behavior of the compiler, this may or may not be the case.
If array offset = 10, the value of array offset can be either 0 or 10 The value will be
0 if the postfix increment operator (++) is executed before the modulus operation
56
occurs. However, if the compiler chooses to implement the logic so that the postfix in-
crement operator occurs after the modulus operation occurs, the array offset variable
If array offset is set to 11, the execution of line 15 results in an out of bounds
access for the array. In the C Language, reading from outside of the array will not, in
general, cause a processor exception. However, the value read is invalid, potentially
resulting in a very large negative number if the offset value being subtracted is larger
than the current data sum variable value. Line 18 may result in the average value
being overwritten. Depending upon how the compiler organizes RAM, the average
variable may be the next variable in RAM following the data values array. If this is
the case, writing to data values[11] will result in the average value being overwritten.
The exact behavior will depend upon the compiler word alignment, array padding,
and other implementation behaviors. This behavior can vary from one compiler to
upon compiler options passed on the command line, especially if compiler optimization
is used.
this construct is easy to verify through testing. So long as the code has been exercised
through this transition point and proper behavior has been obtained, proper behavior
will continue until the code is recompiled, a different compiler version is used, or the
compiler is changed. Code constructs like this, however, do make code portability
difficult.
Figure 4-6 exhibits another statically detectable fault, namely the potential to
57
Figure 4-6: Source exhibiting loop overflow and out of bounds array access.
de-reference outside of an array. In this case, if all of the connections slots are busy,
an error message will be printed out notating this condition. The slot pointer will be
pointing one element beyond the end of the array when the code returns to execute
line 15. When line 15 executes, a de-reference beyond the end of the array will
is unknown based upon the code segment provided, this can result in data corruption
outside of the given array. In a worst case scenario, this behavior could result in
an infinite loop which never terminates. The fault present within this code can be
stand what causes these faults to become failures. In reliability growth modeling, one
of the most important parameters is known as the fault exposure ratio (FER). The
words, the relationship between faults and failures. Naixin and Malaiya[NM96] and
von Mayrhauser and Srimani[MvMS93] discuss both the calculation of this parame-
ter as well as its meaning to a software module. This parameter, however, is entirely
black-box based, and does not help relate faults to failures at a detailed level.
There are many reasons why a fault lays dormant and does not manifest itself as
a failure. The first, and most obvious, deals with code coverage. If a fault does not
execute, it can not lead to failure. While this is intuitively obvious, determining if a
fault can be executed can be quite complicated and require significant analysis.
Figure 4-7 provides such an example of this complexity. The intent of the code
is to calculate the distance from a current number to the next prime number. These
types of calculations are often used in random number generation such as the Linear
with
Depending upon the exact algorithm used, the values selected for a and c may de-
There is, however, a statically detectable problem with this implementation. The
code begins on line 9 by checking to make certain that the number is greater than 0.
If this is the case, the code will step through a set of if and else if functions, looking
for the smallest prime number which is greater than the value passed in. Once this
has been found, the t next prime number variable is set to this value, and in line 73
the calculation
then
resulting in
Since t return value, however, is an uint8 t, t return value will take on a very large
60
positive value. It is important to note that this statically detectable fault escaped
testing even though 100% statement coverage had been achieved, as is shown in Figure
4-8.
There are several different ways we can assess the probability of this fault mani-
festing itself. If we base the probability on the size of the input space, there are 256
possible inputs to the function, ranging from 0 to 255. The failure will occur any
time
resulting in
129
pf = = .50360625. (4.7)
256
D = {x ∈ N ∩ x ≤ 127} (4.8)
1
pf = = 0.0078125. (4.9)
128
D = {x ∈ N ∩ x ≤ 100} (4.10)
61
then
0
pf = = 0. (4.11)
100
This complexity partly explains why faults lie dormant for such long periods and why
many faults only surface when a change is made to the software. If an initial program
using this software never passes a value greater than 100 to the routine, it will never
fail. But if a change is made and a value of 128 can now be passed in, the failure is
more likely to surface. A change in value to include up to 255 for the input virtually
which can cause failure, a control flow graph can be generated for the source code as
is shown in Figure 4-9. From this graph, the static path count through the method
can be calculated. In this case, there are 33 distinct paths through the function; one
1
path will cause a failure, resulting in pf = 33
= 0.0303 if all paths are assumed to
execute equally.
the second most prevalent problem in Japanese source code. In this case, there are
seven distinct paths through the source code, yielding a static path count of 7. Of
these paths, six of them do not contain any statically detectable faults. However, the
eighth path fails to initialize a function pointer, resulting in the program jumping to
an unknown address, and most likely crashing the program. Assuming that we can
1
consider these paths of having an equal probability of executing, pf = 7
≈ 0.1428.
The very presence of these problems may or may not immediately result in a
62
failure. Returning a larger than expected number from a mathematical function may
C will most likely result in an immediate and noticeable failure; Overwriting the stack
return address is likely to result in the same behavior. Thus, for a fault to manifest
itself as a failure, the code which contains the fault first must execute, and then the
result of fault must be used in a manner that will result in a failure occurring.
63
1: #include <stdint.h>
2: /* This routine will calculate the distance from the current number to the next prime number. */
3: uint8_t calculate_distance_to_next_prime_number(uint8_t p_number) {
4: int8_t t_next_prime_number = 0;
5: uint8_t t_return_value = 0;
6: if (p_number > 0) {
7: if (p_number < 2)
8: {t_next_prime_number = 2; }
9: else if (p_number < 3)
10: { t_next_prime_number = 3; }
11: else if (p_number < 5)
12: { t_next_prime_number = 5; }
13: else if (p_number < 7)
14: { t_next_prime_number = 7; }
15: else if (p_number < 11)
16: { t_next_prime_number = 11; }
17: else if (p_number < 13)
18: { t_next_prime_number = 13; }
19: else if (p_number < 17)
20: { t_next_prime_number = 17; }
21: else if (p_number < 19)
22: { t_next_prime_number = 19; }
23: else if (p_number < 23)
24: { t_next_prime_number = 23; }
25: else if (p_number < 29)
26: { t_next_prime_number = 29; }
27: else if (p_number < 31)
28: { t_next_prime_number = 31; }
29: else if (p_number < 37)
30: { t_next_prime_number = 37; }
31: else if (p_number < 41)
32: { t_next_prime_number = 41; }
33: else if (p_number < 43)
34: { t_next_prime_number = 43; }
35: else if (p_number < 47)
36: { t_next_prime_number = 47; }
37: else if (p_number < 53)
38: { t_next_prime_number = 53; }
39: else if (p_number < 59)
40: { t_next_prime_number = 59; }
41: else if (p_number < 61)
42: { t_next_prime_number = 61; }
43: else if (p_number < 67)
44: { t_next_prime_number = 67; }
45: else if (p_number < 71)
46: { t_next_prime_number = 71; }
47: else if (p_number < 73)
48: { t_next_prime_number = 73; }
49: else if (p_number < 79)
50: { t_next_prime_number = 79; }
51: else if (p_number < 83)
52: { t_next_prime_number = 83; }
53: else if (p_number < 89)
54: { t_next_prime_number = 89; }
55: else if (p_number < 97)
56: { t_next_prime_number = 97; }
57: else if (p_number < 101)
58: { t_next_prime_number = 101; }
59: else if (p_number < 103)
60: { t_next_prime_number = 103; }
61: else if (p_number < 107)
62: { t_next_prime_number = 107; }
63: else if (p_number < 109)
64: { t_next_prime_number = 109; }
65: else if (p_number < 113)
66: { t_next_prime_number = 113; }
67: else if (p_number < 127)
68: { t_next_prime_number = 127; }
69: t_return_value = t_next_prime_number - p_number;
70: }
71: return t_return_value;
72: }
File ‘prime_number_example1.c’
Lines executed:100.00% of 69
prime_number_example1.c:creating ‘prime_number_example1.c.\index{gcov}gcov’
-: 1:#include <stdint.h>
-: 2:uint8_t calculate_distance_to_next_prime_number(uint8_t p_number);
function calculate_distance_to_next_prime_number called 1143 returned 100% blocks executed 100%
1143: 3:uint8_t calculate_distance_to_next_prime_number(uint8_t p_number) {
1143: 4: int8_t t_next_prime_number = 0;
1143: 5: uint8_t t_return_value;
1143: 6: if (p_number > 0) {
1142: 7: if (p_number < 2)
118: 8: {t_next_prime_number = 2; }
1024: 9: else if (p_number < 3)
119: 10: {t_next_prime_number = 3;}
905: 11: else if (p_number < 5)
220: 12: {t_next_prime_number = 5;}
685: 13: else if (p_number < 7)
152: 14: {t_next_prime_number = 7;}
533: 15: else if (p_number < 11)
134: 16: {t_next_prime_number = 11;}
399: 17: else if (p_number < 13)
55: 18: {t_next_prime_number = 13;}
344: 19: else if (p_number < 17)
72: 20: {t_next_prime_number = 17;}
272: 21: else if (p_number < 19)
23: 22: {t_next_prime_number = 19;}
249: 23: else if (p_number < 23)
29: 24: {t_next_prime_number = 23;}
220: 25: else if (p_number < 29)
55: 26: {t_next_prime_number = 29;}
165: 27: else if (p_number < 31)
17: 28: {t_next_prime_number = 31;}
148: 29: else if (p_number < 37)
29: 30: {t_next_prime_number = 37;}
119: 31: else if (p_number < 41)
13: 32: {t_next_prime_number = 41;}
106: 33: else if (p_number < 43)
6: 34: {t_next_prime_number = 43;}
100: 35: else if (p_number < 47)
14: 36: {t_next_prime_number = 47;}
86: 37: else if (p_number < 53)
16: 38: {t_next_prime_number = 53;}
70: 39: else if (p_number < 59)
13: 40: {t_next_prime_number = 59;}
57: 41: else if (p_number < 61)
8: 42: {t_next_prime_number = 61;}
49: 43: else if (p_number < 67)
6: 44: {t_next_prime_number = 67;}
43: 45: else if (p_number < 71)
5: 46: {t_next_prime_number = 71;}
38: 47: else if (p_number < 73)
6: 48: {t_next_prime_number = 73;}
32: 49: else if (p_number < 79)
5: 50: {t_next_prime_number = 79;}
27: 51: else if (p_number < 83)
4: 52: {t_next_prime_number = 83;}
23: 53: else if (p_number < 89)
8: 54: {t_next_prime_number = 89;}
15: 55: else if (p_number < 97)
5: 56: {t_next_prime_number = 97;}
10: 57: else if (p_number < 101)
1: 58: {t_next_prime_number = 101;}
9: 59: else if (p_number < 103)
2: 60: {t_next_prime_number = 103;}
7: 61: else if (p_number < 107)
2: 62: {t_next_prime_number = 107;}
5: 63: else if (p_number < 109)
2: 64: {t_next_prime_number = 109;}
3: 65: else if (p_number < 113)
1: 66: {t_next_prime_number = 113;}
2: 67: else if (p_number < 127)
2: 68: {t_next_prime_number = 127;}
1142: 69: t_return_value = t_next_prime_number - p_number;
-: 70: }
1143: 71: return t_return_value;
-: 72:}
Figure 4-8: GNU gcov output from testing prime number source code.
65
Figure 4-9: Control flow graph for calculate distance to next prime number method.
66
1 #include "interface.h"
2
3 static uint16_t test_active_flags;
4 static uint16_t test_done_flags;
5
6 void do_walk(void) {
7 uint8_t announce_param;
8 function_ptr_type test_param;
9 if (TEST_BIT(test_active_flags, DIAG_TEST)) {
10 if (check_for_expired_timer(TIME_IN_SPK_TEST) == EXP) {
11 start_timer();
12 if (TEST_BIT(test_active_flags, RF_TEST)) {
13 announce_param = LF_MESSAGE;
14 test_param = LF_TEST;
15 SETBIT_CLRBIT(test_active_flags, LF_TEST, RF_TEST);
16 } else if (TEST_BIT(test_active_flags, LF_TEST)) {
17 announce_param = LR_MESSAGE;
18 test_param = LR_TEST;
19 SETBIT_CLRBIT(test_active_flags, LR_TEST, LF_TEST);
20 } else if (TEST_BIT(test_active_flags, LR_TEST)) {
21 announce_param = RR_MESSAGE;
22 test_param = RR_TEST;
23 SETBIT_CLRBIT(test_active_flags, RR_TEST, LR_TEST);
24 } else if ((TEST_BIT(test_active_flags, RR_TEST)) &&
25 (get_ap_state(AUK_STATUS) != UNUSED_AUK)) {
26 announce_param = SUBWOOFER_MESSAGE;
27 test_param = AUX1_TEST;
28 SETBIT_CLRBIT(test_active_flags, SUBWOOFER1_TEST, RR_TEST);
29 } else {
30 announce_param = EXIT_TEST_MESSAGE;
31 CLRBIT(test_active_flags, DIAG_TEST);
32 SETBIT(test_done_flags, DIAG_TEST);
33 }
34 make_announcements(announce_param);
35 *test_param();
36 }
37 }
38 }
The first and most important starting point for determining if a fault is to become
a failure is related to source code coverage. If a fault is never encountered during ex-
ecution, it can not result in a failure. In many software systems, especially embedded
systems, the percentage of code which routinely executes is actually quite small, and
the majority of the execution time is spent covering the same lines over and over
again. Embedded systems are also designed with a few repetitive tasks that execute
periodically at similar rates. Thus, with limited testing covering the normal use cases
for the system, information about the “normal” execution paths can be obtained.
There are many different metrics and measurements associated with code cover-
age. Kaner [Kan95] lists 101 different coverage metrics that are available. State-
coverage is an extension to statement coverage except that the unit of code is a se-
3
quence of non-branching statements. Decision Coverage reports whether boolean
expressions tested in control structures evaluated to both true and false. Condition
of condition coverage and decision coverage. Path Coverage4 reports whether each
of the possible paths5 in each function have been followed. Data Flow Coverage, a
2
Also known as line coverage or segment coverage.
3
Also known as branch coverage, all-edges coverage, basis path coverage, or decision-decision-path
testing.
4
Also known as predicate coverage.
5
A path is a unique sequence of branches from the function entry to the exit.
69
variation of path coverage, considers the sub-paths from variable assignments to sub-
sequent references of the variables. Function Coverage measure reports whether each
at least some coverage in all areas of the software. Call Coverage6 reports whether
each function call has been made. Loop Coverage measures whether each loop is
executed body zero times, exactly once, and more than once (consecutively). Race
Coverage reports whether multiple threads execute the same code at the same time
and is used to detect failure to synchronize access to resources. In many cases, cover-
age definitions overlap each other, as decision coverage includes statement coverage,
Path Coverage includes Decision Coverage, and Predicate Coverage includes Path
Marick [Mar99] cites some of the misuses for code coverage metrics. A certain
level of code coverage is often mandated by the software development process when
evaluating the effectiveness of the testing phase. This level is often varied. Extreme
Programming advocates endorse 100% method coverage in order to ensure that all
methods are invoked at least once, though there are also exceptions given for small
functions which are smaller than the test cases would be[Agu02][JBl]. Piwowarski,
Ohba, and Caruso[POC93] indicate that 70% statement coverage is necessary to en-
sure sufficient test case coverage, 50% statement coverage is insufficient to exercise
the module, and beyond 70%-80% is not cost effective. Hutchins et al.[HFGO94]
indicates that even 100% coverage is not necessarily a good indication of testing ad-
6
Also known as call pair coverage.
70
equacy, for though more faults are discovered at 100% coverage than 90% or 95%
coverage, faults can still be uncovered even if testing has reached 100% coverage.
There has been significant study of the relationship between code coverage and
the resulting reliability of the source code. Garg [Gar95] [Gar94] and Del Frate
[FGMP95] indicate that there is a strong correlation between code coverage obtained
during testing and software reliability, especially in larger programs. The exact extent
The fundamental premise behind this model is that the resulting software relia-
bility can be related to the statically detectable faults present within the source code,
the number of paths which lead to the execution of the statically detectable faults,
To model reliability, the source code is first divided into statement blocks. A
Figure 4-12 provides example source code for an embedded system timer routine
which verifies if a timer has or has not expired. Figure 4-13 shows a language transla-
tion of the code into Java. This source code can be decomposed into the block format
1 #include <stdint.h>
2 typedef enum {FALSE, TRUE} boolean;
3
4 extern uint32_t get_current_time(void);
5
6 typedef struct {
7 uint32_t starting_time; /* Starting time for the system */
8 uint32_t timer_delay; /* Number of ms to delay */
9 boolean enabled; /* True if timer is enabled */
10 boolean periodic_timer; /* TRUE if the timer is periodic. */
11 } timer_ctrl_struct;
12
13 boolean has_time_expired(timer_ctrl_struct p_timer)
14 {
15 boolean t_return_value = FALSE;
16 uint32_t t_current_time;
17 if (p_timer.enabled == TRUE)
18 {
19 t_current_time = get_current_time();
20 if ((t_current_time > p_timer.starting_time) &&
21 ((t_current_time - p_timer.starting_time) > p_timer.timer_delay))
22 {
23 /* The timer has expired. */
24 t_return_value = TRUE;
25 }
26 else if ((t_current_time < p_timer.starting_time) &&
27 ((t_current_time + (0xFFFFFFFFu - p_timer.starting_time)) > p_timer.timer_delay))
28 {
29 /* The timer has expired and wrapped around. */
30 t_return_value = TRUE;
31 }
32 else
33 {
34 /* The timer has not yet expired. */
35 t_return_value = FALSE;
36 }
37 if (t_return_value == TRUE)
38 {
39 if (p_timer.periodic_timer == TRUE )
40 {
41 p_timer.starting_time = t_current_time;
42 }
43 else
44 {
45 p_timer.enabled = FALSE;
46 p_timer.starting_time = 0;
47 p_timer.periodic_timer = FALSE;
48 }
49 }
50 }
51 else
52 {
53 /* Timer is not enabled. */
54 }
55 return t_return_value;
56 }
The reliability for each block is assessed using a Bayesian Belief Network, which is
described in detail in the next chapter. The Bayesian Belief network uses an analysis
of the fault locations, fault characteristics, historical data from past projects, fault
taxonomy data, and other parameters to determine if the given fault is either a valid
Figure 4-13: Translation of timer expiration routine from C to Java. Note that nothing
has changed other than the implementation language. The algorithm is exactly the same.
73
Simply using faults to model reliability is insufficient, for the faults must be acti-
The simplest method for establishing code coverage using the model given would
be to assume that all paths through the method are executed with equal probability.
For example, the function diagramed in Figure 4-14 has ten possible paths through the
source code, as is shown in Table 4.2. Using the simplest method, each path would
1
have a probability pp = 10
= 0.10 of executing. From this method, we can then
calculate a reliability for the given function. However, empirically it is known that
this assumption of uniform path coverage is incorrect. Many functions contain fault
tolerance logic which rarely executes. Other functions contain source code which,
the discrete decisions which cause the execution of each path through the source code.
To use this methodology, we assume that each conditional statement used to make a
decision has an equal probability of being true or false. Thus, the statement
if (p_timer_enabled == TRUE)
75
taking the else condition. Using this same logic, the statement
has a probability of
of taking the if condition and a probability of .75 of taking the else condition. We
refer to this measure as uniform conditional logic, and when applying it to Figure
While this method does generate a valid distribution for each path through the
source code, the method given above fails to take into account the dependencies
present within the function. For example, Blocks D and E of the source code execute
within the function, then we are guaranteed to visit block F or G. Block C sets
“t return value = FALSE;”. Thus, if we ever visit node C, we are guaranteed not to
visit node F or G. Making these changes yields the forth and fifth column in Table 4.2.
One problem with value tracking is that it can be difficult to reliably track variable
dependencies if different variables are used but the values are assigned elsewhere in
This method is also problematic in that paths which encounter fewer decisions
source code that this assumption is not always true. In many instances, the first logical
checks within a function are often for fault tolerance purposes (i.e. checking for NULL
pointers, checking for invalid data parameters, etc.), and since these conditions rarely
exist, the paths resulting from this logic is rarely executed. Other methods that
can be used include fuzzy logic reasoning for path probabilities and other advanced
source code. Each method profiles thusfar does not take into any account the user
provided data which has the greatest effect on which paths actually execute. De-
pending upon the users preferences and use cases, the actual path behavior may vary
greatly from the theoretical values. Figure 4-15 provides output from the GNU gcov
tool showing a simple usage of the routine diagramed in Figure 4-14. From this infor-
mation, we can construct the experimental probability of each block being executed
and set up a system of equations from this information, as is shown in Equation 4.13.
77
#include <stdint.h>
typedef enum {FALSE, TRUE} boolean;
typedef struct {
uint32_t starting_time; /* Starting time for the system */
uint32_t timer_delay; /* Number of ms to delay */
boolean enabled; /* True if timer is enabled */
boolean periodic_timer; /* TRUE if the timer is periodic. */
} timer_ctrl_struct;
9
pB→D→F + pB→E→F = 2537
113
pB→D→G + pB→E→G = 2537
(4.13)
5
pB→D→F + pB→D→G + = 2537
117
pB→E→F + pB→E→G = 2537
However, this set of equations is unsolvable. Thus, from the information captured
by gcov is not entirely suitable for determining the paths which are executed during
limited testing.
There are many tools that have been developed to aid in code coverage analysis,
both commercial and open source, besides the gcov program. A detailed discussion
of Java code coverage tools is available in [Agu02]. However, none of the existing
analysis tools supports branch coverage metrics, requiring the development of our
It is possible to obtain better information on the branch coverage for the func-
tion if a subtle change is made to the source code. This conceptual change involves
placing a log point in each block of source code. For this primitive example, this
is accomplished through a simple printf statement in the source code and a letter
A → G corresponding to the block of code executing. Upon exit from the function, a
newline is printed, indicating that the given trace has completed. This modified code
is shown in Figure 4-16. In this figure, lines retain their initial numbering scheme
from the original code. Code which has been added is denoted with a ** symbol.
79
** #include <stdio.h>
1 #include <stdint.h>
2 typedef enum {FALSE, TRUE} boolean;
3
4 extern uint32_t get_current_time(void);
5
6 typedef struct {
7 uint32_t starting_time; /* Starting time for the system */
8 uint32_t timer_delay; /* Number of ms to delay */
9 boolean enabled; /* True if timer is enabled */
10 boolean periodic_timer; /* TRUE if the timer is periodic. */
11 } timer_ctrl_struct;
12
13 boolean has_time_expired(timer_ctrl_struct p_timer)
14 {
15 boolean t_return_value = FALSE;
16 uint32_t t_current_time;
17 if (p_timer.enabled == TRUE)
18 {
** printf("B");
19 t_current_time = get_current_time();
20 if ((t_current_time > p_timer.starting_time) &&
21 ((t_current_time - p_timer.starting_time) > p_timer.timer_delay))
22 {
23 /* The timer has expired. */
** printf("E");
24 t_return_value = TRUE;
25 }
26 else if ((t_current_time < p_timer.starting_time) &&
27 ((t_current_time + (0xFFFFFFFFu - p_timer.starting_time)) > p_timer.timer_delay))
28 {
29 /* The timer has expired and wrapped around. */
** printf("D");
30 t_return_value = TRUE;
31 }
32 else
33 {
34 /* The timer has not yet expired. */
** printf("C");
35 t_return_value = FALSE;
36 }
37 if (t_return_value == TRUE)
38 {
39 if (p_timer.periodic_timer == TRUE )
40 {
** printf("G");
41 p_timer.starting_time = t_current_time;
42 }
43 else
44 {
** printf("F");
45 p_timer.enabled = FALSE;
46 p_timer.starting_time = 0;
47 p_timer.periodic_timer = FALSE;
48 }
49 }
50 }
51 else
52 {
** printf("A");
53 /* Timer is not enabled. */
54 }
** printf("\n");
55 return t_return_value;
56 }
BC
BC
. . .
BC
BEF
BC
. . .
BC
BEG
BC
. . .
BC
BDF
Figure 4-17: Rudimentary trace output file.
By compiling and executing this modified code, a trace file can be captured match-
ing that shown in Figure 4-17 providing the behavioral trace for the program. By
postprocessing this file using an AWK or PERL script, it is possible to determine the
number of unique paths executed through the function as well as their occurrence,
as is shown in Table 4.4. Notice that the actual path counts observed during lim-
ited testing are significantly different from the theoretical path count. One drawback
to this method is that trace logging adds significant overhead to program execution
Through the use of the GNU debugging program (GDB), it is possible to obtain
81
the same path coverage information without modifying the original source code. This
accomplish this, a debug script is created which defines a breakpoint for each block.
and then the program continues execution. This will generate the same format output
as the print method applied previously. However, this method requires no changes to
the source code. The script used for this is shown in Figure 4-18. One disadvantage
block A is encountered, as Block A does not contain any executable code upon which
Figure 4-18: gdb script for generating path coverage output trace.
In the event that the breakpoint method described above is inappropriate for ob-
taining the execution trace, there are several other methods that can be used. In
certain applications, it is not feasible for the debugger to interrupt the program’s
execution. Delays introduced by a debugger might cause the program to change its
behavior drastically, or perhaps fail, even when the code itself is correct. In this
situation, the GDB supports a feature referred to as Tracepoints. Using GDB’s trace
and collect commands, you can specify locations in the program, called tracepoints,
82
and arbitrary expressions to evaluate when those tracepoints are reached. The tra-
cepoint facility can only be used with remote targets. As a final (and extremely
difficult) method, a logic analyzer can be connected to the address bus, so long as the
microprocessor on the system has an accessible address bus. By setting the trigger-
ing systems appropriately, the logic analyzer can trigger on the instruction addresses
desired, and by storing this information in a logic buffer, a coverage trace can be cre-
ated. This method, however, is by far the most difficult method for obtaining path
coverage.
Chapter 5
an overview of their capabilities, and their availability. However, as the goal of our
research is to use static analysis to estimate the software reliability of existing software
packages, it is imperative that the real-world detection capabilities for existing static
analysis tools be investigated. As with all fields of software engineering, static analysis
tools are constantly evolving, incorporating new features and detection capabilities,
For all of the advantages of static analysis tools, there have been very few indepen-
dent comparison studies of Java static analysis tools. Rutar et al. [RAF04] compares
the results of using Findbugs, JLint, and PMD tools on Java source code. This study,
however, is somewhat flawed in that it only looks at Open Source tools and does not
investigate the performance of commercial tools. Second, while the study itself was
1
Portions of this chapter appeared in Schilling and Alam[SA07b].
83
84
the capabilities of each tool when it comes to detecting statically detectable faults.
Forristal[For05] compares 12 commercial and open source tools for effectiveness, but
the analysis is based only on security aspects and security scanners, not the broader
range of static analysis tools available. Furthermore, while the study included tools
which tested for Java faults, the majority of the study was aimed at C and C++
analysis tools, so it is unclear how applicable the results are to the Java programming
language.
Thus, a pivotal need for the success of our modeling is to experimentally determine
was created which would allow an assessment of the effectiveness of multiple static
analysis tools when applied against a standard set of source code modules. The basic
1. Determining the scope of faults that would be included within the validation
suite.
4. Combining the tool results using the SoSART tool developed for this purpose.
85
All together, the validation package consisted of approximately 1200 lines of code,
broken down into small segments which demonstrated a single fault. Injected faults
were broken into eight different categories based upon those mistakes which commonly
Aliasing errors represented errors which occurred due to multiple references alias-
ing to a single variable instance. Array out of bounds errors were designed to check
that static analysis tools were capable of detecting instances in which an array refer-
ence falls outside the valid boundaries of the given array. Several different mechanisms
for indexing outside of an array were tested, including off-by-one errors when iter-
ating through a loop and fixed errors in which a definitive reference which is out of
the range of the array occurs. One test case involved de-referencing of a zero length
array, and one test involved the out of range reference into a two dimensional array.
Deadlocks and synchronization errors were tested using standard examples of code
which suffered from deadlocks, livelocks, and other synchronization issues. An at-
tempt to locate infinite loops using static analysis also occurred. Several examples of
commonly injected infinite loop scenarios were developed based upon PSP historical
86
data. One example used a case of a Vacuous Truth for the while condition, whereas
others tested the case in which a local variable is used to calculate the index value
yet the value never changes once inside of the loop construct.
In the area of logic, there were six subareas which had test cases developed.
Case statement tests were designed to validate that case statements missing break
statements were detected, impossible case statements (i.e. case statements whose
values could not be generated) were detected, and dead code within case statements
was detected. Operator precedence tests tested the ability of the static analysis tool
to detect code which exhibited problems with operator precedence and might result
in outcomes which are not correct relating to the desired outcomes. Logic conditions
which always evaluate in the same manner were also tested, as well as incorrect string
comparison uses.
Mathematical analysis consisted of two major areas, namely division by zero de-
tection and numerical overflow and underflow. Division by zero was tested for both
integer numbers and floating point numbers. Null variable dereferences were tested
using a set of logic which resulted in a null variable reference being dereferenced.
Lastly, a set of test conditions verified the ability of the static analysis tools to detect
uninitialized variables.
Code was developed in Eclipse, and, except cases where explicit faults were desired,
was clean of all warnings at compilation time. In the case where the Eclipse tool or
the Java compiler issued a warning indicating an errant construct, this information
was logged for future comparison. In many cases, it was found that even though
Once the analysis suite had been developed, it was placed under configuration
management and archived in a CVS repository. The suite was re-reviewed in a PSP
style review, specifically looking for faults within the test suite as well as other im-
provements that could be made. While the initial analysis only included nine tools,
Following the development of the validation files, an automated process for exe-
cuting the analysis tools was developed. This ensured that the analysis of this suite
(as well as subsequent file sets) could occur in an automated and uniform manner.
An Apache Ant build file was created which automatically invoked the static analy-
sis tools. The output from the analysis tools was then combined using the Software
static analysis Meta Data tool as well as providing a visually stimulating environment
When running the static analysis tools, each tool was run with all warnings en-
abled. While this maximized the number of warnings generated and resulted in a
significant number of false positives and other nuisance warnings being generated,
this also maximized the potential for each tool to detect the seeded faults.
The experiment consisted of analyzing our validation suite using ten different Java
static analysis tools. Five of the tools used were open source or other readily available
88
static analysis tools. The other five tools included in this experiment represented
commercially available static analysis tools. Due to licensing and other contractual
issues with the commercial tools, the results have been obfuscated and all the tools
In analyzing the results, three fundamental pieces of data from each tool were
sought. The first goal was to determine if the tool itself issued a warning which
would lead a trained software engineer to detect the injected fault within the source
code. This was the first and most significant objective, for if the tools are not able to
detect real-world fault examples, then the tools will be of little benefit to practitioners.
However, beyond detecting injected faults, we were also interested in the other faults
that the tool found within our source code. These faults can be categorized into two
categories, valid faults which may pose a reliability issue for the source code and false
positive warnings. By definition, false positive warnings encompass both faults which
can not lead to failure given the implementation, as well as faults which detect a
problem which is unrelated to a potential failure. Using this definition, all stylistic
warnings are considered to be invalid, for a stylistic warning by its very nature can
The first objective of this experiment was to determine which of the tools actually
detected the injected faults. This accomplished by reviewing the tool outputs in
SOSART and designating those tools which successfully detected the injected static
89
Table 5.2: Summary of fault detection. A 1 indicates the tool detected the injected fault.
Tool
Count Eclipse 1 2 3 4 5 6 7 8 9 10
Array Out of Bounds 1 1 0 0 0 0 0 0 0 0 0 1 0
Array Out of Bounds 2 0 0 0 0 0 0 0 0 0 0 0 0
Array Out of Bounds 3 2 0 1 0 0 0 0 0 0 0 1 0
Array Out of Bounds 4 3 0 1 1 0 0 0 0 0 0 1 0
Deadlock 1 2 0 1 0 0 0 0 1 0 0 0 0
Deadlock 2 3 0 1 0 0 1 0 1 0 0 0 0
Deadlock 3 1 0 1 0 0 0 0 0 0 0 0 0
Infinite Loop 1 3 0 1 0 0 0 0 1 1 0 0 0
Infinite Loop 2 1 0 1 0 0 0 0 0 0 0 0 0
Infinite Loop 3 2 0 1 1 0 0 0 0 0 0 0 0
Infinite Loop 4 2 0 1 0 0 1 0 0 0 0 0 0
Infinite Loop 5 0 0 0 0 0 0 0 0 0 0 0 0
Infinite Loop 6 1 0 0 0 0 0 0 0 0 0 1 0
Infinite Loop 7 2 0 1 0 0 0 0 0 0 0 1 0
logic 1 3 0 1 0 0 0 0 1 1 0 0 0
logic 2 2 1 1 0 0 0 0 0 0 0 0 0
logic 3 0 0 0 0 0 0 0 0 0 0 0 0
logic 4 1 0 1 0 0 0 0 0 0 0 0 0
logic 5 1 0 1 0 0 0 0 0 0 0 0 0
logic 6 1 0 1 0 0 0 0 0 0 0 0 0
logic 7 1 0 1 0 0 0 0 0 0 0 0 0
logic 8 0 0 0 0 0 0 0 0 0 0 0 0
logic 9 0 0 0 0 0 0 0 0 0 0 0 0
logic 10 0 0 0 0 0 0 0 0 0 0 0 0
logic 11 1 0 0 0 0 0 0 0 0 0 0 1
logic 12 2 0 1 0 0 0 0 0 0 0 1 0
logic 13 2 0 1 0 0 0 0 0 0 0 1 0
logic 14 4 0 1 0 0 0 0 0 1 1 0 1
logic 15 3 0 1 0 0 0 0 0 1 0 0 1
Math 1 2 0 0 1 0 0 0 0 0 0 1 0
Math 2 2 0 0 1 0 0 0 0 0 0 1 0
Math 3 1 0 0 0 0 0 0 0 0 0 1 0
Math 4 0 0 0 0 0 0 0 0 0 0 0 0
Math 5 3 0 0 0 0 0 0 1 1 0 0 1
Math 6 3 0 0 0 0 0 0 1 1 0 0 1
Math 7 3 0 0 0 0 0 0 1 1 0 0 1
Math 8 4 0 0 0 0 1 0 1 1 0 0 1
Math 9 0 0 0 0 0 0 0 0 0 0 0 0
Math 10 1 0 0 0 0 0 0 0 0 0 1 0
Math 11 1 0 0 0 0 0 0 0 0 0 1 0
Math 12 1 0 0 0 0 0 0 0 0 0 1 0
Math 13 1 0 0 0 0 0 0 0 0 0 1 0
Math 14 1 0 0 0 0 0 0 0 0 0 1 0
Math 15 1 0 0 0 0 0 0 0 0 0 1 0
Math 16 1 0 0 0 0 0 0 0 0 0 1 0
Null Dereferences 1 3 0 1 1 0 1 0 0 0 0 0 0
Null Dereferences 2 0 0 0 0 0 0 0 0 0 0 0 0
Null Dereferences 3 1 0 0 0 0 1 0 0 0 0 0 0
Uninitialized Variable 1 1 1 0 0 0 0 0 0 0 0 0 0
Uninitialized Variable 2 2 1 0 0 0 0 0 1 0 0 0 0
Total Detected 3 21 5 0 5 0 9 8 1 17 7
Percent Detected 6% 42% 10% 0% 1% 0% 18% 16% 2% 34% 14%
faults. These results are shown in Table 5.2. In each column, a 1 is present if the
given tool detected the fault. A 0 is present if the tool did not provide a meaningful
warning which would indicate the presence of a fault within the source code.
Beyond determining which tools detected the faults, we were also interested in
knowing which faults were detected by multiple tools. Rutar et al.[RAF04] indicated
that they found little overlap between tools in their research. We wanted to see if
this held true for out results. The simplest way of accomplishing this was to count
90
the number of tools which detected a given fault. This result is shown in the count
column of Table 5.3. In summary, of the 50 statically detectable faults present within
the validation suite, 22 of them, or 44% of detected injected faults, were detected by
between tool results was calculated. Perfect correlation between tools, in which case
the tools detected exactly the same set of faults, would be represented by a value of
correlation, in which every fault that was detected by the first tool is not detected by
the second tool and every fault which is not detected by the first tool was detected
by the second tool would be represented as a -1 value. While the results of Table
5.4 do show some correlation between tools, the only correlation of significance (and
Table 5.5: Static Analysis Tool False Positive and Stylistic Rule Detections
Tool
Eclipse 1 2 3 4 5 6 7 8 9 10
Aliasing Error 0 0 2 13 0 5 20 0 41 0 0
Array Out of Bounds 1 0 0 0 0 0 0 9 0 7 0 0
Array Out of Bounds 2 0 0 0 2 0 0 7 0 9 0 0
Array Out of Bounds 3 0 0 0 1 0 0 27 0 24 0 0
Array Out of Bounds 4 0 0 0 1 0 0 7 0 5 0 0
Deadlock 1 0 0 2 1 0 0 5 1 5 0 0
Deadlock 2 0 0 0 2 0 0 11 0 11 0 0
Deadlock 3 0 0 0 4 0 0 0 0 10 0 0
Infinite Loop 1 0 0 0 1 0 1 11 0 13 0 0
Infinite Loop 2 0 1 3 1 0 0 13 0 10 0 0
Infinite Loop 3 0 1 2 2 0 0 15 0 11 0 0
Infinite Loop 4 0 0 0 2 0 0 8 0 8 0 0
Infinite Loop 5 0 0 0 0 0 0 7 0 7 0 0
Infinite Loop 6 0 0 0 0 0 0 7 0 7 0 0
Infinite Loop 7 0 0 0 2 0 0 11 0 9 0 0
logic 1 0 1 0 4 0 0 29 0 31 0 0
logic 2 0 0 0 4 0 0 19 0 14 0 0
logic 3 0 0 0 0 0 0 9 0 9 0 0
logic 4 0 0 0 4 0 0 19 0 15 0 0
logic 5 0 0 0 1 0 0 4 0 5 0 0
logic 6 0 0 0 5 0 0 4 0 5 0 0
logic 7 0 0 0 5 0 0 3 0 4 0 0
logic 8 0 1 0 4 0 0 8 0 7 0 0
logic 9 0 0 0 5 0 0 8 0 7 0 0
logic 10 0 0 0 4 0 0 10 0 7 0 0
logic 11 0 0 0 4 0 0 6 0 5 0 0
logic 12 0 0 0 3 0 0 8 0 8 0 0
logic 13 0 0 0 3 0 0 10 0 11 0 0
logic 14 0 0 0 4 0 0 9 0 6 0 0
logic 15 0 0 0 4 0 0 3 0 4 0 0
Math 1 0 0 0 0 0 0 7 0 5 0 0
Math 2 0 0 0 0 0 0 6 0 6 0 0
Math 3 0 0 0 0 0 0 8 0 5 0 0
Math 4 0 0 0 0 0 0 6 0 6 1 0
Math 5 0 0 0 2 0 0 9 0 8 0 0
Math 6 0 0 0 5 0 0 9 0 9 0 0
Math 7 0 0 0 2 0 0 10 0 9 0 0
Math 8 0 0 0 1 0 0 10 0 11 0 0
Math 9 0 0 0 1 0 0 5 0 3 0 0
Math 10 0 0 0 1 0 0 2 0 1 0 0
Math 11 0 0 0 1 0 0 11 0 8 1 0
Math 12 0 0 0 2 0 0 7 0 5 0 0
Math 13 0 0 0 1 0 0 12 0 9 0 0
Math 14 0 0 0 2 0 0 8 0 6 0 0
Math 15 0 0 0 0 0 0 13 0 8 0 0
Math 16 0 0 0 0 0 0 7 0 6 0 0
Null Dereferences 1 0 0 1 3 0 1 0 0 20 1 0
Null Dereferences 2 0 0 1 3 0 0 13 0 11 0 1
Null Dereferences 3 0 0 2 3 0 0 21 0 16 0 0
uninitialized variables 1 0 0 0 0 0 0 8 0 8 0 0
uninitialized variables 2 0 0 0 2 0 0 6 0 5 0 0
uninitialized variables 3 0 0 0 2 0 0 4 0 4 0 0
Total 0 4 13 117 0 7 489 1 484 3 1
As was stated previously, when executing the static analysis tools, each and every
rule was enabled for the static analysis tools. As would be expected, this method
before the valid warnings could be addressed. Our purpose for this analysis, however,
was to attempt to understand the relationship between false positives and the overall
detection of faults. Table 5.5 provides raw information relating to the false positives
92
and stylistic warning issued by each of the tools during our analysis.
Table 5.6: Correlation between false positive and stylistic rule detections
Tool 1 Tool 2 Tool 3 Tool 4 Tool 5 Tool 6 Tool 7 Tool 8 Tool 9 Tool 10
Eclipse N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Tool 1 1 0.43 0.07 N/A -0.05 0.34 -0.04 0.23 -0.07 0.49
Tool 2 1 0.23 N/A 0.37 0.25 0.36 0.36 0.03 -0.05
Tool 3 1 N/A 0.66 0.22 -0.08 0.55 -0.1 0.11
Tool 4 N/A N/A N/A N/A N/A N/A N/A
Tool 5 1 0.21 -0.03 0.69 0.07 -0.03
Tool 6 1 -0.11 0.73 -0.16 -0.03
Tool 7 1 -0.09 -0.03 -0.02
Tool 8 1 0.07 -0.05
Tool 9 1 -0.03
Tool 10 1
From the raw data, two tools, Tools 6 and 8, seemed to have an extremely high
rate of false positives and stylistic warnings emanated, with a combined total of 484
and 489 instances respectively. Tool 3 was also slightly higher with 117 false positive
This correlation, coupled with anecdotal evidence collected during the analysis, led
Table 5.7, a significant portion of the false positives were generated by two warnings,
Tool 8 Rule 1, with 399 instances, and Tool 6 Rule 4 with 375 instances. Together,
these two warnings constituted nearly two-thirds of the false positive warnings ob-
a stylistic violation of combining spaces and tabs together within source code, which
by the definition of our experiment was a false positive because it could not directly
This study indicates that there is a large variance in the detection capabilities
between tools. The ability of a given tool to detect an injected fault varied between
0% and 42%. This does not mean that the tools that did not detect our injected
faults were ineffective. Rather, it simply means that they were not appropriate tools
This experiment also showed a significant correlation between false positive warn-
ings. However, it was also found that a significant portion of the false positive de-
tections came from two rules which essentially detected the same conditions. It is
believed that through proper configuration of the tools and filtration of the rules de-
tected, it may be possible to significantly diminish the impact of the false positive
problem.
Based on our results, a correlation between tools and the faults which they de-
tect was observed. 44% of the injected faults were detected by two or more static
analysis tools. This seems to run contradictory to the results of the Rutar experi-
ment, in which no significant correlation was found. The experiment reported here,
however, used a greater variety of tools, and more importantly, included commer-
cially developed tools which may be more capable than the open source tools used
in their experiment. Furthermore, the methods were slightly different, in that their
method involved starting with existing large projects and applying the tool to them
whereas our experiment used a smaller validation suite to test fault discovery. We
can conclude by our experiment that it is possible to use multiple independent tools
94
and correlation to help reduce the impact of false positive detections when running
use multiple static analysis tools in order to effectively detect all statically detectable
faults. Each tool tested appeared to detect a varying subset of injected faults, and
while there was an overlap between tools, we are not yet able to effectively characterize
As a last observation, even though every attempt had been taken to control the
source code in a stylistic manner, including using Eclipse and the build in style for-
matting tools, style warnings detected by the tools were still in great abundance
during the experiment. These false positives, as has been noted by Hatton[Hat07],
reduce the signal to noise ratio of the analysis. Stylistic warnings by their very nature
can not directly lead to program failure and therefore need to be carefully filtered
Table 5.7: Percentage of warnings detected as valid based upon tool and warning.
# of valid # of invalid # of valid # of invalid
Tool Warning instances instances % valid Tool Warning instances instances % valid
1 1 1 0 100 6 1 18 0 100
1 2 1 0 100 6 2 1 0 100
1 3 1 0 100 6 3 1 0 100
1 4 1 0 100 6 4 0 375 0
1 5 1 0 100 6 5 0 2 0
1 6 1 0 100 6 6 1 0 100
1 7 2 0 100 6 7 28 0 100
1 8 0 3 0 6 6 1 0 100
1 9 2 0 100 6 9 0 11 0
1 10 3 0 100 6 10 0 20 0
1 11 4 0 100 6 11 2 0 100
1 12 8 0 100 6 12 4 3 57.14
1 13 3 0 100 6 13 3 20 13.04
1 14 1 0 100 6 14 2 3 40
1 15 2 0 100 6 15 1 0 100
1 16 2 0 100 6 16 1 3 25
2 1 0 1 0 6 17 2 0 100
2 2 2 0 100 6 18 1 0 100
2 3 4 0 100 6 19 0 21 0
2 4 5 1 83.33 6 20 0 1 0
2 5 3 8 27.27 6 21 0 8 0
2 6 1 6 14.29 6 22 1 0 100
3 1 0 44 0 6 23 0 7 0
3 2 0 1 0 6 24 0 4 0
3 3 0 4 0 6 25 0 3 0
3 4 1 1 50 6 26 4 0 100
3 5 0 1 0 6 27 0 2 0
3 6 2 40 4.76 6 28 51 1 98.08
3 7 0 4 0 6 29 0 3 0
3 8 1 0 100 6 30 0 1 0
3 9 0 2 0 6 31 3 0 100
3 10 1 0 100 6 32 0 3 0
3 11 5 0 100 6 33 1 0 100
3 12 0 1 0 6 34 3 0 100
3 13 0 2 0 7 1 2 0 100
3 14 0 1 0 7 2 4 0 100
3 15 3 18 14.29 7 3 2 0 100
4 1 1 0 100 7 4 0 1 0
4 2 1 0 100 7 5 0 1 0
4 3 1 0 100 7 6 1 0 100
4 4 1 0 100 7 7 2 0 100
4 5 1 0 100 8 1 0 399 0
4 6 1 0 100 8 2 1 0 100
4 7 1 0 100 8 3 1 0 100
4 8 1 0 100 8 4 0 3 0
5 1 0 1 0 8 5 0 26 0
5 2 0 2 0 8 6 1 42 2.33
10 1 1 0 100 8 7 0 1 0
10 2 1 0 100 8 8 1 0 100
10 3 1 0 100 8 9 0 16 0
10 4 0 1 0 8 10 0 1 0
10 5 4 0 100 8 11 1 0 100
10 6 1 0 100 10 9 1 0 100
10 7 3 0 100 10 10 1 0 100
10 8 4 0 100 10 11 2 0 100
Summary: 84 142 37.16 149 981 13.18
Overall 233 1123 17.18
Chapter 6
Bayesian Belief Networks (BBNs) are powerful tools which have been found to be
useful for numerous applications when general behavioral trends are known but the
data being analyzed is uncertain or incomplete. BBNs, through their usage of causal
directed acyclic graphs, offer an intuitive visual representation for expert opinions yet
Bayesian Belief networks have found wide acceptance within the medical field and
other areas. Haddaway et al. [Had99] provides an overview of both the existing BBN
software packages that are available as well as an extensive analysis of projects which
have successfully used BBNs. Within the software engineering field, many different
projects have used Bayesian Belief Networks to solve common software engineering
problems. Laskey et al.[LAW+ 04] discusses the usage of Bayesian Belief Networks for
the analysis of computer security. The quality of software architectures have been
assessed using a BBN by van Gurp and Bosch[vGB99] and Neil and Fenton[NF96].
Software Reliability has been assessed using Bayesian Belief Networks by Gran and
96
97
The fundamental premise behind this model is that the resulting software relia-
bility can be related to the number of statically detectable faults present within the
source code, the paths which lead to the execution of the statically detectable faults,
and the rate of execution of each path within the software package.
To model reliability, the source code is first divided into a set of methods or
functions. Each method or function is then divided further into a set of statement
blocks and decisions. A statement block represents a contiguous set of source code
the source code is translated into a set of blocks connected by decisions. Statically
detectable faults are then assigned to the appropriate block based upon their location
in the code.
Once the source code has been decomposed into blocks, the output from the ap-
propriate static analysis tools is linked into the decomposed source code. In order
to predict the reliability, the probability of execution for each branch must be de-
the actual program paths observed through execution trace capture during limited
testing. The testing consists of a set of black box tests or functional tests which are
observed at the white box level. For each method, the a reliability for each block is
assigned based upon the output of a Bayesian Belief network relating reliability to
98
the statically detectable faults, code coverage during limited testing, and the code
6.2.1 Overview
In order to accurately assess both the validity of the static analysis fault detected
as well as the likelihood of manifestation, a Bayesian Belief network has been devel-
oped. This Bayesian belief network, shown in Figure 6-1, incorporates historical data
as well as program executional traces to predict the probability that a given statically
The Bayesian Belief network can effectively be divided into three main segments.
The upper left half of the Bayesian Belief Network handles attributes related to the
validity and risk associated with a statically detected fault. The upper right half of
the Bayesian Belief Network assesses the probability that a given fault is exposed
through program execution. The bottom segment combines the results and provides
an overall estimate of the reliability for the given statically detectable fault.
verted into a discrete state value. Based upon the work of Neil and Fenton[NF96], the
majority of the variables are assigned the states of “Very High”, “High”, “Medium”,
“Low”, and “Very Low”. In certain cases, an optional state of “None” exists. This
Figure 6-1: Bayesian Belief Network relating statically detectable faults, code coverage
during limited testing, and the resulting net software reliability.
By definition, all static analysis tools have the capability of generating false pos-
itives. Some tools have a low false positive rate, while other tools have a high false
positive rate. Determining the validity of the fault is therefore the first required step
in assessing whether or not a statically detectable fault will cause a failure. Once the
validity of a given statically detectable fault has been assessed, the probability of the
assessed. Thus, the upper left segment of the reliability network concerns itself with
100
determining the likelihood that a given statically detectable fault is either valid or a
false positive and assessing the fault risk assuming it is a valid statically detectable
fault.
Each detected fault type naturally has a raw false positive rate based upon the
algorithms and implementation. Certain static analysis faults are nearly always a
false positive. Therefore, the states of “Very High”, “High”, “Medium”, “Low”,
and “Very Low” have been selected to represent the false positive rate for the given
statically detectable fault. The model itself does not prescribe a specific translation
from percentages into state values, as this translation may change with the domain.
However, it is expected that this value will be collected from historical analysis of
101
previous projects.
The validity of a static analysis fault is also impacted by the clustering of faults.
locality and either valid or invalid faults. The rationale behind this clustering is that
programmers tend to make the same mistakes, and these mistakes will tend to be
localized at the method, class, file, or package level. However, clustering can also
occur if a tool enters into a run-away state and generates a significant number of
false positives. The clustering states can therefore be represented as “Valid Cluster
Present”, “Invalid Cluster Present” , “No Cluster Present”, and “Unknown Cluster
faults which are part of a cluster will be initialized to the state value “Unknown
Cluster Present”, indicating that the cluster has neither been shown to be valid or
invalid. However, as the Software Engineer inspects the results and observes static
analysis faults to be valid or invalid, the cluster will shift to the appropriate states of
“Valid Cluster Present” or “Invalid Cluster Present” depending upon the results of
the inspection. This model recognizes two types of clustering, clustering at the file
Another input node contains information on whether the static analysis fault has
been correlated by a fault in a second tool at the same location. The usage of multiple
static analysis tools allows an extra degree of confidence that the detected static
analysis fault is valid, for if two independent tools have detected a comparable fault
at the same location, then there is a better chance that both faults are valid. This
statement assumes that the algorithms used to detect the fault are truly independent.
102
Even though Rutar et al.[RAF04] and Wagner[WJKT05] did not find a significant
overlap in rule checking capabilities between tools, these experiments only used a
limited set of static analysis tools. Since these articles were published, several new
described in Chapter 5 and published in Schilling and Alam [SA07b], indicate that
there is at least some form of correlation between tools when the faults represent
the same taxonomical definition. The Independently Correlated state can have a
value of “Yes” or “No” depending on whether the statically detectable fault has been
From these nodes, an overall summary node can be obtained, referred to as the
fault validity node. This node can contain the states “Valid” or “False Positive”,
When an analysis is first conducted, this node is estimated based upon the input
states and their observed values. However, as the software engineer begins to inspect
the static analysis warnings, the instances of this node will be observed to be either
The immediate failure node represents whether the fault that has been detected is
likely to cause an immediate failure. A fault which, for example, indicates that a jump
to a null pointer may occur or that the return stack will be overwritten has a very
high probability of resulting in an immediate failure, and thus, this value will reflect
this case. However, a fault which is detected due to an operator precedence issue may
not be assigned as significant of a value. Thus, for each statically detectable fault,
there is a potential that the given fault will result in a failure if the code is executed.
103
This node probability is directly defined by the characteristics of the fault detected,
and can be represented as “Very High”, “High”, “Medium”, “Low”, “Very Low”, and
“None”. In this case, the “None” state is reserved for static analysis warnings of the
stylistic nature, such as the usage of tabs instead of spaces to indent code. While
these can be considered to be valid warnings, these faults can not directly lead to
program failure.
While there are certain statically detectable faults which do not directly lead to
a failure, there are cases in which a statically detectable fault may represent a fault
which will manifest itself through maintenance. For example, it is deemed to be good
coding practice to enclose all if, else, while, do, and other constructs within opening
and closing brackets. While not doing this does not directly lead to a failure, it can
lead to failures as maintenance occurs on the code segment. Thus, the maintenance
risk state can be set to “Yes” or “No”, indicating whether or not the given fault is a
maintenance risk. The maintenance risk only applies to those faults which are marked
These parameters all feed into the fault risk node. This node represents the
risk associated with a given fault and can take on the states “Very High”, “High”,
In order for a fault to result in a failure, the fault itself must be executed in
a manner which will stimulate the fault to fail. Thus, the right upper half of the
104
reliability Bayesian Belief Network deals with code coverage and the code block during
program execution.
The first subnetwork assumes that the code block with the statically detectable
fault has executed during testing. If the code block has been executed, the probability
of a fault manifesting itself can be related to the number of discrete paths through the
block which have been executed versus the theoretical number of paths through the
block. A fault that is detected may only occur if certain conditions are present, and
these conditions may only be present if a certain execution path has been followed.
full path coverage is insufficient to guarantee that a fault will not manifest itself, for
the fault may be data dependent based upon a parameter passed into the method1 .
There are four principle states that a code block with a static analysis warning
“Block Unreachable”. If a give code block has been executed, this means that at
least one path of execution has traversed the given code block during the testing
period, resulting in the node having the value “Block Executed”. A second state for
this node occurs when the method that contains the statically detectable fault has
been executed but the specific code block has not been executed by at least one path
1
Tracing the entire program state is the basis for Automatic Anomaly detection, as has been
used in the DIDUCE tool[HL02] and the AMPLE[DLZ05] tools.
105
through the method, resulting in the state “Method Executed”. This would indicate
that the state values for the class or the method parameters passed in have not been
The last two states effectively deal with whether or not the code block containing
the statically detectable fault is reachable. For a Java method which has private scope
or a C method which has static visibility, this variable can only be “Block Reachable”
if there exists a direct call to the given method or function from within the scope of
the compilation unit. Otherwise, the method itself can not execute, and the value will
be “Block Unreachable”. For a Java method which has public or protected scope, or
a C method which has external linkage, it must be assumed that there is the potential
for the method to execute, and thus, by default, the node will have a value of “Block
The “Test Confidence” node serves to provide the capability to define the con-
fidence in the testing that has been used to obtain execution profiles. The testing
which is referred to reflects limited testing of the module for which the reliability is
being assessed. In the case of new version of a software component delivered from a
vendor, this would reflect best black box testing of the interface or functional testing
of the module. However, through the usage of execution trace capture, a white box
view of the component and the paths taken is obtained. This parameter allows the
evaluating engineer to adjust their confidence in the testing results based upon the
expected usage of the module in the field. As the engineer performing the reliability
106
analysis has more confidence that the results match what will be seen by a produc-
tion module, this value will be increased, reflecting less variance between the observed
coverage and the actual field coverage. Less confidence would indicate that more of
the unexecuted paths within the module would be expected to execute in the field.
The Fault Exposure potential relates the percentage of test paths covered and the
test confidence. When the test confidence is lowest and the percentage of executed
paths though the code block is lowest, this value will be highest. The value will be
lowest if the test confidence is very high and the percentage of executed paths is also
very high. In general, decreased test confidence will result in more variance in the
The “Percent Paths Through Block Executed” node indicates what percentage of
the paths which pass through the code block containing the statically detectable fault
have been executed during testing. Based on an appropriate translation which scales
the number of paths through the code block relative to the percentages executed, this
node will have a value of “Very High”, “High”, “Medium”, “Low”, or “Very Low”.
The second subnetwork is based upon the premise that the code block containing
the statically detectable fault has not executed, but the containing method has exe-
cuted. In this case, the likelihood of this code executed can be related to the distance
The distance to the nearest path is measured in terms of the number of deci-
107
sions between the given code block containing a statically detectable fault and the
probability that a decision will result in a given outcome. Thus, the number of de-
cisions between the nearest executed path and the static analysis fault is effectively
which the nearest executed path is only one decision away from the static analysis
fault, “Near”, “Far”, and “Very Far”. “None” is a placeholder state which is used to
indicate that the method itself has never been executed during program testing.
The percentage of paths through code block node represents the percentage of
paths through the method which pass through the code bock containing the statically
detectable fault. In the event that the method has not been executed, no assumption
to the probability of any given path executing can be assumed. All paths must
percentage of paths which lead into this program block. This node can have the
The “Nearest Path Execution Percentage” node reflects the percentage of net exe-
cution paths which have gone through the nearest node. As this percentage increases,
there is an indication that more of the execution paths through the code are with in
a few decisions of this code block. Since there are more executions paths nearby, the
likelihood of reaching this code block is increased each time a nearby path is executed,
for only a few decisions may be required to be different in order to reach this location.
The Nearest Path Execution Percentage states are “Very High”, “High”, “Medium”,
The Fault Execution Potential node represents the potential of a code block con-
taining a statically detectable fault to be executed. As the distance from the nearest
path is lowest and the nearest execution path percentage is highest and the test con-
fidence is lowest, this value will be highest. These values will decrease as the distance
The net fault exposure node is switched based upon whether the code block has
been executed or not. If the code has been executed, this value will mirror that of
the Fault Exposure Potential node. If the code has not been executed, then this
parameter will reflect that of the Fault Execution Potential. This node can have the
values of “Very High”, “High”, “Medium”, “Low”, “Very Low”, and “None”.
Reliability for the code block is determined by combining the Fault Risk and the
Net Fault Exposure nodes together to form the Estimated Reliability Node. As the
fault risk increases and the net fault exposure increase, the overall reliability for the
code block will decrease. The net reliability for the block can therefore be expressed as
“Perfect”, “Very High”, “High”, “Medium”, “Low”, and “Very Low”. By default, a
code block which has no statically detectable faults shall have an Estimated Reliability
of “Perfect”.
109
Net reliability
The Fault Failed node will reflect whether or not the given fault has led to failure
during testing. Values for this node can be “Yes” or “No”, with the default value
In the event that a statically detected fault has actually failed during the limited
testing period, the estimated values using the Bayesian Belief network are replaced
with actual reliability values from testing. The actual Tested Reliability node reflects
the observed reliability of the software as it is related to this specific fault. Values
which can occur include “Very High”, “High”, “Medium”, “Low”, and “Very Low”.
The net Reliability node serves as a switch between the Tested Reliability observed
if a failure occurs and the Estimated Reliability node. In the event that a failure has
occurred, the value here will represent that of the Tested Reliability Node. Otherwise,
The Calibrated Net reliability node allows the user to calibrate the output of the
basic network relative to the actual system being analyzed. In essence, this node
in order that the appropriate final values are obtained. While this capability exists
within the model, all testing thus far has used this node simply as a pass through node
where no change to the output probabilities from the Net Reliability node occur.
110
Determining the overall reliability for a code block is straight forward if there is
only a single statically detectable fault within the given code block. In this case, the
reliability of the code block would simply be the value output by the “Calibrated
Net Reliability” node of the Bayesian Belief Network. However, if there are multiple
In traditional reliability modeling with two faults, the probability of failure can
be expressed as
where Pf (F1 ) represents the probability that the first fault will fail on any given
execution and Pf (F2 ) represents the probability that the second fault will fail. If
In this case,
This core concept of independence must be translated into the Bayesian belief net-
statically detected faults are grouped and referenced using the taxonomy defined in
taxonomy. Thus, if two statically detectable faults are categorized into the same clas-
sification, it is assumed that they are multiple instances of the same core fault.
Figure 6-2: Simple Bayesian belief Network combining two statically detectable faults.
If the faults are not of the same type, then an estimation of the combinatorial
is obtained by multiplying the reliabilities together for the two faults. However, this
model will use another Bayesian Belief Network system, as is shown in Figure 6-2.
In this network, the nodes have the states shown in Table 6.2. In a basic system
with one fault present, the Previous Typical and Worst Case Reliability values will
be initialized to “Perfect” and the New Fault reliability value will be initialized to the
reliability value obtained by a single instance of the static analysis reliability network.
The Next Typical and Next Worst case Reliabilities will be calculated based upon
113
faults together. For each additional fault present within the system, there will be
one additional instance of this network, with the Previous Typical and Worst case
reliability values cascading from the previous instance of the network, as is shown in
Figure 6-3.
114
bility
Once the reliability has been obtained for each code block within the method, it
is possible to obtain the overall reliability for each method. To do this, another set of
instances of the combinatorial network defined in Figure 6-2 is created. There will be
one combinatorial network created for each code block within the method. Eventually,
this will result in a complete network similar to that shown in Figure 6-4. This figure
shows the combination of 4 statically detectable faults present on two different code
blocks, which represents a very simple network. The majority of analyzed networks
Figure 6-3: Simple Bayesian belief Network combining four statically detectable faults.
116
Figure 6-4: Method combinatorial network showing the determination of the reliability
for a network with two blocks and four statically detectable faults.
Chapter 7
7.1 Introduction
Markov Models have long been used in the study of systems reliability. As such,
they have also been applied to software reliability modeling. Publications by Musa
[MIO90], Lyu [Lyu95], Rook [Roo90], The Reliability Analysis Center [Cen96], Grot-
tke [Gro01], and Xie [Xie91], Gokhale and Trivedi[GT97] and Trivedi [Tri02] all in-
clude extensive discussion on the usage of Markov Models for the calculation of Soft-
ware Reliability. One of the most commonly used models is that which has been
state is used to represent the execution of a software program. Each node of the
entry and a single point of exit. The probability assigned to each edge represents the
probability that execution will follow the given path to the next node.
1
Portions of this chapter have appeared in Schilling [Sch07].
117
118
Figure 7-1 represents a basic program flow graph. In this case, there is one node
within the flow graph which makes a decision (S1 ), and two possible execution paths
based upon that decision, (S2 ) and (S3 ) respectively. Reaching state (S4 ) indicates
that program execution has completed successfully. The transitions t1,2 and t1,3 repre-
sent the probability that program execution will follow the path S1 → S2 and S1 → S3
respectively.
gram, the average number of times that each state is visited can be obtained. Again
119
using the program flow exhibited in Figure 7-1, this matrix can be represented as
0 t1,2 t1,3 0
0 0 0 t2,4
P = (7.1)
0 0 0 t3,4
0 0 0 1
where
The average number of times each statement set is executed can be calculated
is an absorbing state, and all other states are transient can be partitioned into the
relationship
Q C
P = (7.5)
O 1
where
k
Q C 0
Pk = (7.6)
O 1
M = (I − Q)−1 (7.7)
Returning to the initial problem, if one assigns the values t1,2 = .5 and t1,3 = .5
0 0.5 0.5 0
0 0 0 1
P = (7.8)
0 0 0 1
0 0 0 1
0 0.5 0.5
Q=
0 0 0
(7.9)
0 0 0
−1
1 0 0
0 0.5 0.5
1 0.5 0.5
−1
M = (I − Q) =
0 1 0 −
0 0 0
=
0 1 0
(7.10)
0 0 1 0 0 0 0 0 1
121
From this information, it can be concluded that on average, for each execution of
the program, node S2 will be visited 0.5 times and node S3 will be visited 0.5 times.
If the control flow is modified slightly, as is shown in 7-2, the impact of the looping
1 2.0 0.5
M=
0 4 0
(7.11)
0 0 1
indicating that on average node S2 will be visited 2.0 times and node S3 will be visited
1 50 0.5
M=
0 4 0
(7.12)
0 0 1
indicating that on average node S2 will be visited 50 times and node S3 will be visited
0.5 times.
This capability can then be used to calculate the reliability of the program using
the relationships
Y V
R= Rj j (7.13)
j
X
ln R = Vj ln Rj (7.14)
j
P
R = exp( j
Vj ln Rj )
. (7.15)
where
Markov models offer two distinct problems as the number of nodes increases. First,
extensive computation being necessary to compute the net reliability for the system.
123
Second, Markov Models also require accurate estimations for transition probabili-
ties and reliability values to be determined in order to construct the given model. In
many cases, it is not possible to accurately estimate these reliability values with an
appropriate degree of confidence in order for the Markov Model to be applied. What
is needed is a more general approach which, while providing reasonably accurate re-
sults, does not necessarily require the degree of precision necessary to use a Markov
model.
Belief Networks which can be used to reliably predict the outcomes for a Markov
Model. The specific model which is to be modeled using a BBN is the reliability
model proposed by Cheung[Che80]. The Cheung model relates the net reliability
of the system to three factors, namely the number of nodes within a program, the
reliability of each node within the program, and the frequency of execution for each
node.
software being analyzed. It is not uncommon for a real world software project to
The reliability of each node represents the probability that when a given node
executes, execution will continue to completion without failure. Reliability can either
124
The execution frequency for each node represents, on average, how many times
the given node will be executed when the program is run. It can also represent the
number of times a method is invoked per unit of time. This value can either be
the first case, the program use case influences the results, which may result in a
depending upon the execution environment. However, the second case provides a
The basic BBN relating the reliability of two Markov model nodes is shown in
Figure 7-3. The nodes Reliability A and reliability B represent the reliability of
the two program segments. The Coverage A and Coverage B represent the average
number of executions for the given node in one unit time period. The net reliability
is directly related to the reliability of each of the two nodes as well as the execution
In order to use this BBN, it is necessary that the continuous values for reliability
and execution frequency be translated into discrete values which can then be further
processed. Since reliability values are often quite high, usually .9 or higher, it is often
125
as
U =1−R (7.16)
where R represents the reliability of the system and U represents the resulting un-
reliability of the system. As most systems typically have multiple nines within the
reliability value, the U value will typically be followed by several zero values and then
U = −1 · log10 (1 − R) (7.17)
As a general statement, the reliability for any properly tested system will be at
least .99. Mission critical or safety critical avionics systems require failure rates values
126
less than 10−9 failures per hour of operation[Tha96]. Software, in general, by its very
essence is typically limited to a minimum failure rate of 10−4 failures per hour of
increase this value, but even the best software typically has a minimum failure rate of
10−5 [Tha96] failures per hour of operation, or four orders of magnitude greater than
that which is required for mission critical systems deployment. Based on this concept,
the states shown in Table 7.1 have been defined for the Bayesian Belief Network.
Table 7.2: Bayesian Belief Network States Defined for Execution Rate
State Name Abbreviation V log10 (V )
Very High VH 31.6 ≤ V 1.5 ≤ log10 (V )
High H 3.16 ≤ V < 31.6 0.5 ≤ log10 (V ) < 1.5
Medium M 0.316 ≤ V < 3.16 −0.5 ≤ log10 (V ) < 0.5
Low L 0.0316 ≤ V < 0.136 −1.5 ≤ log10 (V ) < −0.5
Very Low VL V < 0.0316 log10 (V ) < −1.5
X
ln R = Vj ln Rj (7.18)
j
it can be observed that the execution rate for the node is just as significant in the
net reliability value as the reliability of the nodes. A factor of ten difference in a
given Vj value will impact the net reliability by one order of magnitude. Because the
127
execution rate can vary significantly, and is not bounded by an upper or lower bound,
the values for the execution rate are best expressed in terms of the log10 value of the
execution rate. This behavior results in the state definitions shown in Table 7.2.
In order to validate the results of this model, an experiment was set up which
would use the existing Markov model to generate test cases. These test cases would
then be fed into the Bayesian Belief Network and evaluated against the expected
To accomplish this, a MatLab script was created which evaluated a Markov model
simulation using the Cheung[Che80] model and the program flow which is shown in
Figure 7-7. For simplicity, R1 and R4 were fixed at 1, indicating that there was no
probability of failure for the entry and exit nodes. R2 and R3 were independently
varied between the values of 0 and .999999 with a median value of .999. The param-
eters t3,2 , t2,3 , t3,4 , and t2,4 were also varied independently. Altogether, this resulted
in a total of 47730 test vectors being generated and the value ranges shown in Table
7.3.
To evaluate the accuracy of the Bayesian Belief Network, a Java application was
developed using the EBayes[Coz99] core. This application used the same input pa-
rameters that the MatLab script used. The outputs of the Bayesian Belief network
were then compared with the expected values from the Markov model, creating error
values. Comparisons with the Markov model were done in the U domain, as this
allowed an accurate assessment of error across all magnitudes. This resulted in the
While the raw error values are important and indicate that the average error is
less than .5, or one half of the resolution of the model, a more thorough analysis of
the error can be obtained by looking at the number of test instances and the error
129
Table 7.4: Differences between the Markov Model reliability values and the BBN Predicted
Values
Average 0.4156
Median 0.3799
STD 0.3212
Min 0.000031
Max 3.099
for those instances. Table 7.5 shows the number of test instances in which the error
fell within the documented bounds. 96.70% of the test cases had an error of less than
1.0 relative to the value calculated by the Markov Model in the U domain.
The error in the Bayesian Belief network is normally distributed over the data
While the network presented previously has been shown to be effective at per-
forming accurate reliability calculations, the network itself suffers from significant
limitations. Because the network can only compare two nodes at once, the program
being analyzed must either be limited to two nodes or two “non-perfect” nodes. This
limits the model itself to be a proof of concept model which can be used in academic
130
possible to extend the model so that it has broader application. This extension, shown
in Figure 7-5 incorporates a node which combines the net execution rate for the two
nodes. This value is scaled in the same manner as the execution rates for the two
input nodes.
By adding this additional node to the Belief network and assigning the appropri-
handle the case where there is a number of network nodes which is not equal to a
power of two, it is necessary to add a phantom state to the BBN for execution rate
131
which indicates that the given node never executes and the output of the network
should only be dependent upon the other node values. This results in the modified
Table 7.7: Extended Bayesian Belief Network States Defined for Execution Rate
State Name State Abbreviation V log10 (V )
Very High VH 31.6 ≤ V 1.5 ≤ log10 (V )
High H 3.16 ≤ V < 31.6 0.5 ≤ log10 (V ) < 1.5
Medium M 0.316 ≤ V < 3.16 −0.5 ≤ log10 (V ) < 0.5
Low L 0.0316 ≤ V < 0.136 −1.5 ≤ log10 (V ) < −0.5
Very Low VL 0 < V < 0.0316 log10 (V ) < −1.5
Never N V =0 N/A
In order to validate the results of this model, an experiment was set up which
would use the existing Markov model to generate test cases. These test cases would
132
then be fed into the Bayesian Belief Network and evaluated against the expected
To accomplish this, a MatLab script was created which evaluated a Markov model
simulation using the Cheung[Che80] model and the program flow which is shown in
Figure 7-7. For simplicity, R1 and R4 were fixed at 1, indicating that there was no
probability of failure for the entry and exit nodes. R2 and R3 were independently
varied between the values of 0 and .999999 with a median value of .999. The param-
eters t3,2 , t2,3 , t3,4 , and t2,4 were also varied independently. Altogether, this resulted
in a total of 47730 test vectors being generated and the value rages shown in Table
7.8.
133
To evaluate the accuracy of the Bayesian Belief Network, a Java application was
developed using the EBayes[Coz99] core. This application used the same input pa-
rameters that the MatLab script used. The outputs of the Bayesian Belief network
were then compared with the expected values from the Markov model, creating error
values. Comparisons with the Markov model were done in the U domain, as this
allowed an accurate assessment of error across all magnitudes. This resulted in the
Table 7.9: Differences between the Markov Model reliability values and the BBN Predicted
Values
Average 0.4156
Median 0.3799
STD 0.3212
Min 0.000031
Max 3.099
While the raw error values are important and indicate that the average error is
less than .5, or one half of the resolution of the model, a more thorough analysis of
the error can be obtained by looking at the number of test instances and the error
for those instances. Table 7.10 shows the number of test instances in which the error
fell within the documented bounds. 96.70% of the test cases had an error of less than
1.0 relative to the value calculated by the Markov Model in the U domain.
134
The error in the Bayesian Belief network is normally distributed over the data
7.8 Summary
This chapter has demonstrated that Bayesian Belief Networks can be used as
a substitute for complete Markov Models when one is assessing the reliability of a
Markov Model for software reliability within one order of magnitude when using five
used when converting the continuous reliability and execution probability variables
into discrete states. However, for systems in which the exact reliability parameters
are not known, the resolution provided by this network is sufficient to provide an
estimate of the net reliability. One area for future research certainly is to analyze the
effect of increasing the number of states relative to the increased precision that would
Reliability Tool1
In order to use the proposed software reliability model, a Software Static Analysis
Reliability Tool (SOSART) has been developed. The SOSART tool combines static
analysis results, coverage metrics, and source code into a readily understandable
interface, as well as for reliability calculation. However, before discussing the details
In the study of software reliability, there have been many tools that have been
1
Portions of this chapter have appeared in Schilling and Alam[SA06a].
135
136
brief overview of the existing tools and their capabilities in order that they can be
compared with the SOSART analysis tool developed as part of this research.
box approach. It provides a range of reliability models. The original tool used a tex-
tual based interface and operated predominantly in the UNIX environment. However,
and includes a Graphical User Interface. SMERFS3 also supports extended function-
ality in the form of additional models for both hardware and software reliability. One
similar to SMERFS in that it is also a black-box tool for software reliability estimation.
CASRE supports many of the same models supported by SMERFS. CASRE operates
in a Windows environment and does have a GUI. One significant feature included
software tests using code coverage metrics. This represents the first tool discussed
that is a white-box tool. The command line tool uses a specialized compiler (atacCC)
which instruments compiled binaries to collect run-time trace information. The run-
time trace file records block coverage, decision coverage, c-use, p-use, etc. The tool,
assesses test completeness and visually displays lines of code not exercised by tests.
137
assessment software reliability across multiple lifecycle stages. Early reliability pre-
dictions are achieved through static complexity metric modeling and later estimates
include testing failure data. SREPT can estimate reliability as soon as the softwares
architecture has been developed, and the tool can also be used to estimate release
supports five different software reliability growth (SRG) models. Two of the four
models can be used with static metrics for estimation during the early stages of de-
velopment, while one model includes test coverage metrics. ROBUST operates on
data sets of failure times, intervals, or coverage. Data may be displayed in text form
lating software reliability estimates and of quantifying the uncertainty in the estimate.
The tool combines static source code metrics with dynamic test coverage information.
The estimate and the confidence interval is built using the Software Testing and Re-
color-coded feedback on the thoroughness of the testing effort relative to prior suc-
cessful projects. GERT is available as an open source plug-in under the Common
Public License (CPL) for the open source Eclipse development environment. GERT
Thusfar, each of the tools discussed has lacked the ability to interact with static
analysis tools. The AWARE tool[SWX05] [HW06], developed by North Carolina State
University, however, does interact with static analysis tools. Developed as a plug in
for the Eclipse development environment, AWARE interfaces with the Findbugs static
analysis tool. However, whereas the key intent of the tools discussed previously is to
directly aid in software reliability assessment, the AWARE tool is intended to help
software engineers in prioritizing statically detectable faults based upon the likelihood
of them being either valid or a false positive. AWARE is also limited in that it only
supports the Findbugs static analysis tool, and the user interface consists of a basic
To effectively use the model previously developed for all but the smallest of pro-
grams requires the development of an appropriate analysis tool. This tool will be re-
sponsible for integrating source code analysis, test watchpoint generation, and static
analysis importation.
The first responsibility for the SoSART tool is to act as a bug finding meta tool
which automatically combines and correlates statically detectable faults from different
static analysis tools. It should be noted that though, while many examples given
the first application for the tool involves an analysis of a Java application. Thus, the
139
Beyond being a meta tool, however, SoSART is also an execution trace analysis
tool. The Schilling and Alam model requires detailed coverage information for each
method in order to assess the reliability of a given software package. The SoSART
tool thus includes a customized execution trace recording system which captures and
nature to the ATAC tool discussed previously in Section 8.1, the SOSART trace tool
does not require code instrumentation during the compile phase. Instead, for Java
programs, it interfaces with the Java Platform Debugger Architecture (JPDA) and
the Java Debug Interface (JDI). This allows any Java program to be analyzed without
SoSART also consists of a complete language parser and analyzer which has been
implemented using the ANTLR[Par] toolkit. The parser breaks the source code into
the fundamental structural elements for the model, namely classes, methods, state-
ment blocks, and decisions. A class represents the highest level of organization and
ment blocks, coupled with the appropriate conditional decisions, makes up a method.
When determining the path coverage, the SoSART tool uses the parsed information
to determine which execution traces match a given path through a given method.
metrics.
140
The SoSART user interface consists of two portions, a command line tool and
a graphic user interface. The command line toolkit allows users to execute Java
programs overtop of the SOSART system while it collects coverage information for
analysis usage.
SoSART also includes a graphical user interface built using the JGraph toolkit[Ben06].
This Graphical User Interface allows the generation of pseudo-UML Activity diagrams
for each method, as well as displaying the static faults, branch coverage, and struc-
tural metrics for each method. Statically detectable faults are identified through a
four color scheme, with green characterizing a statement block with no known stati-
cally detectable faults, yellow indicating a slight risk of failure within that statement
block due to a statically detectable fault, orange indicating an increased risk over yel-
low, and red indicating a serious potential for failure within the code segment. Color
coding uses gradients to indicate both the most significant risk identified as well as
the typical risk. The overall reliability is calculated by combining the observed ex-
ecution paths with the potential execution paths and the statically detectable fault
locations.
General software requirements for the SOSART tool are provided in Appendix B.
used a process derived from the PSP process for most areas of development. This was
applied for all areas of the tool which were not considered to be “research intensive”,
141
such as the development of the specific Bayesian Belief Networks, the development of
the ANTLR parser which was viewed as a learning process, and other similar areas.
Altogether, the effort expended in the development of the SOSART tool is provided
in Table 8.1. Effort has been recorded to the nearest quarter hour.
Table 8.2 provides performance metrics regarding the implementation of the SOSART
tool in terms of the actual design and implementation complexity. With the exception
of several parameters that were contained within autogenerated code, all implemen-
tation metrics are within standard accepted ranges. Using the value of 30359 LOC
LOC
productivity = (8.1)
time
the productivity was 30.41 lines of code per hour. This number is extremely high
relative to the typical 10-12 LOC per hour expected for commercial grade production
code, but this can be explained by the nature and composition of the project. First,
142
since the tool was an experimentally developed tool, the amount of time spent devel-
oping test plans and executing test plans was significantly less than would be expected
for a production grade project. Overall, this would result in a decrease in produc-
tivity if a professional grade development process were being followed. Second, the
usage of the ANTLR tool automatically generated a significant portion of the source
code. The ANTLR package itself contains 14758 lines of code. While a significant
amount of development was required to create the language description for ANTLR,
if these lines of code are removed from consideration, the productivity drops to 15.62
LOC per hour, or much closer to industry accepted values. Third and finally, while
peer reviews occurred. Properly reviewing a program of this size, assuming a review
rate of 100 LOC per hour, would require an additional 144 hours of effort, bringing
143
the net total effort to 1142 hours, and the effective productivity to 12.6 LOC / hour.
Bug tracking for the SOSART tool was handled using the SourceForge bug track-
ing system. Any bug which was discovered after testing and the completion of integra-
tion into the development tip was tracked using the bug tracking database. Overall,
post-release defect rate of 0.06 defects per KLOC. This number is extremely low, and
it is suspected that further significant defects will be uncovered as the tool is further
an appropriate level of quality in the final delivered tool, three external components
were used within the development of the SOSART tool, namely the ANTLR parser
generator, the JGraph graphing routines, and the EBAYES Bayesian belief engine.
tors. The tool uses a grammar definition which contains Java, C#, Python, or C++
actions. ANTLR was chosen because of its availability under the BSD license as well
as a readily available grammar definition for the Java Programming language which
can be readily expanded upon. The ANTLR software was used to generate the parser
for the Java input code, appropriately separating Java files into classes and methods
the Java language. It is fully Swing compatible in both its visual interface as well as
its design paradigm, and can run on any JVM 1.4 or later. JGraph is used principally
in the user interface area of the SOSART tool, allowing the visualization of method
works. Its goal was to develop an engine small enough and efficient enough to perform
is derived from the JavaBayes[Coz01] system which was used for the conceptual gen-
eration of the Bayesian belief networks used in this research and within the SOSART
tool. The EBAYES engine was used to calculate the Bayesian Belief network values
The general operation for the SOSART tool begins with the user obtaining a pre-
viously existing module to analyze. The module should compile successfully without
any significant compiler warnings. Assuming the user is using the GUI version of the
tool, the user will start the SOSART tool. Upon successful start, the user will be
made.
145
Normal operation begins by the user selecting the Java files that are to be imported
and analyzed using the analysis menu, as is shown in Figure 8-1. Any number of files
can be selected for importation using the standard file dialog box so long as the
files reside within the same directory path. In addition to allowing the importation
of multiple files at one time, it is also possible to import multiple sets of files by
importing each set individually. The tool itself will protect against a single file being
imported multiple times into the project based upon the Java file name.
Figure 8-1: Analysis menu used to import Java source code files.
As the source files are imported into the given source tool, several things occur.
First and foremost, the class is parsed and separated into existing methods. Summary
panel data about each class is generated and this information is used to populate the
In addition to the summery panel, the key user interface for each imported class
146
consists of a tabbed display, with on tab showing the source code, as is shown in
Figure 8-3. Importation will also generate a basic UML activity diagram or control
In order to obtain the program execution profile for the given source code module,
it is necessary to obtain from the structure the line numbers which represent the start
of each code segment. The SOSART tool provides such capability. The watchpoints
are generated as a textual file by the SOSART GUI, and then these are fed to the
command line profiler while actually executes the program modules and obtains the
execution traces.
For the source code module loaded in Figure 8-3, there are many locations which
represent the start of a code block. Examples include Line 199, Line 216, Line 227,
etc. In order to obtain a listing of these locations, the tracepoint window is opened
from the analysis window. The Generate Tracepoints button will, based upon the
8-5.
Once the tracepoint locations have been defined, the Java Path Tracer portion of
the SOSART tool can be invoked from the command line. This tool uses a set of
command line parameters to indicate the classpath for the program which is to be
executed, the trace output file, the path output file, the tracepoint file which defines
150
the locations that are to be observed for program execution, as well as other requisite
parameters. These parameters are shown in detail in Figure 8-6, which represents the
manual page displayed when executing the tool from the command line if an improper
By supplying the appropriate tracepoint parameters on the command line, the tool
can be invoked to analyze a given program set. In the case of the example program,
In this particular instance, the most important output is the XML gathered trace
information showing how many times each of the methods was executed and which
151
paths through the method were taken. A short example of this is shown in Figure
8-8. This figure represents the execution traces obtained during approximately one
minute (60046ms to be exact) of program execution. The SOSART tool allows this
9. In addition to the number of executions for each path being shown numerically
and by color, the relative execution number is shown graphically through thicker
and thinner execution path traces. Those paths which are taken more often have a
thicker execution trace line. Those paths which are executed less often have a thinner
Figure 8-8: XML file showing program execution for HTTPString.java class.
153
Figure 8-9: Execution trace within the SOSART tool. Note that each path through the
code which has been executed is designated by a different color.
154
In order for SOSART to perform its intended purpose and act as a static analysis
Metadata tool, the tool must support the capability to import and analyze statically
detectable faults. Because of the vast variety of static analysis tools on the market,
and the many different manners in which they can be executed, the SOSART tool
does not automatically invoke the static analysis tools. Instead, it is expected that
the static analysis tools will be executed independent of the SOSART tool, through
the code compilation process or by another external tool prior to analyzing the source
Once the analysis tools have been run, importation of the faults begins by im-
porting the source code that is to be analyzed into the SOSART tool, as has been
described previously. Once this has been completed, the statically detectable faults
detectives by each of the executed static analysis tools are imported into the program.
ings which have been previously imported into the SOSART tool and assigned a
definition, a dialog box prompts the user to assign the fault to an appropriate def-
inition. An example of this dialog box is shown in Figure 8-10. In this particular
instance, the first instance of the PMD fault has been imported warning against a
method having multiple return statements. From a taxonomy standpoint, using the
SOSART taxonomy this fault can be classified as a General Logic Problem, as there is
no specific categorization defined for this type of fault. In general, this fault exhibits
a very low potential for immediate failure, though there is a maintenance risk associ-
ated with this fault, as methods which have multiple returns can be more difficult to
maintain over time versus methods which only contain a single return statement.
156
In the case of the fault defined in Figure 8-11, this represents a PMD fault where
the programmer may have confused the Java equality operator (==) with the assign-
ment operator(=). At a high level, this represents a Medium risk of failure. Upon
future review certain cases may be found to have a more significant risk of failure.
Once all faults have been imported and assigned to the appropriate taxonomical
definitions, the Activity Diagrams / Control flow diagrams for each method are up-
dated to include the statically detectable faults. This results in a display similar to
that shown in Figure 8-12. This display shows the respective anticipated reliability
for each block as a color code. Red indicates the most significant risk of failure,
while green indicates that the code segment is relatively safe from the risk of failure.
Orange and yellow reflect intermediate risks in between green and red, with yellow
being slightly more risky than green and orange being slightly less risky than red. In
addition to marking the fault as valid or invalid, this panel can also be used to mod-
ify the immediate failure risk, the maintenance risk, whether the fault failed during
testing, and fault related reliability which have been set by the taxonomy when the
Each code segment block and listing of static analysis warnings can have a two
color gradient present. One color of the gradient represents the typical risk associated
with the given code block. The second color represents the maximum risk associated
with that code block. For example, a code block which contains both green and
orange colors indicates that the code block typically has very little risk associated
157
with it, but there is the potential that under certain circumstances, a significantly
high amount of risk exists. These colors are driven by the Bayesian Belief Network
By clicking on the static analysis listing, the software engineer can open a panel
which can then be used to mark a fault as either “Valid”, “Invalid”, or “Unverified”,
as is shown in Figure 8-13. Faults which are deemed to be invalid will be converted
into a grey background the next time the panel is opened. Faults which are valid
but have no risk will have a green background, and faults which are valid but have
a higher risk associated with them will be displayed with an appropriate background
color of either green, yellow, orange, or red, depending upon their inherent risk.
In addition to invoking this display panel from the method activity diagram /
program data flow diagrams, this same display panel can be invoked from the report
menu option. However, when invoking this display from the report menu, there are
a few differences. the report menu option has two potential selection values, namely
the class level faults option or the all file faults option. Whereas clicking on the code
segment on the activity diagram only shows the faults which are related to the given
code block, selecting the item from the menu displays either those faults which are
detectable at the class level (and thus, are not assigned to a given method) or all
statically detectable faults within the file. In either case, the behavior of the panel is
The report menu also allows the user to view report data about the project and
the distribution of statically detectable faults. For example, the report shown in
Figure 8-14 provides a complete report of the statically detectable faults which are
158
present within the HTTPString.java file of this project. The first three columns deal
with warning counts. The first column shows the number of faults of each type which
have been detected in the overall file. The second column indicates the number of
faults which have been deemed to be valid upon inspection of the fault, and the third
column indicates the number of faults which have been deemed to be invalid based
upon project inspection. The percent valid column indicates the number of faults of
each type which have been determined to be valid upon inspection. The number of
statements counts the number of executable statements found within this source code
module. the last two columns calculate the density of the warnings being detected
This report can be generated for three different sets of data. The first set, de-
scribed previously, generates this report for the currently opened file. This same
report can also be generated at the project scope which encompasses all files that
are being analyzed. Finally, this report can be generated based upon historical data,
In order to allow the appropriate data retention and historical profiling so that the
SOSART tool improves over time, SOSART includes a built in historical database
system. When used properly, the database system allows the user to store past
information regarding the validity of previously detected faults over multiple projects.
segmented to allow different database sets to be used based on the project. Even
does not actually require the usage of a separately installed database, such as MySQL.
The analyze menu contains many of the parameters necessary to use the historical
database capabilities to analyze a given project, as is shown in Figure 8-15. From this
menu, the user can load and save the historical database to a given file. This allows
the user to control which historical database is used for assigning validity values to the
statically detectable faults. This capability also allows separate historical databases to
projects.
that are manipulated are not automatically transferred into the historical database.
Instead, the user must explicitly force the analyzed warnings to transfer into the
database. This is done for two reasons. First off, this prevents the database from
being contaminated by erroneous entries made when learning the tool. Second, and
most importantly, this allows the user to prevent the transfer into the database until
a project has been fully completed. Large projects may require more than one pro-
gram execution of the SOSART tool in order to fully complete the analysis, and it
is best for the transfer of warnings into the database to be delayed until the project
is completed. This is a commonly used paradigm for metrics collection tools within
software engineering.
The analyze menu also has the capability to allow the user to clear the database
of all previously analyzed faults. In general, it is expected that this capability would
160
rarely be used, but it is supported to allow the database to be reset if new program
families are analyzed or there is some other desire to reset the historical data back to
The Program Configuration Panel, shown in Figure 8-16, also allows configuration
upon its default configuration, will automatically load a given database upon program
start. This is a basic feature of the program. However, it is possible to change which
database is loaded based upon the users preferences. It is also possible to configure
whether or not the database is automatically saved upon program exit. Under most
circumstances, it is desired for all changes to the historical database to be saved when
the program is exited. However, there may be certain circumstances where this is not
the appropriate action to take based upon the analysis being performed.
faults into the historical database upon program exit. While this does prevent the
occurrence of user error whereby the data is improperly transferred to the historical
The Randomize Database on Load feature allows the user to randomize the pri-
mary key used within the database when a given database is loaded. By design, the
key used to uniquely identify a statically detectable fault includes the file name, the
line number, the static analysis tool which detected the fault, and the fault which
into the key definition. This allows for multiple instances of the same warning in
161
domization also, to some extent, obfuscates the data, which may be important based
into external reports and other documents, SOSART supports the exportation of
program diagrams into graphics files. In order to avoid issues with royalties and
patent infringement, the Portable Network Graphic Format (PNG) was selected as
the only exportation format natively included with SOSART. The PNG format is a
raster image format intended to serve as a replacement for the GIF format. As such,
for selecting the PNG format is that it was readily supported by the JGraph utilities
Graphics exportation is only available if the currently selected tab on the display
is a Method activity diagram. The summary panels can not be exported, and the
code listing can not be exported in this manner. The resolution of the exportation
is effected by the image zoom as well. A larger zoom factor will result in a large
image, while a smaller zoom factor will result in a smaller image with less resolution.
Complex graphics may result in very large file sizes when exported.
162
8.5.6 Printing
with SOSART, a standard print capability has been integrated with the tool. Printing
is accessed from the file manu, and opens a standard GUI print dialog series. Options
to be selected by the user include scaling features, which allow the graphs to be
printed to a normal size, to fit a given page size, to a specific scale size, or to fit a
Other print configuration parameters include the capability to print either the
current graph or all loaded graphs. If the current graph is selected, only the currently
viewed method graph, summary panel, or source code panel will be printed. If all
loaded graphs is selected, then the summary panel, source code panel, and all graphs
In order to facilitate the analysis of larger Java projects which can not be readily
analyzed in one sitting, as well as to protect the person doing the analysis from
random machine failure and retain results for future consultation, the SOSART tool
offers several mechanisms that can be used to load and save projects.
As with most analysis tools, SOSART offers the user the capability to create a new
project, save the current project, or load a previously saved project. When creating
a new project, the currently open project will first be closed before a new project is
created. The new project will not have any imported Java files, static warnings, or
163
Saving a project will store all details related to the project, including loaded
source code modules, method activity diagrams, imported static analysis faults, and
execution traces. All data files are stored in XML using JavaBeans XML Persistence
mechanism.
Because of the large size of XML files created when saving an entire project, a
secondary method has been created to store projects. With this method, only the
static analysis warnings and their modified validity values and risks are stored for
file size and also results in a tremendous performance improvement when loading a
large project. Using this mechanism, a user will work on a project assessing the risks
associated with a project. When the time comes to save the project, only the static
analysis warnings are saved. To again work on the project, the user must re-import
the Java files and program execution traces before reloading the statically detectable
The SOSART GUI interface supports all common graphical behaviors related to
There are three principle mechanisms for performing zoom operations. The “Zoom
In” and “Zoom Out” features zoom the graphic in or out by a factor of two, depending
164
on which menu item is selected. The zoom dialog box allows the user to zoom to one
of eight pre-configured zoom values, namely 200%, 175%, 150%, 125%, 100%, 75%,
50%, or 25%, as well as offering a drag bar which can set the zoom value to any
Because of the possibility that there may be multiple java files imported into one
analysis project, the SOSART tool includes the capability to tile horizontally and
vertically, as well as cascade the opened files. These behaviors follow standard GUI
practices.
The key functionality required by the SOSART tool is the ability to estimate the
engineer. This is accomplished via the Reliability Report Panel, an example of which
The reliability report panel provides the user with the appropriate details relative
to the given reliability of the loaded modules and execution traces. The display itself
is also color coded, with green indicating very good reliability values, and yellow,
orange, and red indicating lesser reliability values. In the particular example provided
In reviewing the report further, however, this high reliability is achieved because
the modules themselves very rarely execute. To facilitate these detailed reviews, the
reliability report can be exported to a text file for external review and processing.
165
Figure 8-18 represents a portion of the complete textual report detailing the reliability
Method: setAddress
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: setBuffer
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: process
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: checkAuthorization
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: verifyClient
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: _getResponseMessage
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: sendFirstHeaders
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: sendContentHeaders
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: sendExtraHeader
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: sendResponse
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: processString
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: getLocalFile
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Method: replaceString
Posterior marginal for RawReliability: 1.000 0.000 0.000 0.000 0.000 0.000
Posterior marginal for CalibratedReliability: 1.000 0.000 0.000 0.000 0.000 0.000
##########################################################################################
Method Reliability Values
setAddress Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
setAddress Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
setBuffer Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 0.000
Rarely 0.000 Very Rarely 1.000 Never 0.000
setBuffer Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
process Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
process Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
checkAuthorization Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
checkAuthorization Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
verifyClient Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
verifyClient Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
_getResponseMessage Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
_getResponseMessage Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
sendFirstHeaders Posterior marginal for CoverageA: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
sendFirstHeaders Posterior marginal for ReliabilityA: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
sendContentHeaders Posterior marginal for CoverageB: Very Often 0.000 Often 0.000 Normal 1.000
Rarely 0.000 Very Rarely 0.000 Never 0.000
sendContentHeaders Posterior marginal for ReliabilityB: Perfect 1.000 Very High 0.000 High 0.000
Medium 0.000 Low 0.000 Very Low 0.000
.
.
.
##########################################################################################
##########################################################################################
Final Results...
setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders
Posterior marginal for CoverageA: Very Often 0.000 Often 0.059 Normal 0.941 Rarely 0.000 Very Rarely 0.000 Never 0.000
sendExtraHeader:sendResponse:processString:getLocalFile:replaceString:verifyClient:_getResponseMessage:sendFirstHeaders:...
Posterior marginal for CoverageB: Very Often 0.000 Often 0.030 Normal 0.970 Rarely 0.000 Very Rarely 0.000 Never 0.000
setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders
Posterior marginal for ReliabilityA: Perfect 0.047 very High 0.421 High 0.457 Medium 0.073 Low 0.002 Very Low 0.000
setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders:...
Posterior marginal for NetCoverage: Very Often 0.000 Often 0.096 Normal 0.904 Rarely 0.000 Very Rarely 0.000 Never 0.000
setAddress:setBuffer:process:checkAuthorization:verifyClient:_getResponseMessage:sendFirstHeaders:sendContentHeaders:...
Posterior marginal for NetReliability: Perfect 0.015 very High 0.230 High 0.529 Medium 0.202 Low 0.022 Very Low 0.001
Net Anticipated Reliability: 0.999691
##########################################################################################
Source Software
9.1 Introduction
In order to provide the first set of experimental validations for the SoSART model,
software package was calculated using the SoSART model and then compared with
North Carolina State University and used for Software Engineering education, the
open source jsunit program available from SourceForge, and the Jester program, also
Programs were selected with several criteria in mind. First off, due to the com-
plexities of using the SoSART analysis tool, smaller projects needed to be analyzed,
171
172
as the tool suffered from technical difficulties as larger applications were analyzed.
significant tool problems developed as programs larger than 5KLOC were analyzed.
Second, in order to apply the STREW model, a set of JUnit test scripts was required.
The STREW metrics correlate expected software reliability with implementation met-
rics and testing metrics. Third, and finally, access to a set of pseudo requirements
et al. [Nag05] [NWVO04] [NWV03] which have been shown to be effective at esti-
Based on these metrics, the reliability of software can be estimated using the
equation
Reliability = C0 + C1 · R1 + C2 · R2 − C3 · R3 + C4 · R4 (9.1)
where
N umber of Assertions
R4 = SLOC
and
s
R(1 − R)
CI = Zα/2 (9.2)
n
where
174
Zα/2 represents the upper α/2 quartile of the standard normal distribution for
The STREW metrics are supported by the GERT [DZN+ 04] toolkit.
The Real Estate program is an example program developed as a part of the open
documents the entire software development process for a simple game using the agile
development process. A complete suite of unit tests was constructed using JUnit, and
these test cases are readily available with the source code. The Real Estate program
was developed in the Java language and has the overview metrics provided in Table
While the RealEstate program represents a slightly different domain than the
intended domain for this software model, it does provide a readily available convenient
package which can be used as a proof of concept application for the tool and model.
In order to provide a baseline reliability estimate for the RealEstate program, the
GERT analysis tool [DZN+ 04] [Nag05] was invoked from within the Eclipse environ-
ment. While this tool was intended to directly calculate the reliability of the software
given the input parameters, compatibility issues between the tool and available Eclipse
platforms and Java Run Time Environments limited application of the tool to data
collection, and the reliability was calculated externally. From the STREW metrics,
the reliability for the RealEstate program was calculated to range between 0.9185
176
After the assessment using the GERT tool was completed, the source code was
statically analyzed using seven independent static analysis tools which were supported
by the SOSART tool.1 This resulted in the detection of 889 statically detectable
faults, 214 which were deemed to be valid faults upon review with the SOSART tool,
Based on the imported static analysis faults, an estimated reliability value was
calculated using the model and the assumption that all execution methods will be
called at a uniform normal rate. This resulted in the first reliability estimate of
0.8945.
Once this reliability was obtained, the program was executed and execution traces
were captured with the SOSART tool, providing accurate data on the branch coverage
under the tested use case. This data was then imported into the SOSART tool and the
and found to be 0.9753. This reliability jump can be attested to the fact that the
methods which were the most unreliable within the system were rarely (if ever) called
and the methods which were most often traversed are the most reliable within the
1
While previous research included 10 static analysis tools, Licensing issues only allowed 8 tools
to be used for this portion of experimentation.
177
system.
These results, which are within the confidence interval as was calculated by the
STREW metrics, provide a preliminary proof of concept for the validity of using
statically detectable faults and the defined Bayesian Belief Network for assessing the
reliability of software.
JSUnit is an open-source unit testing framework which allows the testing of client-
side Javascript programs. Development began in 2001, and currently there are more
178
than 275 subscribed members and over 10000 downloads. The tool is developed in
Java and is available from Sourceforge. Code metrics for JSUnit are provided in Table
Following the procedure applied previously, the reliability of the JSUnit program
was assessed using the STREW metrics, resulting in the data shown in Table 9.7.
Reliability was estimated to range between 0.6478 and 0.9596, with a typical reliability
179
value of 0.8037.
Using the same 8 static analysis tools, the source code was analyzed for statically
detectable faults, as is shown in Table 9.8. A total of 480 statically detectable faults
were discovered, 220 of which were deemed to be valid and of varying risks.
These results were then imported into the SOSART analysis tool which calculated
an anticipated reliability of 0.9082 if each and every method was executed with a
uniform normal rate. Adding execution traces to the model reduced the reliability
value to 0.8102. These values are slightly higher than would be expected by the
STREW metrics calculations, but are within the range of acceptable values.
180
Java Jester is an Open Source tool available from Sourceforge which is intended
automatically inject errors into source and determine if those errors are detected by
the developed test cases. Jester is developed in Java and can test Java code. Code
metrics for Jester are provided in Table 9.9 and Table 9.10.
The same procedure as was used for the RealEstate and JSUnit programs was
applied to the Jester program to estimate system reliability, resulting in the data
shown in Table 9.11. Reliability was estimated to range between 0.6478 and 0.9596,
Using the same 8 static analysis tools, the source code was analyzed for statically
detectable faults, as is shown in Table 9.12. A total of 652 statically detectable faults
were discovered, 144 of which were deemed to be valid and of varying risks.
These results were then imported into the SOSART analysis tool which calculated
an anticipated reliability of 0.9024 if each and every method was executed with a
uniform normal rate. Execution traces for the program were obtained by running
the acceptance test suite included within the source code module, as this test set
was deemed to be representative of the desired use case for the program. Adding
execution traces to the model reduced the reliability value to 0.9067. These values
are slightly higher than would be expected by the STREW metrics calculations, but
While the previous sections of this chapter have discussed the accuracy of the
reliability model, without providing a cost effective mechanism for applying the model,
it is difficult to establish the practical application for the model. This section intends
to analyze the effort required to apply the model relative to other means of ensuring
One of the oldest and thusfar most effective mechanisms for ensuring the reliabil-
program. For code reviews to obtain their maximum effectiveness, the review rate for
183
the peer review meeting should be approximately 100 lines of code per hour[Gla79].
Furthermore, effective peer reviews require 3 to 4 meeting attendees. Thus, the ef-
fort required to complete review a source code package can be estimated using the
equation
LOC
ECR = · NR (9.3)
RR
where
ECR represents the total effort necessary review the source code package
LOC represents the count of the lines of code within the package
RR represents the review rate for the source code package in LOC per unit of
time
For the three projects analyzed in this chapter, the effort can be estimated to be
The effort required to undertake the reliability analysis using the SOSART method
was measured during development to allow comparison with the estimated effort for
a complete source code review, resulting in the data shown in Table 9.14. The data
is broken into two portions, the effort required for the static analysis tools to analyze
184
the source code modules and the effort required to analyze the reliability of the
program. This second field includes the effort required to review the static analysis
tool detected faults using the SOSART tool as well as the time required to execute
limited module testing. These results clearly indicate that this method is cost effective
The validation for the software reliability model presented previously relied upon
comparing the reliability calculated using the SOSART tool with the reliability cal-
culated through the STREW metrics method. While these experiments provided
a preliminary proof of concept for the model, a more extensive experiment using a
more appropriate software package was necessary. In this particular instance, the
given software package would be operated in an experimental fixture and the reliabil-
ity of the software obtained would be measured. These reliability values would then
be compared with the values obtained through the usage of the SOSART tool.
This chapter describes the experiment which was used to validate the reliability
model. The first section describes the Tempest Web Server software which was used
to validate the software reliability model. The second section describes the setup
which was used to evaluate the Tempest software from a reliability standpoint. The
third section of this chapter the discusses the results of measuring the reliability of the
185
186
discusses the process used to experimentally measure the reliability of the Tempest
software using the software reliability model and the SoSART tool. The fifth and
final section of this document discusses the economic costs which are incurred by
The Tempest web server was developed by members of the NASA Glenn Research
Center, Flight Software Engineering Branch will be analyzed through the SOSART
tool. It is an embedded real-time HTTP web server, accepting requests from standard
browsers running on remote clients and returning HTML files. It is capable of serving
Java applets, CORBA, and virtual reality (VRML), audio, video files, etc. NASA
uses Tempest for the remote control and monitoring of real, physical systems via
inter/intra-nets. The initial version of Tempest were developed for the VxWorks
program using the C programming language and occupied approximately 34kB ROM.
Subsequently, the code has been ported to the Java language and can execute on any
integrated into other products, as is shown in Figure 10-1. Tempest has been used for
Future intended uses for Tempest include enabling near real-time communications
Figure 10-1: Flow diagram showing the relationship between Tempest, controlled experi-
ments, and the laptop web browsers[YP99].
well as developing new teaching aids for education enabling students and teachers to
perform experiments. Since being developed, Tempest has received several awards,
notably the Team 2000 FLC Award for Excellence in Technology Transfer, the 1999
Research and Development 100 Award, and the 1998 NASA Software of the Year
Award Winner
Tempest is implemented using the Java language. While the Java Language does
support Object oriented implementation and design, the Tempest web server is con-
structed more in a structural manner, as is shown Table 10.1. Standard class based
configuration parameters which can be passed to the software on the command line
when an execution instance is started. Command line options are defined in Table
10.3.
To evaluate the Reliability of the Tempest software and the accuracy of the pro-
posed software reliability model, an experiment was constructed using the Tempest
software and the Java Web Tester software package. In essence, one machine was
configured as a Tempest web server and was given test web site to serve., A second
machine was configured to use the Java Web Tester software package. This tool al-
189
lows the user to configure a set of web sites which are to be periodically monitored
The Java Web Tester software was previously developed for research into the re-
liability of web servers, as is detailed in Schilling and Alam[SA07a]. This tool was
developed in the Java Programming language and allows the user to verify connec-
The tool consists of three tools bundled into a single jar file. The first tool, a
GUI based tool, is used to configure the website tester and can be used for short
duration tests. The GUI allows the user to configure the remote site which is to be
used as a test site, the port to connect to the remote site, and the test rate. The test
rate determines how often the remote site is polled for a connection and subsequently
downloaded. Test rates can range between 1 and 3600 seconds. The tool also allows
the remote server to be pinged before an attempt is made to open the http connection.
190
The web testing tool also allows the user to compare the file with a previously
downloaded file. The intent of this is to detect downloads in which the connection
is successful yet the actual material downloaded is corrupted. When comparing files,
an entry can be flagged if either the downloaded file is identical to the previously
downloaded file or differs from the previously downloaded depending upon how the
tool is configured.
The web site testing tool is not limited in the number of test sites that can
be tested. In testing the tool, up to 100 sites were tested simultaneously without
performance degradation. When enabled to run, each test operates as its own Java
Thread, thus preventing the behavior of one remote site from effecting other sites
191
being monitored.
The second portion of the web tester tool is a command line tool which allows
This allows web testing to go on a background mode either via a UNIX script or
CRON job without a required graphical user interface being visible. During extended
The third portion of the tool, also a command line tool, post processed results
from the other segments of the web testing tool and created summary data reports
on website connectivity.
The experiment began with setting up the Tempest software to serve a set of
test web pages in the University of Toledo OCARNet lab. For the purposes of this
experiment, two machines were isolated from the rest of the lab (and the rest of the
University of Toledo domain) through a standard commercial firewall and hub setup.
Two Linux workstations were setup on the OCARNet lab using the topology shown
in Figure 10-3. One machine served to execute the Tempest web server software,
serving out a sample web site. This second machine automatically polled the web
server once a minute and downloaded a series of the web pages from the sample web
In the first instance of testing, this setup executed continuously without software
failure for 2 calender months. However, this setup was flawed in several minor fash-
192
ions. First off, the Tempest software was only operated using one set of configuration
parameters. Second, and more importantly, the testing used very low bandwidth
utilization and did not stress the software from a performance perspective.
As part of the general web reliability study conducted by Schilling and Alam[SA07a],
a second test of the reliability of the software was conducted using a similar setup.
However, in this study, the web server operated using two different configuration pa-
rameters. However, the results related to the Tempest web server ended up being
flawed in that the computer used was inappropriately configured for the experiment
A third reliability study using the OCARNet equipment and a new Windows XP
machine was conducted. In this case, four different instances of the Tempest web
server executed simultaneously from one machine. Each of the four test instances ran
a different set of user configurable parameters, representing four different use cases
for the Tempest web server. This executed for one week before being abandoned
due to performance problems with the machine and required anti-virus software and
The final experimental setup used two Linux machines running in an independent
environment away from the OCARNet Lab. The experimental setup began by creat-
ing the network topology shown in Figure 10-4. In this topology, two Linux machines
were separated from the rest of the network by a router, effectively isolating them
from all traffic except for the web traffic between machines. One of the machines
served a set of web pages over the network. The second machine executed the Java
1
While the Schilling and Alam[SA07a] article does include a Tempest Web server within its
results, this is a separate machine. All data from the flawed experiment was removed before the
analysis of results were presented in that paper.
194
The machine executing the Tempest web server actually executed four different
instances of the web server in four different Unix processes. Each instance ran a dif-
ferent use case, manifested through the usage of different command line parameters,
neously had been impossible under the Windows XP operating environment due to
resource constraints, the combination of a Dual Core microprocessor and the usage
of the Ubuntu Linux Operating System allowed for all four instances of Tempest to
The experiment began by adding one class source code package, namely the Dum-
myLogger class, which is provided in Figure 10-5. This class provides for a single
195
static method which can be called by any class within the Tempest project. This is
necessary due to a limitation of the JDI interface used by the SoSART tool. Under
certain circumstances, there will be method paths that will have code blocks either
optimized out of the final Java Byte Code or conditionals which do not contain ex-
ecutable code. In order for path tracing to function in a reliable fashion, each and
every code segment must contain at least one executable line on which a watchpoint
can be placed. In order to facilitate this, each and every code block as was parsed by
the SoSART tool was appended with a call to the DummyLogger.LogAccess static
method. While technically modifying the source code, this insertion was deemed not
to significantly change the behavior of the system, yet it did allow more accurate anal-
ysis with the SoSART tool and the JDI interface which it relies upon. An example
Once the code modification was completed and appropriately archived into the
local configuration management system, a clean build of the source code from the
archive occurred. In this operation, all existing class files and generated modules
were removed and rebuilt by the javac compiler. This ensured that any and all
Figure 10-6: Modified NotFoundException.java class, showing lines added to call the
DummyLogger routine.
The code was then imported into the SoSART analysis tool. This importation was
used to generate a set of tracepoints which would be used to log the program execution
profile under each of the four use cases that would be tested. The tracepoints were
The net goal for this series of experiments was to experimentally estimate the
reliability of the Tempest Web Server under four different use cases. In order to
do this, four different configurations of the Tempest Web Server were configured to
serve the same material. Each instance ran a different configuration. Over a 25 hour
period, the machines were then tested for operation and the number of failures and
197
mean time between failures was recorded for the test cases. This resulted in the data
provided in Table 10.5. Because of the fact that the first three use cases did not fail in
the first 24 hours of testing, the test was subsequently extended 168 hours. However,
the result remained substantially unchanged after 168 hours, as the first three use
Assuming an exponential probability density function, for the fourth use case, the
1
MT BF = (10.1)
λ
which results in a λ of 0.2096. This can be translated into a reliability value of 0.8109
For the other examples, we must estimate the reliability based upon the fact that
there was no failure in the system. Using the relationship described in Hamlet and
Voas[HV93], the reliability for the first three use cases can be estimated to be 0.9840
bility density function, the MTBF for the software can be estimated to be 62.5 hours,
198
Faults
The initial intent of analyzing the Tempest source code was to start by enabling
all static analysis rules on all tools. Thus, each and every potential rule would be
output, and each and every rule could be pulled into the SoSART tool. This would
allow SoSART to have a complete picture of the occurrence rate for each rule, and
the process, the analysis was automated using a Apache Ant build script which au-
Table 10.6: Tempest Rule Violation Count with All Rules Enabled
Tool 1 2 3 4 5 6 7 8 9 10
ContentTag.java 0 1 33 2 25 334 14 243 14 13
DummyLogger.java 0 1 2 1 0 12 0 5 0 1
GetString.java 7 3 3 2 3 52 1 25 0 1
HTTPFile.java 0 9 32 4 28 263 33 189 14 15
HTTPString.java 4 31 79 9 134 464 85 1074 32 38
HeadString.java 5 1 3 2 0 30 1 13 0 1
Logger.java 0 10 8 2 13 130 6 97 4 3
MessageBuffer.java 0 5 12 0 0 63 3 38 0 1
NotFoundException.java 1 3 2 1 0 27 1 16 0 1
ObjectTag.java 3 1 25 4 79 514 23 355 22 19
PostString.java 8 4 3 2 0 37 2 20 6 2
RuntimeFlags.java 4 1 9 4 0 65 4 41 0 0
Tempest.java 40 10 60 7 145 873 34 656 44 21
TimeRFC1123.java 0 1 11 1 0 48 1 25 2 1
SomeClass.java 0 4 9 3 2 72 4 45 0 4
Total 72 85 291 44 429 3984 212 2842 138 121
Using this approach, however, had one significant drawback. Because all of the
tools had all rules enabled, there was a significant number of violations which were
flagged, as is shown in 10.6. The 8218 rules which were flagged by the analysis tools
unfortunately overwhelmed the internal SoSART database engine, and this complete
199
Table 10.7: Tempest Rule Violation Densities with All Rules Enabled
File LOC 1 2 3 4 5 6 7 8 9 10
ContentTag.java 300 0.000 0.003 0.110 0.006 0.083 1.11 0.046 0.810 0.046 0.043
DummyLogger.java 1 0.000 1.000 2.000 1.000 0.000 12.0 0.000 5.00 0.000 1.000
GetString.java 16 0.437 0.187 0.187 0.125 0.187 3.25 0.062 1.56 0.000 0.062
HTTPFile.java 145 0.000 0.062 0.220 0.027 0.193 1.81 0.227 1.30 0.096 0.103
HTTPString.java 797 0.005 0.038 0.099 0.011 0.168 1.83 0.106 1.34 0.040 0.047
HeadString.java 5 1.000 0.200 0.600 0.400 0.000 6.00 0.200 2.60 0.000 0.200
Logger.java 74 0.000 0.135 0.108 0.027 0.175 1.75 0.081 1.31 0.054 0.040
MessageBuffer.java 21 0.000 0.238 0.571 0.000 0.000 3.00 0.142 1.80 0.000 0.047
NotFoundException.java 7 0.142 0.428 0.285 0.142 0.000 3.85 0.142 2.28 0.000 0.142
ObjectTag.java 300 0.010 0.003 0.083 0.013 0.263 1.71 0.076 1.18 0.073 0.063
PostString.java 12 0.666 0.333 0.250 0.166 0.000 3.08 0.166 1.66 0.500 0.166
RuntimeFlags.java 16 0.250 0.062 0.562 0.250 0.000 4.06 0.250 2.56 0.000 0.000
Tempest.java 552 0.072 0.018 0.108 0.012 0.262 1.58 0.061 1.18 0.079 0.038
TimeRFC1123.java 17 0.000 0.058 0.647 0.058 0.000 2.82 0.058 1.47 0.117 0.058
SomeClass.java 27 0.000 0.148 0.333 0.111 0.074 2.66 0.148 1.66 0.000 0.148
Total 1989 0.036 0.042 0.146 0.022 0.215 2.00 0.106 1.42 0.069 0.060
analysis could not occur. On average, 4.1317 warnings were issued for each line of
To avoid this problem, it was necessary to configure each tool independently in or-
der to filter out those warnings which would not be capable of causing a direct system
failure. The exercise was conducted using the methodology described in Schilling and
Alam[SA06c]. All in all, once all ten of the tools were properly configured, 56.7% of
the rules had been disabled as being either stylistic in nature or otherwise represent-
ing faults which would not result in a system failure based upon the characteristics
of the detected fault. As is shown in Table 10.8, the percentage of rules disabled was
Once each of the tools had been properly configured and inappropriate warnings
had been removed from analysis, the static analysis tools were re-executed using the
newly created configuration profiles, resulting in a total of 1867 warnings being issued
Once the statically detectable faults had detected by the static analysis tools,
these outputs were analyzed using the SOSART tool and assessed for their validity.
Of the 1967 warnings detected, 456, or 23.1%, were deemed to be valid faults which
had the potential of causing some form of systemic operational degradation which
Using the SOSART tool, and assigning all execution paths the likelihood of “Nor-
mal” for their execution rate, the base reliability for the SOSART tool was estimated
the usage options provided in Table 10.4, a set of estimated reliabilities were obtained,
as is shown in Table 10.10. It is important to note that the first three use cases result
in the exact same reliability estimation. This is caused by the fact that their exe-
cution profiles upon which the estimate are based are virtually identical. Execution
201
profiles for use cases 1 and 2 differ by only 9 execution points out of a total of 234
execution points, and do not include any different method invocations Furthermore,
each of these differences can be attributed to a single logical change within a method
in that in the first profile one branch is taken for a given decision but in the second
example a different branch of execution occurs. When making the same comparison
between the first execution profile and the third execution profile, there is a net total
of 41 branch locations which are different out of a total of 301. However, the exact
same methods are still invoked as are invoked in the first two profiles.
In the fourth case, which has a lower reliability score, there are 69 execution point
locations which different between the first and the fourth execution profiles. However,
more importantly, the fourth execution profile includes six method invocations in
classes which are not even used in the first two execution profiles. Thus, it can
clearly be justified that the first three execution profiles, given the granularity of
measurement for this experiment, will have identical reliability values while the fourth
use case will have a different reliability estimation due to the difference in execution
profiles.
Once all four reliabilities have been calculated, a comparison between the mech-
anisms can be obtained. In the first three use cases, the reliability estimated by
SOSART and the reliability calculated based upon field testing were quite similar,
with values of 0.9898 and 0.9840 respectively, representing a 0.5% difference. In the
case of the fourth use case, SOSART estimated a reliability of 0.9757 while the actual
11.1 Conclusions
The problem of software reliability is vast and ever growing. As more and more
complex electronic devices rely further upon software for fundamental functionality,
the impact of software failure becomes greater. Market forces, however, have made
it more difficult to measure software reliability through traditional means. The reuse
of previously developed components, the emergence of open source software, and the
purchase of developed software has made delivering a reliable final product more
difficult.
The first chapter of this dissertation emphasized the need to investigate software
reliability. This need is urgent as the cost of failure to the American economy is
problems have surpassed hardware as the principle source for system failure by at
202
203
To provide justification for study, it was important to analyze past failures. Post-
mortum analysis of failure has been a common mechanism in other engineering fields,
yet has generally been lacking in the area of software engineering. Because of this
failure to obtain a historical perspective, there have been many failure modes which
repeatedly reoccur in different products. To this end, numerous case studies of failure
were presented to introduce the subject. Part of this presentation included whether
software static analysis tools would have been capable of detecting the fault and thus
preventing the failure. It was found that in a significant number of cases the fault
that ultimately led to the failure of the system was statically detectable.
software reliability and statically detectable faults. While static analysis can not
faults.
presented which targets the estimation of reliability for existing software. Traditional
software reliability models require significant data collection during development and
testing, including the operational time between failures, the severity of the failures,
code coverage during testing, and other metrics. In the case of COTS software pur-
chases or open source code, this development data is often not readily available,
the reliability of a software module. This reliability model does not suffer from this
limitation as it only requires black box testing and static analysis of the source code
204
incorporating the path coverage obtained during limited testing, the structure of the
source code, and results from multiple static analysis tools combined using a meta
tool.
Next, it was necessary to establish that static analysis tools can effectively find
faults within a Java program. This was established through the development of a
validation suite which proved that static analysis tools can be effective at finding
faults seeded within a validation suite. Overall, ten different analysis tools were used
to find 50 seeded faults, and 82% of the faults were detectable by one or more tools.
More importantly, 44% of the faults were detected by two or more static analysis
tools, indicating that multiple tool may aid in the reduction of false positives from
The static analysis tool effectiveness experiment also emphasized the importance
of proper tool configuration. In this case, the number of valid warnings was dwarfed
by the number of false positives detected which were incapable of causing a software
failure. However, it was also found that these false positives were often limited to a
In order for this reliability model to be applied to software a reliability toolkit was
constructed to allow the user to apply the reliability model to non-trivial projects.
An overview of the requirements for this tool as well as the tool usage was provided.
Proof of concept validation for the reliability model was presented in two exper-
iments. In the first experiment, the results of the SOSART reliability model were
205
compared with the results from the STREW metrics reliability model. In all cases,
the SOSART estimates were determined to be within the confidence interval for the
accuracy of the STREW metrics model, and typically less than 2% away from the es-
timate of the STREW metrics. In the second experiment, the results of applying the
an existing program is assessed for its operational reliability on the basis of four dif-
ferent sets of configuration parameters. Three of the parameter sets are found to have
identical reliabilities while the fourth is found to have a lesser reliability value. This
was both predicted by the SOSART tool and validated by the experimental results.
While the SOSART results exhibit slightly larger error than would be desired, this
These experiments provided the required proof of concept for the validity of this
level estimate of software reliability from static analysis tool execution coupled with
to be cost effective in that the effort required to apply this method is less than what
This dissertation has put forth a proof of concept experiment that static analysis
can be used to estimate the reliability of a given software package. However, while
this work has been successful as a proof of concept, there are many areas which need
206
to be further investigated.
First, in terms of the Bayesian belief Network, we realize that our network is
While we considered clustering effects at the method and file level, it is known
that clustering occurs at the package and project levels as well. What we do not
know is the relationship between clustering at the various levels. For example, is
clustering at the method level more indicative of a valid cluster versus clustering at
the package level? How does clustering change as a module undergoes revision by
The impact of independent validation also needs further assessment. While our
results indicated that multiple tools often did detect the same faults, we also saw
multiple false positives being detected as well. Is the assumption that was made re-
garding the independence of algorithms truly valid? While we would like to believe
that independent commercial tools should be independent, the research of Knight and
Leveson [KL86] [KL90] indicates that n version programming may not result in as
much independence as would be anticipated, for while the versions are developed in-
Our Bayesian Belief Network also simplifies the concept of a maintenance risk. It
is known that certain code constructs, such as missing braces for if constructs, often
is qualitatively the risk that this poses over time. Our model simply indicates that a
coded construct either is or is not a maintenance risk. But yet, this parameter more
than likely has some degree of variability associated with it and should be modeled
in the same manner as the Immediate Failure Risk. Clearly additional research using
lessons learned databases and difference analysis of existing program modules across
revisions as field defects are fixed may be capable of addressing this issue.
Similar areas for research exist on the code coverage side of the Bayesian Belief
Network. We know that at a high level the assumptions used to generate this model
are valid, but in many cases, the exact relationship has not been definitively shown
The SOSART tool clearly needs additional performance tuning and development.
Due to technical limitation, it was incapable of analyzing programs much larger than
approximately 2 KLOC. While that was acceptable for a proof of concept application
in which smaller programs were used, it is imperative that the tool be capable of
efficiently and reliably analyzing larger programs as well. It may be advisable that
the GUI used for fault analysis be separated from the mathematical model used to
calculate reliability, thus saving memory and allowing for distributed analysis of the
tool.
The model also needs a significant analysis in terms of its granularity. While in
this experiment the software programs typically exhibited what would be considered
relatively low reliabilities, the tool itself suffered from numerical granularity problems
in calculation which seemingly preclude its utilization with higher reliability modules.
This may be an effect of the network itself, or it may be an effect of the definitions
208
tionships present in the model. While each of the software packages assessed in this
dissertation has used the same network coefficients, it is highly probable that the
relationships between nodes are not necessarily the same for all developed software.
The work of Nagappan et al.[NBZ06] indicates that there is no single set of predictive
metrics which can be used to estimate field failure rates. We believe that these con-
clusions also apply to our model, and that there may be multiple relationships which
are specific to the project domain or project family. While our model is targeted at
Embedded Systems, the bulk of the validation occurred with non-embedded appli-
well as the appropriate calibration using the built in calibration parameters for the
is often a risk management issue which decides whether known statically detectable
faults are removed between releases of a given project. It may be possible to relate the
change in software reliability between releases with the change in statically detectable
Lastly, for static analysis (or any other software engineering method) to be com-
mercially acceptable, it must be cost effective. While this research looked at cost in
terms of time, this is not entirely accurate. There are direct monetary costs asso-
ciated with static analysis unrelated to effort, including licensing fees, tool configu-
209
ration, training exercises, and others. In order for this method to be viable in the
of practicing software engineers. This has been one of the goals of this research. The
proof of concept model presented here appears promising, although it has only been
[Agu02] Joy M. Agustin. JBlanket: Support for Extreme Coverage in Java Unit
[AH04] Cyrille Artho and Klaus Havelund. Applying JLint to space exploration
[All02] Eric Allan. Bug Patterns in Java. Apress, September 2002. ISBN: 1-
59059-061-9.
[Ana04] Charles River Analytics. About Bayesian Belief Networks. Charles River
210
211
[And96] Tom Anderson. Ariane 501. E-mail on safety critical mailing list., July
1996.
[Arn00] Douglas N. Arnold. The Patriot Missle Failure. Website, August 2000.
[BDG+ 04] Guillaume Brat, Doron Drusinsky, Dimitra Giannakopoulou, Allen Gold-
[Ben06] David Benson. JGraph and JGraph Layout Pro User Manual, December
2006.
[BK03] Guillaume Brat and Roger Klemm. Static analysis of the Mars explo-
2003.
[BL06] Steve Barriault and Marc Lalo. Tutorial: How to statically ensure soft-
[Blo01] Joshua Bloch. Effective Java programming Language Guide. Sun Mi-
[BR02] Thomas Ball and Sriram K. Rajamani. The SLAM project: debugging
[Bro04] Matthew Broersma. Microsoft server crash nearly causes 800-plane pile
[BV03] Guillaume Brat and Arnaud Venet. Static program analysis using Ab-
[Car92] Ralph V. Carlone. GAO report: Patriot missle defense - software problem
[CM04] Brian Chess and Gary McGraw. Static analysis for security. IEEE Secu-
http://www.bullseye.com/coverage.
http://www.cs.cmu.edu/˜javabayes/EBayes/Doc/, 1999.
2001.
[Dar88] Ian F. Darwin. Checking C Programs with Lint. O’Reilly and Associates,
[Dew90] Philip Elmer Dewitt. Ghost in the machine. Time, pages 58–59, January
29 1990.
214
[DZN+ 04] Martin Davidsson, Jiang Zheng, Nachiappan Nagappan, Laurie Williams,
[EGHT94] David Evans, John Guttag, James Horning, and Yang Meng Tan. LCLint:
[EKN98] William Everett, Samuel Keene, and Allen Nikora. Applying software
[EL03] Davie Evans and David Larochelle. Splint Manual. Secure Programming
2003.
[ELC04] The Economic Impacts of the August 2003 Blackout. Technical report,
[Eng05] Dawson R. Engler. Static analysis versus model checking for bug finding.
[FCJ04] Thomas Flowers, Curtis A. Carver, and James Jackson. Empowering stu-
[FGMP95] Fabio Del Frate, Praerit Garg, Aditya P. Mathur, and Alberto Pasquini.
[For05] Jeff Forristal. Source-code assessment tools kill bugs dead. Secure En-
[FPG94] Norman Fenton, Shari Lawrence Pfleeger, and Robert L. Glass. Science
95, 1994.
[Gan00] Jack Ganssle. Crash and burn: Disasters and what we can learn from
[Gan01] Jack Ganssle. The best ideas for developing better firmware faster. Tech-
2002.
November 11 2004.
[Gar95] Praerit Garg. On code coverage and software reliability. Master’s thesis,
[Gep04] Linda Geppert. Lost radio contact leaves pilots on their own. IEEE
[Ger04] Andy German. Software static code analysis lessons learned. Crosstalk,
16(11):13–17, 2004.
[GH01] Bjorn Axel Gran and Atte Helminen. A Bayesian Belief Network for
[GJC+ 03] Vinod Ganapathy, Somesh Jha, David Chandler, David Melski, and
[GJSB00] James Gosling, Bill Joy, Guy L. Steele, and Gilad Bracha. The Java
[Gle96] James Gleick. A bug and a crash. New York Times Magazine, December
1996.
[Gra86] Jim Gray. Why do computers stop and what can be done about it? Proc.
[Gri04a] Chris Grindstaff. Findbugs, part 1: Improve the quality of your code
1997.
[Hac04] Mark Hachman. NASA: DOS glitch nearly killed mars rover. Extreme-
[Hal99] Todd Halvorson. Air Force Titan 4 rocket program suffers another failure.
[Hat99a] Les Hatton. Ariane 5: A smashing success. Software Testing and Quality
[Hat99b] Les Hatton. Software faults and failures: Avoiding the avoidable and
living with the rest. Draft text from “Safer Testing” Course, December
1999.
[HD03] Elise Hewett and Paul DiPalma. A survey of static and dynamic analyzer
[HFGO94] Monica Hutchins, Herb Foster, Tarak Goradia, and Thomas Ostrand.
220
[HJv00] Marieke Huisman, Bart Jacobs, and Joachim van den Berg. A case study
R0007, 2000.
[HL02] Sudheendra Hangal and Monica S. Lam. Tracking down software bugs
[HLL94] Joseph R. Horgan, Saul London, and Michael R. Lyu. Achieving software
September 1994.
[Hof99] Eric J. Hoffman. The NEAR rendezvous burn anomaly of december 1998.
[Hol99] C. Michael Holloway. From bridges to rockets: Lessons for software sys-
[Hol04] Ralf Holly. Lint metrics and ALOA. C/C++ Users Journal, pages 18–22,
June 2004.
221
[HP00] Klaus Havelund and Thomas Pressburger. Model checking JAVA pro-
[HV93] Dick Hamlet and Jeff Voas. Faults on its sleeve: amplifying software
[HW06] Sarah Heckman and Laurie Williams. Automated adaptive ranking and
can National Standards Institute, 25 West 43rd Street, New York, New
[Jel04] Rick Jelliffe. Mini-review of Java bug finders. The O’Reilly Network,
March 15 2004.
[JM97] Jean-Marc Jezequel and Bertrand Meyer. Design by contract: The lessons
[Kan95] Cem Kaner. Software negligence and testing coverage. Technical report,
[KAYE04] Ted Kremenek, Ken Ashcraft, Junfeng Yang, and Dawson Engler. Corre-
2004.
[KL90] John C. Knight and Nancy G. Leveson. A reply to the criticisms of the
223
35, 1990.
[Koc04] Christopher Koch. Bursting the CMM hype. Software Quality, March 1
2004.
[Lad96] Peter B. Ladkin. Excerpt from the Case Study of The Space Shuttle
[LAW+ 04] Kathryn Laskey, Ghazi Alghamdi, Xun Wang, Daniel Barbara, Tom
[LB05] Marc Lalo and Steve Barriault. Maximizing software reliability and de-
[LE01] David Larochelle and David Evans. Statically detecting likely buffer over-
[Lee94] S. C. Lee. How Clementine really failed and what NEAR can learn. John
1994.
224
[LG99] Craig Larman and Rhett Guthrie. Java 2 Performance and Idiom Guide.
1999.
[Lio96] J. L. Lions. Ariane 5 flight 501 failure report by the inquiry board.
Symposium, 2005.
[LLQ+ 05] Shan Lu, Zhenmin Li, Feng Qin, Lin Tan, Pin Zhou, and Yuanyuan Zhou.
June 2005.
[Mar99] Brian Marick. How to misuse code coverage. Technical report, Reliable
[MB06] Robert A. Martin and Sean Barnum. A status update: The common
[MCJ05] Robert A. Martin, Steven M. Christey, and Joe Jarzombek. The case for
[ME03] M. Musuvathi and D. Engler. Some lessons from using static analysis
[Mey92] Scott (Scott Douglas) Meyers. Effective C++: 50 specific ways to improve
MA 02116, 1992.
[MIO90] John D. Musa, Anthony Iannino, and Kazuhira Okumoto. Software Re-
[MIS98] MISRA-C guidlines for the use of the C language in critical systems. The
[MIS04] MISRA-C:2004 guidlines for the use of the C language in critical systems.
[MKBD00] Eric Monk, J. Paul Keller, Keith Bohnenberger, and Michael C. Daconta.
[MLB+ 94] Y.K. Malaiya, N. Li, J. Bieman, R. Karcich, and B. Skibbe. The relation-
ship between test coverage and reliability. In Proc. Int. Symp. Software
[MMZC06] Kevin Mattos, Christine Moreira, Mark Zingarelli, and Denis Coffey. The
[NB05] Nachiappan Nagappan and Thomas Ball. Static analysis tools as early
[NBZ06] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. Mining metrics
[Neu99] Peter G. Neumann. The risks digest. Online Digest of Computing Failures
[NF96] Martin Neil and Norman Fenton. Predicting software quality using
[NIS06] NIST. Source Code Analysis Tool Functional Specification. Technical re-
[NM96] Li Naixin and Y.K. Malaiya. Fault exposure ratio estimation and ap-
[NWV+ 04] Nachiappan Nagappan, Laurie Williams, Mladen Vouk, John Hudepohl,
[NWVO04] Nachiappan Nagappan, Laurie Williams, Mladen Vouk, and Jason Os-
[OS00] Emilie O’Connell and Hossein Saiedian. Can you trust software capability
[OWB04] Thomas J. Ostrand, Elaine J. Weyuker, and Robert M. Bell. Using static
[Pai01] Ganesh J Pai. Combining bayesian belief networks with fault trees to
[Pav99] J. G. Pavlovich. Formal report of investigation of the 30th April 1999 Ti-
[PD01] Ganesh J Pai and Joanne Bechta Dugam. Enhancing Software Relia-
[Pet94] Henry Petroski. Design Paradigms: Case Histories of Error and Judge-
[Pil03] Daniel Pilaud. Finding run time errors without testing in embedded
[POC93] Paul Piwowarski, Mitsuru Ohba, and Joe Caruso. Coverage measurement
[Pou03] Kevin Poulsen. Nachi worm infected Diebold ATMs. The Register,
February 2004.
[Pou04b] Kevin Poulsen. Tracking the blackout big. The Register, April 2004.
[Pro] The Programming Research Group. High Integrety C++ Coding Standard
http://www.toyo.co.jp/ss/customersv/doc/qac clinic1.pdf.
of bug finding tools for Java. In Proceedings of the 15th IEEE Symposium
[Rai05] Abhishek Rai. On the role of static analysis in operating system checking
[Ric00] Debra J. Richardson. Static analysis. ICS 224: Software Testing and
[Roo90] Paul Rook, editor. Software Reliability Handbook. Centre for Software
[SA05a] Walter Schilling and Mansoor Alam. A methodology for estimating soft-
neering, Chicago, IL, November 2005. IEEE Computer Society and IEEE
Reliability Society.
[SA05b] Walter Schilling and Mansoor Alam. Work In Progress - Measuring the
[SA06a] Walter Schilling and Dr. Mansoor Alam. The software static analysis
NC, November 2006. IEEE Computer Society and IEEE Reliability So-
ciety.
[SA06b] Walter Schilling and Mansoor Alam. Estimating software reliability with
Applications (ISCA).
[SA06c] Walter Schilling and Mansoor Alam. Integrate static analysis into a
November 2006.
[SA06d] Walter Schilling and Mansoor Alam. Modeling the reliability of existing
[SA07a] Walter Schilling and Mansoor Alam. Measuring the reliability of exist-
IV.
[Sch04a] Walter Schilling. Issues effecting the readiness of the Java language for
usage in safety critical real time systems. Submitted to fulfill partial re-
[SDWV05] Michele Strom, Martin Davidson, Laurie Williams, and Mladen Vouk.
2005.
234
[Sha06] Lui Sha. The complexity challenge in modern avionics software. In Na-
1996.
[SK95] Hossein Saiedian and Richard Kuzara. SEI Capability Maturity Model’s
[Sla98b] Gregory Slabodkin. Software glitches leave navy smart ship dead in the
[Sop01] Joe Sopko. CTC195, 197, 203, No Sound. National Electronic Service
[SU99] Curt Smith and Craig Uber. Experience report on early software reli-
[SWA+ 00] Donald Savage, Helen Worth, Diane E. Ainsworth, George Diller, and
[SWX05] Sarah E. Smith, Laurie Williams, and Jun Xu. Expediting Program-
Illinois, 2005.
[Sys02] QA Systems. Overview large Java project code quality analysis. Technical
[Tha96] Henrik Thane. Safe and reliable computer control systems: Concepts and
[Tri02] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing and
[VB04] Arnaud Venet and Guillaume Brat. Precise and efficient static array
[vGB99] Jilles van Gurp and Jan Bosch. Using Bayesian Belief Networks in As-
München, 2004.
[Wal04] Matthew L. Wald. Maintance lapse blamed for air traffic control problem.
[WJKT05] Stefan Wagner, Jan Jrjens, Claudia Koller, and Peter Trischberger. Com-
paring bug finding tools with reviews and tests. In Proceedings of Testing
Verlag GmbH.
[Woe99] Jack J. Woehr. A conversation with Glenn Reeves: Really remote debug-
[XNHA05] Yichen Xie, Mayur Naik, Brian Hackett, and Alex Aiken. Soundness and
[XP04] Shu Xiao and Christopher H. Pham. Performing high efficiency source
355, 2004.
[YB98] David York and Maria Babula. Virtual interactive classroom: A new
238
1998.
[YP99] David York and Joseph Ponyik. New Web Server - the Java version of
[ZLL04] Misha Zitser, Richard Lippmann, and Tim Leek. Testing static analysis
tools using exploitable buffer overflows from open source code. SIGSOFT
[ZWN+ 06] Jiang Zheng, Laurie Williams, Nachiappan Nagappan, Will Snipes,
Fault Taxonomy
239
240
SOSART Requirements
Requirement Rationale
The SOSART analysis tool shall be
capable of loading any Java 1.4.2 com-
pliant source code module and creat- The first project to be analyzed re-
F.1
ing UML based activity diagrams / quires Java support.
control flow diagrams for each method
within the source code module.
Based upon a loaded source code mod-
ule, SOSART shall be capable of gen- Execution traces require indications of
erating watchpoints which can be used the location for the start of each code
F.2 to collect execution profiles. Watch- block, and this must be obtained by
points shall be located at the entry to structurally analyzing the source code
each and every code block, as well as module.
at all return statements.
The SOASRT tool shall be capable
of visualizing execution traces which
Aids in understanding the execution
have been captured by the tool. Vi-
F.3 flow and its relationship to the various
sualization shall occur on the activity
code blocks within the method.
diagram / flow diagram representation
of the program.
The number of times a given path has
been executed shall be displayed on Usability and understanding of the ex-
F.3.1
the activity diagram when an execu- ecuted paths.
tion path is added to the display.
Continued on Next Page. . .
243
244
Requirement Rationale
The SOSART tool shall have the capa-
bility to store external to the program
Allows historical trending to be used
F.4 a historical database which represents
to improve the accuracy of the model.
all warnings which have been analyzed
during program execution.
Prevents erroneous analysis results
Transfer to the historical database
F.4.1 from being transferred into the histor-
shall be at the operators command.
ical database.
This allows projects of entirely differ-
The SOSART tool shall allow the op-
ent scope to be analyzed without data
F.4.2 erator to clear the historical database
contamination by projects from differ-
as is necessary.
ent domains.
The SOSART tool shall allow the op-
This allows projects from different do-
F.4.3 erator to store and retrieve different
mains to be analyzed independently.
historical databases as is necessary.
SOSART shall be capable of calculat- These are basic metrics which should
F.5 ing Cyclomatic Complexity and Static be readily available when analyzing
Path count on a per method basis. COTS and other developed software.
SOSART shall be capable of import-
ing static analysis warnings and dis- Visualization static analysis warnings
F.6 playing the warnings on the generated relative to the execution profile ob-
activity diagrams / control flow dia- tained during program execution.
grams.
SOSART shall be capable of interfac-
ing at minimum with the following
Java Static analysis tools:
1. JLint1
These are commonly available static
2. ESC/Java
analysis tools for Java which have
3. FindBugs
F.6.1 been shown to be reliable and thus
4. Fortify SCA
serves as a starting set for this anal-
5. PMD
ysis.
6. QAJ
7. Lint4J
8. JiveLint
9. Klocwork K7
The SOSART GUI tool shall sup-
port commonly existing document in-
These are standard GUI behaviors ex-
F.7 terface behavior, including but not
pected in a completed analysis tool.
limited to image zooming, tiling, and
printing of generated graphics.
SOSART shall automatically relay
Allows the user to force a relay of
graphics displays when necessary, but
the display if a program failure occurs
F.7.1 shall also have a button to force the
which prevents the proper display of
graphics to be relayed using the built
the activity diagram graph.
in algorithm.
The SOSART tool shall provide the
This allows generated graphics to be
capability to export generated graph-
F.8 imported into reports and other doc-
ics into a standard file format for
uments as is necessary.
graphics.
Continued on Next Page. . .
245
Requirement Rationale
SOSART shall allow the user to save
F.9 projects and generated metrics for fu- Ease of use for long term projects.
ture usage.
The SOSART tool shall allow the user
to save analyzed static analysis warn- This will allow larger projects to be
ings separate from the project. Warn- analyzed which may not be storable
F.10
ings saved as such shall be recoverable in their complete format due to limita-
with all attributes set to the values tions of XML persistence within Java.
modified by the user during analysis.
Allows the user to store common pa-
SOSART shall use a configuration file
rameters and commands within a con-
which shall store configuration pa-
F.11 figuration file so that they do not need
rameters for the tool across multiple
to be set when invoking the tool from
projects.
the command line.
Configuration data shall be stored in
F.11.1 XML is a standard markup language.
the XML format.
SOSART shall categorize imported
Required for the proper characteriza-
F.12 static analysis warnings based upon a
tion of warnings.
defined taxonomy.
CWE and SAMATE are two existing
SOSART shall support the CWE
taxonomies for static analysis warn-
and SAMATE taxonomies as well as
F.12.1 ings. However, these taxonomies tar-
a custom developed taxonomy for
get security as opposed to the larger
SOSART.
domain of static analysis tools.
All static analysis warnings shall be This allows the most efficient catego-
categorized into the appropriate tax- rization of faults to the taxonomy def-
F.12.2
onomy upon importation if the cate- initions, as they are categorized by the
gorization has not already occurred. user when a new instance is detected.
Fault taxonomy assignments shall be
This allows the user to view existing
F.12.3 viewable within the SOSART tool by
taxonomy assignments as is necessary.
the operator.
The SOSART tool shall calculate the
estimated software reliability using
This is one of the fundamental pur-
F.13 the reliability model developed by
poses for the tool.
Schilling and Alam[SA06d] [SA06b]
[SA05a].
The reliability shall be shown in a tex-
Allows storage of reliability and im-
F.13.1 tual format which can be saved to a
portation into external reports.
file.
The SOSART tool shall be capable
of generating reports based upon the Basic core functionality required for
F.14
statically detectable faults imported the tool to be a Metadata tool.
into the tool.
These represent the three major clas-
sifications of faults supported by
The SOSART tool shall provide his-
SOSART, as a fault is either historical
F.14.1 torical reports, project level reports,
in nature (in that it is from a previous
and file level reports.
project), part of a project (which con-
sists of multiple files), or part of a file.
Continued on Next Page. . .
246
Requirement Rationale
Certain static analysis faults may be
located in such a manner that they are
The SOSART tool shall allow for not related to a given method but are
faults which are not directly at- directly connected with the class dec-
F.15
tributable to a given method to be laration. While these faults do not di-
stored at the class level. rectly play into this reliability model,
they should be kept for metrics pur-
poses.
The SOSART tool shall allow all file Under certain circumstances, it may
F.16 faults to be viewed in a listing separate
be beneficial to visualize faults as a list
from the individual method displays. instead of on the activity diagrams.
SOSART shall be capable of exporting
XML is a standard language for inter-
F.17 data into XML format for importation
face between data systems.
into external programs.
The SOSART toolset set contain a
trace generator capable of logging
This is necessary in order to log exe-
F.18 branch execution traces in a man-
cution traces for model usage.
ner which can be imported into the
SOSART tool.
1 Currently, the JLint tool does not include support for XML output. In order
to interface properly with the SOSART tool, JLint will need to be improved to
include XML output.
Requirement Rationale
The PSP represents a well docu-
mented approach to quality soft-
ware development for small, individ-
ual projects. This is essential to en-
The SOSART tool, when practical,
sure a reliably delivered analysis tool.
D.1 shall be developed using the Personal
However, being that this is a research
Software Process.
tool being developed using other re-
searchers code as modules, it may not
be feasible to follow a strict PSP pro-
cess during development.
Reliably following the PSP is simpli-
Development data shall be collected fied by the usage of tools which au-
D.1.1 using the Software Process Dash- tomatically collect the requisite data.
board. By doing this, individual mistakes can
be reduced.
Continued on Next Page. . .
247
Requirement Rationale
Design documentation for the UML represents a standard approach
D.2 SOSART tool shall be created using for software design which is readily un-
UML design format. derstood by practitioners.
Enforcement of coding standards dur-
ing development has been shown to re-
Source code developed for the
duce the number of defects in a final
SOSART tool shall be verified for
D.3 delivered product. Checkstyle helps
coding standards compliance using
to prevent common Java programming
the Checkstyle tool.
mistakes as well as ensuring consistent
style.
This ensures that any compiler de-
tected problems have been removed.
The SOSART tool shall compile with- While it is desirable to remove com-
D3.1 out any compiler warnings in the piler warnings from the automatically
hand-coded segments. generated code, this may not be feasi-
ble given the limitations of automatic
code generation.
SOSART source code, design docu-
mentation, and other materials shall Appropriate software engineering
D.4
be kept under version management at practice.
all times through development.
SOSART shall use the CVS version CVS is readily available as an open
D.4.1 management system for all configura- source project and is well supported
tion management practices. and extremely extensible.
SourceForge is a readily available dis-
SOSART shall be released through the
D.5 tribution site which is commonly used
Sourceforge site.
for Open Source programs.
Requirement Rationale
Java allows for easy development of a
The SOSART tool shall be imple-
GUI, is portable, and is an appropri-
I.1 mented using the Java programming
ate for language research based tool
language.
development1 .
The ANTLR parser is a readily avail-
Source code parsing shall be accom- able tool which can be easily dis-
I.2 plished through the use of the ANTLR tributed. It is also well documented
parser. and has an extensive record of success-
ful usage.
Continued on Next Page. . .
248
Requirement Rationale
The SOSART tool shall not use
It is important to develop a tool which
any constructs which would limit the
I.3 is portable across multiple develop-
portability of the tool to a given envi-
ment platforms.
ronment.
Many higher end UNIX systems do
not support Java versions newer than
Java 1.4.2 shall be used for tool devel-
I.4 1.4.2, and thus, would be unable to
opment.
run the tool if newer Java constructs
are used.
1 This is in contrast to embedded systems development in which Java is generally