You are on page 1of 5

Bonfring International Journal of Data Mining, Vol. 2, No.

2, June 2012 52

ISSN 2277 - 5048 | 2012 Bonfring
Abstract--- In recent years the Receiver Operating
Characteristic (ROC) curves received much attention in
medical diagnosis for classifying the subjects into one of the
two groups. Many researchers have provided the
mathematical formulation of the curve by assuming some
specific distribution. Conventionally, much work has been
carried out by assuming normal distribution. In this paper,
we focused on estimating the ROC Curve and Area Under the
Curve (AUC) using Exponential and Weibull distributions. As
Exponential and Weibull distributions are important in life
testing problems, the performance of ROC forms of these
distributions are studied and then results are compared with
conventional Binormal ROC form. The entire study was done
using real and simulated data sets. In a perspective it is
proposed that ROC form of Binormal is far better than the
other two and Biexponential is better than the Biweibull model
of ROC curve.
Keywords--- ROC Curve, Binormal, Biexponential and
Biweibull Distributions, Area Under the Curve

I. INTRODUCTION
OC curves are the most widely used statistical technique
used for classification, to estimate the accuracy of a
diagnostic test as well as to identify the best threshold. In
recent years, it has got much attention in the field of medical
diagnosis [3], [5], [7], [9], [10], [11], [13], [14]. The main
phase for diagnosing any disease is to evaluate its performance
and to identify the biomarker which helps in classifying the
individuals as healthy or diseased. At the same time it is also
important for researchers to understand how best it can
evaluate the tests performance. An important technique for
evaluating the performance of diagnostic tests is the ROC
curves. With appropriate use of ROC curves, the test
performance can be further improved for getting the ideal
classification. But in most of the situations the above said
phenomenon will not be met by the researchers [6].
Statistically speaking, ROC curves are used to characterize the
accuracy of a diagnostic test and also used to compare
accuracies of two diagnostic tests [4].
So far, many researchers assumed that both (healthy and
diseased) groups follow Normal distribution and carried out

R. Vishnu Vardhan, Assistant Professor, Department of Statistics,
Pondicherry University, Puducherry, India. E-mail:rvvcrr@gmail.com
Sudesh Pundir, Assistant Professor, Department of Statistics,
Pondicherry University, Puducherry, India. E-
mail:sudeshpundir19@gmail.com
G. Sameera, M,Sc.,Student , Department of Statistics, Pondicherry
University, Puducherry, India. E-mail:samskutti@gmail.com
the estimation and fitting procedure of ROC curve analysis
[2], [8], [12]. In most of the cases, the data collected in
realistic environment may or may not follow Normal
distribution. In this paper, we made an attempt to fit the ROC
curve and to estimate the AUC of ROC by assuming that the
individuals of both the groups will follow Exponential and
Weibull distributions. Hence, we can call these ROC models
as Biexponential and Biweibull models. In this paper, first we
have discussed the estimation procedure of Binormal and then
proposed the fitting and estimation procedure of Biexponential
and Biweibull ROC models. In results and discussions
section, the performance and accuracy measures of these
distributions were studied using simulated and real data sets.
II. ESTIMATION OF ROC CURVE
Here, we will discuss in brief about the Binormal form of
ROC, then discussion will be carried out to the other two
distributions considered in this work.
A. Binormal ROC Curve and AUC
Let us consider the two populations, i.e., diseased
population (D) and healthy population (H) with some
classification rule which classifies the individual in any one of
these populations. There are two important measures for
defining an ROC curve, i.e., true positive rate (TPR) and false
positive rate (FPR). In probabilistic notation, the TPR can be
defined as, the chance of correctly identifying those subjects
who are actually suffering from disease; this is also referred
as the sensitivity of the diagnostic test. Similarly the FPR is
the chance of wrongly identifying those subjects who are
actually not suffering with the disease this is referred as 1-
specificity. Using these two measures the plotting of ROC
curve is made. In other words, it is a plot of FPR and TPR
with varying cutoffs or thresholds. This gives the meaning
that generated ROC will have infinitely many cutoffs and each
cutoff will produce a pair of FPR and TPR. The curve will
provide more statistical information and properties which are
to be studied [6]. Firstly, the curve is of an unknown
monotonic transformation from (0, 0) to (1, 1) and secondly,
the region above the chance diagonal and below the curve is
defined as the Area under the ROC curve i.e., AUC. This
AUC is another important statistical accuracy measure used to
assess the performance of a diagnostic test.
If the curve approaches to the left hand corner and has a
larger distance from the chance line then that particular cutoff
gives the high percentage of accuracy of a diagnostic test. As
it is assumed that the distributions of the test scores (S) follow
normal in both diseased (D) and non-diseased (H) population
with their respective means
D
,
H
and standard deviations
D
,

H
. Also assume that
D
>
H
, but no constraints are put on
Estimation of Area under the ROC Curve Using
Exponential and Weibull Distributions
R. Vishnu Vardhan, Sudesh Pundir and G. Sameera
R
Bonfring International Journal of Data Mining, Vol. 2, No. 2, June 2012 53

ISSN 2277 - 5048 | 2012 Bonfring
standard deviations [1].
The false positive rate with threshold t is given by




and the ROC curve is given as




where a=
-
and b= . It is clear that both a and b are non-
negative.

The accuracy of the test depends on how well the test
separates the group being tested into those with and without
the disease.
It is defined as

where y(x) denotes the ROC curve model for a particular
distribution. An area of 1 represents a perfect test and an area
of 0.5 represents a worthless test which implies that test have
no discrimination power.

Now the AUC of ROC for the Binormal distribution is


In next sub section, we have proposed the ROC models for
Biexponential and Biweibull forms.
B. Biexponential Model and AUC
Let us assume that the distributions of the scores (S) are
exponential in both diseased (D) and non-diseased (H)
population with respective means and standard deviations
D
,

H
.
The cumulative distribution function is given as
F(x)=1-e
(-x)

where >0.
The false positive rate with threshold t is given by



and the ROC curve is given as




Now the AUC for the Biexponential distribution is

Depending on the relationship between
D
and
H
, the
shape of the ROC curve gets changed.
In next section, the ROC form under Biweibull distribution
is discussed
C. Biweibull Model and AUC
Let us assume that the distributions of the scores (S) are
Weibull in both diseased (D) and non-diseased (H) population
with respective means and standard deviations.
The cumulative distribution function of exponential
distribution is given as

where c>0 and x>0.
Now the false positive rate with threshold t is given by

=1-P (St|H)


and the ROC curve is given as

=1-P(St|D)


Using value oft from equation, we get

Now the AUC for the Biweibull distribution is

Depending on the relationship between c
D
and c
H
, the
shape of the ROC curves gets changed.
III. RESULTS AND DISCUSSIONS
The entire computations of this paper were carried out
using simulation type of study as well as real datasets.
Initially, we provide a detailed interpretation about the
characteristics of three distributions and the values that are
considered. As we mentioned earlier that we have studied the
statistical properties of Binormal, Biexponential and Biweibull
ROC forms, here a simulated environment is developed to
observe the variability and the performance of three ROC
forms (Figures 1,2,3).
Bonfring International Journal of Data Mining, Vol. 2, No. 2, June 2012 54

ISSN 2277 - 5048 | 2012 Bonfring
In the case of Binormal ROC form, three different typical
possibilities are considered by fixing the mean of healthy and
varying the diseased mean. Since, the variability in the
response is usually observed in diseased subjects rather than
the healthy subjects and at the same time it is clear that there
are no restrictions about the standard deviations. Intuitively,
one can imagine that as the distance between two means of D
and H is widened a better discrimination can be met and vice
versa. If we consider
D
=3.4 and
H
=2.5, the AUC so obtained
is 0.786. Here the distance between two means is moderate
and 78.6% of typical classification can be made.
Using the proposed functional forms of AUC of
Biexponential and Biweibull ROC models, similar kind of
simulation studies have been conducted. In Biexponential
model the parameter of healthy population will be larger than
that of diseased population, since the mean under exponential
is 1/. Even though this kind of situation is observed the
practical implication in computing AUC will not be affected.
Suppose if we consider
H
=0.4 and
D
=0.294, the AUC
expression produces a value 0.576.
In case of Biweibull distribution again three situations are
considered. As the values of c
D
and c
H
parameters vary, typical
shapes of ROC forms under Biweibull distribution can be
viewed. An interesting point a researcher has to focus under
Biweibull model is, for any values of c
D
and c
H
the ROC curve
attains an S shape. After crossing the diagonal line, the curve
will have a concave shape from that point to the right upper
corner point (1, 1). In literature the curves of this kind are
referred to as Not-proper ROC curves [15]. With larger
distance between parameters, the ROC curve under Biweibull
model attains a zig zagged S shaped curve, which means the
vertical line of zig zag meets the diagonal line at value 0.5.
Table 1: ROC Curve Parameters and AUC Measure
Binormal (
D
=

H
=0.8)
Biexponential Biweibull

D

H
AUC
D

H
AUC c
D
c
H
AUC
3.4 2.5 0.7868
0.294
1
0.
4
0.576
2
1
0
2.
5
0.800
0
2.8 2.5 0.6045
0.357
1
0.
4
0.528
3
5
2.
5
0.666
6
4.6 2.5 0.9682
0.217
3
0.
4
0.647
8
6
2.
5
0.705
8

i. Breast Cancer Data
The data was collected from [16]. This dataset contains
1207 samples of which censored subjects are 1135 and died
cases are 72. Totally there are 9 variables, of which
pathological tumor size is the influential variable to diagnose.
The range of this variable is (0.10, 7.00). P-P plots have been
plotted and it is observed that the pathological tumor size is
following all the three distributions. In the table 2 we have
reported the values of statistical parameters for the three
distributions considered. As in the case of simulation study,
similar type of phenomenon has been perceived in this dataset
too. Focusing on the performance of three statistical
distributions with their respective ROC models (Figure 4),
Binormal model seems to perform better than the other two.
Even though a slight margin of difference exists between
Biexponential and Biweibull, they provide equal percentage of
classification.


Figure 1: Typical Forms of Binormal ROC curve


Figure 2: Different Forms of Biexponential ROC Curves


Figure 3: Different Forms of Biweibull ROC Curves

ii. Tuberculosis Data
The data was collected from Sri Venkateswara University
of Medical Sciences, a tertiary hospital in Tirupati. Data
consists of 100 samples with 4 variables. Out of these
Adenosine Deaminase (ADA) is the influential factor to
diagnose. Here also P-P plots were plotted and outliers were
identified. Using 5% trimmed mean, the outliers were
removed from the dataset and the entire computations were
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
T
P
R
FPR
BINORMAL ROC Curve
D=3.4, H=2.
5
D=2.8, H=2.
5
D=4.6, H=2.
5
CHANCE
LINE
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
T
P
R
FPR
BIEXPONENTIAL ROC Curve
D=0.294118
, H=0.4
D=0.357143
, H=0.4
D=0.217391
, H=0.4
CHANCE
LINE
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
T
P
R
FPR
BIWEIBULL ROC Curve
D=10, H=2.5
D=5, H=2.5
D=6, H=2.5
CHANCE
LINE
Bonfring International Journal of Data Mining, Vol. 2, No. 2, June 2012 55

ISSN 2277 - 5048 | 2012 Bonfring
carried out for the remaining samples (N=80). In table 2, the
parameter values of this data were reported along with the
accuracy measure AUC. Even though the variable follows all
the three distributions, the Binormal model provides a better
accuracy when compared to the others. The accuracy obtained
in the Binormal model is 99.8% where as it is just 58.9% in
case of exponential and 69.2% in case of Weibull. Thus, we
claim that Binormal model is far better than the other two
forms and can be preferred for better classification. The ROC
curves of the three statistical distributions were plotted and the
same can be seen in figure 5.
Table 2: ROC Curve Parameters and AUC Measure
Distributions and
their Parameters
Breast Cancer Data Tuberculosis Data
D H AUC D H
AU
C
Binormal

2.481
8
1.686
7 0.699
4
4.27 2.97
0.99
89
1.176 0.965 0.254
0.34
5
Biexponenti
al
0.403 0.593
0.595
3
0.234
0.33
6
0.58
95
Biweibull c 2.798 1.880
0.598
1
20.18
5
8.95
5
0.69
27


Figure 4: ROC Curves for Three Statistical Distributions
Using Breast Cancer Data

Figure 5: ROC Curves for Three Statistical Distributions
Using Tuberculosis Data
IV. CONCLUSIONS
In the classification theory, Binormal ROC model has
made its landmark. Many researchers have proposed various
mathematical procedures for handling ROC curve model. In
this paper, the authors made an attempt to observe the
statistical properties of ROC curves underlying Normal,
Exponential and Weibull distributions. Using the obtained
results, it is clear that the Binormal model performs in a better
way when compared to the other two and at the same time it is
the accuracy measure of Binormal varies much. The ROC
curve of Biweibull has attained S shaped pattern, indicating a
case of not proper ROC curves. Both in simulations and
realistic datasets the accuracy measure AUC under
Biexponential is not attaining a value beyond 0.7. The authors
claim that for giving a better classification it is suggestible to
consider the Binormal ROC form. Even though datasets
follow life distributions, exponential and Weibull, the
Binormal model provides a better AUC.
ACKNOWLEDGEMENTS
The authors would like to acknowledge Dr. Alladi Mohan,
Professor, Department of General Medicine, Sri Venkateswara
Institute of Medical Sciences (SVIMS, Tirupathi) for
providing the Tuberculosis data to carry out this research
work.
REFERENCES
[1] Bamber, D, the area above the ordinal dominance graph and the area
below the receiver operating characteristic graph. Journal of
Mathematical Psychology, 12: 387-415, 1975.
[2] Dorfman and Alf, Maximum Likelihood Estimation of parameters of
signal detection theory and determination of confidence interval rating
method data , Journal of Mathematical Psychology; 6; 487 496, 1969
[3] James A Hanley, Barbara J Mc Neil, A Meaning and Use of the area
under a Receiver Operating Characteristics (ROC) Curves, Radiology;
143; 29 36, 1982.
[4] James A Hanley, Barbara J Mc Neil, A method of Comparing the Areas
Under Receiver Operating Characteristics Analysis derived from the
same cases , Radiology; 148; 839 843, 1983
[5] John. A. Swets et. al, Assessment of Diagnostic Technologies, Science;
205; 753 759, 1979
[6] Krzanowski, WJ and Hand, DJ, ROC curves for continuous data,
Monographs on Statistics and Applied Probability, CRC Press, Taylor
and Francis Group, NY, 2009
[7] Metz CE, Basic Principles of ROC analysis, Seminars in Nuclear
Medicine, 8: 283-298, 1978
[8] Ogilive and Creelman, Maximum Likelihood Estimation of Receiver
Operating Characteristic Curve Parameters, Journal of Mathematical
Psychology; 5; 377 391, 1968
[9] Pepe MS, A regression modeling framework for receiver operating
characteristic curves in medical diagnostic testing, Biometrika,
84(3):595-608, 1997
[10] Pepe MS, Three approaches to regression analysis of receiver operating
characteristic for continuous test results, Biometrics 54:124-135, 1998
[11] Pepe MS, An interpretation for ROC curve and inference using GLM
procedure, Biometrics, 56: 352-359, 2000
[12] Pepe, MS, The statistical evaluation of medical tests for classification
and prediction, Oxford Statistical Science Series, Oxford University
Press, 2003
[13] Qin J and Zhang B, A goodness-of-fit test for logistic regression models
based on case-control data , Biometrika, 84:609-618, 1997.
[14] R Vishnu Vardhan and K.V.S Sarma, On the Relationship between the
Odds Ratio and the Area under the ROC Curve in the context of Logistic
Regression for Comparing Several Biomarkers, International Journal of
Statistics and Systems; 5; 165-172, 2010
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
T
P
R
FPR
ROC Curves for Breast Cancer Data
Binormal
Biexponentia
l
Biweibull
Diag
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
T
P
R
FPR
ROC Curves for Tuberculosis Data
Binormal
Biexponential
Biweibull
CHANCE
LINE
Bonfring International Journal of Data Mining, Vol. 2, No. 2, June 2012 56

ISSN 2277 - 5048 | 2012 Bonfring
[15] Stefano Parodi, Vito Pistoia, Marco Muselli, Not proper ROC curves as
new tool for the analysis of differentially expressed genes in microarray
experiments; BMC Bioinformatics 2008, 9:410, 2008
[16] www.esnips.com/web/spssdatafiles/
R Vishnu Vardhan is currently working as Assistant
Professor in the Department of Statistics, Pondicherry
University, Puducherry. His areas of research are
Biostatistics and Statistical Computing. He has
published 11 research papers in reputed journals and
participated. He has organized one national workshop.
He has presented 25 research papers in 18 National and
International conferences/Seminars, He is a recipient of
Ms Bhargavi Rao and Padma Vibhushan Prof. C R Rao
Award for best Poster Presentation in an International Conference and also
recipient of Indian Society for Probability and Statistics (ISPS) Young
Statistician Award during December 2011. He is a life member of Indian
Society for Probability and Statistics. He has written a book entitled One
Some of Statistical Methods for Clinical Trials A Study of ROC Curves

Dr. Sudesh Pundir is currently working as an Assistant
Professor in the Department of Statistics, Pondicherry
University, Puducherry. Her areas of research are
Biostatistics, Applied Statistics, Reliability. She has a
no. of published research papers in reputed journals. She
has also participated in many conferences in India as
well as abroad. She has organized one International
Conference and acted as an organizing committee
member in many conferences. She has presented many research papers and
gave invited talks, special invited talks and acted as a resource person in
various National and International Conferences/Workshops. She is a life
member of ISPS.

Sameera G did her M.Sc. Statistics with First class
distinction at Department of Statistics, Ramanujan
School of Mathematical Sciences, Pondicherry
University, R.V. Nagar, Kalapet, Puducherry-605 014
during 2010-2012. He has presented three research
papers, one in an National Seminar in Stochastic
Modelling and Analysis at Cochin, One in National
Conference on Recent Developments in the
Applications of Reliability Theory and Survival
Analysis at Pondicherry and the other in an International Biometric Societys
conference on Computational Statistics and Biosciences at Pondicherry. She
has participated in 2 National conferences, one international conference and
one Seminar.

You might also like