Chang2014 Kappa

This article was downloaded by: [New York University]
On: 04 July 2015, At: 01:51

Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: 5 Howick Place, London, SW1P 1WG
Journal of Biopharmaceutical Statistics

Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/lbps20
A Novel Maximizing Kappa Approach

for Assessing the Ability of a Diagnostic
Marker and its Optimal Cutoff Value
a b b
Chia-Hao Chang , Jen-Tsung Yang & Ming-Hsueh Lee
a
Department of Nursing, Chang Gung University of Science and
Technology, Chiayi, Taiwan
b
Department of Neurosurgery, Chang Gung Memorial Hospital,
Chiayi, Taiwan
Accepted author version posted online: 11 Jun 2014.Published
online: 11 Jun 2014.
Click for updates
To cite this article: Chia-Hao Chang, Jen-Tsung Yang & Ming-Hsueh Lee (2014): A Novel Maximizing
Kappa Approach for Assessing the Ability of a Diagnostic Marker and its Optimal Cutoff Value, Journal
of Biopharmaceutical Statistics, DOI: 10.1080/10543406.2014.920347
To link to this article: http://dx.doi.org/10.1080/10543406.2014.920347
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the
Content) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Downloaded by [New York University] at 01:51 04 July 2015
Journal of Biopharmaceutical Statistics, 00: 115, 2015
Copyright Chang Gung University of Science and Technology
ISSN: 1054-3406 print/1520-5711 online
DOI: 10.1080/10543406.2014.920347
A NOVEL MAXIMIZING KAPPA APPROACH FOR

ASSESSING THE ABILITY OF A DIAGNOSTIC MARKER
AND ITS OPTIMAL CUTOFF VALUE
Chia-Hao Chang1, Jen-Tsung Yang2, and Ming-Hsueh Lee2

1
Department of Nursing, Chang Gung University of Science and Technology,
Chiayi, Taiwan
2
Department of Neurosurgery, Chang Gung Memorial Hospital, Chiayi, Taiwan
Threshold-dependent accuracy measures such as true classification rates in ordered multi-

ple-class (k > 3) receiver operating characteristic (ROC) hyper-surfaces have recently been
used to assist with medical decision making. However, based on low power performance in
some circumstances, we construct a new method that relies on the kappa coefficient to solve
such diagnostic problems. Under the approach proposed in the present article, the statistics
N 1
depend strongly on the Ck1 cutoff threshold, which can be chosen to maximize the kappa
statistics of true disease status and of the new biomarker. The Monte Carlo simulation
results confirm the effectiveness of the proposed method in terms of its predictive power.
The proposed design is then compared with the volume under the ROC hyper-surface by
applying it to intracerebral hemorrhagic patients classified into five stroke classes using the
National Institutes of Health Stroke Scale.
Key Words: Diagnostic testing cutoff values; Maximize kappa; Theory of extremes; Volume under the ROC
hyper-surface.
1. INTRODUCTION
The identification of cutoff values for the key biomarkers associated with the risk of
developing clinical diseases is among the most useful clinical tools in preventive epide-
miology. However, power performance is crucial for statistical testing procedures. This
article thus contributes to the body of knowledge on this topic by proposing a new
approach for addressing diagnostic problems of this kind.
The most common method for testing the quality of biomarkers is the receiver
operating characteristic (ROC) curve (Egan, 1975; Green and Swets, 1996; Pepe, 2003).
In simple terms, the ROC curve compares the fraction of positive results among the true
positives (termed the true positive rate, i.e., sensitivity) with the fraction of negative
results among the false positives (the false positive rate, i.e., specificity) at various cutoff
values (c1) by plotting both rates. The area under the ROC curve (AUC) has been
extensively used for testing the quality of biomarkers in the literature. The hypotheses
can be formulated as follows:
Received October 15, 2013; Accepted April 23, 2014
Address correspondence to Chia-Hao Chang, Department of Nursing, Chang Gung University of Science
and Technology, Chiayi Campus, Chiayi 61363, Taiwan; E-mail: howellchang@gmail.com
1
2 CHANG ET AL.
H0 : True AUC 0:5
H1 : True AUC > 0:5
ROC surfaces can be defined to identify new biomarkers in ordered three-class case
classification problems (Mossman, 1999; Dreiseitl et al., 2000; Heckerling, 2001; Nakas
and Yiannoutsos, 2004). In such problems, two decision cutoff values, namely c1 and c2,
are needed to classify biomarkers into three groups.
To allow us to generalize, we now consider classifying these groups into ordered k-
class cases (k > 3) based on a single biomarker. By assuming that F1(.),F2(.),,Fk(.) are k
overlapping continuous distributions, we can define a decision rule as in the three-class
case presented above. k 1 decision cutoff values are needed to classify the groups into k
classes, namely c1,c2,,ck1. Therefore, the following hypotheses may be applied:
H0 : F1 F2 . . . Fk
(1)
H1 : F1 F2 . . . Fk
with at least one strict inequality.

Let xi1 ; xi2 ; . . . xini , i = 1,,k represent the independent random samples from k
overlapping continuous distributions. The ROC hyper-surface in a k-dimensional space
can be constructed by plotting the true classification rates 1 F1(c1), F2(c1) F2(c2),,
and Fk(ck1) at various cutoff values (c1,c2,,ck1), where c1 > c2 > > ck1. A
nonparametric unbiased estimate of the volume under the ROC hyper-surface (VUS)
can then be calculated as follows:
X
n1 X
nk
VUS 1=n1 nk Ix1j1 x2j2 xkjk (2)
j1 1 jk 1
where Ix1j1 x2j2 xkjk = 1 provided there exists at least one strict inequality;
otherwise, Ix1j1 x2j2 xkjk is equal to zero. The hypotheses to be tested for the
quality of the biomarkers are thus
H0 : True VUS 1=k!

H1 : True VUS > 1=k!
which are defined in order to correspond to the three-class case.

In the present study, we propose the potential application of Cohens kappa ()
statistic for evaluating biomarkers in both two-class and ordered multiple-class (k > 3)
diagnostic testing settings. Such an application is similar to that of the VUS. The
coefficient (Cohen, 1960; Fleiss, 1975; Kraemer, 1979; Brennan and Prediger, 1981;
Zwick, 1988; Warrens, 2008a, 2008b, 2010) is a popular descriptive statistical measure
for assessing the interrater agreement between two raters for ordinal categorical data. This
measure corrects the overall accuracy of biomarker predictions using the accuracy
expected to occur by chance. Other advantages of kappa are its simplicity, the fact that
both false positives and false negatives are accounted for in one parameter, and its relative
tolerance to zero values in the confusion matrix. Furthermore, it is more resistant to
A NOVEL MAXIMIZING KAPPA APPROACH 3
prevalence than to positive predictive value, sensitivity, and specificity (Manel et al.,
2001).
This article studies the test of homogeneity using such test statistic, which is
maximal over all possible values of the threshold. It happens often in epidemiological
studies that values of a continuous variable are divided to two groups based on comparing
the measured values with a threshold (Miller and Siegmund, 1982). Maximally selected 2
statistics in k 2 contingency tables are investigated in Betensky and Rabinowitz (1999).
The statistic has been used extensively in map accuracy work (Congalton, 1991) and in
presenceabsence mapping in applied ecology wherein a threshold is selected to max-
imize kappa (Guisan and Hofer, 2003; Hirzel et al., 2006; Moisen et al., 2006). The
proposed coefficient for k k tables further calculates the agreement between the
biomarker and true disease status.
2. MATERIALS AND METHODS

2.1. Proposed Diagnostic Testing
Researchers in medicine, psychiatry, and epidemiology have recently begun to
recognize the degree to which observer error influences measurement. It is
thus crucial when evaluating observer agreement to understand both its possible
contribution to measurement error and its effect on the evaluation of testing new
instruments.
represents two raters who are asked to classify a sample of subjects indepen-
dently using a scale that represents a certain construct. More specifically, the
coefficient tabulates the joint responses of the raters in a two-way table (i.e., with
the classification results of the first rater in the rows and those of the second in the
columns). Then, the degree of agreement that exceeds that expected by chance is
measured. The statistical theory of the sampling distribution of begins with the
derivation of its large-sample standard error, in line with the approach presented by
Fleiss et al. (1969).
Throughout this article, let xi1 ; xi2 ; ,xini , i = 1, 2 represent the independent random
samples from two overlapping continuous distributions. Our proposed method based on
the coefficient summarizes the cross classification of the two raters (true disease status
and the new biomarker, in this case) with two identical classes. Suppose ij is the
probability of a subject being classified in the ith class according to true disease status
and in the jth class according to the new biomarker. Note that i = 1, 2, with the values 1
and 2 used to express the response of true disease status to the clinical study. To define j,
we suppose that

1; if c1 > fxi1 ; ; xini ; i 1; 2g
j
2; if fxi1 ; ; xini ; i 1; 2g c1
Thus, the coefficient is defined as

0 e

1 e
4 CHANG ET AL.
where:
0 : the probability of an observation being classified in the same class as the true disease
status i and the new biomarker j.
e : if the true disease status and the new biomarker are independent, then the probability
of agreement is e 1 1 2 2 , where 1 and 2 are the classifications in
row 1 and column 1, respectively.
The hypothesis to be tested for agreement between the true disease status and the new
biomarker beyond mere chance can be formulated as
H0 : 0 vs: H1 : > 0
which indicates better than chance.

In summary, the coefficient is used to measure the proportion of the agreement for
this two-way table. For example, if the value of is in perfect agreement, then x is a
perfect biomarker associated with the risk of the true disease status, as shown in Table 1.
Table 1 Results of simulations to study size and power for sample size arrangements for k = 3
D.F. (2, 10)
Sample sizes (10, 10, 10) (15, 5, 5) (20, 5, 5)
Simulated critical values/ (0.179, 0.085, (0.18, 0.099, (0.164, 0.09,

(, , )a 0.185) 1.847 0.155) 1.813 0.102) 1.847
Location parameters MaxK VUSs MaxK VUSs MaxK VUSs
(0, 0, 0) 0.045 0.052 0.058 0.065 0.043 0.046

(0.5, 0.25, 0) 0.155 0.184 0.138 0.146 0.124 0.162
(0.5, 0, 0) 0.155 0.198 0.175 0.169 0.164 0.175
(0.75, 0.25, 0) 0.243 0.331 0.251 0.276 0.218 0.236
D.F. (4.5, 4.5)

Sample sizes (10, 10, 10) (15, 5, 5) (20, 5, 5)
(0, 0, 0) 0.045 0.046 0.054 0.054 0.044 0.051

(0.5, 0.25, 0) 0.185 0.243 0.181 0.179 0.215 0.201
(0.5, 0, 0) 0.205 0.241 0.246 0.213 0.236 0.209
(0.75, 0.25, 0) 0.327 0.415 0.312 0.31 0.348 0.329
D.F. (10, 2)
Sample sizes (10, 10, 10) (15, 5, 5) (20, 5, 5)
(0, 0, 0) 0.04 0.051 0.046 0.049 0.046 0.046

(0.5, 0.25, 0) 0.124 0.169 0.166 0.158 0.182 0.171
(0.5, 0, 0) 0.138 0.176 0.196 0.162 0.188 0.152
(0.75, 0.25, 0) 0.246 0.302 0.294 0.249 0.288 0.25
Note. Abbreviations: D.F., degree of freedom; VUSs, the volume under the ROC hyper-surface.
a
Based on maximum likelihood estimation.
For the purposes of generalization, i = 1,,k represents the independent random

samples from k overlapping continuous distributions. The cutoff (c1,,ck-1 where c1 >
>ck-1) is the threshold for k classes. The proposed method based on the coefficient
summarizes the cross classification of two raters with k identical classes. Then, i = k,k 1,
,1, values k,k 1,,1 are used to express the response of true disease status (k-class
case) to the clinical study. To define j, we suppose that
8
>
> 1; if ck1 > fxi1 ; ; xini ; i 1; ; k g
>
>
>
> 2; if ck2 >fxi1 ; ; xini ; i 1; ; k g ck1
<

j
>
>
>
>
>
>
:
k; if fxi1 ; ; xini ; i 1; ; k g c1
where c0 is infinity.
In practice, the k 1 ordered decision cutoff values vary among the possible
biomarker values (= Ck1N1
, N = n1 + n2 + + nk). Thus, the proposed statistic is defined
by ascertaining the optimal cutoff value (i.e., the one that provides the maximum
coefficient) for assigning probabilistic predictions into the k classes of the data set. The
maximum coefficient determines the extreme values rather than the average values.
Statistics for extreme values have been proposed and applied in different fields, such as
for analyzing high sea levels, wind speeds, air pollutant concentrations, and price changes
in share markets (Gumbel, 1958; Coles, 2001; de Haan and Ferreira, 2006). Furthermore,
the maximum coefficient is equivalent to the optimal cutoff value for calculating the
coefficient as follows (Guisan et al., 1999):
n o
MaxK max 1 ; 2 ; ; Ck1
N 1 (3)
n o
where 1 ; 2 ; ; C N1 is a sequence of random
. variables that have a common
k1
m m m N 1
distribution function G, while m 0 e 1 e , m = 1,2,,Ck1 , represent
m
the coefficients under various possible cutoff values. Similarly, 0 is the probability of
an observation being classified in the same class as true disease status i and biomarker j
for the mth cutoff value. If true disease status and the biomarker are independent, then the
m Pk
m m m
probability of agreement is e l l , where l is the classifications in row l
l1
and column l, respectively.
N 1
Given that the number of possible permutations is equal to Ck1 , this number will be
quite large even for small sample sizes. Thus, in practice, only M random permutations are
generated in order to derive the reference distribution. The hypotheses for the agreement
between true disease status and the biomarker beyond mere chance can thus be formulated as
H0 : MaxK 0
H1 : MaxK > 0
which indicates the better than by chance outcome.

6 CHANG ET AL.
2.2. Asymptotic Null Distribution

This section presents the asymptotic distributions for MaxK, the sample maxima for a
stationary series, namely the so-called extremal types theorem. To find a limiting distribution
of interest, we use the sequences of the constants of location parameter {an} and of scale
parameter {bn} (>0) in order to obtain the linear normalization of MaxK such that the
D
distribution of standardized extremes PrMaxK an =bn ! G as n ! 1, where
D
! means convergence in distribution and G is non-degenerate. In fact, the limiting distribu-
tion of the maximum has three classes (Gnedenko, 1943), namely
Gumbel Type I; G1 expf exp a=bgI1;1
Frechet Type II; G2 exp a=b Ia;1 ; > 0
Weibull Type III; G3 exp a=b I1;a Ia;1 ; > 0

The main difference among these three limiting types is the behavior of the upper tail. The
shape parameter reflects the weight of the tail of the distribution. The maximum tail
value is finite for the Weibull distribution and infinite for the Frchet and Gumbel
distributions. Distributions that have upper tails that decay exponentially produce the
Gumbel distribution of the maximum, whereas those that decay polynomially produce the
Frchet distribution of the maximum. Further, Jenkinson applied Gnedenkos results to
propose a generalized formula that combines the three types of extreme value distributions
into a single distribution, called the generalized extreme value distribution (GEV)
(Gnedenko, 1943; Jenkinson, 1955):
G expf1 =1= g; for 1 = > 0
where = 0 for Gumbel, > 0 for Frchet, and < 0 for Weibull. is the shape parameter
and and are location and scale parameters, which are different to {an} and {bn}
defined above.
By following the extremes of dependent sequences theorem (Coles, 2001) and the
generalized formula by Jenkinson (1955), the following theorem can be deduced.
Definition. A stationary series 1 ; 2 ; . . . is said to satisfy the DuCk1

N 1 condition if, for
all i1 < < ip < j1 < < jq with j1 ip > l,
j Prfi1 uCk1
N1 ; . . . ; i uC N 1 ; j uC N 1 ; . . . ; j uC N1 g
p k1 1 k1 q k1
N 1
Prfi1 uCN1 ; . . . ; ip uCN 1 g Prfj1 uCN 1 ; . . . ; jq uC N1 gj Ck1 ; l
k1 k1 k1 k1
N 1
where Ck1
N1
; lCk1
N1 ! 0 for some sequence lC N 1
k1
such that lCk1
N1 /C
k1 ! 0 as
Ck1 ! 1.
N1

Theorem. Let 1 ; 2 ; . . . be a stationary process and MaxK be defined as in Equation

N1 and bC N 1 > 0 and a non-degenerate distribution function g
(3). If the constants aCk1 k1
exists such that
. D N 1
PrMaxK aCN 1 bCN1 ! G as Ck1 !1
k1 k1
then G is a distribution of the GEV family:
G expf1 =1= g
for 1 =>0, with parameters < < , > 0, and < < , and the
N 1 condition is satisfied with u N 1 b N 1 a N 1 for every real .
DuCk1
Ck1 Ck1 Ck1
In practice, we do not have problems with the normalizing constants. The family of
extreme value distributions may be fitted directly to a series of observations of MaxK.
That is,
D N 1
PrMaxK ! G aCk1
N 1 =bC N 1 G as C
k1 k1 ! 1
where G* is of the same type as G.

The maximum likelihood estimation of the parameters for the GEV distribution (,
, and ) considered by Prescott and Walden provides theoretical details for the asympto-
tic behavior of the estimators (Prescott and Walden, 1980, 1983). These authors assume
that the realized extremes are drawn exactly from the GEV distribution. The maximum
likelihood method thus provides efficient, unbiased, asymptotically normal parameter
estimators that have minimum variance. The estimates for the parameters are obtained
by solving a set of nonlinear equations given by the first-order conditions of the max-
imization problem (Tiago de Oliveira, 1973).
A detailed implementation algorithm is as follows:
n o
N 1
a. Calculating MaxK, where MaxK max 1 ; 2 ; ; Ck1
N 1 and Ck1 is the possible
biomarker values.
b. Estimating the parameters of the limiting distribution G, that is, one can fit a GEV
distribution to them using the VGAM package in software R 2.15.0 (R Development
Core Team, 2012) in order to obtain maximum likelihood estimates for the limiting
distribution of MaxK.
c. H0 will therefore be rejected for large MaxK values. The GEV approximation for the
procedure rejects H0 if MaxK G 1 1 ; otherwise, H0 is not rejected. Note that
the criterion value G1 1 is chosen to make the Type I error probability equal ,
where PrMaxK G1 1 jH0 true).
3. COMPARISON WITH RESPECT TO SIZE AND POWER

We next present the results of a finite sample simulation study that compared the
VUS and MaxK in terms of estimation power for various underlying populations and
location parameters. The U-statistics theory for the VUS and extremal types theorem for
MaxK were used in the simulation, the results of which are presented in Tables 13.
First, we assumed that the distribution of each population is generated under the
location model Xij i ij , where ij is iid log-F with combinations of 2, 4.5, and 10
8 CHANG ET AL.
D.F. (2, 10)
Sample sizes (10, 10, 10, 10) (5, 5, 5, 10) (5, 5, 5, 15) (5, 5, 5, 20)
Simulated critical (0.161, 1.885 (0.189, 1.944 (0.17, 0.073, 1.924 (0.156, 1.803
values/(, , )a 0.062, 0.081, 0.147) 0.069,
0.217) 0.168) 0.15)
Location parameters MaxK VUSs MaxK VUSs MaxK VUSs MaxK VUSs
(0, 0, 0, 0) 0.066 0.067 0.037 0.043 0.05 0.063 0.056 0.051

(0.75, 0.5, 0.25, 0) 0.29 0.293 0.203 0.227 0.265 0.253 0.25 0.231
(0.5, 0.5, 0.5, 0) 0.169 0.146 0.177 0.11 0.222 0.133 0.235 0.158
(1, 0.75, 0.5, 0) 0.386 0.387 0.313 0.296 0.374 0.298 0.44 0.346
D.F. (4.5, 4.5)

Sample sizes (10, 10, 10, 10) (5, 5, 5, 10) (5, 5, 5, 15) (5, 5, 5, 20)
(0, 0, 0, 0) 0.057 0.05 0.045 0.053 0.056 0.046 0.043 0.059

(0.75, 0.5, 0.25, 0) 0.376 0.379 0.237 0.244 0.296 0.288 0.322 0.322
(0.5, 0.5, 0.5, 0) 0.218 0.184 0.21 0.155 0.241 0.161 0.275 0.204
(1, 0.75, 0.5, 0) 0.502 0.526 0.405 0.361 0.51 0.429 0.537 0.419
D.F. (10, 2)
Sample sizes (10, 10, 10, 10) (5, 5, 5, 10) (5, 5, 5, 15) (5, 5, 5, 20)
(0, 0, 0, 0) 0.068 0.05 0.046 0.051 0.061 0.061 0.048 0.043

(0.75, 0.5, 0.25, 0) 0.268 0.291 0.213 0.224 0.22 0.232 0.265 0.279
(0.5, 0.5, 0.5, 0) 0.188 0.168 0.17 0.146 0.202 0.153 0.209 0.186
(1, 0.75, 0.5, 0) 0.437 0.424 0.32 0.32 0.381 0.308 0.388 0.355
a
degrees of freedom, and i is the location parameter. We examined balanced and unba-
lanced designs with sample sizes of 5 and 20, respectively. We assumed the above-
mentioned conditions using the software R 2.15.0 (R Development Core Team, 2012)
and fit a GEV distribution to them using the VGAM package in order to obtain maximum
likelihood estimates for the limiting distribution of MaxK.
This simulation was replicated 1000 times at = 0.05 using M = 1000 random
permutations. The VUS is clearly shown to be a discrete random variable (Terpstra and
Magel, 2003). However, it may not be able to fill out the level of Type I errors completely
under H0. Thus, we used simulated critical values for the VUS (e.g., VUSs). Similarly,
Monte Carlo approximations at = 0.05 and 1000 replications were conducted and a
quantile function was used from the qgev package from evd. Ideally, biomarkers that have
better quality should have higher power.
The simulated powers of VUSs and MaxK are shown to be close to the nominal
level under the null hypothesis. Under k = 3, VUS outperforms MaxK when DF = (2, 10)
or when the sample sizes are (10, 10, 10). MaxK can have better performances only for
D.F. (2, 10)
Sample sizes (10, 10, 10, 10, 10) (10, 5, 5, 5, 5) (15, 5, 5, 5, 5)
Simulated critical values/ (0.138, 0.046, 1.77 (0.18, 0.099, 1.813 (0.152, 0.046, 2.036
(, , )a 0.171) 0.155) 0.105)
(0, 0, 0, 0, 0) 0.05 0.05 0.048 0.051 0.046 0.054

(1, 0.75, 0.5, 0.25, 0) 0.387 0.456 0.319 0.287 0.333 0.322
(0.5, 0, 0, 0, 0) 0.195 0.2 0.182 0.145 0.197 0.137
(1.25, 0.75, 0.5, 0.25, 0) 0.533 0.605 0.434 0.383 0.52 0.438
D.F. (4.5, 4.5)

(0, 0, 0, 0, 0) 0.059 0.055 0.049 0.051 0.052 0.066

(1, 0.75, 0.5, 0.25, 0) 0.547 0.604 0.44 0.418 0.462 0.422
(0.5, 0, 0, 0, 0) 0.205 0.184 0.204 0.147 0.246 0.135
(1.25, 0.75, 0.5, 0.25, 0) 0.692 0.741 0.591 0.497 0.64 0.506
D.F. (10, 2)
(0, 0, 0, 0, 0) 0.052 0.053 0.046 0.059 0.056 0.068

(1, 0.75, 0.5, 0.25, 0) 0.4 0.452 0.336 0.307 0.352 0.313
(0.5, 0, 0, 0, 0) 0.126 0.138 0.172 0.119 0.183 0.116
(1.25, 0.75, 0.5, 0.25, 0) 0.514 0.574 0.435 0.347 0.519 0.371
a
the case of unbalanced sample sizes with DF = (4.5, 4.5) or DF = (10, 2). For the case of
unbalanced sample sizes (15, 5, 5) or (20, 5, 5) with DF = (4.5, 4.5) or DF = (10, 2) in
Table 1, the gain percentage in power, DP = (MaxK-VUSs)/VUSs, ranges from 0.65% to
23.68% with the average gain percentage in power being 11.03% (difference of
percentage).
For k = 4, the simulated power of MaxK is higher than that of VUSs, while the
sample sizes that correspond to smaller spaces in adjacent location parameters are
comparatively small. The DP ranges from 0% to 66.92% with the average gain percentage
in power being 28.22%. Notice that, VUS also has comparable performances especially
for the case of location parameters (0.75, 0.5, 0.25, 0) (see Table 2).
For k = 5, the simulated power of MaxK is higher than that of VUSs, while the
sample sizes that correspond to smaller spaces in adjacent location parameters are
comparatively small. The DP ranges from 13.32% to 82.22% with the average gain
percentage in power being 36.27%. Notice that, VUS performs better when the sample
sizes are (10, 10, 10, 10, 10) (see Table 3).
The performances of MaxK and VUS strongly depend on the underlying simulation
settings. Since there is no theoretical comparison for these two methods, the readers can
judge which method should be applied under what situations by the results of Tables 13.
10 CHANG ET AL.
Tables 13 just represent a small subset of the many different scenarios that we simulated.
For example, we also conducted simulations for numerous other alternative patterns.
Interested persons may contact the corresponding author for these simulated results.
4. DATA EXAMPLES
Hemoglobin (HGB) tetramers are the major oxygen-carrying molecules in the
blood. Kimberly proposed that lower HGB values are associated with larger acute infarcts
and with an increased degree of infarct growth during acute ischemic stroke (Kimberly
et al., 2011). While the National Institutes of Health Stroke Scale (NIHSS) is an attractive
candidate predictor for disposition because it is widely used, there has been reluctance to
adopt it within clinical settings, because scale completion is considered to be too time
consuming compared with standard neurological assessments by some users (Lai et al.,
1998). In addition, although many components of the NIHSS are part of standard
neurological assessments, training is required for the reliable use of the tool (Andr,
2002).
Previous randomized controlled trials have demonstrated that the administration of
recombinant tissue plasminogen activator (rt-PA) treatment for acute ischemic stroke
(within three hours of symptom onset) improves functional outcomes without increasing
severe disability and mortality. However, intracerebral hemorrhage remains the most
feared side effect of rt-PA (Wardlaw et al., 2003; Derex and Nighoghossian, 2008).
Indeed, studies have shown that the initial stroke severity assessed by the NIHSS score
is an independent marker for subsequent intracerebral hemorrhage (Tanne et al., 2002).
In December 2003, the Bureau of National Health Insurance in Taiwan implemented
a payment rule for the rt-PA treatment of acute ischemic stroke, with a clinical disease
severity of NIHSS > 25 the minimum requirement for treatment. In addition, it adopted
the advice of the Taiwan Stroke Association and excluded slight strokes (NIHSS < 6)
from the payment rule. Based on these payment criteria, NIHSS 6 and NIHSS 25 have
become operational criteria.
From November 2010 to October 2011, 31 patients that had suffered from
ischemic stroke were recruited at the Chang Gung Memorial Hospital, Taiwan for
the study. The inclusion criteria were as follows: (i) first ever stroke, (ii) obvious
weakness of affected limbs (muscle power < 3), (iii) supratentorial hematoma, and (iv)
Glasgow Coma Scale > 6. Blood samples were collected from all study participants.
The median HGB for NIHSS < 6 (n = 21), NIHSS 625 (n = 7), and NIHSS > 25 (n =
3) were 15.2, 12.9, and 12.8, respectively. The box plot in Fig. 1 shows a non-
increasing trend in the medians among these three classes. From Equations (2) and
(3), the VUS was 0.37 (p = 0.025) and MaxK was 0.56 (p < 0.003). We conclude that
the probability of a correct NIHSS classification based on HGB value is higher than
that expected from chance alone.
We further classify the patients into one of five stroke classes with the NIHSS
representing true disease status. Of the 31 patients, 16 were classified by the NIHSS as no
stroke (NIHSS = 0), four as minor stroke (NIHSS = 14), three as moderate stroke
(NIHSS = 515), five as moderate/severe stroke (NIHSS = 1620), and three as severe
stroke (NIHSS = 2142). The investigators then quantified the ability of HGB values to
classify patients correctly into these five NIHSS classes. Smaller patient HGB values were
hypothesized to be more indicative of stroke and that HGB values would be greater for no
stroke patients. The median HGB values for the five classes were 15.6, 14.4, 13.1, 12.8,
Figure 1 Box plots of hemoglobin vs. 3-level NIHSS.
and 12.7, respectively. The box plot in Fig. 2 also shows a non-increasing trend in the
medians among these five classes. The VUS and MaxK were 0.03 (standard deviations =
0.013) and 0.52 ( = 0.1582, = 0.0658, and = 0.1561, which are based on the
maximum likelihood estimation), as indicated in Table 4. These two estimates were
contrasted with those at the uninformative levels VUS = 1/120 and MaxK = 0; the
associated p-values were p = 0.061 and p < 0.001, respectively. We conclude that the
only significant test is MaxK. Since a plot of this data set still displays a non-increasing
trend, it seems that the trend has not been detected, which indicates a deficiency with the
VUS.
5. DISCUSSION
The present study proposed a new method for assessing the performance of
diagnostic tests in clinical settings. This approach, based on a statistic called maximum
kappa, relies on the measurement agreement between true disease status and the new
biomarker; specifically, the kernel approach applies the coefficient, except when ranking
the value of the biomarker into k classes. A high coefficient indicates stronger agree-
ment with the response at a specific cutoff value.
The coefficient has been used extensively in map accuracy work (Congalton,
1991) and in models that predict the spatial distribution of species (Boyce et al., 1999;
Guisan and Zimmerman, 2000; Manly et al., 2002; Pearce and Boyce, 2006), in which a
12 CHANG ET AL.
Figure 2 Box plots of hemoglobin vs. 5-level NIHSS.
Table 4 Some comparative examples
Statistic Estimate SD/(, , )a p-value
k=3
VUS 0.37 0.105 0.025
MaxK 0.56 (0.1711, 0.0801, 0.0683) 0.003
k=5
VUS 0.03 0.013 0.061
MaxK 0.52 (0.1582, 0.0658, 0.1561) <0.001
Note. Abbreviations: SD, standard deviations; VUS, the volume under the ROC hyper-surface.
a
threshold can be selected to maximize (Guisan et al., 1998; Guisan and Hofer, 2003;
Hirzel et al., 2006; Moisen et al., 2006). Therefore, the maximum coefficient should be
determined based on various cutoff values among possible biomarker values. The cutoff
values that result in the maximum value indicate the optimal agreement between the
new biomarker and true disease status.
After defining MaxK and explaining how the asymptotic GEV distribution can be
reached (Gnedenko, 1943; Guisan et al., 1999), a finite sample simulation study was used
to investigate the powers of MaxK, the VUS, and the VUS under various underlying
populations, location parameter configurations, and sample size arrangements. The VUS
was then nonparametrically estimated and its variance calculated using the U-statistics
methodology in the simulation.
The statistics were clearly shown to be discrete random variables and were unable to
fill out the level of the Type I errors completely under H0; thus, we used simulated critical
values for the VUS. From this analysis, we found that the simulated powers of VUSs and
MaxK were close to the nominal levels under the null hypothesis. Further, the simulated
power of MaxK was higher than that of VUSs for the investigated location parameters
when the sample sizes that correspond to smaller spaces in adjacent location parameters
are comparatively small. The practical case shows this deficiency with the VUS in these
specific circumstances. However, the proposed approach still suffers from two major
limitations. First, the number of possible cutoff values was computationally intensive
even for relatively small samples. Second, the two-class coefficient was discrete.
Combining certain k classes, for example when two classes are relatively hard to
distinguish, and then calculating the kappa value of the collapsed (k 1) (k 1)
agreement table may be desirable. This strategy increases the value of . Following
the theorem proposed by Schouten (1986), we can deduce that the power of MaxK can
be improved by merging two easily confused adjacent location parameters. For

example, some NIHSS classes are easily confused by the HGB values. Moreover,
other authors have previously considered the effect of combining classes by applying
the coefficient (Kraemer, 1980; James, 1983), which will be investigated in future
studies.
REFERENCES
Andr, C. (2002). The NIH stroke scale is unreliable in untrained hands. Journal of Stroke and
Cerebrovascular Diseases 11:4346.
Betensky, R. A., Rabinowitz, D. (1999). Maximally selected chi square statistics for k 2 tables.
Biometrics 55:317320.
Boyce, M. S., McDonald, L. L., Manly, B. F. J. (1999). Reply to Mysterud and Ims. Trends in
Ecology & Evolution 14:490.
Brennan, R. L., Prediger, D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives.
Educational and Psychological Measurement 41:687699.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological
Measurement 20:213220.
Coles, S. (2001). An Introduction to Statistical Modelling of Extreme Values. London: Springer
Verlag.
Congalton, R. G. (1991). A review of assessing the accuracy of classifications of remotely sensed
data. Remote Sensing of Environment 37:3546.
de Haan, L., Ferreira, A. (2006). Extreme Value Theory: An Introduction. London: Springer Verlag.
Derex, L., Nighoghossian, N. (2008). Intracerebral haemorrhage after thrombolysis for acute
ischaemic stroke: An update. Journal of Neurology, Neurosurgery & Psychiatry 79:1093
1099.
Dreiseitl, S., Ohno-Machado, L., Binder, M. (2000). Comparing three-class diagnostic tests by
three-way ROC analysis. Medical Decision Making 20:323331.
Egan, J. P. (1975). Signal Detection Theory and ROC Analysis. New York: Academic Press.
Fleiss, J. L. (1975). Measuring agreement between two judges on the presence or absence of a trait.
Biometrics 31:651659.
Fleiss, J. L., Cohen, J., Everitt, B. S. (1969). Large sample standard errors of kappa and weighted
kappa. Psychological Bulletin 72:323327.
Gnedenko, B. (1943). Sur la distribution limite du terme maximum dune srie alatoire. Annals of
Mathematics 44:423453.
Green, D. M., Swets, J. A. (1996). Signal Detection Theory and Psychophysics. New York: Wiley.
14 CHANG ET AL.
Guisan, A., Hofer, U. (2003). Predicting reptile distributions at mesoscale: Relation to climate and
topography. Journal of Biogeography 30:12331243.
Guisan, A., Theurillat, J. P., Kienast, F. (1998). Predicting the potential distribution of plant species
in an alpine environment. Journal of Vegetation Science 9:6574.
Guisan, A., Weiss, S. B., Weiss, A. D. (1999). GLM versus CCA spatial modelling of plant species
distribution. Plant Ecology 143:107122.
Guisan, A., Zimmerman, N. E. (2000). Predictive habitat distribution models in ecology. Ecological
Modelling 135:147186.
Gumbel, E. J. (1958). Statistics of Extremes. New York: Columbia University Press.
Heckerling, P. S. (2001). Parametric three-way receiver operating characteristic surface analysis
using Mathematica. Medical Decision Making 21:409417.
Hirzel, A. H., LeLay, G., Helfer, V. (2006). Evaluating the ability of habitat suitability models to
predict species presences. Ecological Modelling 199:142152.
James, I. R. (1983). Analysis of nonagreements among multiple raters. Biometrics 39:651657.
Jenkinson, A. F. (1955). The frequency distribution of the annual maximum (or minimum) values of
meteorological elements. Quarterly Journal of the Royal Meteorological Society 81:158171.

Kimberly, W. T., Wu, O., Arsava, E. M., Garg, P., Ji, R., Vangel, M., Singhal, A. B., Ay, H.,
Sorensen, A. G. (2011). Lower hemoglobin correlates with larger stroke volumes in acute
ischemic stroke. Cerebrovascular Diseases Extra 1:4453.
Kraemer, H. C. (1979). Ramifications of a population model for as a coefficient of reliability.
Psychometrika 44:461472.
Kraemer, H. C. (1980). Extension of the kappa coefficient. Biometrics 36:207216.
Lai, S. M., Duncan, P. W., Keighley, J. (1998). Prediction of functional outcome after stroke:
Comparison of the Orpington prognostic scale and the NIH stroke scale. Stroke 29:1838
1842.
Manel, S., Williams, H. C., Ormerod, S. J. (2001). Evaluating presenceabsence models in ecology:
The need to account for prevalence. Journal of Applied Ecology 38:921931.
Manly, B. F. J., McDonald, L. L., Thomas, D. L., Trent, L., McDonald, T. L., Erickson, W. P.
(2002). Resource Selection by Animals: Statistical Design and Analysis for Field Studies.
London: Kluwer Academic.
Miller, R., Siegmund, D. (1982). Maximally selected chi square statistics. Biometrics 38:10111016.
Moisen, G. G., Freeman, E. A., Blackard, J. A. (2006). Predicting tree species presence and basal
area in Utah: A comparison of stochastic gradient boosting, generalized additive models and
tree-based methods. Ecological Modelling 199:176187.
Mossman, D. (1999). Three-way ROCs. Medical Decision Making 19:7889.
Nakas, C. T., Yiannoutsos, C. T. (2004). Ordered multiple-class ROC analysis with continuous
measurements. Statistics in Medicine 23:34373449.
Pearce, J. L., Boyce, M. S. (2006). Modelling distribution and abundance with presence-only data.
Journal of Applied Ecology 43:405412.
Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests for Classification and Prediction.
New York: Oxford University Press.
Prescott, P., Walden, A. T. (1980). Maximum likelihood estimation of the parameters of the
generalized extreme-value distribution. Biometrika 67:723724.
Prescott, P., Walden, A. T. (1983). Maximum likelihood estimation of the parameters of the
threeparameter generalized extreme-value distribution from censored samples. Journal of
Statistical Computation and Simulation 16:241250.
Schouten, H. J. A. (1986). Nominal scale agreement among observers. Psychometrika 9:453466.
Tanne, D., Kasner, S. E., Demchuk, A. M., Koren-Morag, N., Hanson, S., Grond, M., Levine, S. R.
(2002). Markers of increased risk of intracerebral hemorrhage after intravenous recombinant
tissue plasminogen activator therapy for acute ischemic stroke in clinical practice: The
multicenter rt-PA stroke survey. Circulation 105:16791685.
Terpstra, J. T., Magel, R. C. (2003). A new nonparametric test for the ordered alternative problem.
Nonparametric Statistics 15:289301.
Tiago de Oliveira, J. (1973). Statistical Extremes-A Survey. Lisbon: Center of Applied Mathematics.
Wardlaw, J. M., Zoppo, G., Yamaguchi, T. (2003). Thrombolysis for acute ischaemic stroke.
Cochrane Database of Systematic Reviews 3:CD000213.
Warrens, M. J. (2008a). On the equivalence of Cohens kappa and the Hubert-Arabie adjusted rand
index. Journal of Classification 25:177183.
Warrens, M. J. (2008b). On similarity coefficients for 2 2 tables and correction for chance.
Warrens, M. J. (2010). Inequalities between kappa and kappa-like statistics for k k tables.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin 103:374378.

Chang2014 Kappa

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chang2014 Kappa

Uploaded by

Copyright:

Available Formats

This article was downloaded by: [New York University]

On: 04 July 2015, At: 01:51

Journal of Biopharmaceutical Statistics

A Novel Maximizing Kappa Approach

Click for updates

To link to this article: http://dx.doi.org/10.1080/10543406.2014.920347

PLEASE SCROLL DOWN FOR ARTICLE

A NOVEL MAXIMIZING KAPPA APPROACH FOR

Chia-Hao Chang1, Jen-Tsung Yang2, and Ming-Hsueh Lee2

Threshold-dependent accuracy measures such as true classification rates in ordered multi-

H0 : True AUC 0:5

H1 : True AUC > 0:5

with at least one strict inequality.

H0 : True VUS 1=k!

which are defined in order to correspond to the three-class case.

2. MATERIALS AND METHODS

Thus, the coefficient is defined as

which indicates better than chance.

D.F. (2, 10)

Sample sizes (10, 10, 10) (15, 5, 5) (20, 5, 5)

Simulated critical values/ (0.179, 0.085, (0.18, 0.099, (0.164, 0.09,

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0) 0.045 0.052 0.058 0.065 0.043 0.046

D.F. (4.5, 4.5)

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0) 0.045 0.046 0.054 0.054 0.044 0.051

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0) 0.04 0.051 0.046 0.049 0.046 0.046

For the purposes of generalization, i = 1,,k represents the independent random

which indicates the better than by chance outcome.

2.2. Asymptotic Null Distribution

Gumbel Type I; G1 expf exp a=bgI1;1

Frechet Type II; G2 exp a=b Ia;1 ; > 0

Weibull Type III; G3 exp a=b I1;a Ia;1 ; > 0

G expf1 =1= g; for 1 = > 0

Definition. A stationary series 1 ; 2 ; . . . is said to satisfy the DuCk1

all i1 < < ip < j1 < < jq with j1 ip > l,

Theorem. Let 1 ; 2 ; . . . be a stationary process and MaxK be defined as in Equation

then G is a distribution of the GEV family:

where G* is of the same type as G.

3. COMPARISON WITH RESPECT TO SIZE AND POWER

D.F. (2, 10)

(0, 0, 0, 0) 0.066 0.067 0.037 0.043 0.05 0.063 0.056 0.051

D.F. (4.5, 4.5)

(0, 0, 0, 0) 0.057 0.05 0.045 0.053 0.056 0.046 0.043 0.059

(0, 0, 0, 0) 0.068 0.05 0.046 0.051 0.061 0.061 0.048 0.043

D.F. (2, 10)

Sample sizes (10, 10, 10, 10, 10) (10, 5, 5, 5, 5) (15, 5, 5, 5, 5)

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0, 0, 0) 0.05 0.05 0.048 0.051 0.046 0.054

D.F. (4.5, 4.5)

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0, 0, 0) 0.059 0.055 0.049 0.051 0.052 0.066

Location parameters MaxK VUSs MaxK VUSs MaxK VUSs

(0, 0, 0, 0, 0) 0.052 0.053 0.046 0.059 0.056 0.068

Figure 1 Box plots of hemoglobin vs. 3-level NIHSS.

Figure 2 Box plots of hemoglobin vs. 5-level NIHSS.

Table 4 Some comparative examples

Statistic Estimate SD/(, , )a p-value

be improved by merging two easily confused adjacent location parameters. For

meteorological elements. Quarterly Journal of the Royal Meteorological Society 81:158171.

You might also like

Gumbel Type I; G1 expf exp a=bgI1;1

Frechet Type II; G2 exp a=b Ia;1 ; > 0

Weibull Type III; G3 exp a=b I1;a Ia;1 ; > 0

G expf1 =1= g; for 1 = > 0