Professional Documents
Culture Documents
regression model
P.J. Rousseeuw
1
2
and A. Christmann
Universitaire Instelling Antwerpen (UIA), Department of Mathemati
s and Computer S
ien
e, Universiteitsplein 1, B-2610 Wilrijk, Belgium
University of Dortmund, SFB-475, HRZ, 44221 Dortmund, Germany
Abstra t
The logisti
regression model is
ommonly used to des
ribe the ee
t of one or several explanatory
variables on a binary response variable. Here we
onsider an alternative model under whi
h the
observed response is strongly related but not equal to the unobservable true response. We
all this
the hidden logisti
regression (HLR) model be
ause the unobservable true responses are
omparable
to a hidden layer in a feedforward neural net. We propose the maximum estimated likelihood method
in this model, whi
h is robust against separation unlike existing methods for logisti
regression.
We also
onsider outlier-robust estimation in this setting.
1
Introdu tion
The logisti
regression model assumes independent Bernoulli distributed response variables with
su
ess probabilities (x0 ) where is the logisti
distribution fun
tion, x 2 IR are ve
tors of
explanatory variables, 1 i n, and 2 IR is unknown. Under these assumptions, the
lassi
al
maximum likelihood (ML) estimator has
ertain asymptoti
optimality properties. However, even if
the logisti
regression assumptions are satised there are data sets for whi
h the ML estimate does
not exist. This o
urs for exa
tly those data sets in whi
h there is no overlap between su
esses
and failures,
f. Albert and Anderson (1984) and Santner and Duy (1986). This identi
ation
problem is not limited to the ML estimator but is shared by all estimators for logisti
regression,
su
h as that of Kuns
h et al. (1989).
One way to deal with this problem is to measure the amount of overlap. This
an be done by
exploiting a
onne
tion between the notion of overlap and the notion of regression depth proposed
by Rousseeuw and Hubert (1999), leading to the algorithm of Christmann and Rousseeuw (2001).
A
omparison between this approa
h and the support ve
tor ma
hine is given in Christmann,
Fis
her and Joa
hims (2000).
In Se
tion 2 we use an alternative model, whi
h is an extension of the logisti
regression model.
We assume that due to an additional sto
hasti
me
hanism the true response of a logisti
regression
model is unobservable, but that there exists an observable variable whi
h is strongly related to
the true response. E.g., in a medi
al
ontext there is often no perfe
t laboratory test pro
edure
to dete
t whether a spe
i
illness is present or not (i.e., mis
lassi
ation errors may sometimes
o
ur). In that
ase, the true response (whether the disease is present) is not observable, but the
result of the laboratory test is.
It
an be argued that the true unobservable responses are
omparable to a hidden layer in a
feedforward neural network model, whi
h is why we
all this the hidden logisti
regression (HLR)
model. In Se
tion 3 we propose the maximum estimated likelihood (MEL) te
hnique in this model,
and show that it is immune to the identi
ation problem des
ribed above. The MEL estimator is
studied by simulations (Se
tion 4) and on real data sets (Se
tion 5). In Se
tion 6 we also
onsider
outlier-robust estimation in this setting, whereas Se
tion 7 provides a dis
ussion and an outlook
to further resear
h.
i
The
lassi
al logisti
regression model assumes n observable independent responses Y with Bernoulli
distributions Bi(1; (x0 )), where i = 1; : : : ; n and 2 IR . Throughout this paper we assume that
there is an inter
ept, so we put x 1 = 1 for all i, and thus p 2.
The new model assumes that the true responses are unobservable (latent) due to an additional
sto
hasti
me
hanism. In medi
al diagnosis there is typi
ally no test pro
edure (e.g. a blood test)
whi
h is
ompletely free of mis
lassi
ation errors. Another possible
ause of mis
lassi
ations is
the o
urren
e of
leri
al errors.
To
larify the model, let us rst
onsider a medi
al appli
ation with only n = 1 patient. His/her
true status (e.g. presen
e or absen
e of the disease) has two possible values, typi
ally denoted as
su
ess (s) and failure (f ). We assume that the true status T is unobservable. However, we
an
observe the variable Y whi
h is strongly related to T as in Figure 1. If the true status is T = s
we observe Y = 1 with probability P(Y = 1jT = s) = 1 , hen
e a mis
lassi
ation o
urs with
probability P(Y = 0jT = s) = 1 1 . Analogously, if the true status is f we observe Y = 1 with
probability P(Y = 1jT = f ) = 0 and we obtain Y = 0 with probability P(Y = 0jT = f ) = 1 0 .
We of
ourse assume that the probability of observing the true status is higher than 50%, i.e.
0 < 0 < 0:5 < 1 < 1.
i
i;
1 0
0
1 1
T
Fig. 1.
Ekholm and Palmgren (1982)
onsidered the general
ase with n observations. In our notation, there are n unobservable independent random variables T resulting from a
lassi
al logisti
regression model with nite parameter ve
tor = (1 ; : : : ; )0 = (; 1 ; : : : ; 1 )0 . Hen
e T has
a Bernoulli distribution with su
ess probability = (x0 ) where (z ) = 1=[1 + exp( z ) and
x 2 IR . Furthermore, they assume that the observable responses Y are related to T as in Figure
1. For instan
e, when T = s we obtain Y = 1 with probability P(Y = 1jT = s) = 1 whereas
Y = 0 o
urs with the
omplementary probability P(Y = 0jT = s) = 1 1 . (The plain logisti
model assumes 0 and 1 = 1.) The entire me
hanism in Figure 2 we
all the hidden logisti
regression model be
ause the true status T is hidden by the sto
hasti
stru
ture in the top part of
Figure 2. This model
an be interpreted as a spe
ial kind of neural net, with a single hidden layer
that
orresponds to the latent variable T .
i
a. Constru tion
We now need a way to t data sets arising from the hidden logisti
model. Two approa
hes already
exist, by Ekholm and Palmgren (1982) and by Copas (1988), but here we will pro
eed in a dierent
way.
1 0
0
1 1
logistic
X
Fig. 2.
x1 x 2 ... x p 1
Let us start by looking only at Figure 1, where Y is observed but T is not. How
an we then
estimate T ? This is a
tually the smallest nontrivial estimation problem, be
ause any su
h problem
needs more than one possible value of the parameter and more than one possible out
ome. Here
we have exa
tly two values for both, and the only distributions on two possible out
omes are the
Bernoulli distributions. Under f the likelihood of Y = 0 ex
eeds that of Y = 1, and under s the
opposite holds. Therefore, the maximum likelihood estimator of T given (Y = y) be
omes simply
T^ML (Y = 0) = f
(1)
T^ML (Y = 1) = s
whi
h
onforms with intuition.
Let us now
onsider the
onditional probability that Y = 1 given T^ML, yielding
P(Y = 1jT^ML) = 0 if y = 0
= 1
if
y=1
(2)
y~ = (1 y )0 + y 1
i
(3)
whi
h we will
all the pseudo-observations. In words, the pseudo-observation y~ is the su
ess
probability
onditional on the most likely estimate of the true status t .
We now want to t a logisti
regression to the pseudo-observations y~ . (In the
lassi
al
ase,
y~ = y .) There are several estimation methods, but here we will apply the maximum likelihood
i
Y [(x0 )
n
=1
yi
[1 (x0 )1
i
(4)
yi
over 2 IR . We
all (4) the estimated likelihood be
ause we don't know the true likelihood, whi
h
depends on the unobservable t1 ; : : : ; t . (We only know the true likelihood when 0 = 0 and 1 = 1.)
The maximizer ^ of (4)
an thus be
alled the maximum estimated likelihood (MEL) estimator.
In order to
ompute the MEL estimator we
an take the logarithm of (4), yielding
p
X y~ ln((x0 )) + (1
n
=1
(5)
whi
h always exists sin
e is nite. Dierentiating with respe
t to yields the (p variate) s
ore
fun
tion
X
(6)
s(j(~y1 ; : : : ; y~ )) = (~y (x0 )) x
n
=1
for all 2 IR . Setting (6) equal to zero yields the desired estimate.
p
X (x0 )(1
s() =
=1
(x0 )) x x0
i
(7)
and is thus negative denite be
ause the design matrix has rank p. Therefore the dierentiable
fun
tion (5) is stri
tly
on
ave. Now let us take any 6= 0 and repla
e in (5) by . If we let
! +1 then (5) always tends to 1 be
ause there is at least one x in the data set with x0 6= 0
due to full rank, and neither y~ or (1 y~ )
an be zero. Therefore, there must be a nite maximizer
^MEL of (5), whi
h is unique be
ause the
on
avity is stri
t.
2
This implies that the MEL estimator exists even when the data set has no overlap. Therefore
also the resulting odds ratios OR = exp(^ ) always exist, i.e. they are never zero or +1.
A property shared by all logisti
regression estimators is x ane equivarian
e. This says that
when the x are repla
ed by x = Ax where A is a nonsingular p p matrix, then the regression
oe
ients transform a
ordingly.
Property 2. The MEL estimator is x ane equivariant.
= (A0 ) 1 ^MEL hen
e (x )0 ^ = x0 A0 (A0 ) 1 ^MEL =
^MEL
Proof. From (6) it follows that
MEL
0
2
x ^MEL. This also yields the same predi
ted values.
In linear regression there exist two other types of equivarian
e: one about adding a linear
fun
tion to the response (`regression equivarian
e') and one about multiplying the response by a
onstant fa
tor (`y s
ale equivarian
e'), but these obviously do no apply to logisti
regression.
i
. Choi e of
and
If 0 and 1 are known from the
ontext (e.g. from the type I and type II error probabilities of a
blood test) then we
an use these values. But in many
ases, 0 and 1 are not given in advan
e.
Copas (1988, page 241) found that a
urate estimation of 0 and 1 from the data itself is very
di
ult, if not impossible unless n is extremely large. He essentially
onsiders them as tuning
onstants that
an be
hosen, as do we.
The `symmetri
' approa
h used by Copas is to
hoose a single
onstant
> 0 and to set
0 =
and 1 = 1
:
(8)
His
omputations require that
be small enough so that terms in
2
an be ignored. In his Table
1 the values
= 0:01 and
= 0:02 o
ur, whereas he
onsiders
= 0:05 to be unreasonably high
(page 238). In most of Copas' examples
= 0:01 performs well, and this turns out to be true
also for our MEL method, so we
ould use
= 0:01 as the default
hoi
e. This approa
h has the
advantage of simpli
ity.
On the other hand, there is something to be said for an `asymmetri
'
hoi
e whi
h takes into
a
ount how many y 's are 0 and 1 in the data set. Let us
onsider the marginal distribution of
the y (that is, un
onditional on the x ) from whi
h we
onstru
t some estimate ^ of the marginal
su
ess probability P(Y = 1). It seems reasonable to
onstrain 0 and 1 su
h that the average of
the pseudo-observations y~
orresponds to ^ . This yields
i
X y~
n
= (1 ^ )0 + ^ 1
n =1
^ ^ 1 = 0 ^ 0
1 1 = 0 :
1 ^ ^ 0
Sin
e it is natural to assume that 0 < ^ < 1 the latter ratios equal a (small) positive number
whi
h we will denote by . Consequently we
an write both 0 and 1 as fun
tions of , as:
1 + ^
^
(9)
0 =
1 + and 1 = 1 + :
However, sin
e we have assumed that 0 < ^ < 1 we have to
onstru
t ^ a
ordingly. We
annot
take the standard estimate = 1 =1 y = (number of y = 1)=n be
ause
an be
ome 0 or 1.
A natural idea is to bound away from 0 and 1 by putting
^ =
n
i
^ = max (; min(1 ; ))
(10)
hen e 0 < ^ < 1 . Note that both mis lassi ation probabilities in Figure 1 are less than be ause
^
<
0 =
1+ 1+ <
and
1 1 = 1 + 1 +1 ^ = (11 +^) < 1 + < :
Our default
hoi
e will be = 0:01, whi
h implies smaller
lassi
ation errors than by putting
When the data are `balan
ed' in the sense that there are as many y = 1 as y = 0, expression
(10) yields ^ = 0:5 hen
e 0 = 1 1 by (9), yielding identi
al mis
lassi
ation probabilities, as
in the symmetri
formulas (8). In all other, `unbalan
ed'
ases, our asymmetri
approa
h yields
less biased predi
tions. An extreme
ase is when all y = 1. (This is a situation where the
lassi
al
ML estimator does not exist.) The MEL estimator will put all y~ = 1 yielding a t with all
i
or
http://www.statistik.uni-dortmund.de/sfb475/beri
hte/rous
hr2.zip
The ML estimator has the ni
e property under the logisti
regression model that if ^ is the ML
estimate for the data set f(x0 ; y ), 1 i ng, then ^ is the ML estimate for the data set
f(x0 ; 1 y ), 1 i ng. Hen
e, re
oding all response variables Y to 1 Y ae
ts the ML
estimator only in the way that it
hanges the signs of the regression
oe
ients, and the odds
ratios be
ome exp( ^ ) = 1=OR . We
all this equivarian
e with respe
t to re
oding the response
variable. The MEL estimator has the same property, whether 0 and 1 are given by (8) or (9) .
Property 3. The MEL estimator is equivariant with respe
t to re
oding the response variable.
Proof. Writing y = 1
y and re
omputing (10) and (9) [or (8) yields y~ = 1 y~ by (3).
Applying the ML estimator to the (x0 ; y~ ) yields the desired result.
2
i
Simulations
In this se
tion we
arry out a small simulation to
ompare the bias and the standard error of the
usual ML estimator and the proposed MEL estimator with = 0:01 under the assumptions of the
logisti
regression model. We will estimate p = 3
oe
ients, in
luding the inter
ept term. Both
explanatory variables are generated from the standard normal distribution. As true parameter
ve
tors we use = (1; 0; 0)0 and = (1; 1; 2)0 . The number of observations n will be 20, 50, and
100. For ea
h situation 1; 000 samples are generated.
We use the depth-based algorithm (Christmann and Rousseeuw 2001) to
he
k whether the
data set has overlap, i.e. whether the ML estimate exists. It turned out that there were 12 data
sets without overlap for n = 20 with , and 129 data sets without overlap for n = 20 with .
This
ontrasts sharply to the MEL estimate, whi
h existed for all data sets.
Table 1
ompares ML and MEL for the data sets with overlap. In situation A, where the true
slopes are zero, there is not mu
h dieren
e between the estimators. But in situation B, the MEL
estimator has a substantially smaller bias and standard error than the ML estimator. This
an be
explained by the well-known phenomenon that ML tends to overestimate the magnitude of nonzero
oe
ients, whereas MEL exhibits a kind of `shrinkage' behavior.
A
Examples
In this se
tion we
onsider some ben
hmark data sets. Both the banknotes data set (Riedwyl
1997) and the hemophilia data set (Hermans and Habbema 1975) have no overlap, hen
e their ML
estimate does not exist. The vaso
onstri
tion data (Finney 1947, Pregibon 1981) and the food
stamp data (Kuns
h et al. 1989) are well-known in the literature on outlier dete
tion and robust
logisti
regression. They both have little overlap: it su
es to delete 3 (resp. 6) observations in
these data sets to make the ML estimate nonexistent (see Christmann and Rousseeuw 2001). Some
of these observations are
onsidered as outliers in Kuns
h et al. (1989). The
an
er remission data
set (Lee 1974) is
hosen be
ause n=p 4 is small. The toxoplasmosis data set (Efron 1986) and
the IVC data set (Jaeger et al. 1997, 1998) have a large n.
Table 1:
Bias and Standard Error of the ML estimator and the MEL estimator with = 0:01.
ML
n
Bias
SE
Case A with = (1; 0; 0)0
20
0.291 0.032
1
0.010 0.031
2
-0.014 0.035
50
0.097 0.012
1
-0.015 0.011
2
-0.021 0.012
100
0.053 0.008
1
0.004 0.008
2
-0.004 0.008
Case B with = (1; 1; 2)0
20
0.586 0.067
1
0.652 0.083
2
1.372 0.159
50
0.133 0.019
1
0.156 0.022
2
0.350 0.030
100
0.061 0.011
1
0.085 0.012
2
0.154 0.016
MEL
Bias
SE
0.272
0.009
-0.004
0.095
-0.015
-0.021
0.052
0.004
-0.004
0.028
0.029
0.030
0.012
0.011
0.012
0.008
0.008
0.008
0.360
0.364
0.780
0.097
0.104
0.247
0.038
0.050
0.084
0.039
0.045
0.057
0.017
0.019
0.025
0.010
0.011
0.015
The IVC data set des
ribes an in vitro experiment to study possible risk fa
tors of the thrombus
apturing e
a
y of inferior vena
ava (IVC) lters. We fo
us on the study of a parti
ular
oni
al
IVC lter, for whi
h the design
onsisted of 48 dierent settings x . For ea
h ve
tor x there were
m repli
ations with m 2 f50; 60; 90; 100g, yielding a total of n = 3200.
Table 2: Comparison between MEL estimates with = 0:01 and ML estimates.
^6
^5
^4
^3
^2
Data set (n; p)
Method
^
^1
Banknotes
ML
no overlap, ML does not exist
(200; 7)
MEL
147.09 0.46 -1.02 1.33 2.20 2.32 -2.37
Hemophilia
ML
no overlap, ML does not exist
(52; 3)
MEL
-5.43 -56.59 47.39
Vaso
onstri
tion ML
-2.92 5.22 4.63
(39; 3)
MEL
-2.77 4.98 4.41
Food stamp
ML
0.93 -1.85 0.90 -0.33
(150; 4)
MEL
0.89 -1.83 0.88 -0.33
Can
er remission ML
58.04 24.66 19.29 -19.60 3.90 0.15 -87.43
(27; 7)
MEL
58.51 18.20 12.20 -12.19 3.68 0.14 -81.42
Toxoplasmosis
ML
0.10 -0.45 -0.19 0.21
(697; 4)
MEL
0.10 -0.44 -0.19 0.21
IVC
ML
-1.79 0.67 -1.05 -1.25 1.83
(3200; 5)
MEL
-1.73 0.65 -1.03 -1.22 1.79
i
Table 2 shows that the MEL estimates with = 0:01 were quite similar to the ML estimates for
the data sets with overlap. This is even true for the
an
er remission data set taking into a
ount
the huge standard errors of the ML
oe
ients, namely 71:23; 47:84; 57:95; 61:68; 2:34; 2:28; and
67:57. The odds ratios exp(^ ) based on the ML and MEL estimates were quite similar too (see
Table 3).
Figure 3 shows that the
hoi
e of has relatively little impa
t on the MEL estimates for the
j
food stamp data set, whi
h has overlap. Figure 4 shows the ee
t of for the banknotes data.
Be
ause the latter data set has no overlap we know that jj^jj tends to +1 as goes to 0 (sin
e
= 0
orresponds to the ML estimator). One
ould therefore use like a `ridge parameter' in
Figure 4.
Table 3: Comparison of odds ratios based on ML and MEL.
^3
^2
Data set (n; p)
Method
^
^1
Vaso
onstri
tion ML
0.05 185.03 102.64
MEL
0.06 146.13 81.97
Food stamp
ML
2.53
0.16
2.45 0.72
MEL
2.44
0.16
2.42 0.72
Toxoplasmosis
ML
1.10
0.64
0.83 1.24
MEL
1.10
0.64
0.83 1.24
6
Outlier-robust estimation
In the literature on logisti
regression, many robust alternatives to the maximum likelihood estimator have been proposed. They
an easily be modied for the hidden logisti
regression model in the
same way that we
onstru
ted the MEL estimator, i.e. by applying them to the pseudo-observations
(3).
As an example we will
onsider a modi
ation of the least trimmed weighted squares (LTWS)
estimator of Christmann (1994a) whi
h is dened as follows. We assume large strata, i.e. ea
h
design point x has m responses Y for j = 1; : : : ; m . One then adds all the Y
orresponding to
that x yielding
X
Z = Y 2 f0; : : : ; m g
j
mi
=1
and redenes n as the number of the x 's (whi
h is less than the total number of original responses YP). The large strata assumption says that n and p are xed while min1 m ! 1
and m =( =1 m ) ! k 2 (0; 1). One then puts = Z =m and Z = (m (1 ))1 2 1 ( )
as well as X = (m (1 ))1 2 x . For large values of m the Z approximately follow a linear
regression model in the X . Christmann (1994a) dened the LTWS estimator of as the least
trimmed squares estimator (Rousseeuw 1984) applied to the transformed variables Z and X ,
that is
X r2
^LTWS = argmin
:
2 IR =1
where r1:2 : : : r2 : are the ordered squared residuals where r = Z 0 X . The robustness
aspe
ts and asymptoti
behavior of ^LWTS were investigated in Christmann (1994a, 1998) .
In the hidden logisti
model, we apply the LTWS method to the pseudo-observations y~ dened
in (3), with 0 and 1 given by (9) and (10). That is, we put
Y~ = (1 Y )0 + Y 1
i
n
j
n
=
i n
n n
P
yielding the
orresponding variable Z~ = =1 Y~ . Substituting Z~ for Z yields ~ = Z~ =m and
i
mi
Z~ = (m ~ (1 ~ ))1 2 1(~ )
X~ = (m ~ (1 ~ ))1 2 x
i
to whi
h we apply LTS regression. Like the MEL estimator, this modied LTWS estimator exists
for all data sets (and it is still x ane equivariant). In addition, it is also robust to outliers in
Z and x . (The latter means that the modied LTWS estimator
an resist the ee
t of leverage
points, unlike some other robust approa
hes.)
i
^
1
1.85
0.65
0.75
1.75
0.85
1.65
^
2
^
3
0.80
0.33
0.84
0.31
0.88
0.29
Fig. 3. Graphs of the MEL
oe
ients versus for the food stamp data set,
for = 0:0001; 0:001; 0:005; 0:01; 0:05; and 0:1.
^
1
0.2
200
400
0.6
600
1.0
800
^
2
^
3
10
Fig. 4. Graphs of the rst four MEL
oe
ients versus for the banknotes data set,
for = 0:0001; 0:001; 0:005; 0:01; 0:05; and 0:1.
11
Let us illustrate the modied LTWS estimator on the toxoplasmosis data set (Efron 1986) of
Se
tion 5. In aggregated form this data set has n = 34 observations, with m ranging from 1 to
82 with a mean of 20:5. We ran the modied LTWS method with the default
hoi
es = 0:01
and h = [ 0:75n = 25, whi
h took only a few se
onds be
ause we used the FAST-LTS program
(Rousseeuw and Van Driessen 1999b). The resulting
oe
ients were ( 0:37; 1:26; 0:17; 0:42)0
whi
h
learly dier from the non-robust
oe
ients given in Table 2. Of
ourse, the odds ratios
0:69, 0:28, 0:84, and 1:52 based on the outlier-robust approa
h also dier from the non-robust odds
ratios in Table 3. The observations 27, 28, and 30 sti
k out in the robust residual plot (Figure 5),
whi
h agrees with ndings based on a robust minimum Hellinger distan
e approa
h (Christmann
1994b).
o 27
o 28
o 30
o
o
o
ooo
o o
o
o
oo
o
ooo
2.5
o
o
o
o
o
o
oo
o
-2.5
o
0
10
20
30
Index
Fig. 5.
The main problem addressed in this paper is that the
oe
ients of the binary regression model
(with logisti
or probit link fun
tion)
annot be estimated when the x 's of su
esses and failures
don't overlap. This is a de
ien
y of the model itself, be
ause the t
an be made perfe
t by letting
jjjj tend to innity. Therefore, this problem is shared by all reasonable estimators that operate
under the logisti
model.
Our approa
h to resolve this problem is to work with a generalized model, whi
h we
all the
hidden logisti
model. Here we
ompute the pseudo-observations y~ , dened as the probability that
y = 1
onditional on the maximum likelihood estimate of the true status t . The resulting MEL
estimator always exists and is unique, even though the hypotheti
al mis
lassi
ation probabilities
(based on our default setting = 1%) are so small that they would not be visible in the observed
data.
The hidden logisti
model was previously used (under a dierent name) in an important paper
by Copas (1988). However, his approa
h and ours are almost diametri
ally opposite. Copas' motivation is to redu
e the ee
t of the outliers that matter, whi
h are the observations (x ; y ) where x
is far away from the bulk of the data and y has the value whi
h is very unlikely under the logisti
i
12
model. In the terminology of Rousseeuw and van Zomeren (1990) these are bad leverage points. In
logisti
regression their ee
t is always to
atten the t, i.e. to bring the estimated slopes
lose to
zero. Copas' approa
h shrinks the logisti
distribution fun
tion away from 0 and 1 (by letting
it range between
and 1
), so that bad leverage points are no longer that unlikely under his
model, whi
h greatly redu
es their ee
t. On the other hand, his approa
h aggravates the problems
that arise when there is little overlap between su
esses and failures, as in his analysis of the vaso
onstri
tion data.
Our approa
h goes into the other dire
tion: rather than shrinking while leaving the responses
y un
hanged, we leave un
hanged and shrink the y to the pseudo-observations y~ whi
h are
slightly larger than zero or slightly less than 1. This
ompletely eliminates the overlap problem.
It does not help at all for the problem of bad leverage points, but for that problem we
an use
existing te
hniques from the robustness literature. For instan
e, for grouped data (i.e. tied x 's)
we saw in Se
tion 6 that the tting
an be done by the LTS regression method, whi
h is robust
against leverage points.
In general, also other robust te
hniques
an be applied to the (x ; y~ ). For instan
e, note that
the s
ore fun
tion (6) is similar to an M-estimator equation. Sin
e the (pseudo-)residual is always
bounded due to
jy~ (x0 )j < 1
the main problem
omes from the fa
tor x whi
h need not be bounded (this
orresponds to the
leverage point issue). A straightforward remedy is to downweight leverage points, yielding the
weighted maximum estimated likelihood (WEMEL) estimator dened as the solution ^ of
i
X (~y
n
=1
(x0 )) w x = 0
i
(11)
where the weights w only depend on how far away x is from the bulk of the data. For instan
e,
we
an put
M
w =
(12)
maxfRD2 (x ); M g
i
where x = (x 2 ; : : : ; x ) 2 IR 1 , RD(x ) is its robust distan
e, and M is the 75th per
entile of
all RD2 (x ), j = 1; : : : ; n.
When all regressor variables are
ontinuous and there are not more than (say) 30 of them,
we
an use the robust distan
es that
ome out of the minimum
ovarian
e determinant (MCD)
estimator of Rousseeuw (1984), for whi
h the fast algorithm of Rousseeuw and Van Driessen (1999a)
is available. This algorithm has been in
orporated in the pa
kages S-Plus (as the fun
tion
ov.m
d)
and SAS/IML (as the routine MCD), and both provide the robust distan
es in their output. In
ase
that not all regressor variables are
ontinuous or there are very many of them (even more than one
thousand), we
an use the robust distan
es provided by the robust prin
ipal
omponents algorithm
of Hubert, Rousseeuw and Verboven (2001).
We have not yet studied the WEMEL estimator in any detail, but we note that it is easy to
ompute be
ause most GLM algorithms (in
luding the one in S-Plus) allow the user to input prior
weights w .
We also have not yet addressed the issue of bias
orre
tion for either MEL or WEMEL, whi
h is
a subje
t for further resear
h. It may be possible to apply the same type of
al
ulus as for formula
(27) of Copas (1988).
Last but not least are the
omputation of in
uen
e fun
tions and breakdown values. It would
be interesting to
onne
t our work in the hidden logisti
model with the existing body of literature
on outlier dete
tion and robust estimation in the
lassi
al logisti
model, in
luding the work of
Pregibon (1982), Stefanski et al. (1986), Kuns
h et al. (1989), and Muller and Neykov (2000).
i;
i;p
A knowledgement
The se
ond author was supported by the Deuts
he Fors
hungsgemeins
haft (SFB 475, \Redu
tion
of
omplexity in multivariate data stru
tures").
13
Referen es
14