You are on page 1of 10

Communications in StatisticsSimulation and Computation

, 38: 17141722, 2009


Copyright Taylor & Francis Group, LLC
ISSN: 0361-0918 print/1532-4141 online
DOI: 10.1080/03610910903094676
Monte Carlo Approximations of the Quantiles
of a Sample Statistic
TAK K. MAK AND FASSIL NEBEBE
Department of Decision Sciences and M.I.S., JMSB, Concordia
University, Montreal, Quebec, Canada
We consider in this article the problem of numerically approximating the quantiles of
a sample statistic for a given population, a problem of interest in many applications,
such as bootstrap condence intervals. The proposed Monte Carlo method can
be routinely applied to handle complex problems that lack analytical results.
Furthermore, the method yields estimates of the quantiles of a sample statistic of
any sample size though Monte Carlo simulations for only two optimally selected
sample sizes are needed. An analysis of the Monte Carlo design is performed to
obtain the optimal choices of these two sample sizes and the number of simulated
samples required for each sample size. Theoretical results are presented for the
bias and variance of the numerical method proposed. The results developed are
illustrated via simulation studies for the classical problem of estimating a bivariate
linear structural relationship. It is seen that the size of the simulated samples used in
the Monte Carlo method does not have to be very large and the method provides a
better approximation to quantiles than those based on an asymptotic normal theory
for skewed sampling distributions.
Keywords Bootstrap; Computer intensive methods; Monte Carlo simulation;
Quantile approximation.
Mathematics Subject Classication 62F40; 62H10.
1. Introduction
We consider in this article the use of Monte Carlo methods to numerically
approximate, for any n, the lower :-level quantile t
n:
of a sample statistic T
n
,
a function of a sample of n observations generated from a given distribution
E
0
. It is assumed that T
n
converges in distribution to a known distribution,
typically a standard normal distribution. The methods studied are potentially useful
and relevant in a wide range of statistical problems, including the applications
of bootstrap to condence interval construction and hypothesis testing using
asymptotic pivotal statistics. To see this, suppose

0
n
is an estimator of a certain
Received April 18, 2008; Accepted June 5, 2009
Address correspondence to Tak K. Mak, Department of Decision Sciences and M.I.S.,
JMSB, Concordia University, 1455 De Maisonneuve West, Montreal H3G 1M8, Quebec,
Canada; E-mail: takmak@alcor.concordia.ca
1714
Monte Carlo Quantile Approximation 1715
population parameter 0(E) representing a functional of the distribution E. One
way of constructing bootstrap condence intervals for 0(E) is the percentile-t
method based on the quantiles of the bootstrap sample statistic T
n
=

n(

0(

E)), o

n
, where

E is an estimator of E,

n
is the bootstrap counterpart of

0
n
,
and o

n
is the bootstrap estimate of the asymptotic variance, provided that T
n
is
asymptotically pivotal. In this case, we need to obtain numerically the quantiles
of T
n
, a statistic computed on bootstrap samples generated from the distribution
E
0
=

E. Bootstrap interval estimation has been widely used in practice to obtain


condence intervals with more accurate nominal coverage probabilities than those
based on an asymptotic normal theory (see, for example, Hall, 1992, Ch. 3). For
example, bootstrap condence intervals have been considered for the estimation of
intraclass correlation (Ukoumunne et al., 2003) and the kappa statistic (Lee and
Fung, 1993) for measuring rater agreement in clinical and psychological studies. The
results obtained in this article also provide the statistician with a computationally
efcient means of comparing the performance of competing estimators for given
population distributions.
In this article, it is seen that t
n:
can be approximated by Monte Carlo simulation
with a numerical error of the order O(n
1
) for any n. There is, however, no
need to perform Monte Carlo simulation for each given n. Following Mak (2004),
simulation is conducted for only two selected sample sizes n
1
and n
2
(say, n
1
- n
2
)
with a different number of bootstrap samples simulated in each case. Given n
1
, we
determine the optimal choice of n
2
and the number of bootstrap samples for each
of these two sample sizes so as to minimize the number of simulated observations
needed to meet a specied accuracy. Interestingly, the results derived are quite
different from those for estimating variances obtained in Mak (2004). The choice
of n
1
is however less straightforward and represents a trade-off between smaller
variance and larger bias, In the estimation of bootstrap distribution functions,
Bickel and Yahav (1988) also used extrapolation and the bootstrap at two sample
sizes smaller than n, but employed a different optimal criterion and did not take into
account inherent errors of resampling. As pointed out by Booth and Sarkar (1998),
numerical errors resulted from resampling an estimated distribution should be
carefully examined, as it is not desirable to have the results of statistical inferences
determined by Monte Carlo errors. They also discussed the use of resampling
to obtain numerically the bootstrap variance and concluded that the number of
resamples needed is considerably larger than what is traditionally thought. A similar
observation was also made in Mak (2004).
The results in the present article provide an efcient way of computing for any n
both the critical value of a statistical test and the power of the test for any parameter
values in the alternative hypothesis. This greatly facilitates the determination of a
sample size to achieve a certain power as required in many clinical trials (Browne,
1995). It takes advantage of modern computing power to relief the statisticians
burden of the need to derive results analytically when they are mathematically
complicated or intractable as is often the case. The theoretical results are presented
in the next section. In Sec. 3, simulation studies are conducted to demonstrate the
results and the effectiveness of the numerical methods developed in Sec. 2.
2. The Main Results
We consider in this section the use of Monte Carlo methods to estimate t
n:
. It
is assumed that as n , T
n
converges in distribution to the standard normal
1716 Mak and Nebebe
distribution. For a large class of estimation problems, such as maximum likelihood
estimation, the limiting distribution is typically normal. However, the discussion
below is actually applicable to any known form of limiting distribution. Assuming
that the distribution function of T
n
can be expanded as a power series in n
1,2
and
using a Cornish-Fisher type of expansion (see, for example, Ch. 2 of Hall, 1992),
the :-level quantile t
n:
can be expressed as, ignoring terms of the order O(n
3,2
),
approximately
t
n:
z
:
+
o
1

n
+
o
2
n
(2.1)
where z
:
is the :-level quantile of the standard normal distribution. This gives a simple
method for approximating t
n:
for any n provided that the constants o
1
and o
2
are
known or can be accurately estimated numerically. Since in (2.1) t
n:
is approximately
a linear function of n
1,2
and n
1
, the unknown coefcients o
1
and o
2
can be uniquely
solved if t
n:
is known for two values of n. In practice, the analytical formof t
n:
is rarely
known but can be estimated to any desired degree of accuracy using simulation or
resampling. Specically, let S
1
, . . . , S
B
be B bootstrap samples, each of size n, obtained
by sampling from the given distribution E
0
. For each bootstrap sample, the statistic T
n
is calculated, yielding B values T
n
(S
1
), . . . , T
n
(S
B
). Let

t
n:
be the sample :-level quantile
based on T
n
(S
1
), . . . , T
n
(S
B
). For large B,

t
n:
has an approximate variance equal to
(see, for example, Gross, 1980; Kuk and Mak, 1989)
1
B]](z
:
)]
2
:(1 :) =
v
B
, (2.2)
say, where ] is the probability density function (pdf) of the standard normal
distribution. Thus,

t
n:
t
n:
+a z
:
+o
1
n
1,2
+o
2
n
1
+a,
where a has approximately zero mean and variance B
1
v. We therefore have the
linear regression model
, =

t
n:
z
:
o
1
n
1,2
+o
2
n
1
+a
with independent variables n
1,2
and n
1
, so that the regression coefcients o
1
and
o
2
can be estimated using the standard generalized least squares theory, provided
that the dependent variable , is computed for several values of n. As in Mak
(2004), we consider only the case of two selected sample sizes n
1
and n
2
with, say,
n
1
- n
2
. The number of bootstrap samples B is allowed to be different for the
two sample sizes so as to optimize the total number of observations needed to be
simulated. Let B
1
and B
2
be the number of bootstrap samples for the sample sizes
n
1
and n
2
, respectively. Let o
1
and o
2
be the resulting estimators of respectively o
1
and o
2
. Also, let Y = (

t
n
1
:
z
:
,

t
n
2
:
z
:
)

and

A = ( o
1
, o
2
)

so that

A = X
1
Y, where
X =

n
1,2
1
n
1
1
n
1,2
2
n
1
2

.
Monte Carlo Quantile Approximation 1717
Then for any n, t
n:
can be estimated by

t
n:
= z
:
+ o
1
n
1,2
+ o
2
n
1
.
Furthermore, the covariance matrix of

A is equal to X
1
V(X
1
)

, where V, a
diagonal matrix with diagonal elements B
1
1
v and B
1
2
v, is the covariance matrix of
Y. Let n
2
= kn
1
and B
2
= B
1
. Then ignoring terms of the order O(n
3,2
), we obtain,
by evaluating the matrix X
1
V(X
1
)

,
Var(

t
n:
) n
1
Var( o
1
) = n
1
B
1
1
vkn
1
(1 k
1,2
)
2
(k
2
+
1
). (2.3)
For any values of n
1
, k, and , a value of B
1
can be determined to achieve a specied
value of the variance of

t
n:
, say e. Equating the right hand side of (2.3) to e and
solving, we have
B
1
= n
1
e
1
vkn
1
(1 k
1,2
)
2
(k
2
+
1
). (2.4)
The total number of observations to be simulated is then
N = n
1
B
1
+n
2
B
2
= n
1
B
1
(1 +k) = n
1
e
1
vkn
1
(1 k
1,2
)
2
(k
2
+
1
)(1 +k).
(2.5)
Thus, choosing a smaller value of n
1
for estimating o
1
and o
2
seems desirable in
order to reduce N. However, a smaller n
1
would also lead to larger bias of

t
n:
for
estimating t
n:
, as will be seen below. On the other hand, k and can be chosen
to minimize the value of N. Minimizing (2.5) with respect to k and , one then
obtains k = 5.295 and = 2.297, regardless of the values of v, n
1
and the sample
size n considered. The results obtained here are different for estimating variances,
in which case k = 4 and = 2 Mak (2004).
The bias for estimating o
1
, o
2
, and t
n:
are given in the following theorem.
Theorem 2.1. For given k and , we have E( o
1
) = o
1
+O(n
1
1
) +O(B
1
1
)n
1,2
1
and
E( o
2
) = o
2
+O(n
1,2
1
) +O(B
1
1
)n
1
.
Proof. Let A = (o
1
, o
2
)

. We have E(

A) = E(X
1
Y) = X
1
(XA +D) = A +X
1
D,
where D is a vector with both components of the form O(n
3,2
1
) +O(B
1
1
) since for
any n,

t
n:
has a bias of the order O(B
1
). The theorem follows since
X
1
= k
1,2
n
3,2
1
(k
1,2
1)
1

(kn
1
)
1
n
1
1
(kn
1
)
1,2
n
1,2
1

.
It follows immediately from the theorem that for any n,
E(

t
n:
) =t
n:
+n
1,2
O(n
1
1
) +n
1,2
O(B
1
1
)n
1,2
1
+n
1
O(n
1,2
1
) +n
1
O(B
1
1
)n
1
+O(n
3,2
).
Let e = a
2
,n, where a is a small constant. Then we have, by (2.4),
B
1
= a
2
vkn
1
(1 k
1,2
)
2
(k
2
+
1
)
1718 Mak and Nebebe
and
E(

t
n:
) = t
n:
+n
1,2
O(n
1,2
1
) +n
1
O(n
1,2
1
) +O(n
1
).
Thus, to reduce the number of bootstrap samples needed, one would like to choose
a smaller n
1
. This, however, will be done at the expense of a larger bias. For large
n
1
, we have, approximately

t
n:
t
n:
+aO

(n
1,2
) +O(n
1
).
Since a is small,

t
n:
is approximately a second order correct endpoint for bootstrap
condence intervals based on a pivotal statistic (Hall, 1992, Ch. 3) if T
n
=

n(

0(

E)), o

n
as dened in Sec. 1.
3. Simulation Studies
Mak and Nebebe (2009) studied numerically the efciencies of using resampling in
a structural relationship to estimate the asymptotic variance of the slope estimate
and compared to those derived analytically. We will use the same experimental
settings to demonstrate numerically the analytical results developed in the previous
section. The availability of an explicit expression of the asymptotic variance reduces
considerably the required amount of numerical computation. In a bilinear structural
relationship, two unobservable variables and p are connected by the linear
relationship p = +[. We observe only , = p +a and x = +o, where o and a
are measurement errors with zero means. Furthermore, o and a are assumed to
be independent and normally distributed with variances o
2
o
and o
2
a
, respectively.
It is well known that in the absence of additional information about the error
variances, the parameters are non identiable. We consider here the case where o
2
o
is known. The slope parameter [ is then estimated by

[ = s
x,
,(s
xx
o
2
o
). Consider
now the studentized statistic T
n
= (

[ [),s

[
, where s
2

[
is the sample estimator
of the asymptotic variance of

[ explicitly given in Mak and Nebebe (2009).
Simulation studies were performed to examine the performance of the Monte
Carlo method in estimating the quantiles of T
n
. We consider the estimation of
the :-quantiles for : = 0.1, 0.5, 0.9. For any n, these quantiles are estimated by

t
:n
with o
1
and o
2
obtained by the method explained in the previous section. We
consider n
1
= 10 and n
1
= 15. The optimal choice n
2
= 5.295n
1
is used for the
other sample size. The efciency of

t
:n
is studied for n = 30 and n = 100. The
true values of t
0.1
, t
0.5
, t
0.9
were computed numerically using a very large number
of simulated samples. Since the implication of errors in quantile estimation is
more difcult to judge, we consider also errors in terms of the corresponding
nominal probabilities. Specically, in addition to examining the error

t
:n
t
:
,
we consider also the deviation E
n
(

t
:n
) E
n
(t
:
) = E
n
(

t
:n
) :, where E
n
is the true
distribution function of T
n
. The asymptotic variance of E
n
(

t
:n
) : is equal to
n
2
= ]](z
:
)]
2
V(

t
:n
). We also dene the relative standard deviation (error) as
r = n,

:(1 :) = ](z
:
)

V(

t
:n
),

:(1 :), the asymptotic standard error as a


fraction of

:(1 :). To achieve a specied value of r one can simply set in (2.4)
e =
r
2
:(1 :)
|](z
:
)]
2
.
Monte Carlo Quantile Approximation 1719
Consequently, (2.4) reduces to
B
1
= n
1
r
2
kn
1
(1 k
1,2
)
2
(k
2
+
1
).
In the simulation studies, B
1
is arbitrary chosen to achieve a relative standard
deviation of r = 0.03 for quantile estimation when n = 100 (for n > 100, the relative
standard deviation is therefore less than 0.03). For n
1
= 10 the number of bootstrap
samples required is B
1
= 867, and B
1
= 1300 for n
1
= 15. The true parameter
values used to generate the bootstrap samples are as in Mak and Nebebe (2009):
= 10, [ = 2, E() = 100, Var() = 900, o
2
o
= 100, o
2
a
= 400.
Thus, for given (n
1
, B
1
, n
2
, B
2
), B
1
and B
2
bootstrap samples of sizes n
1
and n
2
are generated and the estimate

t
n:
as well as E
n
(

t
n:
) are calculated for
: = 0.1, 0.5, 0.9 and n = 30, 100. The process is repeated 100 times to estimate the
numerical mean, standard deviation (and also relative standard deviation in the case
of E
n
(

t
n:
)). The results of the simulation studies are then summarized in Tables 1
and 2.
We rst examine the numerical results for the case n = 100 (Table 1). For
n = 100, the true values of t
0.1
, t
0.5
, t
0.9
are 1.264, 0.046, and 1.322, respectively. It
is clear that the estimated relative standard deviations, though consistently smaller,
are in general close to the specied value of 0.03, for both the cases n
1
= 10 and
n
1
= 15, although the number of bootstrap samples required are considerably higher
for n
1
= 15. Thus, as shown theoretically in Sec. 2, there are not necessarily any
gains in efciency in choosing a larger value for n
1
. It is also interesting to note
Table 1
Mean, standard deviation, and relative standard deviation
of

t
n:
and E
n
(

t
n:
) for n = 100
:
0.10 0.50 0.90

t
n:
E
n
(

t
n:
)

t
n:
E
n
(

t
n:
)

t
n:
E
n
(

t
n:
)
n
1
= 10 Mean 1.262 0.1006 0.046 0.5001 1.329 0.9010
(bias) (0.002) (0.0006) (0.000) (0.0001) (0.007) (0.0010)
Standard 0.035 0.0064 0.026 0.0103 0.032 0.0052
deviation
Relative 0.0212 0.0206 0.0175
standard
deviation
n
1
= 15 Mean 1.263 0.1004 0.044 0.4993 1.330 0.9011
(bias) (0.001) (0.0004) (0.002) (0.0007) (0.008) (0.0011)
Standard 0.029 0.0052 0.023 0.0093 0.032 0.0052
deviation
Relative 0.01724 0.0186 0.0175
standard
deviation
1720 Mak and Nebebe
Table 2
Mean, standard deviation, and relative standard deviation
of

t
n:
and E
n
(

t
n:
) for n = 30
:
0.10 0.50 0.90

t
n:
E
n
(

t
n:
)

t
n:
E
n
(

t
n:
)

t
n:
E
n
(

t
n:
)
n
1
= 10 Mean 1.276 0.1005 0.090 0.5006 1.382 0.8886
(bias) (0.002) (0.0005) (0.002) (0.0006) (0.008) (0.0014)
Standard 0.041 0.0072 0.030 0.0120 0.037 0.0056
deviation
Relative 0.0241 0.0240 0.0187
standard
deviation
n
1
= 15 Mean 1.278 0.1001 0.086 0.4989 1.389 0.8997
(bias) (0.000) (0.0001) (0.002) (0.0011) (0.001) (0.0003)
Standard 0.032 0.0056 0.027 0.0108 0.032 0.0048
deviation
Relative 0.0187 0.0216 0.0161
standard
deviation
that for all the values of : considered, there are no marked differences in bias
between the cases n
1
= 10 and n
1
= 15 (for estimating the three quantiles or in
terms of deviation from the nominal probability :). This suggests that although
theoretically the bias is larger for smaller n
1
, the difference may not be practically
signicant unless the value of n
1
is really small. In fact, most of the estimated
biases are negligible compared to the estimated standard deviations. If no bootstrap
corrections are used so that z
:
is used to approximate t
:
, then E
n
(z
:
) : is equal
to 0.0028, 0.0182, and 0.0073 for, respectively, : = 0.1, 0.5, 0.9. Thus, using the
bootstrap improves on the results based on standard asymptotic theory unless
E
n
(z
:
) : is very small so that no correction is required.
Next, we examine the simulation results for n = 30 (Table 2). For n = 30,
the true values of t
0.1
, t
0.5
, t
0.9
are, respectively, 1.278, 0.088, and 1.390. It is
clear that similar conclusions for the mean (bias), standard deviation, and relative
standard deviation can be drawn as in the case n = 100. However, the relative
standard deviations are higher in the present case compared to those obtained for
the case n = 100. This is expected since for xed B
1
, the variance of

t
n:
is inversely
proportional to n.
4. Conclusions
We address in this article the problem of estimating the quantile of a statistic which
converges in distribution to a known distribution function, typically, but not limited
to, the standard normal distribution. Note that when the limiting distribution is
different than the standard normal, z
:
must be everywhere replaced by the lower :
Monte Carlo Quantile Approximation 1721
quantile of this particular distribution as well as ] in (2.2) by its pdf. It is seen that
a quantile estimate can be obtained for any sample size while simulation is required
for only two selected sample sizes. We study the issue of determining the number
of simulated samples required as well as the optimal allocation of the number of
samples to each sample size. The problem effectively is reduced to determining the
smaller of the two sample sizes, since once it is xed, the other sample size and
the number of bootstrap samples for each sample size can be obtained based on a
certain optimal criterion and variance requirement. In general, the smaller of the
two sample sizes is chosen to minimize bias but one has to strike a balance between
bias and resampling requirement. We do not investigate this issue in this article. In
practice, some preliminary numerical studies may have to be conducted to determine
a reasonable value. This can possibly be carried out along the lines proposed in Mak
(2004) for estimating variances, but the details have to be carefully worked out since
the results for estimating quantiles and variances are quite different as seen in the
present article.
It would also be interesting to extend the results in this article to the case where
the statistic converges in distribution, but the limiting distribution is unknown. It is
conjectured that the numerical method suggested is still applicable, but the results of
the variance and bias studies, as well as the optimal allocation rules could be quite
different.
A potential area of application of the developed methods, as pointed out
in Sec. 1, is bootstrap condence interval construction using asymptotic pivotal
statistics. It is seen that by choosing the variance specication requirement carefully,
one can attain approximately second order correct endpoints for the bootstrap
condence intervals. While bootstrap condence interval using pivotal statistics
can be applied to a wide range of problems for which asymptotic normal theory
applies, some degree of caution is needed. For example, the application to the
estimation of correlation coefcient is known to be less successful. In the latter case
considerable improvement can be obtained by using a transformation to stabilize
variance. Alternatively, Efrons (1987) bias-corrected and accelerated bias-corrected
approaches can be applied, and it would be interesting to examine the use of the
computational methods in this article to these approaches.
Finally, the computation of a pivotal statistic most likely would also require
the asymptotic variance be computed. For maximum likelihood estimation, the
asymptotic variances can be computed numerically quite easily even if an analytical
expression for the expected information matrix is unavailable. For other estimation
problems, other techniques such as the delta method may be employed, but
their analytical derivation may be complicated. Mak and Nebebe (2009) proposed
an efcient method for computing the asymptotic variance using Monte Carlo
simulation even if the nite sample variance does not exist. Their proposed method
can in principle be applied here, but the amount of computation will be substantially
increased, and the details will not be pursued here.
Acknowledgment
The authors are grateful to the referee for the constructive comments leading to the
considerable improvement of the article.
1722 Mak and Nebebe
References
Bickel, P. J., Yahav, J. A. (1988). Richardson extrapolation and the bootstrap. Journal of the
American Statistical Association 83:387393.
Booth, J. G., Sarkar, S. (1998). Monte Carlo approximation of bootstrap variances. American
Statistician 52:354357.
Browne, R. H. (1995). On the use of a pilot sample for sample size determination. Statistics
in Medicine 14:19331940.
Efron, B. (1987). Better bootstrap condence interval (with discussion). Journal of the
American Statistical Association 82:171200.
Gross, S. (1980). Median estimation in sample survey. Proceedings of the Section on Survey
Research Methods, American Statistical Association, 181184.
Hall, P. (1992). The Bootstrap and Edgeworth Expansion. New York: Springer.
Kuk, A., Mak, T. K. (1989). Median estimation in the presence of auxiliary information.
Journal of the Royal Statistical Society B 51:261269.
Lee, J., Fung, K. P. (1993). Condence interval of the kappa coefcient by bootstrap
resampling. Psychiatry Research 49:9798.
Mak, T. K. (2004). Estimating variances for all sample sizes by the bootstrap. Computational
Statistics and Data Analysis 46:459467.
Mak, T. K., Nebebe, F. (2009). Numerical approximation of conditional asymptotic
variances using Monte Carlo simulation. Computational Statistics 24:333344.
Ukoumunne, O. C., Davison, A. C., Gulliford, M. C., Chinn, S. (2003). Non-parametric
bootstrap condence intervals for the intraclass correlation coefcient. Statistics in
Medicine 24:38053821.
Copyright of Communications in Statistics: Simulation & Computation is the property of Taylor & Francis Ltd
and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright
holder's express written permission. However, users may print, download, or email articles for individual use.