Bootstrap Approach To Compare The Slopes of Two Calibrations

Article
pubs.acs.org/ac
Bootstrap Approach To Compare the Slopes of Two Calibrations

When Few Standards Are Available
Graciela Estévez-Pérez,† Jose M. Andrade,*,‡ and Rand R. Wilcox§
†
Department of Mathematics and ‡Group of Applied Analytical Chemistry, Department of Analytical Chemistry, University of A
Coruña, Campus da Zapateira, 15071 A Coruña, Galicia, Spain
§
Department of Psychology, University of Southern California, Los Angeles, California 90089, United States
*
S Supporting Information
ABSTRACT: Comparing the slopes of aqueous-based and

standard addition calibration procedures is almost a daily task
in analytical laboratories. As usual protocols imply very few
standards, sound statistical inference and conclusions are hard
to obtain for current classical tests (e.g., the t-test), which may
greatly affect decision-making. Thus, there is a need for robust
statistics that are not distorted by small samples of
experimental values obtained from analytical studies. Several
promising alternatives based on bootstrapping are studied in
this paper under the typical constraints common in laboratory
work. The impact of number of standards, homoscedasticity or
heteroscedasticity, three variance patterns, and three error
distributions on least-squares fits were considered (in total,
144 simulation scenarios). The Student’s t-test is the most valuable procedure when the normality assumption is true and
homoscedasticity is present, although it can be highly affected by outliers. A wild bootstrap method leads to average rejection
percentages that are closer to the nominal level in almost every situation, and it is recommended for laboratories working with a
small number of standards. Finally, it was seen that the Theil−Sen percentile bootstrap statistic is very robust but its rejection
percentages depart from the nominal ones (<5%), so its use is not recommended when the number of standards is very small.
Finally, a tutorial and free software are given to encourage analytical laboratories to apply bootstrap principles to compare the
slopes of two calibration lines.
A nalytical laboratories must decide daily whether a series of

sample aliquots can be quantified appropriately by
“traditional” standardization [external standard calibration,
difficulties when the F-test is considered to compare the
variances of two regression straight lines before Student’s t-test
is calculated), is not generally known or, at least, is seldom
direct calibration (DC), or aqueous calibrations are common applied.4 However, it is a routine method in biology,5
synonyms]. This is addressed usually after preliminary trials psychology, and many other scientific disciplines, although
where the slope of the (first order) regression line obtained for most applications focus on study of the regression coef-
a traditional standardization framework is compared against ficients.6,7 Indeed, outside the analytical chemistry field, several
that derived from a series of additions of a standard to a set of studies dealt with this topic and related aspects; some were
sample aliquots (SAM, standard addition method). Should their reviewed elsewhere.4 Unfortunately, in many scientific fields it
slopes be comparable, the SAM method can be avoided, and so is hard or even impossible (or too costly) to prepare numerous
the laboratory workload will be simplified, the overall costs specimens to deploy good statistical calibrations. Therefore,
reduced, and the laboratory turnaround improved. applying classical statistics to draw conclusions is not without
The general framework for straight-line standardization and concern.
comparison of the slopes of two regression lines were Of particular relevance for the present paper is the work from
reviewed1−3 recently, with special emphasis on common Ng and Wilcox,8 which extends a previous study where
laboratory practices and some misconceptions.4 How to bootstrap was presented as a reliable alternative to other
correctly evaluate the statistical equality of the residual approaches.9 In the former, the effects of nonnormality and
variances of the regression models constituted a difficulty. heteroscedasticity were studied when testing the hypothesis
A literature survey found that this issue has surprisingly not that regression lines associated with two independent groups
been considered broadly during the last years in the analytical
chemistry literature,4 despite its importance in daily laboratory Received: October 28, 2015
work worldwide. As an example, the ANCOVA approach Accepted: January 11, 2016
(analysis of covariance, which circumvents elegantly some
© XXXX American Chemical Society A DOI: 10.1021/acs.analchem.5b04004

Anal. Chem. XXXX, XXX, XXX−XXX
Analytical Chemistry Article
have the same slopes. Problems associated with Student’s t-test of {3, 5, 7, 9}. The traditional Student’s t-test will also be
were underlined, and promising alternatives based on boot- studied for benchmarking. It is also worth noting that in the
strapping studied. However, their simulations did not consider present study the explanatory variables will not be random,
less than 20 standards, a value that may be almost impracticable which is the case in analytical laboratories. Please note that we
for many analysts, not to mention the huge difficulties and high tried to keep a pedagogic approach, as far as possible. Some
cost that may arise in the industrial arena to undertake such statements might appear relatively trivial (it was considered
prolonged calibration times [e.g., in gas chromatography to important to reinforce them) whereas the most complex ones
analyze PAHs (polycyclic aromatic hydrocarbons), more than are not presented in detail (instead, interested readers are
30 h of instrumental time would be required, just to get the referred to seminal papers).
calibration data]. The design of the study and the simulations undergone will
Bootstrap methods are statistical techniques that do not be described in the Experimental Section. Results gathered
assume a probability distribution function for the population; from them will be discussed in the section Results and
they provide a nonparametric approach to testing hypotheses Discussion, and finally, practical examples will be presented
that deals effectively with nonnormal distributions and under Worked Examples. A tutorial on how to use the free
heteroscedasticity. In the event sampling is from a normal software is included in Supporting Information.
■
distribution, bootstrap methods perform nearly as well as
standard methods that assume homoscedasticity. The practical EXPERIMENTAL SECTION
advantage of bootstrap methods is that they continue to give
Here we explain how the experiments were designed. The
accurate results for a much broader range of situations.10 In
ANCOVA model constitutes a convenient way to compare two
essence, bootstrap methods perform simulations on the
regression lines in a more general context than the classical
observed data to determine confidence intervals or the
Student’s t-test. A simple introduction was given elsewhere (ref
distribution of some appropriate test statistic. Conceptually,
4 and other references cited therein). In brief, ANCOVA
the data are resampled with replacement many times in order
combines several calibrations (standardizations) simultaneously
to generate empirical estimates of the statistics and use them to
in a unique multiple regression model (like the one in eq 1)
make inferences about the populations. We will not present its
and it allows for testing different issues about its coefficients;
basics here, although two user-oriented references are highly
for instance, if the slopes of the regression lines are comparable
advisable.11,12 Here, we will only mention that bootstrapping is
(with the alternative hypothesis that the regression lines were
particularly useful when (i) the theoretical distribution of a
not derived from populations with equal slopes).
statistic of interest is complicated or unknown (bootstrapping is
Throughout this paper, data sets were generated from the
distribution-independent) and when (ii) the sample size is
regression model depicted in eq 1:
insufficient to get a straightforward and sound statistical
inference (this situation is very common in analytical chemistry, Yi = β0 + β1xi + β2Gi + β3xiGi + τ(xi)εi (1)
and therefore, it is required to estimate statistics that are not
distorted by the specific values derived from a given study). with i = 1, ..., n. βj (j = 0, ..., 3) are the unknown coefficients
Bootstrap (resampling) methods are gaining interest within (calculated by a least-squares fit), xi is the explanatory variable
analytical chemistry and have already been proposed to (e.g., concentration), Gi is a dummy variable for two groups (Gi
compare analytical procedures 13 and ordinates of two takes a value of 0 or 1 for each corresponding calibration), τ
regression lines;12 to calculate the probability of correct (variance pattern) is a function of xi used to model
assignments in KNN classification;14 to improve variable heteroscedasticity, and εi is the error term. The term τ is
selection for manufacturing processes, combined with permu- used in eq 1 to associate different variances with the error (ε)
tation tests and PLS-VIP (partial least-squares variable term (for instance, to make it dependent on the x values). Note
importance in projection);15,16 to calculate sample-specific that the regression model is Yi = β0 + β1xi + τ(xi)εi when Gi = 0
prediction intervals;17 and to evaluate the reliability of and Yi = (β0 + β2) + (β1 + β3)xi + τ(xi)εi when Gi = 1 (i.e.,
classifications when evaluating counterfeit banknotes,18 just to when each calibration is considered separately). To assess
cite some recent examples. whether the slopes of the groups (calibration lines) differ, the
The aim of the present study was 2-fold. First, we aim to hypothesis given by eq 2 is tested:
evaluate the behavior of nonparametric approaches discussed
by Ng and Wilcox8 in typical situations that emerge in analytical for H0 , β3 = 0; for H1 , β3 ≠ 0 (2)
laboratories, specifically a low number of standards, which, in a
way, might be considered as an “extreme situation” in statistical Interest is in analyzing, through a Monte Carlo simulation
terms. In particular, we are interested in evaluating the most study, the performance of several methods to test eq 2 when
promising options of that previous study, namely, the wild the normality and homoscedasticity assumptions are violated.
bootstrap quasi t-statistic and the Theil−Sen percentile Unlike Ng and Wilcox,8 the values of the explanatory variable
bootstrap (a procedure related to the well-known non- are fixed by the researcher, and two typical situations for the
parametric Theil’s method to calculate a regression line based explanatory variable were considered. In situation 1 (S1), the
on the median). Second, a pragmatic approach is presented so explanatory variable takes values on the same interval in both
that laboratories can apply the most satisfactory methods calibrations (noted by 1 and 2), a common, pragmatic working
derived from this study to their own data sets. design in laboratories; that is, [I1, L1] = [I2, L2] = [0, 5]. Here
As a step forward, the methods recommended by Ng and n1, n2 ∈ {3, 5, 7, 9}. When n1 = n2 (n = number of standards),
Wilcox8 will be checked under nonnormality (in particular, the values of the explanatory variable x are equal; otherwise,
simulating the presence of an undetected outlier, as may they are different. In situation 2 (S2), the explanatory variables
happen in laboratories) and heteroscedasticity contexts, when take different values on different intervals [I1, L1] = [0, 5]; [I2,
the sample sizes (number of calibrators) are small, on the order L2] = [0, 10]. Here n1, n2 ∈ {3, 5, 7, 9}.
B DOI: 10.1021/acs.analchem.5b04004
To simulate heteroscedasticity, we focused on two scenarios. As a final note here, the statistics to be studied are T
First, within-groups heteroscedasticity (WG) occurs (this (Student’s t), Qt (HC4-based quasi-t test), WB (HC4-based
means that the variance of ε depends only on x) without wild bootstrap quasi-t test), MP (modified percentile boot-
differences between the groups (BG homoscedasticity). This strap), and TS (Theil−Sen percentile bootstrap). Technical,
corresponds to common experience in, for example, spectro- full descriptions of these statistics were given previously,8 and
interested readers are kindly forwarded to that seminal work.
■
scopic laboratories, where currently standard additions method
(SAM) and direct method (DM) standardizations exhibit
similar standard errors of the fits. A second scenario considers RESULTS AND DISCUSSION
that heteroscedasticity occurs due to both WG and different In the following, the simulations will be resumed and discussed
variances among the standardizations (i.e., BG heteroscedas- separately for each scenario. First, a scenario where the
ticity, meaning that the variance of ε depends on G). This variability depends on x (but not on the calibration; i.e., WG)
might correspond to a situation where the added analyte and will be explored for the two situations S1 and S2, the three
the sample-inherent one do not hold the same behavior. In our types of residual variances, and three types of error distribution.
opinion, this should not occur in laboratories because it would Scenario 1, Situation 1: Within-Group Heteroscedas-
denote a suboptimal analytical procedure, but it is worth ticity and Between-Group Homoscedasticity, Same
considering. Calibration Ranges. This corresponds to a common design
Three variance patterns (VP) for τ(xi) were considered. in laboratories (termed scenario 1 in the Experimental
They are thought to represent typical laboratory conditions and Section): sample sizes and values of the explanatory variables
are depicted in Table S1 (Supporting Information), along with are equal in both calibrations.
a comparison with the variance patterns employed in the Table 1 reveals that whenever homoscedasticity occurs
original study.8 Numerical values in the mathematical (VP1) and the residuals follow a normal distribution, the best
expressions given in Table S1 for the variance patterns, τ(xi),
are a consequence of applying the original functions presented Table 1. Comparison of Statistics under Variance Pattern 1
in ref 8 to the range of experimental values in x (in this study, and Three Types of Error Distribution
without loss of generality, [0, 5] or [0, 10]). normal error +
Variance patterns 1, 2, and 3 represent three disparate normal error skewed errors outlier
conditions. In VP1, the residuals of the regression fits follow type I type I type I
normal homoscedastic distributions regardless of x. In VP2, the error error error
method p-value rates p-value rates p-value rates
residuals are normal but their variance increases when the
explanatory variable departs from the central value (to both n=3
higher and lower values). This corresponds to a common T 0.5026 5.40 0.4710 4.40 0.5474 7.60
situation. Indeed, it resembles the typical Hotelling’s confidence Qt 0.5579 3.30 0.5285 2.60 0.5889 0.00
region around the calibration line. Finally, VP3 “opposes” VP2, WB 0.5231 3.10 0.4905 2.70 0.5184 5.70
as the variance decreases when x values move away from the MP 0.7105 0.00 0.6905 0.00 0.6916 0.00
central value. The latter two situations are visualized in the TS 0.5451 0.00 0.5802 0.00 0.6287 0.00
n=5
graphical insets of Table S1.
T 0.5016 5.50 0.4775 4.80 0.5726 2.90
With regard to the error distribution, for each variance
Qt 0.5395 5.00 0.5235 2.90 0.6327 2.00
pattern, error terms (εi) were generated randomly from the so-
WB 0.5032 5.10 0.4592 7.30 0.6117 1.90
called g-and-h distribution, which is used currently to generate
MP 0.5584 0.00 0.5294 0.00 0.6204 0.00
models with different distributions. Three distributions were
TS 0.5810 0.90 0.5851 1.40 0.5919 0.80
considered for εi in this work: standard normal (g = 0, h = 0),
n=7
asymmetric light-tailed (g = 0.5, h = 0), and standard normal (g
T 0.5158 4.50 0.4917 4.20 0.5627 2.30
= 0, h = 0) with an outlier. It is acknowledged that in a perfect Qt 0.5373 4.50 0.5218 1.80 0.6360 0.90
situation the latter situation should not occur; however, it is WB 0.5181 4.40 0.4868 5.70 0.6072 0.80
also true that routine work in stressing circumstances, MP 0.5232 1.80 0.4985 1.30 0.5883 0.70
combined with a very low number of standards, might well TS 0.5601 2.00 0.5732 1.10 0.5963 1.20
make it difficult to correctly detect outliers. n=9
The two situations for the explanatory variable (S1 and S2) T 0.5079 5.00 0.4909 3.00 0.5725 1.60
with various sample sizes {3, 5, 7, 9} were combined with all Qt 0.5271 4.80 0.5110 1.30 0.6443 0.40
three variance patterns (τ) for each of the three εi distributions. WB 0.5075 4.50 0.4751 5.80 0.6280 0.50
Hence, a total of 144 simulation scenarios are considered. The MP 0.5087 5.30 0.4826 4.50 0.5959 0.90
probability of a type I error was based on 1000 replications and TS 0.5569 2.20 0.5860 1.30 0.5773 0.70
was estimated with α̂ , the proportion of p values less than or
equal to 0.05 (at the α = 0.05 level) (proportion of times that statistic is the traditional Student’s t-test, as expected. In our
H0 is rejected under H0). context, “best” means that the p-value and the percentage of
The studies were carried out by considering two rejections approach the nominal values (i.e., 0.5 and 5%,
experimental laboratory setups, typical of DM and SAM respectively). This was so regardless of the sample size.
comparisons. In setup A, the regression lines are identical. Nevertheless, WB yielded very good results. When the
For instance, without loss of generality, β0 = 0.15, β1 = 0.1, and distribution of the residuals is asymmetric, T, WB, and MP had
β2 = β3 = 0. In setup B, the regression lines have the same p-values close to the nominal ones (although MP only with the
slopes but different intercepts. For instance, without loss of largest number of standards). When an outlier is considered,
generality, β0 = 0.15, β1 = 0.1, β2 = −0.15, and β3 = 0. the most robust statistic was TS, although it is not the most
C DOI: 10.1021/acs.analchem.5b04004
Table 2. Comparison of Statistics under Within-Group Heteroscedasticity Conditions (Variance Patterns 2 and 3) and Three
Types of Error Distribution
variance pattern 2 (3)
normal error skewed errors normal error + outlier
method p-value type I error rates p-value type I error rates p-value type I error rates
n=3
T 0.376 (0.616) 12.5 (2.7) 0.334 (0.581) 9.7 (2.5) 0.573 (0.685) 4.2 (0.4)
Qt 0.429 (0.667) 8.6 (1.9) 0.392 (0.632) 6.7 (1.7) 0.621 (0.729) 2.5 (0.4)
WB 0.492 (0.535) 1.3 (3.5) 0.453 (0.532) 2.8 (3.4) 0.646 (0.594) 0.0 (2.8)
MP 0.646 (0.766) 0.0 (0.0) 0.627 (0.741) 0.0 (0.0) 0.737 (0.786) 0.0 (0.0)
TS 0.533 (0.600) 0.0 (0.0) 0.531 (0.601) 0.0 (0.0) 0.606 (0.643) 0.0 (0.0)
n=5
T 0.395 (0.559) 12.9 (2.1) 0.396 (0.544) 10.1 (1.6) 0.532 (0.641) 4.3 (1.2)
Qt 0.481 (0.559) 6.4 (3.2) 0.489 (0.570) 3.2 (0.8) 0.633 (0.659) 1.9 (1.7)
WB 0.472 (0.486) 5.3 (6.6) 0.457 (0.482) 4.8 (5.0) 0.661 (0.629) 0.6 (0.8)
MP 0.490 (0.585) 0.0 (0.0) 0.489 (0.570) 0.0 (0.0) 0.620 (0.656) 0.0 (0.0)
TS 0.546 (0.599) 1.1 (0.7) 0.583 (0.606) 0.7 (0.5) 0.581 (0.617) 1.1 (0.6)
n=7
T 0.409 (0.565) 11.9 (1.9) 0.401 (0.539) 9.7 (2.7) 0.528 (0.610) 3.8 (1.1)
Qt 0.494 (0.551) 7.0 (2.5) 0.485 (0.535) 3.9 (2.1) 0.642 (0.636) 0.6 (1.2)
WB 0.487 (0.508) 4.3 (4.0) 0.469 (0.468) 6.3 (4.5) 0.633 (0.601) 0.2 (1.4)
MP 0.466 (0.544) 6.3 (0.9) 0.448 (0.514) 7.4 (1.5) 0.592 (0.600) 1.3 (1.0)
TS 0.547 (0.586) 3.4 (1.0) 0.545 (0.586) 1.6 (0.9) 0.562 (0.582) 1.6 (0.6)
n=9
T 0.419 (0.558) 10.8 (2.4) 0.409 (0.535) 9.6 (2.3) 0.541 (0.591) 3.5 (1.1)
Qt 0.501 (0.532) 6.8 (4.2) 0.483 (0.526) 3.2 (1.5) 0.659 (0.619) 0.2 (0.6)
WB 0.491 (0.501) 4.8 (5.2) 0.468 (0.488) 6.7 (6.2) 0.650 (0.602) 0.1 (1.5)
MP 0.467 (0.521) 7.5 (2.8) 0.453 (0.505) 7.7 (3.1) 0.604 (0.578) 1.1 (1.7)
TS 0.542 (0.590) 3.2 (1.0) 0.551 (0.573) 2.4 (1.6) 0.564 (0.573) 1.5 (0.5)
Figure 1. Relative performance of statistics, illustrated through averaged p-values calculated considering the three distributions of errors when sample
sizes and values of the explanatory variable are equal in both calibrations.
precise because it yields p-values not very close to the nominal

one. Therefore, if homoscedasticity occurs, the classical
D DOI: 10.1021/acs.analchem.5b04004
Figure 2. Relative performance of statistics: average type I error probabilities considering the three error distributions when different types of WG
heteroscedasticity (variance patterns: VP1, blue; VP2, red; and VP3, green) and samples sizes are considered (“reject” stands for rejection).
Student’s t-test can be used regardless of the error distribution, possible to calculate an “average behavior” of the statistics for
which is a highly advantageous result given the ubiquitous and each variance pattern considering the three error distributions.
widespread use of this test. This is presented in Figure 1 and in Figure S1 (Supporting
On the contrary, when heteroscedasticity occursin Information). They show that, roughly, the T statistic has very
particular, when the variance increases with x (VP2)Table good behavior only with homoscedastic data (VP1). With
2 shows that the t-test leads clearly to higher percentages of heteroscedasticity (VP2 and VP3), WB is the best option
rejection than the nominal level (for both normal and because despite the fact that the other statistics improve when
asymmetric distributions). Qt and MP worked satisfactorily the number of standards increase, they did not outperform it.
only when the number of standards is at least 5. The best test In the same way, the situation where the explanatory variable
statistics were WB and TS (in particular, the former). When an does not take the same values in the two calibrations, although
outlier is present, TS is still the most robust option (although it does expand on the same range, was studied. It was observed
with p-values > 0.55) because the p-values and/or percentages that T behaves very well under homoscedasticity conditions,
of rejections remain quite stable (e.g., compare columns 2nd regardless of the number of standards and error distribution.
and 8th). The other statistics decrease the percentage of However, the statistic degrades under heteroscedasticity as the
rejections and, a bit surprinsingly, the T statistic is the best. p-values are lower and higher than the nominal one (50%) for
Table 2 presents the results for heteroscedasticity associated VP2 (increasing variance) and VP3 (decreasing variance),
with a reduction of the variance of residues when the respectively. Also, T is very sensitive to the existence of a
explanatory variable departs from the central value (VP3). In unique outlier, which makes the p-values increase remarkably.
general the p-values clearly overpass the nominal one, both for The average p-values for each statistic, when VPs and error
normal and asymmetric errors, but for the WB statistic. When distributions and number of standards (sample size) are
the fewest standards are considered (n = 3), results are really considered, can be seen in Figure S2 (Supporting Information).
poor for T, Qt, and MP. When an outlier is present, the other Type I error probability for the WB situation is close to the
tests improve their behavior but they do not outperform it. nominal level (5%, when 95% confidence level is considered)
Therefore, WB should be the choice here. for both normal and asymmetric errors, regardless of the type
An interesting conclusion from the simulations above is that of heteroscedasticity and number of standards. It is, no doubt,
the behavior of the statistics depends much more on the type of the best statistic under these conditions. Qt and MP vary quite
heteroscedasticity than on the distribution of errors (in which a lot depending on sample size, sometimes with good results.
cases the behaviors were quite homogeneous). Thus, it is Finally, TS seems the most robust statistic, even though the
E DOI: 10.1021/acs.analchem.5b04004
average p-values exceed systematically the nominal one (=0.5, sample sizes and β3 values. In effect, the more standards we
or 50%, Figure S2). Figure 2 depicts the average percentage of have and the larger the value of β3 is, the greater the statistical
rejections (average type I error estimated for the three error power should be.
distributions, nominal value = 5%). Theoretically, when homoscedasticity holds, T should be
A final remark is in order here, because in our studies TS was more powerful than WB, ex ante. In effect, it was found that
far less satisfactory than for Ng and Wilcox.8 Further both T and WB have very similar power, although for smaller
simulations with more standards were considered (n = 10, sample sizes, T approaches 1 faster. As expected, both statistics
20, 40, 60) and the statistic yielded average results close to the clearly increase their power when the number of standards
nominal ones, comparable to those of WB. In some cases TS increases. Hence, WB performed very satisfactorily, even when
overpassed WB, but we feel that TS is not suitable for typical compared with the optimal approach (under these conditions),
laboratory situations where very few standards are used. T.
Scenario 1, Situation 2: Within-Group Heteroscedas- When heteroscedasticity and nonnormal residuals were
ticity and Between-Group Homoscedasticity, Different considered [we selected increasing variances (VP2) and
Calibration Ranges. In the Experimental Section , situation asymmetric error distribution (g = 0.5, h = 0) because in our
S2 was described as that when the explicative variable has experience this seems a common situation in laboratories], the
different ranges in both calibrations. Results were totally similar T statistic had very unsatisfactory results, as expected, and only
to those above, so they are not presented here for brevity. the WB statistic must be considered. However, its statistical
Scenario 2: Both Within-Group and Between-Group power converged to 1 at a slower pace than when
Heteroscedasticity. Out of the many possibilities that might homoscedasticity and normal errors occurred: it was relatively
be simulated, three rather “extreme” situations were considered: low for small values of β3 and/or for very small sample sizes
normal error in group 1 plus asymmetric error in group 2; (n1= n2 = 3). However, when the sample size of a calibration is
normal error in group 1 plus normal error with an outlier in equal or greater than 5, the statistical power increases to 1
group 2; and asymmetric error in group 1 plus normal error quickly. This is a highly relevant result because, in analytical
with an outlier in group 2. laboratories, to prepare a series of five standards is a de facto
Figure S3 shows that the most useful statistics were T, WB, standard, and so this practice now has an additional benefit.
and TS. Their behaviors agree largely with those in the two
previous scenarios. T is the most valuable statistic when
homoscedasticity is present, although it becomes very affected
■ WORKED EXAMPLES
In Table S2 (Supporting Information), experimental data from
by an outlier. In contrast, WB leads to average rejection four real case studies are presented. They show how awkward
percentages that are very close to the nominal ones in almost decision-making can be when classical tests are used with very
every situation. Hence this would be the recommended choice few data points and/or when the structural characteristics are
for laboratories working with a small number of standards. not carefully reviewed (which must be done before every
Finally, TS is very robust but its rejection percentages depart calculation).
from the nominal ones (<5%). Example 1 was intended to check whether an ionization
Statistical Power. In the analysis of a statistical test, it is suppressor improved some spectrometric measurements in
common practice to focus on type I error (α error, or flame atomic absorption spectrometry4 and represents a
significance level); that is, the probability of rejecting the null measurement protocol where the standards are not replicated.
hypothesis when, indeed, it is true. Nevertheless, the probability Examples 2−4 correspond to laboratory studies carried out
of a type II error (β), or the probability of accepting the null with a graphite furnace atomic spectrophotometer to assess the
hypothesis when it is false, should not be neglected. Evaluation need for SAM calibration when different instrumental setups
of this probability is made by studying the statistical power of were considered. The standards were replicated three times
the test (1 − β), that is, the test’s ability to detect a specific (replicates mean full preparation).
alternative hypothesis (the probability of correctly rejecting the In the first example, heteroscedasticity cannot be checked
null hypothesis when it is false or a specified alternative is true). because of a lack of replicates. Although normality of the
Although the statistical power is also known as sensitivity, it residuals is not fulfilled (p-value for Shapiro−Wilks’ test =
should not be confused with analytical sensitivity, which is the 0.04), T and WB procedures yield the same conclusion: the
slope of the regression line. Note that the higher the power, the slopes are comparable (p-values for T and WB were 0.25 and
lower the type II error will be. 0.51, respectively).
The two most satisfactory statistics according to the previous In example 3, homoscedasticity and normality are fulfilled
simulations, T and WB, were subjected to a final analysis to (nonsignificative tests at 5% level) and both procedures lead to
determine their statistical power. For this, a situation was the same decision; that is, the slopes are comparable (p-values
considered first where the true regression lines can differ just in for T and WB were 0.24 and 0.59, respectively).
their slopes (β0 = 0.15, β1 = 0.1, β2 = 0.0) and normality of their In example 4, neither heteroscedasticity nor normality
residuals and homoscedasticity are fulfilled by design. Then, occurs, and the T and WB tests showed that the slopes are
their power was evaluated as a function of both the number of not different (p-values for T and WB were 0.44 and 0.62,
standards (sample size) and the values of β3. It is worth noting respectively).
that a “perfect” test should have power = 1 and that power Finally, in example 2, neither homoscedasticity nor normality
tends to 0.05 (type I error) when the null hypothesis (β3 = 0) is occurs (p-values for Shapiro−Wilks and Breusch−Pagan tests
true; otherwise the power should be evaluated for different not significant 95% confidence level), and the WB test accepts
values of β3. In particular, for the test posed in eq 2, we that the slopes are equal, while the Student’s one rejects this (p-
investigated whether the methods considered in this paper can values of 0.11 and 0.01, respectively).
detect a difference between the two slopes (if it exists) by These results clearly show that the behavior of the t-test
rejecting the null hypothesis. Such ability in general depends on depends hugely on the particular data under study and that
F DOI: 10.1021/acs.analchem.5b04004
although it is quite reliable (according to the simulations when different types of WG heteroscedasticity are
described earlier), it may lead to different conclusions than the considered; and an tutorial on use of the WB Statistic
robust tests (in particular, the WB one). In particular, one (PDF)
■
should be aware that if the data present heteroscedasticity, its
correct performance is by no means guaranteed. AUTHOR INFORMATION
These examples were used to develop the practical examples Corresponding Author
displayed in the tutorial on practical use of the WB statistic *E-mail andrade@udc.es; fax +34-981-167065.
(including how to get and install the free software). The
Notes
tutorial can be found at the Supporting Information.
■
The authors declare no competing financial interest.
CONCLUSIONS
The slopes of two regression lines were compared in 144
■ ACKNOWLEDGMENTS
The Galician Government, “Xunta of Galicia”, is acknowledged
simulation scenarios. Ranges of experimental predictors, sample for its support to the QANAP group (Programa de
sizes, three variance patterns, and three error distributions were Consolidación y Estructuración de Unidades de Investigación
combined to assess how they influence the performance of Competitiva, GRC2013-047). The financial support of the
Student’s t-test and four nonparametric tests. Spanish Government (Ministerio de Economiá y Compet-
A first interesting conclusion from the simulations is that itividad) and Xunta de Galicia (research projects MTM2014-
behavior of the statistics depends much more on type of 52876-R and CN2012/130) is also acknowledged.
■
heteroscedasticity than on distribution of errors (for which the
behaviors of different setups were very similar). A second fact REFERENCES
derived from the simulations is that the statistics behaved the (1) Ortiz, M. C.; Sánchez, S.; Sarabia, L. Quality of analytical
same, regardless of whether the explicative variable takes the measurements: univariate regression. In Comprehensive Chemometrics:
same values in the two calibrations or not. Chemical and Biochemical Data Analysis, Vol. 1; Brown, S. D.; Tauler,
When homoscedasticity occurs, the classical Student’s t-test R.; Walczack, B., Eds.; Elsevier: Amsterdam, 2009; pp 127−169; DOI:
can be used regardless of error distribution, which is a highly 10.1016/B978-044452701-1.00091-0.
advantageous result given the ubiquitous and widespread use of (2) Thompson, M.; Lowthian, P. J. Notes on Statistics and Data
this test. However, when heteroscedasticity occurs (no matter Quality for Analytical Chemists; Imperial College Press: London, 2011.
whether the variance increases or decreases), the t-test leads (3) Andrade-Garda, J. M.; Carlosena-Zubieta, V.; Soto-Ferreiro, R.;
clearly to higher percentages of rejection than the nominal one Terán-Baamonde, J.; Thompson, M. Classical Linear Regression by
the Least Squares Method. In Basic Chemometric Techniques in Atomic
(for both normal and asymmetric error distributions). The best Spectroscopy; Royal Society of Chemistry: London, 2013; Chapt. 2, pp
statistics were WB and TS, in particular WB, although when an 52−122; DOI: 10.1039/9781849739344-00052.
outlier was present TS was the most robust option (although it (4) Andrade, J. M.; Estévez-Pérez, G. Anal. Chim. Acta 2014, 838, 1−
had p-values >0.55, and so its use is discouraged for a low 12.
number of standards). (5) Draper, N. R.; Smith, H. Applied Regression Analysis; John Wiley
In the most complex situation, heteroscedasticity was & Sons: New York, 1998; DOI: 10.1002/9781118625590.
induced both within and between calibrations. Once more, (6) Smithson, M. Frontiers in Psychology 2012, 3, No. 231.
the most useful statistics were T, WB, and TS. Their behaviors (7) DeShon, R. P.; Alexander, R. A. Psychological methods 1996, 1 (3),
agree largely with those in the previous scenarios. T is the most 261−277.
valuable statistic when homoscedasticity is present, although it (8) Ng, M.; Wilcox, R. R. Br. J. Math. Stat. Psych. 2010, 63, 319−340.
(9) Wilcox, R. R. Br. J. Math. Stat. Psych. 1997, 50, 309−317.
becomes strongly affected by an outlier. On the contrary, WB (10) Wilcox, R. Introduction to Robust Estimation and Hypothesis
lead to average rejection percentages that are very close to the Testing, 3rd ed.; Elsevier, New York, 2012.
nominal ones in almost every situation. TS is very robust, but (11) Mooney, C. Z.; Duval, R. D. Bootstrapping: A Nonparametric
its rejection percentages depart from the nominal ones. Approach to Statistical Inference. Sage Publications: Newbury Park, CA,
As a final conclusion, the wild bootstrap (WB) statistic seems 1993.
a very convenient and useful choice to compare regression (12) Wehrens, R.; Putter, H.; Buydens, L. M. C. Chemom. Intell. Lab.
straight lines in laboratories working with a small number of Syst. 2000, 54, 35−52.
standards. In addition, its power is comparable to Student’s t- (13) Hartmann, C.; Smeyers-Verbeke, J.; Penninckx, W.; Vander
test even in cases where optimal homoscedasticy occurs, and it Heyden, Y.; Vankeerberghen, P.; Massart, D. L. Anal. Chem. 1995, 67,
does not need previous statistical tests to assess normality or 4491−4499.
(14) Villa, J. L.; Boqué, R.; Ferré, J. Chemom. Intell. Lab. Syst. 2008,
homoscedasticity.
■
94 (1), 51−59.
(15) Afanador, N. L.; Tran, T. N.; Buydens, L. M. C. Anal. Chim.
ASSOCIATED CONTENT Acta 2013, 768, 49−56.
*
S Supporting Information (16) Afanador, N. L.; Tran, T. N.; Buydens, L. M. C. Chemom. Intell.
The Supporting Information is available free of charge on the Lab. Syst. 2014, 137, 162−172.
(17) Pereira, A. C.; Reis, M. S.; Saraiva, P. M.; Marques, J. C.
ACS Publications website at DOI: 10.1021/acs.anal- Chemom. Intell. Lab. Syst. 2011, 105 (1), 43−55.
chem.5b04004. (18) de Almeida, M. R.; Correa, D. N.; Rocha, W. F. C.; Scafi, F. J.
Two tables, listing variance patterns associated with O.; Poppi, R. J. Microchem. J. 2013, 109, 170−177.
residuals and practical examples to compare the behavior
of T and WB tests; three figures, showing relative
performance of statistics when sample sizes and values of
explanatory variable are equal in both calibrations and
average p-values and relative performance of statistics
G DOI: 10.1021/acs.analchem.5b04004

Bootstrap Approach To Compare The Slopes of Two Calibrations

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bootstrap Approach To Compare The Slopes of Two Calibrations

Uploaded by

Copyright:

Available Formats

Article

Bootstrap Approach To Compare the Slopes of Two Calibrations

ABSTRACT: Comparing the slopes of aqueous-based and

A nalytical laboratories must decide daily whether a series of

© XXXX American Chemical Society A DOI: 10.1021/acs.analchem.5b04004

precise because it yields p-values not very close to the nominal

You might also like