You are on page 1of 22

Fluid Phase Equilibria 164 1999.

6182
www.elsevier.nlrlocaterfluid

Analysis of residuals a useful tool for phase equilibrium data


analysis
Jaime Wisniak ) , Anna Polishuk
Department of Chemical Engineering, Ben-Gurion Uniersity of the Nege, Beer-Shea, Israel
Received 21 January 1999; accepted 8 June 1999

Abstract
It is shown that analysis of the goodness of a fit using only the coefficient of determination R 2 or the mean
average deviation is not enough and should be accompanied by a statistical study of the behavior of the
residuals. This study should not be visual but using proper statistics. Different techniques such as normal
probability plots, half-normal probability plots, detrended probability plots, rankits and DurbinWatson are
presented and illustrated with examples from the area of fluid behavior and fluid phase equilibrium. q 1999
Elsevier Science B.V. All rights reserved.
Keywords: Statistics; Regression; Residuals

1. Introduction
Data collection in the study of phase behavior or phase equilibrium is normally followed by
regression of the data for the purpose of a. compressing a large set of information into an equation
possessing a few constants, b. fitting an equation that describes the phenomena, c. interpolation, d.
testing a certain model, e. selecting between different models, f. calculating parameters related to

Corresponding author. Tel.: q972-7-6461479; fax: q972-7-6472916; e-mail: wisniak@bgumail.bgu.ac.il

0378-3812r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 8 - 3 8 1 2 9 9 . 0 0 2 4 6 - 0

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

62

the phenomena, etc. The oerall goodness of the fit is usually expressed by statistics like the mean
average deviation MAD., the standard deviation s and the coefficient of determination R 2 :
n

< yi y yi <
1

MAD s

ss

yi y yi .
1

yi y y .

2.

R2 s

1.

1
n

yi y yi .

3.

but seldom by an analysis of the residuals e i s yi y yi ., where yi is the observed value and yi the
predicted one, in a population of n elements with average y. The standard deviation measures the
dispersion of the sample about the mean while the coefficient of determination measures the
proportion of the variation of the regressand that is accounted for by the regressor. In general, a
regression equation will be valid only over the region of the regressor variables contained in the
observed data. It is the purpose of this work to show that all regression procedure should be
accompanied by a study of the behavior of the residuals of the model, and the pitfalls that may occur
if this is not done:
A good method of fitting should w1x:
a. use all the relevant data in estimating each constant;
b. have reasonable economy in the number of constants required;
c. provide some estimate of the uncontrolled error in y;
d. provide some indication of the random error in each constant estimated;
e. make it possible to find regions of systematic deviations from the equation if any exist;
f. show whether the conclusions are unduly sensitive to the results of a small number of runs; and
g. give some idea of how well the final equation can be expected to predict the response.
The fundamental assumptions in correlation are that the error terms have zero mean, constant
variance and are not correlated. The assumption of normality is usually added to permit testing of
hypotheses and determining confidence intervals of the parameters. By this assumption, we write that
the errors are ND0, s 2 . where the symbol ND0, s 2 . stands for normally distributed with mean 0
and variance s 2.
Whenever possible, the data should be checked before their regression using criteria like correctness of sign and magnitude, thermodynamic consistency, etc. Nevertheless, there is always the
possibility that large collections of data contain a few outliers, values that cannot be considered
typical of the physical behavior of the system. It is usually impossible to repeat the conditions that
made them unusual, but they must be spotted, however, since their retention may invalidate the
judgments we make.
Regression analysis is a statistical technique for investigating and modeling the relationship
between variables. An important objective of the technique is to estimate the unknown parameters in

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

63

the regression model following a procedure called fitting the model to the data. There are several
parameter estimation techniques and these are well described in the literature w2,3x. The next step of a
regression analysis is checking the adequacy of the model. Here, we study the appropriateness of the
model and use suitable tools to verify the quality of the fit. The results of the check may indicate
whether the model is reasonable or that it must be modified.
After obtaining the proper fit of the model, a number of important questions arise, including: a.
How well does the equation fit the data?; b. Is the model likely to be useful as a predictor?; and c.
Are any errors of the basic assumptions violated, and if so, how serious is this?

2. Goodness of fit and R 2


A very common procedure for choosing among alternative functions is to select the one that fits the
best, that is, the one that has the highest coefficient of determination R 2. We will discuss this concept
in some detail because the coefficient of determination is the tool most commonly used in qualifying
the fit and, unfortunately, too many times misinterpreted. Although a high R 2 value is an important
asset of the regressor, there are serious objections to relying exclusively or very heavily on this
empirical criterion. The purpose of sampling the population is to obtain reliable estimates of the
parameters of the same, and not merely discuss the properties of the sample. Thus, we should avoid
choosing functions that have a high R 2 but many nonsignificant coefficients, or coefficients whose
signs or magnitudes have no theoretical support. We can always increase the value of R 2 by adding
additional regressors and ask the questions: a. Does the return justify the price?; and b. After how
many terms does the return become negative? As mentioned before, R 2 measures the proportion of
the variation of the regressand that is accounted for by the regressor tested, thus the R 2 s of functions
that have different regressands are not directly comparable. For example, when correlating vapor
liquid equilibrium VLE. data we may use models where the adjustable parameters are independent of
the temperature e.g., RedlichKister. or temperature dependent e.g., NRTL. .
A perfect fit to the data for which yi s yi is an unlikely event and will give R 2 s 1. Nevertheless,
if there is no pure error, R 2 can be made unity by employing n properly selected coefficients in the
model, including the free term. Consider for example, fitting the response of a GC integrator with the
following polynom:
A s a1 x q a 2 x 2 q . . . qa n x n

4.

where A is the area and x the molar fraction. Eq. 3. predicts that the more terms we pick for the
polynom, the larger the value of R 2 will be. Assuming there are no repeat measurements, then for a
population of n experiments, the value of R 2 will be exactly 1 and the predicted curve will pass
through all the experimental points saturation. . Obviously, this is a very poor choice because to pass
through all the points the curve will have undulate, that is, will present local minima or maxima that
do not fit the physical reality. Hence, we need an additional tool to decide between the different
possibilities different number of terms. . One such possibility is to use the so-called adjusted
2
, defined as:
coefficient of determination R adj
2
R adj
s R2 y

ky1
nyk

1 y R 2 .

5.

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

64

2
where n is the total number of observations and k is the number of regressors. Clearly, R adj
penalizes
functions having a higher k. Increasing the number of observations will increase continuously the
2
, the latter will generally increase first and then start decreasing.
value of R 2 , but not so that of R adj
The practical significance of this result is that we should not use more terms in the expansion than the
2
. As we will see later, there are simple techniques for
number that yields the maximum value of R adj
checking the possibility that a regression with een fewer terms will still give a better fit.

3. Residuals
2
The values of R 2 and R adj
, indicate if the overall fit of the model is or is not satisfactory.
Important discrepancies may still exist, although the model may pass the above tests. These
discrepancies often can be detected through an analysis of residuals and must be investigated before
the model is finally adopted for use. As noted, the residuals play a key role in evaluating model
adequacy. Residuals can be viewed as realizations of the model errors. Thus, to check the assumptions
we must ask ourselves if the residuals look like a random sample from a distribution with these
properties. The residual should be distinguished from the model error value i s yi y E yi 4 where
E yi 4 is the expected value of yi . The residual involves the vertical deviation of yi from the fitted
value yi on the known estimated regression curve while the error involves the vertical deviation of yi
from the unknown true regression. Hence, we may say that the residuals from a regression represent
the part of the data that is not explained by the model.
Examination of the residuals should be an automatic part of any analysis of variance. If the model
is adequate, the residuals should be structureless; that is, they should contain no obvious pattern.
Through a study of residuals, many types of model inadequacies and violations of the underlying
assumptions can be discovered. Residuals are very useful in identifying outliers, observations that
behave very differently than the bulk of the observations. In the residual plots described below, any
point that is distant from most of the points on the plot is considered an outlier and its origin should
be investigated. If it is clear that the observation is in error for example, error in conducting the
measurement, measuring the wrong thing, a mistake in data transcription or entry, etc.. , then it should
be corrected or deleted from the analysis. The fact that an observation does not fit with the other
observations in the analysis does not justify its removal. Before removing outliers, we should always
investigate the source of the outlier to justify its removal on the context of the data collection process.
Being unusual is not by itself a reason for deleting an observation, it is best to leave it in the model
and continue to investigate the cause of the unusual residual. Sometimes, these observations contain
important information about the data collected. In many situations, extreme outliers may seriously
bias results and lead to erroneous conclusions. Consider, for example, the situation given in Fig. 1
where the data for the VLE of the system propanolq 1,1,2,2-tetrachloroethylene is being analysed
using the following linearized version of the Van Laar model w4x:

Ys

x2
x 1 ln g 1 q x 2 ln g 2

1
s

q
A

1 x1
B x2

6.

If the data were represented by a straight line, then the value of the parameters A and B would be
calculated from the slope and intercept of the line. Inspection of Fig. 1 shows that the last point is

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

65

Fig. 1. Plot of Eq. 6..

clearly an outlier and that it will determine a slope and intercept substantially different from that
suggested by the bulk of the information. In this particular case, we could be safely eliminate the
outlier on the ground that it represents a very diluted sample, where the analytical error is the largest.
Usually, one should examine the pattern of the raw or standardized residuals even if it is just to
identify outliers. One possibility is plotting the residuals against the number of their standard
deviations their number. . If one or more cases fall outside the " times sigma limits, one should
probably exclude the respective cases and run the analyses over to make sure that the key results are
not biased by these outliers.
Available residual values are raw residuals, standard residuals z-scores., Mahalanobis distances,
deleted residuals and Cooks distances w3x. The standard residual is calculated as yi y yi .rs . The
Mahalanobis distance is the distance of a case from the centroid of all the cases, in the space defined
by the independent variables. Cooks distance is useful for assessing the changes that would results in
all the residuals if the respective case were omitted from the regression analysis. For most cases, it
will be enough to consider the raw and the standard residuals.
In the following paragraphs, we will describe and illustrate the following features of residual
analysis:
Detection of an outlier.
Detection of a trend in the residuals.
Detection of changes in the error variance.
Examination of the residuals to ascertain if they are represented by a normal distribution.

4. Residual plots
One very useful tool is to plot the residuals against the fitted values. Fig. 2 illustrates different
shapes of this plot. If the assumptions of the regression are met, this plot will show a band of
relatively constant width, above and below the value 0, and independent of the fitted value Fig. 2a. .

66

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

Fig. 2. Different patterns of residual distribution.

The cone shape Fig. 2b. is a common deviation from this pattern, where the spread of residuals is
wider for larger fitted values. The cone shape indicates that the variance of the observations increases
as the mean increases. When this behavior is observed, performing a logarithmic or square root
transformation of the variable before fitting the data will usually improve the distribution of the
residuals. If the residuals tend to lie in a band that curves either upward or downward, or follow a
sinusoidal pattern Fig. 2c. , addition of a new term for example, the square of the regressor. may
improve the fit and hence, the distribution of the residuals. A behavior like that in Fig. 2e will be
indicative that there is a time-trend in the residuals. When preparing a plot of the residuals it is
important that the limits of the residual axis be not much different than the larger residual present;
larger limits will tend to agglomerate the residuals, obscure their real value, and deform the general
aspect of the distribution. For example, the DIPPR Standard Database w5x reports the residual plots for
the 104 systems that constitute the base, unfortunately, in all the residuals plots the scale runs from
q0.030 to y0.030, despite the fact that in many of the systems, the residuals are much smaller than
these limits. As a consequence, in 29 28%. of the systems, the residuals appear so close to the value

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

67

zero as to give the wrong impression that they are essentially zero, and randomly distributed. This
way of presentation was probably selected to standardize the graphical algorithm and hopefully will
be modified in a next edition.
In the following examples, we illustrate the concept of residual pattern distribution and their
analysis.
4.1. Example 1
Table 1 represents the experimental results of measuring the vapor pressure of ETBE as a function
of the temperature w6x. Let us assume that we do not have any previous theoretical background on the
possible relation between the vapor pressure and the temperature. Inspection of the data indicates that
higher temperatures yield higher vapor pressures in excess of a possible linear relation. Let us assume
the following possible relation:
P kPa s A y

B
T K

7.

Using an optimization technique, we get A s 660.57, B s y196 446 and R 2 s 0.962. On the basis
of R 2 alone, the fit seems very reasonable, but let us look at the plot of the vapor pressure against
1rT, predicted and calculated values, as well as the plot of the residuals Fig. 3a,b. . It becomes clear
not only that the equation examined is inappropriate, but also that the residuals are very large and
dispersed in the shape of an inverted funnel. Obviously, the large value of R 2 is insufficient as a

Table 1
Vapor pressure of ETBE as a function of the temperature w6x
P wkPax

T wKx

20.335
23.415
26.755
29.685
32.595
37.325
40.625
45.665
50.395
56.065
61.535
67.755
70.885
75.005
81.255
86.815
92.145
96.905
101.325

302.31
305.62
308.81
311.35
313.68
317.11
319.32
322.43
325.10
328.08
330.71
333.50
334.83
336.51
338.86
340.89
342.75
344.35
345.77

68

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

Fig. 3. Example 1. a. Eq. 7., calculated against predicted values, b. Eq. 7., plot of the residuals, c. Eq. 8., calculated
against predicted values, d. Eq. 8., plot of the residuals, and e. Eq. 9., plot of the residuals.

criterion for the goodness of the fit. Let us now perform a transformation of the variable and write the
expression of Clapeyron:
B
ln P kPa . s A y
8.
T K
and repeat the procedure. We now find A s 15.7925, B s 3861.4 and R 2 s 0.9999. The overall fit is
improved compared to the first attempt, but again the graphs provide a much more clear picture

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

69

Fig. 3c,d.. Fig. 3d indicates that the value of the residuals has been reduced substantially up to
three orders of magnitude. but the dispersion is still is far from random and retains the form of a
funnel. We then perform an additional transformation, this time of the temperature, and use the
expression of Antoine:
ln P kPa . s A y

B
T K qC

9.

This time, we obtain A s 13.5097, B s 2523.80, C s 211.220 and R 2 s 0.9999. The pertinent
residual graph appears in Fig. 3e. We see that two transformations have produced not only an
excellent overall fit, but also a random distribution of the residuals. In addition, the last point lowest
temperature. can now be clearly identified as an outlier.
4.2. Example 2
Artigas et al. w7x have measured the VLE of the system 2-methylpropanol-bromocyclohexane at 40
kPa Table 2. and found that the data satisfy the Fredenslund criteria for thermodynamic consistency
with MAD s 0.0046. The Fredenslund criteria w8x consists of expressing G E as a series of Legendre
polynomials:
k

G E s x 1 x 2 A k 2 x 1 y 1.

10.

and calculating the predicted value of the vapor composition. The MAD of the residuals of the vapor
composition is used as the overall criteria for thermodynamic consistency. The data are declared
thermodynamically consistent if MADF 0.01.

Table 2
VLE data and Fredenslund residuals for the system 2-methyl-1-propanol 1.qbromocyclohexane 2. at 40 kPa w7x
T wKx

x1

y1

y 1 y y 1,calc .

T yTcalc .

396.43
390.75
387.65
380.95
378.92
373.29
368.25
365.67
362.92
361.53
360.31
359.27
359.11
358.66
358.32

0.0213
0.0390
0.0485
0.0920
0.1073
0.1702
0.2632
0.3428
0.4666
0.5592
0.6790
0.7913
0.8491
0.9035
0.9313

0.2497
0.3794
0.4315
0.5797
0.6100
0.7020
0.7702
0.8066
0.8343
0.8600
0.8826
0.9047
0.9230
0.9445
0.9550

y0.0132
y0.0095
0.0066
y0.0035
y0.0055
y0.0010
y0.0007
y0.0007
y0.0084
y0.0034
y0.0036
y0.0031
y0.0013
y0.0051
0.0034

y0.09
0.00
0.62
y0.70
y0.54
0.22
1.00
0.34
y0.66
y0.46
0.22
0.30
y0.17
y0.14
0.10

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

70

Fig. 4. Example 2. a. Residuals of the Margules equation, and b. plot of the residuals of Eq. 11..

In addition, the temperature dependence on the mole fraction has been adjusted according to:
4

T s x 1T10 q x 2 T20 q x 1 x 2 A q B x 1 y x 2 . q C x 1 y x 2 . q D x 1 y x 2 .

11 .

with A s 0.73841, B s y0.51237, C s y0.04609, D s y0.04972 and R s 0.999.


Let us now investigate the behavior of the residuals of the composition and the temperature Fig.
4a,b.. We see that although the overall fit is adequate and the data pass the Fredenslund test, the
dispersion of the residuals of the composition is not random Fig. 4a. , indicating a weakness in the
decision of declaring the data as thermodynamically. On the other hand, we see that the dispersion of
the residuals of the temperature is random. This example illustrates the fact that the Fredenslund
criteria as used, is a necessary but not sufficient condition for thermodynamic consistency. The
complete test should include a residual analysis.
5. Frequency distribution
When the variable under consideration is continuous its theoretical probability distribution, or
probability density function, can be represented by a continuous curve. The height of the curve gives
the density for a given value of the variable. In order to compare the theoretical with the observed
frequency, it is necessary to divide the two into corresponding classes. Probability density functions
are defined so that the expected frequency of observations between two class limits vertical lines in
the graph. is represented by the area between the limits under the curve. The total area under the
curve is therefore equal to the sum of the expected frequencies 1.0 or n, depending on whether
relative or absolute frequencies were calculated..
6. Normal probability plots
The area under a normal distribution ND 0, 1. with mean 0 and standard deviation of 1. is given
by w3x:
x
1
z2
ys
exp y
d z.
12 .
2
y` '2 p
If we plot 100 y as ordinate against x as abscissa we obtain an S-shaped curved called cumulative
probability curve of the distribution. Some points of this curve are, for example, x, y . s y1.96,2.5.,

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

71

0,50. and 1.96,97.5., all of which are easily obtained from tables of the cumulative ND 0, 1.
distribution w3x.
Normal probability paper is a specially constructed type of graph paper where the unnumbered
horizontal axis is marked by equal divisions in the usual way, but the vertical axis has a special scale.
The vertical scale goes from 0.01 to 99.99 but the spacing between the divisions becomes wider as we
move up from the 50 point to the 99.99 point and down from the 50 point to the 0.01 point, with
symmetry about the horizontal 50 line. The scaling is such that if 100 times the value of y in Eq. 12.
is plotted against x, the resulting curve will be a straight line. Note that since the points y`,0. and
`,100. are on the straight line plot, the values 0 and 100 cannot be plotted in the scale since the
horizontal scale is of limited length and cannot go from y` to q`. Normal probability paper can be
bought or can be drawn using graphic software packages like Sigmaplot and Statistica.
The procedure to test if a set of residuals comes from a normal population, that is, to test if the
residuals are randomly distributed, is the following. First, the residuals are arranged in ascending
order, from the most negative value to the most positive one. The accumulated frequency of each
residual is calculated using 100 j y 0.5.rn, where j is the sequential number of the residual and n is
the total number of residues. We then plot in a normal probability graph, the cumulative frequency
against the value of the residual. The reason for calculating the cumulative frequency in the described
manner is that if we divide the unit area under the normal curve into n equal areas, we should expect
that one observation lies in each section so marked out. The factor 100 adapts the value to the vertical
scale given on normal probability paper. If the sample is a normal sample, it will be possible to draw
a well-fitting straight line through the bulk of the points plotted. In visualizing the straight line more
emphasis should be put on the central values of the plot than on the extremes. Instead of using the raw
residuals, we can use the z-scores, calculated as indicated before, z s yi y y .rs . If the residuals are
normally distributed then approximately 95% of them should fall in the interval y2, q2.. An
alternative procedure to using normal probability paper is to use the calculated cumulative frequencies
to read the theoretical z-scores from a table of the normal distribution and to plot these scores against
the value of the residue, on regular graph paper. This is equivalent to plotting the actual value of a
logarithm on a regular scale, instead of the argument on a log scale.
Outliers may also become evident in this plot. If there is a general lack of fit, and the data seem to
form a clear pattern e.g., an S shape. around the zero line Fig. 2b,c. , then the dependent variable
may have to be transformed in some way.
7. Half-normal plot
When it is known that a sample comes from a possible. normal distribution plot with zero mean, a
useful alternative to the full-normal plot is the half-normal plot. If:
1
x2
x;
exp y 2
y` F x F ` .
13.
2u
u'2p
then
1
x2
< x<;
< x < G 0. .
14.
p exp y 2 u 2
u
2

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

72

The variable < x < follows a half-normal distribution that has a probability distribution of exactly the
same shape as the right-hand half of an ND 0, u 2 . distribution but with every ordinate twice as high.
Suppose now:
ys

< x<

H0

1
x2
p exp y 2 u 2 d x.
2

15 .

If we plot 501 q y . as ordinate against < x < as abscissa on normal probability paper for < x < G 0, the
result will be a straight line through the point 0,50.. In other words, this plot corresponds to the upper
half of the theoretical full-normal line. What we have done in practice is to fold over the full-normal
line so that its lower part now coincides with the upper part. Suppose again that we have a set of
residuals that we want to check for normality. We now repeat the procedure used with the normal
probability paper, except that first we take the absolute alue of the residuals and then arrange them
in increasing value order. For example, if the residuals had values y0.0032, y0.013, y0.022,
0.0028, 0.047, 0.05 note that here the ascending order is the one that would be used in the
full-normal plot. , the signs would be dropped and the numbers rearranged in the ascending order
0.0028, 0.0032, 0.013, 0.022, 0.047 and 0.05. If z 1, z 2 , . . . , z m are the numbers obtained after the
rearrangement, then each value z i is plotted on normal probability paper against its cumulative
frequency w50 q 50 j y 0.5.rn x. Again, if the residues are normally distributed, a well-fitting straight
line can be drawn through the bulk of the points plotted. The half-normal plot is basically used when
one wants to ignore the sign of the residual, that is, when one is mostly interested in the distribution
of absolute residuals, regardless of sign.

8. Detrended normal plot


The detrended normal probability plot is constructed in the same way as the standard normal
probability plot, except that before the plot, the linear trend is removed. The procedure involves
calculating the actual z-score of the residual and plotting it against the residual, in regular graph
paper. This often spreads out the plot, thereby enabling the user to detect patterns of deviation more
easily.
8.1. Example 3
Table 3 lists the VLE data for the system diethyl ether q ethanol, as reported in the DIPPR
Standard Database w5x. This publication is a very good analysis of the thermodynamic description of
VLE. According to this publication VLE data should be analysed for thermodynamic consistency
using a four-suffix Margules equation as explained below. and a isual examination of the
distribution of the residuals. On the basis of these criteria the data in Table 3 have been declared as
thermodynamically consistent.
The four-suffix Margules equation is:
gE
RT

s x 1 x 2 Ax 2 q Bx 1 y Dx 1 x 2 .

16.

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

73

Table 3
VLE data and Margules residuals for the system diethyl ether 1.qethanol 2. at 273.15 K, as reported in Ref. w5x
P wPax

x1

y1

y 1 y y 1,calc .

4910.26
7650.04
10180.5
12104.34
13701.54
15050.76
16213.33
17230.58
18179.84
19061.1
19799.7
20382.32
20859.62
21307.58
21779.54
22326.16
22931.45
23520.73
24118.02

0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95

0.6787
0.8034
0.859
0.8866
0.9039
0.9159
0.9259
0.9323
0.9386
0.9442
0.9487
0.9523
0.9555
0.9585
0.9622
0.9667
0.972
0.9783
0.9861

0.01957
0.00931
0.00153
0.00067
0.00079
0.00118
0.00057
0.00158
0.00129
0.0007
0.00046
0.00058
0.00074
0.00093
y0.0005
y0.00037
y0.00114
y0.00131
y0.00013

The procedure recommended by the DIPPR publication is similar to that suggested by Fredenslund.
The experimental data are fitted by Eq. 16. and the resulting equation used to predict the values of
the vapor composition. Again, the data are declared thermodynamically consistent if the MAD value
of the residuals is equal or less than 0.01. A plot of the residuals against the liquid composition is also
examined visually. For the case in question, it is reported that MADs 0.002.
Let us use the information to prepare normal, half-normal and detrended probability plots, along
with the residual plot Fig. 5a,b,c,d.. First, a visual inspection of the residual plot Fig. 5d. shows that
contrary to what is claimed in the DIPPR data base, the residuals are not randomly distributed, this
fact is confirmed by the probability plots shown in Fig. 5a,b,c. The first two points in the residual
graph are outliers and they also appear as such in the other three graphs. In this particular case, we
can understand these observations on the basis of the large relative volatility of the ether with respect
to the alcohol. The vapor phase is composed mostly of the volatile compound, hence, the error in its
analysis is almost of the same order as the variations in the composition. Thus, in this case, to test for
goodness of fit using a criteria such as Fredenslund or Margules will introduce too much noise in the
statistic, and this will seriously affect the decision.

9. Normal rankit plot and nscores


The conclusions obtained from the three graphs described above become stronger as the size of the
sample increases. In smaller samples, a difference of one item per class can make a substantial
difference in the cumulative percentage of the tails. For small samples, it is recommended to use also

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

74

Fig. 5. Analysis of the residuals of the data of Table 3. a. Normal probability plot, b. half-normal probability plot, c.
detrended probability plot, and d. plot of the residuals.

the method of ranked normal deiates or rankits. Suppose a simple random sample of size n is
available from the population under study. These sample values are compared with n specially
constructed values called normal scores and abbreviated nscores. If a scatter plot of the data values
against the nscores displays a very nearly straight-line relationship between them, then the investigator concludes that the sample was selected from a normal population. Tables of nscores are readily
available in the literature w9x.

Table 4
VLE data and Margules residuals for the system furfural 1.qethanol 2. at 338.15 K, as reported in Ref. w5x
P wPax

x1

y1

y 1 y y 1,calc .

Ranked residuals

Rankits

55995.39
54262.2
51595.76
45329.61
37463.59
29330.92
18931.78
12398.98
7466.05

0.0201
0.0506
0.1507
0.35
0.5509
0.7501
0.9011
0.9492
0.98

0.0048
0.01
0.0229
0.04
0.0538
0.073
0.1315
0.2052
0.351

y0.00152
y0.0023
y0.00368
y0.00295
0.00208
0.00965
0.01128
0.01458
0.04156

y0.00368
y0.00295
y0.0023
y0.00152
0.00208
0.00965
0.01128
0.01458
0.04156

y1.485
y0.932
y0.572
y0.275
0.000
0.275
0.572
0.932
1.485

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

75

Fig. 6. Analysis of the residuals of the data of Table 4. a. Plot of the residuals, and b. rankit plot.

9.1. Example 4
Table 4 lists the VLE data for the system furfural 1. q ethanol 2., as reported in the DIPPR
Standard Database w5x. According to this publication, the data in Table 4 have been declared as
thermodynamically consistent, according to the criteria mentioned in example 3. The nscores given in
Table 4 have been taken from w9x. Fig. 6a,b show the residual plot and the rankit plot of the same. It is
seen again that contrary to what is claimed in the DIPPR database, the residuals are not randomly
distributed, fact that is confirmed by the rankit plot Fig. 6b. .
10. DurbinWatson statistic
The DurbinWatson statistic is useful for evaluating the presence of absence of a serial correlation
of residuals i.e., whether or not residuals for adjacent cases are correlated, indicating that the
observations or cases in the file are not independent.. We repeat again that all statistical significance
tests described above assume that the data consist of a random sample of independent observations. If
this is not the case, the estimates may be more unstable that the significance levels would lead one to
believe. Some applications of regression involve regressor and response variables that have a natural
sequence over time kinetic data, for example.. The assumption of uncorrelated or independent errors
for time series data is often not appropriate and the errors series may exhibit serial correlation; that is,
they are autocorrelated. Some typical situations of time series data in phase equilibrium are the
following:
a. Determination of the vapor pressure of a pure component as a function of the temperature,
where all the measurements are performed by changing the temperature always in the same
direction.
b. Determination of binary VLE where the measurements are performed by increasing the
concentration of one component successively, from a very low value to a very high value.
c. Calibration of a gas chromatograph using synthetic mixtures of increasing concentration of the
same component.

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

76

Residual plots can be useful for the detection of autocorrelation. The most meaningful display is to
plot of residuals vs. time order of extraction of the samples., such a shown in Fig. 3e. If there is
positive autocorrelation, residuals of identical sign occur in clusters and there are not enough changes
in sign in the pattern of residuals.
Various statistical tests can be used to detect the presence of autocorrelation. The test developed by
Durbin and Watson w10,11x is one widely used. The test statistic is:
n

ds

e t y e ty1 .

17.

e t2

where e t are the residuals from an ordinary least squares analysis applied to the data. Durbin and
Watson have shown that d lies between two bonds, d L and d U , such that if d is outside these limits a
conclusion regarding the hypotheses can be reached. The decision procedure is:
a. If d - d L , reject the hypothesis that the residuals are not correlated.
b. If d ) d U do not reject the hypothesis.
c. If d L F d F d U the test is inconclusive.
Table 5
Data for example 5
Residuals
x

Order 2

Order 3

0.009
0.019
0.032
0.043
0.068
0.105
0.129
0.168
0.228
0.269
0.354
0.361
0.408
0.458
0.504
0.547
0.608
0.672
0.734
0.805
0.855
0.943
0.960
0.979

1.217
2.397
4.05
5.28
8.27
12.68
15.34
19.89
26.32
30.72
39.49
40.17
45.01
49.88
54.37
58.52
64.18
70.27
76.08
82.14
87.18
94.99
96.44
98.21

y0.00064
y0.00086
y0.00149
y0.00128
y0.00236
y0.00354
y0.00271
y0.00355
y0.00262
y0.00281
y0.00128
y0.00004
y0.000343
0.000762
0.00104
0.00188
0.00320
0.00284
0.00214
0.00547
y0.00030
y0.00249
y0.00194
0.00364

y0.000197
y0.000023
y0.000154
0.000390
y0.00028
y0.000556
0.00491
y0.000227
0.000376
y0.000290
y0.000144
0.000982
y0.000223
y0.000027
y0.000535
y0.000340
0.000305
y0.000400
y0.000965
0.00312
y0.00148
y0.000670
0.000607
y0.000114

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

77

In general, small values of the statistic d lead to a rejection of the hypothesis of lack of
autocorrelation because the adjacent error terms t and ty1 tend to be of the same magnitude when
they are positively autocorrelated. Hence, the differences in the residuals e t y e ty1 . would tend to be
small leading to a small denominator in Eq. 17. and, hence, to a small test statistic d. Tables of
values d L and d U for different levels of significance can be found in the standard literature w3,9x.
10.1. Example 5
The data for calibration of the GC area response Table 5. were fitted to the following polynomials
of second and third order:
x s 0.008303 A q 0.000017A2

R 2 s 0.9999

x s 0.007924 A q 0.000030 A2 y 0.0000002 A3

R 2 s 0.9999

where x is the composition of the synthetic mixture and A is the area response. The samples were
injected in increasing order of x. Apply the DR test for serial correlation between the variable and
the response, to both equations. The values of the residuals are given in Table 5.

Fig. 7. Example 5: Predicted against experimental values of a. second-order polynom, and b. third-order polynom.

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

78

Inspection of the coefficient of determination gives the impression that both equations fit the data
very well, in addition, that it would seem preferable to use the first one since it contains fewer terms.
This global conclusion seems to be confirmed by an inspection of the plot of the predicted values of x
against the measured ones Fig. 7a,b. . A somewhat different picture arises when we examine the
dispersion of the residuals either graphically or using the DurbinWatson criteria. Application of Eq.
17. and utilization of the DurbinWatson statistic tables w3x for a sample of 24 observations with one
degree of freedom and 0.05 level of significance, gives the following results:
Order of polynom
2
3

d
0.4704
2.941

dL
1.27
1.27

dU
1.45
1.45

Inspection of the results indicate that for a polynom of order 2, the residuals are positively correlated
but not so for a polynom of order 3. This result can be reinforced by inspection of the residuals given
in Fig. 7c,d. For the polynom of order 2, the residuals are clearly nonrandom, showing a sinusoidal
behavior typical of a model that is lacking terms. Increasing the order of the polynom to 3 produces a
random distribution.
10.2. Example 6
Table 6 lists the VLE data for the system chlorobenzene 1. q propionic acid 2. at 313.15 K, as
reported in the DIPPR Standard Database w5x. According to this publication, the data in Table 6 have

Table 6
VLE data and Margules residuals for the system chlorobenzene 1.qpropionic acid 2. at 313.15 K, as reported in Ref. w5x
P wPax

x1

y1

y 1 y y 1,calc .

1439.88
1519.87
1586.54
1706.53
1799.85
1906.51
2026.50
2133.16
2266.48
2399.80
2533.12
2679.78
2826.43
2973.09
3119.74
3266.40
3386.39
3493.05
3573.04

0.018
0.036
0.054
0.073
0.093
0.114
0.138
0.164
0.194
0.230
0.270
0.32
0.378
0.448
0.529
0.620
0.722
0.830
0.921

0.005
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95

0.05355
0.01261
0.01206
0.01049
0.00777
0.00332
0.00098
y0.00243
y0.00378
y0.00201
y0.00174
0.00252
0.00475
0.00612
0.00364
y0.00342
y0.01023
y0.01023
y0.00544

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

79

Fig. 8. Example 6: Plot of the residuals in Table 6.

been declared as thermodynamically consistent, according to the criteria mentioned in examples 3 and
4.
Application of Eq. 17. and utilization of the DurbinWatson statistic tables w3x for a sample of 19
observations with one degree of freedom and 0.05 level of significance, gives the following results:
d s 0.506, d L s 1.18 and d U s 1.40. Since d - d L the residuals are not random distributed, fact
confirmed by Fig. 8.

11. Test of an additional term


Another important distribution is called the x 2 chi-square. distribution. This distribution is a
probability function whose values range from zero to positive infinity. Chi-square has many practical
applications, the most important of which are testing differences or relationships. In particular, it can
be used to measure the degree to which the frequencies in an actual distribution do not conform to the
corresponding frequencies in a theoretical distribution. The mathematics of the x 2 distribution is
complicated and beyond the purpose of this work so will limit ourselves to say that populations that
follow the x 2 distribution have the interesting property of additivity. This means that if we have two
statistics for example, residuals. , that are distributed as x 2 then we can form a new x 2 statistic by
taking the difference of the two. The precise formulation of a x 2 test for our objective is as follows:
Let F1, F2 , . . . , Fk be the sample frequencies of k classes, and let f 1 , f 2 , . . . , f k be the expected
frequencies on the basis of a certain hypothesis random normal. . In our particular case, let us assume
that we first fit our set of data with a fitting function having m y 1. terms plus a constant, and then
with a fitting function having m terms plus a constant. For each case we calculate the value of
chi-square as follows:
2

x s

Fi y f i .
fi

19.

The resulting values of the chi-square statistic, x 2 m y 1. and x 2 m., associated with the residuals
around each regression will have n y m. and n y m y 1. degrees of freedom. Thus, the difference
between these two will also follow the x 2 distribution with one degree of freedom.

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

80

If we form the ratio of the difference x 2 n y 1. y x 2 n. over the new value x 2 n., we can form a
statistic Fx that can be shown to be the F Fisher. distribution with degrees of freedom n 1 s 1 and
2 s n y m y 1:
Fx s

x 2 m y 1 . y x 2 m . rn 1

20.

x 2 m . rn 2

The important practical point to realize is that the ratio given by Eq. 20. is a measure of how
much the additional term has improed the value of the reduced chi-square:
Dx 2
Fx s

21.

xn2

Fx should be small when the regression function with m terms does not significantly improve the fit
over the function with m y 1. terms. In other words, addition of a new term will be beneficial only
when the value of Fx is large. To judge what is large and what is small, we use the F tables w3,9x, as
shown in the following example.

Table 7
Vaporliquid equilibrium data and residuals for the system hexane 1.qETBE 2. at 94 kPa w12x
x1

y1

T wKx

Pcalc , A

Pcalc , Aq B

ei , A

e i , Aq B

0.977
0.967
0.950
0.929
0.892
0.866
0.826
0.804
0.759
0.725
0.707
0.674
0.624
0.607
0.580
0.531
0.499
0.459
0.405
0.378
0.319
0.268
0.218
0.165
0.124
0.064

0.978
0.968
0.952
0.932
0.899
0.872
0.835
0.814
0.773
0.742
0.726
0.695
0.649
0.633
0.609
0.564
0.533
0.495
0.444
0.417
0.358
0.306
0.253
0.195
0.149
0.079

339.47
339.47
339.47
339.48
339.51
339.55
339.59
339.63
339.71
339.79
339.83
339.90
340.05
340.10
340.17
340.33
340.45
340.60
340.82
340.92
341.19
341.45
341.73
342.07
342.30
342.74

93.88
93.87
93.87
93.88
93.90
93.96
93.96
93.99
94.03
94.09
94.10
94.09
94.17
94.18
94.15
94.16
94.18
94.18
94.16
94.09
94.03
93.98
93.92
93.91
93.73
93.66

93.85
93.84
93.82
93.81
93.82
93.86
93.85
93.88
93.91
93.97
93.99
93.99
94.08
94.10
94.08
94.12
94.16
94.17
94.18
94.13
94.09
94.06
94.02
94.01
93.82
93.73

y0.12
y0.13
y0.13
y0.12
y0.10
y0.04
y0.04
y0.01
0.03
0.09
0.10
0.09
0.17
0.18
0.15
0.16
0.18
0.18
0.16
0.09
0.03
y0.02
y0.08
y0.09
y0.27
y0.34

y0.12
y0.16
y0.18
y0.19
y0.18
y0.14
y0.15
y0.12
y0.09
y0.03
y0.01
y0.01
0.08
0.10
0.08
0.12
0.16
0.17
0.18
0.13
0.09
0.06
0.02
0.01
y0.18
y0.27

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

81

11.1. Example 7
The data in Table 7 correspond to the VLE in the system hexane 1. q ETBE at 94 kPa w12x and
have been correlated using the RedlichKister expansion, restricted to the first term regular solution. :
GE
RTx 1 x 2

sAs

lng 1
x1

lng 2

22.

x2

with A s 0.13103. We want to investigate if addition of a second term improves the fit:
GE
RTx 1 x 2

s A q B x1 y x 2 . s

lng 1
x2

lng 2
x1

23.

A linear regression procedure yields A s 0.130478 and B s y0.012078. The pertinent form of the
equation is used to calculate the predicted value of the vapor composition. The pertinent values of
Pcalc and of the residues are given in Table 5 for both equations. Applying Eq. 19. to both sets of
residuals yields x 2 A. s 0.00548 and x 2 A, B . s 3.15 = 10y8. From Eq. 19., we obtain Fx s 2.20
= 10y4 . From the F tables w9x, we have F ) 5, hence, the system can be described as a regular
solution.

12. Conclusions
Examination of the goodness of fit of a regression using an overall statistic like the coefficient of
determination R 2 may overlook important details of the experimental results. The overall test must be
accompanied by a thorough analysis of the behavior of the residuals. Statistical tools like normal
probability plots, half-normal probability plots, detrended probability plots and rankit plots, may be
used for this purpose. Most of these techniques will also help identify outliers, that is, results that
present unusual behavior. In general, it is recommended to use at least two different methods to
establish the randomness or non randomness of the residuals.
Although only a few may originate a policy; we are all able to judge it. Pericles of Athens

References
w1x
w2x
w3x
w4x
w5x

C. Daniel, F.S. Good, Fitting Equations to Data, Wiley-Interscience, New York, 1971.
T.F. Edgar, D.M. Himmelblau, Optimization of Chemical Processes, McGraw-Hill, New York, 1988.
J. Neter, W. Wasserman, M.H. Kutner, Applied Linear Statistical Models, 3rd edn., Irwin, Burr Ridge, IL, 1990.
J. Wisniak, Fluid Phase Equilibria 89 1993. 291302.
M.A. Gess, R.P. Danner, M. Nagvekar, Thermodynamic Analysis of VaporLiquid Equilibria. Recommended Models
and a Standard Data Base, Design Institute for Physical Property Data, American Institute of Chemical Engineers, New
York, 1991.
w6x R. Reich, M. Cartes, H. Segura, J. Wisniak, Fluid Phase Equilibria 1998. in press..
w7x H. Artigas, C. Lafuente, M.C. Lopez,
F.M. Royo, J.S. Urieta, Fluid Phase Equilibria 134 1997. 163.

82

w8x
w9x
w10x
w11x
w12x

J. Wisniak, A. Polishukr Fluid Phase Equilibria 164 (1999) 6182

A. Fredenslund, J. Gmehling, P. Rasmussen, VaporLiquid Equilibria Using UNIFAC, Elsevier, Amsterdam, 1977.
F.J. Rohlf, R.R. Sokal, J. Rolf, Statistical Tables, 3rd edn., W.H. Freeman, New York, 1994.
J. Durbin, G.S. Watson, Biometrika 37 1950. 409.
J. Durbin, G.S. Watson, Biometrika 38 1951. 159.
R. Reich, M. Cartes, H. Segura, J. Wisniak, Physics and Chemistry of Liquids 1998. in press..

You might also like