You are on page 1of 4

Math/ Compu,. Modelling. Vol. II, pp. 1073&1076.

1988
Printed in Great Britain
0895-7177:88 $3.00+0.00
Pergamon Press plc
PREDICTING THM CONCENT&%TION IN TREATED WATER WITH HIGHLY CORRELATED DATA
Paul .J. Ossenbruggen, Department of Civil Engineering
Marie Gaudard, Department of Mathematics
M. Robin Collins, Department of Civil Engineering
University of New Hampshire, Durham, NH 03824
Abstract. Chlorine is an effective disinfectant in killing bacteria and
viruses; however, trihalomethanes (THM) and several chlorination by-products of
the drinking water treatment process are suspected carcinogens. In order to
develop treatment strategies to minimize THM production, mathematical models
containing appropriate raw water quality and control predictor variables are
needed. Some of these predictor variables are known to be highly correlated;
therefore, when collectively introduced into a regression model, questions
about the validity of the model arise. Condition indexes and variance
decompositions, as well as ridge regression, are used to identify the degree
and causes of ill-conditioning. Path analysis is used to support the inclusion
of predictor variables in the final model.
Keywords. Linear regression, trihalomethanes, control, water quality.
INTRODUCTION
Drinking water disinfection with chlorine is
probably the single most effective factor in
controlling typhoid fever and other waterborne
diseases. However, trihalomethane (THM) and
other by-products of the chlorination process
are suspected carcinogens. Plant operators
must judiciously use chlorine to maximize
bacteria and virus kill while, at the same
time, minimizing THM production. The purpose
of this paper is to describe the steps used in
developing a mathematical model that can be
used to evaluate treatment strategies under
different operating conditions.
The THM production process is a very complex
chemical process that is not entirely
understood. There are no mechanistic models
that explain the THM production process;
empirical studies based on correlation
analysis and simple regression methods have
been widely used to study the problem. As a
result of some of these efforts (Kavanaugh,
1980), we have learned that the concentration
of finished water THM, a mixture of CHC13
(chloroform), CHClzBr, CHClBr2, and CHBr3, is
a function of:
.
type of raw water organic precursor
.
PH
.
temperature
.
presence of selected inorganic species
.
method of chlorination
.
detention time
Raw water contains a mixture of dissolved and
suspended organic matter that generally gives
water its yellowish-to-brownish color. This
organic matter is generally formed by the
decomposition of plant and animal material.
Fulvic acid, humic acid, and other products
with complex chemical structure have been
isolated from raw water and have been shown to
have very different chemical properties. The
presence of naturally occurring inorganic
species, such as bromide and calcium, pH, and
temperature have also been shown to affect THM
production. Consequently, THM production is
highly dependent upon the quality and source
of the raw water.
THM production is also dependent upon the
effectiveness of the treatment process in
removing organic matter and the location of
chlorine addition during treatment.
Prechlorination refers to the practice of
adding chlorine to raw water prior to the
removal of organic precursors. The
chlorinated raw water is subject to further
treatment to remove unwanted matter.
Postchlorination refers to the practice of
adding chlorine as the final treatment step.
For proper plant operation and maintenance,
both pre- and post- chlorination are used in
most plants. It has been found, however, that
reducing the prechlorination dose reduces THM
concentration in the finished water.
While correlation analysis and simple
regression models have been successful in
identifying the factors that are related to
THM production, a mathematical model that
contains the collective effects of these
factors is imperative for establishing better
treatment control strategies. An effective
control model must contain variables
reflecting the quality of the raw water and
the effect of chlorine dose.
EFFECTS OF ILL-CONDITIONING
Consider a multiple regression model, 1[ - xb +
e, where the columns of the X matrix are
highly dependent (or, equivalently, where
1073
I074 Proc. 6th Inr. Conf. on Mathematical Modelling
predictor variables are highly correlated).
In such a case, ordinary least squares
procedures can result in a final model which
is inappropriate for control. In our study of
data collected periodically (approximately
bimonthly) from July 1980 to March 1982 at a
water treatment plant in Canton, New York
(Edzwald, 1983), six of seven predictor
variables were eliminated from a multiple
regression model using the SAS backward
elimination procedure. A significance level
of 0.100 "as used. The seven predictors were
raw water IJV Absorbance, total organic carbon
(TOC), trihalomethane formation potential
(THMFP), temperature, turbidity and pre- and
post-chlorination dose. The final model
contained raw water temperature as its only
predictor variable. Clearly, this model is of
very limited use for control because variables
of raw water quality and chlorine dose are
absent.
In the Canton data set, UV Absorbance, TOC and
THMFP, all surrogate measures of organic
precursor concentration, are highly
correlated. All pairwise correlation
coefficients (r) are greater than 0.90. Raw
water temperature is also highly correlated
with these variables with all r > 0.8. These
high correlations are expected to cause the I!
data matrix to be ill-conditioned. Models fit
by ordinary least squares when the X matrix is
ill-conditioned are often highly unstable; the
sample variances of regression parameters are
often larger than expected, resulting in
confidence intervals which are too wide and t-
statistics which are too low. For the Canton
data set, ill-conditioning may explain the
elimination of all surrogate measures of raw
water quality and chlorination dose variables
by the SAS backward elimination procedure.
For models consisting of 3 or more predictor
variables, an examination of pairwise
correlation coefficients may not always lead
to the identification of variables causing
ill-conditioning. It is possible that 3 or
more predictor variables taken collectively
may be highly dependen,; whereas, their
pairwise correlation may be small. Therefore,
a means of evaluating the degree and causes of
ill-conditioning, which is more effective than
the use of correlation coefficients alone, is
needed.
MODEL DEVELOPMENT APPROACH
Our approach consists of the use of following
statistical tools:
.
Condition index and variance-
decomposition proportions
.
Ridge regression
.
Ordinary least squares
.
Path analysis
Condition indexes, variance-decomposition
proportions (Belsey, Kuh, and Welsch, 1980)
and ridge regression (Hoerl, et al; 1962,
1980) are used to determine the degree of
ill-conditioning and to identify the variables
that cause ill-conditioning. After variables
are removed from the ill-conditioned model,
ordinary least square is used to estimate the
parameters (or b's) of the reformulated model.
Path analysis (Li, 1975) is used to validate
the reformulated model by examining the
contribution that each variable has toward
prediction.
ILL-CONDITIONING DIAGNOSTICS
Condition indexes and variance-decomposition
proportions are derived from the
singular-value decomposition
where the diagonal elements of D, pj,.are the
singular values of X. The condition Index for
predictor j is defined to be pj/pmax, where
pmax is the largest singular value. It can be
shown empirically that large condition
indexes, say from 30 to 100, are associated
with moderate to strong dependencies. The
notion of variance-decomposition proportion is
based on the relationship
where @ is the least squares estimate of b,
and o2 is the variance of e. For each
condition index, a variance decomposition
prouortion is associated with each predictor,
and reflects the proportion of its variance
which is associated with that condition index.
The Canton data exhibits two moderate to
strong dependences because two pj exceed 30.
Thus, since two dependencies are involved it
is suspected that at least two predictor
variables will need to be eliminated from the
model in order to stabilize it. Analysis of
the variance-decomposition proportions and
auxiliary least squares models show that W
Absorbance, TOC and temperature are involved
in one near dependency, and THMFP and TOC are
involved in the other near dependency.
Ridge regression, a biased least squares
estimation method, is used as a second means
of identifying the causes of ill-conditioning.
The ridge estimators are defined by
j$(k) = (zTz + kI)-1 ZTU
where z and are obtained by standardizing of X
and 1, respectively. The ridge trace for the
Canton data set, Figure 1, shows that
2 -2
G
00
0.2 0.4 0.6 0.8 1.0
k, Ridge Caefficisnf
FIG, 1 Ridee trace of initial model.
temperature, IJV Absorbance, and TOC are the
most unstable predictors. Additionally, as k
Pm-. 6th Int. ConJ on Mathematical Modellin~ 1075
approaches one, the p(k)s for THMFP, W
Absorbance and turbidity approach zero,
indicating little predicting power. Since the
findings from these two different diagnostic
techniques indicate that THMFP and W
Absorbance are causing ill-conditioning and
providing little predicting power, they are
removed from the model. Turbidity, while not
considered to be a direct cause of ill-
conditioning, is also removed because it has
little predicting power. The ridge trace of
the reformulated model, Figure 2, exhibits
strong stability.
Ordinary least squares regression gives the
final model:
THM - -2.95 + 3.96 (TEMP) + 19.7 (PreC12) +
12.9 (PostC12) - 5.65 (TOG) (4)
tf
0.0 0.2 0.4 0.6 08 1.0
3
z
Q
k. Ridge Coefficient
FIG. 2 Ridge trace of reformulated model.
PATH ANALYSIS
Since the effects of ill-conditioning are
assumed to be eliminated from the reformulated
model, path analysis may be used to study the
predicting power of each variable of the
model. It is also used to validate the model
as a reasonable model for control.
Using path analysis, each predictor's
contribution to the coefficient of
determination will be divided into two
components, a direct effect, and a ioint
The direct effects p' and ridge association.
regression estimates for k=O are i t e same
(Pj
= pj(k-0)) and are simply the least
squares estimates of a multiple regression
equation when the standardized forms of x and
1 are used. The direct effect, pj, of each
predictor measures its direct contribution to
predicting Y. The total association of that
predictor with Y is measured by its
correlation with Y, r..
J
Then the joint
association is simply the difference, rj - pj.
See Tables 1 and 2.
Since the coefficient of determination is
R2
- 7 Rj
_ FjPjv (5)
the contribution to determination of each pre-
dictor can be determined as shown in Table 2.
In our final model for THM prediction,
predictors have respectable direct effects,
and with the exception of post-chlorination
TABLE 1 Path Analysis Results
Predictor Direct Joint
Effect Association
pj
II' - p'
J J
Temperature 0.885 -0.126
Pre-Cl2 dose 0.486 0.100
Post-Cl2 dose 0.313 -0.338
TOC -0.504 0.993
dose, have significant correlations with THM
and contribute to R2, Interestingly, raw
water TOC has the weakest correlation with
treated water THMs (see Table 2).
TABLE 2 Path Analvsis Results
Predictor Correlation Contribution
Coefficient to
Determination
rj
Rj2=
rJ PJ
Temperature
Pre-Cl2 dose
Post-Cl2 dose
TOC
0.759 0.672
0.586 0.285
-0.025 -0.008
0.489 -0.246
R2
= y rjPj
= 0.705
TABLE 3 Correlation-Coefficient Between
Predictors*
Temperature Pre-Cl2 Dose
Pre-Cl2 dose 0.602
Post-Cl2 dose --
TOC 0.823
-0.570
0.519
*Only correlations between predictors, which
are significant at 0.10, are shown.
The variables TOC and post-chlorination dose
were retained in the model because of their
joint associations with the other variables.
The correlation coefficients between the
dependent variable, THM, and TOC, W
absorbance and THMFP, are respectively 0.480,
0.557, and 0.550. This relationship
contributes to the predicting power of the
model. Note that even though ill-conditioning
has been eliminated, there still remain
significant correlations among certain
predictor variables (see Table 3).
CONCLUSIONS
The two approaches used to diagnose ill-
conditioning and stabilize our model, namely
the diagnostics of Belsey, Kuh, and Welsch and
ridge regression, give different perspectives
on the problem and seem to complement one
another. Path analysis provides a "check" on
the model suggested by these two procedures.
Note that the joint association among
predictor variables can result in a common
causal effect; therefore, it plays an
important part in explaining the retention of
a predictor variable in the final model. High
1076 Proc. 6th Inr. Cmf. on Mathematical ModellinK
correlation between a predictor variable and a
dependent variable does not guarantee the
retention of a predictor variable in using our
approach. This was illustrated by the
retention of TOC while excluding W Absorbance
and THMFP from the final model. Likewise, a
low correlation between a predictor variable
and dependent variable does not guarantee its
exclusion as exhibited by our retention of
post-chlorination dose in the model.
Our final model contains a measure of raw
water quality (TOC) and the effects of pre-
and post-chlorination dose on THM production.
It also shows the importance of temperature;
therefore, it is a satisfactory model for
control. In three other data sets which were
examined (not reported here), this approach
led to models that are consistent with
theories of THM formation and that contain
appropriate predictor variables for control.
REFERENCES
Kavanaugh, M.C., Trussell, A.R., Cramer, J.
and Trussell, R.R. (1980) An Empirical
Kinetic Model of Trihalomethane
Formation: Applications to Meet the
Proposed Standard, Journal AWWA. 72,
578-582.
Edzwald, J.K. (1983) Removal of
Trihalomethane Precursors by Direct
Filtration and Conventional Treatment,
Department of Civil Engineering and
Environmental Engineering, Clarkson
College, Potsdam, NY 13676, 224 pp.
Belsey, D.A., Kuh, E., and Welsch, R.E. (1980)
Rearession Diaanostics: Identifying
Influential Data and Sources of
Collinearitv, John Wiley and Sons.
Hoerl, A.E. and Kennard, R.W. (1970) Ridge
Regression: Applications to
Nonorthogonal Problems, Technometrics.
l2, 1970, 69-82.
Hoer-l, A.E. and Kennard, R.W. (1970) Ridge
Regression: Biased Estimation for
Nonorthogonal Problems, Technometrics,
l2, 55-67.
Hoerl, A.E. (1962) Applications of Ridge
Analysis to Regression Problems, Chemical
Engineering Progress. 58, 54-59.
Li, C.C., Path Analvsis - A Primer, Boxwood
Press, 1975.
The research on which this paper is based was
financed in part by the U.S. Department of the
Interior as authorized by the Water Research
and Development Act of 1978 (P.L. 95-467).

You might also like