1
Executive Summary
The main purpose of this report is to learn how to use different methods to estimate
the future population of a determined services area. Analyze the accuracy and precision of
each method to determine which method represents more accurately the data given. The
methods to be employed to predict the population are: linear, quadratic, exponential and
logistic growth regressions.
We concluded based on several parameters: sum of the mean squared errors and from
both F and t tests; that the best model to represent this population is the logistic model (d=0)
predicting a population of 42,116 people by the end of 2034.
We are told that the capita daily consumption creases constantly by 2% annually,
hence we developed an equation predict the per capita water consumption by the end of 2034:
126.11 gpcd. This gives as a design parameter of 3.54 MGD that the water must produce to
keep up with the future demand and avoid an expensive expansion, construction of another
facility or from buying water from other water plants.
Theory
Water demand forecasting is a tool used by utility managers to predict the amount of
water the service area by a particular water plant would need. This water demand must
account for domestic, industrial, commercial, public, leakage and wastage use. In this
assignment we will only worry about the water demand pertaining the domestic area, the
housing area.
To determine the amount of water needed we need to use several databases to
collected from several sources. One of those sources is from the department of transportation
(DOT), the DOT has divided areas into traffic analysis zones (TAZ) which is a group of
census block that has at least one major main road going through it or touching the zone
boundary, the data is collected on population or housing and employment for the traffic
demand model.
Housing and population data is collect per TAZ, using the most recent data available
from the DOT or other sources such as the regional building permit data and population
estimates from the Municipalities Offices of Budget.
Several methods area employed to forecast water demand, but we will do a per capita
water demand forecast only. A per capita water demand forecast assumes each individual
uses the same average amount of water annually (q) and to that average annual amount of
Pg 2
2
water consume we will have to adjust for future water demand, in our case we will simply
assume there is a 2% increase in annual water demand. That average amount of water
consume per capita is simply multiplies by the total population of the area (N) to be serviced
to give us the total system water demand (Q):
!
!
! !
!
! !
!
where t represents the calendar year
In other to calculate the population in the year 2034 we will use several modeling
methods: simple linear regression, quadratic regression, exponential and logistic growth
regressions.
In a simple linear equation model we assume population, the dependent variable, is
only dependent on a single parameter, the year which is the independent variable (Soon 2004
p. 335). This model is represented by the following equation:
P(t)=a+m!t
In the above equation P represents the population at year t, and a and m are constant
to be determined by the statistical program, excel, that will be determined given the data
collected over a period of time.
Another arithmetic linear regression that can be applied is explain in Reynolds and
Richards book in which we normalize the values by using the natural log for the dependent
variable, in some cases where the population vs. time graph expresses a scurve this method
tends to give a better function to represent them. It yields a function with the following
general equation:
!" !
!
! ! !!! ! ! !
!
! !
!!!!!
! !
!
! !
!!!
! ! ! !
!!!
! !
!
! ! ! !
!!!
Quadratic modeling is still a linear regression model in the sense that the unknown
constants to be determined are not the base of any exponent with a variable: a
x
where a is
constant and x does vary. This model is represented by the following equation:
P(t) = a+b!t+c!t
2
In the above equation P and t represent population and time, respectively and a, b and
c represent constants to be determined by the statistical program.
Exponential growth represents a change of the population relative to the initial
population which increases at a consistent rate over time: P(t)=P
o
!e
a!t
, where P
o
represents the
initial population and a is constant to be determined. If we compare the exponential growth
equation and the arithmetic linear equation normalized by the natural log, we see that both
Pg S
S
equations are the same. Reynolds and Richards simply demonstrate how the equation is
derived and how it relates to a linear fit.
Logistic growth represents a more realistic situation over a long period of time since
every area has a limit of how many people it can support. This model represents the rapid
growth followed by steady growth and finally the growth rate decreases gradually until it
reaches it limit capacity, the maximum amount of people the current system can support.
! ! !
!
!!!!!
!!!
!!
Due to equipment constraints, logarithmic regression will be developed using the TI
nspire CAS software instead that is capable of doing logistic regression with both d=0 and
d"0, I will do both logistic regressions just to justify that for this set of data since the
population within the time range analyze is never zero we should use the d"0 logistic
regression. Other programs are also capable doing this type of regression like minitab, excel
is not capable of doing this type of regression.
In order to evaluate which model above represents the data more accurately and
precisely we will compare some parameters: standard error (s), mean square error (MSE),
hypothesis testing with t and F statistics based on a selected probability (#), correlation
coefficient (r), coefficient of determination (r
2
), and confidence and prediction intervals. We
will select the model with least the least error and highest correlation.
The method of least squares determines the linear fit by minimizing the error of all the
data points using the following formula: !!!! ! ! !!! ! where the error is simply the
difference between the value given by the regressed equation against the data: ! ! ! ! !
!!!!. The sum of the squared errors (SEE) is minimized by this method.
!"" ! ! !
!
!!! !!! !
!
!
!
!!!
! ! !! ! !
!
!! ! ! !! ! !
!
$
2
represents the standard deviation and n the number of samples or data points.
For this assignment we will only do null hypothesis testing which established that for
two set of data, the data given and the data generated from the regression, should have the
same mean or variance in order for it to represent the same population. We will compare both
F and t statistical calculated values against the corresponding 95% confidence value from
each respective table. Note that F statistical analysis is also known as the F test, likewise t
Pg 4
4
statistical analysis is also known as t test. Excel is capable of computing both F and T test,
giving a probability of how close two sets of data are related.
!
!"#!
!
!"#$%# !"#$ !" !"#$"%&'
!"#$$%& !"#$ !" !"#$"%&'
and !
!"#!
!
!
!
!!
!
!
!
!
!
!
!
!
!
!
!
from Benefield and Randall
If Fcalc is greater than the F
#
(value from the table) then reject the null hypothesis and
conclude that the variances are not equal, otherwise this test fails to reject the null hypothesis
and we can conclude that there is not enough evidence to conclude both variances or means
are equal hence he can speculate the regression does model the data.
If tcalc is greater than t
#,df
,where df represents the degrees of freedom and is equal to
the sum of all the data points compared from both the data given and from the regressed
model minus two, if it is greater then reject the null hypothesis an conclude the means or
variance are not equal, otherwise you can only conclude that there is not enough evidence to
conclude that the means or variances are not equal.
The correlation coefficient (r) and its determinant (r
2
) range from 0 to 1 and represent
how close does the equation represent a set of data, being the closer to one a more accurate
representation. No equation need to be given for these two parameters since excel computes
them for us.
Data
Table 1. Population, Flow and Per Capita Data
Year
Population
(ppl)
Flow
(MGD)
Flow per Capita
(gpcd)
1984 4,525.00 0.24 53.04
1986 4,902.00 0.26 53.04
1988 5,457.00 0.31 56.81
1990 5,564.00 0.33 59.31
1992 5,806.00 0.35 60.28
1994 6,312.00 0.40 63.37
1996 7,300.00 0.48 65.75
1998 8,307.00 0.55 66.21
2000 10,621.00 0.75 70.61
2002 11,543.00 0.83 71.91
2004 12,546.00 0.92 73.33
2006 13,651.00 1.04 76.18
2008 14,563.00 1.15 78.97
2010 15,308.00 1.23 80.35
2012 18,302.00 1.52 83.05
2014 19,678.00 1.67 84.87
Pg S
S
Analysis
I will do two linear regressions, a simple linear regression and a natural log
regression. For a simple linear regression we simply plot the year of the xaxis and the
population on the yaxis and excel computes the rest:
Linear regression as a function of a natural log:
As seen above, by doing so we loose accuracy hence this last linear regression method
is not adequate to model this data.
P(t) = 1.u21u
6
ln(t)  7.721u
6
R` = u.94926
u,u
u,S
1,u
1,S
2,u
2,S
198u 198S 199u 199S 2uuu 2uuS 2u1u 2u1S
!
"
#
$
%
&
'
(
"
)
+
,




Yeai
.&'$/&% %"0 1$)2'(")
Lineai Regiession
P(t) = Su8.81t  1u
6
R` = u.9Suu6
u
1
1
2
2
S
198u 198S 199u 199S 2uuu 2uuS 2u1u 2u1S
!
"
#
$
%
&
'
(
"
)
+
,




Yeai
Lineai Regiession
Pg 6
6
Logistic Regressions:
y = 12.S9x
2
 49826x + 49.SE+u6
R` = u.9891S
u,u
u,S
1,u
1,S
2,u
2,S
198u 198S 199u 199S 2uuu 2uuS 2u1u 2u1S
!
"
#
$
%
&
'
(
"
)
+
,




34&/
5$&6/&'(2 740/488(")
y = 1.S71u
41
e
u.uS16x
R` = u.98291
u,u
u,S
1,u
1,S
2,u
2,S
198u 198S 199u 199S 2uuu 2uuS 2u1u 2u1S
!
"
#
$
%
&
'
(
"
)
+
,




Yeai
9+#")4)'(&% 740/488(")
u=u
!!!! !
!"!!!"!!"
! !!"!! ! !
!!!!"#$!!!!!"#$!
Yeai: 1984 1988 1992 1996 2uuu 2uu4 2uu8 2u12 2u14
Pg 7
7
u0
!"#$#
! ! !!!"# ! !"
!"
! !
!!!!"#!!
! !"#!!!"
Yeai: 1984 1988 1992 1996 2uuu 2uu4 2uu8 2u12 2u14