Numerical Ch17

July 2013
Chapter 17
Least-Square Regression
9 4444 260
Physics I/II, English 123, Statics, Dynamics, Strength, Structure I/II, C++, Java, Data, Algorithms,
Numerical, Economy
eng-hs.com, eng-hs.net
info@eng-hs.com 260 4444 9 .
July 2013
Where
substantial
error
is
associated
with
data,
polynomial interpolation is inappropriate and may yield

unsatisfactory results when used to predict intermediate
values. Experimentally data is often of this type. For
example,
the
following
figure
(a)
shows
seven
experimentally derived data points showing significant

variability. The data indicates that higher values of y are
associated
with
higher
values
of
x.
Now, if a sixth-order interpolating polynomial is fitted to this

data (fig b), it will pass exactly through all of the points.
However, because of the variability in the data, the curve
oscillates widely in the interval between the points. In
particular, the interpolated values at x = 1.5 and x = 6.5
appear to be well beyond the range suggested by the data.
A
more
appropriate
strategy
is
to
derive
an
approximating function that fits the shape . Fig (c) illustrates

how a straight line can be used to generally characterize
the trend of the data without passing through any particular
point.
One way to determine the line in figure (c) is to look at
the plotted data and then sketch a best line through the
points. Such approaches are not enough because they are
arbitrary. That is, unless the points define a perfect straight
9 4444 260
Numerical, Economy
July 2013
line (in which case, interpolation would be appropriate),

different analysis would draw different lines.
To avoid this,
some criterion
must be devised to establish a basis for the fit. One way to

do this is to derive a curve that minimizes the discrepancy
between the data points
9 4444 260
Numerical, Economy
July 2013
and the curve. One technique for doing

this is called least-squares regression.
17.1 Linear Regression

The simple example of a least-squares approximation is
fitting a straight line to a set of paired observations: (x 1,y1),
(x2,y2),
(xn,yn).
The mathematical expression for the straight line is

y = a0 + a1x + e
where a0 and a1 are coefficients representing the intercept
and the slope, respectively, and e is the error between the
model and the observations, which can be represented by
rearranging the previous equation as
e = y a0 a1x
thus, the error is the discrepancy between the true value of
y and the approximate value, a0 + a1x, predicted by the
linear equation.
17.1.1 Criteria for the best fit
One strategy for fitting a best line through the data
would be to minimize the sum of the residual errors for all
the available data, as in
9 4444 260
Numerical, Economy
July 2013
n
i=1
i=1
e i= ( y ia0 a1 xi )
where n = total number of points. However, this is an
inadequate criterion, as illustrated by the next figure, which
shows the fit of a straight line to two points.
9 4444 260
Numerical, Economy
July 2013
Obviously, the best fit is the line connecting the points.

However, any straight line passing through the midpoint of
the connecting line results in a minimum value of the
previous equation equal to zero because the errors cancel.

Therefore, another logical criterion might be to minimize the
sum of the absolute values of the discrepancies, as in
n
i=1
i=1
|ei|= | y iaoa1 x i|
The previous fig (b) demonstrates why this criterion is also

inadequate.
For the four points shown, any straight line falling within the
dashed lines will minimize the sum of the absolute values.
Thus, this criterion also does not yield a unique best fit.
A third strategy for fitting a best line is the minimax
criterion.
In this technique, the line is chosen that minimizes the
maximum distance that an individual point falls from the
line. As shown in previous fig (c), this strategy is ill-suited
for regression because it gives big effect to an outlier, that
is, a single point with a large error.
9 4444 260
Numerical, Economy
July 2013
A strategy that overcomes the shortcomings of the

previous approaches is to minimize the sum of the squares
of the residuals between the measured y and the y
calculated with the linear model.
n
e = ( yi , measured y i ,model ) = ( y ia0a1 x i )2

2
i
i=1
i=1
Sr =
i=1
This criterion has a number of advantages, including the

fact that it yields a unique line for a give set of data.

17.1.2 Least-Squares fit of a straight line
To determine values of a0 and a1, the previous equation
is differentiated with respect to each coefficient:
Sr
=2 ( y ia 0a1 x i)
a0
Sr
=2 [ ( y ia 0a1 xi )x i ]
a1
Note that we have simplified the summation symbols;

unless otherwise indicated, all summation are from i = 1 to
n. Setting these derivatives equal to zero will result in a
minimum Sr.
y i a0 a 1 x i
0=
9 4444 260
Numerical, Economy
July 2013
2
y i x i a0 xi a1 x i
0=
Now, realizing that
a0
= na0, we can express the
equations as a set of two simultaneous linear equations with

two unknowns (a0 and a1):
(17.4)
n a0 + ( xi ) a1= y i
( x i ) a0 + ( x 2i ) a1= x i y i
These are called the normal equations. They can be solved

simultaneously
a1=
n x i yi x i y i
2
n x 2i ( x i)
This result can then be used in conjunction with Eq. (17.4)

to solve for
a0 = y a 1 x
where
and

are the means of y and x, respectively.

Example 17.1 Linear Regression

Problem Statement:
Fit a straight line to the x and y values in the first two
columns of the next table
9 4444 260
Numerical, Economy
July 2013
Solution:
The following quantities can be computed
n=7
x i yi =119.5
x 2i =140
28
x i=28 x = 7 =4
24
y i=24 y = 7 =3.428571
Using the previous two equations,
a1 =
7 ( 119.5 )28(24)
=0.8392857
2
7 ( 140 )(28)
a0 =3.4285710.8392857 ( 4 )=0.07142857
Therefore, the least-square fit is

y=0.07142857+ 0.8392857 x
The line, along with the data, is shown in the first figure (c).

17.1.3 Quantification of Error of Linear Regression
Any line other than the one computed in the previous
example results in a larger sum of the squares of the
residuals. Thus, the line is unique and in terms of our
chosen criterion is a best line through the points.
9 4444 260
Numerical, Economy
July 2013
A number of additional properties of this fit can be

explained by examining more closely the way in which
residuals
were
computed.
Recall that the sum of the squares is defined as

n
S r = e = ( y ia0 a1 xi )2
2
i
i=1
i=1
Notice the similarity between the previous equation and

S t = ( y i y )2
The similarity can be extended further for cases where (1)

the spread of the points around the line is of similar
magnitude along the entire range of the data and (2) the
distribution of these points about the line is normal.
It can be demonstrated that if these criteria are met, leastsquare regression will provide the best (that is, the most
likely) estimates of a0 and a1.
In addition, if these criteria are met, a standard
deviation of the regression line can be determined as
s y / x=
Sr
n2

is called the standard error of
the )
estimate.
(

The subscript notation y / x designates that the error is
where
s y /x
for a predicted value of y corresponding to a particular

value of x.
9 4444 260
Numerical, Economy
July 2013
Also, notice that we now divide by n-2 because two data

derived estimates a0 and a1 were used to compute S r;
thus,
we
have
lost
two
degrees
of
freedom.
Another justification for dividing by n-2 is that there is no

such thing as the spread of data around a straight line
connecting two points..
The standard error of the estimate quantifies the spread
of the data. However,
s y /x
quantifies the spread around the
regression line as shown in the next figure (b) in contrast to

the original standard deviation Sy that quantified the spread
around the mean ( fig (a)).

The above concepts can be used to quantify the
goodness
of our fit. This is particularly useful for comparison of
several
regressions
(next figure). To do this, we return to the original data and

determine the total sum of the squares around the mean for
9 4444 260
Numerical, Economy
July 2013
the
dependent
variable
(in our case, y). This quantity is designated as S t. This is the

magnitude of the residual error associated with the
dependent variable prior to regression. After performing the
regression, we can compute Sr, the sum of the squares of
the
residuals
around
the
regression
line.
This characterizes the residual error that remains after the

regression.
It is, therefore, sometimes called the unexplained sum of
the squares.
. . .

The difference between the two

quantifies,
St
S
r,
quantifies the improvement or error reduction due to
9 4444 260
Numerical, Economy
July 2013
describing the data in terms of a straight line rather than as

an average value.
Because the magnitude of this quantity is scale-dependent,
the difference is normalized to St to yield
r 2=
S t S r
St
where r2 is called the coefficient of determination and r

is the correlation coefficient (= r 2 ).
For a perfect fit, Sr = 0 and r = r2 = 1, signifying that the
line explains 100 percent of the variability of the data. For r
= r2 = 0, Sr = St and the fit represents no improvement.
An alternative formulation for r that is more convenient
for computer implementations is
r=
n x i y i ( x i )( y i)
n x ( x ) n y ( y )
2
i
2
i

Example 17.2 Estimate of errors for the linear leastSquares Fit
9 4444 260
Numerical, Economy
July 2013
Problem Statement:
Compute the total standard deviation, the standard error of
the estimate, and the correlation coefficient for the data in
Example 17.1
Solution:
The summations are performed and represented in the
previous examples table. The standard deviation is
S y=
St
22.7143
=
=1.9457
n1
71
and the standard error of the estimate is

S y / x=
Sr
2.9911
=
=0.7735
n2
72
Thus, because
S y / x< S y
, the linear regression model is
efficient.
The extent of the improvement is quantified by
r 2=
S t S r 22.71432.9911
=
=0.868
St
22.7143
or
r= 0.868=0.932
These results indicate that 86.8 percent of the original

uncertainty has been explained by the linear model.

17.1.5 Linearization of Nonlinear Relationships
9 4444 260
Numerical, Economy
July 2013
Linear regression provides a powerful technique for

fitting a best line to data. However, it is predicated on the
fact that the relationship between the dependent and
independent
variables
is
linear.
This is not always the case and the first step in any
regression analysis should be to plot and visually inspect
the data to know whether a linear model applies. For
example, the next figure shows some data that is obviously
curvilinear. In some cases, techniques such as polynomial
regression, are appropriate. For example, transformations
can be used to express the data in a form that is compatible
with linear regression.

9 4444 260

Numerical, Economy
July 2013
One
example
is
the
exponential model
y= 1 e
1 x
(17.2)
where
and
are
constants. As shown
in
the
next
equation
figure,
the
represents a nonlinear
relationship (for
1 0
between x and y.
Another
example of a nonlinear model is the simple power
equation
y=a2 x
(17.13)
9 4444 260
Numerical, Economy
July 2013
where
and
are constant coefficients. As shown in
the previous figure, the equation ( for 2 0 or 1) is

nonlinear.
A third example of a nonlinear model is the saturationgrowth-rate equation
y= 3
x
3+ x
(17.4)
Where
and
are constant coefficients. This model
also represents a nonlinear relationship between y and x,

that levels off as x increases.
A
simpler
alternative
is
to
use
mathematical
manipulations to transform the equations into a linear form.

Then, simple linear regression can be employed to fit the
equations to data.
Equation (17.2) can be linearized by taking its natural
logarithm
ln y=ln 1 + 1 x ln e
But because ln e = 1,
ln y=ln 1 + 1 x
Thus, a plot of ln y versus x will yield a straight line with a

slope of
and an intercept of ln
(previous fig d).
9 4444 260
Numerical, Economy
July 2013
Equation (17.3) is linearized by taking its base-10 logarithm

to give
log y= 2 log x + log 2
Thus, a plot of y versus log x will yield a straight line with a

slope of
and an intercept of log
( previous fig e).
Equation (17.14) is linearized by inverting it to give

1 3 1 1
=
+
y 3 x 3
Thus, a plot of
1/ y
versus
1/ x
will be linear, with a slope of

(previous
fig f).

In their transformed forms, these models can use linear
3 / 3
and an intercept of
1/ 3
regression to evaluate the constant coefficients. They could

then be transformed back to their original state and used
for
predictive
purposes.
Example 17.4 illustrates this procedure for Eq. (17.3)

Example 17.4 Linearization of a Power Equation
Problem Statement:
Fit Eq.(17.13) to the data in the next table using a
logarithmic transformation of the data.
9 4444 260
Numerical, Economy
July 2013
Solution:
The next figure (a) is a plot of the original data in its
untransformed state. Figure (b) shows the plot of the
transformed data. A linear regression of the log-transformed
data yields the result
log y=1.75 log x0.300
Thus, the intercept, log
, equals -0.300, and therefore,
by taking the antilogarithm,

2=1.75
2=100.3=0.5
. The slope is
Consequently, the power equation is

y=0.5 x
1.75
This curve, as plotted in the next figure (a), indicates a good

fit.

9 4444 260
Numerical, Economy
July 2013
17.1.6
General
Linear
We
Comments
on
Regression
have
focused
on
the
simple derivation and practical use of equations to fit data.

Some statistical assumptions that are inherent in the linear
least-square procedures are
1. Each x has a fixed value; it is not random and is
known without error.
2. The y values are independent random variables and
all have the same variance.
3. The y values for a given x must be normally
distributed.
Such assumptions are relevant to the proper derivation
and use of regression. For example, the first assumption
means that (1) the x values must be error-free and (2) the
regression of y versus x is not the same as x versus y.

9 4444 260
Numerical, Economy
July 2013
17.2 Polynomial Regression

Some engineering data, although representing a marked
pattern, is poorly represented by a straight line. For these
cases, a curve would be better suited to fit the data. One
method
to
accomplish
this
objective
is
to
use
transformations. Another alternative is to fit polynomials to

the data using polynomial regression.
The least-squares procedure can be readily extended to
fit the data to a higher-order polynomial. For example,
suppose
that
we
fit
second-order polynomial or quadratic:

2
y=a0+ a1 x+ a2 x +e
for this case the sum of the squares of the residuals is

n
S r = ( y i a0a1 x ia2 x i )
2 2
i=1
Following the procedure of the previous section, we take the

derivative of the previous equation with respect to each of
the unknown coefficients of the polynomial, as in
Sr
=2 ( y ia 0a1 x ia2 xi2)
a0
Sr
=2 x i ( y ia 0a 1 x ia 2 x 2i )
a1
Sr
=2 x 2i ( y ia 0a 1 x ia2 x 2i )
a2


Numerical, Economy
9 4444 260
July 2013
These equations can be set equal to zero and rearranged to

develop the following set of normal equations:
( n ) a0 + ( xi ) a 1+ ( x 2i ) a2= yi
( x i ) a0 + ( x 2i ) a1 + ( x 3i ) a2 = xi y i
( x 2i ) a0 + ( x 3i ) a1 + ( x 4i ) a2= x 2i y i
where all summations are from i = 1 through n. Note that

the above three equations are linear and have
unknowns:
a0 , a1 ,
and
a2
three
The coefficients of the unknowns can be calculated

directly from the observed data.
For this case, we see that the problem of determining a
least-squares second-order polynomial is equivalent to
solving a system of three simultaneous linear equations.
9 4444 260

Numerical, Economy
July 2013
Example Polynomial Regression

Problem Statement:
Fit a second-order polynomial to the data in the first two
columns of the next table.
Solution:
From the given data,
m=2
x i=15
x 4i =979
n =6
y i=152.6
x i yi =585.6
x =2.5
y =25.433
x 2i =55
x 2i y i =2488.8
x 3i =225
Therefore, the simultaneous linear equations are
]{ } { }
6 15 55 a0
152.6
15 55 225 a1 = 585.6
55 225 979 a2
2488.8
Solving these equations through a technique such as Gauss

elimination gives a0 = 2.47857, a1 = 2.35929, and a2 =
1.86071.

9 4444 260
Numerical, Economy
July 2013
Continue:
Therefore, the least-squares quadratic equations for this
case is
y = 2.47857 + 2.35929x + 1.86071x2
The standard error of the estimate based on the regression
polynomial is
S y / x=
Sr
3.74657
=
=1.12
63
n( m+1)
The coefficient of determination is

r 2=
S t S r 2513.393.74657
=
=0.99851
St
2513.39
and the correlation coefficient is r = 0.99925.

These results indicate that 99.851 percent of the
original uncertainty has been explained by the model. This
result supports the conclusion that the quadratic equation
represents an excellent fit, as is also evident from the next
figure.
9 4444 260
Numerical, Economy
July 2013

Multiple
Linear
17.3
Regression
A useful extension of linear regression is the case where

y is a linear function of two or more independent variables.
For example, y might be a linear function of x1 and x2, as in
y=a0+ a1 x1 +a 2 x 2+ e
Such
an
equation
is
particularly
useful
when
fitting
experimental data where the variable being studied is often

a function of two other variables. For this two-dimensional
case, the regression line becomes a plane (next figure).
9 4444 260
Numerical, Economy
July 2013

As with the previous cases, the best values of the
coefficients are determined by setting up the sum of the
squares of the residuals,
n
S r = ( y i a0a1 x 1 ia 2 x 2 i)2
i=1
and differentiating with respect to each of the unknown

coefficients.
Sr
=2 ( y ia 0a1 x 1ia2 x 2 i )
a0
Sr
=2 x 1 i ( y ia 0a1 x1 ia2 x 2 i)
a1
Sr
=2 x 2 i ( y ia 0a1 x 1ia2 x 2 i)
a2
9 4444 260
Numerical, Economy
July 2013
The coefficients yielding the minimum sum of the squares of

the residuals are obtained by setting the partial derivatives
equal to zero and expressing the result in matrix form as
n
x1 i
x2 i
x1 i
x2 i
2
x1 i x1 i x2 i
x 1 i x 2 i x 22 i
]{ } { }
a0
a1 =
a2
yi
x1 i y i
x2 i y i

( )
Example 17.6 Multiple Linear Regression
Problem Statement:
The following data was calculated from
the equations y = 5 + 4x1 3x2:
Use multiple linear regression to fit this
data.
Solution:
9 4444 260
Numerical, Economy
July 2013
The summations required to develop the previous equation

are:
The result is
]{ } { }
6
16.5 14 a0
54
16.5 76.25 48 a1 = 243.5
14
48
54 a2
100
Which can be solved using a method such as Gauss

elimination for
a0 = 5
ai = 4
a2= -3
which is consistent with the original equation from which

the data was derived.
:
The foregoing two-dimensional
be
easily
case
can

extended to m dimensions, as in
y = a0 + a1x1 + a2x2 + + amxm + e
where the standard error is formulated as
S y / x=
Sr
n(m+1)
and the coefficient of determination is computed as in Eq

(17.10).
9 4444 260
Numerical, Economy
July 2013
Although there may be certain cases where a variable is

linearly related to two or more other variables, multiple
linear regression has additional utility in the derivation of
power equations of the general form
y=a0 x a1 x2a . x am
1
Such
equations
are
extremely
experimental
To
use
useful
when
fitting
data.
multiple
linear
regression,
the
equation
is
transformed by taking its logarithm to yield.

y=loga 0+ a1 log x 1 + a 2 log x 2+ +am log x m
log
This transformation is similar in spirit to the one used to fit a

power equations when y is a function of a single variable x.
Problem 17.5
Use least-squares regression to fit a straight line to

x
11
15
17
21
23
29
21
29
14
21
15
29
37
39
y
29
13
3
9 4444 260
Numerical, Economy
July 2013
Compute the standard error of the estimate and the

correlation coefficient. Plot the data and the regression line.
If someone made an additional measurement of x = 10, y =
10, would you suspect, that the measurement was valid or
faulty? Justify your conclusion.
Solution:
The results can be summarized as
y 31.0589 0.78055 x
( s y / x 4.476306 ; r 0.901489 )
At x = 10, the best fit equation gives 23.2543. The line and
data can be plotted along with the point (10, 10).

The value of 10 is nearly 3 times the standard error away

from the line,
23.2543 10 = 13.2543 34.476
Thus, we can conclude that the value is probably erroneous.
Problem 17.13
An investigator has reported the data tabulated below for
an experiment to determine the growth rate of bacteria k
(per d), as a function of oxygen concentration c (mg/L). It is
9 4444 260
Numerical, Economy
July 2013
known that such data can be modeled by the following

equation:
2
k c
k = max 2
c s +c
Where
cs
and
k max
are parameters. Use a transformation
to linearize this equation. Then use linear regression to

estimate
cs
and
k max
and predict the growth rate at c = 2
mg/L.
C
0.5
0.8
1.5
2.5
1.1
2.4
5.3
7.6
8.9
Solution:
The equation can be linearized by inverting it to yield
c 1
1
1
s 2
k k max c
k max
Consequently, a plot of 1/k versus 1/c should yield a straight

line with an intercept of 1/kmax and a slope of cs/kmax
c,
mg/
L
k, /d
0.5
1.1
0.8
2.4
1.5
5.3
2.5
7.6
8.9
Sum
1/c2
4.00000
0
1.56250
0
0.44444
4
0.16000
0
0.06250
0
6.2294
1/c21/
1/k
k
0.9090 3.6363
91
64
0.4166 0.6510
67
42
0.1886 0.0838
79
57
0.1315 0.0210
79
53
0.1123 0.0070
60
22
1.7583 4.3993
(1/c2)2
16.0000
00
2.44140
6
0.19753
1
0.02560
0
0.00390
6
18.668
9 4444 260
Numerical, Economy
July 2013
44
75
38
44

Continue:
(. . . as
. . )
The slope and the intercept can be computed
a1
5(4.399338 ) 6.229444 (1.758375 )

0.202489
5(18.66844 ) (6.229444 ) 2
a0
1.758375
6.229444
0.202489
0.099396
5
5
Therefore, kmax = 1/0.099396 = 10.06074 and cs =

10.06074(0.202489) = 2.037189, and the fit is
10.06074 c 2
2.037189 c 2
This equation can be plotted together with the data:
The equation can be used to compute

k
10.06074 (2) 2
2.037189 (2) 2
6.666
Data,
Algorithms,
Physics I/II, English 123, Statics, Dynamics, Strength, Structure

I/II, C++,
Java,
9 4444 260
Numerical, Economy

Numerical Ch17

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Numerical Ch17

Uploaded by

Copyright:

Available Formats

July 2013

info@eng-hs.com 260 4444 9 .

polynomial interpolation is inappropriate and may yield

experimentally derived data points showing significant

Now, if a sixth-order interpolating polynomial is fitted to this

approximating function that fits the shape . Fig (c) illustrates

info@eng-hs.com 260 4444 9 .

line (in which case, interpolation would be appropriate),

must be devised to establish a basis for the fit. One way to

info@eng-hs.com 260 4444 9 .

and the curve. One technique for doing

17.1 Linear Regression

The mathematical expression for the straight line is

info@eng-hs.com 260 4444 9 .

info@eng-hs.com 260 4444 9 .

Obviously, the best fit is the line connecting the points.

The previous fig (b) demonstrates why this criterion is also

info@eng-hs.com 260 4444 9 .

A strategy that overcomes the shortcomings of the

e = ( yi , measured y i ,model ) = ( y ia0a1 x i )2

This criterion has a number of advantages, including the

Note that we have simplified the summation symbols;

info@eng-hs.com 260 4444 9 .

Now, realizing that

= na0, we can express the

equations as a set of two simultaneous linear equations with

These are called the normal equations. They can be solved

This result can then be used in conjunction with Eq. (17.4)

Example 17.1 Linear Regression

info@eng-hs.com 260 4444 9 .

Therefore, the least-square fit is

info@eng-hs.com 260 4444 9 .

A number of additional properties of this fit can be

Recall that the sum of the squares is defined as

Notice the similarity between the previous equation and

The similarity can be extended further for cases where (1)

for a predicted value of y corresponding to a particular

info@eng-hs.com 260 4444 9 .

Also, notice that we now divide by n-2 because two data

Another justification for dividing by n-2 is that there is no

quantifies the spread around the

regression line as shown in the next figure (b) in contrast to

(next figure). To do this, we return to the original data and

info@eng-hs.com 260 4444 9 .

(in our case, y). This quantity is designated as S t. This is the

This characterizes the residual error that remains after the

The difference between the two

info@eng-hs.com 260 4444 9 .

describing the data in terms of a straight line rather than as

where r2 is called the coefficient of determination and r

info@eng-hs.com 260 4444 9 .

and the standard error of the estimate is

, the linear regression model is

These results indicate that 86.8 percent of the original

info@eng-hs.com 260 4444 9 .

Linear regression provides a powerful technique for

info@eng-hs.com 260 4444 9 .

info@eng-hs.com 260 4444 9 .

are constant coefficients. As shown in

the previous figure, the equation ( for 2 0 or 1) is

are constant coefficients. This model

also represents a nonlinear relationship between y and x,

manipulations to transform the equations into a linear form.

Thus, a plot of ln y versus x will yield a straight line with a