Statistics Review: EEE 305 Lecture 10: Regression

EEE 305 Lecture 10: Regression
Statistics Review
Measure of Location
Arithmetic mean: the sum of the individual data points (yi) divided by the number of points n:
y
y i
n
Median: the midpoint of a group of data.
Mode: the value that occurs most frequently in a group of data.
Measures of spread
Standard deviation:
𝑆𝑡
𝑠𝑦 =
𝑛−1
Where 𝑆𝑡 is the sum of the squares of the data residuals
2
𝑆𝑡 = 𝑦𝑖 − 𝑦
Variance:
2
𝑦𝑖 − 𝑦 2
𝑦𝑖2 − ( 𝑦𝑖 )
𝑆𝑦2 = =
𝑛−1 𝑛−1
Coefficient of variance:
𝑆𝑦
𝑐. 𝑣. = × 100%
𝑦
Normal distribution
If we have a very large set of data, the histogram often can be approximated by a smooth curve. The
symmetric, bell-shaped curve superimposed on Fig. 1 is one such characteristic shape—the normal
distribution. Given enough additional measurements, the histogram for this particular case could
eventually approach the normal distribution.
1
Page
Figure 1
Prepared BY
Shahadat Hussain Parvez
Linear Least-Square Regression

Linear least-squares regression is a method to determine the “best” coefficients in a linear model for
given data set.
“Best” for least-squares regression means minimizing the sum of the squares of the estimate
residuals. For a straight line model, this gives:
n n
S r   ei2    yi  a0  a1 xi 
2
i 1 i 1
This method will yield a unique line for a given set of data.
The simplest example of a least-squares approximation is fitting a straight line to a set of paired
observations: (x1, y1), (x2, y2), . . . , (xn, yn). The mathematical expression for the straight line is
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑒
where a0 and a1 are coefficients representing the intercept and the slope, respectively, and e is the
error, or residual, between the model and the observations, which can be represented by
rearranging the last equation as
𝑒 = 𝑦 – 𝑎0 – 𝑎1 𝑥
Thus, the error, or residual, is the discrepancy between the true value of y and the approximate
value, a0 + a1x, predicted by the linear equation
One strategy for fitting a “best” line through the data would be to minimize the sum of the residual
errors for all the available data, as in
(ii)
where n = total number of points. However, this is an inadequate criterion, as illustrated by Fig. 2a
which depicts the fit of a straight line to two points. Obviously, the best fit is the
Figure 2 Examples of some criteria for “best fit” that are inadequate for regression: (a) minimizes the sum of the
residuals, (b) minimizes the sum of the absolute values of the residuals, and (c) minimizes the maximum error of any
individual point.
Line connecting the points. However, any straight line passing through the midpoint of the
connecting line (except a perfectly vertical line) results in a minimum value of Eq. (ii) equal to zero
because the errors cancel.
Therefore, another logical criterion might be to minimize the sum of the absolute val-ues of the
discrepancies, as in
Figure 2b demonstrates why this criterion is also inadequate. For the four points shown, any straight
line falling within the dashed lines will minimize the sum of the absolute values. Thus, this criterion
2
also does not yield a unique best fit.

Page
Prepared BY
A third strategy for fitting a best line is the minimax criterion. In this technique, the line is chosen
that minimizes the maximum distance that an individual point falls from the line. As depicted in Fig.
2c, this strategy is ill-suited for regression because it gives undue influence to an outlier, that is, a
single point with a large error.
A strategy that overcomes the shortcomings of the aforementioned approaches is to minimize the
sum of the squares of the residuals between the measured y and the y calculated with the linear
model
(iii)
This criterion has a number of advantages, including the fact that it yields a unique line for a given
set of data.
Least-Squares Fit of a Straight Line
If we model
y  a0  a1 x
The value of 𝑎1 𝑎𝑛𝑑 𝑎0 can be found using the formula
n xi yi   xi  yi
a1 
n xi2   xi 
2
a0  y  a1 x
where ¯y and ¯x are the means of y and x, respectively.
Example
Find the straight line fitting the points xi and yi
3
Page
Prepared BY
Quantification of Error
Recall for a straight line, the sum of the squares of the estimate residuals:
n n
S r   ei2    yi  a0  a1 xi 
2
i 1 i 1
Figure 3 The residual in linear regression represents the vertical distance between a data point and the straight line.
Standard error of the estimate
Sr
sy / x 
n2
The figure below shows Regression data showing (a) the spread of the data around the mean of the
dependent variable and (b) the spread of the data around the best-fit line. The reduction in the
spread in going from (a) to (b), as indicated by the bell-shaped curves at the right, represents the
improvement due to linear regression.
Figure 4
The coefficient of determination r2 is the difference between the sum of the squares of the data
residuals and the sum of the squares of the estimate residuals, normalized by the sum of the squares
of the data residuals:
St  S r
r2 
St
• r2 represents the percentage of the original uncertainty explained by the model.
• For a perfect fit, Sr=0 and r2=1.
4
• If r2=0, there is no improvement over simply picking the mean.

Page
• If r2<0, the model is worse than simply picking the mean!
Prepared BY
Example
Compute the total standard deviation, the standard error of the esti-mate, and the correlation
coefficient for the data in previous example Example
Linearization of nonlinear relationship

Linear regression is predicated on the fact that the relationship between the dependent and
independent variables is linear - this is not always the case.
Three common examples are:
exponential : y   1 e 1 x
power : y   2 x 2
x
saturation - growth - rate : y  3
3  x
One option for finding the coefficients for a nonlinear fit is to linearize it. For the three common
models, this may involve taking logarithms or inversion:
M odel Nonlinear Linearized
exponential : y  1e 1x ln y  ln 1  1 x
power : y   2 x 2 log y  log  2   2 log x

x 1 1 3 1
saturation - growth - rate : y  3  
3  x y 3 3 x
5
Page
Prepared BY
The figure below shows graphically how linearization works
Figure 5 (a) The exponential equation, (b) the power equation, and (c) the saturation-growth-rate equation. Parts (d),
(e), and (f) are linearized versions of these equations that result from simple transformations.
Polynomial Regression
The figure below shows an example of situation where linear least-square regression is not a good fit
and higher order polynomial regression is preferable.
Figure 6 (a) Data that is ill-suited for linear least-squares regression. (b) Indication that a parabola is preferable.
We can extend the idea of linear least square regression to derive equations for higher order
regression. The idea is to minimize the sum of the squares of the estimate residuals. For example,
suppose that we fit a second-order polynomial or quadratic:
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + 𝑒
For this case the sum of the square of the residuals that needs to be minimizing is
 
n n
6
S r   ei2   yi  a0  a1 xi  a2 xi2
2
Page
i 1 i 1
Prepared BY
Following the procedure of the previous section, we take the derivative of last equation with respect
to each of the unknown coefficients of the polynomial, as in
These equations can be set to zero and rearranged to develop the following equations
where all summations are from i = 1 through n. Note that the above three equations are linear and
have three unknowns: a0, a1, and a2. The coefficients of the unknowns can be calculated directly
from the observed data.
For this case, we see that the problem of determining a least-squares second-order polynomial is
equivalent to solving a system of three simultaneous linear equations.
In general, for any dimension this would mean minimizing:
 
n n
S r   ei2   yi  a0  a1 xi  a2 xi2    am xim
2
i 1 i 1
The standard error for fitting an mth order polynomial to n data points is:
Sr
sy / x 
n  m  1
because the mth order polynomial has (m+1) coefficients.
The coefficient of determination r2 is still found using:
St  S r
r2 
St
Example
7
Page
Prepared BY
Multiple Linear Regression [Additional topic]

Another useful extension of linear regression is the case where y is a linear function of two or more
independent variables:
y  a0  a1 x1  a2 x2  am xm
Again, the best fit is obtained by minimizing the sum of the squares of the estimate residuals:
S r   ei2    yi  a0  a1 x1,i  a2 x2,i   am xm,i 

n n
2
i 1 i 1
8
Page
Figure 7 Graphical depiction of multiple linear regressions where y is a linear function of x1 and x2.
Prepared BY
General Linear Least squares [Additional topic]

Linear, polynomial, and multiple linear regression all belong to the general linear least-squares
model:
y  a0 z0  a1 z1  a2 z2  am zm  e
where z0, z1, …, zm are a set of m+1 basis functions and e is the error of the fit.
The basis functions can be any function data but cannot contain any of the coefficients a0, a1, etc.
Solving General Linear Least Squares Coefficients [Additional topic]
The equation:
y  a0 z0  a1 z1  a2 z2  am zm  e
can be re-written for each data point as a matrix equation:
y  Z a e
where {y} contains the dependent data, {a} contains the coefficients of the equation, {e} contains the
error at each point, and [Z] is:
 z01 z11  z m1 
z z12  z m 2 
Z    02
     
 
 z0 n z1n  z mn 
with zji representing the the value of the jth basis function calculated at the ith point.
Generally, [Z] is not a square matrix, so simple inversion cannot be used to solve for {a}. Instead the
sum of the squares of the estimate residuals is minimized:
2
n  n m 
S r   e    yi   a j z ji 
2
i
i 1 i 1  j 0 
The outcome of this minimization yields:
Z  Z a  Z  y
T T
Nonlinear regression [Additional topic]

Not all fits are linear equations of coefficients and basis functions.
One method to handle this is to transform the variables and solve for the best fit of the transformed
variables. There are two problems with this method:
– Not all equations can be transformed easily or at all
– The best fit line represents the best fit for the transformed variables, not the original
variables.
Another method is to perform nonlinear regression to directly determine the least-squares fit.
9
Page
Prepared BY
Descriptive Statistics in MATLAB

• MATLAB has several built-in commands to compute and display descriptive statistics.
Assuming some column vector s:
– mean(s), median(s), mode(s)
• Calculate the mean, median, and mode of s. mode is a part of the
statistics toolbox.
– min(s), max(s)
• Calculate the minimum and maximum value in s.
– var(s), std(s)
• Calculate the variance and standard deviation of s
• Note - if a matrix is given, the statistics will be returned for each column.
• [n, x] = hist(s, x)
– Determine the number of elements in each bin of data in s. x is a vector
containing the center values of the bins.
• [n, x] = hist(s, m)
– Determine the number of elements in each bin of data in s using m bins. x
will contain the centers of the bins. The default case is m=10
• hist(s, x) or hist(s, m) or hist(s)
– With no output arguments, hist will actually produce a histogram.
//Linear regression
• MATLAB has a built-in function polyfit that fits a least-squares nth order polynomial to data:
– p = polyfit(x, y, n)
• x: independent data
• y: dependent data
• n: order of polynomial to fit
• p: coefficients of polynomial f(x)=p1xn+p2xn-1+…+pnx+pn+1
• MATLAB’s polyval command can be used to compute a value using the coefficients.
– y = polyval(p, x)
//
• Given x and y data in columns, solve for the coefficients of the best fit line for y=a0+a1x+a2x2
Z = [ones(size(x) x x.^2]
a = (Z’*Z)\(Z’*y)
– Note also that MATLAB’s left-divide will automatically include the [Z]T terms if the
matrix is not square, so
a = Z\y
would work as well
• To calculate measures of fit:
St = sum((y-mean(y)).^2)
Sr = sum((y-Z*a).^2)
r2 = 1-Sr/St
10
syx = sqrt(Sr/(length(x)-length(a)))
//Nonlinear Regression
Page
Prepared BY
• To perform nonlinear regression in MATLAB, write a function that returns the sum of the
squares of the estimate residuals for a fit and then use MATLAB’s fminsearch function to find
the values of the coefficients where a minimum occurs.
• The arguments to the function to compute Sr should be the coefficients, the independent
variables, and the dependent variables.
• Given dependent force data F for independent velocity data v, determine the coefficients for
the fit:
• First - write a function called fSSR.m containing the following:
function f = fSSR(a, xm, ym)
yp = a(1)*xm.^a(2);
f = sum((ym-yp).^2);
• Then, use fminsearch in the command window to obtain the values of a that minimize fSSR:
a = fminsearch(@fSSR, [1, 1], [], v, F)
where [1, 1] is an initial guess for the [a0, a1] vector, [] is a placeholder for the options
• The resulting coefficients will produce the largest r2 for the data and may be different from
the coefficients produced by a transformation:
Algorithm for linear regression
11
Page
Prepared BY
Matlab implementation of linear regression

function [a, r2] = linregr(x,y)
% linregr: linear regression curve fitting
% [a, r2] = linregr(x,y): Least squares fit of straight
% line to data by solving the normal equations
% input:
% x = independent variable
% y = dependent variable
% output:
% a = vector of slope, a(1), and intercept, a(2)
% r2 = coefficient of determination
n = length(x);
if length(y)~=n, error('x and y must be same length'); end
x = x(:); y = y(:); % convert to column vectors
sx = sum(x); sy = sum(y);
sx2 = sum(x.*x); sxy = sum(x.*y); sy2 = sum(y.*y);
a(1) = (n*sxy-sx*sy)/(n*sx2-sx^2);
a(2) = sy/n-a(1)*sx/n;
r2 = ((n*sxy-sx*sy)/sqrt(n*sx2-sx^2)/sqrt(n*sy2-sy^2))^2;
% create plot of data and best fit line
xp = linspace(min(x),max(x),2);
yp = a(1)*xp+a(2);
plot(x,y,'o',xp,yp)
grid on
Algorithm for polynomial regression

The steps for implementation of polynomial and multiple regression are
The pseudo code for the implementation is as follows

12
1. Chapra examples 17.1, 17.2, 17.4, 17.4

Page
2. Chapra chapter 17 exercise 17.1-17.6
Prepared BY

Statistics Review: EEE 305 Lecture 10: Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Review: EEE 305 Lecture 10: Regression

Uploaded by

Copyright:

Available Formats

EEE 305 Lecture 10: Regression

Linear Least-Square Regression

also does not yield a unique best fit.

• If r2=0, there is no improvement over simply picking the mean.

• If r2<0, the model is worse than simply picking the mean!

Linearization of nonlinear relationship

M odel Nonlinear Linearized

exponential : y  1e 1x ln y  ln 1  1 x

power : y   2 x 2 log y  log  2   2 log x

The figure below shows graphically how linearization works

Multiple Linear Regression [Additional topic]

S r   ei2    yi  a0  a1 x1,i  a2 x2,i   am xm,i 

General Linear Least squares [Additional topic]

Nonlinear regression [Additional topic]

Descriptive Statistics in MATLAB

Algorithm for linear regression

Matlab implementation of linear regression

Algorithm for polynomial regression

The pseudo code for the implementation is as follows

1. Chapra examples 17.1, 17.2, 17.4, 17.4

2. Chapra chapter 17 exercise 17.1-17.6

You might also like