Professional Documents
Culture Documents
Statistics Review
Measure of Location
Arithmetic mean: the sum of the individual data points (yi) divided by the number of points n:
y
y i
n
Median: the midpoint of a group of data.
Mode: the value that occurs most frequently in a group of data.
Measures of spread
Standard deviation:
𝑆𝑡
𝑠𝑦 =
𝑛−1
Where 𝑆𝑡 is the sum of the squares of the data residuals
2
𝑆𝑡 = 𝑦𝑖 − 𝑦
Variance:
2
𝑦𝑖 − 𝑦 2
𝑦𝑖2 − ( 𝑦𝑖 )
𝑆𝑦2 = =
𝑛−1 𝑛−1
Coefficient of variance:
𝑆𝑦
𝑐. 𝑣. = × 100%
𝑦
Normal distribution
If we have a very large set of data, the histogram often can be approximated by a smooth curve. The
symmetric, bell-shaped curve superimposed on Fig. 1 is one such characteristic shape—the normal
distribution. Given enough additional measurements, the histogram for this particular case could
eventually approach the normal distribution.
1
Page
Figure 1
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
i 1 i 1
This method will yield a unique line for a given set of data.
The simplest example of a least-squares approximation is fitting a straight line to a set of paired
observations: (x1, y1), (x2, y2), . . . , (xn, yn). The mathematical expression for the straight line is
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑒
where a0 and a1 are coefficients representing the intercept and the slope, respectively, and e is the
error, or residual, between the model and the observations, which can be represented by
rearranging the last equation as
𝑒 = 𝑦 – 𝑎0 – 𝑎1 𝑥
Thus, the error, or residual, is the discrepancy between the true value of y and the approximate
value, a0 + a1x, predicted by the linear equation
One strategy for fitting a “best” line through the data would be to minimize the sum of the residual
errors for all the available data, as in
(ii)
where n = total number of points. However, this is an inadequate criterion, as illustrated by Fig. 2a
which depicts the fit of a straight line to two points. Obviously, the best fit is the
Figure 2 Examples of some criteria for “best fit” that are inadequate for regression: (a) minimizes the sum of the
residuals, (b) minimizes the sum of the absolute values of the residuals, and (c) minimizes the maximum error of any
individual point.
Line connecting the points. However, any straight line passing through the midpoint of the
connecting line (except a perfectly vertical line) results in a minimum value of Eq. (ii) equal to zero
because the errors cancel.
Therefore, another logical criterion might be to minimize the sum of the absolute val-ues of the
discrepancies, as in
Figure 2b demonstrates why this criterion is also inadequate. For the four points shown, any straight
line falling within the dashed lines will minimize the sum of the absolute values. Thus, this criterion
2
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
A third strategy for fitting a best line is the minimax criterion. In this technique, the line is chosen
that minimizes the maximum distance that an individual point falls from the line. As depicted in Fig.
2c, this strategy is ill-suited for regression because it gives undue influence to an outlier, that is, a
single point with a large error.
A strategy that overcomes the shortcomings of the aforementioned approaches is to minimize the
sum of the squares of the residuals between the measured y and the y calculated with the linear
model
(iii)
This criterion has a number of advantages, including the fact that it yields a unique line for a given
set of data.
Least-Squares Fit of a Straight Line
If we model
y a0 a1 x
The value of 𝑎1 𝑎𝑛𝑑 𝑎0 can be found using the formula
n xi yi xi yi
a1
n xi2 xi
2
a0 y a1 x
where ¯y and ¯x are the means of y and x, respectively.
Example
Find the straight line fitting the points xi and yi
3
Page
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Quantification of Error
Recall for a straight line, the sum of the squares of the estimate residuals:
n n
S r ei2 yi a0 a1 xi
2
i 1 i 1
Figure 3 The residual in linear regression represents the vertical distance between a data point and the straight line.
Standard error of the estimate
Sr
sy / x
n2
The figure below shows Regression data showing (a) the spread of the data around the mean of the
dependent variable and (b) the spread of the data around the best-fit line. The reduction in the
spread in going from (a) to (b), as indicated by the bell-shaped curves at the right, represents the
improvement due to linear regression.
Figure 4
The coefficient of determination r2 is the difference between the sum of the squares of the data
residuals and the sum of the squares of the estimate residuals, normalized by the sum of the squares
of the data residuals:
St S r
r2
St
• r2 represents the percentage of the original uncertainty explained by the model.
• For a perfect fit, Sr=0 and r2=1.
4
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Example
Compute the total standard deviation, the standard error of the esti-mate, and the correlation
coefficient for the data in previous example Example
exponential : y 1 e 1 x
power : y 2 x 2
x
saturation - growth - rate : y 3
3 x
One option for finding the coefficients for a nonlinear fit is to linearize it. For the three common
models, this may involve taking logarithms or inversion:
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Figure 5 (a) The exponential equation, (b) the power equation, and (c) the saturation-growth-rate equation. Parts (d),
(e), and (f) are linearized versions of these equations that result from simple transformations.
Polynomial Regression
The figure below shows an example of situation where linear least-square regression is not a good fit
and higher order polynomial regression is preferable.
Figure 6 (a) Data that is ill-suited for linear least-squares regression. (b) Indication that a parabola is preferable.
We can extend the idea of linear least square regression to derive equations for higher order
regression. The idea is to minimize the sum of the squares of the estimate residuals. For example,
suppose that we fit a second-order polynomial or quadratic:
𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + 𝑒
For this case the sum of the square of the residuals that needs to be minimizing is
n n
6
S r ei2 yi a0 a1 xi a2 xi2
2
Page
i 1 i 1
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Following the procedure of the previous section, we take the derivative of last equation with respect
to each of the unknown coefficients of the polynomial, as in
These equations can be set to zero and rearranged to develop the following equations
where all summations are from i = 1 through n. Note that the above three equations are linear and
have three unknowns: a0, a1, and a2. The coefficients of the unknowns can be calculated directly
from the observed data.
For this case, we see that the problem of determining a least-squares second-order polynomial is
equivalent to solving a system of three simultaneous linear equations.
In general, for any dimension this would mean minimizing:
n n
S r ei2 yi a0 a1 xi a2 xi2 am xim
2
i 1 i 1
The standard error for fitting an mth order polynomial to n data points is:
Sr
sy / x
n m 1
because the mth order polynomial has (m+1) coefficients.
The coefficient of determination r2 is still found using:
St S r
r2
St
Example
7
Page
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
i 1 i 1
8
Page
Figure 7 Graphical depiction of multiple linear regressions where y is a linear function of x1 and x2.
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
syx = sqrt(Sr/(length(x)-length(a)))
//Nonlinear Regression
Page
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
• To perform nonlinear regression in MATLAB, write a function that returns the sum of the
squares of the estimate residuals for a fit and then use MATLAB’s fminsearch function to find
the values of the coefficients where a minimum occurs.
• The arguments to the function to compute Sr should be the coefficients, the independent
variables, and the dependent variables.
• Given dependent force data F for independent velocity data v, determine the coefficients for
the fit:
• First - write a function called fSSR.m containing the following:
function f = fSSR(a, xm, ym)
yp = a(1)*xm.^a(2);
f = sum((ym-yp).^2);
• Then, use fminsearch in the command window to obtain the values of a that minimize fSSR:
a = fminsearch(@fSSR, [1, 1], [], v, F)
where [1, 1] is an initial guess for the [a0, a1] vector, [] is a placeholder for the options
• The resulting coefficients will produce the largest r2 for the data and may be different from
the coefficients produced by a transformation:
11
Page
Prepared BY
Shahadat Hussain Parvez
EEE 305 Lecture 10: Regression
Prepared BY
Shahadat Hussain Parvez