You are on page 1of 24

REGRESSION & CORRELATION ANALYSIS

DEPARTMENT
REDGEMAN@UIDAHO.EDU

OF STATISTICS
OFFICE: +1-208-885-4410

DR. RICK EDGEMAN, PROFESSOR & CHAIR SIX SIGMA BLACK BELT

Father of Regression Analysis Carl F. Gauss (1777-1855)


German mathematician, noted for his wide-ranging contributions to physics, particularly the study of electromagnetism. Born in Braunschweig on April 30, 1777, Gauss studied ancient languages in college, but at the age of 17 he became interested in mathematics and attempted a solution of the classical problem of constructing a regular heptagon, or seven-sided figure, with ruler and compass. He not only succeeded in proving this construction impossible, but went on to give methods of constructing figures with 17, 257, and 65,537 sides. In so doing he proved that the construction, with compass and ruler, of a regular polygon with an odd number of sides was possible only when the number of sides was a prime number of the series 3, 5, 17, 257, and 65,537 or was a multiple of two or more of these numbers. With this discovery he gave up his intention to study languages and turned to mathematics. He studied at the University of Gttingen from 1795 to 1798; for his doctoral thesis he submitted a proof that every algebraic equation has at least one root, or solution. This theorem, which had challenged mathematicians for centuries, is still called the fundamental theorem of algebra (see ALGEBRA; EQUATIONS, THEORY OF). His volume on the theory of numbers, Disquisitiones Arithmeticae (Inquiries into Arithmetic, 1801), is a classic work in the field of mathematics. Gauss next turned his attention to astronomy. A faint planetoid, Ceres, had been discovered in 1801; and because astronomers thought it was a planet, they observed it with great interest until losing sight of it. From the early observations Gauss calculated its exact position, so that it was easily rediscovered. He also worked out a new method for calculating the orbits of heavenly bodies. In 1807 Gauss was appointed professor of mathematics and director of the observatory at Gttingen, holding both positions until his death there on February 23, 1855.

Although Gauss made valuable contributions to both theoretical and practical astronomy, his principal work was in mathematics and mathematical physics. In theory of numbers, he developed the important prime-number theorem (see E). He was the first to develop a nonEuclidean geometry (see GEOMETRY), but Gauss failed to publish these important findings because he wished to avoid publicity. In probability theory, he developed the important method of least squares and the fundamental laws of probability distribution, (see PROBABILITY; STATISTICS). The normal probability graph is still called the Gaussian curve. He made geodetic surveys, and applied mathematics to geodesy (see GEOPHYSICS). With the German physicist Wilhelm Eduard Weber, Gauss did extensive research on magnetism. His applications of mathematics to both magnetism and electricity are among his most important works; the unit of intensity of magnetic fields is today called the gauss. He also carried out research in optics, particularly in systems of lenses. Scarcely a branch of mathematics or mathematical physics was untouched by Gauss.

Introduction to Regression Analysis


Regression analysis is the most often applied technique of statistical analysis and modeling. In general, it is used to model a response variable (Y) as a function of one or more driver variables (X1, X2, ..., Xp). The functional form used is: Yi = 0 + 1X1i + 2X2i + ... + pXpi +

Introduction to Regression Analysis


If there is only one driver variable, X, then we usually speak of simple linear regression analysis.

When the model involves


(a) multiple driver variables, (b) a driver variable in multiple forms, or (c) a mixture of these, the we speak of multiple linear regression analysis.

The linear portion of the terminology refers to the response variable being expressed as a linear combination of the driver variables.

Introduction to Regression Analysis


The term in the model is referred to as a random error term and may reflect a number of things including the general idea that knowledge of the driver variables will not ordinarily lead to perfect reconstruction of the response.

Regression Analysis: Model Assumptions


Model assumptions are stated in terms of the random errors, , as follows:
the errors are normally distributed, with mean = zero, and constant variance 2, that does not depend on the settings of the driver variables, and the errors are independent of one another. This is often summarized symbolically as: is NID(0, 2)

Model Estimation
Ordinarily the regression coefficients (the s) are of unknown value and must be estimated from sample information. The estimate of a given coefficient, i, is often symbolized by i. Although there are well-established statistical/ mathematical methods for determining estimates by these methods, they are generally tedious and are well-suited to performance by a computer. The resulting estimated model is:

^
^

Yi = 0 + 1X1i + 2X2i + ... + pXpi


The random error term, i, is then estimated by ei = Yi - Yi

Interval Estimation
Estimates will vary from sample to sample and it is useful to have estimates of the standard deviations of these estimates, S^ . These i estimated standard deviations tend to be included in the regression output of most statistical packages and can be used in the formation of confidence intervals for the true value of i, that is:

^ i +/- t/2,n-(p+1)Si ^

Where t/2,n-(p+1) is the value of Students t distribution with n-(p+1) degrees of freedom that places a proportion /2 in the upper tail of the t-distribution.

Interval Estimation: Key Concept


When examining a confidence interval for a particular regression coefficient, j, we will want to know whether the interval includes the value zero. If zero is included in the interval then, conceivably, j = 0, which would imply that the model could be simplified by dropping the corresponding term, jXj from the model. Otherwise, the corresponding variable, Xj is considered to be a potentially important predictor or determinant of Y.

Analysis of Variance for Regression


An omnibus or global test of the overall contribution of the set of driver variables to the prediction of the response variable is carried out via the analysis of variance (anova). A summary table for the anova of regression follows:
Source of Variation (SV) Regression Residual Total Degrees of Freedom (df) p n-(p+1) n-1 Sum of Squares (SS) SSR SSE SST Mean Fcalc Fcrit Square (MS) MSR MSR/MSE F ,p,n-(p+1) MSE

Where F,p,n-(p+1) is the value of F with p numerator df and n-(p+1) denominator df that places in the upper tail of the distribution.

In the anova table we have the following: SSR = Sum of Squares due to Regression SSE = Sum of Squares due to Error or Residual SST = Sum of Squares Total MSR = SSR/p = Mean Square Regression MSE = SSE/[n-(p+1)] = Mean Square Error or Residual

The sums of squares are derived from the algebraic identity:

^ ^ (Yi - Y)2 = (Yi - Y)2 + (Yi -Yi)2

That is: SST = SSR + SSE So that R2 = SSR/SST represents the proportion of variation in Y that is explained by the behavior of the driver variables. R2 is the coefficient of determination.

ANOVA for Regression Formulas

Regression Diagnostics: The Normality Assumption


Are the residuals (or errors) approximately normally distributed? A variety of methods are available for checking this regression assumption:
Anderson-Darling, Watson, Cramer-von Mises, Kolmogorov-Smirnov, Lillieforss, and Chi-Square Tests; Histograms or Boxplots; Correlation Assessment of Normal Probability Plots.

Regression Diagnostics: Independence of Errors


Are the errors independent of one another, or autocorrelated? This assumption may be graphically examined by plotting the errors in time sequence and determining if any patterns exist - a control chart for individuals could be used for this with all eight PATs appropriate for use. Other plots are also available. This assumption is commonly evaluated via the Durbin-Watson Test. This test is based on the value of

(ei - ei-1)2/SSE

Which may range in value from 0 to 4. Tables of lower and upper critical values of D, denoted by dl and du, respectively, are widely available for significance levels of = .01 and = .05. The corresponding autocorrelation coefficient which may range from -1 to +1 is given by: ra = 1 - (D/2)

Regression Analysis Example: Timeliness of Order Delivery


The order fulfillment process of a major distribution center is having trouble delivering orders on time. It is conjectured that order volume is the root cause and that problems occur when order volumes are high. A new computer system has been requested to handle the increased volume of orders. Data is on the following slide. A negative response time indicates early delivery. Do the data support the request? Why or why not? What recommendations would you make based on your analysis of these data? What are some other possible solutions to this problem?

Day
1
2 3 4 5 6 7 8 9 10 11 12 13

X
31
91 13 69 70 64 38 50 94 82 15 42 27

Y
-2
9 -4 15 12 6 7 4 23 24 -2 -4 -6

Day
14
15 16 17 18 19 20 21 22 23 24 25 26

X
66
73 45 21 7 69 38 2 99 36 82 58 20

Y
14
10 -2 -12 -1 11 5 -17 18 8 9 21 -5

Day
27
28 29 30 31 32 33 34 35 36 37 38 39

X
88
97 93 72 6 55 15 10 21 88 55 27 66

Y
19
12 19 10 11 23 12 16 20 43 23 16 32

Day
40
41 42 43 44 45 46 47 48 49 50
X = 2,448 161,128

X
3
35 40 64 43 30 73 46 82 35 2
X2 =

Y
13
25 15 32 17 28 33 19 32 22 6

Y = 639 Y2 = 15,731 XY = 40,175

X = Order Volume

Y = Average Response Time / Order

NOTE: days are on a M-F rotation.

Descriptive Statistics: Volume, Res.Time Variable Volume Res.Time N 50 50 Mean 48.96 12.78 StDev 29.02 12.42 Variance 842.32 154.38 Sum 2448.00 639.00 Sum of Squares 161128.00 15731.00

Note:

SX2 = 842.32 = (X2 n(X2bar))/(n-1) = (161,128 50(48.962))/(49) and SX = 29.02 SY2 = 154.38 = (Y2 n(X2bar))/(n-1) = (15,731 50(12.782))/(49) and SY = 12.42 SXY = 181.42 = (XY n(Xbar)(Ybar)/(n-1) = (40,175 50(48.96)(12.78))/(49)

Covariances: Volume, Res.Time Volume 842.325 181.420 Res.Time 154.379

Correlations: Volume, Res.Time Pearson correlation of Volume and Res.Time = 0.503 P-Value = 0.000

Volume Res.Time

Manual Regression Calculations


First, get the following:
Means (Xbar and Ybar). These are: Xbar = 48.96 Ybar = 12.78 Variances (SX2 and SY2), standard deviations (SX and SY), and the covariance (SXY). These are: SX2 = 842.32 SX = 29.02 SY2 = 154.38 SY = 12.42 SXY = 181.42

Second, get the correlation coefficient (rXY) and the coefficient of determination (r2).
These are: rXY = SXY/(SXSY) = 181.42/ (29.02*12.42) = .503 and
1 0

r2 = (.503)2 = .253 ^ ^ Next, get the estimates of the Slope ( ), Intercept ( ), and Regression Equation^ = + X Y ^ ^
0
1

These are:

= SXY/SX2 = 181.42 / 842.32 = 0.215

0 = Ybar 1Xbar = 12.78 (0.215)(48.96) = 2.24

^ ^

^ ^ ^
Y = 0 + 1X = 2.24 + 0.215X BE ABLE TO USE THIS.

Regression Analysis: Res.Time versus Volume


The regression equation is Res.Time = 2.24 + 0.215 Volume Predictor Constant Volume S = 10.8493 Coef 2.235 0.21538 SE Coef 3.032 0.05340 T 0.74 4.03 P 0.465 0.000

NOTE THAT:

r2 = SSR/SST = = 1914.6/7564.6 = 0.253


So that: r = r2 = 0.503 with the algebraic sign being P the same as that for the 0.000 estimated slope, ^ . 1

R-Sq = 25.3%

R-Sq(adj) = 23.8%

Analysis of Variance
Source Regression Residual Error Total DF 1 48 49 SS 1914.6 5650.0 7564.6 MS 1914.6 117.7 F 16.27

Unusual Observations
Obs 36 Volume 88.0 Res.Time 43.00 Fit 21.19 SE Fit 2.59 Residual 21.81 St Resid 2.07R

Residual Plots for Res.Time


Normal Probability Plot of the Residuals
Standardized Residual
99 90 2 1 0 -1 -2 0 5 10 15 Fitted Value 20

Residuals Versus the Fitted Values

Percent

50 10 1 -2 -1 0 1 Standardized Residual 2

Histogram of the Residuals


Standardized Residual
-2 -1 0 1 Standardized Residual 2 10.0 2 1 0 -1 -2

Residuals Versus the Order of the Data

Frequency

7.5 5.0 2.5 0.0

10

15 20 25 30 35 Observation Order

40

45

50

S1 = MSE / (n-1)SX2 S0 = MSE[1/n + X2/(n-1)SX2]


Verify that these are: 0.0534 and 3.032, respectively, then construct and interpret 95% confidence intervals for each regression coefficient (e.g., for the slope and intercept.

Estimate of and confidence interval for Y|X=X*


(the mean of Y given that X is equal to X*)

Y|X=X*

^ = ^ + ^ X* = Y 0 1

S|X=X* = MSE[1/n + (X* - X)2/(n-1)SX2]

Estimate of and confidence interval for the mean of m new observations at X = X*

Y|X=X* = Y = 0 + 1X* S|X=X* = MSE[1/m + 1/n + (X* - X)2/(n-1)SX2]

Estimate of and confidence interval


(the error variance)

2 for

(n-2)MSE/2n-2,big < 2 < (n-2)MSE/2n-2,small


Where the big and small values of chi-square are the ones placing /2 in the upper and lower tails, respectively, of the chisquare distribution with (n-2) degrees of freedom or, more generally, n-(p+1) degrees of freedom. Construct and interpret a 95% confidence interval for the error variance.

REGRESSION & CORRELATION ANALYSIS


End of Session
DEPARTMENT
REDGEMAN@UIDAHO.EDU

OF STATISTICS
OFFICE: +1-208-885-4410

DR. RICK EDGEMAN, PROFESSOR & CHAIR SIX SIGMA BLACK BELT

You might also like