You are on page 1of 22

First Lecture 11

Simple Regression

Covariance and Correlation (to see whether a relationship EXISTS)

Statistics for Management Decisions Example

Statistics for Management Decisions Regression


Regression analysis enables us to estimate the strength and direction of relations between variables

Consider the following example comparing the returns of Consolidated Moose Pasture stock (CMP) and the TSX 300 Index The next slide shows 25 monthly returns

Specifically between dependent (Y) and independent variables (x1, x2, etc.)

For example:

The effect of years of education on income The effect of engine size one gas mileage The effect of house size on price
# #

Statistics for Management Graph Of Data Decisions


6

Statistics for Management Example Data Decisions TSX TSX CMP


x y x

CMP
4

0 0 -2 2 4

-6

-4

-2

TSX
6

-4

3 -1 2 4 5 -3 -5 1 2
#

4 -2 -2 2 3 -5 -2 2 -1

-4 -1 0 1 0 -3 -3 1

CMP TSX CMP y x y -3 2 4 0 -1 1 -2 4 3 0 -2 -1 0 1 2 1 -3 -4 -2 2 1 3 -2 -2


#

-6

Statistics for Management Graph Of Data Decisions


6 4

Statistics for Management Decisions Example

CMP

From the data, it appears that a positive relationship may exist


0 0 -2 2 4

-6

-4

-2

TSE
6

Most of the time when the TSX is up, CMP is up Likewise, when the TSX is down, CMP is down most of the time Sometimes, they move in opposite directions

-4

Lets graph this data


# #

-6

Statistics for Management Graph of Data Decisions


CMP

4 6

Statistics for Management Example Summary Statistics Decisions


The data do appear to be positively related Lets derive some summary statistics about these data:

0 0 -2 2 4

-6

-4

-2

TSX
6

Mean TSX CMP


#

s 7.25 6.25 2.69 2.50


#

0.00 0.00

-4

-6

Statistics for Management Decisions Implications


Statistics for Management Decisions Observations

When points in the upper right and lower left quadrants dominate, then the sums of the products of the deviations will be positive When points in the lower right and upper left quadrants dominate, then the sums of the products of the deviations will be negative
#

Both have means of zero and standard deviations just under 3 However, each data point does not have simply one deviation from the mean, it deviates from both means Consider Points A, B, C and D on the next graph

Statistics for Management Decisions Covariance Statistics for Management Decisions An Important Observation

The sums of the products of the deviations will give us the appropriate sign of the slope of our relationship

In the same units as Variance (if both variables are in the same unit), i.e. units squared Very important element of measuring portfolio risk in finance

Statistics for Management Decisions Using Covariance

Statistics for Management Decisions Covariance


population

Very useful in Finance for measuring portfolio risk Unfortunately, it is hard to interpret for two reasons:

What does the magnitude/size imply? The units are confusing


sample

Statistics for Management Decisions The Correlation Coefficient Statistics for Management A Decisions Statistic More Useful

The correlation coefficient measures the strength of the linear relationship between two variables.

Coefficient = -1 perfect negative Coefficient = 0 no relation Coefficient = 1 perfect positive

We can simultaneously adjust for both of these shortcomings by dividing the covariance by the two relevant standard deviations This operation
Removes the impact of size & scale Eliminates the units

Statistics for Management Decisions Calculating Correlation

Statistics for Management Decisions Correlation


Correlation indicates a positive/negative relation between two variables

population

sample

Both variables move together, either in the same direction or in opposite directions E.g. when one goes up so does the other

Statistics for Management Decisions Correlation coefficient Statistics for Management Decisions Example
X 20 18 24 20 17 21 10 10 18 12 16 Y

X
14 18

Y 400 324 576 400 484 196 324 2704 1654 2090 100 180 100 140 441 462 289 340 324 432 144 216 256 320

X2

Y2

XY

22

20

16

18

12

24

18

20

17

22

21

14

10

18

10

136

104

Create a scatter plot, what type of relationship exists? Compute the correlation coefficient Test the significance of the correlation coefficient at the 0.05 level
#

Statistics for Management Decisions Correlation coefficient

Statistics for Management Decisions Scatter plot


25 20

15

10

0 0 5 10 15 20 25 30

In Excel: use the CORREL function, =CORREL(A2:A8,B2:B8) # #

Statistics for Management Decisions Correlation vs. Regression Statistics for Management Decisions Significance

Hypothesis test on the true population parameter (rho, r)


H0: r=0 HA: r0

Correlation indicates a relation between two variables Regression indicates causality between an independent and a dependent variable. Test statistic (n-2 degrees of freedom):
estimated from the sample

Changes in the independent variables are those causing the change in the dependent variable Well start with simple regression one independent variable and then look at multiple variables.

Statistics for Management Decisions Significance test

Simple regression

3.563 > 2.571 (t critical value, 5df) Reject the null hypothesis and conclude that the correlation coefficient is significant (significantly different than 0)
#

Statistics for Management Decisions Modeling the linear relationships


4000 3500 3000 2500

Statistics for Management A Decisions scatter plot

Consumption
2000 1500 1000 500 0 0 2000 4000 6000 8000 10000

Premise: there is a true relationship between income and consumption. This relationship can be described in a linear form:

Or more generally:

12000

Income

The income/consumption example


Dependent variable (y) consumption Independent variable (x) income

Statistics for ManagementRegression Model Simple Linear Decisions Note that both and are population

rise run =slope (=rise/run)

Statistics for Management Decisions The story


We think that income affects consumption
The more you make the more you buy

parameters which are usually unknown and hence estimated from the data.

We are looking to study this relationship in more depth


Is there indeed a significant effect? What is the magnitude of this effect? (We limit our discussion to linear effects)

=y-intercept

Well create and test a regression model of the relationship between consumption and income
# #

Statistics for Management A good regression line will be the one that minimizes the Decisions errors (SSE). total of the squared

4000 3500 3000 2500

Statistics for Management Decisions Modeling the linear relationships

4500

4000 3500
Consumption
2000 1500 1000 500 0 0 2000

error (residual, deviation)

3000 2500

2000 1500

1000 Traffic Fatalities 500 15000 20000 25000 30000 35000

With simple linear regression we try to capture the true relationship between the two variable with a single line. The estimated regression model is:
4000 6000

8000

10000

12000

Income

5000

10000

Population (thousands)

Statistics for Management Decisions Understanding the error term


More formally

No line can hit all the points in the scatter plot, or even most of the points The amount we miss by is called error or residual.

It is the difference between the predicted value (from the regression line) and the true value

Statistics for Management Decisions Finding the line equation


estimated

Statistics for Management Decisions Regression Analysis


The equation: Where:

A statistical technique for determining the best fit line through a series of data The regression line is the unique line that minimizes the total of the squared deviations (or errors).

XY -

b = 1

X Y n

The statistical term is Sum of Squared Errors or SSE This line is called the least squares line
#

Statistics for Management Decisions Example


calculated
Capacity (Y) XY X2

Statistics for Management Decisions Required Conditions - e


Data on refining capacity of an oil companys sites


81.82 81.82 58.18 43.64 40 36.36 34.55 32.73 29.09 25.45 SY = 463.64 1063.66 818.2 756.34 349.12 200 254.52 138.2 261.84 203.63 76.35 SXY = 4121.86 169 100 169 64 25 49 16 64 49 9 SX2 = 714

Obs.

# of sites (X)

1 2 3 4 5 6 7 8 9 10 Total

13 10 13 8 5 7 4 8 7 3 SX = 78

The probability distribution of e is normal E(e) = 0 se is constant and independent of x, the independent variable The value of e associated with any particular value of y is independent of the value of e associated with any other value of y
# #

Statistics for Management Decisions Example


Weeks on the Market 23 48 9 26 20 40 51 18 25 62 33 11 15 26 27 56 12 Asking Price $76,500 $102,000 $53,000 $84,200 $73,000 $125,000 $109,000 $60,000 $87,000 $94,000 $76,000 $90,000 $61,000 $86,000 $70,000 $133,000 $93,000

Statistics for Management Decisions The least squares line


Meaning: capacity will increase by 4.787units for every site added

The Harris Corporation has recently done a study of homes that have sold in the Detroit area within the past 18 months. Data were recorded for the asking price (x) and the number of weeks (y) each home was on the market before it sold.

Statistics for Management Decisions The model to be estimated

Understanding and assessing the regression model

SUMMARY OUTPUT

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

ANOVA

df

Statistics for Management Decisions The estimated model


MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

Regression Residual Total

1 15 16

SS 2133.111647 2147.123648 4280.235294

Statistics for Management Decisions Excel The Regression Tool

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

Tools
Data Analysis
Choose Regression from the dialogue box menu.

= -16.2251 + 0.00053x

Intercept (b0): -16.2251 Slope (b1): 0.00053

What is the meaning of -16.2251?

Is the effect of asking price on number of weeks significant?


#

Statistics for Management Decisions Test1: Testing the Slope


true

Statistics for Management Decisions The regression output (Excel)


SUMMARY OUTPUT

The hypotheses:

HO: b1 = 0 HA: b1 0

estimated

We follow a t-test:
The standard error of the estimate

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

OR #

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

SUMMARY OUTPUT

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

ANOVA

df

Statistics for Management Decisions Test1: Testing the Slope


MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

Regression Residual Total

1 15 16

SS 2133.111647 2147.123648 4280.235294

Statistics for Management Decisions The standard error of the estimate

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

The standard error of the estimate (Se or SEE) measures how the data varies around the regression line

0.000528163 0 t= 0.000136818

Similar to the concept of standard deviation

[ /2, (n-2) df ]

k = 1 for simple regression

=> Can conclude that the slope is different from zero #

We would like Se to be small the smaller it is the larger the t-statistic is and the more likely we are to reject the null hypothesis that the slope is zero #

Statistics for Management Testing the Slope Decisions

Statistics for Management Decisions What do we have in the tables?


SUMMARY OUTPUT

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16

S=SEE SSE
SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

If we wish to test for positive or negative linear relationships we conduct one-tail tests, i.e. our research hypothesis become: H1: 1< 0 (testing for a negative slope) or H1: 1 >0 (testing for a positive slope)

Of course, the null hypothesis remains: H0: 1 = 0.


#

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

b1

Sb1

Statistics for Management Decisions The regression output (Excel) Statistics for Management IsDecisions the test different from 1?

SUMMARY OUTPUT

The hypotheses:
HO: b1 = 1 HA: b1 1

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

ANOVA SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

df

Regression Residual Total

1 15 16

t=

0.000528163 1 0.000136818

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

SUMMARY OUTPUT

Statistics for Management F Decisions Ratio

MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

Statistics for Management Decisions Test 2: model fit


Testing the overall significance of the model

ANOVA

df

Regression Residual Total

1 15 16

SS 2133.111647 2147.123648 4280.235294

H0: b1 = b2 = b3 = = 0 H1: at least one b is different than zero


Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

Mean Squares = SS/df

Significant model

The general form for a simple regression:


Mean Square MSR = SSR/1 MSE = SSE/(n-2) F-Statistic F = MSR/MSE

We need to see that at least one of our independent variables has a significant affect Note: we only have b1 so this test should give us the same results as the previous t-test (and well see that it does)

The test statistic is an F-ratio

Regression Residual Total

Degrees of Sum of Freedom Squares 1 SSR n-2 SSE n-1 SST

Well have an ANOVA table (from Excel) # #

Statistics for Management Coefficient of Determination Decisions Statistics for Management Decisions Symmetry in Testing
SUMMARY OUTPUT Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17 ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

As we did with analysis of variance, we can partition the variation in y into two parts:

SST = Variation in y = SSE + SSR

SSE Sum of Squares Error measures the amount of variation in y that remains unexplained (i.e. due to error) SSR Sum of Squares Regression measures the amount of variation in y explained by variation in the independent variable x.
Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

Statistics for Management InDecisionsoutput the Excel

Statistics for Management Decisions Test 3: R2 - Coefficient of Determination


The R2 tells of the proportion of the variability in the dependent variable is explained by the independent variable

SUMMARY OUTPUT

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

2133.111647 4280.235294

ANOVA MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

df

Regression Residual Total

1 15 16

SS 2133.111647 2147.123648 4280.235294

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086

We would like to see high values (1 is the highest) Note: for simple regression, R-squared is the square of the correlation coefficient (r): R2 = (r)2 .

Statistics for Management Coefficient of Determination Decisions


R2 has a value of .4984. This means 49.84% of the variation in the weeks on market (y) is explained by the variation in the asking price (x). The remaining 50.16% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y. #

Confidence and prediction intervals

Statistics for Management Decisions Prediction

Statistics for Management Decisions Summary of simple regression output


SUMMARY OUTPUT Correlation coefficinet between x and y Coefficient of determination S N ANOVA df Regression Residual Total 1 15 16 SS 2133.111647 2147.123648 4280.235294 MS F Significance F 2133.111647 14.90211089 0.001541086 143.1415765

Suppose you wanted to know how many weeks it would take to sell a house priced at $100,000 The regression equation was: = -16.2251 + 0.00053x

Regression Statistics Multiple R 0.705948422 R Square 0.498363174 Adjusted R Square 0.464920719 Standard Error 11.96417889 Observations 17

Learn the relationships between the three tables components

Substitute x=100,000 y = -16.2251 + 0.00053*(100,000) = 36.7749 Important side note: pay attention to the units of measurement in the data

y = 36.7749 is a point estimate of the number of weeks


what is the true price?

Point estimates are subject to errors

Intercept Asking Price

Coefficients Standard Error t Stat P-value -16.22506178 12.20252667 -1.329647722 0.203501866 0.000528163 0.000136818 3.860325232 0.001541086 b 0 and b 1 S b0 and S b1

Statistics for Management Decisions Prediction Interval Statistics for Management Decisions Scatter plot
(textbook)
60 70

y ta / 2,n- 2s 1+ e
40

1 n
30

Need to construct a prediction interval around this estimate 50

OR derived
20 10

y ta / 2,n- 2s 1+ e
0 $50,000 $60,000 $70,000

1 n
#

$80,000 $90,000 $100,000 $110,000 $120,000 $130,000 $140,000

Statistics for Management InDecisions our example

Statistics for Management Decisions Prediction interval


100 80

60

Prediction interval
40 Xp =100000, y = 36.59126539 20 Number of weeks 0 50000 -20

62500

75000

87500

100000

112500

125000

-40 Price

Statistics for Management Decisions Confidence Interval

Statistics for Management A Decisions different question


(textbook)

y ta / 2,n- 2s e

1 n

OR derived

Suppose I own several properties in Detroit and price them all at $100,000. What is the expected number of weeks for selling these homes? Instead of predicting an individual value, I am asking for an expected value (i.e. the mean number of week)
We can use a confidence interval for the estimation of the mean. The distinction between confidence interval and prediction interval is similar to the difference between the CI of the mean vs. the CI of an individual value #

y ta / 2,n- 2s e
#

1 n

Statistics for Management InDecisions our example


100 80

Statistics for Management Decisions Confidence Interval

60

Narrower than the prediction interval


40

20

0 50000 -20 62500 75000 87500 100000 112500 125000

Note: Point, Prediction and Confidence intervals in Excel are obtained by Add-Ins > Data Analysis Plus > Prediction Interval #

-40

Statistics for Management Decisions Solution


80 67.20

Statistics for Management Decisions The curve

9,784 345.50*80=-17,856

y ta / 2,n- 2s e

1 n

a=0.1 n-2=18

Both intervals are curved, becoming narrower around the average value of x (xbar). The closer Xg is to X-bar the better our estimate and thus the narrower the interval.

Need to compute using SSE # #

Statistics for Management Decisions Solution

Statistics for Management Decisions Example


The following summary statistics were obtained from a regression analysis:

Computing the standard error of the estimate

From the t table: t0.05,18=1.734

Provide a 90% CI for the average y, given xg=80


# #

Statistics for Management Residual Analysis Decisions


Fitted 24.17942851 37.64759194 11.76759162 28.2462857 22.33085706 49.79534718 41.34473484 9.655265163 0.862374431 9.795347185 0.946239962 2.330857056 0.203458396 2.246285699 0.193608631 2.767591621 0.259720474 10.35240806 0.906924006 1.179428507 0.102346097 Residual St.resid

Recall the deviations between the actual data points and the regression line were called residuals. Excel calculates residuals as part of its regression analysis:

Statistics for Management Decisions Solution


y ta / 2,n- 2s e 1 n

76500

23

102000 9

48

53000

84200

26

73000

20

125000

40

109000

51

60000 18 15.46473452 2.535265477 0.230053966 We can use these residuals to determine whether the error 87000 25 29.72514286 -4.72514286 0.407099584 variable is nonnormal, whether the error variance is constant, 94000 62 33.42228576 28.57771424 2.471464614 and whether the 76000 errors are 33 23.91534687 9.084653129 0.788907244 independent 31.30963267 20.30963267 1.751163484

90000

11

Statistics for Management Nonnormality Decisions We can take the residuals and put them into a histogram

Statistics for Management Regression Diagnostics Decisions

to visually check for normality

There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other. How can we diagnose violations of these conditions? Residual Analysis, that is, examine the differences between the actual data points and those predicted by the linear equation
# #

were looking for a bell shaped histogram () with the mean close to zero ( ).

Statistics for Management of the Error Variable (for Nonindependence Decisions time series data not in this course) Statistics for Management Heteroscedasticity Decisions
When the requirement of a constant variance is violated, we have a condition of heteroscedasticity.

If we were to observe the number of weeks houses stay on the market for many weeks for, say, a year, that would constitute a time series.

When the data are time series, the errors often are correlated. Error terms that are correlated over time are said to be autocorrelated or serially correlated.

We can often detect autocorrelation by graphing the residuals against the time periods. If a pattern emerges, it is likely that the independence requirement is violated.

We can diagnose heteroscedasticity by plotting the residual against the predicted y. # #

Statistics for Management Nonindependence of the Error Decisions Variable

Statistics for Management Heteroscedasticity Decisions

Patterns in the appearance of the residuals over time indicates that autocorrelation exists:

If the variance of the error variable ( ) is not constant, then we have heteroscedasticity. Heres the plot of the residual against the predicted value of y:

Note the runs of positive residuals, replaced by runs of negative residuals

Note the oscillating behavior of the residuals around zero.

there doesnt appear to be a change in the spread of the plotted points, therefore no heteroscedasticity

Statistics for Management Decisions Outliers our example Statistics for Management Outliers Decisions
An outlier is an observation that is unusually small or unusually large. E.g. in our houses example the prices range from $53,000 to $133,000. Suppose we have a value of $1,000,000 this point is an outlier.

Olga Kaminer, 2009

1.

Statistics for Management Procedure for Regression Diagnostics Decisions

Statistics for Management Outliers Decisions

2.

3.

Possible reasons for the existence of outliers include: There was an error in recording the value The point should not have been included in the sample Perhaps the observation is indeed valid. Outliers can be easily identified from a scatter plot. If the absolute value of the standard residual is > 2, we suspect the point may be an outlier and investigate further. They need to be dealt with since they can easily influence the least squares line #

4.

5.

6.

7.

Develop a model that has a theoretical basis. Gather data for the two variables in the model. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. Determine the regression equation. Calculate the residuals and check the required conditions (normality, homoscedasticity, independence) Assess the models fit (t-test for the slope, the overall F-ratio, R2 ) If the model fits the data, use the regression equation to predict a particular value (confidence/prediction intervals) of the dependent variable. #

You might also like