You are on page 1of 7

Regression Analysis

Dr. P.K.Viswanathan, Professor(Analytics)


Great Lakes Institute of Management, Chennai
Illustrated using the Case Problem-Sigma Property
Part I-Simple Linear Regression

Need for Regression

The correlation coefficient gives you just the degree of relationship or association. It
cannot help you estimate or predict the response variable for a given independent
variable. The response variable is called the dependent variable.

In the present problem involving Sigma Property, ‘%' Occupancy is the independent
variable and ‘Revenue’ is the dependent variable. Revenue depends on % Occupancy.
Using regression analysis, it is possible to predict revenue for a given % occupancy.

The following visual gives succinctly the objectives of regression .

Regression Model

Simple Linear Regression Model: In this model, dependent variable is a linear function
of one independent variable. For the present case, Revenue may be structured as a linear
function of % occupancy. Based on sample data collected for the dependent and
independent variable, a model is postulated connecting the dependent variable with the
independent variable in a linear equation form. Symbolically, we write the sample
regression line as follows:

Yˆ  b0  b1 x1
where
ŷ is the estimate for the dependent variable(revenue)
x1 is the independent variable(% individual occupancy)
b0 and b1 are determined by statistical least square method. b1 is called the regression
coefficient(slope) and b0 is the constant term (intercept).

Historical Perspective

Just for knowledge sake, it is worth pointing out here that the estimates for b0 and b1
obtained by least square method are called ‘Best Linear Unbiased Estimates’ (BLUE)
first pioneered by Gauss and Markoff in the context of General Linear Models that take
care of Multiple Linear Regression as well.

Values of b0 and b1 in the case of simple linear regression model

The values of b0 and b1 are obtained by solving the normal equations that are given
below:

 y  nb 0  b1  x1

 yx  b0  x1  b1  x1
2
1

Here n denotes the sample size.

Solving these two normal equations,

You will find

b1 =
 (x  x )(y  y)
1 1

 (x  x ) 1 1
2

b0 = y  b1 x 1

How does Simple Linear Regression work in practice?

To understand the nitty-gritty of simple regression, let us take the present problem for
which we give below the relevant data.
Revenue($) %Occupancy
514,440 65.7
463,115 61.1
598,182 78.2
454,924 65.4
453,803 63.5
502,228 70.6
626,262 81.2
498,703 72
514,458 72.9
623,291 81.7
454,768 62.1
385,573 53.4

You postulate the model for the population in the standard form as follows:

Y= β0+β1X1

Y is the Revenue measured in $, β0 is the intercept and β1 is the slope corresponding the
independent variable X1(%Occupancy)

The estimated regression model to test the population model is


Yˆ  b0  b1 x1
where ŷ is the estimated dependent variable(Revenue)
x1 is the independent variable(%Occupancy in the sample data)
b0 and b1 are the intercept and slope to be determined by statistical least square method.

The complete output is given below after solving the equations:

SUMMARY OUTPUT (Table 1)

Regression Statistics
Multiple R 0.95806386
R Square 0.91788636
Adjusted R Square 0.909674996
Standard Error 22417.56601
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 56175962978 56175962978 111.7824 9.51905E-07
Residual 10 5025472657 502547265.7
Total 11 61201435635
Coefficients Standard Error t Stat P-value
Intercept -60376.5 54097.94734 -1.116059834 0.29049627
%Occupancy 8231.778 778.5864232 10.57272182 9.51905E-07
1) Regression Equation

The fitted equation is ŷ = -60376.55+8231.78x1. The equation implies that for every
1% increase in individual occupancy, the revenue will increase by $8231.78.

2) Statistical Validation: Look at the Regression Statistics output(Table1) on top.

Multiple R denotes the correlation coefficient between the two variables namely %
occupancy and revenue. The value of R =0.9581(shows very strong positive
correlation between revenue and % occupancy).

R Square (R2) has a value 0.9179. R2 is called the coefficient of determination.


This gives the contribution made by regression in explaining the variations in the
dependent variable. This is worked out as a ratio between the regression sum of
square and the total sum of square. Closer the value of R2 to 1, greater is the veracity
of the model. In our case R2 =0.9179. The interpretation is 91.79% of the variations in
revenue is explained by % occupancy and only about 8.21% is explained by the error
or residual term. So, the model fitted is fairly accurate.

Adjusted R2: When more independent variables are added in the regression model,
R2 value will increase. It needs to be corrected to reflect the reality. This is achieved
(n  1)
 
by Adjusted R2. It is computed by the formula Adjusted R 2  1  (1  R 2 )
(n  k - 1)
.

n is the number of observations and k is the number of constants in the regression


equation. Here n=12 and k=the number of independent variables in the regression
equation(here, it is 1). If you substitute and simplify, you get the adjusted R 2 value
=0.9097 that is given in the output(Table1).

Standard Error: The standard error of the sample dependent variable is given by the
square root of the mean square corresponding to the Residual term in the ANOVA
table that just follows the Regression Statistics.

Item of interest on the ANOVA output:

 Regression Sum of Squares =  ( ŷ  y) = 56175962978


2

 Residual Sum of Squares =  (y  ŷ) = 5025472657 (same as Error Sum of


2

Squares)
 Total Sum of Squares =  (y  y) 2
=61201435635
 Mean Squares due to regression and error are worked out by dividing the sum of
squares by the corresponding degrees of freedom. F statistics computed is
nothing but the ratio between the mean squares of regression and residual. That is
calculated F = 111.7824.
Null Hypothesis: There is no linear relationship between Y and X 1 in the population
regression line. All the betas in the population line are zero(β0=0 β1 =0)

Alternative Hypothesis: There is linear relationship between Y and X1 in the Population


Regression Line(At least one β is not 0)

If you look at the P-Value(9.51905E-07), it is less than α(5%) and hence null hypothesis
is rejected. The conclusion is that revenue is linearly related to % occupancy at 5% level
of significance.

Things to do in a Simple Linear Regression Model

 Postulate the model Yˆ  b0  b1 x1


 Enter the sample data for x and y in spreadsheet form.
 Perform the Regression Analysis and get the summary output
 Write the Regression Equation using the intercept and coefficient of X from
summary output. Predict y for a given x
 Validate the model statistically by looking at R 2 as well as F statistic in the
ANOVA that tests the null hypothesis of no linear relationship in the population.
 After statistical validation use the model for estimation /prediction

Regression Analysis - Sigma Property


Part II-Multiple Linear Regression

Multiple Linear Regression is an extension of the simple linear regression model in


which the number of independent variables will be more than one. In the present context
of Sigma Property, we add one more independent variable namely % Group Occupancy.

You postulate the model for the population in the standard form as follows:

Y= β0+β1X1+ β2X2

Y is the Revenue measured in $, β0 is the intercept and β1 is the slope corresponding the
independent variable X1(% Individual Occupancy) and β2 is the slope corresponding the
independent variable X2(% Group Occupancy)

The estimated regression model may be written as Yˆ  b0  b1 x1  b2 x2


Where

ŷ is the estimate for the dependent variable(revenue)


x1 is the independent variable (% individual occupancy)
x2 is the independent variable (% group occupancy)

b0 , b1 , and b2 represent the intercept, and slopes of the independent variables


respectively.

The relevant data is given below:

% Individual % Group
Revenue($) Occupancy Occupancy
514,440 42.30 23.44
463,115 36.82 24.32
598,182 45.40 32.79
454,924 38.78 26.67
453,803 42.31 21.22
502,228 40.65 29.41
626,262 40.00 39.76
498,703 37.66 33.10
514,458 37.49 34.20
623,291 41.96 38.68
454,768 34.29 27.32
385,573 36.04 16.52
Output(Table 2)
Regression Statistics
Multiple R 0.963021796
R Square 0.92741098
Adjusted R Square 0.911280087
Standard Error 22217.49119
Observations 12

ANOVA
Significance
df SS MS F F
Regression 2 5.68E+10 28379441702 57.49285 7.48E-06
Residual 9 4.44E+09 493616914.7
Total 11 6.12E+10

Standard Upper
Coefficients Error t Stat P-value Lower 95% 95%
Intercept -110027.553 83255.78 -1.321560533 0.218921 -298365 78310.25
% Individual Occupancy 9649.624857 2166.151 4.454733798 0.001589 4749.448 14549.8
% Group Occupancy 8171.183638 983.4503 8.308690357 1.63E-05 5946.463 10395.9
1) Regression Equation

The fitted equation is Yˆ = -110027.55+9649.62x1+8171.18x2. The equation implies


that for every 1% increase in individual occupancy(x1), the revenue will increase by
$9649.62 provided x2 is held at the same level. Likewise, for every 1% increase in group
occupancy(x2), revenue will increase by $8171.18 provided x1 is held at the same level.

2) Statistical Validation: Look at the Regression Statistics output(Table 2) on top.

Multiple R denotes the correlation coefficient between the two variables namely %
occupancy and revenue. The value of R =0.9581(shows very strong positive
correlation between revenue and % occupancy).

R Square (R2) has a value 0.9179. R2 is called the coefficient of determination.


This gives the contribution made by regression in explaining the variations in the
dependent variable. This is worked out as a ratio between the regression sum of
square and the total sum of square. Closer the value of R 2 to 1, greater is the veracity
of the model. In our case R2 =0.9179. The interpretation is 91.79% of the variations in
revenue is explained by % occupancy and only about 8.21% is explained by the error
or residual term. So, the model fitted is fairly accurate.

Adjusted R2: When more independent variables are added in the regression model,
R2 value will increase. It needs to be corrected to reflect the reality. This is achieved
(n  1)
 
by Adjusted R2. It is computed by the formula Adjusted R 2  1  (1  R 2 )
(n  k - 1)
.

n is the number of observations and k is the number of independent variables in the


regression equation. Here n=12 and k=2. If you substitute and simplify, you get the
adjusted R2 value =0.9113 that is given in the output.

Standard Error: The standard error of the sample dependent variable is given by the
square root of the mean square corresponding to the Residual term in the ANOVA
table that just follows the Regression Statistics.

Null Hypothesis: There is no linear relationship between Y and Xs in the population


regression line. All the betas in the population line are zero(β 0=0 β1 =0 β2 =0)

Alternative Hypothesis: There is linear relationship between Y and Xs in the


Population Regression Line(At least one β is not 0)

If you look at the P-Value(7.48E-06), it is less than α(5%) and hence null hypothesis is
rejected. The conclusion is that revenue is linearly related to % individual occupancy
and % group occupancy at 5% level of significance. The P values for individual
coefficients (% Individual Occupancy and % Group Occupancy) based on t stat are
0.001589 and 1.63E-05 respectively. These two are highly significant at 5% level and
hence they are important predictors for Revenue of Sigma Property.

You might also like