You are on page 1of 39

Regression Analysis

Introduction:
The linear relationship between two and more than two variables. One of them is
dependent variable and other s are independent variables. For example sales depends on
promotional expense. Using regression analysis it is possible to predict sales for a given
promotion expense.

Regression Model:
Simple Linear Regression Model:
The linear relationship between two variable one is independent variable and other is
dependent variable is called simple linear regression model. For example demand may be
structured as a linear function 0f price.

Symbolically we write the sample regression line as follows

Y= a+bX

Where

Y = dependent variable

X = independent variable

a = constant term or intercept of line

b= regression coefficient or slope of the line

Multiple linear Regression model:\


The relationship between more than two variables one of them is dependent and others
are independent variables is called multiple linear regression model.

Y = bo +b1X1 + b2X2 + b3 X3

Where

Y = dependent variable

X1 = independent variable

X2 = independent variable
X3 = independent variable

b o= constant

b 1= regression coefficient of x1

b 2= regression coefficient of x2

b 3= regression coefficient of x3

Some Applications of Regression:


i. Relationship between Advertisement expenditure and sales
ii. Relationship between heights and weights
iii. Relationship between saving and income
iv. Relationship between prices and demand
v. Relationship between prices, demand and income

Relationship between sales, sales promotional expense and prices


Application
1. A study was made to know the relation between advertising expenditure (X) and the increase in
sales (y). Following data were obtained.

X Y XY X2

40 85 3400 1600

20 100 2000 400

25 95 2375 625

20 65 1300 400

30 175 5250 900

50 140 7000 2500

40 190 7600 1600

20 120 2400 400

50 260 13000 2500

40 225 9000 1600

25 180 4500 625

50 210 10500 2500

410 1845 68325 15650

(i). Find the regression line to predict increase in sales from advertisement expenditure.

(ii). Estimate increase in sales when advertising expenditure is Rs 35/.

Solution:

The regression line y on x is

Y = a + bx

b = XY - (X) (Y) / n


X2 - (X)2 / n

After putting the value in formula b is

b= 3.22

a = Y - bx

a = 43.71

y = 43.71 + 3.22 x

2. Are good grades in college important for earning a good salary? A business statistics student has
taken a random sample of starting salaries and college grade point average for some recently
graduated friends of his. The data follow:

Starting salary ($ thousands): 36, 30, 30, 24, 27, 33, 21, 27

Grade-point average: 4.0, 3.0,3.5, 2.0, 3.0, 3.5, 2.5, 2.5

i. Predict the starting salary on grade point average.

3. A landlord is interested in seeing whether his apartment rents are typical. Thus, he has taken a
random sample of 11 rents and apartment sizes of similar apartment complexes. The data
follow:
Rent: 230, 190, 450, 310, 218, 185, 340, 245, 125, 350, 280

No. of bedrooms: 2 ,1, 3, 2, 2, 2, 2, 1, 1, 2, 2

i. Develop an estimating equation that best describes these data.

4. Many small companies buy advertising without considering its effect. “ Hamburger wars”
(substantial price rivalry with special “value meals”) have cut the profits of Ethiopian Burgers of
Santa Cruz, California, a small regional chain. The marketing manager is trying to make the case
that “ you have to spend money to make money.” Spending on billboard advertisements, in the
manager’s opinion, has a direct result on sales. There are records for 7 months:

Monthly expenditure

On billboards ($ 1000) 25, 16, 42, 34, 10, 21, 19

Monthly sales

Revenue ($ 100,000) 34, 14, 48, 32, 26, 29, 20

Case Study: (Monthly Sales forecast)


i. The A supermarket that has a chain of 15 retail outlets in a city wants to predict monthly
sales of its entire operation based on the performance of all its retail outlets. The number of
customers who visited during one typical month in all the outlets were recorded along with
the money value purchase made by them. Based on the data the are given in the next page,
answer the following:

a. Identify the dependent and independent variable in this forecast model.

b. Fit a regression model to the data

Super Market Data on its Retail Outlets


Retail 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Outlet

Consumer 1814 1852 1012 1482 1578 1778 1748 1020 1058 840 1358 1744 1848 1214 904

Sales 22.4 22.1 13.68 18.42 18.84 20.16 18.9 13.46 14.48 12.24 15.26 18.86 18.92 15.28 13.84
(lakh Rs)

Multiple Regression Model:


Applications:
1. Sam Spade, owner and general manager of the Campus Stationery Store, is concerned
about the sales behavior of a compact cassette tape recorder sold at the store. He
realizes that there are many factors that might help explain sales, but believes that
advertising and price are major determinants. Sam has collected the following data:

Sales Advertising Price

(units sold) (number of ads) ($)

33 3 125

61 6 115

70 10 140

82 13 130

17 9 145

24 6 140

(a) Calculate the least squares equation to predict sales from advertising and price.

(b) If advertising is 7 and price is $132, what sales would you predict?

2. The federal Reserve is performing a preliminary study to determine the relationship


between certain economic indicators and annual percentage change in the gross
national product (GNP). Two such indicators being examined are the amount of the
federal government’s deficit (in billions of dollars) and the Dow Jones Industrial Average
(the mean value over the year ). Data for 6 years follow:

Y X1 X2

Change in GNP Federal Deficit Dow Jones

2.5 100 2850

-1.0 400 2100


4.0 120 3300

1.0 200 2400

1.5 180 2550

3.0 80 2700

(a) Calculate the least Square equation that best describes the data.

(b) What percentage change in GNP would be expected in a year in which the federal
deficit was $240 billion and the mean Dow Jones value was 3000?

Case Study (Fuel Consumption for Car):


A market research firm on behalf of a client who is planning to develop a new fuel efficient car is
studying the fuel consumption of a popular make of car when it runs with unleaded petrol. The
initial diagnostic data were collected from 10 trips of same distance covered under similar road
conditions using the same car. The hypothesis was that the fuel consumption in kms per liter
was dependent on the average speed in kms per hrs and the total weight (Including passengers
and luggage loaded in the car measured in 100 kg).

Initial Diagnostic data


Trip 1 2 3 4 5 6 7 8 9 10

Km per 14 12 15 16 12 16 13 14 13 11
litre

Average 50 40 45 55 35 60 55 55 40 30
speed
(km)

Total 2 3 2 1.5 3 1 5 2 2 3
Weight
(000 kg)

Fit a multiple regression model through SPSS by taking fuel consumption as the dependent
variable and average speed and total weight as the independent variables.

Regression Analysis By SPSS


Simple regression model
A tire manufacturing company is interested in removing pollutants from the exhaust at the factory, and
cost is a concern. The company has collected data from other companies concerning the amount of
money spent on environmental measures and the resulting amount of dangerous pollutants released (as
a percentage of total emissions).

Money Spent($ Percentage of


thousands) Dangerous Pollutants

8.4 35.9

10.2 31.8

16.5 24.7

21.7 25.2

9.4 36.8

8.3 35.8

11.5 33.4

18.4 25.4

16.7 31.4

19.3 27.4

28.4 15.8

4.7 31.5

12.3 28.9

Situation Variable variable(s) Nature Technique

1 1 Quantitative Simple regression

Dependent
Independent 1 Quantitative
Model Summary

Std. Error
Adjusted R
a Model R R Square of the
Square
Estimate
1 .879 .773 .752 2.919

S.D of the residuals or S.D


of regression, square root
of residual mean square

In case of simple Explanatory power of model, lies b/e 0 to 1,


model, R is simple closer to 1 indicates best fit, the ratio of
correlation co- explained variation to total variation
efficient
ANOVA

Sum of Mean
Model df F Sig.
Squares Square
1 Regression 319.098 1 319.098 37.448 .000
Residual 93.733 11 8.521
Total 412.831 12
P-value < money
a Predictors: (Constant), 0.05 spent ($ thousand)
b Dependent Variable: percentage of dangerous pollutants

H0: model is insignificant

Coefficients
Un
Standardized
standardized t Sig.
Coefficients
Coefficients
Model B Std. Error Beta

1 (Constant) 40.717 1.998 20.378 .000

money spent
-.782 .128 -.879 -6.119 .000
($Thousand)

a Dependent Variable: percentage of dangerous pollutants

By change in money spent 1 unit


pollutant decrease by 0.782 By change in 1 unit
s.d of money spent
pollutant decreases
by 0.879 s.d
Multiple regression Model

Sales Forecasting

To measure the effect of advertising and prices, the following data were collected form a consumer
marketing company for the last 10 months. Figures in the following table are in $1000. Fit regression
model sales as dependent variable and other variable take as independent variable and also find
correlation coefficient.

Months Sales (Y) Adv. Expense (x1) Prices (x2)

1 300 45 2400

2 350 50 2200

3 400 55 2100

4 500 85 2000

5 400 65 2150

6 450 60 2000

7 420 57 2150

8 550 68 1950

9 450 60 2050

10 500 70 2000
Situation Variable variable(s) Nature Technique

Dependent 1 Quantitative
2 Multiple regression
Independent >1 Quantitative

Y: Sales

X1: Adv Expense

X2: Prices
Model
Y=a+b1X1+b2X2
Model Summary
Std.
Adjusted
R Error of
Model R R
Square the
Square
Estimate
1 .953 .908 .881 25.71

a Predictors: (Constant), price in Rs, advertisement expense rs in lacs

In case of multiple regressions, R is the Explanatory power of the model.


Multiple co-relation between dependent
variable and the combined effect of
explanatory variables
ANOVA

Sum of Mean
Model df F Sig.
Squares Square
1 Regression 45533.520 2 22766.760 34.447 .000

Residual 4626.480 7 660.926

Total 50160.000 9
a Predictors: (Constant), price in Rs, advertisement expense rs in lacs
b Dependent Variable: sales in units
P-value=0.001< 0.05, indicates that
model is significant to predict
dependent variable

Coefficients
Unstandardiz
Standardized
ed t Sig.
Coefficients
Coefficients
Model B Std. Error Beta

(Constant) 1210.077 258.473 4.682 .002


advertisement
expense rs in 1.672 1.135 .253 1.473
.184
lacs
-
price in Rs -.419 .096 -.749 .003
4.358
a Dependent Variable: sales in units

Indicate
Rate of change in Y per
variable is
unit increase of X keeping
significant
others as constant
Indicate variable is
insignificant
Correlation

The interdependence of two variable X and Y is called correlation. In other words, two variables
are said to be correlated if they tend to simultaneously vary in some direction. If both the variables tend
to increase (or decrease) together, the correlation is said to be direct or positive, if one variable tends to
increase as the other variable decreases the correlation is said to be negative or inverse,

r = XY - (X) (Y) / n

([ X2 - (X)2 / n] [ Y2 - (Y)2 / n])1/2

Properties of correlation

(i) r xy = r yx

(ii) -1  r  +1

(iii) r = + (bxy . byx) 1/2


Example

Calculate the product moment coefficient of correlation between X and Y from the following
data:

X Y X2 Y2 XY

1 2 1 4 2

2 5 4 25 10

3 3 9 9 9

4 8 16 64 32

5 7 25 49 35

15 25 55 151 88

r= 0.8
Example:
Calculate the coefficient of correlation of the following data

Price (X) Demand (Y)

3 25

4 24

5 20

6 20

7 19

8 17

9 16

10 13

11 10

12 6

r = -0.876
Example

The following data refer to two variables promotional expenses (Rs lakhs) and sales (1000 units)
collected in the context of a promotional study. Calculate the correlation coefficient and comment.

Promotional
Sales
Expenses

7 12

10 14

9 13

4 5

11 15

5 7

3 4

r= 0.9787

Comments The promotional expense is strongly associated with sales and the

Correlation is very close to 1.


Example

Calculate the coefficient of correlation.

(i) b12 = - 0.1 , b21 = - 0.4

(ii) b13 = 0.27, b31 = 0.6

(iii) b23 = 0.67, b32 = 0.38

Example

The following data were computed from personnel records of a manufacturing firm.

X = number of years of service, Y = weekly wage rate

n = 23 ∑X = 2433 ∑Y = 4245 ∑X2 = 281019 ∑Y2 = 841786

∑XY = 482788.
Example

Find the coefficient of correlation between persons employed and cloth manufactured in a
textile mill.

Cloth manufactured
Persons employed
(‘000 yds)

137 23

209 47

113 22

189 40

176 39

200 51

219 49

r= ∑xy - nXY
n Sx Sy

Example

(i) Given ∑XY = 350, n = 10, X = 5, Y = 6 ,

S2x = 4, S2y = 9,

Compute coefficient of correlation.

(ii) Given rxy = 0.8 and bxy = 0.45, what would be the value of byx ?

(iii) What would be the value of ryx, if rxy = 0.87 ?

Example
Calculate the correlation coefficient for the following data on supply and demand

Supply Demand

400 50

200 60

700 20

100 70

500 40

300 30

600 10

Correlation by SPSS
Example

The following figures are index numbers of average weekly earnings and prices in a country.

Years Earnings Prices

1955 100 100

1956 108 107

1957 113 109

1958 117 112

1959 112 113

1960 130 114

1961 138 118

1962 143 123

1963 149 125

1964 163 129


Chi-Square Test

Chi-Square analysis is widely used in research studies for testing hypothesis involving nominal
data. Nominal data are also known by two names categorical data and attributes data. The symbol χ 2
statistics is used to designate the chi-square distribution whose value depends on the number of degree
of freedom (d.f.). A Chi-Square distribution is a skewed distribution particularly with smaller d.f. As the
sample size and therefore the d.f. increase, the χ 2 distribution becomes a symmetrical distribution
approaching normality.

The Chi-square test is a nonparametric test. Nonparametric means no assumption needs to be


made about the form of the original probability form which the samples are drawn. It is a classic
nonparametric test involving data measurement in nominal scale. Please note that all parametric tests
make the assumption that the samples are drawn a specified or assumed population. Thus,
nonparametric methods are also called “distribution free” method.

Conditions for using the Chi-Square

 The sample observation drawn from a population must be independent and random.
 The data must be in frequency (counting) form. If the original data are in percentages, they must
be converted into frequency.
 No frequency in any cell/category must be less than 5. If the frequency is less than 5 for a
category, you have to do some regrouping.

Example

An educator has the opinion that the grades high school students make depend on the amount
of time they spend listening to music. To test this theory, he has randomly given 400 students a
questionnaire. Within the questionnaire are the two questions. How many hours per week do you listen
to music? What is the average grade for all your classes? The data from the survey are in the following
table. Using a 5 percent significance level, test whether grades and time spent listening to music are
independent or dependent.

Average Grade
Hours
spent
A B C D F Total
listening to
music

< 5 hrs 13 10 11 16 5 55

5-10 hrs 20 27 27 16 5 95

11-20 hrs 9 27 71 16 32 155

> 20 hrs 8 11 41 24 11 95

Total 50 75 150 72 53 400


Results:

Crosstabs
hours * average grade Crosstabulation

Count

average grade

A B C D F Total

hours <5 13 10 11 16 5 55

5-10 20 27 27 19 2 95

11-20 9 27 71 16 32 155

> 20 8 11 41 24 11 95

Total 50 75 150 75 50 400

Chi-Square Tests

Asymp. Sig. (2-


Value df sided)

Pearson Chi-Square 63.830a 12 .000

Likelihood Ratio 67.534 12 .000

Linear-by-Linear Association 13.160 1 .000

N of Valid Cases 400

a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 6.88.

Example
A newspaper publisher, trying to pinpoint his market’s characteristic, wondered whether
newspaper readership in the community is related to readers educational achievement. A survey
questioned adults in the area on their level of education and their frequency of readership. The results
are shown in the following table.

Level of Educational Achievement

Professional High Did not


Frequency of
College
Readership or School Complete Total
Graduate
Postgraduate Graduate High school

Never 10 17 11 21 59

Sometimes 12 23 8 5 48

Morning or
35 38 16 7 96
evening

Both editions 28 19 6 13 66

Total 85 97 41 46 269

At the 0.05 significance level, does the frequency of newspaper readership in the community differ
according to the readers level of education?

Example
An advertising firm is trying to determine the demographics for a new product. They have
randomly selected 75 people in each of 5 different age groups and introduced the product to them. The
results of the survey are given below.

Future Age Group


Activity 18-29 30-39 40-49 50-59 60-69

Purchase
12 18 17 22 32
frequently

Seldom
18 25 29 24 30
Purchase

Never
45 32 29 29 13
Purchase

Calculate Chi-square of the above data.

You might also like