Regression, Correlation Analysis and Chi-Square Analysis

Regression Analysis
Introduction:
The linear relationship between two and more than two variables. One of them is
dependent variable and other s are independent variables. For example sales depends on
promotional expense. Using regression analysis it is possible to predict sales for a given
promotion expense.
Regression Model:
Simple Linear Regression Model:
The linear relationship between two variable one is independent variable and other is
dependent variable is called simple linear regression model. For example demand may be
structured as a linear function 0f price.
Symbolically we write the sample regression line as follows
Y= a+bX
Where
Y = dependent variable
X = independent variable
a = constant term or intercept of line
b= regression coefficient or slope of the line
Multiple linear Regression model:\

The relationship between more than two variables one of them is dependent and others
are independent variables is called multiple linear regression model.
Y = bo +b1X1 + b2X2 + b3 X3
Where
Y = dependent variable
X1 = independent variable
b o= constant
b 1= regression coefficient of x1
Some Applications of Regression:

i. Relationship between Advertisement expenditure and sales
ii. Relationship between heights and weights
iii. Relationship between saving and income
iv. Relationship between prices and demand
v. Relationship between prices, demand and income
Relationship between sales, sales promotional expense and prices

Application
1. A study was made to know the relation between advertising expenditure (X) and the increase in
sales (y). Following data were obtained.
X Y XY X2
40 85 3400 1600
20 100 2000 400
25 95 2375 625
20 65 1300 400
30 175 5250 900
50 140 7000 2500
40 190 7600 1600
20 120 2400 400
50 260 13000 2500
40 225 9000 1600
25 180 4500 625
50 210 10500 2500
410 1845 68325 15650
(i). Find the regression line to predict increase in sales from advertisement expenditure.
(ii). Estimate increase in sales when advertising expenditure is Rs 35/.
Solution:
The regression line y on x is
Y = a + bx
b = XY - (X) (Y) / n

X2 - (X)2 / n
After putting the value in formula b is
b= 3.22
a = Y - bx
a = 43.71
y = 43.71 + 3.22 x
2. Are good grades in college important for earning a good salary? A business statistics student has
taken a random sample of starting salaries and college grade point average for some recently
graduated friends of his. The data follow:
Starting salary ($ thousands): 36, 30, 30, 24, 27, 33, 21, 27
Grade-point average: 4.0, 3.0,3.5, 2.0, 3.0, 3.5, 2.5, 2.5
i. Predict the starting salary on grade point average.
3. A landlord is interested in seeing whether his apartment rents are typical. Thus, he has taken a
random sample of 11 rents and apartment sizes of similar apartment complexes. The data
follow:
Rent: 230, 190, 450, 310, 218, 185, 340, 245, 125, 350, 280
No. of bedrooms: 2 ,1, 3, 2, 2, 2, 2, 1, 1, 2, 2
i. Develop an estimating equation that best describes these data.
4. Many small companies buy advertising without considering its effect. “ Hamburger wars”
(substantial price rivalry with special “value meals”) have cut the profits of Ethiopian Burgers of
Santa Cruz, California, a small regional chain. The marketing manager is trying to make the case
that “ you have to spend money to make money.” Spending on billboard advertisements, in the
manager’s opinion, has a direct result on sales. There are records for 7 months:
Monthly expenditure
On billboards ($ 1000) 25, 16, 42, 34, 10, 21, 19
Monthly sales
Revenue ($ 100,000) 34, 14, 48, 32, 26, 29, 20
Case Study: (Monthly Sales forecast)

i. The A supermarket that has a chain of 15 retail outlets in a city wants to predict monthly
sales of its entire operation based on the performance of all its retail outlets. The number of
customers who visited during one typical month in all the outlets were recorded along with
the money value purchase made by them. Based on the data the are given in the next page,
answer the following:
a. Identify the dependent and independent variable in this forecast model.
b. Fit a regression model to the data
Super Market Data on its Retail Outlets

Retail 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Outlet
Consumer 1814 1852 1012 1482 1578 1778 1748 1020 1058 840 1358 1744 1848 1214 904
Sales 22.4 22.1 13.68 18.42 18.84 20.16 18.9 13.46 14.48 12.24 15.26 18.86 18.92 15.28 13.84
(lakh Rs)
Multiple Regression Model:

Applications:
1. Sam Spade, owner and general manager of the Campus Stationery Store, is concerned
about the sales behavior of a compact cassette tape recorder sold at the store. He
realizes that there are many factors that might help explain sales, but believes that
advertising and price are major determinants. Sam has collected the following data:
Sales Advertising Price
(units sold) (number of ads) ($)
33 3 125
61 6 115
70 10 140
82 13 130
17 9 145
24 6 140
(a) Calculate the least squares equation to predict sales from advertising and price.
(b) If advertising is 7 and price is $132, what sales would you predict?
2. The federal Reserve is performing a preliminary study to determine the relationship

between certain economic indicators and annual percentage change in the gross
national product (GNP). Two such indicators being examined are the amount of the
federal government’s deficit (in billions of dollars) and the Dow Jones Industrial Average
(the mean value over the year ). Data for 6 years follow:
Y X1 X2
Change in GNP Federal Deficit Dow Jones
2.5 100 2850
-1.0 400 2100

4.0 120 3300
1.0 200 2400
1.5 180 2550
3.0 80 2700
(a) Calculate the least Square equation that best describes the data.
(b) What percentage change in GNP would be expected in a year in which the federal
deficit was $240 billion and the mean Dow Jones value was 3000?
Case Study (Fuel Consumption for Car):

A market research firm on behalf of a client who is planning to develop a new fuel efficient car is
studying the fuel consumption of a popular make of car when it runs with unleaded petrol. The
initial diagnostic data were collected from 10 trips of same distance covered under similar road
conditions using the same car. The hypothesis was that the fuel consumption in kms per liter
was dependent on the average speed in kms per hrs and the total weight (Including passengers
and luggage loaded in the car measured in 100 kg).
Initial Diagnostic data

Trip 1 2 3 4 5 6 7 8 9 10
Km per 14 12 15 16 12 16 13 14 13 11
litre
Average 50 40 45 55 35 60 55 55 40 30
speed
(km)
Total 2 3 2 1.5 3 1 5 2 2 3
Weight
(000 kg)
Fit a multiple regression model through SPSS by taking fuel consumption as the dependent
variable and average speed and total weight as the independent variables.
Regression Analysis By SPSS

Simple regression model
A tire manufacturing company is interested in removing pollutants from the exhaust at the factory, and
cost is a concern. The company has collected data from other companies concerning the amount of
money spent on environmental measures and the resulting amount of dangerous pollutants released (as
a percentage of total emissions).
Money Spent($ Percentage of

thousands) Dangerous Pollutants
8.4 35.9
10.2 31.8
16.5 24.7
21.7 25.2
9.4 36.8
8.3 35.8
11.5 33.4
18.4 25.4
16.7 31.4
19.3 27.4
28.4 15.8
4.7 31.5
12.3 28.9
Situation Variable variable(s) Nature Technique
1 1 Quantitative Simple regression
Dependent
Independent 1 Quantitative
Model Summary
Std. Error
Adjusted R
a Model R R Square of the
Square
Estimate
1 .879 .773 .752 2.919
S.D of the residuals or S.D

of regression, square root
of residual mean square
In case of simple Explanatory power of model, lies b/e 0 to 1,

model, R is simple closer to 1 indicates best fit, the ratio of
correlation co- explained variation to total variation
efficient
ANOVA
Sum of Mean
Model df F Sig.
Squares Square
1 Regression 319.098 1 319.098 37.448 .000
Residual 93.733 11 8.521
Total 412.831 12
P-value < money
a Predictors: (Constant), 0.05 spent ($ thousand)
b Dependent Variable: percentage of dangerous pollutants
H0: model is insignificant
Coefficients
Un
Standardized
standardized t Sig.
Coefficients
Coefficients
Model B Std. Error Beta
1 (Constant) 40.717 1.998 20.378 .000
money spent
-.782 .128 -.879 -6.119 .000
($Thousand)
a Dependent Variable: percentage of dangerous pollutants
By change in money spent 1 unit

pollutant decrease by 0.782 By change in 1 unit
s.d of money spent
pollutant decreases
by 0.879 s.d
Multiple regression Model
Sales Forecasting
To measure the effect of advertising and prices, the following data were collected form a consumer
marketing company for the last 10 months. Figures in the following table are in $1000. Fit regression
model sales as dependent variable and other variable take as independent variable and also find
correlation coefficient.
Months Sales (Y) Adv. Expense (x1) Prices (x2)
1 300 45 2400
2 350 50 2200
3 400 55 2100
4 500 85 2000
5 400 65 2150
6 450 60 2000
7 420 57 2150
8 550 68 1950
9 450 60 2050
10 500 70 2000
Situation Variable variable(s) Nature Technique
Dependent 1 Quantitative
2 Multiple regression
Independent >1 Quantitative
Y: Sales
X1: Adv Expense
X2: Prices
Model
Y=a+b1X1+b2X2
Model Summary
Std.
Adjusted
R Error of
Model R R
Square the
Square
Estimate
1 .953 .908 .881 25.71
a Predictors: (Constant), price in Rs, advertisement expense rs in lacs
In case of multiple regressions, R is the Explanatory power of the model.

Multiple co-relation between dependent
variable and the combined effect of
explanatory variables
ANOVA
Sum of Mean
Model df F Sig.
Squares Square
1 Regression 45533.520 2 22766.760 34.447 .000
Residual 4626.480 7 660.926
Total 50160.000 9
a Predictors: (Constant), price in Rs, advertisement expense rs in lacs
b Dependent Variable: sales in units
P-value=0.001< 0.05, indicates that
model is significant to predict
dependent variable
Coefficients
Unstandardiz
Standardized
ed t Sig.
Coefficients
Coefficients
Model B Std. Error Beta
(Constant) 1210.077 258.473 4.682 .002

advertisement
expense rs in 1.672 1.135 .253 1.473
.184
lacs
-
price in Rs -.419 .096 -.749 .003
4.358
a Dependent Variable: sales in units
Indicate
Rate of change in Y per
variable is
unit increase of X keeping
significant
others as constant
Indicate variable is
insignificant
Correlation
The interdependence of two variable X and Y is called correlation. In other words, two variables
are said to be correlated if they tend to simultaneously vary in some direction. If both the variables tend
to increase (or decrease) together, the correlation is said to be direct or positive, if one variable tends to
increase as the other variable decreases the correlation is said to be negative or inverse,
r = XY - (X) (Y) / n
([ X2 - (X)2 / n] [ Y2 - (Y)2 / n])1/2
Properties of correlation
(i) r xy = r yx
(ii) -1  r  +1
(iii) r = + (bxy . byx) 1/2

Example
Calculate the product moment coefficient of correlation between X and Y from the following
data:
X Y X2 Y2 XY
1 2 1 4 2
2 5 4 25 10
3 3 9 9 9
4 8 16 64 32
5 7 25 49 35
15 25 55 151 88
r= 0.8
Example:
Calculate the coefficient of correlation of the following data
Price (X) Demand (Y)
3 25
4 24
5 20
6 20
7 19
8 17
9 16
10 13
11 10
12 6
r = -0.876
Example
The following data refer to two variables promotional expenses (Rs lakhs) and sales (1000 units)
collected in the context of a promotional study. Calculate the correlation coefficient and comment.
Promotional
Sales
Expenses
7 12
10 14
9 13
4 5
11 15
5 7
3 4
r= 0.9787
Comments The promotional expense is strongly associated with sales and the
Correlation is very close to 1.

Example
Calculate the coefficient of correlation.
(i) b12 = - 0.1 , b21 = - 0.4
(ii) b13 = 0.27, b31 = 0.6
(iii) b23 = 0.67, b32 = 0.38
Example
The following data were computed from personnel records of a manufacturing firm.
X = number of years of service, Y = weekly wage rate
n = 23 ∑X = 2433 ∑Y = 4245 ∑X2 = 281019 ∑Y2 = 841786
∑XY = 482788.
Example
Find the coefficient of correlation between persons employed and cloth manufactured in a
textile mill.
Cloth manufactured
Persons employed
(‘000 yds)
137 23
209 47
113 22
189 40
176 39
200 51
219 49
r= ∑xy - nXY
n Sx Sy
Example
(i) Given ∑XY = 350, n = 10, X = 5, Y = 6 ,
S2x = 4, S2y = 9,
Compute coefficient of correlation.
(ii) Given rxy = 0.8 and bxy = 0.45, what would be the value of byx ?
(iii) What would be the value of ryx, if rxy = 0.87 ?
Example
Calculate the correlation coefficient for the following data on supply and demand
Supply Demand
400 50
200 60
700 20
100 70
500 40
300 30
600 10
Correlation by SPSS
Example
The following figures are index numbers of average weekly earnings and prices in a country.
Years Earnings Prices
1955 100 100
1956 108 107
1957 113 109
1958 117 112
1959 112 113
1960 130 114
1961 138 118
1962 143 123
1963 149 125
1964 163 129

Chi-Square Test
Chi-Square analysis is widely used in research studies for testing hypothesis involving nominal
data. Nominal data are also known by two names categorical data and attributes data. The symbol χ 2
statistics is used to designate the chi-square distribution whose value depends on the number of degree
of freedom (d.f.). A Chi-Square distribution is a skewed distribution particularly with smaller d.f. As the
sample size and therefore the d.f. increase, the χ 2 distribution becomes a symmetrical distribution
approaching normality.
The Chi-square test is a nonparametric test. Nonparametric means no assumption needs to be

made about the form of the original probability form which the samples are drawn. It is a classic
nonparametric test involving data measurement in nominal scale. Please note that all parametric tests
make the assumption that the samples are drawn a specified or assumed population. Thus,
nonparametric methods are also called “distribution free” method.
Conditions for using the Chi-Square
 The sample observation drawn from a population must be independent and random.
 The data must be in frequency (counting) form. If the original data are in percentages, they must
be converted into frequency.
 No frequency in any cell/category must be less than 5. If the frequency is less than 5 for a
category, you have to do some regrouping.
Example
An educator has the opinion that the grades high school students make depend on the amount
of time they spend listening to music. To test this theory, he has randomly given 400 students a
questionnaire. Within the questionnaire are the two questions. How many hours per week do you listen
to music? What is the average grade for all your classes? The data from the survey are in the following
table. Using a 5 percent significance level, test whether grades and time spent listening to music are
independent or dependent.
Average Grade
Hours
spent
A B C D F Total
listening to
music
< 5 hrs 13 10 11 16 5 55
5-10 hrs 20 27 27 16 5 95
11-20 hrs 9 27 71 16 32 155
> 20 hrs 8 11 41 24 11 95
Total 50 75 150 72 53 400

Results:
Crosstabs
hours * average grade Crosstabulation
Count
average grade
A B C D F Total
hours <5 13 10 11 16 5 55
5-10 20 27 27 19 2 95
11-20 9 27 71 16 32 155
> 20 8 11 41 24 11 95
Total 50 75 150 75 50 400
Chi-Square Tests
Asymp. Sig. (2-

Value df sided)
Pearson Chi-Square 63.830a 12 .000
Likelihood Ratio 67.534 12 .000
Linear-by-Linear Association 13.160 1 .000
N of Valid Cases 400
a. 0 cells (.0%) have expected count less than 5. The minimum expected
count is 6.88.
Example
A newspaper publisher, trying to pinpoint his market’s characteristic, wondered whether
newspaper readership in the community is related to readers educational achievement. A survey
questioned adults in the area on their level of education and their frequency of readership. The results
are shown in the following table.
Level of Educational Achievement
Professional High Did not

Frequency of
College
Readership or School Complete Total
Graduate
Postgraduate Graduate High school
Never 10 17 11 21 59
Sometimes 12 23 8 5 48
Morning or
35 38 16 7 96
evening
Both editions 28 19 6 13 66
Total 85 97 41 46 269
At the 0.05 significance level, does the frequency of newspaper readership in the community differ
according to the readers level of education?
Example
An advertising firm is trying to determine the demographics for a new product. They have
randomly selected 75 people in each of 5 different age groups and introduced the product to them. The
results of the survey are given below.
Future Age Group

Activity 18-29 30-39 40-49 50-59 60-69
Purchase
12 18 17 22 32
frequently
Seldom
18 25 29 24 30
Purchase
Never
45 32 29 29 13
Purchase
Calculate Chi-square of the above data.

Regression, Correlation Analysis and Chi-Square Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression, Correlation Analysis and Chi-Square Analysis

Uploaded by

Copyright:

Available Formats

Regression Analysis

Symbolically we write the sample regression line as follows

a = constant term or intercept of line

b= regression coefficient or slope of the line

Multiple linear Regression model:\

Some Applications of Regression:

Relationship between sales, sales promotional expense and prices

20 100 2000 400

30 175 5250 900

50 140 7000 2500

40 190 7600 1600

20 120 2400 400

50 260 13000 2500

40 225 9000 1600

25 180 4500 625

50 210 10500 2500

410 1845 68325 15650

(ii). Estimate increase in sales when advertising expenditure is Rs 35/.

The regression line y on x is

b = XY - (X) (Y) / n

After putting the value in formula b is

Grade-point average: 4.0, 3.0,3.5, 2.0, 3.0, 3.5, 2.5, 2.5

i. Predict the starting salary on grade point average.

No. of bedrooms: 2 ,1, 3, 2, 2, 2, 2, 1, 1, 2, 2

i. Develop an estimating equation that best describes these data.

On billboards ($ 1000) 25, 16, 42, 34, 10, 21, 19

Revenue ($ 100,000) 34, 14, 48, 32, 26, 29, 20

Case Study: (Monthly Sales forecast)

a. Identify the dependent and independent variable in this forecast model.

b. Fit a regression model to the data

Super Market Data on its Retail Outlets

Multiple Regression Model:

Sales Advertising Price

(units sold) (number of ads) ($)

2. The federal Reserve is performing a preliminary study to determine the relationship

Change in GNP Federal Deficit Dow Jones

2.5 100 2850

-1.0 400 2100

1.0 200 2400

1.5 180 2550

Case Study (Fuel Consumption for Car):

Initial Diagnostic data

Regression Analysis By SPSS

Money Spent($ Percentage of

Situation Variable variable(s) Nature Technique

1 1 Quantitative Simple regression

S.D of the residuals or S.D

In case of simple Explanatory power of model, lies b/e 0 to 1,

H0: model is insignificant

1 (Constant) 40.717 1.998 20.378 .000

a Dependent Variable: percentage of dangerous pollutants

By change in money spent 1 unit

Months Sales (Y) Adv. Expense (x1) Prices (x2)

X1: Adv Expense

a Predictors: (Constant), price in Rs, advertisement expense rs in lacs

In case of multiple regressions, R is the Explanatory power of the model.

Residual 4626.480 7 660.926

(Constant) 1210.077 258.473 4.682 .002

r = XY - (X) (Y) / n

([ X2 - (X)2 / n] [ Y2 - (Y)2 / n])1/2

(iii) r = + (bxy . byx) 1/2

Price (X) Demand (Y)

Correlation is very close to 1.