You are on page 1of 29

D a t a A n a ly s is

U n i v a r i Ba t i ev a r i a M t e u l t i v a r
A n a l y s Ai s n a l y s Ai s n a l y s i
Three Types of Analysis

we can classify analysis into three types –

1. Univariate, involving a single variable at a


time,

2. Bivariate, involving two variables at a time,


and

3. Multivariate, involving three or more


variables simultaneously.
Revision : Application Areas:
Correlation

1. Correlation and Regression are


generally performed together. The
application of correlation analysis is
to measure the degree of
association between two sets of
quantitative data. The correlation
coefficient measures this
association. It has a value ranging
from 0 (no correlation) to 1 (perfect
positive correlation), or -1 (perfect
2. For example, how are sales of
product A correlated with sales
of product B? Or, how is the
advertising expenditure
correlated with other
promotional expenditure? Or,
are daily ice cream sales
correlated with daily maximum
temperature?
3. Correlation does not necessarily
mean there is a causal effect.
Given any two strings of numbers,
there will be some correlation
among them. It does not imply that
one variable is causing a change in
another, or is dependent upon
another.

4. Correlation is usually followed by


regression analysis in many
applications.
Application Areas: Regression
1.The main objective of regression analysis
is to explain the variation in one
variable (called the dependent
variable),based on the variation in one
or more other variables (called the
independent variables).

2.Application example:
‘explaining’ variations in sales of a
product based on advertising expenses,
or number of sales people, or number
of sales offices, or on all the above
variables.

3. If there is only one dependent variable


4. If multiple independent variables are
used to explain the variation in a
dependent variable, it is called a multiple
regression model.

5. Even though the form of the regression


equation could be either linear or non-
linear, we will limit our discussion to
linear (straight line) models.
Purposes of Regression Analysis

 To establish the relationship between


a dependent variable (outcome) and a
set of independent (explanatory)
variables

 To identify the relative importance of


the different independent (explanatory)
variables on the outcome

 To make predictions
Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are


assumed to be either interval scaled or ratio scaled.

2. Nominally scaled variables can be used as


independent variables in a regression model, with
dummy variable coding.

3. If the dependent variable happens to be a


nominally scaled one, discriminant analysis
should be the technique used instead of regression.

4. Dependent variable essentially METRIC


Independent variables Metric or Dummy
The general regression model (linear)
is of the type

Y = b0 + b1x1 + b2x2 +…….+ bkxk


(Unstand)

where

y is the dependent variable

x1, x2 , x3….xkare the independent


variables expected to be related to y
and expected to explain or predict y.
b1, b2, b3…bn are the coefficients of the
respective independent variables,
which will be determined from the
input data.
Steps of Regression Analysis

Step 1: Construct a regression model


Step 2: Estimate the regression and
interpret
the result
Step 3: Conduct diagnostic analysis of the
results
Step 4: Change the original regression
model if necessary
Step 5: Make predictions
DATA (INPUT / OUTPUT)

1. Input data on y and each of the x


variables is required to do a
regression analysis. This data is
input into a computer package to
perform the regression analysis.

2. The output consists of the ‘b’


coefficients for all the independent
variables in the model. It also gives
the results of a ‘t’ test for the
significance of each variable in the
model, and the results of the ‘F’ test
3 Assuming the model is statistically
significant at the desired confidence level
(usually 90 or 95%), the coefficient of
determination or R2 of the model is an
important part of the output. The R2 value is
the percentage (or proportion) of the total
variance in ‘y’ explained by all the
independent variables in the regression
equation.
Requirements for applying Multiple regression analysis

1. The variables used (independent and dependent) are


assumed to be either interval scaled or ratio scaled.

2. Nominally scaled variables can be used as


independent variables in a regression model, with
dummy variable coding.

3. If the dependent variable happens to be a


nominally scaled one, discriminant analysis
should be the technique used instead of regression.

4. Dependent variable essentially METRIC


Independent variables Metric or Dummy
Worked Example: Problem

A manufacturer and marketer of


electric motors would like to build
a regression model consisting of
five or six independent variables,
to predict sales. Past data has
been collected for 15 sales
territories, on Sales and six
different independent variables.
Build a regression model and
recommend whether or not it
The data are for a particular year, in
different sales territories in which the
company operates, and the variables on
which data are collected are as follows:
Dependent Variable
Y =sales in Rs.lakhs in the territory
Independent Variables
X1 = market potential in the territory
(in Rs.lakhs).
X2 = No. of dealers of the company in
the territory.
X3 = No. of salespeople in the
territory.
X4 = Index of competitor activity in the
territory on a 5 point scale
(1=low, 5=high level of activity
by competitors).
X5 = No. of service people in the
territory.
X6 = No. of existing customers in the
1 2 3 4 5 6 7
SALES POTENTL DEALERS PEOPLE COMPET SERVICE CUSTOM
1
5 25 1 6 5 2 20
2
60 150 12 30 4 5 50
3
20 45 5 15 3 2 25
4
11 30 2 10 3 2 20
5
45 75 12 20 2 4 30
6
6 10 3 8 2 3 16
7
15 29 5 18 4 5 30
8
22 43 7 16 3 6 40
9
29 70 4 15 2 5 39
10
3 40 1 6 5 2 5
11
16 40 4 11 4 2 17
12
8 25 2 9 3 3 10
13
18 32 7 14 3 4 31
14
23 73 10 10 4 3 43
15
81 150 15 35 4 7 70
Regression

We will first run the regression model of the


following form, by entering all the 6 'x'
variables in the model -

Y= b0+ b1x1 + b2x2 + b3x3 + b4x4 + b5x5 + b6x6

………..Equation 1

and determine the values of b0, b1, b2, b3, b4,


b5, & b6.

Regression Output:
MULTIPLE REGRESSION RESULTS:

All independent variables were entered in one block

Dependent Variable: SALES

Multiple R: .988531605
Multiple R-Square: .977194734
Adjusted R-Square: .960090784
Number of cases: 15
T h e A N O V A T a b le
STAT. A n a ly s is o f V a r ia n c e ; D e p e n .V a r : S A L E S ( r e g d a ta 1 .s ta )
M U L T IP L E
REG RESS.
Sum s of M ean
E ffe c t S q ua res df S q ua res F
R egress. 6 6 0 9 .4 8 4 6 1 1 0 1 .5 8 51 7 .1 3 2 6 9.0 0 0 0 0 4
R e s id u a l 1 5 4 .2 4 9 8 1 9 .2 8 1
T o ta l 6 7 6 3 .7 3 3

From the analysis of variance table, the last column


indicates the p-level to be 0.000004. This indicates
that the model is statistically significant at a
confidence level of (1-0.000004)*100 or
(0.999996)*100, or 99.9996.
:
STAT. Regression Summary for Dependent Variable: SALES
2 2
M U LTIPLE R= .98853160R = .97719473Adjusted R= .96009078
REGRESS. F(6,8)=57.133p< .00000Std.Error of Estimate: 4.3910

N =15 St.Err. St. Err.


BETA of B of B t(8) p-level
BETA
Intercept -3.1729 5.813394 -.54581 .600084
POTEN TL .439073 .144411 .22685 .074611 3.04044 .016052
D EALERS .164315 .126591 .81938 .631266 1.29800 .230457
PEO PLE .413967 .158646 1.09104 .418122 2.60937 .031161
CO M PET .084871 .060074 -1.89270 1.339712 -1.41276 .195427
SERVICE .040806 .116511 -.54925 1.568233 -.35024 .735204
CUSTO M .050490 .149302 .06594 .095002 .33817 .743935
Column 4 of the table, titled ‘B’ lists all the
coefficients for the model. These are : a
(intercept) = -3.17298
b1 = .22685
b2 = .81938
b3 = 1.09104
b4 = -1.89270
b5 = -0.54925
b6 = 0.06594

Substituting these values of a, b1, b2, ..b6


in equation 1 we can write the equation
(rounding off all coefficients to 2 decimals),
as
Sales = -3.17 + .23 (potential) + .82
(dealers) + 1.09 (salespeople) -
1.89 (competitor activity) - 0.55
(service people) + 0.07 (existing
customers)
[Y= a + b1x1 + b2x2 + b3x3 + b4x4 + b5x5 +
b6x6 ]
The estimated increase in sales for every unit
increase or decrease in the independent variables
is given by the coefficients of the respective
variables. For instance, if the number of sales
people is increased by 1, sales in Rs . lakhs, are
estimated to increase by 1.09, if all other
variables are unchanged. Similarly, if 1 more
dealer is added, sales are expected to increase by
The SERVICE variable does not make too much
intuitive sense. If we increase the number of
service people, sales are estimated to decrease
according to the –0.55 coefficient of the variable
"No. of Service People" (SERVICE).

Now look at the individual variable ‘t’ tests, we


find that the coefficients of the variable SERVICE
is statistically not significant (p-level 0.735204).
Therefore, the coefficient for SERVICE is not to be
used in interpreting the regression, as it may lead
to wrong conclusions.
Strictly speaking, only two variables,
potential (POTENTL) and No. of sales
people (PEOPLE) are significant
statistically at 90 percent confidence level
since their p- level is less than 0.10. One
should therefore only look at the
relationship of sales with one of these
variables, or both these variables.

Different modes of entering independent


variables in the model
 Enter
 Forward Stepwise Regression
The final unstandardised model is given by

Sales = -10.6164 + .2433 (POTENTL)


+ 1.4244 (PEOPLE)…………
Equation 3

Predictions:
If potential in a territory were to be Rs. 50 lakhs,
and the territory had 6 salespeople, then
expected sales, using the above equation would
be
= -10.6164 +.2433(50) +1.4244(6)
= 10.095 lakhs.
Similarly, we could use this model to make
predictions regarding sales in any territory for
Recommended usage

1. It is recommended that for serious decision-making, there


has to be a-priori knowledge of the variables which are
likely to affect y, and only such variables should be used in
the regression analysis.
2. For exploratory research, the hit-and-trial approach may be
used.
3. It is also recommended that unless the model is itself
significant at the desired confidence level (as evidenced by
the F test results printed out for the model), the R² value
should not be interpreted.
Multicollinearity and how to tackle it

Multicollinearity : Interrelationship of the various


independent variables

It is essential to verify whether independent variables are


highly correlated with each other. If they are, this may indicate
that they are not independent of each other, and we may be
able to use only 1 or 2 of them to predict the dependent
variables.

Independent variables which are highly correlated with each


other should not be included in the model together

You might also like