You are on page 1of 31

BIVARIATE DATA

Applied Statistics and Computing Lab Indian School of Business

Applied Statistics and Computing Lab

Learning goals
Understanding bivariate data Understanding the idea of correlation Understanding linear regression

Applied Statistics and Computing Lab

Bivariate Data

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Applied Statistics and Computing Lab

Why study variables together


Variation in one variable may or may not affect the variation in another variable Understanding the relationship
When the value of one variable changes, compare the other variable for : Direction of movement and : Magnitude of movement

Prediction
If a new value of one variable is observed, can we predict the corresponding value of the other variable?
Applied Statistics and Computing Lab
8

Data type I:

Statistics for bivariate data


Y E(XY) .. .. Data type I ( + + + ) = ( + + + ) ,
,

X .. ..

Data type II

Data type II: (tabulating relative frequencies; in case if there are multiple observations with same values of X and Y)

X|Y .. ..
Totals

..

..

Totals

E(X), E(Y), V(X) and V(Y) are calculated as per the univariate mean and variance formulae

1
9

Applied Statistics and Computing Lab

Covariance (denoted by Cov)


We understand the variation in a single variable by looking at the movement of its values from a central tendency For a bivariate data, we want to look at the combined deviation The sign of such a measure may tell us about the two variables and how they covary Hence we can take product of the two sets of deviations The Covariance calculates just this! It is defined as the expected value of the product of the deviation of X from its mean, and the deviation of Y from its mean*
cov( X , Y ) = E[( X E ( X ))(Y E (Y ))] cov( X , Y ) = E ( XY ) E ( X ).E (Y ) xy x y cov( X , Y ) = ( . ) n n n

A reasonable measure of joint variation


Applied Statistics and Computing Lab
10 *Aczel A., Sounderpandian J. Complete business statistics

Covariance (contd.)
Covariance is independent of change of origin but affected by change of scale
( X a) (Y b) and V = d c cov(U ,V ) = cd [cov( X , Y )] For U =

Covariance of 2 variables is always lesser than or equal to the product of variances of those two variables
cov( X , Y ) Var ( X ).Var (Y )

Unit of covariance is obtained by taking a product of the units of X and Y


Applied Statistics and Computing Lab
11

Covariance (contd.)
cov(Waist circumference, adipose tissue area) = 643.39 Can we compare this with another covariance? For the Body measurement data, consider both the Weight and the Height of all the individuals What is the covariance between Height and Weight for both the genders? = 27.13 Kg. : Cms. and = 40.38 Kg. : Cms. What information do we obtain by comparing these two covariance values?

Applied Statistics and Computing Lab

12

Standardization
If we standardize both the variables, the covariance is independent of the unit of measurement Makes the covariances of both categories comparable It would then lie between [-1,1] The number is closer to 0 => the variables do not covary much The number closer to 1 or -1 => the variables covary highly , = 0.43 , = 0.53 The height and weight are moderately related to each other, for both the genders We will see that this covariance is the same as the measure we study next!

Applied Statistics and Computing Lab

13

Correlation coefficient
Denoted by (called rho) Defined as the measure of the degree of linear association between the two variables X and Y* Indicates the strength of and direction in which the two variables would move, in relation with each other Calculated as the proportion of the covariance between X and Y, to the product of standard deviations of X and Y (, ) = Correlation coefficient is also termed as the Pearson Product-moment Correlation Coefficient , = 0.77 (, = )0.43 (, = )0.53
14 *Aczel A., Sounderpandian J. Complete business statistics

Applied Statistics and Computing Lab

Properties of Correlation coefficient


Correlation coefficient of two variables is equal to the covariance of their standardised forms Lies between -1 and 1 (extremes included) 1 1 It is a dimension-free measure or a measure free of units Is independent of both, change of origin and change of scale ( ) ( ) = = , =
Applied Statistics and Computing Lab
15

Perfect positive correlation. If one of X or Y increases, the other one must increase as per an exact linear relation. Similarly if one decreases, the other decreases by the same rule. No linear relationship.

Perfect negative correlation. If one of X or Y increases, the other must decrease as per an exact liner relation. Similarly if one decreases, the other increases by the same rule. Strong negative correlation. If one of X or Y increases, the other decreases as per a moderately strong linear relation. Similarly if one decreases, the other increases by the same rule. Strong negative correlation. If one of X or Y increases, the other decreases as per a very strong linear relation. Similarly if one decreases, the other increases by the same rule.

Moderate positive correlation. If one of X or Y increases, the other must increase as per a moderately strong linear relation. Similarly if one decreases, the other decreases by the same rule. Weak positive correlation. If one of X or Y increases, the other must increase as per a weak linear relation. Similarly if one decreases, the other decreases by the same rule.

No linear relationship.

Applied Statistics and Computing Lab

16 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Limitations of correlation coefficient


X -3 -2 -1 0 1 2 3 Y 9 4 1 0 1 4 9 Y 0.6 0.2 0.2 0.2 0.1 0.1 0.1 0.05 0.05 0 X 2.01 2 2 2 2 2 2 2 2 2

Correlation coefficient =0 Yet, there exists a perfect quadratic relation between X and Y
Applied Statistics and Computing Lab

Correlation coefficient = 0.911!


17

Correlation and causality


A huge Roger Federer fan! Watches several Fedearer - Nadal matches live on television Has recorded that Federer loses approximately 80% of the matches, that this fan watches live Does he cause Federer to lose, by watching the match?

Applied Statistics and Computing Lab

18

Other measures
Rank correlation To measure the degree of correlation between two ordinal variables or rankings : Company rankings given by two different publications : Ranks of universities published on two websites Consider two groups of women. They are grouped based on whether they use a particular brand of shampoo (say Shampoo A) or not. For each of the groups, responses are collated to indicate which of the five characteristics about their shampoo are most important to them.
Characteristics Characteristic 1 Characteristic 2 Characteristic 3 Characteristic 4 Characteristic 5 Group 1 rankings 1 3 2 5 4 Group 2 rankings 5 3 4 1 2 D=(rank 2 rank 1) 4 0 2 -4 -2
19

Applied Statistics and Computing Lab

Other measures (contd.)


Spearmans Rank correlation coefficient (): 6 =1 ( 1)
Where, d= difference between 2 ranks of each object n= Number of objects

This rank correlation is also equal to the Pearson product-moment correlation applied to the ranks organised in an ascending order Lies in the interval [-1,1] Higher the positive correlation coefficient, greater the degree of agreement between two ranks Higher the negative correlation coefficient (closer to -1), greater the degree of disagreement between two ranks A correlation coefficient of 0 indicates that there is absolutely no similarity in the two ranks given to the same object
Applied Statistics and Computing Lab
20

Other measures (contd.)


Kendalls Tau ():
= () 1 ( 1) 2

For n objects with ranks , ; for each i=1,2,,n, a pair of observations ( , ) and , is said to be, concordant if the ranks of both elements agree i.e. both ( > ) and > OR both ( < ) and < discordant if ( > ) and ( < ) OR ( < ) and ( > ), the pair is said to be discordant Neither concordant nor discordant if ( = ) or =

Lies in the interval [-1,1] If the agreement between two rankings is perfect, coefficient = 1 If the disagreement between two rankings is perfect, coefficient = -1 If the rankings are independent, the coefficient would be close to 0
21

Applied Statistics and Computing Lab

Linear Regression
Suppose now, the variation in one variable (X) influences the variation in the other variable (Y) Is the adipose tissue area is influenced by waist circumference? Are ice-cream sales affected by the temperature in the city? The variable X i.e. the variable that influences, is also referred to as the predictor variable or the independent variable or the explanatory variable The variable Y i.e. the variable that is being influenced, is also referred to as the outcome variable or the dependent variable or the explanatory variable Can we draw one line such that the equation of that line explains the relation between X and Y? Which line describes the relationship in a reasonable way?
Applied Statistics and Computing Lab
22

Applied Statistics and Computing Lab

23

Linear regression (contd.)

This line minimizes the sum of squared vertical distances


24 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Applied Statistics and Computing Lab

Linear regression (contd.)


Simple linear regression model: = + + where, Y=Outcome variable X=Predictor variable =Random component in the model
)( ) ( = ) ( - =

If we can safely assume linear relationship between X and Y, this model predicts average value by which Y will change for one unit change in X
Applied Statistics and Computing Lab
25

Linear regression (contd.)

The model is estimated using Method of least squares This method tries to minimize the sum of squared errors There are other methods of estimation
Applied Statistics and Computing Lab
26 Visuals from Aczel A., Sounderpandian J. Complete business statistics

Linear regression (contd.)


Goodness of the model depends on the strength of linear relationship between X and Y The error could comprise of factors other than X, that may affect Y The coefficient of determination or is a measure of the strength of linearity in the relationship It indicates the proportion of variation in Y, that is explained by X
Applied Statistics and Computing Lab
27

Linear regression (contd.)


Fitting a linear regression for the Waist circumferenceAdipose tissue data gives following output in R:
Coefficients: (Intercept) data.waist$Adipose.tissue Estimate Std. Error t value Pr(>|t|) 37.79 <2e-16 *** 12.19 <2e-16 *** 71.26327 1.88565 0.19796 0.01624

Multiple R-squared: 0.5861

We get the following regression equation:


= 71.26 + 0.2()

Applied Statistics and Computing Lab

28

Linear regression (contd.)

Applied Statistics and Computing Lab

29

R-codes
Function Dotplot R-code install.packages(TeachingDemos) library(TeachingDemos) dots(variable name) plot(variable1 name,variable2 name) cov(variable1 name,variable2 name) cor(variable1 name,variable2 name) cor(variable1 name,variable2 name, method=spearman) cor(variable1 name,variable2 name, method=kendall) lm(response variable ~ explanatory variable) abline(response variable ~ explanatory variable)
30

Scatter plot Covariance Correlation Spearmans rank correlation Kendalls tau Linear regression Regression line
Applied Statistics and Computing Lab

Thank you

Applied Statistics and Computing Lab

You might also like