You are on page 1of 5

MBB

Chapter 3 (section 4) &


Chapter 12 (section 8)

Correlation coefficient: first


introduced by Galton in 1877

1. Concept of Correlation

Sir Francis Galton


(1822 1911)
Source: Wikipedia

2. Interpreting sample rs (examples)


3. Estimation: using a sample (calculations)
4. Inference ... testing
5. Correlation vs. causality
1

The population Correlation Coefficient is (Greek letter rho).

Correlation is a measure of the strength of the linear relationship


between two measured variables.

Theory give us that: - 1 +1

X and Y are random variables, both measured (either discrete or


continuous). That is, we have bivariate (paired
paired data)  two
variables measured on each individual (but they are rarely
measuring the same thing).

The sample correlation coefficient is r


(We also have that: - 1 r+1)

Often, we just want to know if X and Y are related in


a linear manner (and the strength of the relationship) without
specifying which variable depends on which.

In Topic 1, we saw the appropriate visual display is the scatterplot.

e.g. is there a relationship between students maths marks


and their science marks?

In this case, correlation is useful.

Interpreting the sample correlation

The Correlation Coefficient measures the degree to which X and Y


are linearly related.
If is positive: X & Y both increase or decrease together.
If is negative: as one variable increases, other decreases.
If is zero: there is NO linear relationship between X and Y.
If = 1 or -1:
There is an exact linear relationship between X and Y.
Knowing one variable means you know the other exactly.
There is no variability around the line that summarises their
relationship.

Some more examples of relationships and correlation values


Strong
positive

Perfect
positive

Weak
positive

No
corr

Strong
negative

Weak
negative

No
corr

Perfect
negative

Non-linear patterns  zero correlation

Source: Wikipedia

Correlation measures the strength of the LINEAR relationship.

In this set: * 4 plots show linearity (a, b, d, e);


* 1 plot shows no relation (c);
* 1 plot shows a non-linear relation (f)

As for any parameter, is very rarely known for a population.

The correlation coefficient measures LINEAR ASSOCIATION only,


so:

The aim is to use a sample to estimate the population relation.


The sample consists of n independent observations

Independence implies zero correlation

(xi,yi) ; i = 1, 2, , n
where xi = measured value of variable 1 for individual i
yi = measured value of variable 2 for individual i

BUT
Zero correlation does not necessarily imply independence (as X
and Y could have a non-linear relationship ... see previous slide.)

it is estimated by r, the sample correlation coefficient  = r

Independence

SXY / (n-1) is the sample covariance

Zero
correlation

[it is like an unscaled correlation] ... not really a useful measure (yet).

S xy

r=

r is sometimes referred
as the Pearson sample
correlation coefficient

S xx S yy

10

(weight / age)
n
n
n
2
2
SS for X  S xx = ( xi x ) = xi xi
i =1
i =1
i =1

i =1

i =1

Use your calculator in sd mode.


You need:
n
x
s
S
x

sy

Six children of different ages (from 1 to 6 years).


X = age , Y= weight
Relevant summary statistics:
Individual X(yrs)
Y(kgs)

nx 2 = ( n 1) S x2

2
i

or

ny 2 = ( n 1) S y2

i =1

S xy = ( xi x )( yi y ) = xi yi xi yi n
i =1 i =1
i =1
i =1
n

SCP 

i =1

2
i

i =1

SS for Y  S yy = ( yi y )2 = yi2 yi n
n

or

or

x y nxy
i

i =1

1.
2.
3.
4.
5.
6.

1
6
3
2
4
5

n=6

6
18
13
9
12
14

x = 3.5

sx 1.871

S xx = 17.5

y = 12

s y 4.147

S yy = 86

S xy = xi yi nxy
= 289 6(3.5)(12) = 37

xx

= (n 1) * s

2
x

S yy = (n 1) * s

S xy = xi yi nxy

r=

2
y

11

S xy
S xx S yy

37.0
0.9537
17.5 86.0

12

Testing the correlation coefficient

For testing H0: = 0 vs. H1: 0 (or < or >)

We may want to test the correlation coefficient:

the test statistic is

H0: = 0

[no linear relationship]

H1: 0

[some linear relationship]

tobs =

r n2
1 r2

compare with tn-2


[still in the form

r
est.se ( r )

Problem:
r is an index between -1 and 1, randomly distributed about , so

For the age/weight example:

r cant be Normally distributed, and

H0: = 0 vs H1: 0

it very often has a skewed distribution

tobs =

If = 0,
0 the distribution of r (the random variable) will be
approximately Normal (but truncated at -1 and 1)

0.9537 6 2
1 0.9537 2

df = 6-2 = 4 
The further is from 0, the more skewed the distribution of r will
be, so we can only test specifically for = 0 (and no other values).

13

at = 0.05

6.342

0.001 < p-value < 0.005  Reject H0

We can conclude that there is a significant positive linear


relationship between age and weight.

14

Obtaining a confidence interval for


Two variables may be related, without
either one causing the other one to change.

The distribution of the sample correlation coefficient


(and hence the test statistic t = r n 2

1 r2 )

Consider a case of two variables:

is only approximately normally distributed when the population

- Ice cream sales at a beach

correlation coefficient IS zero.

- number of drowning deaths at the beach


Hence, we cannot use this method to obtain a c.i. for .

- they have a positive relationship  YES!

Other methods exist, but are beyond the scope of STAT171.

Does eating ice cream cause drowning?


Do people eat ice cream to console themselves if someone drowns?
15

WHAT?????

16

An example of scary (but incorrect) interpretation

Considering the two variables:

http://io9.com/on-correlation-causation-and-the-real-cause-of-auti-1494972271

- Ice cream sales at a beach


- number of drowning deaths at the beach

Here we have plotted for USA


over time from 1997 to 2009:

There is what is called a latent variable here ... one


that is causing the observed relationship.

$sales of organic vegetables; &


number of individuals
diagnosed with AUTISM

Clearly, there is evidence that


eating organic vegetables
causes autism!
17

What is really going on ?????


Read the fine print ...

Over this time, the


population of the USA
grew ...
So of course you would
expect:
more money spent on organic vegetables
AND
more people diagnosed with autism
 The true causal variable is ...

19

18

You might also like