You are on page 1of 49

Describing the Relation

Between Two Variables


Scatter Diagrams; Correlation
Bivariate data is data in which two variables
are measured on an individual.
The response variable is the variable
whose value can be explained or
determined based upon the value of the
predictor variable.

A lurking variable is one that is related to


the response and/or predictor variable, but is
excluded from the analysis
A scatter diagram shows the relationship
between two quantitative variables measured
on the same individual. Each individual in
the data set is represented by a point in the
scatter diagram. The predictor variable is
plotted on the horizontal axis and the
response variable is plotted on the vertical
axis. Do not connect the points when
drawing a scatter diagram.
Do heavier people burn more energy?
Does wine consumption affect cause a
decrease in heart disease?
These questions reflect a desire to understand
the relationship between two variables.
What we need:
1. A plot/graph to view the relationship
2. Characteristics to describe
3. Measures of the characteristics
4. Method to make inferences about
the relationship
Correlation & Regression
The graph…a Scatter Plot
Response variable
Y
(dependent variable)

Explanatory variable X
(independent variable)

Correlation & Regression


Do heavier people burn more energy?

Response: metabolic rate


Explanatory: weight or mass

Does wine consumption cause a decrease


in heart disease?
Response: death rate from heart
disease
Explanatory: wine consumption
Correlation & Regression
Do heavier people burn more energy?
Lean body mass vs. metabolic rate
2000
Rate(cal)

1500

1000

30 40 50 60
Mass(kg)

Correlation & Regression


Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300
hrt_death rate

200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine consumption

Correlation & Regression


Interpreting…characteristics to look for:

• Patterns:
• Form (clusters, scatter, linear..)
• Direction (positive, negative)
• Strength ( how closely points follow form)

• Deviations:
• Outliers
Interpret the last two scatter plots….

Correlation & Regression


Options to consider:
Adding a categorical variable

Correlation & Regression


Scatter plot: relationship Strength?
between quantitative
variables

Form: Linear is probably the


most common form

Strength: We can measure


the strength of a linear
relationship
…because our eyes can
deceive us!!!
Strength?
EXAMPLE Drawing a Scatter Diagram

The following data are based on a study for


drilling rock. The researchers wanted to
determine whether the time it takes to dry drill
a distance of 5 feet in rock increases with the
depth at which the drilling begins. So, depth
at which drilling begins is the predictor
variable, x, and time (in minutes) to drill five
feet is the response variable, y. Draw a
scatter diagram of the data.
Source: Penner, R., and Watts, D.G. “Mining Information.” The American Statistician, Vol.
45, No. 1, Feb. 1991, p. 6.
Two variables that are linearly related are said to
be positively associated when above average
values of one variable are associated with above
average values of the corresponding variable.
That is, two variables are positively associated
when the values of the predictor variable increase,
the values of the response variable also increase.
Two variables that are linearly related are said to
be negatively associated when above average
values of one variable are associated with below
average values of the corresponding variable.
That is, two variables are negatively associated
when the values of the predictor variable increase,
the values of the response variable decrease
The linear correlation coefficient or Pearson
product moment correlation coefficient is a
measure of the strength of linear relation between
two quantitative variables. We use the Greek letter
(rho) to represent the population correlation
coefficient and r to represent the sample correlation
coefficient. We shall only present the formula for
the sample correlation coefficient.
Properties of the Linear Correlation Coefficient
1. The linear correlation coefficient is always
between -1 and 1, inclusive. That is, -1 < r < 1.
2. If r = +1, there is a perfect positive linear relation
between the two variables.
3. If r = -1, there is a perfect negative linear relation
between the two variables.
4. The closer r is to +1, the stronger the evidence of
positive association between the two variables.
5. The closer r is to -1, the stronger the evidence of
negative association between the two variables.
Properties of the Linear Correlation Coefficient
6. If r is close to 0, there is evidence of no linear
relation between the two variables. Because the
linear correlation coefficient is a measure of
strength of linear relation, r close to 0 does not
imply no relation, just no linear relation.
7. It is a unitless measure of association. So, the
unit of measure for x and y plays no role in the
interpretation of r.
EXAMPLE Drawing a Scatter Diagram and
Computing the Correlation Coefficient
For the following data
(a)Draw a scatter diagram and comment on the
type of relation that appears to exist between x
and y.
(b) By hand, compute the linear correlation
coefficient.
EXAMPLE Determining the Linear
Correlation Coefficient

Determine the linear correlation coefficient


of the drilling data.
xi  x yi  y  xi  x   yi  y 
sy    
sx s
 x  y  s

y
x
A linear correlation coefficient that implies
a strong positive or negative association
that is computed using observational data
does not imply causation among the
variables.
Correlation = r

• Quantitative variables
• Linear relationships
• r has no units
• r can be between –1 and 1
• Positive r =
positive association
• Negative r =
negative association
• 0 = no association
• r is influenced by outliers
Do heavier people burn more energy?
Lean body mass vs. metabolic rate
2000

Rate(cal)
1500

1000

30 40 50 60
Mass(kg)

Correlations: Mass (kg), Rate (cal)


Pearson correlation of Mass(kg) and Rate(cal) = 0.865 r
P-Value = 0.000

Correlation & Regression


Weight (mass) vs. metabolic rate
2000

Males +
Rate(cal)

1500
Females o

1000

30 40 50 60
Mass(kg)

Correlations: Mass (kg)_F, Rate (cal)_F


Pearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876
Correlations: Mass (kg)_M, Rate (cal)_M
Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592

Correlation & Regression


Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300

hrt_death rate
200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine c onsumption

Correlations: Alcohol, heart_death rate


Pearson correlation of Alcohol and hrt_death rate = -0.843

Correlation & Regression


heart disease death rate vs. wine consumption
(outliers removed)
300

250
hrt death rate

200

150

1 2 3 4
Alc wine consumption

Correlations: Alcohol Wine consumption, heart death rate


Pearson correlation of Alc Wine consumption and hrt death rate = -0.648

Correlation & Regression


heart disease death rate vs. wine consumption
(outliers removed)
300

250
hrt death rate

200

150

1 2 3 4
Alc wine consumption

Correlations: Alcohol Wine consumption, heart death rate


Pearson correlation of Alc Wine consumption and hrt death rate = -0.648

Correlation & Regression


Linear relationships…using a LINE
Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)

300

hrt_death rate

200

100

0 1 2 3 4 5 6 7 8 9
Alcohol
wine c onsumption

We can summarise an overall linear form with a line…the best line is called
the Regression Line

Correlation & Regression


A regression line describes how a response variable changes as an explanatory
variable changes. We can now predict a value of y when given an x.

Fitted regression line death rate vs.wine consumption


death rate = 260.563 - 22.9688 wine consumt

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300

What would be the death rate due


to heart disease if the average daily
consumption of wine was 3
death rate

200
glasses?

191.66 deaths per 100,000


100

0 1 2 3 4 5 6 7 8 9
wine consumption

Correlation & Regression


How do we determine the regression line?

We want the vertical


distances from the points
(observed) to the line
(predicted) to be as small as
possible…this means our
error in predicting y is small.

Correlation & Regression


Calculating the line…
We will use the method of least squares to calculate the line.
Least squares regression is the line that makes the sum of the squares of
the vertical distances as small as possible.

y  a  bx Equation of the line (read “y hat”)

sy
b  rsx b is the slope (rate of change in y when x increases)

a  y  bx a is the y intercept (value of y when x is 0)

Correlation & Regression


Fitted regression line death rate vs.wine consumption
death rate = 260.563 - 22.9688 wine consumt

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %

300

death rate
200

100

0 1 2 3 4 5 6 7 8 9

wine consumption

 The regression equation is


death rate = 260.563 - 22.9688 wine consumption

S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %


 
Analysis of Variance
 
Source DF SS MS F P
Regression 1 59813.6 59813.6 41.6881 0.000
Error 17 24391.4 1434.8
Total 18 84204.9

Correlation & Regression


Facts about regression….

1. Clear distinction between the response variable and the explanatory


variable.
2. Correlation and slope…a change in one s of x corresponds to a
change of r s in y.
3. Least-squares regression line passes through
4. (x, y )
Some variation (spread) in y can be accounted for by changes in x
when there is a linear relationship. The square of the correlation
coefficient is the the fraction of the variation in y values that is
explained by changes in x.

variation in y due to x
r 
2
total variation in observed y
= coefficient of determination

Correlation & Regression


Attention!! Caution!!
1. Correlation and regression describe only linear relationships
2. R and r-sq are not resistant
3. Do not extrapolate!!! What is extrapolate?
4. Correlations based on averages are too high when applied to
individuals…if the data has been “averaged”, the values of
correlation and regression cannot be used with un-averaged values.
(i.e., average alcohol consumption per country…not individuals).
5. Lurking variables…like the male/female variable in the weight vs.
energy and the possible Mediterranean variable in the wine data.
6. Correlation/association is not causation.

Correlation & Regression

You might also like