You are on page 1of 32

E370

5/13/17
Two Variable
Analysis
Descript Single Qualitative
ive Variable
Statistic Methods Quantitative Center
s Spread

Shape
Estimating
Probabilities

Two At least 1 Contingency


qualitative Tables
Variable
Methods variable

2 quantitative Scatter plots


variables Covariance
Correlation
Least Squares
Lines
Relationships Between Two
Variables

For Quantitative Data


Scatter Plots
Covariance
Correlation
Least Squares Line

For Nominal and Ordinal Level Data


Contingency Tables

Two-Variable Analysis
Thisgraph plots Cartesian
coordinate pairs
Data must be paired, that is, the X value
and the Y value must be for the same
observation.
For example, if Ralph is 72 inches (6 feet)
tall and he weighs 180 lbs, his coordinate
pair would be (72, 180).
If George is 67 inches tall and weighs 135
lbs, his coordinate pair is (67, 135).
We would never put Ralphs height with
Georges weight.

Scatter Plots
Education and Income
25

E
d
u 20
c
a
t
i 15
o
n

i 10
n

Y
e 5
a
r
s
0
0 10 20 30 40 50 60 70 80 90 100

Income in $K
Income and Education
100

90

80
I
n 70
c
o 60
m
e 50

i 40
n
30
$
K 20

10

0
6 8 10 12 14 16 18 20 22

Years of Education
Education and Income Income and Education
25 100
90
20 80
70
15 60
50
10 40
30
5 20
10
0 0
0 20 40 60 80 100 6 8 10 12 14 16 18 20 22

Comparison withIncome in $K Years of Education

Switched Axes
Theyare an important first step in two-
variable analysis.
Either variable can be on either axis.
In this class we look for linear associations.
A linear association is indicated by the
visual impression of a line.
The more the association resembles a
distinct line, the stronger is the association.
The more the association resembles a
random scattering of points, the weaker is
the association.

About Scatter Plots


Scatterplots are important, but
they can only take us so far.
The judgments we make are
relative and subjective.
A measure of the strength of an
apparent association can be
calculated.
The Covariance

Some objectivity?
It
is a measure of strength of a possible
linear relationship between 2 variables.
It is called Co-Variance, because that is what
it is, Variance but with 2 variables.

It can be any real numberpositive,


negative or zero.
It is very sensitive to changes in
magnitude of the variables, for example,
reporting income in $ instead of $K.

Covariance Characteristics
Itsunits are worse than
variances.
Thecovariance between incomes and
heights of male executives is 77.72 dollar-
inches.

Thecovariance between a golfers total


score and the time taken to complete 18
holes is 398.51 stroke-minutes.

Inwhich of these relationships is there the


stronger linear relationship?

What does it tell us?


Standing alone it can tell us . . .
If it is positive, both variables are
moving in the same direction at the
same time, or are both positive or
negative at the same time.
If it is negative, the variables are
moving in opposition to each other, as
one increases, the other decreases; or
one is negative when the other is
positive.
If it is zero, (0), we say there is no
relationship indicated.

Covariance Interpreted
Correlation

398.51 stroke-minutes standardizes to


a 0.69 correlation coefficient.
77.72 dollar-inches standardizes to a
0.65 correlation coefficient.

Covariance Transformed
= {-1, 1}
When = 1 or -1 perfect
correlation.
What does perfect correlation mean?
o A deterministic relationship is one in
which knowledge of the value of one
variable determines the value of the
other exactly.
o Imperfect correlation suggests that
one MAY have a statistical
relationship. A statistical
relationship is one that is
characterized by natural variability in
Correlation
both measurements.
Correlation
enables the interpretation of
covariance to be meaningful, and provides a
measure of how much of a line the variables
make.
= 0 generally implies no linear relationship; the
scatter plot shows no linear pattern at all.
> 0 means that something resembling a positively
sloped line can be seen in the scatter plot; the closer
to 1 the more the plot resembles a line.
< 0 means that something resembling a negatively
sloped line can be seen in the scatter plot; the closer
to -1 the more the plot resembles a line.
Correlation is resistant to changes in units of
measurement.
Correlation is very sensitive to outliers.

Correlation Characteristics
Just because you can calculate a
covariance or correlation does not
mean you have a linear
relationship.
They assume you have checked to see
if the two variables have a linear
relationship.
All they measure is the strength of
LINEAR relationships.
Just because you have a linear
relationship and have calculated a
covariance or correlation does not
But.
mean . . .have a causal
you
relationship.
Causation exists when a change in an
explanatory variable is the direct cause
of a change in a response variable.
How do we know if we have causation?
Randomized, controlled experiment
A reasonable explanation
The connection appears under varying
conditions.
Every thing else is ruled out
Variables are "confounded" if their effects on a
third variable cannot be separated from one
another.

Correlation is NOT
Lurking variables are variables which cause an
effect but which were not included in the analysis.

causation.
Reasons for relationships between
variables other than causation:
o The explanatory variable is not the
"sole" cause of the response variable.
o Confounding variables exist.
o Both variables have a common cause.
o Both variables are changing over time.
o Coincidence.
Spurious Correlation
o a large suggests an association between
two variables that does not truly exist.

Correlation with no cause?


DataData AnalysisCovariance
Educatio
Lottery Age Children Income
n
Populati
Lottery 14.198
on
Educatio
-7.804 11.152
n
Age 7.941 -7.097 142.281

DataData
Children -0.114 AnalysisCorrelation
0.472 1.683 1.732
Income -34.609 38.212 -7.774 1.642 243.094
Educatio
Lottery Age Children Income
n
Lottery 1.000
Educatio
-0.620 1.000
n
Age 0.177 -0.178 1.000
Children -0.023 0.107 0.107 1.000
Income -0.589 0.734 -0.042 0.080 1.000
Covariance, Correlation and Excel
IFF you have a causal relationship
(how will you know?), a least squares
line can be calculated.
It is a model of a linear relationship
with a specific characteristic that
allows us to say it is the line of best
fit.
Yi ' = b 0 + b 1 X i
You probably know this as y = mx+b.
b0 =b
b1 = m

Least Squares Lines


It
is the line of all possible lines that
has the smallest sum of squared error,
SSE:

Error = Actual Y Predicted Y


Error is squared, then summed.

Thisprocess generates an estimate of


the y-intercept, b0, and the slope, b1.
b0 : The value of Y when X is 0.
b1 : The change in Y as a result of a 1 unit
change in X.

About the line of best fit


Price' = - 260.05 + 3866.6(Weight)

Fora diamond with a carat weight of 0, the


predicted price is -$260.05.

Eachadditional carat of weight in a


diamond the price of the diamond will
change by $3866.60.

The predicted price of a diamond weighing


0.27 carats is -$260.05 + $3866.6*(0.27) =
$783.93

Our Example
Objective and Subjective
Subjective
at least partly opinion based

Objective Probabilities
Classical
Empirical or *relative frequencies*

Marginal, Joint and Conditional


Marginal: Probability of a single event
Joint: Probability of 2 or more events at the same time
Conditional: Probability of an event given another event
has already happened.
Independence
No evidence that one variable causes the other to change.

Probability Concepts
Gender and Major of 200 Students
Major
Accountin
Gender g Econ Stats Total
Male 0.22 0.15 0.12 0.49
Female 0.28 0.15 0.08 0.51
Total 0.5 0.3 0.2 1
Simple or Marginal Probability
Joint Probability
Conditional Probability
Independence Rule: If
P(A&B)=P(A)*P(B)independent

A Contingency Table
Economic Class
Survived
I II III Other Total
?
No 122 167
Yes 203 178 212 711
Total 885 2201

A Population at
Risk:
Number of Yes 711
Yes in Class I 203
Total Number of
885
Other
Yes in Class III 178
No in Class II 167
No in Class I 122
Population Size 2201
Yes in Other 212
Fill this contingency table.
Economic Class
Survived
? I II III Other Total
No 122 167 5285 6736 14907
Yes 203 1181 178 212 711
Total 3252 2853 7064 885 2201

1181 = 711-203-178-212
3252 = 122+203
2853 = 167+118
7064 = 2201-885-285-325
5285 = 706-178
6736 = 885-212
14907 = 122+167+528+673 OR 14907 = 2201-711

Calculations to fill the table


Convert the frequencies to probabilities
Economic Status of Population
Exposed to Risk
Survived
I II III Other Total
?
122/2201
No 0.08 0.24 0.3 0.68
=0.06
Yes 0.09 0.05 0.08 0.1 0.32
Total 0.15 0.13 0.32 0.4 1
What is the probability that a person survived the risk?
What is the probability that a person had economic
status=II and did not survive?
Given that a person had economic status III, what is the
probability that person survived the risk?
Given that a person survived, what is the probability
that person had economic status I?
Is
there evidence that suggests that
economic status and whether or not a
person survived the risk are
independent?
If
P(A)*P(B)=P(A&B), A & B are
independent. (Special Law of
Multiplication)
P(A) and P(B) are marginal, and P(A&B)
is joint; lets calculate the product of
some marginals and compare them to
the related joint probability.

Independence
Expected
Marginal Economic Status of Population
Probabili
Products Exposed to Risk
ties
Survived
I II III Other Total
?
0.15*0.68
No 0.0884 0.2176 0.272 0.68
= 0.102
0.4*0.32
Yes 0.048 0.0416 0.1024 0.32
= 0.128
Total 0.15 0.13 0.32 0.4 1
Joint Observed
Economic Status of Population
Probabili probabili
Exposed to Risk
ties ties
Survived
I II III Other Total
?
No 0.06 0.08 0.24 0.3 0.68
Yes 0.09 0.05 0.08 0.1 0.32
Total 0.15 0.13 0.32 0.4 1

Independence?
At least one No qualitative
qualitative variable. variables.
Contingency Table Scatter Plots
* May be used for 2 * Two quantitative
quantitative variables
variables. * Either variable may be
Mutually exclusive and on X or Y
collectively exhaustive * Reveals linear
categories associations.
* This assures that the Covariance
probabilities all will * Measures the strength
sum to of a
exactly 1. linear association.
* May be positive, 0 or
If product of marginal
negative
probabilities = joint
* Has very difficult units
probabilities
Summarize Two-Variable to Relationships
independent interpret
* Similar to the idea of a * Because of units, the
Correlation
* Measures the strength SSE (Y i Y i') 2
of a i

possible linear Yi = b0 + b1*Xi


relationship * b0 is the Y-Intercept
* Has no units
value,
* Exists between -1 and
the value of Y when
+1
Xi = 0.
* Impervious to unit
changes * b1 is the slope of the
* Sensitive to outliers line, the
Least Squares Line amount that Y
* Requires causation changes for a
before it is one unit change in
estimated. X.
* Reports the equation of * Not only describes
the line the line
No
thatQualitative completely, but
Variables, Continued
fits the data best.
* Best fit means the enables one
smallest total to make predictions.

You might also like