You are on page 1of 46

Week 3: Chapter 2.

9
Graphing Bivariate Numerical Data

C
h
a
p
t
e
r

Terminology
Independent variable

The variable you have control over, what you can choose and
manipulate. It is usually what you think will affect the
dependent variable.

Dependent variable

What you measure in the experiment and what is affected


during the experiment. The dependent variable responds to
the independent variable.

2
2

C
h
a
p
t
e
r

Graphing Bivariate Relationships


Scatterplots
Two-dimensional plot, that displays the general relationship
between 2 quantitative variables graphically.
One variables values plotted along the vertical axis (Yaxis)
The other along the horizontal axis (X-axis)
2 variables are measured on the same individuals.

2
3

Each point determined by two values

Should provide a quick visual impression of the data

Math SAT Score

C
h
a
p
t
e
r

Scatterplots

HS GPA
4

C
h
a
p
t
e
r

Example

Draw a Scatterplot to represent the following


dataset:
x

7
5

C
h
a
p
t
e
r

Example
y

10

9
8
7
6
5
4
3
2
1

Figure: Scatterplot between x and y variables


6

C
h
a
p
t
e
r

4-Steps to Describe a
Scatterplot

2
7

Independent (explanatory) Variable lies along the x-axis

Dependent (response) Variable lies along the y-axis

Cases are represented by each dot

2008 State Mean SAT Math Score vs State Participation Rate


625
600
State Mean SAT Math Score

C
h
a
p
t
e
r

Step 1. Identify the Explanatory and


Response variables and the Cases

575
550
525
500
475
450

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Participation Rate (proportion of state's 2008 graduating seniors who took the SAT)

Include general pattern description like linearity or curve,

Obvious gaps that result in clusters (values grouped


together)

Outliers (individual values which are an exception to the


pattern)
2008 State Mean SAT Math Score vs State Participation Rate
625
600

State Mean SAT Math Score

C
h
a
p
t
e
r

Step 2. Describe the overall shape

575
550
525
500
475
450

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Participation Rate (proportion of state's 2008 graduating seniors who took the SAT)

Describe the trend and what it means:


No Trend if dots are scattered all over or form a
horizontal or vertical line
Positive - as the explanatory variable increases so does the
response variable
Negative - as the explanatory variable increases the
response variable
2008 State Mean SAT Math Score vs State Participation Rate
decreases
625
600

State Mean SAT Math Score

C
h
a
p
t
e
r

Step 3. Describe the trend

575
550
525
500
475
450

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Participation Rate (proportion of state's 2008 graduating seniors who took the SAT)

10

Strong points close to an imaginary line or curve;

Weak points widely scattered from line or curvepattern


difficult to discern;

Moderate somewhere in betweenpattern obviously


there but has exceptions or a lot of spread.
2008 State Mean SAT Math Score vs State Participation Rate
625
600

State Mean SAT Math Score

C
h
a
p
t
e
r

Step 4. Describe the strength

575
550
525
500
475
450

0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Participation Rate (proportion of state's 2008 graduating seniors who took the SAT)

11

Week 3: Chapter 11.5


Correlation

correlation

linear
regression
1

Terminology
C
h
a
p
t
e
r

Correlation
Measures the strength of a certain type of
relationship between two measurement
variables.

11
2

C
h
a
p
t
e
r
11

Terminology

The Coefficient of Correlation (aka Pearson)


A measure of the strength of the linear
relationship between the two
Indicates how closely the values fall to a
straight line.

Formulas
C
h
a
p
t
e
r
11

Pearson sample correlation coefficient ( r ):


n

( x x )( y
i

i 1

y)

( x x ) ( y
2

i 1

i 1

y)

Where (x1, y1), (x2, y2),...,( xn, yn) denote a


sample of (x, y) pairs

C
h
a
p
t
e
r
11

Properties of r
1.

-1 r 1

2.

Correlation of +1 indicates a perfect positive


linear relationship between the two variables
As one increases, so does the other.
All individuals fall on the same straight line.

3.

Correlation of 1 indicates a perfect negative


linear relationship between the two variables
As one increases, the other decreases.
All individuals fall on the same straight line.

C
h
a
p
t
e
r
1
1

Properties of r
5.

Correlation of zero could indicate no linear


relationship between the two variables,
or
that the best straight line through the data on a
scatterplot is exactly horizontal

6.

Correlations are unaffected if the units of


measurement are changed

7.

The value of r is a measure of the extent to


which x and y are linearly related.

C
h
a
p
t
e
r

Visualizing Correlation
Coefficients using Scatterplots
(r=.1)

(r=.3)

(r=.5)

(r=.7)

(r=.9)

(r=1)

11
7

C
h
a
p
t
e
r
11

Visualizing Correlation
Coefficients using Scatterplots
(r= -1)

(r= -.4)

(r= -.8)

(r= -.2)

(r= -.6)

(r=0)

Strength of linear relationships


C
h
a
p
t
e
r

Strong Moderate

-1

-0.8

-0.5

Weak

Moderate Strong

0.5

0.8

11
9

Example #1: Mare & Foal Weight


C
h
a
p
t
e
r

Is foal weight related to mare weight?

Observation

Mare weight (x, in kg)

556

638

588

550

580

642

568

642

Foal weight (y, in kg)

129

119

132

123.5

112

113.5

95

104

10

11

12

13

14

15

Mare weight (x, in kg)

556

616

549

504

515

551

594

Foal weight (y, in kg)

104

93.5

108.5

95

117.5

128

127.5

Observation

11
10

Correlation

11

130

120

Foal Weight

C
h
a
p
t
e
r

110

100

90
500

550

600

650

Mare Weight

Correlation of Mare Weight and Foal Weight = 0.001

Interpretation: r is close to 0 indicates no linear relationship


between mare weight and foal weight
11

Example #2: Verbal SAT and GPA


C
h
a
p
t
e
r
11

r = .485
moderate positive relationship
12

Example #3: Putting Success


C
h
a
p
t
e
r
11

r = -.94
13

Week 3: Chapter 11
Simple Linear Regression

correlation

linear
regression
1

Terminology
C
h
a
p
t
e
r

Independent variable
An independent variable is the variable you have
control over, what you can choose and manipulate.
It is usually what you think will affect the
dependent variable.

Dependent variable
A dependent variable is what you measure in the
experiment and what is affected during the experiment.
The dependent variable responds to the independent
variable.

11
2

Terminology
C
h
a
p
t
e
r
11

Regression Analysis
Used to describe the relationship between a dependent
variable and one or more independent variables.

Linear Regression
used to construct a simple formula that will predict a
value or values for a variable given the value of another
variable.
OR
used to test whether and how a given variable is related
to another variable or variables.

Terminology
C
h
a
p
t
e
r
11

Deterministic model
The relationship between x and y is exact, and there is no
allowance for error.

Deterministic relationship
y = 1.5x

(Reaction time,
in seconds)

(Percentage of
drug in the blood)

Terminology
C
h
a
p
t
e
r

Probabilistic model
The relationship between x and y includes a deterministic
component and a random error component.
Accounts for unexplained variation caused by unknown
phenomena or other variables.

Probabilistic relationship
y = 1.5x + random error

(Reaction time,
in seconds)

11
(Percentage of
drug in the blood)
5

Terminology
C
h
a
p
t
e
r

Probabilistic Relationship

y = f(x) + random error

11
6

C
h
a
p
t
e
r
11

General Form of Probabilistic


Models

y = Deterministic component + Random


error
y is the variable of interest.
mean value of the random error is assumed to
be 0

Therefore,
Mean value of y, E(y) = Deterministic component

Formulas
C
h
a
p
t
e
r
11

First-Order (Straight-Line) Probabilistic Model


y = 0 + 1 x +
Where
y = dependent variable
x = independent variable
0 + 1 x = E(y) = deterministic component
(read as epsilon) = random error component
0 (read as beta zero) = y-intercept
1 (read as beta one) = Slope of the line

Line of Means
C
h
a
p
t
e
r

In the probabilistic model, the deterministic


component is referred to as the line of means
the mean of y, E(y), is equal to the straight-line
component of the model
E(y)= 0 + 1 x

11
9

Section 11.2: Fitting the Model


C
h
a
p
t
e
r

The Least Squares Approach


Step 1
Hypothesize the deterministic component of the
probabilistic model

E(y) = 0 + 1x
Step 2
Use sample data to estimate the unknown parameter in
the model

11
10

Method of Least Squares


C
h
a
p
t
e
r

(Reaction time,
in seconds)

Values on the line are the


predicted values

4.5
4
3.5
3
2.5

The distances between


the scattered dots and the
line are the errors of
prediction.

2
1.5
1

11

0.5
0
0

(Percent of drug
in bloodstream)

Formulas
C
h
a
p
t
e
r

In general, consider a sample of n data points


consisting of pairs of values of x and y:
x
x1
x2
.
.
.
xn

y
y1
y2
.
.
.
yn

The straight-line model between y and x is:

11

E y = 0 + 1
12

Formulas
C
h
a
p
t
e
r
11

Model:

= 0 + 1

Estimates:

= 0 + 1

Deviation:
SSE:

= 0 + 1

0 + 1

The least squares line = 0 + 1 is the line


that has the following two properties:
1. The sum of the errors (SE) equals 0.
2. The sum of squared errors (SSE) is smaller
than that for any other straight-line model.
13

Formulas
C
h
a
p
t
e
r
11

Least Squares Estimates


Slope:

1 =

1
2

OR

1 =

y-intercept

0 = 1
14

Example
C
h
a
p
t
e
r
11

Consider the straight-line model = 0 + 1 , where


=reaction time (in seconds) and =percent of drug
received.
a) Use method of least squares to
estimate the values of 0 and 1 .
b) Predict the reaction time when
= 2%.
c) Find SSE for the analysis.
d) Give practical interpretations of
0 and 1 .

Where:

x 3

= 1.5811

y 2

= 1.2447

Table 11.1 Reaction Time


versus Drug Percentage
Subjec
t

Percent
x of Drug

Reaction
Time y
(seconds)

= 0.9037

15

Answer a)
C
h
a
p
t
e
r
11

1.2247

.
9037
Slope 1 =

.7

1.5811

y intercept , 0 y 1 x

2 .7(3) .1

The best fitted least square equation or regression line is:

y .1 .7 x

16

Answer b)
C
h
a
p
t
e
r
11

The predicted reaction time when x=2% is


= .1 + .7 2 = 1.3

Answer c)
SSE =

0 + 1
5

yi .1 .7 xi

i 1

(1 .1 .7 1) 2 (1 .1 .7 2) 2 (2 .1 .7 3) 2
(2 .1 .7 4) 2 (4 .1 .7 5) 2
1.10
17

Answer d)
C
h
a
p
t
e
r

(Reaction time,
in seconds)

Slope, 1 = .7

y .1 .7 x

For every 1% increase


in the amount of drug
in the bloodstream,
the mean reaction
time is estimated to
increase by .7 seconds.

4.5

4
3.5
3
2.5

y-intercept, 0 = .1

The estimated mean


reaction time is equal
to -.1 seconds when
the percent x of drug
is equal to 0%.

1.5
1

11

0.5
0
0

(Percent of drug
in bloodstream)

Coefficients of Determination, 2
C
h
a
p
t
e
r

The proportion of variation in y that can


be explained by the linear relationship
between x and y.

=
=1

2 takes values between 0 and 1.

11
19

C
h
a
p
t
e
r
11

High r2
x provides important
information about y
Predictions are more accurate
based on the model

Low r2
Knowing values of x does not
substantially improve
predictions on y
There may be no relationship
between x and y, or it may be
more subtle than a linear
relationship

Example
C
h
a
p
t
e
r

Consider the straight-line model = 0 + 1 , where


=reaction time (in seconds) and =percent of drug
received.
e) Calculate the value of the
coefficient of determination,
2 , and interpret results.

Where:

x 3

= 1.5811

y 2

= 1.2447

= 0.9037

Table 11.1 Reaction Time


versus Drug Percentage
Subjec
t

Percent
x of Drug

Reaction
Time y
(seconds)

11
21

Answer e)
C
h
a
p
t
e
r
11

2 = (.9037)2 = .817

Interpretation:
About 82% of the sample variation in reaction time (y)
can be explained by the fitted linear relationship
between reaction time (in seconds) and the percent of
drug received.

You might also like