Bivariate EDA and Regression Analysis

bivariate EDA and regression analysis
width
length
weight of core
distance from quarry
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
scatterplot matrix
AG_C1_2
AG_C1_1
AG_C2_2
AG_C3_2
AG_C4_2
AG_C1_1 AG_C2_1
AG_C2_1
AG_C3_1
AG_C3_1
AG_C4_1
AG_C4_1
AG_C1_2
AG_C2_2
AG_C3_2
AG_C4_2
A G_C1_1
A G_C2_1
A G_C3_1
A G_C4_1
A G_C1_2
A G_C2_2
A G_C3_2
A G_C4_2
A G_C1_1
A G_C1_1
A G_C2_1
A G_C2_1
A G_C3_1
A G_C3_1
A G_C4_1
A G_C4_1
A G_C1_2
A G_C1_2
A G_C2_2
A G_C2_2
A G_C3_2
A G_C3_2
A G_C4_2
A G_C4_2
A G_C1_1
A G_C2_1
A G_C3_1
A G_C4_1
A G_C1_2
A G_C2_2
A G_C3_2
A G_C4_2
10
5
AG_C2_1
-5
-10 -4
-3
-2
-1
0 1 2 AG_C1_1
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
scatterplots
scatterplots provide the most detailed summary of a bivariate relationship, but they are not concise, and there are limits to what else you can do with them simpler kinds of summaries may be useful more compact; often capture less detail may support more extended mathematical analyses may reveal fundamental relationships
y = a + bx
y = a + bx
6 5 4
(x2,y2)
b = slope y b = y/x
3
2 1 1 2 3 4 5
(x1,y1)
b = (y2-y1)/(x2-x1)
6
a = y intercept
y = a + bx
we can predict values of y from values of x predicted values of y are called y-hat
a bx y
the predicted values (y) are often regarded as dependent on the (independent) x values try to assign independent values to x-axis, dependent values to the y-axis
y = a + bx
becomes a concise summary of a point distribution, and a model of a relationship may have important explanatory and predictive value
how do we come up with these lines? various options:

by eye calculating a Tukey Line (resistant to outliers) locally weighted regression LOWESS least squares regression
linear regression
linear regression and correlation analysis are generally concerned with fitting lines to real data least squares regression is one of the main tools attempts to minimize deviation of observed points from the regression line maximizes its potential for prediction
standard approach minimizes the squared variation in y

2 ( yi yi ) i 1 n
Note:
these are the vertical deviations this is a sum-squared-error approach
regressing x on y would involve defining the line
i c dyi x
by minimizing
i x
calculating a line that minimizes this value is called regressing y on x appropriate when we are trying to predict y from x this is also called Model I Regression
start by calculating the slope (b):
(x
i 1 n
x )( y i y )
i
covariance
(x
i 1
x)
once you have the slope, you can calculate the y-intercept (a):
y a y bx
b xi n
regression pathologies
things to avoid in regression analysis
Tukey Line
resistant to outliers divide cases into thirds, based on x-axis identify the median x and y values in upper and lower thirds slope (b)= (My3-My1)/(Mx3-Mx1) intercept (a) = median of all values yi-b*xi
Correlation
regression concerns fitting a linear model to observed data correlation concerns the degree of fit between observed data and the model... if most points lie near the line:
the fit of the model is good the two variables are strongly correlated values of y can be well predicted from x
Pearsons r
this is assessed using the product-moment correlation coefficient:
r
( x x )( y y ) (x x) ( y y)
i i 2 i i
= covariance (the numerator), standardized by a measure of variation in both x and y
x
r
( x x )( y y ) (x x) ( y y)
i i 2 i i
(xi,yi)
+
y
unlike the covariance, r is unit-less ranges between 1 and 1
0 = no correlation -1 and 1 = perfect negative and positive correlation (respectively)

correlation between x and y is the same as between y and x
r is symmetrical
no question of independence or dependence recall, this symmetry is not true of regression
regression/correlation
one can assess the strength of a relationship by seeing how knowledge of one variable improves the ability to predict the other
if you ignore x, the best predictor of y will be the mean of all y values (y-bar) if the y measurements are widely scattered, prediction errors will be greater than if they are close together we can assess the dispersion of y values around their mean by:
( y
y)
i y
2 ( y y ) i i
2 ( y y ) i
r2=
2 ( y y ) i i 2 ( y y ) i
coefficient of determination (r2) describes the proportion of variation that is explained or accounted for by the regression line r2=.5
half of the variation is explained by the regression half of the variation in y is explained by variation in x
i y
correlation and percentages

much of what we want to learn about association between variables can be learned from counts
ex: are high counts of bone needles associated with high counts of end scrapers?
sometimes, similar questions are posed of percent-standardized data

ex: are high proportions of decorated pottery associated with high proportions of copper bells?
caution
these are different questions and have different implications for formal regression percents will show at least some level of correlation even if the underlying counts do not
spurious correlation (negative) closed-sum effect
case 1 2 3 4 5 6 7 8 9 10
C_v1 15 35 20 23 36 79 40 95 27 67
C_v2 14 1 96 59 90 2 99 36 0 93
C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10 94 89 73 7 86 26 28 22 58 98 59 95 31 52 15 5 66 75 99 61 76 23 90 33 97 11 77 21 32 62 13 77 65 83 54 68 23 48 30 94 8 14 74 71 52 74 69 95 5 3 97 9 60 35 41 44 22 58 5 16 10 27 85 57 34 13 63 74 100 43 95 43 27 90 3 87 36 68 75 48
10 vars. 5 vars. 3 vars. 2 vars.
original counts
-1.0 -0.5 0.0 r 0.5 1.0
percents (10 vars.)

0.5 1.0
-1.0
-0.5
0.0 r
percents (5 vars.)
-1.0 -0.5 0.0 r 0.5 1.0
percents (3 vars.)
-1.0 -0.5 0.0 r 0.5 1.0
percents (2 vars.)
0.5 1.0
-1.0
-0.5
0.0 r
100 80 60 40 20 0 0
20
15
P10_V2
C_V2
10
20
40 60 C_V1
80
100
0 0
10 P10_V1
15
20
40
70 60
90 80 70
30
T5_V2 T3_V2
50
60
T2_V2
20 30 40 50 T3_V1 60 70 80
40 30 20 10
20
50 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 T2_V1
10
0 0
10
20
30 40 T5_V1
50
60
70
0 10
regression assumptions
both variables are measured at the interval scale or above variation is the same at all points along the regression line (variation is homoscedastic)
residuals
vertical deviations of points around the regression for case i, residual = yi-y-hati [yi-(a+bxi)] residuals in y should not show patterned variation either with x or y-hat normally distributed around the regression line residual error should not be autocorrelated (errors/residuals in y are independent)
standard error of the regression

recall: standard error of an estimate (SEE) is like a standard deviation can calculate an SEE for residuals associated with a regression formula
S yi y i
i y
to the degree that the regression assumptions hold, there is a 68% probability that true values of y lie within 1 SEE of y-hat 95% within 2 SEE can plot lines showing the SEE y-hat = a+bx +/- SEE
data transformations and regression

read Shennan, Chapter 9 (esp. pp. 151-173)
200
150
VAR2
100
50
200
0 0
50
100 VAR1
150
200
150
VAR2
100
50
0 0
50
100 VAR1
150
200
200
150
VAR2
100
50
40
80 VAR1
120 160
200
150
VAR2
100
50
0 0
5 VAR1T
10
15
let VAR1T = sqr(VAR1)
distribution and fall-off models ex: density of obsidian vs. distance from the quarry:
6 5 4
DENSITY
3 2 1 0 0
10
20
30
40 50 DIST
60
70
80
Plot of Residuals against Predicted Values

6 5 4 3 2 1 0 0
RESIDUAL
10 20 30 40 50 DIST 60 70 80
DENSITY
-1 -1
1 2 ESTIMATE
LG_DENS log(DENSITY)
6 5 4 3 2
LG_DENS
2 1 0 -1 -2 -3
DENSITY
10
20
30
40 50 DIST
60
70
80
10
20
30
40 50 DIST
60
70
80
2 1
LG_DENS
0 -1 -2 -3
y = 1.70-.05x [remember y is logged density]
10
20
30
40 50 DIST
60
70
80
6 5 4
6 5 4
logy = 1.70-.05x
DENSITY
3 2 1 0 0
DENSITY
3 2 1 0 0
10
20
30 40 50 DISTANCE
60
70
80
10
20
30 40 50 DISTANCE
60
70
80
fplot y = exp(1.70-.05*x)
begin PLOT DENSITY*DISTANCE / FILL=1,0,0
fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL='' XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0 XMAX=80 YMIN=0 YMAX=6
end
transformation summary
correcting left skew:
x4 x3 x2 x log(x) -1/x -1/x2 stronger strong mild weak mild strong stronger
correcting right skew:

Bivariate EDA and Regression Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bivariate EDA and Regression Analysis

Uploaded by

Copyright:

Available Formats

bivariate EDA and regression analysis

distance from quarry

how do we come up with these lines? various options:

standard approach minimizes the squared variation in y

regressing x on y would involve defining the line

start by calculating the slope (b):

= covariance (the numerator), standardized by a measure of variation in both x and y

unlike the covariance, r is unit-less ranges between 1 and 1

0 = no correlation -1 and 1 = perfect negative and positive correlation (respectively)

no question of independence or dependence recall, this symmetry is not true of regression

correlation and percentages

sometimes, similar questions are posed of percent-standardized data

C_v3 C_v4 C_v5 C_v6 C_v7 C_v8 C_v9 C_v10 94 89 73 7 86 26 28 22 58 98 59 95 31 52 15 5 66 75 99 61 76 23 90 33 97 11 77 21 32 62 13 77 65 83 54 68 23 48 30 94 8 14 74 71 52 74 69 95 5 3 97 9 60 35 41 44 22 58 5 16 10 27 85 57 34 13 63 74 100 43 95 43 27 90 3 87 36 68 75 48

10 vars. 5 vars. 3 vars. 2 vars.

percents (10 vars.)

standard error of the regression

data transformations and regression

let VAR1T = sqr(VAR1)

Plot of Residuals against Predicted Values

y = 1.70-.05x [remember y is logged density]

begin PLOT DENSITY*DISTANCE / FILL=1,0,0

correcting right skew:

You might also like