Professional Documents
Culture Documents
width
length
weight of core
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
scatterplot matrix
AG_C1_2
AG_C1_1
AG_C2_2
AG_C3_2
AG_C4_2
AG_C1_1 AG_C2_1
AG_C2_1
AG_C3_1
AG_C3_1
AG_C4_1
AG_C4_1
AG_C1_2
AG_C2_2
AG_C3_2
AG_C4_2
A G_C1_1
A G_C2_1
A G_C3_1
A G_C4_1
A G_C1_2
A G_C2_2
A G_C3_2
A G_C4_2
A G_C1_1
A G_C1_1
A G_C2_1
A G_C2_1
A G_C3_1
A G_C3_1
A G_C4_1
A G_C4_1
A G_C1_2
A G_C1_2
A G_C2_2
A G_C2_2
A G_C3_2
A G_C3_2
A G_C4_2
A G_C4_2
A G_C1_1
A G_C2_1
A G_C3_1
A G_C4_1
A G_C1_2
A G_C2_2
A G_C3_2
A G_C4_2
10
5
AG_C2_1
-5
-10 -4
-3
-2
-1
0 1 2 AG_C1_1
3 2 1
AG_C1_2
0 -1 -2 -3 -4 -5 -4 -3 -2 -1 0 1 2 AG_C1_1 3 4 5
scatterplots
scatterplots provide the most detailed summary of a bivariate relationship, but they are not concise, and there are limits to what else you can do with them simpler kinds of summaries may be useful more compact; often capture less detail may support more extended mathematical analyses may reveal fundamental relationships
y = a + bx
y = a + bx
6 5 4
(x2,y2)
b = slope y b = y/x
3
2 1 1 2 3 4 5
(x1,y1)
b = (y2-y1)/(x2-x1)
6
a = y intercept
y = a + bx
we can predict values of y from values of x predicted values of y are called y-hat
a bx y
the predicted values (y) are often regarded as dependent on the (independent) x values try to assign independent values to x-axis, dependent values to the y-axis
y = a + bx
becomes a concise summary of a point distribution, and a model of a relationship may have important explanatory and predictive value
linear regression
linear regression and correlation analysis are generally concerned with fitting lines to real data least squares regression is one of the main tools attempts to minimize deviation of observed points from the regression line maximizes its potential for prediction
Note:
these are the vertical deviations this is a sum-squared-error approach
i c dyi x
by minimizing
i x
calculating a line that minimizes this value is called regressing y on x appropriate when we are trying to predict y from x this is also called Model I Regression
(x
i 1 n
x )( y i y )
i
covariance
(x
i 1
x)
once you have the slope, you can calculate the y-intercept (a):
y a y bx
b xi n
regression pathologies
things to avoid in regression analysis
Tukey Line
resistant to outliers divide cases into thirds, based on x-axis identify the median x and y values in upper and lower thirds slope (b)= (My3-My1)/(Mx3-Mx1) intercept (a) = median of all values yi-b*xi
Correlation
regression concerns fitting a linear model to observed data correlation concerns the degree of fit between observed data and the model... if most points lie near the line:
the fit of the model is good the two variables are strongly correlated values of y can be well predicted from x
Pearsons r
this is assessed using the product-moment correlation coefficient:
r
( x x )( y y ) (x x) ( y y)
i i 2 i i
x
r
( x x )( y y ) (x x) ( y y)
i i 2 i i
(xi,yi)
+
y
r is symmetrical
regression/correlation
one can assess the strength of a relationship by seeing how knowledge of one variable improves the ability to predict the other
if you ignore x, the best predictor of y will be the mean of all y values (y-bar) if the y measurements are widely scattered, prediction errors will be greater than if they are close together we can assess the dispersion of y values around their mean by:
( y
y)
i y
2 ( y y ) i i
2 ( y y ) i
r2=
2 ( y y ) i i 2 ( y y ) i
coefficient of determination (r2) describes the proportion of variation that is explained or accounted for by the regression line r2=.5
half of the variation is explained by the regression half of the variation in y is explained by variation in x
i y
caution
these are different questions and have different implications for formal regression percents will show at least some level of correlation even if the underlying counts do not
spurious correlation (negative) closed-sum effect
case 1 2 3 4 5 6 7 8 9 10
C_v1 15 35 20 23 36 79 40 95 27 67
C_v2 14 1 96 59 90 2 99 36 0 93
original counts
-1.0 -0.5 0.0 r 0.5 1.0
-1.0
-0.5
0.0 r
percents (5 vars.)
-1.0 -0.5 0.0 r 0.5 1.0
percents (3 vars.)
-1.0 -0.5 0.0 r 0.5 1.0
percents (2 vars.)
0.5 1.0
-1.0
-0.5
0.0 r
100 80 60 40 20 0 0
20
15
P10_V2
C_V2
10
20
40 60 C_V1
80
100
0 0
10 P10_V1
15
20
40
70 60
90 80 70
30
T5_V2 T3_V2
50
60
T2_V2
20 30 40 50 T3_V1 60 70 80
40 30 20 10
20
50 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 T2_V1
10
0 0
10
20
30 40 T5_V1
50
60
70
0 10
regression assumptions
both variables are measured at the interval scale or above variation is the same at all points along the regression line (variation is homoscedastic)
residuals
vertical deviations of points around the regression for case i, residual = yi-y-hati [yi-(a+bxi)] residuals in y should not show patterned variation either with x or y-hat normally distributed around the regression line residual error should not be autocorrelated (errors/residuals in y are independent)
S yi y i
i y
to the degree that the regression assumptions hold, there is a 68% probability that true values of y lie within 1 SEE of y-hat 95% within 2 SEE can plot lines showing the SEE y-hat = a+bx +/- SEE
200
150
VAR2
100
50
200
0 0
50
100 VAR1
150
200
150
VAR2
100
50
0 0
50
100 VAR1
150
200
200
150
VAR2
100
50
40
80 VAR1
120 160
200
150
VAR2
100
50
0 0
5 VAR1T
10
15
distribution and fall-off models ex: density of obsidian vs. distance from the quarry:
6 5 4
DENSITY
3 2 1 0 0
10
20
30
40 50 DIST
60
70
80
RESIDUAL
10 20 30 40 50 DIST 60 70 80
DENSITY
-1 -1
1 2 ESTIMATE
LG_DENS log(DENSITY)
6 5 4 3 2
LG_DENS
2 1 0 -1 -2 -3
DENSITY
10
20
30
40 50 DIST
60
70
80
10
20
30
40 50 DIST
60
70
80
2 1
LG_DENS
0 -1 -2 -3
10
20
30
40 50 DIST
60
70
80
6 5 4
6 5 4
logy = 1.70-.05x
DENSITY
3 2 1 0 0
DENSITY
3 2 1 0 0
10
20
30 40 50 DISTANCE
60
70
80
10
20
30 40 50 DISTANCE
60
70
80
fplot y = exp(1.70-.05*x)
fplot y = exp(1.70-.05*x) ; XLABEL='' YLABEL='' XTICK=0 XPIP=0 YTICK=0 YPIP=0 XMIN=0 XMAX=80 YMIN=0 YMAX=6
end
transformation summary
correcting left skew:
x4 x3 x2 x log(x) -1/x -1/x2 stronger strong mild weak mild strong stronger