You are on page 1of 7

1 STATISTICS II Inference for Linear Regression In OLS: explanatory variable x is quantitative; defining subpopulations for each x Assumption that

at all means lie on a line when plotted against x: o


y i = 0 + 1 x + i (DATA = FIT + RESIDUAL)

o Allows us to do inference not only for subpopulations for which we have data

but also for those corresponding to xs not present in the data Prerequisites: Normal distributions, equal sds for all x, linear association; Normally distributed residuals (assessed by plotting) Estimating the regression parameters (slope, intercept, and variation):
o Least-squares line: y = b0 + b1 x

o Slope: b1 = r

o Intercept: b0 = y b1 x o Residual: ei = y i y i = y i b0 b1 x i ; all ei together sum to zero; Normal distribution and equal sds are assumed ei2 = (y i y i )2 ; s = s2 ; variation of y around LSR line 2 o Variation: s = n 2 n 2

sy sx

ANOVA for simple linear regression: Source Model Error Total


2

DF 1 n2 n1

SS

MS SSM / DFM SSE / DFE SST / DFT

F MSM / MSE

( y

y )2 y i )2 y )2

(y

(y

2 SSM ( y i y ) MSM NB: R = = ; F= 2 SST (y i y ) MSE Confidence intervals and significance tests:
o Parameter significance tests: t =

bi ; df = n 2 SE bi

o Parameter intervals: bi t * SE bi ; df = n 2

o Mean response intervals: y t * SE ; df = n 2

o Prediction intervals: y t * SE y ; df = n 2; intervals for single future


observations are larger than intervals for the mean of its subpopulations Standard errors:

o Slope: SE b1 =

(x

x )2

s
2 (n 1)sx

o Intercept: SE b0

1 x2 =s + n (x i x ) 2

1 (x * x ) 2 + o Mean response: SE = s n (x i x ) 2
o Individual prediction: SE y = s 1+

Correlation: r =

1 (x * x ) 2 + n (x i x ) 2

1 (y i y )(x i x ) COV (y, x) = n 1 sy sx SD(y)SD(x) o Indicates direction and degree of relationship

o Standardized variable (between -1 and 1) o Linear relationship o Sensitive to outliers Inference for correlation: o The correlation coefficient is a measure of the strength and direction of the linear association between two variables; required condition is joint Normality o Test for a zero population correlation: t =

r n 2 1 r2

; df = n 2

o Significance tests for a correlation and the slope in a linear regression yield
b1 r n 2 identical t statistics; in fact = ; also applies for H0: R2 = 0 2 sb1 1 r

o Confidence intervals: r not Normally distributed, but skewed (in testing for rho = 0, the sample distribution can be approximated by a Normal distribution) o Therefore:
1 1+ r Fishers Z-transformation: rZ = ln with srZ = 2 1 r

1 n3

e 2rZ 1 Inverse transformation: r = 2rZ e +1

3 One-Way ANOVA Statistical technique that assesses whether observed differences between group sample means are statistically significant F statistic for two populations equals two-sample t2:

xy = 1 1 sp + n n n (x y ) 2 2 2 t = s2 p t=

n (x y ) 2 sp

(For linear regressions with only one explanatory variable the t form of the test is

preferable, as it more easily allows testing one-sided alternatives, etc.) Prerequisites: o Only one way to classify the populations of interest; independent SRS o Normally distributed data o Population sds are equal: 2sMIN > sMAX (rule of thumb) o If conditions are violated, transformation can solve the problem (e.g. log) As sds are considered equal, we combine them into a single (pooled) estimate:
2 (n1 1)s12 + (n 2 1)s2 + ...+ (n I 1)sI2 (n1 1) + (n 2 1) + ...+ (n I 1)

s2 = p

sp = s2 = MSE p
Pooling gives more weight to groups with larger sample sizes. If the sample sizes all are equal, the pooled variance is just the average of the I sample variances. Sums of squares: o SST: x ij x (variation of data around the overall mean) o SSG: x i x (variation of group means around the overall mean)

o SSE: x ij x i (variation of each observation around its group mean)


Source

DF I1 NI

SS

MS SSG / DFG SSE / DFE

F MSG / MSE

Model
Error Total

groups

n i (x i x ) 2 (n i 1)si2

groups

N1

obs

(x ij x ) 2

( R2 =

SSG ) SST

4 The F (I 1, N 1) test: o F approximates 1 if H0 is true and becomes very large if Ha is true o The P-value of the F test is the probability that a random variable having the F (I 1, N 1) distribution is greater than or equal to the calculated value; the F test is always one-sided, because any differences among the group means tend to make F large o When H0 is true, both MSG and MSE are unbiased estimators of the population variance, thus their value must be similar (i.e., F is close to 1) Comparing the means (rejecting the ANOVA H0 alone does not create useful results): o Visually: e.g. overlap of multiple confidence intervals (rule of thumb: overlap < width indicates a significant difference) o Contrasts: questions of interest are formulated before examining the data; inference about contrast is valid whether or not the ANOVA H0 is rejected Group sample means have to be known

c = ai x i ; SE c = sp
t(DFE) =

ai2 n ; coefficients a sum to zero i

c ; c t * SE c SE c

Different kinds of contrasts:

Simple contrast: compares mean of each level to the mean of a specified level (reference category); useful for control group Deviation contrast: compares mean of each level to the mean of all levels (grand mean) Difference contrast: compares the mean of each level (except the first) to the mean of previous levels Helmert contrast: compares the mean of each level (except the last) to the mean of subsequent levels Repeated contrast: compares the mean of each level (except the last) to the mean of the respective subsequent level Advantages: more power (better probability to reject H0), less chance capitalization (because of fewer tests)

o Multiple comparisons: post-hoc, i.e. only used after rejecting the ANOVA H0; computation of t statistics for all pairs of means, using the pooled standard deviation sp; always test two-sided

t ij = sp

xi x j 1 1 + ni n j

t** is the critical threshold and its value depends upon the chosen multiple comparisons procedure Methods: LSD (least-significant differences) method: uses a t** of the upper alpha/2 critical value for the t (DFE) distribution; changing the dfs in the t-distribution to n 1; suffers from strong chance capitalization Bonferroni: controls for chance capitalization and guarantees that the probability of any false rejection is no greater than alpha (a/k); can be approximated by hand Tukey; Student-Newman-Keuls (SNK): based on advanced mathematics

Simultaneous confidence intervals: (x i x j ) t **sp

1 1 + ; if a ni n j

confidence intervals contains 0, then the test will not be significant o Power: probability of rejecting H0 when Ha is in fact true Specify: an important alternative, i.e. values for the population means; sample sizes; level of significance alpha; a guess of the sd Find the critical value F* of the F (DFG, DFE) distribution that will lead to the rejection of H0 Calculate the noncentrality parameter Probability that the observed F is greater than F* is the power value

6 Two-Way ANOVA Advantages: o Including more factors (each with two or more levels) moves variation from RESIDUAL to FIT and therefore reduces the variation of the model o More efficient than one-way ANOVA o Interaction effects (differences between differences between means) + main effects General two-way problem: I x J ANOVA Prerequisites: o Independent SRS o Normal distributions o Equal standard deviations o Normally distributed residuals Main effects and interaction o Differences between the marginal means relate to main effects o Presence of interaction does not render main effects uninformative o It is good practice to examine the test for an interaction first, because the presence of a strong interaction can influence the interpretation of main effects Source A B AB Error Total DF I1 J1 (I 1)(J 1) N IJ N1 SS SSA SSB SSAB SSE SST MS SSA / DFA SSB / DFB SSAB / DFAB SSE / DFE F MSA / MSE MSB / MSE MSAB / MSE

When the nij are not equal, some methods of analysis give SS that do not add Hypothesis testing involves at least three individual tests (A, B, AB) o Visual inspection: e.g. means plot; CI plot o Inferential statistics: confidence intervals; post hoc tests only for main effects; contrasts (simple-main-analyses) will be introduced in Statistics III

7 Formulas to Know by Heart Mean, variance, standard deviation, standard error, median Degrees of freedom for the various procedures Confidence interval = expected value + critical value x standard error SS, MS, F SST = (n 1) x var(Y) Sample value of a contrast Regression coefficients Correlation SE of rz

You might also like