Summary StatisticsIIa

1 STATISTICS II Inference for Linear Regression In OLS: explanatory variable x is quantitative; defining subpopulations for each x Assumption that
at all means lie on a line when plotted against x: o

y i = 0 + 1 x + i (DATA = FIT + RESIDUAL)
o Allows us to do inference not only for subpopulations for which we have data
but also for those corresponding to xs not present in the data Prerequisites: Normal distributions, equal sds for all x, linear association; Normally distributed residuals (assessed by plotting) Estimating the regression parameters (slope, intercept, and variation):
o Least-squares line: y = b0 + b1 x
o Slope: b1 = r
o Intercept: b0 = y b1 x o Residual: ei = y i y i = y i b0 b1 x i ; all ei together sum to zero; Normal distribution and equal sds are assumed ei2 = (y i y i )2 ; s = s2 ; variation of y around LSR line 2 o Variation: s = n 2 n 2
sy sx
ANOVA for simple linear regression: Source Model Error Total

2
DF 1 n2 n1
SS
MS SSM / DFM SSE / DFE SST / DFT
F MSM / MSE
( y
y )2 y i )2 y )2
(y
(y
2 SSM ( y i y ) MSM NB: R = = ; F= 2 SST (y i y ) MSE Confidence intervals and significance tests:
o Parameter significance tests: t =
bi ; df = n 2 SE bi
o Parameter intervals: bi t * SE bi ; df = n 2
o Mean response intervals: y t * SE ; df = n 2
o Prediction intervals: y t * SE y ; df = n 2; intervals for single future

observations are larger than intervals for the mean of its subpopulations Standard errors:
o Slope: SE b1 =
(x
x )2
s
2 (n 1)sx
o Intercept: SE b0
1 x2 =s + n (x i x ) 2
1 (x * x ) 2 + o Mean response: SE = s n (x i x ) 2
o Individual prediction: SE y = s 1+
Correlation: r =
1 (x * x ) 2 + n (x i x ) 2
1 (y i y )(x i x ) COV (y, x) = n 1 sy sx SD(y)SD(x) o Indicates direction and degree of relationship
o Standardized variable (between -1 and 1) o Linear relationship o Sensitive to outliers Inference for correlation: o The correlation coefficient is a measure of the strength and direction of the linear association between two variables; required condition is joint Normality o Test for a zero population correlation: t =
r n 2 1 r2
; df = n 2
o Significance tests for a correlation and the slope in a linear regression yield
b1 r n 2 identical t statistics; in fact = ; also applies for H0: R2 = 0 2 sb1 1 r
o Confidence intervals: r not Normally distributed, but skewed (in testing for rho = 0, the sample distribution can be approximated by a Normal distribution) o Therefore:
1 1+ r Fishers Z-transformation: rZ = ln with srZ = 2 1 r
1 n3
e 2rZ 1 Inverse transformation: r = 2rZ e +1
3 One-Way ANOVA Statistical technique that assesses whether observed differences between group sample means are statistically significant F statistic for two populations equals two-sample t2:
xy = 1 1 sp + n n n (x y ) 2 2 2 t = s2 p t=
n (x y ) 2 sp
(For linear regressions with only one explanatory variable the t form of the test is
preferable, as it more easily allows testing one-sided alternatives, etc.) Prerequisites: o Only one way to classify the populations of interest; independent SRS o Normally distributed data o Population sds are equal: 2sMIN > sMAX (rule of thumb) o If conditions are violated, transformation can solve the problem (e.g. log) As sds are considered equal, we combine them into a single (pooled) estimate:
2 (n1 1)s12 + (n 2 1)s2 + ...+ (n I 1)sI2 (n1 1) + (n 2 1) + ...+ (n I 1)
s2 = p
sp = s2 = MSE p
Pooling gives more weight to groups with larger sample sizes. If the sample sizes all are equal, the pooled variance is just the average of the I sample variances. Sums of squares: o SST: x ij x (variation of data around the overall mean) o SSG: x i x (variation of group means around the overall mean)
o SSE: x ij x i (variation of each observation around its group mean)

Source
DF I1 NI
SS
MS SSG / DFG SSE / DFE
F MSG / MSE
Model
Error Total
groups
n i (x i x ) 2 (n i 1)si2
groups
N1
obs
(x ij x ) 2
( R2 =
SSG ) SST
4 The F (I 1, N 1) test: o F approximates 1 if H0 is true and becomes very large if Ha is true o The P-value of the F test is the probability that a random variable having the F (I 1, N 1) distribution is greater than or equal to the calculated value; the F test is always one-sided, because any differences among the group means tend to make F large o When H0 is true, both MSG and MSE are unbiased estimators of the population variance, thus their value must be similar (i.e., F is close to 1) Comparing the means (rejecting the ANOVA H0 alone does not create useful results): o Visually: e.g. overlap of multiple confidence intervals (rule of thumb: overlap < width indicates a significant difference) o Contrasts: questions of interest are formulated before examining the data; inference about contrast is valid whether or not the ANOVA H0 is rejected Group sample means have to be known
c = ai x i ; SE c = sp
t(DFE) =
ai2 n ; coefficients a sum to zero i
c ; c t * SE c SE c
Different kinds of contrasts:
Simple contrast: compares mean of each level to the mean of a specified level (reference category); useful for control group Deviation contrast: compares mean of each level to the mean of all levels (grand mean) Difference contrast: compares the mean of each level (except the first) to the mean of previous levels Helmert contrast: compares the mean of each level (except the last) to the mean of subsequent levels Repeated contrast: compares the mean of each level (except the last) to the mean of the respective subsequent level Advantages: more power (better probability to reject H0), less chance capitalization (because of fewer tests)
o Multiple comparisons: post-hoc, i.e. only used after rejecting the ANOVA H0; computation of t statistics for all pairs of means, using the pooled standard deviation sp; always test two-sided
t ij = sp
xi x j 1 1 + ni n j
t** is the critical threshold and its value depends upon the chosen multiple comparisons procedure Methods: LSD (least-significant differences) method: uses a t** of the upper alpha/2 critical value for the t (DFE) distribution; changing the dfs in the t-distribution to n 1; suffers from strong chance capitalization Bonferroni: controls for chance capitalization and guarantees that the probability of any false rejection is no greater than alpha (a/k); can be approximated by hand Tukey; Student-Newman-Keuls (SNK): based on advanced mathematics
Simultaneous confidence intervals: (x i x j ) t **sp
1 1 + ; if a ni n j
confidence intervals contains 0, then the test will not be significant o Power: probability of rejecting H0 when Ha is in fact true Specify: an important alternative, i.e. values for the population means; sample sizes; level of significance alpha; a guess of the sd Find the critical value F* of the F (DFG, DFE) distribution that will lead to the rejection of H0 Calculate the noncentrality parameter Probability that the observed F is greater than F* is the power value
6 Two-Way ANOVA Advantages: o Including more factors (each with two or more levels) moves variation from RESIDUAL to FIT and therefore reduces the variation of the model o More efficient than one-way ANOVA o Interaction effects (differences between differences between means) + main effects General two-way problem: I x J ANOVA Prerequisites: o Independent SRS o Normal distributions o Equal standard deviations o Normally distributed residuals Main effects and interaction o Differences between the marginal means relate to main effects o Presence of interaction does not render main effects uninformative o It is good practice to examine the test for an interaction first, because the presence of a strong interaction can influence the interpretation of main effects Source A B AB Error Total DF I1 J1 (I 1)(J 1) N IJ N1 SS SSA SSB SSAB SSE SST MS SSA / DFA SSB / DFB SSAB / DFAB SSE / DFE F MSA / MSE MSB / MSE MSAB / MSE
When the nij are not equal, some methods of analysis give SS that do not add Hypothesis testing involves at least three individual tests (A, B, AB) o Visual inspection: e.g. means plot; CI plot o Inferential statistics: confidence intervals; post hoc tests only for main effects; contrasts (simple-main-analyses) will be introduced in Statistics III
7 Formulas to Know by Heart Mean, variance, standard deviation, standard error, median Degrees of freedom for the various procedures Confidence interval = expected value + critical value x standard error SS, MS, F SST = (n 1) x var(Y) Sample value of a contrast Regression coefficients Correlation SE of rz

Summary StatisticsIIa

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summary StatisticsIIa

Uploaded by

Copyright:

Available Formats

1 STATISTICS II Inference for Linear Regression In OLS: explanatory variable x is quantitative; defining subpopulations for each x Assumption that

at all means lie on a line when plotted against x: o

ANOVA for simple linear regression: Source Model Error Total

MS SSM / DFM SSE / DFE SST / DFT

o Mean response intervals: y t * SE ; df = n 2

o Prediction intervals: y t * SE y ; df = n 2; intervals for single future

1 (y i y )(x i x ) COV (y, x) = n 1 sy sx SD(y)SD(x) o Indicates direction and degree of relationship

e 2rZ 1 Inverse transformation: r = 2rZ e +1

o SSE: x ij x i (variation of each observation around its group mean)

MS SSG / DFG SSE / DFE

ai2 n ; coefficients a sum to zero i

Different kinds of contrasts:

Simultaneous confidence intervals: (x i x j ) t **sp

You might also like