You are on page 1of 7

Statistical inference

Methods for drawing conclusions about a population from sample data 2 key methods: Point estimates: calculate a single value (such as mean, proportion etc.) Confidence interval: calculate range of values that is likely to contain the true value of the parameter.

Standard error of an estimate


Standard deviation of an estimate is called the standard error. It describes the typical error or uncertainty associated with the estimate.

Computing SE for the sample mean:

SE = n
n number of independent observations standard deviation Since standard deviation is typically unknown, if we have a sample size n 30, we can use sample standard deviation (s) instead.

Central Limit Theorem


If a sample consists of at least 30 independent observations and the data are not strongly skewed, then the distribution of the sample mean is well approximated by a normal model. How to verify that sample observations are independent: random sample consists of less than 10% of observations (rule of thumb) subjects in an experiment are randomly assigned.

Confidence interval
point estimate z * SE
z*SE - margin of error Interpreting confidence interval: We are XX% confident that the population parameter is between....

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

Hypothesis testing

Do not reject H0 H0 true HA true

Reject H0

Correct Decision Type 1 Error 1- (significance level) Type 2 Error Correct Decision 1- (power)

... significance level (probability, under H0, that the test concludes HA) 1- power of the test (probability, under HA, that the test concludes HA) The p-value quantifies how strongly the data favor HA over H0 (that is, shows the odds that we got difference in two treatments purely by chance). A small p-value (usually < 0: 05) corresponds to sufficient evidence to reject H0 in favor of HA . Hypothesis must be set up before observing the data. If they are not, the test must be two-sided. Test statistic: Z when point estimate is nearly normal

Inference for numerical data


Paired Data
Two sets of observations are paired if each observation in one set has a special correspondence or connection with exactly one observation in the other data set.
2 s diff = s 2 x + s y 2rs x s y

SE =

s diff

Z=

x diff 0 SE

Difference of two means


If the sample means, x1 and x2, each meet the criteria for having nearly normal sampling distributions and the observations in the two samples are independent, then the difference in sample means, x1-x2, will have a sampling distribution that is nearly normal.

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

1 2 SE = + n1 n 2

s1 s 2 + n1 n2

since we usually don't know population , if the sample has at least 30 observations, we can use standard deviation estimates based on sample, and Z-test. For small sample, we use t-test.

One-sample means with the t distribution


We can use it when population is nearly normally distributed, the sample is small and population standard deviation is unknown. to accurately estimate standard error from a small sample

df = n-1 confidence interval:

x t df

s n

t distribution for the difference of means


For small sample when data is independent and nearly normally distributed

x 1 x2 T =

as a point estimate for

1 2

( x 1 x 2)( 1 2 )

2 s1 s2 2 + n1 n 2

When standard deviations of two groups are nearly equal, we can use pooled standard deviation (by pooling data we improve estimation of the variance)

2 pooled

s ( n 1 )+ s 2 ( n 2 1) = 1 1 n1 + n2 2

df = n1 + n 2 2

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

Comparing many means with ANOVA


H0 : 1 = 2 = . = k HA: At least one mean is different We simultaneously consider many groups, and evaluate whether their sample means differ more than we would expect from natural variation. Test statistic for ANOVA:

F=

MSG MSE

MSG - measures variability between groups dfG = k-1 MSE mean square error (measures of variability within the groups) dfE = n-k If H0 is true, variation in the sample means (MSG) should be relatively small compared to within-group variation (MSE) Conditions for ANOVA analysis: Independence of data approximately normal distributions approximately constant variance in the groups

Inference for categorical data


Inference for a single proportion
Condition of normality: the sample observations are independent we expected to see at least 10 successes and 10 failures in our sample, i.e. np 10 and n(1-p ) 10. This is called the success-failure condition .

Standard error:

SE p=

Confidence interval:

p ( 1 p ) n

p z * SE

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

Test statistic:

Z=

point estimate null value SE

Choosing a sample size when estimating a proportion:


We want a sample size that would ensure margin of error to be below some threshold m:

p ( 1 p ) m n

Solving for n:

z n p ( 1 p ) m

( )

If we have a good estimate of p, we use it, otherwise standard error is largest when p = 0.5, so to cover the worst case scenario, if we are not sure about the true p, we choose p=0.5

0.5 m n solving for n : 2 z n 0.25 m z

( )

Difference of two proportions


Condition of normality: each distribution is normally distributed samples are independent

SE p p = SE p + SE p =
2 2
1 2 1 2

p 1 ( 1 p1 ) p 2 ( 1 p 2 ) + n1 n2

Confidence interval:

(p 1 p 2) z * SE
Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

Hypothesis testing: H0: p1 p2 = 0 HA: p1 p2 0 When the null hypothesis is p1 = p2, we can use pooled estimate of the proportion (since we are assuming equal proportions)

p =

p 1 n1 + p 2 n2 n 1 + n2

SE =

p (1 p) p ( 1 p) + n1 n2

Chi-square Test
Use chi-square test when: Given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population. Evaluate whether data resemble a particular distribution, such as a normal distribution or a geometric distribution.

(a) One-way table In one-way table describes counts for each outcome in a single variable. We can put our data in a table like this:
Categories observed expected C1 C2 C3 C4 Total

We want to establish whether observed counts differ from the expected counts by chance, so the sample is representative of population or not. Chi-square statistic is then:

( obsereved count j expected count j ) = expected count j j


2

degrees of freedom: k -1 (where k is the number of categories)

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

(b) Two-way table Two way table describes counts for combinations of outcomes for two variables.
S1 C1 C2 Total S2 S3 Total

Expected count is then: Expected count =


k 2

Column TotalRow Total Table Total


2

( obsereved count j expected count j ) = expected count j j

degrees of freedom: (number of rows -1)x(number of columns -1) Conditions for the chi-square test: independent observations each category has to have at least 5 expected cases degrees of freedom: k 2

Nontechnical Introduction to Statistical Inference Prepared by: Gabriela Hromis

You might also like