You are on page 1of 13

Chapters 13

Statistical Inference: An Overview


Testing the Null Hypothesis
Hypothesis Testing
A hypothesis test is a statistical method that uses sample data to evaluate a hypothesis about a
population

Purpose of Hypothesis Testing


The purpose of the hypothesis test is to decide between to competing explanations:
1. the difference between the sample & the population can be explained by sampling error (there
DOES NOT appear to be a treatment effect)
2. the difference between the sample and the population is too large to be explained by sampling
error (there DOES appear to be a treatment effect) / not caused by chance variations within
populations

Hypothesis Test - Steps


1. State hypothesis about the unknown population
2. use hypothesis to predict the characteristics the sample should have
3. Obtain a sample from the population
4. Compare data with the hypothesis Prediction
*when we test hypotheses, we make predictions about POPULATION PARAMETERS

Null hypothesis
the general population has no change, no different or no relationship
in the context of an experiment, the IV (treatment) has NO EFFECT on the dependent variable
Alternative/Scientific Hypothesis
the general population has a change, a different or a relationship
in the context of an experiment, the IV (treatment) HAS AN EFFECT on the dependent variabl
Null hypothesis + alternative hypothesis
mutually exclusive, exhaustive i.e. cannot both be true, and one of them must be true

Factors that Influence a Hypothesis Test


1. VARIABILITY of the scores
measured by either sdev or variance
smaller sdev = smaller standard error = larger z-score
A large variance decreases the likelihood of a significant test
2. NUMBER of scores in the sample / Sample Size
influences the size of the standard error in the denominator
big sample size = possibility of type I error
increasing number of score in the sample produces a smaller standard error, which leads to a
larger value for the z-score

A larger sample size leads to a smaller standard error, which leads to a larger z-score, thus likelihood
of rejecting the null hypothesis INCREASES

A smaller standard deviation produces a smaller standard error, which leads to a larger z-score, thus
likelihood of rejecting the null hypothesis INCREASES

Page 1 of 13
The Process of Statistical Interference
The more variability, the harder it will be to reject the null hypothesis

Applying Statistical Inference: An Example

Choosing a significance level


Critical region
extremely unlikely values in the tails of the distribution, as defined by the alpha level
sample outcomes that are very unlikely to occur if the null hypothesis is true
if data produced a sample mean located in the critical region, the null hypothesis is REJECTED
provides convincing evidence that the treatment really does have an effect
parts of each distribution that make up the most extreme 5% of the differences between means

Alpha Level / Level of Significance


probability value that is used to define the concept of very unlikely / probability of obtaining
sample data in the critical region even though Ho is true
a criterion for deciding whether to reject the null hypothesis or not
a larger alpha means boundaries for the critical region moves closer to the center of the
distribution
e.g. alpha = 0.01 vs alpha = 0.05
0.05: we will reject the Ho if we get a pattern of data so unlikely that it could have
occurred by chance less than 5 times out of 100
0.01: odds of getting such large treatment effects by chance are less than 1 in 100, thus
it has boundaries farther from the centre of the distribution compared to 0.05 and i.e.a
STRICTER criterion
if no treatment effect, an alpha level of 0.05 means there is a 5% risk of rejecting the null hypothesis
and committing a type I error

Significant
a result is significant if it is very unlikely to occur when the null hypothesis is true
null hypothesis is thus REJECTED

Type 1 and Type 2 Errors


Type I Error FALSE POSITIVE
rejecting a TRUE null hypothesis, represented by ALPHA
i.e. saying the treatment has an effect, when it actually does NOT
falling in critical region even though treatment has NO EFFECT
Alpha Level: also the probability that the test will lead to a Type I error
if we are using a 0.05 significant level, the probability of a Type I error is 0.05
5 times out of 100 (or 5% of the time), we will reject the null hypothesis when we should not

Type I errors are caused by unusual, unrepresentative samples


Consequences of a type I error are very serious since researcher believes there is a real treatment
effect, and thus the researcher can report or publish the research result

Type II Error FALSE NEGATIVE


the failure to reject a FALSE null hypothesis, represented by BETA
hypothesis test fails to detect that a real treatment effect really exists i.e. saying the treatment has no
effect, when it actually DOES
occurs when treatment effect is very small since research study fails to detect the effect
greater chance of a Type 2 error at p < 0.01 than there is at p < 0.05
How to reduce BETA?
Page 2 of 13
using a more powerful statistical test i.e. parametric tests
accept a less extreme significant lvl i.e. more likely to detect an effect at p <0.05 than p < 0.01
consequences of Type II error are not as serious as Type I

Statistical Power
the probability of correctly rejecting the H0 (null hypothesis) when Ho is false / when there is a real
treatment effect / when H1 is true
power = 1 - (beta / probability of committing a type II error)

Relationships
1. as effect size INCREASES, the probability of rejecting H0 increases, which means the power of
the test INCREASES
2. as the power of a test INCREASES, the probability of a type II error DECREASES
i.e. as the probability (of correctly rejecting a null hypothesis when there is a real treatment
effect) increases, the probability (of saying there is no treatment effect when there actually is)
decreases
3. as sample size INCREASES, the probability of rejecting the null hypothesis INCREASES, which
means the power of the test INCREASES

Factors that Affect Power


1. Effect Size
the larger the Cohens d, the harder it is to miss seeing it, thus larger power
2. Sample Size
larger sample size, larger power
3. Alpha Level
the lower the alpha level, the lower the power
larger alpha level, the larger the power
(0.05, 0.01, 0.001)
the lower alpha level, means smaller critical regions, thus less likely of correctly rejecting H0
4. Type of Test
One tailed - more power
a lager proportion of treatment distribution will be in the critical region
Two tailed - less power

Going Beyond Testing the Null Hypothesis


Effect Size
estimate of the size or magnitude of a treatment effect
can be measured by Cohens d
Page 3 of 13
sample size does not influence effect size
increasing sample variance reduces measures of effect size
bigger magnitude, bigger effect (bigger mean different in terms of standard deviation)

Small: d = 0.2, Medium: d = 0.5, High: d = 0.8


* sign does not matter, look at absolute value!

Confidence Intervals
a range of values that we feel confident will include the true population mean

The Odds of Finding Significance


The Importance of Variability
LARGE DIFFERENCES are more likely in populations that have HIGH variability on the DV
the more variability, the more critical regions fall farther from the centre of the distribution
more variability means larger differences between means of samples are needed to reject Ho

One Tailed and Two Tailed Tests


One-Tailed Test
the hypotheses specify either an increase or a decrease in the population mean
they make a statement about the DIRECTION of the effect i.e. directional hypothesis
the directional prediction is incorporated into the statement of the hypothesis
the (5%) critical region is located entirely in one tail of the distribution

Ho: Test scores are not increased.


H1: Test scores are increased.

Two-Tailed Test
the critical region of the distribution is divided between its two fails
whenever a two-tailed test is used, it means there is a nondirectional hypothesis: one that does not
predict the exact pattern of results
willing to accept extreme differences that go in either direction

One-tailed VS. Two-tailed test


The CRITERIA both tests use for rejecting H0
One-tailed test: reject the null hypothesis when the difference between the sample and the
population is relatively SMALL, provided that the difference is in the specified direction
size of the critical region is LARGER and CLOSER to the center of the distribution,
making it EASIER for differences between means to be large enough to fall there
statistical value needed to achieve p < 0.05, is SMALLER
Page 4 of 13
we can get significant results more EASILY
Two-tailed test: requires relatively a LARGE mean difference to reject null hypothesis,
independent of direction
The two-tailed test requires a LARGER Z-SCORE for the sample to be in the critical
region, thus a two-tailed test is more choosy
When should they be used
One-tailed test: situations in which the directional prediction is made before the research is
conducted and there there is strong justification for making the directional prediction
Two-tailed test: situations when there is NO strong directional expectation or when there are
two competing predictions
e.g. a study in which one theory predicts an increase in scores but another theory
predicts a decrease

Test Statistics
inferential statistics: statistics that can be used as indictors of what is going on in the population
also called test statistics because they can be used to evaluate results

Organizing and Summarising Data


Summarizing Data: Using Descriptive Statistics
raw data: data recorded as you ran an experiment
summary data: data used when reporting results
descriptive statistics: shorthand ways of describing data
e.g. when we want to order a window shade, w dont carry the window frame to the hardware
store; instead we summarise its characteristics using the standard dimensions of length and
width

Measures of Central Tendency


Central Tendency
statistical measure to determine a single score that defines the centre of a distribution or is most
representative of the entire group

Measure of Central Tendencies


1. Mean
2. Median
3. Mode

Why do we need measure central tendency


1. Allows us to describe a set of data by identifying the central position within that set of data
2. Allows us to consider an individual score relative to the entire sample or population
3. Allows us to compare multiple groups, using a single number that represents each group

The Mean
The sum of the scores divided by the number of scores
Page 5 of 13
Requires scores that are numerical values measured on an INTERVAL or RATIO scale
Properties of the Mean
1. the mean is sensitive to the exact values of all the scores in the distribution (e.g. if one
score is new, it will change)
2. the mean is very sensitive to extreme scores (e.g. an extreme high scoring outlier can pull
up the entire mean)
3. the mean is least subject to sampling variations
The mean wont work when
1. data was obtained using a NOMINAL or ORDINAL scale (i.e. discrete variables)
2. the distribution has a few extreme scores (or is skewed)

The Median
median is the midpoint of the list when scores are arranged from smallest to largest
50% of the scores in the distribution have values that are equal to or less than the median
Usually, the median can be found by a simple counting procedure
1. with an ODD # of scores, list the values in order, and median is the middle score on the list
2. with an EVEN # of scores, list the values in order, and the median is the halfway between
the middle two scores
The median wont work when:
1. data was obtained using a NOMINAL scale
2. The data set is very SMALL
The median is not affected by the presence of an outlier i.e. a resilient measure of central tendency

The Mode
The most frequently occurring (category or score in the distribution if nominal data)
Corresponds to the PEAK in the frequency distribution graph
Primary value: it can be used for data measure on a nominal, ordinal, interval, or ratio scale
The mode wont work when:
1. the distribution is rectangular or bimodal/multimodal
The mode is the only measure of CT which produces a # that actually appears on the distribution

Normal Distribution
mean = median = mode

Positively Skewed
mode is smaller than the median, which is smaller than the mean
mode < median < mean

Negatively Skewed
mean is smaller than the median, which is smaller than the mode
mode > median > mean

Page 6 of 13
When to use the mean
generally considered the best measure of central tendency
computed using every score in the distribution
useful when the data fit a normal distribution

When to use the median


when there are EXTREME scores (and the distribution is skewed) e.g. income data
when distribution is open-ended (e.g. # of songs listened to, 0, 1, 2, 3, 4, 5, 6, 7)
when the data were measured using an ORDINAL scale (e.g.

When to use the mode


when data was measured using a NOMINAL scale
when the data represents a DISCRETE variable
when there are also extreme scores because it is not sensitive to extreme scores
when you want to describe the shape of the distribution

Measures of Variability
Variability
Statistical measure of the differences between scores in a distribution
The degree to which scores are spread out or close together/clustered together
Variability: the extent to which the scores in a distribution differ from each other

Measures of Variability
1. Range
difference between the largest and smallest scores in a set of data
however it does not reflect the precise amount of variability
2. Standard deviation
measures the standard distance from the mean
average deviation of scores from their mean
3. Variance
transforming variability into a standard form
measures the average squared deviation of scores from their mean
tells us how much scored as spread out, or dispersed, around the mean of the data
SS = sum of squared deviations

Chapter 14
Which test do I use?
Levels of Measurements
Nominal (Discrete)
a matter of distinguishing by name
classifies items into distinct categories with NO quantitative relationship to one another
provides the least information - refers to quality more than quantity
tells nothing about magnitude/order, and thus does not have equal intervals between values

e.g. binary category for computers of 0 and 1, course, religion, marital status, favorite TV show
e.g. 1 = male, 2 = female, height (short or not)

Ordinal (Discrete)
refers to order in measurement, order matters but the difference between values does not
indicates DIRECTION, in addition to providing nominal formation

Page 7 of 13
reflects differences only in magnitude, where magnitude is measured in the form of ranks
cannot be sure that the intervals between values are equal
scale has no true zero

e.g. movie ratings from * to *****. weight (underweight, average, overweight), Starbucks (tall,
grande, venti), letter grades or UP GRADES(???)
e.g. (express the amount of pain patients feel on a scale of 1 to 10) - A score of 7 means more pain
that a score of 5, and that is more than a score of 3. But the difference between the 7 and the 5 may
not be the same as that between 5 and 3. The values simply express an order.

Interval (Continuous)
ORDERED categories, possess EQUAL intervals, distance between values has MEANING
e.g. The difference between a temperature of 100 degrees and 90 degrees is the same
difference as between 90 degrees and 80 degrees.
NO TRUE ZERO POINT - a unit of measurement exists (+, -), zero does not signify absence of the
characteristic
measures magnitude

e.g. time of day on a 12-hour clock, calendar dates, Celsius or Farenheit temperature (even if temp
is zero, it doesnt mean there is no temperature)

Ratio (Continous)
has an absolute zero point (a point where none of the quality being measured exists & if there is a
negative value, it means the absence of the characteristic)
equal intervals between all of its values
ratios of #s reflect ratios of magnitude
express relationships between values on these scales as ratios: we can say 1 minutes is twice as
long as 1 minute

e.g. ruler: inches or centimetres, income: money earned last year, years of work experience

Selecting a Statistical Test


The Parameters of Data Analysis
1. How many IVs are there?
2. How many treatment conditions are there?
3. Is the experiment run between or within-subjects?
4. Are the subjects matched?
5. What is the level of measurement of the DV?

Parametric vs Non-Parametric
PARAMETRIC
rely on assumptions about (population represented by sample) parameters and hypothesis
about parameters
t-tests, ANOVAs
can only be used if your data allows you to compute for means and variances
thus nominal/ordinal data wont work
NON-PARAMETRIC
do not require assumptions about population parameters nor do they test hypotheses about
population parameters i.e. assumptions are not met
nominal, ordinal data can be used
We can use data in the form of frequencies rather than numerical scores
cannot compute for means and variances
Page 8 of 13
The Chi-Square Test

Chi-Square
chi-square (x2): determines whether the frequencies of responses in our sample represent
frequencies expected in the population
Not measuring a numerical score for each individual
Individuals are simply classified into categories
when HO is true, x2 = 0 i.e. frequencies of response in the sample do not differ in any meaningful
way from those we cwould expect in the untrued population
as differences between expected and obtained frequencies become greater, the value of x2
increases
when x2 is larger than the critical value, we reject H0 at p <0.05

Observed frequency: the number of individuals from the sample who are classified in a particular
category. Each individual us counted in one and only category.
expected frequency: for each category is the frequency value that is predicted from the proportions
in the null hypothesis and the sample size (n). The expected frequencies define an ideal,
hypothetical sample distribution that would be obtained if the sample proportions were in perfect
agreement with the proportions specified in the null hypothesis.

Degrees of Freedom
tells us how many members of a set of data could vary or change value without changing the value
of a statistic we already know for those data
e.g. we know mean of data, then degrees freedom tells us how many members of that set of data
can change without altering the value of the mean

Interpreting the Chi Square


Cramers Coefficient phi: an estimate of the degree of association between the two categorical
variables tested by x2
dfsmaller = the smaller rows or columns
df* = dfsmaller
0.10 small degree, 0.30 medium degree, 0.50 large degree

The t-test

Requirements
1. Data should be interval or ratio scale (since t-test is a PARAMETRIC test)
2. The values in the sample must consist of independent observations
3. The population that is sampled must be normally distributed

Effects of Sample Size


Sample Size
large sample size = small variance
small sample size = high variance = harder to reject Ho = critical value of t needed to reject
Ho increases since fewer DoF means more variability between samples
greater sample size > easier to reject Ho i.e. easier to get a significant effect
the shape of the t distribution becomes more and more like the normal curve as the sample size
INCREASES
with small samples, the t distribution has a flatter and wider shape

Degrees of Freedom
Page 9 of 13
the greater the df, the smaller is needed to reach the f-ratio before saying you can reject Ho
the critical value of t is LARGER when the degrees of freedom are SMALLER
the fewer degrees of freedom, the more difficult it will be to reject Ho
the tower the subjects, the higher the critical value of t

Effect Size
transform t values and pdfs to a correlation coefficient, r

Confidence Intervals
represents a range of values above and below our sample mean that is likely to contain the
population mean with the probability level (usually at 95% or 99%)
e.g. mean of 20, and calculate a 95% CI equal to +/- 3.20 > we can be 95% confident that the true
mean falls somewhere within that range / 95% probability that the true population mean for no fun
would fall between 20 +/- 3.20

the T-test for Matched group


fewer degrees of freedom, making it harder to reject Ho
lowers amount of variability in data
look at DIFFERENCES between each subjects scores in the treatment conditions

Analyzing Multiple Groups and Factorial Experiments

Analysis of Variance
evaluates the SIGNIFICANCE of the sample mean differences between two or more
treatments (or populations)
Outcome of an ANOVA tells us if the variations of the scores come from treatment (independent
variable) or just came from chance differences.
ANOVA - Type I error is maintained at a manageable level
within-groups variability: the degree to which the scores of subjects in the SAME treatment group
differ from one another i.e. how much subjects vary from others in the group
size of the differences that exist inside each of the samples (i.e. differences expected without
any treatment effect)
between-groups variability: degree to which the scores of different treatment groups differ from
one another i.e. how much subjects vary across each different level/conditions of the IV
the size of the difference between the sample means i.e. treatment effects
Notice that the three means are different; that is, they are variable.

The F thus becomes F = treatment effect + random differences / random differences


When the null hypothesis is true, and there are no differences between treatments, the F ratio is
balanced (i.e. get a value of 1)

One way repeated measures ANOVA


Measuring performance on the same variable over time
e.g. looking at changes in performance during training or before and after a specific treatment
The same subject is measured multiple times under different conditions
e.g. performance when taking Drug A and performance when taking drug B
The same subjects provide measures/ratings on different characteristics
Page 10 of 13
e.g. the desirability of red cars, green cars, and blue cars
Note how we could do some repeated measures designs are regular between subject designs
e.g. randomly assign to drug A or drug B

Getting one sample from the population, and that sample is measured under different conditions
(Assumption: each treatment will now represent a certain sample)

Whats new?
eliminates individual differences NATURALLY from the between-treatments variability - same
participant in every treatment condition
removes individual differences systematically by splitting within-treatments to between-subjects
(individual diffs) and ERROR (w/o indiv. diffs and just random, unsystematic factors)
the result: similar to the independent-measures F ratio but with all individual differences removed
F-ratio of repeated-measures differs from independent measures in that it includes NO
VARIABILITY CAUSED BY INDIVIDUAL DIFFERENCES

Individual differences in the F-ratio:


Individual differences: refers to participants characteristics such as age, personality, and gender,
that vary from one person to another
Thus, this means if there is a mean difference between treatments, they cannot be explained
by individual differences
Individual differences are eliminated or removed from the variances in the F-ratio for the
repeated measures ANOVA

Sources of Variability
individual differences
result of procedures
all of these together are called ERROR: individual differences, undetected mistakes in recording
data, variations in testing conditions, and a host of extraneous variables

Page 11 of 13
The Statistical Inference Process
assume that the sample of interest, or treatment groups, are DRAWN from the same population
assume that each of these samples are NORMALLY DISTRIBUTED on the DV (i.e. individual
scores are not too dispersed or different from each other)
Why? if too variable (i.e. std dev and variance too high), its difficult to detect effects of IVs
large sample size is important to approximate normality
ie assume NO DIFFERENCE and NORMALITY

Treat the samples differentially subject these sample to different levels of the IV
Afterwards measure these samples on the DV, and compute statistics to test whether the sample are
now different form each other on this variable

Robust test: t-test


considered robust it can violate the normality assumption a little bit
if they are significantly different from each other, we can conclude that theses samples are so different
on the DV that they imight as well have come from DIFFERENT populations

Page 12 of 13
CHI-SQUARE TEST OF INDEPENDENCE
IV1: presence or absence of food sample
DV2: purchase decision (buy or not to buy)

Ho: There is no association or relationship between being prior taste of a food product and deciding to
buy it
Expected frequencies correspond to NULL hypothesis i.e. there is no difference between frequency of
observations for taste variable and buy viable
Ergo, being given a taste does not make one more or less likely to buy

H1: There is an association or relationship between prior taste and decision to buy
Observed frequencies refer to ACTUAL frequencies obtained in study
alternative hypothesis: decision to buy is DEPENDENT ON being given prior taste of the food

Multiple ANOVA:
4 effect components
main effects of each IV
interactions effects of IV
effects due to ID
effects due to other sources of error

for a 3 factor anova, how many fs are computed?


3 main effects
3 interactions effect (between two IVs)
1 higher order
7 total

NNHST
requires specific, dichotomous, no-arbitirary hypothesis: If this (posited caused), then that (posited
effect)

Statistical significance: PROBABILITY THAT A PARTICULAR OBSERVATION (IN OUR CASE,


DIFFERENCE BETWEEN GROUPS I THE dv) IS DUE TO CHANCE
Effect of IVS: true and reliable vs false and untrustworthy

Practical or substantive significance: refers to the IMAPCT of a particular finding


effect of IVs: big and important v. small and immaterial

IV: nominal

DV: scale

Margin of Error:
via confidence interval

Page 13 of 13

You might also like