You are on page 1of 16

Stats Study Notes

Graphical and Numerical Summaries


Statistic - summary of data (which are measures of events)
Field of Statistics - collecting, analysing and understanding
data measured with uncertainty
When choosing which graph:
1 Variable
2 Variables
Histogram (columns, no
Quantita
gaps)
Scatterplot
tive
Box Plot
Clustered bar chart (two
Categori
side-by-side charts in same
Bar graph (with gaps)
cal
scale)
Jittered Scatterplot
Comparative Bloxplot
One
Comparative Histogram
Each
(side-by-side, same scale)
When looking at a graph observe:
Location - where most data is (similar to mode, also
mean/median)
Spread - variability (width of bulky part)
Shape
symmetric,
left-skewed,
right-skewed
(skewed=direction it is pulled from symmetry)
Unusual observations
When choosing which method of numerical summary:
One
categorical
variable
table
of
frequency/percentages
One quantitative variable
o Location
Mean
Median
if n (number of values) is odd, M=
if n is even, M=
o Spread
Standard Deviation
s=
Interquartile Range - Q3-Q1 (each are
calculated as medians of the top or bottom
half)
Five number summary: (Min, Q1, M, Q3, Max). This is the data
shown in a box plot, however the tails of a box plot may exclude
outliers. This is calculated by adding 1.5IQR to the outer ends of Q 1
and Q3, then picking the furthest data points within this range.
Outlier points are marked with a .
Transformations
Linear transformations are changing units of x to xnew, for
example time (minh), length (kmmi) and temperature ( oCoF),
Page 1

Oliver Bogdanovski

altering location and shape, but not shape. They are found by the
equation:
xnew = a + bx
Measures of location follow this:
xnew = a + bx
Mnew =
a + bM
Measures of spread are only affected by b:
snew = bs
IQRnew = bIQR
Non-linear transformations change shape, and are good for
correcting skewed data and working with outliers. To pull down the
right tail (right-skewed) use log(x) [preferred], x 1/4 or x1/2 (from
strongest to weakest). These are monotonically increasing (keeps
everything in order), and the base of the log only affects the scale,
not shape, and hence will not make it more symmetrical. Because
log(ab)=log(a)+log(b), they change multiplicative values to
additive. To pull down the left tail (left-skewed), treat it as -x then
continue with right-skewed (e.g. log(-x). If dealing with zeros in
right-skewed, use log(x+1). To stretch the proportions of data
where 0<x<1 to -<x<, use the logit transformation:
xnew = log
Relationships between Two Quantitative Variables
Between two quantitative variables, the relationship can be:
existent or non-existent (random variation)
strong, mild or weak (deviation from line of best fit)
increasing or decreasing (direction)
linear or non-linear
Outliers (in scatterplots) could reveal possible systematic
structure worthy of investigations (e.g. possible external influences).
Correlation (r) - a measure of strength of a linear relationship
between two variables (sensitive to outliers - plot data to be aware
of them):

r=
where
xi = x-axis values (i=1, 2, 3n)
yi = y-axis values (i=1, 2, 3n)
sx = standard deviation of x
-1<r<1
close to 1 = strong positive/increasing linear
relationship
close to -1 = strong negative/decreasing linear
relationship
close to 0 = weak/non-existent linear relationship
Least-Squares Regression
Regression is used to study causal relationships, when an
explanatory variable (independent, x-axis) changes the response
variable, related by a regression line (which doesnt extend
beyond limits of the current data). This is different to correlation as
in correlation each variable is on equal footing (association does not
Page 2

Oliver Bogdanovski

mean causation). Least-squares finds the line that would result in


the smallest sum of the vertical lengths of line to the points
individually squared (but only works for approximate straight lines,
and has no outliers - outliers can only be removed if they are the
result of recording mistakes, not because of other factors; causality
cannot also be assumed from this data, as lurking variables - other
factors that could influence it - could be present). It is found by:
= b0 + b1x
where
= fitted values
For standardised data:
b1 = r
b0 = y-value when x=0

For unstandardised data:


b1 = r(sy/sx)
b0 = y- b1x

Data can be standardised using Z-scores:


zx =
zy =
Estimating values within the data range is called prediction,
however this cannot be used outside the data range (an
extrapolation) as external factors may influence it.
To measure the strength of regression in linear least-squares
regression, use r2 (it is the variance of values divided by the
variance of y values - variance being the square of standard
deviation - and hence results closer to 1 are more accurate or
stronger). It can also be expressed as a percentage, and hence can
be written as the explanatory explains r 2% of the variation in the
(possibly transformed) response variable.
Residual plots are those that plot residual values vs. fitted
values, found by:
residual = y -
If the residual plot has no apparent shape whatsoever (including
linear, arc, fan shapes, etc.), then the line is a good fit. Otherwise, a
different shape would be better suited.
Design of Experiments
Strategy for using data in research:
1. Identify the research question (what you want to know)
2. Decide on the population to be studied (people/things)
3. Decide which variables to measure
4. Obtain data
Ways to collect data:

Anecdotal Data - haphazardly collected/own


experience (dont use)

Available Data - previously produced (possibly for


another purpose)

Collect your own


Census/Population
Sample
Mean
(mu)
x
Standard Deviation
(sigma)
s
Correlation
(rho)
r
Page 3

Oliver Bogdanovski

Observational Study - observe individuals and variables (dont


influence response), provides association
Experiment - some treatment deliberately applied and
response observed, demonstrates causation
Association explanations:
common response (ice cream sales correlate with heat
stoke frequency, both common response to heat)
causation (moon position determines tides)
confounding (two factors cause and are associated with
a third, but between each other are entirely unrelated parent BMI and child exercise may both correlate with
child BMI and hence each other, but are not associated)
Subjects - individuals on which the experiment is done
Treatment - specific experimental condition applied to subject
Factor - the explanatory variable manipulated in different
treatments
Levels - the different variations of factors
Response Variable - measured variable of primary interest
All experiments should follow the principles:
Compare - two or more treatments (one a control
group/placebo)
Randomise - assignment of subjects to treatments
(removes selection bias, ensures independence
assumption is reasonable, motivates probability use;
randomly assigning and sorting numbers)
Repeat - with many subjects to reduce chance variation
There are three main types of experimental design:
Random Comparative Experiment
Matched Pair Design - apply treatment to one of each
pair (identical twin studies allow for genetic control,
before-after studies take two measurements of each
subject, controls variation across subjects)
Random Block Design - like matched pair except with
a group/block of people with similar attributes and the
different treatments are assigned within each block (not
to the entire block; male and female, background, age)
Cautions about experiments:
Control - use appropriate control, ensure other factors
are constant
Beware of Bias - if the administrator of the treatment
knows it may affect results, and a double blind
experiment is used in which neither administrator nor
subject know
Repeat - entire treatment applied to different subjects
Realistic - need to duplicate real-world conditions to be
meaningful
Sampling Designs and Statistical Inference

Page 4

Oliver Bogdanovski

Population (usually large, sometimes theoretical) - entire


group of subject we want information from
Sample (smaller in size) - part of the population we actually
examine
Design - how we choose the sample (however subject to
undercoverage as some groups may not be represented, nonresponse, and response bias due to interviewer technique or
questionnaire design)
Simple Random Sample (SRS) - every combination of
n individuals has equal chance to be selected
(calculated by assigning random numbers, sorting and
taking the first n subjects), which eliminates bias, data
is subject to probability (being independently and
identically distributed), eliminates sampling bias,
however is difficult to obtain as you need a list of all
subjects
Stratified Random Sample - individuals are grouped
by common characteristics, forming strata, and then
SRS takes place
Voluntary Sample - people choose themselves
Parameter - a number which describes some aspect of the
population (usually unknown but we wish to know it)
Statistic - a number which describes some aspect of the
sample, used to estimate the parameter
Samples result in sampling variability, however a computer
can simulate, predict and account for this is if a few samples have
been taken. The sampling distribution of a statistic is the
distribution of values of the statistic from all possible samples of the
same size within a given population. However, most of the time we
dont have the full population, and hence probability theory (later) is
used.
Bias is concerned with location, in which an unbiased
statistics mean of its sampling distribution is equal to the true
parameter. Variability is concerned with spread (lower variability
being a small sampling distribution).
Probability
A phenomenon is random if individual outcomes are
uncertain, however the probability of a random phenomenon will tell
us the proportion of times the outcome will occur in a large number
of repetitions (long run frequency). As subjects are randomly
selected, they apply to probability models which require a sample
space (S; description of all possible outcomes) AND the probability
of each outcome or set of outcomes. An event (often A) is an
outcome or set of outcomes of a random phenomenon (calculated
mathematically or by long run empirical observation and doing),
and its probability is P(A). The probability rules are:
1) 0P(A)1
2) P(S) = 1
Page 5

Oliver Bogdanovski

3) P(Ac) = 1 - P(A)
complement rule
4) P(A or B) = P(A) + P(B)
addition
rule
(independent/disjoint events)
5) P(1 of A, B, C) = P(A) + P(B) + P(C)
(and
so
on,
independent)
6) P(A and B) = P(A) P(B)
multiplication rule (independent
events)
7) P(A or B) = P(A) + P(B) - P(A and B)
(P(A and B) = 0 if disjoint:
Rule 4)
Picture these as Venn Diagrams to derive them. Two events are
independent if knowing one does not change the other. If a random
phenomenon has equally likely outcomes, then each event has:
P(A) = #outcomes of A/#outcomes of S
In binomial probability (remember only two, independent
outcomes where probability of success remains constant - shown by
random sampling) we can use nCr to determine probability.
Conditional Probability
P(B|A) = =
Multiplication Rule
P(A and B) = P(A) P(B|A)
For more than two events
P(A, B and C) = P(A) P(B|A) P(C|(A
and B))
Both of these rules are mere rearrangements of the definition of
conditional probability.
Total Law of Probabilities for Multiple Outcomes
P(B) = P(A1) P(B|A1) + P(A2) P(B|A2) + P(A3) P(B|A3) +
where A1, A2, A3 are independent (basically sums all areas of B)
Bayes Rule
P(A|B) = =
The first equality comes from applying the multiplication rule to the
definition, and the second part by using the total law of probabilities
on the bottom (which is helpful in circumstances where P(B) is not
directly known). This assumes P(B) 0.
Random Variables
A random variables value is a numerical outcome of a
random phenomenon, usually represented by capital letters towards
the end of the alphabet, except Z (e.g. X, Y). They can be discrete
(where possible values are countable) or continuous (where values
are placed within some interval of real numbers, taking an infinite
number of possibilities). These are represented in tables listing
values (or ranges) and their respective probabilities (p i), called a
probability distribution or just distribution. They must follow two
rules: 0pi1 and p1+p2+ = 1
A binomial distribution is a special case of a discrete random
variable in which an experiment repeated n times has two
outcomes: success (p) or failure (1-p). The probability of a certain
number of trials (X, an integer where 0Xn) is:
P(X=k) = () pk (1-p)n-k

Page 6

Oliver Bogdanovski

We say X has binomial distribution where n is the number of


repeats and p is the probability of success, written as X~B(n,p).
Discrete random variables can be graphed in a probability
histogram, whilst continuous random variables are graphed on a
density curve which shows overall pattern of a distribution. In a
given range of possible values, probability is found by the area
standing above this section in a density curve. Density curves are
mathematical functions used to describe the probabilistic behaviour
of measurements of interest, never having negative values and
producing a total area of 1.
Means and Variances of Random Variables
When talking about a set of data we use xwhilst when talking
about a probability distribution or density curve we use . In
discrete random variables, this is found by:
X = xi pi
The law of large numbers states the more independent
observations (the larger the sample size), the closer xwill be to
(also thought of as the law of averages). As before, if the random
variable sample space undergoes linear transformation, so does the
mean (a+bX = a + bX), and if random variable sample spaces are
transformed and added their means follow (aX+bY = aX + bY).
When studying probability the long-run average (also called
mean, X or expected value of X) is not the only thing determining if
you would make an investment in something, the other being risk
which is studied by variance, a measure of spread (square the
standard deviation but use n on the bottom). It is represented by
s2 for samples and 2 for populations and found by:
2X = (xi - X)2pi
Rule 1
When linearly transforming a random variable the
variance is altered by the co-efficient but squared (as standard
deviation is directly related):
2(a+bX) = b2X2.
Rule 2
When two random variables (X and Y) are independent
and the sum or difference is found (note it produces the same result
as squaring makes it positive):
(X+Y)2 = X2 + Y2
(X-Y)2 = X2 + Y2
Rule 3
When two random variables are correlated by :
(X+Y)2 = X2 + Y2 + 2XY
(X-Y)2 = X2 + Y2 - 2XY
Rule 4
If they are independent, linearly transformed and
summed:
(aX+bY)2 = a2X2 + b2Y2
However, we must recompute mean and variance if they
undergo non-linear transformations. In continuous random
variables, we use integration (xf(x) dx for mean, (x-) 2f(x) dx for
variance) across all possible values of x. The same rules as above
can be applied.

Page 7

Oliver Bogdanovski

Normal Distribution
General Equation:
y = e^
The shorthand for normal distribution is: X ~ N(, )
The normality assumption is how much a data set looks like
a normal curve, often based on a somewhat helpful (however
sometimes misleading) histogram. However a normal quantile
plot is made specifically to check this, having normal quantiles
(or z-scores; expected values for a normally distributed data set)
plotted horizontally against the vertical sample quantiles from the
actual data set. If proportional (straight, increasing line with minimal
deviation - small amount allowed near edges) then the normality
assumption is reasonable, however if the data is right-skewed it will
be concave up, if left-skewed it will be concave down (but still
increasing). Other shapes are possible. The similarity can be
described with: excellent, good, fair, poor, hopeless. For normal
measurements, 68% of the data falls within 1 of the , whilst 95%
is roughly within 2 and 99.7% within 3.
In a standard normal distribution we use the letter Z,
where Z ~ (0, 1). These values can be looked up directly in a
standard normal probability table, in which the left axis provides the
first two digits, the top axis provides the third digit, together making
the horizontal value on the normal distribution, and the
corresponding value between these is the area to the left: P(Z<z).
To find P(Z>z) we use 1-P(Z<z) as total area sums to 1. If using
discrete values and less than we must jump a value below (for
P(X<10), use P(X9), but for P(X>10) use 1-P(X<10)). Being equal
to doesnt make a difference as the area at a point is 0 (unless
looking at discrete values). To find P(z 1<Z<z2) or P(|Z|<z)=P(z<Z<z) simply subtract areas (z is a constant representing a
specific value of Z). Similarly, if we know the probability (the area),
the value for which this is true can be deduced. If the distribution is
a normal distribution but not standard, we can standardise it to
calculate values with tables by the linear transformation:
Z=
Hence P(X<c) = P= Pwhich can then be found on the table.
The sampling distribution of a statistic is the probability
distribution of values take by the random variable (tells us how the
statistic will behave from one sample to another). In a binomial
variable X ~ B(n,p):
X = np
X2 = np(1-p)
X =
Often we know n, but not p, so its estimated with the sample
proportion: pp =
The mean and variance of a sample proportion can be expressed as:
pp = p
pp 2 =
pp =
If n is large enough a binomial distribution can be approximated to
a normal distribution where X Y ~ N(np, ) - a similar thing can be
done with pp (using its own values):
pp ~N(p, )
To determine if n is large enough it must satisfy:

Page 8

Oliver Bogdanovski

np>10

AND

n(1-p)>10

The Central Limit Theorem


For a random variable X, taking random samples of size n we
can look at the mean of these samples (derived from random
variable rules):
XX =
XX2 =
XX =
XX is called the standard error of XX. If X and Y are
independent normally distributed random variables, then their sum
also has a normal distribution. This can be extended to a sum of n
independent normal variables. We also find that if n is large
enough, then XX will be normally distributed irrespective of the
distribution of X, such that:
XX ~N(, )
This is the Central Limit Theorem. As n gets bigger, the
approximation is more accurate, and the sample size n required for
this approximation depends on how close X was to a normal
distribution (use normal quantile plot, if approximation is good, dont
need many observations, if fair (not noticeably skewed) then at least
15, and if poor, at least 40). This also applies to discrete populations
(like binomials).
Confidence intervals allow us to make inferences (general
statements) about a whole population based on a sample. We define
a confidence (e.g. 95%), and then use the interval:
(xX - z*, xX + z*)
where P(-z*<Z<z*) = C (our confidence, so 95%). Remember Z is a
random variable for a standard normal distribution (use the tables).
To find z*, do 1-C (gives area under normal curve not included by
our range), then divide this by 2 (gives the area to the left not in our
range), and then find the corresponding value that would give this
area in the table. This is the -z*, and z* is just the positive version. In
doing this we assume we have independent observations with a
known standard deviation (look at method: SRS is best), and that XX
is normally distributed (either through normal distributions or
central limit theorem). Even though we have a 95% confidence, we
cannot say we have a 95% probability.
This interval can be rewritten as (xX - m, xX + m), and we call
this m value the margin of error (it controls the width of the
confidence interval):
m = z*
To reduce the margin of error, we can:
reduce the confidence interval - but you produce a less
confidence result (less likely to capture the true mean
within this range)
increase the sample size - best option
measure variables more precisely - decreases (in
some cases)

Page 9

Oliver Bogdanovski

If given the margin of error desired, the variability to expect


and the confidence required, we can calculate an appropriate
sample size from the above expression.
Hypothesis Testing
In hypothesis testing, we have a particular claim about a
population parameter (a null hypothesis, H0, e.g. the average
body temperature is 37oC) that we want to test, and we have a
sample data we can use to test this claim. We use it to check if
there is evidence against the null hypothesis (as you cant ever
prove something definitively right, only wrong). The data is used to
support an alternate hypothesis (Ha). We use probability (a Pvalue) to determine whether the probability of observing this
sample is likely or extreme, and if it is highly unlikely (assuming our
null hypothesis is true) then we should re-evaluate our null
hypothesis (which may result in changing it to a value that makes
our sample seem less of an outlier). To do a hypothesis test:
1. State the null and alternate hypothesis (H0 and Ha)
2. Calculate the test statistic and its null distribution
3. Calculate the P-value
4. Conclude
A test statistic is a standardised measure of the difference
between the sample estimate and the value of the population
parameter in the null hypothesis. Its null distribution is the
sampling distribution (how likely each outcome is) of the test
statistic in comparison to H0 (so we assume H0 is true). It works in a
similar way to a normal distribution. Hence we can calculate a Pvalue or probability. In the conclusion, if our P-value<0.05, we have
a case against it (our results are significant as the sample is
extreme). However, if greater than this, they are not significant (as
it could be by statistical chance) and we give it less credence and
may even ignore this sample (as it may be proving the null
hypothesis true, although we can never definitively do that).
However this value is defined arbitrarily and we could use <0.001 as
very strong, <0.01 as strong, <0.1 as inconclusive (but possibly
some evidence against it), and >0.1 (to 1) as no evidence against.
Assuming our n independent observations are true and we
know its standard deviation, and XX is approximately normal, we use
the Wald statistic:
test statistic =
To compare the means (and evaluate if our null hypothesis
(current assumed mean) is true) we can use the test statistic:
z=
Assuming H0 is true, z comes from the standard normal
distribution (the null distribution). The null hypothesis is H 0: = 0
(0 is the current value to test).
To determine how we do the P-value use:
if Ha: > 0
P-value = P(Z z)
(onesided)

Page 10

Oliver Bogdanovski

if Ha: < 0
P-value = P(Z z)
(onesided)
if Ha: 0

P-value = 2P(Z |z|)


(two-sided)
The final outcome is because youre standardising the values
into a standard distribution (the data has been moved and
stretched), and hence you need to find both outer tails. This is for
the mean, however. If we are looking at the proportion of successes
(p), then we use the sample portion pp , which has:
margin of error = pp =
z=
Note the p0s on the bottom, as we are assuming these values
are true and working based off this, just as we assumed the
standard deviation in the mean was true to compare. Our
assumptions are that we have n independent measurements of a
binary variable, and that pp is approximately normal (or CLT, and in
the case of binary variables npp 10, n(1- pp )10). In the case of a
confidence interval, we use pp instead of p 0 as the interval is based
purely off our sample data:
(pp - z*, pp + z*)
Hypothesis testing and confidence intervals are related. For
example, declaring significance at the 0.05 level for a two-sided test
is the equivalent of a value being outside of your 95% confidence
interval. However confidence intervals focus on estimating an
unknown parameter, whilst hypothesis testing is used to see how
much evidence we have against the current (hypothesised)
parameter.
Hypothesis Testing when isnt known (t-distribution)
Often we dont know what is, so we use the estimate s (the
sample standard deviation). However now our test statistic is not a
normal distribution, as s varies between samples (only being
approximately normal for large n). Its distribution will be slightly
more flat and spread out, even when a small sample size (small n) is
taken 1000 times, and thus we lose confidence (a 95% confidence
interval would only be 89.4% confident). If n30, the approximation
is fairly good.
Instead, to make inferences about we use the tdistribution:
t=
This has (n-1) degrees of freedom (that is, if you took away n1 numbers from your sample, there is no freedom as you know
exactly what the result it). We write this as t(n-1) or t n-1. We use
slightly different distributions for each t n-1, and hence the tdistribution is more of a family of distributions, the density of which
depends on our degrees of freedom (df or ). They are symmetric
around 0, and as df increases the t-distribution becomes less flat,
more short-tailed, and becomes more like a normal distribution.

Page 11

Oliver Bogdanovski

If we dont have any knowledge of (which occurs most of the


time), then assuming we have independence (SRS) and a normal
approximation (CLT if large enough n, otherwise distribution of data
if small n), then we use the sample parameter for standard
deviation s in our test statistics and confidence intervals:
t=
(under the hypothesis H0: =
0)
xX t*
where t* is the value in the tn-1 distribution where the area between
-t* and t* is C (the value along the bottom). We can use the same
hypothesis tests as before:
if Ha: > 0
P-value = P(T t)
(one-sided)
if Ha: < 0
P-value = P(T t) = P(T |t|)
(onesided)
if Ha: 0

P-value = 2P(T |t|)


(two-sided)
For non-normal data our ranges for acceptable times to use a
t-test are:
if n<15, only use if no outliers and straight or slightly
skewed (good normal approximation)
if 15n40, only use if no outliers and fair
approximation to normality (only not if strongly skewed)
if n>40, it works well ever if there is a poor
approximation (strongly skewed), however may still be
affected if there are GROSS outliers
If not normal, use either a different family of distributions (not
in the course), transform it or use a distribution-free procedure like
the sign test (not in the course).
Misuse of Hypothesis Testing
Statistical significance is achieved when the P-value is
below some chosen significance level, whilst
practical
significance is when the departure from the null hypothesis
(calculated by statistical significance) is large enough to have
practical importance. For example, if assessing how long workers
work (with a prediction they should work 8 hours/day) and finding
they all work 8.01 hours/day, then it appears statistically significant,
but in reality this doesnt mean much. Something could also be
practically significant even if it isnt statistically significant.
Misuse of hypothesis testing:
concluding null hypothesis is true
searching for significance (test lots of hypothesis on the
same data; statistically likely to find something
eventually anyway)
test a hypothesis on the same data the hypothesis was
generated from (obviously data is going to support
hypothesis)
test claims we know arent true anyway

Page 12

Oliver Bogdanovski

Comparing Two Means


When comparing two samples from two different populations
we use two-sample t-tests to compare means. Our hypotheses
are:
H0: 1 = 2
Ha: one of 1 > 2
which means:
1 - 2 = 0
1 2
1 < 2
We have four assumptions for two-sample t-tests:
independence within a sample (within each)
independence between samples (of each other)
distribution is approximately normal (CLT for n1+n2)
both X1 and X2 have the same
o if 1/2 is less than a factor of two, its ok
o if not, if n1/n2 is less than a factor of two, then its
ok anyway
o if not, then our assumption is not satisfied
we can then either transform the data to
ensure they have the same , or use a
different test statistic (not covered)
The test statistic can be calculated by:
t = ~t(n1+n2-2) where sp =
Note: 1 - 2 = 0, and the degrees of freedom are n 1-1 and
n2-1 summed. This formula is derived by using 1-2 as the
parameter of interest, and subbing in appropriate distributions (as
they are normally distributed, so their means (themselves) and
standard error (standard deviation divided by root n)).
Thus the confidence interval becomes:
(xX 1- xX 2) t*sp
where the area between -t* and t* (in the t(n1+n2-2) distribution) is
C (our confidence percentage).
If our two samples are not independent of each other, then it
is not a two-sample t-test, but a paired t-test, which we do as a
normal t-test where our null hypothesis is the difference between
the two tests:
H0: (=1-2) = 0
Ha: as before
Both samples have the same n (as we use the same individuals),
and then we can calculate the s and xX for the DIFFERENCE between
the two samples (make a new line in a table to calculate the
difference). Hence:
t = where 0 = 0
Note that if the standard deviations are identical and n is large
enough, then we could use a z-test (unlikely to come up though). As
we are dealing with one sample (the difference) our assumptions
are the usual independence of the pairs from other pairs (or of the
rest of the sample) and an approximately normal distribution
(although in practice, this works well even if the difference isnt
normally distributed, as long as XX is). Our degrees of freedom are
t(n-1). The confidence interval becomes:
Page 13

Oliver Bogdanovski

xX t*
Using a matched pairs design allows us to account for other
variables (the effects of each individual on the results before so we
know the effect of after).
Relationships between Categorical Variables
In two-way tables (of frequencies) each row or column
represents a variable (listing each of its categories), whilst in
between is the chance of each combination occurring. They are
summaries of joint distributions of the two categorical variables
(as opposed to marginal distributions that only look at one
aspect - e.g. only the males compared to the course they are
enrolled in - in their distribution, rather than both - e.g. females and
males compared to course; if asked to find add these values as row
and column sums on each end/margin). A conditional distribution
shows the effect of one variable upon the other (as with conditional
probability before), and is represented the same way as a joint
distribution.
Two-way tables can be visualised as multiple bar charts
(placed side-by-side), clustered bar charts (within the same chart,
each category has a set of columns of different colours, explained
through a legend), or bar charts of the conditional distribution
(considers one aspect like if female, then has different columns that
show the distribution across each course, as a decimal out of 1).
Simpsons paradox is when there are lurking variables that
may influence performance, so it may appear things are linked,
however in reality the linkage occurs between the response variable
and some lurking variable. This can only be fixed by altering
experimental design.
To make inferences for two-way tables with categorical data,
we use 2 tests. Starting with:
H0: No association (they are independent)
Ha: An association (one is dependent upon the other)
Having the observed counts in the table, we can compute
expected counts under H0 (expected count = row total column
total/n), and produce:
X2 =
Our assumptions are our n observations are independent and
that the sample size is large enough so that all expected counts>10.
We can then look up the respective P-values in the chi-square
distribution table, and our degrees of freedom are (r-1)(c-1) [no. of
rows/columns]. The P-values are calculated as:
P(2>X2)
Unlike t and Z, this distribution is not symmetric, but rightskewed (having larger proportions to the left), and can also only be
positive. 2(df) has mean df, and 2(1) = Z2. At low numbers (2), the
values on the left never decrease to hit zero, however as they
increase the values are shifted more and more towards the centre.

Page 14

Oliver Bogdanovski

We can also use 2-tests to compare the proportions (e.g.


unemployment rate; our first variable) of two different samples (e.g.
year; the sample being the second variable). If given the proportion
of each and the sample size, calculate the X (proportionsample
size) and its conjugate for each sample, and express this in a table.
This can then be used for a 2-test.
We can also compare paired proportions, and for this simply
use the differences.
Inference for Regression
Each value can be found from the linear regression line by:
y = b0 + b1x + error
The error is random scatter around the line (as the fitted values
are the only ones that actually occur on the line). Making inferences
about regression can tell us if two quantitative variables are related,
and we must take into account the sampling error (the amount of
error due to us only having a sample) in estimating the true
regression line (y = 0 + 1x, which is based on our fitted values
= b0 + b1x, which itself is based on our sample values y = b 0 + b1x
+ error). 1 is what we are most interested in, as the slope is what
tells us the relationship (=0 means no relationship as there is
random variation above or below the line; >0 means increasing
relationship; <0 means decreasing relationship). Assuming the
linear regression model:
yi = 0 + 1x + i
where each i is independently sampled from a normal distribution
with mean 0 and standard deviation . If true, we can use:
t=
(sub in 1=0)
SE(b1) is the estimated standard error (also called residual
standard error) for b1 (dont need to calculate by hand). To test if
there is a relationship (starting with the assumption there is no
relationship) we use:
H0: 1 = 0
Ha: 1 > 2 P-value = P(T
t)
1 2 P-value = P(T
t)
1 < 2 P-value = 2P(T
|t|)
Our confidence interval becomes:
(b1 - t*SE(b1), b1 + t*SE(b1))
Our assumptions are that:
y (the mean of Y) has a linear relationship with X
o residual plot
i (the errors from the line) are normally distributed,
with mean 0, standard deviation (or if n is large, CLT
means b1 with errors will be normal anyway)
o this can be checked by plotting regression
residuals against theoretical quantiles in a normal
Page 15

Oliver Bogdanovski

quantile plot (for regression residuals), and


checking for the increasing linear line
o CLT works as b1 is equal to the correlation
equation, and it works for not only averages and
sums, but also weighted sums (which is what the
correlation equation describes)
each i has the same variance at each x-value
(regardless of n)
o distances from 0 on residual plot
Yi observations are independent - SRS
A formula for SE(b1) is:
SE(b1) =
Notice SE(b1) decreases as sample size (n) increases, variability of X
(sx) increases, and variability around the fitted line (s) decreases.

Page 16

Oliver Bogdanovski

You might also like