You are on page 1of 9

Applied Statistics - Summary Block 4,

Academic Year 09-10


Chapter 1 Examii! distri"utios
Statistics is a way to get information from data, by collecting, analyzing and
interpreting these data in order to get insight in world-wide phenomena and to
assist in decision making processes.
Overall pattern of a distribution can be described by its center, spread and shape:
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
Center igh variability !ow variability
"imodal #two modes$
%nother way is to describe the spread by symmetry versus skewness:

0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
-3 -2 -1 0 1 2 3
0
0.2
0.4
0.6
0.8
1
0.5 1 1.5 2 2.5 3 3.5 4
0
0.2
0.4
0.6
0.8
Skewed to the left Symmetric Skewed to the right
&i'erent ways to graph the distribution of (uantitative variables are:
Stemplot) % stemplot separates each observation in a stem #from low to high$
and a leaf.
*imeplot) % timeplot plots each observation against the time at which it was
measured.
+ie chart) % pie chart shows relative fre(uency in a circle of ,--..
"archart) % barchart shows the fre(uency in a diagram, whereas the sum
doesn/t have to e(ual -.
+areto chart) % +areto chart is a sorted barchart.
istogram) % histogram shows the distribution of something.
*o interpret these graphs, you have to look at the overall pattern, for which you
can use the center, spread and shape. 0atch out with these data for outliers.
&escribing distributions with numbers:
,. !ocation) #ea #average$ or media #odd number of observations: the middle
observation1even number of observations: the average of the two middle
observations$.
2. Spread) $a!e #ma3imum-minimum$, %ariace4 #average s(uared deviation
around the mean$, stadard de%iatio #s(uare root of the variance$and the
iter&uartile ra!e44 #567869-6,$.
48
448 6, is median of :rst half, 69 is median of second half, 567 is the di'erence
between them.
% :ve number summary gives a (uick overview of the distribution. 5t shows the
minimum, the :rst (uartile #6,$, the median, the third (uartile #69$ and the
ma3imum. 5t can be shown in a bo3 plot.
% desity cur%e describes a mathematical description of the distribution. 5t is a
idealized description of data. *he most fre(uently used distribution is the ormal
distri"utio, which is symmetric and unimodal. Calculations about this
distribution can be made by his z-score #z8#3-;$1<$ or with your graphical
calculator with the function normalcdf#=$.
( )

=
n
i
i
x x
n
s
1
2 2
1
1
*o determine whether the distribution is normal, we have to look at the normal
(uantile plot, which is the relationship between the z-scores and the observed
values. 5t/s a straight line if the distribution is normal.
Chapter ' Examii! relatioships
% scatterplot shows the relationship between two (uantitative variables
measured on the same individuals. >ach observation is depicted as one point
with, as 3-coordinate, the value of one variable and, as y-coordinate, the value of
the other variable. 0e can use a scatterplot to determine the shape of the
relationship #line, cluster etc.$.
*he correlatio measures the direction and strength of
the linear relationship between two (uantitative
variables. *he correlation is the mean product of the z-
scores of 3 and y. % positive relationship gives high 3
and high y #or low 3, low y$ and a negative relationship
shows a low 3 and a high y #or high 3, low y$.
*he correlation only captures linear relationships and it
is always between -, and ,.
% re!ressio lie is a straight line that describes how
a response variable y changes as an e3planatory
variable 3 changes. *he most common used regression
line is the least-s(uares regression line, which is the straight line y8b
-
?b
,
3 that
minimizes the sum of the s(uares of the vertical distances of the observed points
from the line. *he slope b
,
gives the rate of change of y if 3 changes with one
unit. *he formula is b
,
8r4#s
y
1s
3
$, where r is the correlation, s
y
the standard
deviation of y and s
3
the standard deviation of 3. *he intercept b
-
is the value of y
if 38-.
% residual is the di'erence between an observed value and the predicted value
by the regression line. % residual plot is a scatterplot of the regression residuals
against the e3planatory variable. 7esidual plots help
us assess the :t of a regression line.
5n the :rst residual plot we see that the regression line
:ts (uite well, but in the second residual plot we see
that there might be a curved, non-linear relationship
instead of the straight line regression line. *he second
ploy has a curved pattern.
Outliers are in@uential observations, but it/s not needed to have large residuals
due to the outliers.
&i'erences between correlation and regression #both measures sensitive to
outliers$:
Correlation 7egression
Aoal: Beasure for strength and
direction of relationship
between two (uantitative
variables.
+rediction from one variable by
another using a straight line.
7ole
variables:
"oth variables have the same
role.
*here is one response variable
y and one e3planatory variable
x.
Extrapolatio is the use of a regression line for prediction far outside the range
of values of the e3planatory variable 3 that you used to obtain the line. Such
predictions are often not accurate.
% lurki! %aria"le is a variable that is not among the e3planatory or response
variables in a study and yet may in@uence the interpretation of relationships
among those variables.
"e aware: association is not causation. % high correlation does not mean that one
variable CcausesD the other to be high as well.
% t(o-(ay ta"le lists the fre(uencies of co-occurrences of two variables. *here
is a di'erence between the marginal distribution and the conditional distribution.
*he marginal distribution describes the distributions of variable % alone and the
distributions of variable " alone, while the conditional distribution describes the
distribution of variable %, if they are in a speci:c category of variable ". *here is
no relation between the two variables:
5f the conditional distributions of variable %1variable " are the same as the
marginal distribution of variable %1variable ".
5f it doesn/t matter which of the two above statements we inspect.
%n association or comparison that holds for all several groups can reverse
direction when the data are combined to form a single group. *his reversal is
called Simpso)s paradox. #lurking variables$
Chapter * +roduci! data
5n business and economics we are often stuck with o"ser%atioal data, which
has observed consumers what they want and need. Eor proving causation,
e3periments are best, but often not feasible. 5n observational studies it is often
diFcult to separate the e'ect of two #or more$ variables. 0e call these variables
confounding variables.
&esigning samples:
*he populatio in a statistical study is the entire group of individuals about
which we want information. % sample is the part of the population from which we
actually collect information used to draw conclusions about the whole. 0e
distinguish four types of samples:
Simple random sample #S7S$: individuals are drawn at random, each individual
has the same change of being selected.
Goluntary response sample: respondents choose to provide the data.
+robability sample: each respondent has a, prior determined, probability of
being selected.
Strati:ed random sample: *he populations is divided into groups of similar
individuals) strata. 5n each stratum a S7S is drawn and these are then
combined to form the full sample.
*he design of a study is "iased if it systematically favors certain outcomes. 5f the
bias is selection bias, the sample does not give a good representation of the
population. 5f the bias is information bias, the method of gathering the data is
inappropriate and yields systematic errors in measurement. %t last, if the bias is
confounding bias, there is confusion of e'ects. *hen, an observed e'ect is
actually caused or in@uenced by another, unobserved factor #a lurking variable$.
&esigning e3periments:
*he individuals studied in an e3periment are often called su",ects, especially if
they are people. *he e3planatory variables in an e3periment are often called
-actors. % treatmet is any speci:c e3perimental condition applied to the
subHects. 5f an e3periment has several factors, a treatment is a combination of a
speci:c value #le%el$ of each of the factors.
"eing submitted to an e3periment, people tend to :nd results. >ven if they
actually did not receive treatmentI *o
overcome this problem, we introduce a control
group that enters in the e3periment but does
not receive a treatment. *his group is called
the control group. "ut groups may be di'erent.
*o counter that problem, we need to assign
individuals randomly to groups.
5n a double blind e3periment e3perimenter and subHect do not know which
treatment is received. *his avoids any unconscious bias. 5f the e'ect of the
treatment is much larger than could be e3pected on the basis of chance, we say
that there is a statistically signi:cant e'ect.
% parameter is a number that describes the population. % parameter is a :3ed
number but in practice we don/t know its value. %
statistic is a number that describes a sample. *he
value of a statistic is known when we have taken a
sample, but it can change from sample to sample.
*wo important concepts when we try to estimate a
population parameter with a sample statistic:
,. "ias concerns the center of the sampling
distribution. % statistic used to estimate a
parameter is unbiased if the mean of its sampling
distribution is e(ual to the true value of the
parameter being estimated.
2. *he variability of a statistic is described by the
spread of its sampling distribution. *he spread is determined by the sampling
design and the sample size n. Statistics from larger samples have smaller
spreads.
Chapter 4 +ro"a"ility ad sampli! distri"utios
0e call an event radom, if individual outcomes are uncertain, but in the long
run a pattern can be observed in the outcomes. 7andom e3periments have:
*he sample space S are all the possible outcomes)
% particular e%et A is the number or serial of numbers which you/re going to
observe.
*he pro"a"ility P(A) is the proportion of times that this outcome occurs if we
were to repeat the e3periment in:nitely.
% radom %aria"le is a variable whose value is a numerical outcome of a
random phenomenon. 0e distinguish two types of variables: &iscrete random
variables #the outcomes are :nite #countable$$ and continuous random variables
#in:nite outcomes$. % probability distribution of a random variable J assigns
probabilities to all values that J can take on.
*he mea or expected %alue of a random variable is
de:ned as the weighted average of the possible values of X,
where the weights are the corresponding probabilities of each
X
i
. .he %ariace of a random variable is the weighted
average of the s(uared deviations of the possible values of X
from the e3pected value , where the weights are the
corresponding probabilities. *he stadard de%iatio is the
s(uared root of the variance.


= =
x
x P x
X E
) ( ) (
) ) ((
2
2 2

= =
x
x xP X E ) ( ) (
5f we take a linear combination K8aJ?b, with J as a random variable, the mean
will be ;
K
8;
aJ?b
8a;
J
?b and the variance will be <
2
K
8<
2
aJ?b
8a
2
<
2
J
. 5f J and K are
both random variables, such that 78-.2J?-.LK, the mean will be ;
J?K
8;
J
?;
K
.
0ith two random variables, it/s important whether one variable depends on the
other, to calculate the variance:
5f we have two random
variables, we can consider the Hoint distribution of these variables. 5nstead of
looking at +#J$ and +#K$ separately, we consider +#J,K$: *he probability that two
events occur simultaneously. *he co%ariace cov
JK
is the e3pected value of #XM

X
$#YM
Y
$. *he relationship between the covariance cov
JK
and the correlation N is
cov
JK
8 Ns
J
s
K
. *his yields the following properties:
Covariance of J with J, is e(ual to variance of J: s
J
2
)
Oero covariance gives zero correlation)
Size of covariance diFcult to interpret: depends on measurement
units)
Sign of covariance same as sign of correlation.
% statistic from a random sample will take di'erent values if we take more
samples from the same population. Sample statistics are random variablesI *he
mean is an important random variable. *he sample mean X will di'er from
sample to sample and is not e(ual to the population mean ;. *he la( o- lar!e
um"ers tells us that the sample mean X will converge to the population mean ;
if the drawn observations of the sample will increase. *herefore, respondents has
to be independent and randomly drawn.
5f a population has the P#;,<$ distribution, then the sample mean X of n
independent observations has the P#;,<1Qn$ distribution. 7egardless of the shape
of the original distribution, the distribution of the sample mean will be
appro3imately normal, if n is large enough.
Chapter / +ro"a"ility .heory
5t is often helpful to draw pictures that display relations
among several events. % 0e dia!ram is a useful tool. 5n
a Genn diagram, the total area represents the sample space
S, and the events are drawn into this area.
5f the occurrence of event % does not change anything about the probability for
event ", the events % and " are independent events. *he probability is
+#%?"$8+#%$4+#"$, if the events are independence.
Rniform "ernoulli "inomial +oisson
Sample Space
S
,, 2, =, n -,, ,, 2, =, n ,, 2, =, n
+#J8k$
,1n for k in S,
else -
p for k8,,
#,-p$ for k8-,
else -
for k in S,
else -
for k in S,
else -
Bean m #n?,$12 + n4p ;
Stand.
deviation s
> Rniform: %ll events are e(ually liked.
12 / ) 1 (
2
N
) 1 ( p p ) 1 ( p np
k n k
p p
k
n

) 1 (
! k
e
k


> "ernoulli: *wo possible events #success or failure$) +robability for success is p,
failure is ,-p.
> "inomial: n independent repetitions of a "ernoulli e3periment. *he probability
p for success is the same in each e3periment. *he random variable J is
de:ned as: Cthe number of successes k out of n trials #repetitions of the
e3periment$D.
+oisson distribution:
+oisso distri"utio is used to determine the probability for the number of
occurrences #successes$ during a :3ed time interval or on a :3ed area in space.
Some assumptions to be made are:
,. *he number of successes that occur in any unit of measure is independent of
the number of successes that occur in any non-overlapping unit of measure.
2. *he probability that a success will occur in a unit of measure is the same for all
units of e(ual size and is proportional to the size of the unit.
9. *he probability that 2 or more successes will occur in a unit approaches - as
the size of the unit becomes smaller.
*he probability of k successes in a +oisson distribution is +#J8k$8 , where the
parameter ; is the mean number of successes per unit of measure.
5f J is +oisson distributed with ;
J
, K is +oisson distributed with ;
K
, and J and K are
independent. *hen, S8J?K is +oisson distributed with ;
S
8 ;
J
? ;
K
.
Coditioal pro"a"ility occurs when the probability of two events % and "
happen together. 0hen +#%$S-, the conditional probability of " is +#"T%$8 +#%
and "$1+#%$. *his yields that +#% and "$ is e(ual to +#%$4 +#"T%$. ere, +#"T%$ is the
probability that " occurs, given the information that % occurs.
Example: A table with students, distinguished on (non-)smoking and gende!
A: "ale A: #emale $otal
%: Smoke &' (' )'
%: *on-smoke )' +' ,('
$otal +' ,&' &''
P("ale)-+'.&''-'!(, P(Smoke)-'!/ and P("ale and smoke)-'!,!$he
pobability that the student is a smoke, gi0en the in1omation that the
student is a male, is: '!,.'!(-'!&2!
5f +#"T%$8+#"$, two events % and ", that have both positive probabilities, are
independent.
Bayes)s $ule states that if % and " are events whose probability are not - or ,,
then:
5f we know the conditional probability of ", given %, and the marginal probability
+#%$, we can use "ayes/s rule to calculate the conditional probability of %, given
".
Chapter 1 2troductio to i-erece
*hrough statistical i-erece, we aim to make statements about the
population based on data obtained from a sample. *his involves estimation of
population parameters by using sample statistics. Eor the actual ;, the sample
mean is a good appro3imation. "ut, it/s very unlikely that the actual value of the
population parameter will be e(ual to the sample mean.
*herefore, it/s important to consider the spread of the estimator. 5nstead of Hust
looking at one number, how about giving a range of plausible values for the
! k
e
k


population parameter. Such a range is called a co3dece iter%al. % range of
plausible values can be constructed as:
U M margin of error, ? margin of errorV.
7ecalling the central limit theorem, it means that there is an appro3imately WX.
probability that the sample mean is in the interval. 5n other words: +#; - 2< Y X
Y ; ? 2<$ 8 -.WX. %fter rearranging this function to +#X - 2< Y ; Y X ? 2<$ 8
-.WX, we can say that appro3imately WX. of the samples includes the true value
of ;. Pow, we can say with WX. confidence that ; is between X - 2< and X ? 2<.
% general way of obtaining con:dence intervals for the population mean is to
establish, :rst of all, how con:dent we want to be: we determine the con:dence
level 3. *hen, we choose an S7S of size n from a population having unknown
mean ; and known deviation <. % le%el co3dece C for ; is:
XZ45#<1Qn$. ere, 4 is the critical value with area 3 between M4 and ?4 under the
standard Pormal curve #for 4-values, see book table C$. *he (uantity 45#<1Qn$ is
the mar!i o- error.
*he width of the interval is a'ected by the sample size n: the larger n, the
smaller the interval.
*he width of the interval is a'ected by the con:dence level 3: the larger 3, the
bigger the interval.
Often, the con:dence level 3 is set. 5f the margin of error is also set, we can
determine how large our sample should be: n[#44<1m$
2
, where m is your margin
of error.
*he ull hypothesis is typically conservative. 5t/s often a statement you want to
disprove. *he alterati%e hypothesis is often the thing that you want to prove:
for which you have some evidence. Kou have to prove that the results are
statistically signi:cant, to have evidence that is stong enough.
*o test a certain hypothesis we need a test-statistic. % test-statistic is a function
from your sample for which you can evaluate how likely it is if the null hypothesis
is true.
*he probability, computed assuming that 6
'
is true, that the test-statistic would
take a value as e3treme or more e3treme than actually observed, is called the +-
%alue of the test. *he smaller the +-value, the stronger the evidence against 6
'

provided by the data)
6
a
: ; S ;
-
is +#7 [ 4$ #one-sided$)
6
a
: ; \ ;
-
is +#7 Y 4$ #one-sided$)
6
a
: ; ] ;
-
is 2+#7 [T4T$ #two-sided$.
0e reHect the null hypothesis if the +-value is smaller than a certain si!i3cace
le%el ^, where ^ should be selected before doing the test and ^ is typically -.,-,
-.-X or -.-,. *he advantage of this signi:cance level ^ is that it/s a clear
decision: reHect or do not reHect.
5f we reHect
-
, when in fact
-
is true, this is called a type 2 error. 5f we do not
reHect
-
, when in fact
a
is true, so
-
is not true, this is called a type 22 error.
*he po(er o- a test is the probability that a test with a level of signi:cance , is
reHected for a speci:c value in 6
a
Chapter 4 2-erece -or distri"utios
*ill now on, we assumed that the standard deviation < was known to calculate
the O-value of the sample, which made it possible to conduct the unknown ;. "ut
this is very unlikely.
x x
Erom now on, < is unknown. 5f < is unknown, we replace it by its estimator s.
5nstead of the O-value, this gives a new value, the *-value: *8#X-;$1#s1Qn$. *his
yields a level 3 con:dence interval of XZt5#s1Qn$, where t is the value for the t#n-
,$ density curve with area 3 between Mt and t. *he margin of error is t5#s1Qn$.
% t-statistic is only valid if the population is normally distributed. 5f the mean and
sample variance are sensitive to outliers, then the t-statistic is sensitive to
outliers too.
A sample o1 2' employees with sample mean ()'!/+ and 0aian8e ,2'9!22!:s ;
then also >(2'<
6
'
: ;-(2' t-(()'!/+-(2').(,2'9!22.=2') P-0alue is aound '!'&/, 1om
whi8h you 8an
6
a
: ;>(2' -,!+? 8on8lude that , with @-'!'2, ;>(2'!
5f there is reason to suspect that the population distribution is not normal #for
e3ample, skewed or bi-modal$, we can construct an alternative test that ignores
the size of the di'erences between observations and focusing on the direction of
the di'erence: plus or minus #up or down$. Such a test, is called a si! test. *he
idea is simple, the di'erence between the two compared groups is positive or
negative, so p8-.X. *he hypotheses for this are 6
'
: $he two population lo8ations
ae the same (p - '!2) and 6
a
: $he two population lo8ations diAe (p B '!2 (o C
o >)). *he test statistic is the number of times that the di'erence is positive. 0e
can calculate the probability to observe this number #or an even less likely
number$ given that the null hypothesis of no change is true.
5f we draw two random samples with size n
,
and n
&
and mean ;
,
and ;
&
,
respectively, the con:dence interval for ;
,
D;
&
is given by #x
,
-x
&
$Zt5Q##s
,
1n
,
$?
#s
,
1n
,
$$. ere, t5 is the value for the t(k) density curve, where the smaller of n
,
-,
and n
&
-, is the degrees of freedom.
*o test the hypothesis 6
'
: ;
,
-;
&
, we have to compute the t-statistic #see book$
and use the P-value or critical values, where again the smaller of n
,
-, and n
&
-, is
the degrees of freedom, to reHect or not reHect the null hypothesis.
5f the variances are the same, we have to use the formulas for pooled two-sample
t procedures. 0atch out for the fact that the degrees of freedom is n
,
-n
&
-2 in this
caseI
*o test whether the variances are the same, we use the F-test: #8s
lage
&
1s
small
&
!
*hen, compare the obtained value with critical values of an #
nlage, nsmall
distribution.
Eor a two-sided test use the critical value corresponding to ^12. Eor a one-sided
test use the critical value corresponding to ^. 5f the observed value e3ceeds the
critical value, reHect the null hypothesis in favour of the alternative.
Chapter 5 +roportios
*he sample proportion of successes is the count of successes divided by the total
number. *he mean of the sampling distribution is p and the standard deviation is
Q##p#,-p$1n$. *o make a con:dence interval we need to use the standard
deviation, the mean and 45. *he con:dence interval becomes then
C58pZ45Q##p#,-p$1n$.
Often, it is desired to obtain an interval of speci:ed width. *he (uestion is then
how many individuals must be included in the sample. 0e know that the width "
is e(ual to 45Q##p#,-p$1n$. *he amount of individuals needed n becomes then
n8#45
&
p#,-p$$1"
&
. *he ma3imum width is obtained when p8-.X.
Some re(uirements for proportions are that there must be a large sample and a
large population. 5f the population is too small, the independence assumption
doesn/t hold and the number of successes doesn/t hold a binomial distribution.
Eor population proportions, hypothesis testing can be done, too. *o do this, we
use the 4-statistic. 0e can obtain the 4-value with the formula #see book$ and
calculate the p-value with that z-value.
Often we are interested in the di'erence between two proportions. Eor the two
proportions, we have P
,
- X
,
.n
,
E *(p
,
, p
,
(,-p
,
).n
&
) and P
&
- X
&
.n
&
E *(p
&
, p
&
(,-
p
&
).n
&
)!5f the two are independent, it follows that the di'erence between the two
proportions is also appro3imately normally distributed. *he mean of the sampling
distribution is the di'erence F between the two proportions, which is e(ual to p
,
-
p
&
and the standard deviation is Q##p
,
#,-p
,
$1n)?#p
&
#,-p
&
$1n$$. *o make the
con:dence interval for the di'erence F, we would like to use the usual formula
FZ45<.
0e can also test a hypothesis for di'erence. Rsually,
-
: p
,
- p
&
and
a
: p
,
B p
&
.
0hen constructing a test statistic for this hypothesis, we use the assumption that
the null is true. *his means that if we estimate the variance for the di'erence, we
should do so by assuming that p
,
- p
&
. *his yields:
p 8 #x
,
?x
&
$1# n
,
?n
&
$
SE
Fp
8 Q#p#,-p$#,1n
,
?,1n
&
$$
4 8 #p
,
-p
&
$1SE
Fp
.
*he probability can be found in the table in the book with the obtained 4-value.