Professional Documents
Culture Documents
STIRLING’S APPROXIMATION TO N!
For large n: n ! ≈ 2π nn e
n −n
PERMUTATIONS
A permutation is an ordered sequence of elements selected from a given finite set, without repetitions, and not
necessarily using all elements of the given set. For example, given the set of letters {C, E, G, I, N, R}, some permutations
are ICE, RING, RICE, NICER, REIGN and CRINGE. ENGINE, on the other hand, is not a permutation, because it uses
the elements E and N twice.
Permutation of r elements on n possible (n possible choices for first element, n-1 for second, …)
n!
Pnr = P (n, r ) = n Pr =
(n − r )!
The number of permutations of n objects consisting of groups of which n1 are alike, n2 are alike, and so on, is
n!
, n = n1 + n2 + ... + nk
n1 !n2 !...nk !
r times
(Because there are n possibilities for the first choice, then there are n possibilities for the second choice, and so on.)
COMBINATION
Combination is an un-ordered collection of distinct elements taken from a given set.
The number of k-combinations (each of size k) from a set S with n elements (size n) is the binomial coefficient
⎛n⎞ n!
Ck = ⎜ ⎟ =
⎝ k ⎠ k !(n − k )!
n
ARITHMETIC MEAN
The arithmetic mean (the mean) of a set of N numbers X1,. . . ,XN:
X=
∑X i
- X bar
N
WEIGHTED ARITHMETIC MEAN
X=
∑w X i i
∑w i
X=
∑fm i i
, that is, a weighted arithmetic mean of all the means.
∑f i
4. If A is any guessed or assumed arithmetic mean (which may be any number) and if d j = X j − A are the deviations of
Xj from A, then
X = A+
∑d i
= A+
∑(X i − A)
= A+ d
N N
THE MEDIAN
The median of a set of numbers arranged in order of magnitude is either
the middle value or
the arithmetic mean of the two middle values
Geometrically the median is the value of X corresponding to the vertical line which divides a histogram into two parts
having equal areas.
THE MODE
The mode of a set of numbers is that value which occurs with the greatest frequency; that is, it is the most common
value. The mode may not exist, and even if it does exist it may not be unique.
DISPERSION OR VARIATION
The degree to which numerical data tend to spread about an average value is called the dispersion, or variation, of the
data. Various measures of this dispersion (or variation) are available, the most common being the:
• Range
• Mean
• Deviation
• semi-interquartile range
• 10–90 percentile range
• standard deviation
THE RANGE
The range of a set of numbers is the difference between the largest and smallest numbers in the set.
MD =
∑X j −X
= X −X
N
s=
∑(X j − X )2
N
Sometimes the standard deviation of a sample’s data is defined with (N – 1) replacing N in the denominator because the
resulting value represents a better estimate of the standard deviation of a population from which the sample is taken. For
large values of N (N > 30), there is practically no difference between the two definitions.
The variance of a set of data is defined as the square of the standard deviation σ = s .
2
When it is necessary to distinguish the standard deviation of a population from the standard deviation of a sample drawn
from this population, we often use
• s - standard deviation of a sample
• σ - standard deviation of a population
• s 2 - sample variance
• σ 2 - population variance
s=
∑(X j − X )2
=
∑(X j
2
− 2X j X + X 2)
=
∑X j
2
−∑ X2
= X2 − X2
N N N
=
∑ (d 2
− 2dd + d 2 )
=
∑d 2
−
2d ∑ d
+d 2
=
∑d 2
−d 2
leads to s = d2 −d 2
N N N N N
N1 + N 2
Note that this is a weighted arithmetic mean of the variances. This result can be generalized to three or more sets.
k 2 −1
3. Chebyshev’s theorem states that for k > 1, there is at least ⋅100% of the probability distribution for any variable
k2
within k standard deviations of the mean. In particular, when
k = 2 there is at least 75% of the data in the interval ( x − 2 s, x + 2 s ) ,
k = 3 there is at least 89% of the data in the interval ( x − 3s, x + 3s ) ,
EVENTS
If A and B are events, then
1. Union - A B is the event “either A or B or both”
2. Intersection - A ∩ B IS THE EVENT “BOTH A AND B”
3. Complement – A’ IS THE EVENT “NOT A”
4. A – B = A ∩ B’ is the event “A but not B” In particular, A’ = S – A.
5. Mutually exclusive - the sets corresponding A and B are disjoint, A ∩ B =
6. Independent events – A and B are independent if and only if P(A ∩ B ) = P(A)P(B) (their joint probability is a product of
their individual probabilities) or
Axiom 3. For any number of mutually exclusive events A1, A2, …, in the class C,
P(A1 A2 … ) = P(A1) + P(A2) + …
In particular, for two mutually exclusive events A1 and A2 ,
P(A1 A2 ) = P(A1) + P(A2)
SOME IMPORTANT THEOREMS ON PROBABILITY
Theorem 1: If A1 ⊂ A2 , then
P ( A1 ) ≤ P ( A2 )
P ( A2 − A1 ) = P( A1 ) − P( A2 )
Theorem 5: If A = A1 A2 … An, where A1, A2, … , An are mutually exclusive events, then P(A) = P(A1) + P(A2) +
… + P(An)
Theorem 6: If A and B are any two events, then P(A B) = P(A) + P(B) – P(A ∩ B)
More generally, if A1, A2, A3 are any three events, then
P(A1 A2 A3) = P(A1) + P(A2) + P(A3) –
P(A1 ∩ A2) – P(A2 ∩ A3) – P(A3 ∩ A1) +
P(A1 ∩ A2 ∩ A3).
CONDITIONAL PROBABILITY
P(B | A) denotes the probability of B given that A has occurred - the conditional probability of B given A.
Since A is known to have occurred, it becomes the new sample space replacing the original S.
P( A ∩ B)
P ( B | A) ≡ Or
P( A)
P ( A ∩ B ) ≡ P ( A) P ( B | A)
Consider the simple scenario of rolling two fair six-sided dice, labeled die 1 and die 2. Define the following three events:
The prior probability of each event is 1/6. Of the 36 possible ways that a pair of dice can land, just 5 result in a sum of 8
(namely 2 and 6, 3 and 5, 4 and 4, 5 and 3, and 6 and 2).
The probability of both A and C occurring is called the joint probability of A and C –
P(A ∩ C) = 1/36. On the other hand P(B ∩ C) = 0.
Now suppose we roll the dice and cover up die 2, so we can only see die 1, and observe that die 1 landed on 3. Given this
partial information, the probability that the dice sum to 8 is no longer 5/36; instead it is 1/6, since die 2 must land on 5 to
achieve this result. This is called the conditional probability, because it is the probability of C under the condition that A
is observed, and is written P(C | A), which is read "the probability of C given A" –
P(C | A) = P(A ∩ C) / P(A) = 1/36 / 1/6 = 1/6
Similarly, P(C | B) = 0, since if we observe die 2 landed on 1, we already know the dice can't sum to 8, regardless of what
the other die landed on.
joint probability ≠ conditional probability
⎧ P( A ∩ B)
⎪ P( B | A) = P ( A)
⎪
⎨
⎪ P( A | B) = P( A ∩ B)
⎪⎩ P( B)
P ( A | B ) P( B ) = P ( A ∩ B ) = P ( B | A) P( A)
Dividing by P(B) we obtain Bayes’ theorem.
Theorems on Conditional Probability
Theorem 8: For any three events A1, A2, A3, we have
P ( A1 ∩ A2 ∩ A3 ) = P ( A1 ) P ( A2 | A1 ) P( A3 | A1 ∩ A2 )
In words, the probability that A1 and A2 and A3 all occur is equal to the probability that A1 occurs times the probability
that A2 occurs given that A1 has occurred times the probability that A3 occurs given that both A1 and A2 have occurred.
Theorem 9: If an event A must result in one of the mutually exclusive events A1 , A2 , … , An , then
P ( A) = P ( A1 ) P ( A | A1 ) + P( A2 ) P( A | A2 ) + ... + P( An ) P( A | An )
INDEPENDENT EVENTS
A and B are independent events if the probability of B occurring is not affected by the occurrence or nonoccurrence of A
and vice versa
Or their joint probability is a product of their individual probabilities
⎧ P ( B | A) = P ( B )
⎨
⎩ P ( A | B ) = P ( A)
or
P( A ∩ B ) = P ( A) P ( B )
Three events A1, A2, A3 are independent if they are pair-wise independent.
The conditional probability fallacy is the assumption that P(A|B) is approximately equal to P(B|A). It can be overcome
by describing the data in actual numbers rather than probabilities.
The relation between P(A|B) and P(B|A) is given by Bayes' theorem:
P( B | A) P( A)
P( A | B) =
P( B)
And P(A|B) is approximately equal to P(B|A) if the prior probabilities P(A) and P(B) are also approximately equal.
An example: In order to identify individuals having a serious disease in an early curable form, one may consider screening
a large group of people. While the benefits are obvious, an argument against such screenings is the disturbance caused
by false positive screening results. Suppose:
• 1% of the group suffer from the disease: P(ill) = 0.01 and P(well) = 0.99.
• when the screening test is applied to a health person, there is a 1% chance of getting a false positive result:
P(positive | well) = 1%, and P(negative | well) = 99%.
• when the test is applied to an ill person, there is a 1% chance of a false negative result: P(negative | ill) = 1% and
P(positive | ill) = 99%.
Now, one may calculate the following:
• P(well ∩ negative) = P(well) P(negative | well) = 99% * 99% = 98.01%.
• P( ill ∩ positive) = P( ill) P(positive | disease) = 1% * 99% = 0.99%.
• P(well ∩ positive) = P(well) P(positive | well) = 99% * 1% = 0.99%.
• P(ill ∩ negative) = P( ill) P(negative | disease) = 1% * 1% = 0.01%.
And
P(positive) = P(well ∩ positive) + P(ill ∩ positive) = 0.99%+0.99%=1.98%
P(ill | positive) =P(ill ∩ positive) / P(positive) = 0.99% / 1.98% = 50%.
In this example, it should be easy to relate to the difference between the conditional probabilities
• P(positive | ill) = 99% - is the probability that an individual who has the disease tests positive
• P(ill | positive) = 50% - is the probability that an individual who tests positive actually has the disease.
With the numbers chosen here, the last result is likely to be deemed unacceptable: half the people testing positive are
actually false positives.
Another type of fallacy is interpreting conditional probabilities of events as unconditional probabilities, or seeing them as
being in the same order of magnitude. This fallacy to view P(A|B) as P(A) or as being close to P(A) is often related with
some forms of statistical bias.
Here is an example: One of the conditions for the legendary wild-west hero Wyatt Earp to have become a legend was
having survived all the duels he survived. Indeed, it is reported that he was never wounded, not even scratched by a
bullet. The probability of this to happen is very small, contributing to his fame because events of very small probabilities
attract attention. However, the point is that the degree of attention depends very much on the observer. Somebody
impressed by a specific event (here seeing a "hero") is prone to view effects of randomness differently from others which
are less impressed.
In general makes not much sense to ask after observation of a remarkable series of events "What is the probability of
this?", because this is a conditional probability upon observation. The distinction between conditional and unconditional
probabilities can be intricate if the observer who asks "What is the probability?" is himself outcome of a random selection.
Bayes’ Theorem
P( B | A) P( A)
P( A | B) =
P( B)
The Bayes’ theorem is often referred to as a theorem on the probability of causes and relates the conditional and
marginal probabilities of events A and B. Each term in Bayes' theorem has a conventional name:
• P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into
account any information about B.
• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is
derived from or depends upon the specified value of B.
• P(B|A) is the conditional probability of B given A.
• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having
observed 'B'.
Suppose there is a co-ed school having
60% boys - all wear trousers
40% girls - wear trousers or skirts in equal numbers
An observer sees a student from a distance; all they can see is that this student is wearing trousers. What is the
probability this student is a girl? The correct answer can be computed using Bayes' theorem. The event A is that the
student observed is a girl, and the event B is that the student observed is wearing trousers.
Girls Boys Total
Trousers 20 60 80, P(B) = 0.8
Skirts 20 0 20, P(B’) = 0.2
Total 40, P(A) = 0.4 60, P(A’) = 0.6 100
P(B|A) = 0.5 - the probability of the student wearing trousers given that the student is a girl.
P(B|A') = 1 - the probability of the student wearing trousers given that the student is a boy.
P(A|B) = P(B|A)P(A)/P(B) = 0.5 * 0.4 / 0.8 = 0.25
Another way of obtaining the same result is: there are 80 trouser-wearers, of which 20 are girls. Therefore the chance that
a random trouser-wearer is a girl equals 20/80 = 0.25.
You might think that, with two doors left unopened, you have a 50:50 chance with either door, and so there is no point in
changing doors. However, this is not the case. Let us call the situation that the prize is behind a given door Ar, Ag, and
Ab.
We shall assume P(Ar) = P(Ag) = P(Ab) = 1/3 and that we have already picked the red door.
Event B = "the presenter opens the blue door". Without any prior knowledge, we would assign this a probability of 50%.
• if the prize is behind the red door, the presenter is free to pick between the green or the blue door at random -
P(B | Ar) = 1 / 2
• if the prize is behind the green door, presenter must pick the blue door - P(B | Ag) = 1
• if the prize is behind the blue door, presenter must pick the green door - P(B | Ab) = 0
Thus,
P(Ar | B) = P(B | Ar) P(Ar) / P(B) = 1/2 * 1/3 / ½ = 1/3
P(Ag | B) = P(B | Ag) P(Ag) / P(B) = 1 * 1/3 / 1/2 = 2/3
P(Ab | B) = P(B | Ab) P(Ab) / P(B) = 0 * 1/3 / ½ = 0
⎪ f ( x1 ) + f ( x2 ) x2 < x < x3
⎪⎩ f ( x1 ) + ... + f ( xn ) xn < x
f ( x) = F ( x) − lim F (u )
u→ x−
For continuous random variable Distribution function is
x
F ( x) = ∫
−∞
f (u )du
EXPECTATION
For a discrete random variable X ( X = x1 , x2 ,...., xn ) with probability function P ( X = xk ) = f ( xk ) the expectation is
defined as
E ( X ) = x1 f ( x1 ) + x2 f ( x2 ) + ... + xn f ( xn ) = ∑ xk f ( xk )
With equal probabilities
x1 + x2 + ... + xn
E( X ) = = mean( X )
n
The expected value of a discrete random variable is its measure of central tendency!
For continuous random variable
∞
E[ g ( x)] = ∫ g ( x) f ( x)dx
−∞
where f(x) is probability (density) function and g(x) is the function whose expectation we want to calculate.
VARIANCE
For discrete random variable
n
σ x2 = E[( X − μ ) 2 ] = ∑ ( x j − μ ) 2 f ( x j )
1
For continuous random variable
∞
σ x2 = E[( X − μ )2 ] = ∫ (x − μ)
2
f ( x)dx
−∞
Theorem 4:
σ 2 = E[( X − μ ) 2 ] = E ( X 2 ) − μ 2 = E ( X 2 ) − [ E ( X )]2
where μ = E(X).
The cumulative distribution function of a probability distribution is the probability of the event that a random variable X is
less than or equal to x:
1
x ⎛ ( u − μ )2 ⎞
Φ μ ,σ 2 ( x) =
σ 2π ∫ exp ⎜⎜ − 2σ 2 ⎟⎟du
−∞ ⎝ ⎠
μ = 0, σ = 1
1
x
⎛ u2 ⎞
Φ ( x) =
2π
∫ ⎜⎝ − 2 ⎟⎠du
−∞
exp
The 68-95-99.7 rule, or three sigma rule, or empirical rule, states that for a normal distribution, almost all values lie within
3 standard deviations of the mean.
Because of the exponential tails of the normal distribution, odds of higher deviations decrease very quickly:
range fraction in range expected frequency outside range approximate frequency
for daily event
μ ± 1σ 0.682689492137 1 in 3 weekly
μ ± 2σ 0.954499736104 1 in 22 monthly
μ ± 3σ 0.997300203937 1 in 370 yearly
μ ± 4σ 0.999936657516 1 in 15,787 every 60 years (once in
a lifetime)
μ ± 5σ 0.999999426697 1 in 1,744,278 every 5,000 years (once
in history)
μ ± 6σ 0.999999998027 1 in 506,842,372 every 1.5 million years
(essentially never)
Thus for a daily process, a 6σ (Six Sigma) event is expected to happen less than once in a million years.
THE POISSON DISTRIBUTION
The discrete probability distribution that expresses the probability of a number of events to occur in time interval, if these
events occur with a known average rate and independently of the time since the last event. The Poisson distribution can
also be used for the number of events in other specified intervals such as distance, area or volume. If the expected
number of occurrences in this interval is λ, then the probability that there are exactly X occurrences is equal to
λ X e−λ
p( X ) =
X!
μ =λ Mean
σ =λ
2
Variance
σ= λ Standard deviation
SAMPLE SPACES
A set S that consists of all possible outcomes of a random experiment is called a sample space, and each outcome is
called a sample point. Often there will be more than one sample space that can describe outcomes of an experiment, but
there is usually only one that will provide the most information.
Example. If we toss a die, then one sample space is given by {1, 2, 3, 4, 5, 6} while another is {even, odd}. It is clear,
however, that the latter would not be adequate to determine whether an outcome is divisible by 3.
Sample space can be:
• discrete
o finite (finite number of points)
o countably infinite (as many points as there are natural numbers)
• non-discrete
o non-countably infinite (as many points as there are in some interval on the x axis)
EFFICIENT ESTIMATES
If the sampling distributions of two statistics have the same mean (or expectation), then the statistic with the smaller
variance is called an efficient estimator of the mean, while the other statistic is called an inefficient estimator.
If we consider all possible statistics whose sampling distributions have the same mean, the one with the smallest
variance is sometimes called the most efficient, or best, estimator of this mean.
as we have seen is true for many statistics if the sample size N >= 30), we can expect to find an actual sample statistic S
lying in the intervals
( μS − σ S , μS + σ S ) about 68.27% of the time
( μS − 2σ S , μS + 2σ S ) about 95.45% of the time
( μS − 3σ S , μS + 3σ S ) about 99.73% of the time
Equivalently, we can expect to find (or we can be confident of finding) μS in the intervals
( S − σ S , S + σ S ) about 68.27% of the time
( S − 2σ S , S + 2σ S ) about 95.45% of the time
( S − 3σ S , S + 3σ S ) about 99.73% of the time
Because of this, we call these respective intervals the 68.27%, 95.45%, and 99.73% confidence intervals for
estimating μ S . The end numbers of these intervals called the 68.27%, 95.45%, and 99.73% confidence limits, or
fiducial limits.
Similarly, ( S ± 1.96σ S ) and ( S ± 2.58σ S ) are the 95% and 99% (or 0.95 and 0.99) confidence limits for S. The
percentage confidence is often called the confidence level. The numbers 1.96, 2.58, etc., in the confidence limits are
called confidence coefficients, or critical values, and are denoted by zc. From confidence levels we can find confidence
coefficients, and vice versa.
Confidence 99.73% 99% 98% 96% 95.45% 95% 90% 80% 68.27% 50%
level
zc 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.6745
Confidence Intervals for Means
If the statistic S is the sample mean X , then the 95% and 99% confidence limits for estimating the population mean μ,
are given by X ± 1.96σ X and X ± 2.58σ X , respectively. More generally, the confidence limits are given by X ± zc ⋅ σ X ,
where zc can be read from the table. Using the values of σ X we see that the confidence limits for the population mean
are given by
σ
X ± zc for the sampling either from an infinite population or with replacement from a finite population, and
N
σ Np − N
X ± zc for the sampling without replacement from a population of finite size N p .
N N p −1
Generally, the population standard deviation σ is unknown; thus, to obtain the above confidence limits, we use the
sample estimate s. This will prove satisfactory when N >= 30. For N < 30, the approximation is poor and small sampling
theory must be employed.
Confidence Intervals for Standard Deviations
The confidence limits for the standard deviation σ of a normally distributed population, as estimated from a sample with
standard deviation s, are given by
σ
s ± zc ⋅ σ s = s ± zc
2N
STATISTICAL DECISION THEORY
Decisions about populations made on the basis of sample information are called statistical decisions.
Assumptions about the populations involved are called statistical hypotheses.
Null Hypotheses
In many instances we formulate a statistical hypothesis for the sole purpose of rejecting or nullifying it. For example, if we
want to decide whether a given coin is loaded, we formulate the hypothesis that the coin is fair (i.e., p = 0.5, where p is the
probability of heads). Similarly, if we want to decide whether one procedure is better than another, we formulate the
hypothesis that there is no difference between the procedures (i.e., any observed differences are due merely to
fluctuations in sampling from the same population). Such hypotheses are often called null hypotheses and are denoted by
H0.
Alternative Hypotheses
Any hypothesis that differs from a given hypothesis is called an alternative hypothesis. For example, if one hypothesis is p
= 0.5, alternative hypotheses might be p = 0.7, or p > 0.5. A hypothesis alternative to the null hypothesis is denoted by H1.
Problem. The probability of getting between 40 and 60 heads (inclusive) in 100 tosses of a fair coin is
⎛ N ⎞ k N −k
∑⎜ k ⎟p q . Normal approximation can be used here with
⎝ ⎠
μ = Np = 100 ⋅ 0.5 = 50, σ = Npq = 100 ⋅ 0.5 ⋅ 0.5 = 5 . On a continuous scale, between 40 and 60 heads inclusive
39.5 − 50 60.5 − 50
is the same as between 39.5 and 60.5 heads. Thus 39.5 → = −2.1, 60.5 → = 2.1 and the
5 5
probability is 0.9642 (area under normal curve between -2.1 and 2.1).
Problem. To test the hypothesis that a coin is fair, adopt the following decision rule:
Accept the hypothesis if the number of heads in a single sample of 100 tosses is between 40 and 60 inclusive. Reject the
hypothesis otherwise.
Find the probability of rejecting the hypothesis when it is actually correct.
What conclusions would you draw if the sample of 100 tosses yielded 53 heads? 60 heads?
Solution. The probability to get outside 40 to 60 heads is (1 – 0.9642) = 0.0358. This is probability to reject the correct
hypothesis.
According to the decision rule, we would have to accept the hypothesis that the coin is fair in both cases (53 or 60 heads).
Problem. In an experiment on extrasensory perception, a subject in one room is asked to state the color (red or blue) of a
card chosen from a deck of 50 well-shuffled cards by an individual in another room. If the subject identifies 32 cards
correctly, determine whether the results are significant at the (a) 0.05 and (b) 0.01 levels.
Solution. If p is the probability of the subject choosing the color of a card correctly, then we have to decide between two
hypotheses:
H0 : p = 0:5, and the subject is simply guessing.
H1 : p > 0:5, and the subject has powers of extrasensory perception.
Since we are testing the subject’s extrasensory perception (p>0.5), we choose a one-tailed test. If hypothesis H0 is true,
then the mean and standard deviation of the number of cards identified correctly are given by
μ = Np = 50 ⋅ 0.5 = 25, σ = Npq = 3.54
The set of z scores outside the range -1.96 to 1.96 constitutes what is called the critical region of the hypothesis, the
region of rejection of the hypothesis, or the region of significance.
The set of z scores inside the range -1.96 to 1.96 is thus called the region of acceptance of the hypothesis, or the
region of non-significance.
On the basis of the above remarks, we can formulate the following decision rule:
• Reject the hypothesis at the 0.05 significance level if the z score of the statistic S lies outside the range -1.96 to
1.96. This is equivalent to saying that the observed sample statistic is significant at the 0.05 level.
• Accept the hypothesis otherwise (or, if desired, make no decision at all).
Because the z score plays such an important part in tests of hypotheses, it is also called a test statistic.
It should be noted that other significance levels could have been used. For example, if the 0.01 level were used, we
would replace 1.96 everywhere above with 2.58.
P is "the probability, if the test statistic really were distributed as it would be under the null hypothesis, of observing a test
statistic [as extreme as, or more extreme than] the one actually observed."
The smaller the P value, the more strongly the test rejects the null hypothesis, that is, the hypothesis being tested.
A p-value of .05 or less rejects the null hypothesis "at the 5% level" that is, the statistical assumptions used imply that only
5% of the time would the supposed statistical process produce a finding this extreme if the null hypothesis were true.
1. The p-value is not the probability that the null hypothesis is true (claimed to justify the "rule" of considering as
significant p-values closer to 0 (zero)).
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and
classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close
to unity. This is the Jeffreys-Lindley paradox.
2. The p-value is not the probability that a finding is "merely a fluke" (again, justifying the "rule" of considering small p-
values as "significant").
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently
cannot simultaneously be used to gauge the probability of that assumption being true. This is subtly different from the real
meaning which is that the p-value is the chance that null hypothesis explains the result: the result might not be "merely a
fluke," and be explicable by the null hypothesis with confidence equal to the p-value.
3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called
prosecutor's fallacy.
4. The p-value is not the probability that a replicating experiment would not yield the same conclusion.
5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
6. The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the
data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
7. The p-value does not indicate the size or importance of the observed effect (compare with effect size).
SPECIAL TESTS
For large samples, the sampling distributions of many statistics are normal distributions, and the above tests can be
applied to the corresponding z scores. The following special cases are just a few of the statistics of practical interest. In
each case the results hold for infinite populations or for sampling with replacement. For sampling without replacement
from finite populations, the results must be modified.
σ
1. Means. Here S = X is the sample mean; μ S = μ X = μ is the population mean; σS =σX = ( σ is the
N
X −μ
population standard deviation and N is the sample size). The z score is given by z = . When necessary, the
σ/ N
sample deviation s is used to estimate σ .
2. Proportions. Here S = P, the proportion of ‘‘successes’’ in a sample; μ S = μ P = p , where p is the population
pq
proportion of successes and N is the sample size; and σS = σP = .
N
P−q
The z score is given by z=
pq / N
X − Nq
In case P =X/N, where X is the actual number of successes in a sample, the z score becomes z=
Npq
That is, μ S = μ = Np , and S = X.
SMALL SAMPLING THEORY
SMALL SAMPLES
In previous chapters we often made use of the fact that for samples of size N > 30, called large samples, the sampling
distributions of many statistics are approximately normal, the approximation becoming better with increasing N. For
samples of size N < 30, called small samples, this approximation is not good and becomes worse with decreasing N, so
that appropriate modifications must be made.
STUDENT’S T DISTRIBUTION
Let us define the statistic
X −μ X −μ
t= N −1 =
s sˆ / N
If we consider samples of size N drawn from a normal (or approximately normal) population with mean μ and if for each
sample we compute t, using the sample mean X and sample standard deviation s or ŝ , the sampling distribution for t
can be obtained:
Y0 Y0
Y= N /2
= (ν +1)/2
⎛ t2 ⎞ ⎛ t2 ⎞
⎜ 1 + ⎟ ⎜1 + ⎟
⎝ N −1 ⎠ ⎝ ν ⎠
where Y0 is a constant depending on N such that the total area under the curve is 1, and where the constant ν = N − 1 is
called the number of degrees of freedom. For large values of ν or N (certainly N >= 30) the distribution closely
approximate the standardized normal.
As done with normal distributions, we can define 95%, 99%, or other confidence intervals by using the table of the t
distribution. In this manner we can estimate within specified limits of confidence the population mean μ .
For example, if −t.975 and t.975 are the values of t for which 2.5% of the area lies in each tail of the
t distribution, then the 95% confidence interval for t is
X −μ
−t.975 < N − 1 < t.975
s
s s
from which we see that μ is estimated to lie in the interval X − t.975< μ < X + t.975
N −1 N −1
with 95% confidence. Note that t.975 represents the 97.5 percentile value, while t.975 = −t.975 represents the 2.5 percentile
value.
In general, we can represent confidence limits for population means by
s
X ± tc
N −1
where the values tc, called critical values or confidence coefficients, depend on the level of confidence desired and on the
sample size.