You are on page 1of 22

BASICS

STIRLING’S APPROXIMATION TO N!
For large n: n ! ≈ 2π nn e
n −n

PERMUTATIONS
A permutation is an ordered sequence of elements selected from a given finite set, without repetitions, and not
necessarily using all elements of the given set. For example, given the set of letters {C, E, G, I, N, R}, some permutations
are ICE, RING, RICE, NICER, REIGN and CRINGE. ENGINE, on the other hand, is not a permutation, because it uses
the elements E and N twice.
Permutation of r elements on n possible (n possible choices for first element, n-1 for second, …)
n!
Pnr = P (n, r ) = n Pr =
(n − r )!
The number of permutations of n objects consisting of groups of which n1 are alike, n2 are alike, and so on, is
n!
, n = n1 + n2 + ... + nk
n1 !n2 !...nk !

PERMUTATIONS WITH REPETITION


If you have n things to choose from, and you choose r of them, then the permutations are:
⋅ n ⋅
n n ⋅ ....

n = n
r

r times
(Because there are n possibilities for the first choice, then there are n possibilities for the second choice, and so on.)

COMBINATION
Combination is an un-ordered collection of distinct elements taken from a given set.
The number of k-combinations (each of size k) from a set S with n elements (size n) is the binomial coefficient
⎛n⎞ n!
Ck = ⎜ ⎟ =
⎝ k ⎠ k !(n − k )!
n

COMBINATION WITH REPETITION


Elements from the given set can be taken few times. E.g. choosing 3 donuts from a menu with 10 types of donuts (each
type can be selected few times, the order is irrelevant)
(n + k − 1)! ⎛ n + k − 1⎞ ⎛ n + k − 1⎞
=⎜ ⎟=⎜ ⎟
k !(n − 1)! ⎝ k ⎠ ⎝ n − 1 ⎠
Suppose we have n distinct objects: 1, 2, …, n. Now consider an unordered selection of k integers, possibly with
repetitions i1 , i2 , i3 ,....ik . Suppose therefore that this sequence is in non-decreasing order. Then the sequence
i1 , i2 + 1, i3 + 2,....ik + (k − 1) is a strictly increasing sequence of k out from [1, (n + k – 1)]. The number of such sequences
is the number of k-combinations, without repetition, from (n + k – 1) distinct objects.
There is another way: suppose we have n distinct cells and k indistinguishable objects to place in the cells. Lay out the
objects in a line. Place dividers between them to divide them into lots that go in successive cells. We will require n-1
dividers. A configuration of objects and dividers looks like xxx|xx|xxxx|||xx|xxx . Here 3 x's go in the first cell, 2 x's in
the second, 4 x's in the third, none in the fourth, and so on. Each such configuration is completely determined by the
location of the dividers. Thus we must choose n-1 locations out of a total of n+k-1 locations for the dividers.

ARITHMETIC MEAN
The arithmetic mean (the mean) of a set of N numbers X1,. . . ,XN:

X=
∑X i
- X bar
N
WEIGHTED ARITHMETIC MEAN

X=
∑w X i i

∑w i

PROPERTIES OF THE ARITHMETIC MEAN


1. The algebraic sum of the deviations of a set of numbers from their arithmetic mean is zero.
2. The sum of the squares of the deviations of a set of numbers Xj from any number a is a minimum if and only if a = X
.
3. If f1 numbers have mean m1, f2 numbers have mean m2, . . . , fK numbers have mean mK, then the mean of all the
numbers is

X=
∑fm i i
, that is, a weighted arithmetic mean of all the means.
∑f i

4. If A is any guessed or assumed arithmetic mean (which may be any number) and if d j = X j − A are the deviations of
Xj from A, then

X = A+
∑d i
= A+
∑(X i − A)
= A+ d
N N

THE MEDIAN
The median of a set of numbers arranged in order of magnitude is either
the middle value or
the arithmetic mean of the two middle values
Geometrically the median is the value of X corresponding to the vertical line which divides a histogram into two parts
having equal areas.

THE MODE
The mode of a set of numbers is that value which occurs with the greatest frequency; that is, it is the most common
value. The mode may not exist, and even if it does exist it may not be unique.

THE EMPIRICAL RELATION BETWEEN THE MEAN, MEDIAN, AND MODE


For symmetrical curves, the mean, mode, and median all coincide.
For unimodal frequency curves that are moderately skewed (asymmetrical), we have the empirical relation
Mean – mode = 3(mean – median)
QUARTILES, DECILES, AND PERCENTILES
If a set of data is arranged in order of magnitude, the middle value (or arithmetic mean of the two middle values) that
divides the set into two equal parts is the median. By extending this idea, we can think of those values which divide the
set into four equal parts. These values denoted by Q1, Q2, and Q3, are called the first, second, and third quartiles,
respectively, the value Q2 being equal to the median.
Similarly, the values that divide the data into 10 equal parts are called deciles and are denoted by D1, D2, . . . , D9, while
the values dividing the data into 100 equal parts are called percentiles and are denoted by P1, P2, . . . , P99. The fifth
decile and the 50th percentile correspond to the median. The 25th and 75th percentiles correspond to the first and third
quartiles, respectively.
Collectively, quartiles, deciles, percentiles, and other values obtained by equal subdivisions of the data are called
quantiles.

DISPERSION OR VARIATION
The degree to which numerical data tend to spread about an average value is called the dispersion, or variation, of the
data. Various measures of this dispersion (or variation) are available, the most common being the:
• Range
• Mean
• Deviation
• semi-interquartile range
• 10–90 percentile range
• standard deviation

THE RANGE
The range of a set of numbers is the difference between the largest and smallest numbers in the set.

THE MEAN DEVIATION


The mean deviation, or average deviation, of a set of N numbers is defined by

MD =
∑X j −X
= X −X
N

THE STANDARD DEVIATION AND THE VARIANCE


The standard deviation of a set of N numbers is defined by

s=
∑(X j − X )2
N
Sometimes the standard deviation of a sample’s data is defined with (N – 1) replacing N in the denominator because the
resulting value represents a better estimate of the standard deviation of a population from which the sample is taken. For
large values of N (N > 30), there is practically no difference between the two definitions.
The variance of a set of data is defined as the square of the standard deviation σ = s .
2

When it is necessary to distinguish the standard deviation of a population from the standard deviation of a sample drawn
from this population, we often use
• s - standard deviation of a sample
• σ - standard deviation of a population
• s 2 - sample variance
• σ 2 - population variance

s=
∑(X j − X )2
=
∑(X j
2
− 2X j X + X 2)
=
∑X j
2
−∑ X2
= X2 − X2
N N N

With d j = X j − A deviation from some arbitrary constant A and X = A +


∑d i
:
N
using
d = X − A, X = A + d , X = A + d , X − X = d − d
∑ (d − d ) 2

=
∑ (d 2
− 2dd + d 2 )
=
∑d 2


2d ∑ d
+d 2
=
∑d 2

−d 2
leads to s = d2 −d 2
N N N N N

PROPERTIES OF THE STANDARD DEVIATION


1. For normal distributions, it turns out that:
(a) 68.27% of the cases are included between (X-s, X+s)
(b) 95.45% of the cases are included between (X-2s, X+2s)
(c) 99.73% of the cases are included between (X-3s, X+3s)
2. Suppose that two sets consisting of N1 and N2 numbers have variances given by s12 and s22 , respectively, and have
the same mean X . Then the combined, or pooled, variance of both sets is given by
N1s12 + N 2 s2 2
s =
2

N1 + N 2
Note that this is a weighted arithmetic mean of the variances. This result can be generalized to three or more sets.
k 2 −1
3. Chebyshev’s theorem states that for k > 1, there is at least ⋅100% of the probability distribution for any variable
k2
within k standard deviations of the mean. In particular, when
k = 2 there is at least 75% of the data in the interval ( x − 2 s, x + 2 s ) ,
k = 3 there is at least 89% of the data in the interval ( x − 3s, x + 3s ) ,

k = 4 there is at least 93.75% of the data in the interval ( x − 4s, x + 4s )

EVENTS
If A and B are events, then
1. Union - A B is the event “either A or B or both”
2. Intersection - A ∩ B IS THE EVENT “BOTH A AND B”
3. Complement – A’ IS THE EVENT “NOT A”
4. A – B = A ∩ B’ is the event “A but not B” In particular, A’ = S – A.
5. Mutually exclusive - the sets corresponding A and B are disjoint, A ∩ B =  
6. Independent events – A and B are independent if and only if P(A ∩ B ) = P(A)P(B) (their joint probability is a product of
their individual probabilities) or

THE AXIOMS OF PROBABILITY


Axiom 1. For every event A in class C, P(A) ≥ 0

Axiom 2. For the sure or certain event S in the class C, P(S) = 1

Axiom 3. For any number of mutually exclusive events A1, A2, …, in the class C,
P(A1 A2 … ) = P(A1) + P(A2) + …
In particular, for two mutually exclusive events A1 and A2 ,
P(A1 A2 ) = P(A1) + P(A2)
SOME IMPORTANT THEOREMS ON PROBABILITY
Theorem 1: If A1 ⊂ A2 , then
P ( A1 ) ≤ P ( A2 )
P ( A2 − A1 ) = P( A1 ) − P( A2 )

Theorem 2: For every event A, 0 ≤ P (A) ≤ 1,

Theorem 3: For , the empty set, P ( ) = 0

Theorem 4: If A’ is the complement of A, then P (A’) = 1 – P(A)

Theorem 5: If A = A1 A2 … An, where A1, A2, … , An are mutually exclusive events, then P(A) = P(A1) + P(A2) +
… + P(An)

Theorem 6: If A and B are any two events, then P(A B) = P(A) + P(B) – P(A ∩ B)
More generally, if A1, A2, A3 are any three events, then
P(A1 A2 A3) = P(A1) + P(A2) + P(A3) –
P(A1 ∩ A2) – P(A2 ∩ A3) – P(A3 ∩ A1) +
P(A1 ∩ A2 ∩ A3).

Theorem 7: For any events A and B, P(A) = P(A ∩ B) + P(A ∩ B′)

CONDITIONAL PROBABILITY
P(B | A) denotes the probability of B given that A has occurred - the conditional probability of B given A.
Since A is known to have occurred, it becomes the new sample space replacing the original S.
P( A ∩ B)
P ( B | A) ≡ Or
P( A)
P ( A ∩ B ) ≡ P ( A) P ( B | A)
Consider the simple scenario of rolling two fair six-sided dice, labeled die 1 and die 2. Define the following three events:

A: Die 1 lands on 3 – P(A) = 1/6


B: Die 2 lands on 1 – P(B) = 1/6
C: The dice sum to 8 – P(C) = 5/36

The prior probability of each event is 1/6. Of the 36 possible ways that a pair of dice can land, just 5 result in a sum of 8
(namely 2 and 6, 3 and 5, 4 and 4, 5 and 3, and 6 and 2).

The probability of both A and C occurring is called the joint probability of A and C –
P(A ∩ C) = 1/36. On the other hand P(B ∩ C) = 0.

Now suppose we roll the dice and cover up die 2, so we can only see die 1, and observe that die 1 landed on 3. Given this
partial information, the probability that the dice sum to 8 is no longer 5/36; instead it is 1/6, since die 2 must land on 5 to
achieve this result. This is called the conditional probability, because it is the probability of C under the condition that A
is observed, and is written P(C | A), which is read "the probability of C given A" –
P(C | A) = P(A ∩ C) / P(A) = 1/36 / 1/6 = 1/6
Similarly, P(C | B) = 0, since if we observe die 2 landed on 1, we already know the dice can't sum to 8, regardless of what
the other die landed on.
joint probability ≠ conditional probability
⎧ P( A ∩ B)
⎪ P( B | A) = P ( A)


⎪ P( A | B) = P( A ∩ B)
⎪⎩ P( B)
P ( A | B ) P( B ) = P ( A ∩ B ) = P ( B | A) P( A)
Dividing by P(B) we obtain Bayes’ theorem.
Theorems on Conditional Probability
Theorem 8: For any three events A1, A2, A3, we have
P ( A1 ∩ A2 ∩ A3 ) = P ( A1 ) P ( A2 | A1 ) P( A3 | A1 ∩ A2 )
In words, the probability that A1 and A2 and A3 all occur is equal to the probability that A1 occurs times the probability
that A2 occurs given that A1 has occurred times the probability that A3 occurs given that both A1 and A2 have occurred.

Theorem 9: If an event A must result in one of the mutually exclusive events A1 , A2 , … , An , then
P ( A) = P ( A1 ) P ( A | A1 ) + P( A2 ) P( A | A2 ) + ... + P( An ) P( A | An )

INDEPENDENT EVENTS
A and B are independent events if the probability of B occurring is not affected by the occurrence or nonoccurrence of A
and vice versa
Or their joint probability is a product of their individual probabilities
⎧ P ( B | A) = P ( B )

⎩ P ( A | B ) = P ( A)
or
P( A ∩ B ) = P ( A) P ( B )
Three events A1, A2, A3 are independent if they are pair-wise independent.

The conditional probability fallacy is the assumption that P(A|B) is approximately equal to P(B|A). It can be overcome
by describing the data in actual numbers rather than probabilities.
The relation between P(A|B) and P(B|A) is given by Bayes' theorem:
P( B | A) P( A)
P( A | B) =
P( B)
And P(A|B) is approximately equal to P(B|A) if the prior probabilities P(A) and P(B) are also approximately equal.
An example: In order to identify individuals having a serious disease in an early curable form, one may consider screening
a large group of people. While the benefits are obvious, an argument against such screenings is the disturbance caused
by false positive screening results. Suppose:
• 1% of the group suffer from the disease: P(ill) = 0.01 and P(well) = 0.99.
• when the screening test is applied to a health person, there is a 1% chance of getting a false positive result:
P(positive | well) = 1%, and P(negative | well) = 99%.
• when the test is applied to an ill person, there is a 1% chance of a false negative result: P(negative | ill) = 1% and
P(positive | ill) = 99%.
Now, one may calculate the following:
• P(well ∩ negative) = P(well) P(negative | well) = 99% * 99% = 98.01%.
• P( ill ∩ positive) = P( ill) P(positive | disease) = 1% * 99% = 0.99%.
• P(well ∩ positive) = P(well) P(positive | well) = 99% * 1% = 0.99%.
• P(ill ∩ negative) = P( ill) P(negative | disease) = 1% * 1% = 0.01%.
And
P(positive) = P(well ∩ positive) + P(ill ∩ positive) = 0.99%+0.99%=1.98%
P(ill | positive) =P(ill ∩ positive) / P(positive) = 0.99% / 1.98% = 50%.

In this example, it should be easy to relate to the difference between the conditional probabilities
• P(positive | ill) = 99% - is the probability that an individual who has the disease tests positive
• P(ill | positive) = 50% - is the probability that an individual who tests positive actually has the disease.
With the numbers chosen here, the last result is likely to be deemed unacceptable: half the people testing positive are
actually false positives.
Another type of fallacy is interpreting conditional probabilities of events as unconditional probabilities, or seeing them as
being in the same order of magnitude. This fallacy to view P(A|B) as P(A) or as being close to P(A) is often related with
some forms of statistical bias.

Here is an example: One of the conditions for the legendary wild-west hero Wyatt Earp to have become a legend was
having survived all the duels he survived. Indeed, it is reported that he was never wounded, not even scratched by a
bullet. The probability of this to happen is very small, contributing to his fame because events of very small probabilities
attract attention. However, the point is that the degree of attention depends very much on the observer. Somebody
impressed by a specific event (here seeing a "hero") is prone to view effects of randomness differently from others which
are less impressed.
In general makes not much sense to ask after observation of a remarkable series of events "What is the probability of
this?", because this is a conditional probability upon observation. The distinction between conditional and unconditional
probabilities can be intricate if the observer who asks "What is the probability?" is himself outcome of a random selection.

Bayes’ Theorem
P( B | A) P( A)
P( A | B) =
P( B)
The Bayes’ theorem is often referred to as a theorem on the probability of causes and relates the conditional and
marginal probabilities of events A and B. Each term in Bayes' theorem has a conventional name:
• P(A) is the prior probability or marginal probability of A. It is "prior" in the sense that it does not take into
account any information about B.
• P(A|B) is the conditional probability of A, given B. It is also called the posterior probability because it is
derived from or depends upon the specified value of B.
• P(B|A) is the conditional probability of B given A.
• P(B) is the prior or marginal probability of B, and acts as a normalizing constant.
Intuitively, Bayes' theorem in this form describes the way in which one's beliefs about observing 'A' are updated by having
observed 'B'.
Suppose there is a co-ed school having
60% boys - all wear trousers
40% girls - wear trousers or skirts in equal numbers
An observer sees a student from a distance; all they can see is that this student is wearing trousers. What is the
probability this student is a girl? The correct answer can be computed using Bayes' theorem. The event A is that the
student observed is a girl, and the event B is that the student observed is wearing trousers.
Girls Boys Total
Trousers 20 60 80, P(B) = 0.8
Skirts 20 0 20, P(B’) = 0.2
Total 40, P(A) = 0.4 60, P(A’) = 0.6 100
P(B|A) = 0.5 - the probability of the student wearing trousers given that the student is a girl.
P(B|A') = 1 - the probability of the student wearing trousers given that the student is a boy.
P(A|B) = P(B|A)P(A)/P(B) = 0.5 * 0.4 / 0.8 = 0.25
Another way of obtaining the same result is: there are 80 trouser-wearers, of which 20 are girls. Therefore the chance that
a random trouser-wearer is a girl equals 20/80 = 0.25.

Using P(B) = P(A ∩ B) + P(A’ ∩ B) = P(B|A)P(A) + P(B|A’)P(A’). We can rewrite Bayes’ as


P( B | A) P ( A)
P( A | B) =
P ( B | A) P( A) + P ( B | A ') P( A ')
Or more generally for A1, A2, … , An mutually exclusive and complementary events whose union is the sample space S
(one of the events must occur):
P( B | Ak ) P( Ak )
P( Ak | B) =
∑ P( B | Aj ) P( Aj )
j

An Intuitive Explanation of Bayes’ Theorem:


• Tests are not the event. We have a test for spam, separate from the event of actually having a spam message.
• Tests are flawed. Tests detect things that don’t exist (false positive), and miss things that do exist (false negative).
• Tests give us test probabilities, not the real probabilities. People often consider the test results directly, without
considering the errors in the tests.
• False positives skew results. Suppose you are searching for something really rare (1 in a million). Even with a
good test, it’s likely that a positive result is really a false positive on somebody in the 999,999.
• Even science is a test. Scientific experiments can be considered “potentially flawed tests” and need to be treated
accordingly. Tests and measuring equipment have some inherent rate of error.
Bayes’ theorem gives you the actual probability of an event given the measured test probabilities. For example, you can:
• Correct for measurement errors. If you know the real probabilities and the chance of a false positive and false
negative, you can correct for measurement errors.
• Relate the actual probability to the measured test probability. Bayes’ theorem lets you relate
o P(A|X), the chance that an event A happened given the indicator X, and
o P (X|A), the chance the indicator X happened given that event A occurred.

One clever application of Bayes’ Theorem is in spam filtering. We have


Event A: The message is spam.
Test X: The message contains certain words (X)
P ( words | spam) P ( spam)
P ( spam | words ) =
P ( words )
Bayesian filtering allows us to predict the chance a message is really spam given the “test results” (the presence of
certain words). Clearly, words like “viagra” have a higher chance of appearing in spam messages than in normal ones.
Spam filtering based on a blacklist is flawed — it’s too restrictive and false positives are too great. But Bayesian filtering
gives us a middle ground — we use probabilities. As we analyze the words in a message, we can compute the chance it
is spam (rather than making a yes/no decision). As the filter gets trained with more and more messages, it updates the
probabilities that certain words lead to spam messages. Advanced Bayesian filters can examine multiple words in a row,
as another data point.

The Monty Hall problem


We are presented with three doors to choose - red, green, and blue - one of which has a prize hidden behind it. We
choose the red door. The presenter, who knows where the prize is, opens the blue door and reveals that there, is no
prize behind it. He then asks if we wish to change our mind about our initial selection of red. Will changing our mind at this
point improve our chances of winning the prize?

You might think that, with two doors left unopened, you have a 50:50 chance with either door, and so there is no point in
changing doors. However, this is not the case. Let us call the situation that the prize is behind a given door Ar, Ag, and
Ab.
We shall assume P(Ar) = P(Ag) = P(Ab) = 1/3 and that we have already picked the red door.
Event B = "the presenter opens the blue door". Without any prior knowledge, we would assign this a probability of 50%.
• if the prize is behind the red door, the presenter is free to pick between the green or the blue door at random -
P(B | Ar) = 1 / 2
• if the prize is behind the green door, presenter must pick the blue door - P(B | Ag) = 1
• if the prize is behind the blue door, presenter must pick the green door - P(B | Ab) = 0

Thus,
P(Ar | B) = P(B | Ar) P(Ar) / P(B) = 1/2 * 1/3 / ½ = 1/3
P(Ag | B) = P(B | Ag) P(Ag) / P(B) = 1 * 1/3 / 1/2 = 2/3
P(Ab | B) = P(B | Ab) P(Ab) / P(B) = 0 * 1/3 / ½ = 0

So, we should always switch.


THE BINOMIAL, NORMAL, AND POISSON DISTRIBUTIONS
DISCRETE PROBABILITY DISTRIBUTION
Let X be a discrete random variable, and X = x1 , x2 ,...., xn .Suppose also that these values are assumed with
probabilities given by
P( X = xk ) = f ( xk ), k = 1, 2,...
In general, f(x) is a probability function if
1. f ( x) ≥ 0
2. ∑ f ( x) = 1
x

(CUMULATIVE) DISTRIBUTION FUNCTION


The cumulative distribution function determines the probability that the random variable X takes on any value x or less:
F ( x) = P ( X ≤ x)
1. Distribution function is non-decreasing (monotonically increasing): F ( x ) ≤ F ( y ), if x ≤ y
lim F ( x ) = 0
x→ − ∞
2.
lim F ( x ) = 1
x →∞

3. Distribution function is continuous from the right lim F ( x + h) = F ( x), ∀x


h →0+

DISTRIBUTION FUNCTIONS FOR DISCRETE RANDOM VARIABLES


X = x1 , x2 ,...., xn
f(x) – Probability function
F(x) – Distribution function
⎧0 x < x1
⎪ f (x ) x1 < x < x2

F ( x) = ⎨
1

⎪ f ( x1 ) + f ( x2 ) x2 < x < x3
⎪⎩ f ( x1 ) + ... + f ( xn ) xn < x
f ( x) = F ( x) − lim F (u )
u→ x−
For continuous random variable Distribution function is
x
F ( x) = ∫
−∞
f (u )du

EXPECTATION
For a discrete random variable X ( X = x1 , x2 ,...., xn ) with probability function P ( X = xk ) = f ( xk ) the expectation is
defined as
E ( X ) = x1 f ( x1 ) + x2 f ( x2 ) + ... + xn f ( xn ) = ∑ xk f ( xk )
With equal probabilities
x1 + x2 + ... + xn
E( X ) = = mean( X )
n
The expected value of a discrete random variable is its measure of central tendency!
For continuous random variable

E[ g ( x)] = ∫ g ( x) f ( x)dx
−∞
where f(x) is probability (density) function and g(x) is the function whose expectation we want to calculate.
VARIANCE
For discrete random variable
n
σ x2 = E[( X − μ ) 2 ] = ∑ ( x j − μ ) 2 f ( x j )
1
For continuous random variable

σ x2 = E[( X − μ )2 ] = ∫ (x − μ)
2
f ( x)dx
−∞

THEOREMS ON EXPECTATION AND VARIANCE


For any random variable (not just discrete)

Theorem 1: If c is any constant, then


E (cX ) = cE ( X )
Var (cX ) = c 2Var ( X )

Theorem 2: If X and Y are any random variables, then


E ( X + Y ) = E ( X ) + E (Y )

Theorem 3: If X and Y are independent random variables, then


E ( XY ) = E ( X ) E (Y )

Theorem 4:
σ 2 = E[( X − μ ) 2 ] = E ( X 2 ) − μ 2 = E ( X 2 ) − [ E ( X )]2
where μ = E(X).

Theorem 5: The quantity E[( X − a ) ] is a minimum when a = μ = E(X)


2

Theorem 6: If X and Y are independent random variables,


Var ( X ± Y ) = Var ( X ) + Var (Y ) σ X2 ±Y = σ X2 + σ Y2
THE BINOMIAL DISTRIBUTION
If p is the probability that an event will happen in any single trial and q = 1 - p is the probability that it will fail to happen in
any single trial, then the probability that the event will happen exactly X times in N trials is given by
⎛N⎞ N!
P( X = k ) = ⎜ ⎟ p k q1− k = p k q1− k
⎝k ⎠ k !( N − k )!
Binomial distribution properties:
μ = Np Mean
σ 2 = Npq Variance
σ = Npq Standard deviation

THE NORMAL DISTRIBUTION


The normal distribution (Gaussian distribution) is continuous probability distribution defined by:
1 ⎛ σ2 ⎞
Y= exp ⎜ − 2 ⎟
2π ⎝ 2( X − μ ) ⎠
The mean characterize location and the variance – the scale.
X −μ
When the variable X is expressed in terms of standard units z = the normal distribution has standard form:
σ
1 ⎛ 1 ⎞
Y= exp ⎜ − 2 ⎟ , with 0 mean and 1 variance.
2π ⎝ 2z ⎠

The cumulative distribution function of a probability distribution is the probability of the event that a random variable X is
less than or equal to x:
1
x ⎛ ( u − μ )2 ⎞
Φ μ ,σ 2 ( x) =
σ 2π ∫ exp ⎜⎜ − 2σ 2 ⎟⎟du
−∞ ⎝ ⎠
μ = 0, σ = 1
1
x
⎛ u2 ⎞
Φ ( x) =

∫ ⎜⎝ − 2 ⎟⎠du
−∞
exp
The 68-95-99.7 rule, or three sigma rule, or empirical rule, states that for a normal distribution, almost all values lie within
3 standard deviations of the mean.
Because of the exponential tails of the normal distribution, odds of higher deviations decrease very quickly:
range fraction in range expected frequency outside range approximate frequency
for daily event
μ ± 1σ 0.682689492137 1 in 3 weekly
μ ± 2σ 0.954499736104 1 in 22 monthly
μ ± 3σ 0.997300203937 1 in 370 yearly
μ ± 4σ 0.999936657516 1 in 15,787 every 60 years (once in
a lifetime)
μ ± 5σ 0.999999426697 1 in 1,744,278 every 5,000 years (once
in history)
μ ± 6σ 0.999999998027 1 in 506,842,372 every 1.5 million years
(essentially never)

Thus for a daily process, a 6σ (Six Sigma) event is expected to happen less than once in a million years.
THE POISSON DISTRIBUTION
The discrete probability distribution that expresses the probability of a number of events to occur in time interval, if these
events occur with a known average rate and independently of the time since the last event. The Poisson distribution can
also be used for the number of events in other specified intervals such as distance, area or volume. If the expected
number of occurrences in this interval is λ, then the probability that there are exactly X occurrences is equal to
λ X e−λ
p( X ) =
X!
μ =λ Mean
σ =λ
2
Variance
σ= λ Standard deviation

RELATION BETWEEN THE BINOMIAL AND NORMAL DISTRIBUTIONS


If N is large and if neither p nor q is too close to zero, the binomial distribution can be closely approximated by a
normal distribution with standardized variable given by
X − Np
z=
Npq

RELATION BETWEEN THE BINOMIAL AND POISSON DISTRIBUTIONS


In the binomial distribution if N is large while the probability p of the occurrence of an event is close to 0, the event is
called a rare event.
In practice we shall consider an event to be rare if the number of trials is at least 50 (N >= 50) while Np is less than 5 (p <
0.1). In such case the binomial distribution is very closely approximated by the Poisson distribution with λ = Np .

RELATION BETWEEN THE POISSON AND NORMAL DISTRIBUTIONS


X −λ
The Poisson distribution approaches a normal distribution with standardized variable as λ increases indefinitely.
λ

THE GEOMETRIC DISTRIBUTION


Either of two discrete probability distributions:
• the probability distribution of the number X of Bernoulli trials needed to get one (first) success
o supported on the set { 1, 2, 3, ...}
P ( X = k ) = (1 − p ) k −1 p
1
o mean =
p
1− p
variance =
p2
• the probability distribution of the number Y = X − 1 of failures before the first success
o supported on the set { 0, 1, 2, 3, ... }
P ( X = k ) = (1 − p ) k p
1− p
o mean =
p
1− p
variance = 2
p
Which of these one calls "the" geometric distribution is a matter of convention and convenience.
ELEMENTARY SAMPLING THEORY
SAMPLING THEORY
Sampling theory is a study of relationships between a population and samples from that population.
It estimates unknown population parameters from knowledge of corresponding sample statistics.

SAMPLE SPACES
A set S that consists of all possible outcomes of a random experiment is called a sample space, and each outcome is
called a sample point. Often there will be more than one sample space that can describe outcomes of an experiment, but
there is usually only one that will provide the most information.
Example. If we toss a die, then one sample space is given by {1, 2, 3, 4, 5, 6} while another is {even, odd}. It is clear,
however, that the latter would not be adequate to determine whether an outcome is divisible by 3.
Sample space can be:
• discrete
o finite (finite number of points)
o countably infinite (as many points as there are natural numbers)
• non-discrete
o non-countably infinite (as many points as there are in some interval on the x axis)

SAMPLING WITH AND WITHOUT REPLACEMENT


A finite population in which sampling is with replacement can theoretically be considered infinite.
Sampling from a finite very large population can be considered to be sampling from an infinite population.

SAMPLING DISTRIBUTION OF MEANS


Suppose that all possible samples of size N are drawn without replacement from a finite population of size Np > N. If
• μ X and σ X are the mean and standard deviation of the sampling distribution of means
• μ and σ are the mean and standard deviation of the population
then
μX = μ
σ Np − N
σX =
N N p −1
If the population is infinite or if sampling is with replacement, then
⎧μ X = μ

(*) ⎨ σ
⎪σ X = N

For large values of (N >=30), the sampling distribution of means is approximately a normal distribution with (μ X
,σ X 2 ) ,
irrespective of the population (so long as the population mean and variance are finite and the population size is at
least twice the sample size).
This result for an infinite population is a special case of the central limit theorem, which shows that the accuracy of the
approximation improves as N gets larger (sampling distribution is asymptotically normal).
X1, X2, X3, ... Xn is a sequence of n independent and identically distributed random variables each having finite values of
expectation µ and variance σ2. The central limit theorem states that as the sample size n increases, the distribution of the
sample average of these random variables approaches the normal distribution with a mean µ and variance σ2 / n
irrespective of the shape of the original distribution.
Sn is the sum of n random variables Sn = X1 + ... + Xn.
Then, defining a new random variable
Sn − μ
Zn = = N (0,1)
σ n n→∞
the distribution of Zn converges towards the standard normal distribution N(0,1).

SAMPLING DISTRIBUTION OF PROPORTIONS


Suppose that a population is infinite and that the probability of occurrence of an event is p. For example, the population
may be all possible tosses of a fair coin in which the probability of ‘‘heads’’ is p =0.5. Consider all possible samples of
size N drawn from this population, and for each sample determine the proportion P of successes. In the case of the
coin, P would be the proportion of heads turning up in N tosses. We thus obtain a sampling distribution of
proportions whose mean and standard deviation are given by
μP = μ
pq (from (*) with μ = p, σ = pq )
σP =
N
Note that the population is binomially distributed.

SAMPLING DISTRIBUTIONS OF DIFFERENCES AND SUMS


Suppose that we are given two populations. For each sample of size N1 drawn from the first population, let us compute a
statistic S1; this yields a sampling distribution for the statistic S1 with (μ S1 , σ S 12 ) . Similarly, for each sample of size N2
drawn from the second population, let us compute a statistic S2; this yields a sampling distribution for the statistic S2 with
(μ S2 , σ S 2 2 ) . From all possible combinations of these samples from the two populations we can obtain a distribution of the
differences, S1 - S2, which is called the sampling distribution of differences of the statistics. The mean and standard
deviation of this sampling distribution are given by
μ S 1− S 2 = μ S 1 − μ S 2
σ S 1− S 2 = σ S 12 + σ S 2 2
provided that the samples are independent.
STATISTICAL ESTIMATION THEORY
UNBIASED ESTIMATES
If the mean of the sampling distribution of a statistic equals the corresponding population parameter, the statistic is
called an unbiased estimator of the parameter.

EFFICIENT ESTIMATES
If the sampling distributions of two statistics have the same mean (or expectation), then the statistic with the smaller
variance is called an efficient estimator of the mean, while the other statistic is called an inefficient estimator.
If we consider all possible statistics whose sampling distributions have the same mean, the one with the smallest
variance is sometimes called the most efficient, or best, estimator of this mean.

POINT ESTIMATES AND INTERVAL ESTIMATES; THEIR RELIABILITY


An estimate of a population parameter given by a single number is called a point estimate of the parameter. An estimate
of a population parameter given by two numbers between which the parameter may be considered to lie is called an
interval estimate of the parameter.
Interval estimates indicate the precision, or accuracy, of an estimate and are therefore preferable to point estimates.

CONFIDENCE-INTERVAL ESTIMATES OF POPULATION PARAMETERS


Given sampling distribution of a statistic S with ( μ S , σ S ) . If the sampling distribution of S is approximately normal (which
2

as we have seen is true for many statistics if the sample size N >= 30), we can expect to find an actual sample statistic S
lying in the intervals
( μS − σ S , μS + σ S ) about 68.27% of the time
( μS − 2σ S , μS + 2σ S ) about 95.45% of the time
( μS − 3σ S , μS + 3σ S ) about 99.73% of the time
Equivalently, we can expect to find (or we can be confident of finding) μS in the intervals
( S − σ S , S + σ S ) about 68.27% of the time
( S − 2σ S , S + 2σ S ) about 95.45% of the time
( S − 3σ S , S + 3σ S ) about 99.73% of the time
Because of this, we call these respective intervals the 68.27%, 95.45%, and 99.73% confidence intervals for
estimating μ S . The end numbers of these intervals called the 68.27%, 95.45%, and 99.73% confidence limits, or
fiducial limits.
Similarly, ( S ± 1.96σ S ) and ( S ± 2.58σ S ) are the 95% and 99% (or 0.95 and 0.99) confidence limits for S. The
percentage confidence is often called the confidence level. The numbers 1.96, 2.58, etc., in the confidence limits are
called confidence coefficients, or critical values, and are denoted by zc. From confidence levels we can find confidence
coefficients, and vice versa.
Confidence 99.73% 99% 98% 96% 95.45% 95% 90% 80% 68.27% 50%
level
zc 3.00 2.58 2.33 2.05 2.00 1.96 1.645 1.28 1.00 0.6745
Confidence Intervals for Means
If the statistic S is the sample mean X , then the 95% and 99% confidence limits for estimating the population mean μ,
are given by X ± 1.96σ X and X ± 2.58σ X , respectively. More generally, the confidence limits are given by X ± zc ⋅ σ X ,
where zc can be read from the table. Using the values of σ X we see that the confidence limits for the population mean
are given by
σ
X ± zc for the sampling either from an infinite population or with replacement from a finite population, and
N
σ Np − N
X ± zc for the sampling without replacement from a population of finite size N p .
N N p −1
Generally, the population standard deviation σ is unknown; thus, to obtain the above confidence limits, we use the
sample estimate s. This will prove satisfactory when N >= 30. For N < 30, the approximation is poor and small sampling
theory must be employed.
Confidence Intervals for Standard Deviations
The confidence limits for the standard deviation σ of a normally distributed population, as estimated from a sample with
standard deviation s, are given by
σ
s ± zc ⋅ σ s = s ± zc
2N
STATISTICAL DECISION THEORY
Decisions about populations made on the basis of sample information are called statistical decisions.
Assumptions about the populations involved are called statistical hypotheses.
Null Hypotheses
In many instances we formulate a statistical hypothesis for the sole purpose of rejecting or nullifying it. For example, if we
want to decide whether a given coin is loaded, we formulate the hypothesis that the coin is fair (i.e., p = 0.5, where p is the
probability of heads). Similarly, if we want to decide whether one procedure is better than another, we formulate the
hypothesis that there is no difference between the procedures (i.e., any observed differences are due merely to
fluctuations in sampling from the same population). Such hypotheses are often called null hypotheses and are denoted by
H0.
Alternative Hypotheses
Any hypothesis that differs from a given hypothesis is called an alternative hypothesis. For example, if one hypothesis is p
= 0.5, alternative hypotheses might be p = 0.7, or p > 0.5. A hypothesis alternative to the null hypothesis is denoted by H1.

TESTS OF HYPOTHESES AND SIGNIFICANCE - DECISION RULES


If we suppose that a particular hypothesis is true, but find that the results observed in a random sample differ markedly
from the results expected under the hypothesis, then we would say that the observed differences are significant and
would thus be inclined to reject the hypothesis (or at least not accept it on the basis of the evidence obtained).
Procedures that help us decide whether to accept or reject hypotheses, are called tests of hypotheses, tests of
significance, rules of decision, or simply decision rules.

Problem. The probability of getting between 40 and 60 heads (inclusive) in 100 tosses of a fair coin is
⎛ N ⎞ k N −k
∑⎜ k ⎟p q . Normal approximation can be used here with
⎝ ⎠
μ = Np = 100 ⋅ 0.5 = 50, σ = Npq = 100 ⋅ 0.5 ⋅ 0.5 = 5 . On a continuous scale, between 40 and 60 heads inclusive
39.5 − 50 60.5 − 50
is the same as between 39.5 and 60.5 heads. Thus 39.5 → = −2.1, 60.5 → = 2.1 and the
5 5
probability is 0.9642 (area under normal curve between -2.1 and 2.1).
Problem. To test the hypothesis that a coin is fair, adopt the following decision rule:
Accept the hypothesis if the number of heads in a single sample of 100 tosses is between 40 and 60 inclusive. Reject the
hypothesis otherwise.
Find the probability of rejecting the hypothesis when it is actually correct.
What conclusions would you draw if the sample of 100 tosses yielded 53 heads? 60 heads?
Solution. The probability to get outside 40 to 60 heads is (1 – 0.9642) = 0.0358. This is probability to reject the correct
hypothesis.
According to the decision rule, we would have to accept the hypothesis that the coin is fair in both cases (53 or 60 heads).
Problem. In an experiment on extrasensory perception, a subject in one room is asked to state the color (red or blue) of a
card chosen from a deck of 50 well-shuffled cards by an individual in another room. If the subject identifies 32 cards
correctly, determine whether the results are significant at the (a) 0.05 and (b) 0.01 levels.
Solution. If p is the probability of the subject choosing the color of a card correctly, then we have to decide between two
hypotheses:
H0 : p = 0:5, and the subject is simply guessing.
H1 : p > 0:5, and the subject has powers of extrasensory perception.
Since we are testing the subject’s extrasensory perception (p>0.5), we choose a one-tailed test. If hypothesis H0 is true,
then the mean and standard deviation of the number of cards identified correctly are given by
μ = Np = 50 ⋅ 0.5 = 25, σ = Npq = 3.54

TYPE I AND TYPE II ERRORS. LEVEL OF SIGNIFICANCE


• If we reject a hypothesis when it should be accepted, we say that a Type I error has been made. The
maximum probability with which we would be willing to risk a Type I error is called the level of
significance, or significance level, of the test. This probability is generally specified before any samples are
drawn so that the results obtained will not influence our choice. A significance level of 0.05 or 0.01 is customary
• If we accept a hypothesis when it should be rejected, we say that a Type II error has been made.
The only way to reduce both types of error is to increase the sample size.
TESTS INVOLVING NORMAL DISTRIBUTIONS
Suppose that under a given hypothesis the sampling distribution of a statistic S is a normal distribution with (μ S ,σ S 2 ) .
We can be 95% confident that if the hypothesis is true, then the z score of an actual sample statistic S will lie between -
1.96 and 1.96. However, if on choosing a single sample at random we find that the z score of its statistic lies outside the
range -1.96 to 1.96, we would conclude that such an event could happen with a probability of only 0.05 if the given
hypothesis were true. We would then say that this z score differed significantly from what would be expected under the
hypothesis, and we would then be inclined to reject the hypothesis.
The total shaded area 0.05 is the significance level of the test. It represents the probability of our being wrong in rejecting
the hypothesis (i.e., the probability of making a Type I error).

The set of z scores outside the range -1.96 to 1.96 constitutes what is called the critical region of the hypothesis, the
region of rejection of the hypothesis, or the region of significance.
The set of z scores inside the range -1.96 to 1.96 is thus called the region of acceptance of the hypothesis, or the
region of non-significance.
On the basis of the above remarks, we can formulate the following decision rule:
• Reject the hypothesis at the 0.05 significance level if the z score of the statistic S lies outside the range -1.96 to
1.96. This is equivalent to saying that the observed sample statistic is significant at the 0.05 level.
• Accept the hypothesis otherwise (or, if desired, make no decision at all).
Because the z score plays such an important part in tests of hypotheses, it is also called a test statistic.
It should be noted that other significance levels could have been used. For example, if the 0.01 level were used, we
would replace 1.96 everywhere above with 2.58.

TWO-TAILED AND ONE-TAILED TESTS


In the above test we were interested in extreme values of the statistic S or its corresponding z score on both sides of the
mean (i.e., in both tails of the distribution). Such tests are thus called two-sided tests, or two-tailed tests.
Often, however, we may be interested only in extreme values to one side of the mean, such as when we are testing the
hypothesis that one process is better than another (which is different from testing whether one process is better or worse
than the other). Such tests are called one-sided tests, or one-tailed tests. In such cases the critical region is a region to
one side of the distribution, with area equal to the level of significance.
Level of significance 0.10 0.05 0.01 0.005 0.002
Critical values of z for -1.28 or -1.645 or -2.33 or -2.58 or -2.88 or
one-tailed tests 1.28 1.645 2.33 2.58 2.88
Critical values of z for -1.645 and -1.96 and -2.58 and -2.81 and -3.08 and
two-tailed tests 1.645 1.96 2.58 2.81 3.08

P-VALUES FOR HYPOTHESES TESTS


The p-value is the probability of observing a sample statistic as extreme or more extreme than the one observed under
the assumption that the null hypothesis is true. When testing a hypothesis, state the level of significance α . Calculate
your p-value:
• if the p-value <= α , reject H0
• otherwise, do not reject H0
For testing means, using large samples (n>30), calculate the p-value as follows:
1. H0 : μ = μ 0 and H1 : μ < μ 0 , p-value = P(Z < computed test statistic),
2. H0 : μ = μ0 and H1 : μ > μ0 , p-value = P(Z > computed test statistic),
3. H0 : μ = μ0 and H1 : μ ≠ μ0 ,
p-value = P(Z < -|computed test statistic|) + P(Z > |computed test statistic|).
x − μ0
The computed test statistic is , where x is the mean of the sample, s is the standard deviation of the sample,
s/ N
and μ0 is the value specified for μ in the null hypothesis. Note that if σ is unknown, it is estimated from the sample by
using s. This method of testing hypothesis is equivalent to the method of finding a critical value or values and if the
computed test statistic falls in the rejection region, rejects the null hypothesis.

P is "the probability, if the test statistic really were distributed as it would be under the null hypothesis, of observing a test
statistic [as extreme as, or more extreme than] the one actually observed."
The smaller the P value, the more strongly the test rejects the null hypothesis, that is, the hypothesis being tested.
A p-value of .05 or less rejects the null hypothesis "at the 5% level" that is, the statistical assumptions used imply that only
5% of the time would the supposed statistical process produce a finding this extreme if the null hypothesis were true.

There are several common misunderstandings about p-values.[2][3]

1. The p-value is not the probability that the null hypothesis is true (claimed to justify the "rule" of considering as
significant p-values closer to 0 (zero)).
In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and
classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close
to unity. This is the Jeffreys-Lindley paradox.
2. The p-value is not the probability that a finding is "merely a fluke" (again, justifying the "rule" of considering small p-
values as "significant").
As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently
cannot simultaneously be used to gauge the probability of that assumption being true. This is subtly different from the real
meaning which is that the p-value is the chance that null hypothesis explains the result: the result might not be "merely a
fluke," and be explicable by the null hypothesis with confidence equal to the p-value.
3. The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called
prosecutor's fallacy.
4. The p-value is not the probability that a replicating experiment would not yield the same conclusion.
5. 1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
6. The significance level of the test is not determined by the p-value.
The significance level of a test is a value that should be decided upon by the agent interpreting the data before the
data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
7. The p-value does not indicate the size or importance of the observed effect (compare with effect size).

SPECIAL TESTS
For large samples, the sampling distributions of many statistics are normal distributions, and the above tests can be
applied to the corresponding z scores. The following special cases are just a few of the statistics of practical interest. In
each case the results hold for infinite populations or for sampling with replacement. For sampling without replacement
from finite populations, the results must be modified.
σ
1. Means. Here S = X is the sample mean; μ S = μ X = μ is the population mean; σS =σX = ( σ is the
N
X −μ
population standard deviation and N is the sample size). The z score is given by z = . When necessary, the
σ/ N
sample deviation s is used to estimate σ .
2. Proportions. Here S = P, the proportion of ‘‘successes’’ in a sample; μ S = μ P = p , where p is the population

pq
proportion of successes and N is the sample size; and σS = σP = .
N
P−q
The z score is given by z=
pq / N
X − Nq
In case P =X/N, where X is the actual number of successes in a sample, the z score becomes z=
Npq
That is, μ S = μ = Np , and S = X.
SMALL SAMPLING THEORY
SMALL SAMPLES
In previous chapters we often made use of the fact that for samples of size N > 30, called large samples, the sampling
distributions of many statistics are approximately normal, the approximation becoming better with increasing N. For
samples of size N < 30, called small samples, this approximation is not good and becomes worse with decreasing N, so
that appropriate modifications must be made.

STUDENT’S T DISTRIBUTION
Let us define the statistic
X −μ X −μ
t= N −1 =
s sˆ / N
If we consider samples of size N drawn from a normal (or approximately normal) population with mean μ and if for each
sample we compute t, using the sample mean X and sample standard deviation s or ŝ , the sampling distribution for t
can be obtained:
Y0 Y0
Y= N /2
= (ν +1)/2
⎛ t2 ⎞ ⎛ t2 ⎞
⎜ 1 + ⎟ ⎜1 + ⎟
⎝ N −1 ⎠ ⎝ ν ⎠

where Y0 is a constant depending on N such that the total area under the curve is 1, and where the constant ν = N − 1 is
called the number of degrees of freedom. For large values of ν or N (certainly N >= 30) the distribution closely
approximate the standardized normal.
As done with normal distributions, we can define 95%, 99%, or other confidence intervals by using the table of the t
distribution. In this manner we can estimate within specified limits of confidence the population mean μ .

For example, if −t.975 and t.975 are the values of t for which 2.5% of the area lies in each tail of the
t distribution, then the 95% confidence interval for t is
X −μ
−t.975 < N − 1 < t.975
s
s s
from which we see that μ is estimated to lie in the interval X − t.975< μ < X + t.975
N −1 N −1
with 95% confidence. Note that t.975 represents the 97.5 percentile value, while t.975 = −t.975 represents the 2.5 percentile
value.
In general, we can represent confidence limits for population means by
s
X ± tc
N −1
where the values tc, called critical values or confidence coefficients, depend on the level of confidence desired and on the
sample size.

TESTS OF HYPOTHESES AND SIGNIFICANCE


Tests of hypotheses and significance, or decision rules, are easily extended to problems involving small samples, the only
difference being that the z score, or z statistic, is replaced by a suitable t score, or t statistic.
1. Means. To test the hypothesis H0 that a normal population has mean μ , we use the t score
(or t statistic)
X −μ X −μ
t= N −1 =
s sˆ / N
where X is the mean of a sample of size N.
2. Differences of Means. Suppose that two random samples of sizes N1 and N2 are drawn from normal populations
whose standard deviations are equal.

You might also like