Professional Documents
Culture Documents
Probability Theory was developed from the study of games of chance by Fermat
and Pascal and is the mathematical study of randomness. This theory deals
with the possible outcomes of an event and was put onto a firm mathematical
basis by Kolmogorov.
Kolmogorov
2. P () = 1.
P
3. P (jJ Aj ) = jJ P (Aj ) if {Aj : j J} is a countable set of
incompatible events.
The sample space and events in probability obey the same rules as sets and
subsets in set theory. Of particular importance are the distributive laws
A (B C) = (A B) (B C)
A (B C) = (A B) (A C)
AB = A C
AB = A C
The basic laws of probability can be derived directly from set theory and the
Kolmogorov axioms. For example, for any two events A and B, we have the
addition law,
P (A B) = P (A) + P (B) P (A B).
Laws of probability
The basic laws of probability can be derived directly from set theory and the
Kolmogorov axioms. For example, for any two events A and B, we have the
addition law,
P (A B) = P (A) + P (B) P (A B).
Proof
A = A
= A (B B)
= (A B) (A B) by the second distributive law, so
P (A) = P (A B) + P (A B) and similarly for B.
AB = (A B) (B B)
= (A B) B by the first distributive law
= (A B) B (A A)
= (A B) (B A) (A B) so
P (A B) = P (A B) + P (B A) + P (A B)
= P (A) P (A B) + P (B) P (A B) + P (A B)
= P (A) + P (B) P (A B).
n
[
Ai = , Ai Aj = for all i 6= j.
i=1
In this case,
n
X n
X n
X
P (ni=1Ai) = P (Ai) P (Ai Aj ) + P (Ai Aj Ak ) + . . .
i=1 j>i=1 k>j>i=1
The Kolmogorov axioms provide a mathematical basis for probability but dont
provide for a real life interpretation. Various ways of interpreting probability in
real life situations have been proposed.
Frequentist probability.
Subjective probability.
Keynes
The idea comes from Venn (1876) and von Mises (1919).
Bernoulli
This derives from the ideas of Jakob Bernoulli (1713) contained in the principle
of insufficient reason (or principle of indifference) developed by Laplace
(1812) which can be used to provide a way of assigning epistemic or subjective
probabilities.
If we are ignorant of the ways an event can occur (and therefore have no
reason to believe that one way will occur preferentially compared to another),
the event will occur equally likely in any way.
|S|
P (S) = .
||
Variations
N!
VNn = N (N 1) (N n + 1) = .
(N n)!
Note that one variation is different from another if the order in which the cards
are drawn is different.
We can also consider the case of drawing cards with replacement. In this case,
n
the number of possible results is V RN = N n.
To simplify the problem, assume there are 365 days in a year and that the
probability of being born is the same for every day.
Let Sn be the event that at least 2 people have the same birthday.
P (Sn) = 1 P (Sn)
# elementary events where nobody has the same birthday
= 1
# elementary events
# elementary events where nobody has the same birthday
= 1
365n
If we deal all the cards in a pack of size N , then their are PN = N ! possible
deals.
If we assume that the pack contains R1 cards of type one, R2 of suit 2, ... Rk
of type k, then there are
R ,...,Rk N!
P RN1 =
R1 ! Rk !
different deals.
Combinations
If we flip a coin N times, how many ways are there that we can get n heads
and N n tails?
n N N!
CN = = .
n n!(N n)!
In the Primitiva, each player chooses six numbers between one and forty nine.
If these numbers all match the six winning numbers, then the player wins the
first prize. What is the probability of winning?
Example: The probability of winning the Primitiva
In the Primitiva, each player chooses six numbers between one and forty nine.
If these numbers all match the six winning numbers, then the player wins the
first prize. What is the probability of winning?
1 1
=
49 13983816
6
or almost 1 in 14 million.
The player must match 5 of the six winning numbers and there are C65 = 6
ways of doing this. Also, they must match exactly the bonus ball and there
are C11 = 1 ways of doing this. Thus, the probability of winning the second
prize is
61
13983816
which is just under one in two millions.
Ramsey
P (A B)
P (B|A) = .
P (A)
P (A B) = P (B|A)P (A).
Example 12
What is the probability of getting two cups in two draws from a Spanish pack
of cards?
Write Ci for the event that draw i is a cup for i = 1, 2. Enumerating all
the draws with two cups is not entirely trivial. However, the conditional
probabilities are easy to calculate:
9 10 3
P (C1 C2) = P (C2|C1)P (C1) = = .
39 40 52
The multiplication law can be extended to more than two events. For example,
We can also solve the birthday problem using conditional probability. Let bi be
the birthday of student i, for i = 1, . . . , n. Then it is easiest to calculate the
probability that all birthdays are distinct
364 363
P (b1 6= b2) = , P (b3
/ {b1, b2}|b1 6= b2) =
365 365
and similarly
366 i
P (bi
/ {b1, . . . , bi1}|b1 6= b2 =
6 . . . bi1) =
365
for i = 3, . . . , n.
Thus, the probability that at least two students have the same birthday is, for
n < 365,
364 366 n 365!
1 = .
365 365 365n(365 n)!
We can also extend the law to the case where A1, . . . , An form a partition. In
this case, we have
X n
P (B) = P (B|Ai)P (Ai).
i=1
Theorem 4
For any two events A and B, then
P (B|A)P (A)
P (A|B) = .
P (B)
Supposing that A1, . . . , An form a partition, using the law of total probability,
we can write Bayes theorem as
P (B|Aj )P (Aj )
P (Aj |B) = Pn for j = 1, . . . , n.
i=1 P (B|Ai )P (Ai )
Example 13
The following statement of the problem was given in a column by Marilyn vos
Savant in a column in Parade magazine in 1990.
Suppose youre on a game show, and youre given the choice of three doors:
Behind one door is a car; behind the others, goats. You pick a door, say No.
1, and the host, who knows whats behind the doors, opens another door,
say No. 3, which has a goat. He then says to you, Do you want to pick
door No. 2? Is it to your advantage to switch your choice?
http://www.stat.sc.edu/~west/javahtml/LetsMakeaDeal.html
Simulating the game
http://www.stat.sc.edu/~west/javahtml/LetsMakeaDeal.html
http://en.wikipedia.org/wiki/Monty_Hall_problem
discrete
continuous
mixed
Discrete variables are those which take a discrete set range of values, say
{x1, x2, . . .}. For such variables, we can define the cumulative distribution
function, X
FX (x) = P (X x) = P (X = xi)
i,xi x
where P (X = x) is the probability function or mass function.
For a discrete variable, the mode is defined to be the point, x, with maximum
probability, i.e. such that
It is interesting to analyze the probability of being close or far away from the
mean of a distribution. Chebyshevs inequality provides loose bounds which
are valid for any distribution with finite mean and variance.
Theorem 6
For any random variable, X, with finite mean, , and variance, 2, then for
any k > 0,
1
P (|X | k) 2 .
k
Therefore, for any random variable, X, we have, for example that P ( 2 <
X < + 2) 34 .
2 2 2
P (|X | k) = P (X ) k
2
E (X )
by Markovs inequality
k2 2
1
=
k2
Chebyshevs
inequality shows us, for example, that P ( 2 X
+ 2) 0.5 for any variable X.
Chebyshevs inequality is not very tight. For the binomial distribution, a much
stronger result is available.
Theorem 7
Let X BI(n, p). Then
2
P (|X np| > n) 2e2n .
Suppose that Y is defined to be the number of tails observed before the first
head occurs for the same coin. Then Y has a geometric distribution with
parameter p, i.e. Y GE(p) and
1p 1p
The mean any variance of X are p and p2
respectively.
The negative binomial distribution reduces to the geometric model for the case
r = 1.
Theorem 8
Let X HG(N, R, n) and suppose that R, N and R/N p. Then
n
P (X = x) px(1 p)nx for x = 0, 1, . . . , n.
x
In the example,p
= 500/2000 = 0.25 and using a binomial approximation,
20
P (X = 5) 0.2550.7515 = 0.2023. The exact answer, from Matlab
5
is 0.2024.
Assume that rare events occur on average at a rate per hour. Then we can
often assume that the number of rare events X that occur in a time period
of length t has a Poisson distribution with parameter (mean and variance) t,
i.e. X P(t). Then
(t)xet
P (X = x) = for x = 0, 1, 2, . . .
x!
The Poisson distribution
Assume that rare events occur on average at a rate per hour. Then we can
often assume that the number of rare events X that occur in a time period
of length t has a Poisson distribution with parameter (mean and variance) t,
i.e. X P(t). Then
(t)xet
P (X = x) = for x = 0, 1, 2, . . .
x!
Formally, the conditions for a Poisson distribution are
Continuous variables are those which can take values in a continuum. For a
continuous variable, X, we can still define the distribution function, FX (x) =
P (X x) but we cannot define a probability function P (X = x). Instead, we
have the density function
dF (x)
fX (x) = .
dx
RThus,
x
the distribution function can be derived from the density as FX (x) =
X
f (u) du. In a similar way, moments of continuous variables can be
defined as integrals, Z
E[X] = xfX (x) dx
and the mode is defined to be the point of maximum density.
1
fX (x) = for a < x < b.
ba
a+b
In this case, we write X U(a, b) and the mean and variance of X are 2
(ba)2
and 12 respectively.
Remember that the Poisson distribution models the number of rare events
occurring at rate in a given time period. In this scenario, consider the
distribution of the time between any two successive events. This is an
exponential random variable, Y E(), with density function
1 1
The mean and variance of Y are and 2
respectively.
1 y
fY (y) = y e for y > 0.
()
The mean and variance of Y are and 2
respectively.
32 X 2 42
P (3 < X < 4) = P < <
4 4 4
= P (0.5 < Z < 1) where Z N (0, 1)
= P (Z < 1) P (Z < 0.5) = 0.8413 0.6915
= 0.1499
One of the main reasons for the importance of the normal distribution is that
it can be shown to approximate many real life situations due to the central
limit theorem.
Theorem 9
Given a random sample of size X1, . . . , Xn fromPsome distribution, then
n
under certain conditions, the sample mean X = n1 i=1 Xi follows a normal
distribution.
http://cnx.rice.edu/content/m11186/latest/
For a discrete random variable, X, taking values in some subset of the non-
negative integers, then the probability generating function, GX (s) is defined
as
X X
GX (s) = E s = P (X = x)sx.
x=0
This function has a number of useful properties:
x
1 d G(s)
G(0) = P (X = 0) and more generally, P (X = x) = x! dsx |s=0 .
G(1) = 1, E[X] = dG(1)ds and more generally, the kth factorial moment,
E[X(X 1) (X k + 1)], is
k
X! d G(s)
E =
(X k)! dsk s=1
Example 17
Consider a negative binomial variable, X N B(r, p).
r+x1
P (X = x) = pr (1 p)x for z = 0, 1, 2, . . .
x
X r+x1
E[sX ] = sx pr (1 p)x
x
x=0
r
X r+x1 p
= pr {(1 p)s}x =
x 1 (1 p)s
x=0
" #
i
X (sX)
MX (s) = E
i=1
i!
i
d MX (s) i
= E X
dsi s=0
1 x
fX (x) = x e for x > 0
()
Z
sx
MX (s) = e x1ex dx
0 ()
Z
= x1e(s)x dx
0 ()
=
s
dM
=
ds ( s)1
dM
= = E[X]
ds
s=0
dg 1(y) dx 1 1
= = dy = 0 .
dy dy dx
g (x)
If g does not have a unique inverse, then we can divide the support of X up
into regions, i, where a unique inverse, gi1 does exist and then
X
1
dx
fY (y) = fX gi (y)
i
dy i
where the derivative is that of the inverse function over the relevant region.
1 y 1
fY (y) = 2 exp for y > 0
2 2 2 y
1 1 1 y
= y 2 exp
2 2
1 1
Y G ,
2 2
E[Y ] = a + bE[X]
V [Y ] = b2V [X]
XX
so that, in particular, if we make the standardizing transformation, Y = X ,
then Y = 0 and Y = 1.
E[g(X)] g(E[X])
fX,Y (x, y)
fY |x(y|x) = .
fX (x)
Two variables are said to be independent if for all x, y, then fX,Y (x, y) =
fX (x)fY (y) or equivalently if fY |x(y|x) = fY (y) or fX|y (x|y) = fX (x).
Obviously, the units of the covariance are the product of the units of X and
Y . A scale free measure is the correlation,
X,Y
X,Y = Corr[X, Y ] =
X Y
1 X,Y 1
and the right hand side is a quadratic in a with at most one real root. Thus,
its discriminant must be non-positive so that if b 6= 0,
The discriminant is zero iff the quadratic has a real root which occurs iff
E[(aX bY )2] = 0 for some a and b.
Theorem 12
For two variables, X and Y , then
Proof
Z Z Z
E[E[Y |X]] = E yfY |X (y|X) dy = fX (x) yfY |X (y|X) dy dx
Z Z
= y fY |X (y|x)fX (x) dx dy
Z Z
= y fX,Y (x, y) dx dy
Z
= yfY (y) dy = E[Y ]
( + ) 1
fX (x) = x (1 x)1 for 0 < x < 1.
()()
The mean of X is E[X] = + .
for y = 0, 1, . . . , n.
GY (s) = (1 p + sp)n
h h PN ii
= E E s i=1 Xi | N
N
= E GX (s)
= GN (GX (s))
This result is useful in the study of branching processes. See the course in
Stochastic Processes.
n
Y
MY (s) = Mi(s)
i=1
and if the variables are identically distributed with common mgf MX (s), then
MY (s) = MX (s)n.
n 1
which is the mgf of a gamma distribution, G 2, 2 which is the 2n density.
For any variable, Y , with zero mean and unit variance and such that all
moments exist, then the moment generating function is
sY s2
MY (s) = E[e ] = 1 + + o(s2).
2
Now assume that X1, . . . , Xn are a random sample from a distribution with
mean and variance 2. Then, we can define the standardized variables,
Yi = Xi , which have mean 0 and variance 1 for i = 1, . . . , n and then
Pn
X i=1 Yi
Zn = =
/ n n
To make this result valid for variables that do not necessarily possess
all their moments, then we can use essentially the same arguments but
defining the characteristic function CX (s) = E[eisX ] instead of the moment
generating function.
E[X] =
2
V [X] =
n
E[S 2] = 2
n 2 2
= =
n
" n #
2 1 X 2
E[S ] = E (Xi X)
n1 i=1
" n #
1 X 2
= E (Xi + X)
n1 i=1
n
1 X h 2 2
i
= E (Xi ) + 2(Xi )( X) + ( X)
n 1 i=1
!
2
1 2
h
2
i
= n 2nE (X ) + n
n1 n
1 2 2
2
= n =
n1
A further important extension can be made if we assume that the data are
normally distributed.
Theorem 14
2
If X1, . . . , Xn N (, 2) then we have that X N (, n ).
Proof We can prove this using moment generating functions. First recall that
if Z N (0, 1), then X = + Z N (, 2) so that
MX (s) = E[esX ]
h s Pn i
= E e n i=1 Xi
s n
= MX
n
n
s + s2 2
= e n 2n2
s2 2 /n
s+
= e 2