You are on page 1of 65

INTRODUCTION TO PROBABILITY

THEORY AND STATISTICS


HEINRICH MATZINGER
Georgia Tech
E-mail: matzi@math.gatech.edu
October 7, 2014

Contents
1 Definition and basic properties
1.1 Events . . . . . . . . . . . . .
1.2 Frequencies . . . . . . . . . .
1.3 Definition of probability . . .
1.4 Direct consequences . . . . . .
1.5 Some inequalities . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

2
2
4
5
6
10

2 Conditional probability and independence


11
2.1 Law of total probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Bayes rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Expectation

14

4 Dispersion, average fluctuation and standard deviation


18
4.1 Matzingers rule of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5 Calculation with the Variance
21
5.1 Getting the big picture with the help of Matzingers rule of thumb . . . . . 23
6 Covariance and correlation
24
6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Chebyshevs and Markovs inequalities

26

8 Combinatorics

29

9 Important discrete random variables


9.1 Bernoulli variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Binomial random variable . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Geometric random variable . . . . . . . . . . . . . . . . . . . . . . . . . . .

31
31
32
33

10 Continuous random variables

34

11 Normal random variables

36

12 Distribution functions

38

13 Expectation and variance for continuous random variables

40

14 Central limit theorem

42

15 Statistical testing
44
15.1 Looking up probabilities for the standard normal in a table . . . . . . . . . 46
15.2 Two sample testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
16 Statistical estimation
16.1 An example . . . . . . . . . . . . . . . . . . .
16.2 Estimation of variance and standart deviation
16.3 Maximum Likelihood estimation . . . . . . .
16.4 Estimation of parameter for geometric random

. . . . . .
. . . . . .
. . . . . .
variables

.
.
.
.

17 Linear Regression
17.1 The case where the exact linear model is known . . . . . .
17.2 When and are not know . . . . . . . . . . . . . . . .
17.3 Where the formula for the estimates of and come from
17.4 Expectation and variance of . . . . . . . . . . . . . . .
17.5 How precise are our estimates . . . . . . . . . . . . . . . .
17.6 Multiple factors and or polynomial regression . . . . . . .
17.7 Other applications . . . . . . . . . . . . . . . . . . . . . .

1
1.1

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

53
53
55
56
57

.
.
.
.
.
.
.

58
58
60
62
64
65
65
65

Definition and basic properties


Events

Imagine that we throw a die which has 4 sides. The outcome, of this experiment will be
one of the four numbers: 1,2,3 or 4. The set of all possible outcomes in this case is:
= {1, 2, 3, 4} .
is called the outcome space or sample space. Before doing the experiment we dont know
what the outcome will be. Each possible outcome has a certain probability to occur. This
2

die-experiment is a random experiment.


We can use our die to make bets. Somebody might bet that the number will be even. We
throw the die: if the number we see is 2 or 4 we say that the event even has occurred
or has been observed. We can identify the event even with the set: {2, 4}. This might
seem a little bit abstract, but by identifying the event with a set, events become easier
to handle: Sets are well known mathematical objects, whilst the events as we know them
from every day language are not.
In a similar way one might bet that the outcome is a number greater-equal to 3. This
event is realized when we observe a 3 or a 4. The event greater or equal 3 can thus be
viewed as the set {3, 4}.
Another example, is the event odd. This is the set: {1, 3}.
With this way of looking at things, events are simply subsets of . Take another example:
a coin with a side 0 and a side 1. The outcome space or sample space in that case is:
= {0, 1} .
The events are the subsets of , in this case there are 4 of them:
, {0}, {1}, {0, 1}.
Example 1.1 It might at first seem very surprising that events can be viewed as sets.
Consider for example the following sets:
the set of bicycles which belong to a Ga tech student, the set of all sky-scrapers in Atlanta,
the set of all one dollar bills which are currently in the US.
Let us give a couple of events:
the event that after X-mas the unemployment rate is lower than now, the event that our
favorite pet dies from hart attack, the event that I go down with flue next week.
At first, it seems that events are something very dierent from sets. Let us see in a real
world example how mathematicians view events as sets:
Assume that we are interested in where the American economy is going to stand in exactly
one year from now. More specifically, we look at unemployment and inflation and wonder
if they will be above or below their current level. To describe the situation which we
encounter in a year from now, we introduce a two digit variable Z = XY . Let X be equal
to one if unemployment is higher in a year than its current level. If it is lower, let X be
equal to 0. Similarly, let Y be equal to one if inflation is higher in a year from now. If it
is lower, let Y be equal to zero. The possible outcomes for Z are:
00, 01, 10, 11.
This is the situation of a random experiment, where the outcome is one of the four possible
numbers: 00,01,10,11. We dont know what the outcome will be. But each possibility can
occur with a certain probability. Let A be the event that unemployment is higher in a year.
This corresponds to the outcomes 10 and 11. We thus identify the event A with the set:
{10, 11} .
3

Let B be the event that inflation is higher in a year form now. This corresponds to the
outcomes 01 and 11. We thus view the event B as the set:
{01, 11} .
Recall that the intersection AB of two sets A and B, is the set consisting of all elements
contained in both A and B. In our example, the intersection of A and B is equal to
A B = {11}. Let C designate the event that unemployment goes up and that inflation
goes up at the same time. This corresponds to the outcome 11. Thus, C is identified with
the set: {11}. In other words, C = A B. The general rule which we must remember is:
For any events A and B, if C designates the event that A and B both occur at
the same time, then C = A B.
Let D be the event that unemployment or inflation will be up in a year from now. (By or
we mean that at least one of them is up.) This corresponds to the outcomes: 01,10,11.
Thus D gets identified with the set:
D = {01, 10, 11} .
Recall that the union of two sets A and B is defined to be the set consisting of all elements
which are in A or in B. We see in our example that D = A B. This is true in general.
We must thus remember the following rule:
For any events A and B, if D designates the event that A or B occur, then
D = A B.

1.2

Frequencies

Assume that we have a six sided die. In this case the outcome space is
= {1, 2, 3, 4, 5, 6}.
The event even in this case is the set:
{2, 4, 6}
whilst the event odd is equal to
{1, 3, 5} .
Instead of throwing the die only once, we throw it several times. As a result, instead of
just a number, we get a sequence of numbers. When throwing the six-sided die I obtained
the sequence:
1, 4, 3, 5, 2, 6, 3, 4, 5, 3, . . .
When repeating the same experiment which consists in throwing the die a couple of times,
we are likely to obtain another sequence. The sequence we observe is a random sequence.

In this example we observe one 3 within the first 5 trials and three 3s occurring within
the first 10 trials. We write:
n{3}
for the number of times we observe a 3 among the first n trials. In our example thus: for
n = 5 we have n{3} = 1 whilst for n = 10 we find n{3} = 3.
Let A be an event. We denote by nA the number of times A occurred up to time n.
Take for example A to be the event even. In the above sequence within the first 5 trials
we obtained 2 even numbers. Thus for n = 5 we have that nA = 2. Within the first 10
trials we found 4 even numbers. Thus, for n = 10 we have nA = 4. The proportion of
even numbers nA /n for the first 5 trials is equal to 2/5 = 40%. For the first 10 trials, this
proportion is 4/10 = 40%.

1.3

Definition of probability

The basic definition of probability which we use is based on frequencies. For our definition
of probability we need an assumption about the world surrounding us:
Let A designate an event. When we repeat the same random experiment independently
many times we observe that on the long run the proportion of times A occurs tends to
stabilize. Whenever we repeat this experiment, the proportion nA /n on the long run
tends to be the same number. A more mathematical way of formulating this, is to stay
that nA /n converges to a number only depending on A, as n tends to infinity. This is our
basic assumption.
Assumption As we keep repeating the same random experiment under the same conditions and such that each trial is independent of the previous ones, we find that:
the proportion nA /n tends to a number which only depends on A, as n .
We are now ready to give our definition of probability:
Definition 1.1 Let A be an event. Assume that we repeat the same random experiment
under exactly the same conditions independently many times. Let nA designate the number
of times the event A occurred within the n first repeats of the experiment. We define the
probability of the event A to be the real number:
nA
.
n n

P (A) =: lim

Thus, P (A) designates the probability of the event A. Take for example a four-sided
perfectly symmetric die. Because, of symmetry each side must have same probability. On
the long run we will see a forth of the times a 1, a forth of the times a 2, a forth of the
times a 3 and a forth of the times a 4. Thus, for the symmetric die the probability of
each side is 0.25.

1.4

Direct consequences

From our definition of probability there are several useful facts, which follow immediately:
1. For any event A, we have that:
P (A) 0.
2. For any event A, we have that:
P (A) 1.
3. Let designate the state space. Then:
P () = 1.
Let us prove these elementary facts:
1. By definition na /n 0. However, the limit of a sequence which is 0 is also 0.
Since P (A) is by definition equal to the limit of the sequence nA /n we find that
P (A) 0.
2. By definition nA n. It follows that na /n 1. The limit of a sequence which
is always less or equal to one must also be less or equal to one. Thus, P (A) =
limn na /n 1.
3. By definition n = n. Thus:
P () = lim n /n = lim n/n = lim 1 = 1.
n

The next two theorems are essential for solving many problems:
Theorem 1.1 Let A and B be disjoint events. Then:
P (A B) = P (A) + P (B).
Proof. Let C be the event C = A B. C is the event that A or B has occurred. Because
A and B are disjoint, we have that A and B can not occur at the same time. Thus, when
we count up to time n how many times C has occurred, we find that this is exactly equal
to the number of times A has occurred plus the number or times B has occurred. In other
words,
nC = nA + nB .
(1.1)
From this it follows that:
!n
nC
nA + nB
nB "
A
= lim
= lim
+
.
n n
n
n
n
n
n

P (C) = lim

We know that the sum of limits is equal to the limit of the sum. Applying this to the
right side of the last equality above, yields:
!n
nB "
nA
nB
A
= lim
+
+ lim
= P (A) + P (B).
lim
n n
n n
n
n
n
6

This finishes to prove that


P (C) = P (A B) = P (A) + P (B).
Let us give an example which might help us to understand why equation 1.1 holds.
Imagine we are using a 6-sided die. Let A be the event that we observe a 2 or a 3. Thus
A = {2, 3}. Let B be the event that we observe a 1 or a 5. Thus, B = {1, 5}. The two
events A and B are disjoint: it is not possible to observe at the same time A and B since
A B = . Assume that we throw the die 10 times and obtain the sequence of numbers:
1, 3, 4, 6, 3, 4, 2, 5, 1, 2.
We have seen the event A four times: at the second, fifth, seventh and tenth trial. The
event B is observed at the first trial, at the eight and ninth trials. C = AB = {2, 3, 1, 5}
is observed at the trials number: 2,5,7,10 and 1,8,9. We thus find in this case that nA = 4,
nB = 3 and nC = 7 which confirms equation 1.1.
Example 1.2 Assume that we are throwing a fair coin with sides 0 and 1. Let Xi designate the number which we obtain when we flip the coin for the i-th time. Let A be the
event that we observe right at the beginning the number 111. In other words:
A = {X1 = 1, X2 = 1, X3 = 1}.
Let B designate the event that we observe the number 101 when we read our random
sequence starting from the second trial. Thus:
B = {X2 = 1, X3 = 0, X4 = 1}.
Assume that we want to calculate the probability to observe that at least one of the two
events A or B holds. In other words we want to calculate the probability of the event
C = A B.
Note that A and B can not occur both at the same time. The reason is that for A to hold
it is necessary that X3 = 1 and for B to hold it is necessary that X3 = 0. X3 however
can not be equal at the same time to 0 ant to 1. Thus, A and B are disjoint events, so
we are allowed to use theorem 1.1.We find, applying theorem 1.1 that:
P (A B) = P (A) + P (B).
With a fair coin, each 3-digit number has same probability. There are 8, 3-digit numbers
so each one has probability 1/8. It follows that P (A) = 1/8 and P (B) = 1/8. Thus
P (A B) =

1 1
1
+ = = 25%
8 8
4
7

The next theorem is useful for any pair of events A and B and not just disjoint events:
Theorem 1.2 Let A and B be two events. Then:
P (A B) = P (A) + P (B) P (A B).
Proof. Let C = A B. Let D = B A, that is D consists of all the elements that are in
B, but not in A. We have by definition that C = D A and that D and A are disjoint.
Thus we can apply theorem 1.1 and find:
P (C) = P (A) + P (D)

(1.2)

Furthermore (A B) and D are disjoint, and we have B = (A B) D. We can thus


apply theorem 1.1 and find that:
P (B) = P (A B) + P (D)

(1.3)

Subtracting equation 1.3 from equation 1.2 yields:


P (C) P (B) = P (A) P (A B).
By adding P (B) on both sides of the last equation, we find:
P (C) = P (A B) = P (A) + P (B) P (A B).
This finishes this proof.
Problem 1.1 Let a and b designate two genes. Let the probability that a randomly picked
person in the US, has gene a be 20%. Let the probability for gene b be 30%. And eventually,
let the probability that he has both genes at the same time be 10%. What is the probability
to have at least one of the two genes?
Let us explain how we solve the above problem: Let A, resp. B designate the event that
the randomly picked person has gene a, resp. b. We know that:
P (A) = 20%
P (B) = 30%
P (A B) = 10%
The event to have at least one gene is the event A B. By theorem 1.2 we have that:
P (A B) = P (A) + P (B) P (A B). Thus in our case: P (A B) = 20% + 30% 10% =
40%. This finishes to solve the above problem.
In many situations we will be considering the union of more of 3 events. The next theorem
gives the formula for three events:

Theorem 1.3 Let A, B and C be three events. Then we have


P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C).
Proof. We already now the formula for the probability of the union of two events. So,
we are going to use this formula. Let D denote the union: D := B C. Then we find
ABC =AD
and hence
P (A B C) = P (A D).

(1.4)

By theorem 1.2, the right side of the last equation above is equal to:
P (A D) = P (A) + P (D) P (A D) = P (A) + P (B C) P (A (B C))

(1.5)

Note that by theorem 1.2 we have:


P (B C) = P (B) + P (C) P (B C).
We have:
A (B C) = (A B) (A C)

and hence

P (A (B C)) = P ((A B) (A C)).

(1.6)

But the right side of the last equation above is the probability of the union of two events
and hence theorem 1.2 applies:
P ((AB)(AC)) = P (AB)+P (AC)P ((AB)(AC)) = P (AB)+P (AC)P ((ABC)).
(1.7)
Combining now equation 1.4, 1.5, 1.6 with 1.7, we find
P (A B C) = P (A) + P (B) + P (C) P (A B) P (A C) P (B C) + P (A B C).
Often it is easier to calculate the probability of a complement than the probability of the
event itself. In such a situation, the following theorem is useful:
Theorem 1.4 Let A be an event and let Ac denote its complement. Then:
P (A) = 1 P (Ac )
Proof. Note that the events A and Ac are disjoint. Furthermore by definition AAc = .
Recall that for the sample space , we have that P () = 1. We can thus apply theorem
1.1 and find that:
1 = P () = P (A Ac ) = P (A) + P (Ac ).
This implies that:

which finishes this proof.

P (A) = 1 P (Ac )
9

1.5

Some inequalities

Theorem 1.5 Let A and B be two events. Then:


P (A B) P (A) + P (B)
Proof. We know by theorem 1.2 that
P (A B) = P (A) + P (B) P (A B)
Since P (A B) 0 we have that
P (A) + P (B) P (A B) P (A) + P (B).
It follows that
P (A B) P (A) + P (B).
For several events a similar theorem holds:
Theorem 1.6 Let A1 , . . . , An be a collection of n events. Then
P (A1 A2 . . . An ) A1 + . . . + An
Proof. By induction.
Another often used inequality is:
Theorem 1.7 Let A B. Then:
P (A) P (B).
Proof. If A B, then for every n we have that:
nA nB
hence also
Thus:

nA
nB

.
n
n
nA
nB
lim
.
n n
n n
lim

Hence
P (A) P (B).

10

Conditional probability and independence

Imagine the following situation: in a population there are two illnesses a and b. We
assume that 20% suer from b, 15% suer from a whilst 10% suer from both. Let A be
the event that a person suers from a and let B be the event that a person suers from b.
If a patient comes to a doctor and says that he suers from illness b, how likely is he to
have illness a also? (We assume that the patient has been tested for b but not yet tested
for a.) We note that half the population group suering from b, also suer from a. Hence,
when the doctor meets such a patients suering from b, there is a chance of 1 out of 2,
that the person suers also from a. This is called the conditional probability of B given
A and denoted by P (B|A). The formula we used is 10%/20% = P (A B)/P (A).
Definition 2.1 Let A, B be two events. Then we define the probability of A conditional
on the event B, and write P (A|B) for the number:
P (A|B) :=

P (A B)
.
P (B)

Definition 2.2 Let A, B be two events. We say that A and B are independent of each
other i
P (A B) = P (A) P (B).
Note that A and B are independent of each other if and only if P (A|B) = P (A). In other
word, A and B are independent of each other if and only if the realization of one of the
events does not aect the conditional probability of the other.
Assume that we perform two random experiments independently of each other, in the
sense that the two experiments do not interact. That is the experiments have no influence
on each other. Le A denote an event related to the first experiment, and let B denote
an event related to the second experiment. We saw in class that in this situation the
equation P (A B) = P (A) P (B) must hold. And thus, A and B are independent in the
sense of the above definition. To show this we used an argument where we simulated the
two random experience by picking marbles from two bags.
There are also many cases, where events related to a same experiment are independent,
in the sense of the above definition. For example for a fair die, the events A = {1, 2} and
B = {2, 4, 6} are independent.
There can also be more than two independent events at a time:
Definition 2.3 Let A1 , A2 , . . . , An be a finite collection of events. We say that A1 , A2 , . . . , An
are all independent of each other i
P (iI Ai ) = iI P (Ai )
for every subset I {1, 2, . . . , n}.
The next example is very important for the test on Wednesday.
11

Example 2.1 Assume we flip the same coin independently three times. Let the coin be
biased, so that side 1 has probability 60% and side 0 has probability 40%. What is the
probability to observe the number 101? (By this we mean: what is the probability to first
get a 1, then a 0 and eventually, at the third trial, a 1 again?)
To solve this problem let A1 , resp. A3 be the event that at the first, resp. third trial we
get a one. Let A2 be the event that at the second trial we get a zero. Observing a 101
is thus equal to the event A := A1 A2 A3 . Because, the trials are performed in an
independent manner it follows that the events A1 , A2 , A3 are independent of each other.
Thus we have that:
P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 ).
We have that:
P (A1 ) = 60%, P (A2 ) = 40%, P (A3) = 60%.
It follows that:
P (A1 A2 A3 ) = 60% 40% 60% = 0.144.

2.1

Law of total probability

Lemma 2.1 Let A and B be two events. Then


P (A) = P (A B) + P (A B c ).

(2.1)

Furthermore, if B and B c have both probabilities that are not equal to zero, then
P (A) = P (A|B) P (B) + P (A|B c )P (B c ).

(2.2)

Proof. Let D be the event D := A B. Let E be the event E := A B c . Then we have


that D and E are disjoint. Furthermore, A = E D so that by theorem 1.1, we find:
P (A) = P (E D) = P (E) + P (D).
Replacing E and D by A B and A B c , yields equation 2.1.
We can use 2.1 to find
P (A) =

P (A B) P (B) P (A B c ) P (B c )
+
.
P (B)
P (B c )

The right side of the last equality above is equal to


P (A) = P (A|B)P (B) + P (A|B c )P (B c ),
which finishes to prove equation 2.2
Let us give an example which should show that intuitively this law is very clear. Assume
that in a town 90% of women are blond but only 20% of men. Assume we chose a person
12

at random from this town. Each person is equally likely to be drawn. Let W be the event
that the person be a women and B be the event that the person be blond. The law of
total probability can be written as
P (B) = P (B|W )P (W ) + P (B|W c)P (W c ).

(2.3)

In our case, the conditional probability of blond conditional on women is P (B|W ) = 0.9.
On the other hand W c is the event to draw a male and P (B|W c) is the conditional
probability to have a blond given that the person is a man. In our case, P (B|W c ) = 0.2.
So, when we put the numerical values into equation 2.3, we find
P (B) = 0.9P (W ) + 0.2P (W c ).

(2.4)

Here P (W ) is the probability that the chosen person is a women. This is then the
percentage of women in this population. Similarly, P (W c ) is the proportion of men. In
other words, equation 2.4 can be read as follows: the total proportion of blonds in the
population is the weighted average between the proportion of blonds among the female
and the male population.

2.2

Bayes rule

Bayes rule is useful when one would like to calculate a conditional probability of A given
B, but one is given the opposite, that is the probability of B given A. Let us next state
Bayes rule:
Lemma 2.2 Let A and B be events both having non zero probabilities. Then
P (A|B) =

P (B|A) P (A)
.
P (B)

(2.5)

Proof. By definition of conditional probability we have P (B|A) = P (B A)/P (A). We


are now going to plug the last expression into the right side of equation 2.5. We find:
P (B|A) P (A)
P (A B)P (A)
P (A B)
=
=
= P (A|B),
P (B)
P (A) P (B)
P (B)
which establishes equation 2.5.
Let us give an example. Assume that 30% of men are interested in car races, but only 10%
of women are. If I know that a person is interested in car races, what is the probability
that it is a man? Again I imagine that we pick a person at random in the population.
Let M be the event that the person is a man and C the event that she/he is interested
in car races. We know P (C|M) = 0.3 and P (C|M c ) = 0.1. Now by Bayes rule we have
that the conditional probability that the person is a man given that he/she is interested
in car races is:
P (M)
.
(2.6)
P (M|C) = P (C|M)
P (C)
13

We have that P (C) = P (C|M)P (M) + P (C|M c )P (M c ) which we can plug into 2.6 to
find
P (M)
.
P (M|C) = P (C|M)
P (C|M)P (M) + P (C|M c )P (M c )
In the present numerical example, we find
P (M|C) = 0.3

P (M)
,
0.3P (M) + 0.1P (M c )

where P (M) represents the proportion of men in the population, whilst P (M c ) represents
the proportion of women.

Expectation

Imagine a firm which every year makes a profit. It is not known in advance what the profit
of the firm is going to be. This means that the profit is random: we can assign to each
possible outcome a certain probability. Assume that from year to year the probabilities
for the profit of our firm do not change. Assume also that from one year to the next the
profits are independent. What is the long term average yearly profit equal to?
For this let us look at a specific model. Assume the firm could make 1, 2, 3 or 4 million
profit with the following probabilities
P (X = x)
x

0.1 0.4 0.3 0.2


1
2
3
4

(The model here is not very realistic since there are only a few possible outcomes. We
chose it merely to be able to illustrate our point). Let Xi denote the profit in year i.
Hence, we have that X, X1 , X2 ,... are i.i.d. random variables.
To calculate the long term average yearly profit consider the following. In 10% of the
year on the long run we get 1 million. If we take a period of n years, where n is large,
we thus find that in about 0.1n years we make 1 million. In 40% of the years we make
2 millions on the long run. Hence, in a period of n years, this means that in about 0.4n
years we make 2 millions. This corresponds to an amount of money equal to about 0.4n
times 2 millions. Similarly, for n large the money made during the years where we earned
3 million is about 3 0.3n, whilst for the years where we made 4 millions we get 4 0.2n.
The total during this n year period is thus about
10.1+20.4+30.3+40.4 = 1P (X = 1)+2P (X = 2)+3P (X = 3)+4P (X = 4) == 3.3
Hence, on the long run the yearly average profit is 3.3 millions. This long term average is
called expected value or expectation and is denoted by E[X]. Let us formalize this concept:
In general if X denotes the outcome of a random experiment, then we call X a random
variable.

14

Definition 3.1 Let us consider a random experiment with a finite number of possible
outcomes, where the state space is
= {x1 , x2 , . . . , xs } .
(In the profit example above, we would have = {1, 2, 3, 4}.) Let X denote the outcome
of this random experiment. For x , let px denote the probability that the outcome of
our random experiment is is x. That is:
px := P (X = x).
(In the last example above, we have for example p1 = 0.1 and p2 = 0.4...) We define the
expected value E[X]:
#
E[X] :=
xpx .
x

In other words, to calculate the expected value of a random variable, we simply multiple
the probabilities with the corresponding values and then take the sum over all possible
outcomes. Let us see yet another example for expectation.
Example 3.1 Let X denote the value which we obtain when we throw a fair coin with
side 0 and side 1. Then we find that:
E[X] = 0.5 1 + 0.5 0 = 0.5
When we keep repeating the same random experiment independently and under the same
conditions on the long run, we will see that the average value which we observe converges to
the expectation. This is what we saw in the firm/profit example above. Let us formalize
this. This fact is actually a theorem which is called the Law of Large Numbers. This
theorem goes as follows:
Theorem 3.1 Assume we repeat the same random experiment under the same conditions
independently many times. Let Xi denote the (random variable) which is the outcome of
the i-th experiment. Then:
lim

(X1 + X2 + . . . + Xn )
= E[X1 ]
n

(3.1)

This simply means that on the long run, the average is going to be equal to to the expectation.
Proof. Let denote the state space of the random variables Xi :
= {x1 , x2 , . . . , xs } .
by regrouping the same terms together, we find:
X1 + X2 + . . . + Xn = x1 nx1 + x2 nx2 + . . . + xs nxs .
15

(Remember that nxi denotes the number of times we observe the value xi in the finite
sequence: X1 , X2 , . . . , Xn .) Thus:
! n
(X1 + X2 + . . . + Xn )
nxs "
x1
lim
= lim x1
+ . . . + xs
.
n
n
n
n
n

By definition

nxi
.
n n
Since the limit of a sum is the sum of the limits we find,
P (X1 = xi ) = lim

nx1
nxs "
nx
nx
lim x1
+ . . . + xs
= x1 lim 1 + . . . + xs lim s =
n
n
n
n
n
n
n
=x1 P (X = x1 ) + . . . + xs P (X = xs ) = E[X1 ].
!

So, we can now generalize our firm profit example. Imagine for this that the profit a firm
makes every month is random. Imagine also that the earnings from month to month are
independent of each other and also have the same probabilities. In this case we can
view the sequence of earnings month for month, as a sequence of repeats of the same
random experiment. Because of theorem 3.1, on the long run the average monthly income
will be equal to the expectation.
Let us next give a few useful lemmas in connection with expectation. The first lemma deals
with the situation where we take an i.i.d. sequence of random outcomes X1 , X2 , X3 , . . .
and multiply each one of them with a constant a. Let Yi denote the number Xi multiplied
by a: hence Yi := aXi . Then the long term average of the Xi s multiplied by a equals to
the long term average of the Yi s. Let us state this fact in a formal way:
Lemma 3.1 Let X denote the outcome of a random experiment. (Thus X is a so-called
random variable.) Let a be a real (non-random) number. Then:
E[aX] = aE[X].
Proof. Let us repeat the same experiment independently many times. Let Xi denote
the outcome of the i-th trial. Let Yi be equal to Yi := aXi . Then by the law of large
numbers, we have that
Y1 + . . . + Yn
= E[Y1 ] = E[aX1 ].
n
n
lim

However:
%
X 1 + . . . + Xn
=
n
X 1 + . . . + Xn
= aE[X1 ].
=a lim
n
n

Y1 + . . . + Yn
aX1 + . . . + aXn
lim
= lim
= lim a
n
n
n
n
n

16

This proves that E[aX1 ] = aE[X1 ] and finishes this proof.


The next lemma is
extremely important when dealing with the expectation of sums of random variables. It
states that the sum of the expectation is equal to the expectation of the sum. We can
think of a simple real life example which shows why this should be true. Imagine that
Matzinger is the owner of two firm (wishful thinking since Matzinger is a poor professor).
Let Xi denote the profit made by his first firm in year i. Let Yi denote the profit made
by his second firm in year i. We assume that from year to year the probabilities do not
change for both firms and the profits are independent (from year to year). In other words
X, X1 , X2 , X3 , . . . are i.i.d. variables and so are Y, Y1 , Y2 , . . .. Let Zi denote the total profit
Matzinger makes in year i, so that Zi = Xi + Yi . Now obviously the long term average
yearly profit of Matzinger is the long term average yearly profit form the first firm plus
the long term average profit from the second firm. In mathematical writing this gives:
E[X + Y ] = E[X] + E[Y ].
As a matter of fact, E[X + Y ] denotes the long term average profit of Matzinger. On
the other hand, E[X] denotes the average profit of the first firm, whilst E[Y ] denotes the
average profit of the second firm.
Let us next formalize all of this:
Lemma 3.2 Let X, Y denote the outcomes of two random experiments.
Then:
E[X + Y ] = E[X] + E[Y ].
Proof. Let us repeat the two random experiments independently many times. Let Xi
denote the outcome of the i-th trial of the first random experiment. Let Yi be equal to the
outcome of the i-th trial of the second random experiment. For all i N, let Zi := Xi +Yi.
Then by the law of large numbers, we have that:
Z1 + . . . + Zn
= E[Z1 ] = E[X1 + Y1 ].
n
n
lim

However:
Z1 + . . . + Zn
X1 + Y1 + X2 + Y2 + . . . + Xn + Yn
= lim
=
n
n
n
n
$
%
(X1 + . . . + Xn ) + (Y1 + . . . + Yn )
= lim
=
n
n
Y1 + . . . + Yn
X1 + . . . + Xn
+ lim
+ = E[X1 ] + E[Y1 ].
= lim
n
n
n
n
lim

This proves that E[X1 +Y1] = E[X1 ]+E[Y1 ] and finishes this proof. It is very important
to note that we do not need for the above theorem to have X and Y being independent
of each other.
17

Lemma 3.3 Let X, Y denote the outcomes of two independent random experiments.
Then:
E[X Y ] = E[X] E[Y ].
Proof. We assume that X takes values in a countable set x , whilst Y takes on values
from the countable set Y . We have that
#
E[XY ] =
xyP (X = x, Y = y).
(3.2)
xX ,yY

By independence of X and Y , we have that P (X = x, Y = y) = P (X = x)P (Y = y).


Plugging the last equality into 3.2, we find
#
#
#
yP (Y = y) = E[X]E[Y ]
xyP (X = x)P (Y = y) =
xP (X = x)
E[XY ] =
xX ,yY

xX

yY

So we have proven that E[XY ] = E[X] E[Y ].

Dispersion, average fluctuation and standard deviation

In some problems we are only interested in the expectation of a random variable. For
example, consider insurance policies for mobile telephones sold by a big phone company.
Say Xi is the amount which will be paid during the coming year to the i-th customer
due to his/her phone breaking down. It seems reasonable to assume that the Xi s are
independent of each other. (We assume no phone viruses). We also assume that they all
follow the same random model. So, by the Law of Large Numbers we have that for n
large, the average is approximately equal to the expectation:
X 1 + X 2 + . . . + Xn
E[Xi ].
n
Hence, when n is really large, there is no risk involved for the phone company: they
know how much they will have to pay total: on a per customer basis, they will have to
spend an amount very close to E[X1 ]. In other words, they only need one real number
from the probability model for the claims: that is the expectation E[Xi ]. Now, in many
other applications knowing only the expected value will not be enough: we will also need
a measure of the dispersion. This means that we will also want to know how much on
average the variables fluctuate from their long term average E[X1 ].
Let us give an example. Matzinger as a child used to walk with his mother every day on the shores
of Lake Geneva. Now, there is a place where there is a scale to measure the height of the water. So,
hydrologists measure the water level and then analyze this data. Assume that Xi denotes the water level
on a specific day day in year i. (We assume that we always measure on the same day of the year, like for
example on the first of January). For the current discussion we assume that the model does not change

18

over time (no global warming). We furthermore assume that from one year to the next the values are
independent. Say the random model would be given as follows:
x
P (X = x)

1
6

1
6

1
6

1
6

1
6

1
6

How much does the water level fluctuate on average from year to year? Note that the long term average,
that is the expectation is equal to
E[Xi ] = 4

1
1
1
1
1
1
+ 5 + 6 + 7 + 8 + 9 = 6.5
6
6
6
6
6
6

Now, when the water level is 6 or 7, then we are 0.5 away from the long term average of 6.5. In such a
year i, we will say that the fluctuation fi is 0.5. In other words, we measure for each year i, how far we
are from E[Xi ]. This observed fluctuation in year i is then equal to
fi := |Xi 6.5| = |Xi E[Xi ]|.
In our model, fi = 0.5 happens with a probability of 1/3, that is on the long run, in one third of the
years. When the water level is either at 8 or 5, then we are 1.5 away from the long term average of 6.5.
This has also a probability of 1/3. Finally, with water levels of 4 or 9, we are 2.5 away from the long
term average and again this will happen in a third of the year on the long run. So, the long term average
fluctuation if this models holds, will always tend to be about
E[fi ] = E[|Xi E[Xi ]|] = 2.5

1
1
1
+ 1.5 + 0.5 = 1.5.
3
3
3

after many years. To understand why simply consider the fluctuations f1 , f2 , f3 , . . .. By the Law of Large
Numbers applied to them we get that for n large, the average fluctuation is approximately equal to its
expectation:
f1 + f2 + . . . + fn
E[fi ] = E[|Xi E[Xi ]|].
(4.1)
n
So, now matter, what after many years, we will always now what the average fluctuation is approximately equal to: the expression on the right side of 4.1

The real number


Long term average fluctuation = E[|Xi E[Xi ]|]

(4.2)

is a measure of the dispersion (around the expectation) in our model. It should be obvious
why this dispersion is important: if it is small people of Geneva will be safe. If it is big,
they will often have to deal with flooding. So, in some sense, we can view the value given
in 4.2 as a measure of risk: if the dispersion is 0, then there is no risk and the random
number is not random but always equal to the fixed value E[X1 ]!
In modern statistics, one considers however most often a number which represents the
same idea, but can be slightly dierent from 4.2. The number we will use most often, is
not the average fluctuation, but instead the square root of the average fluctuation square.
This number is called the standard deviation of a random variable. We usually denote it
by , so
&
X := E[(X E[X])2 ].
19

The long term average fluctuation square of a random variable X is also called variance,
and will be denoted by V AR[X] so that
V AR[X] := E[(X E[X])2 ].
With this definition the standard deviation is simply the square root of the variance:
&
X = V AR[X].

In most cases, X and our other measure of dispersion given by E[|X E[x]|] are almost
equal.
Let us go back to our example. the variance is the average fluctuation square. We get thus:
V AR[Xi ] = E[fi2 ] = 2.52

1
1
1
+ 1.5 + 0.52 = 2.91
3
3
3

and hence the standard deviation is


Xi =

&

V AR[Xi ] = 2.91 1.7

So, we see the average fluctuation size was E[|Xi E[Xi ]|] = 1.5 whilst the standard deviation is (only)
about 13% bigger.

Now, standard deviation is most often used for determining the order of magnitude of a
random imprecision. So, we don t care about knowing absolutely exactly that number:
instead we just want the order of magnitude. In other words, in most applications,
E[|Xi E[Xi ]| and Xi are suciently close to each other, that for applications it does
not matter which one of the two we take! But, it will turn out that the standard deviation
allows for certain calculations which the other measure of dispersion in 4.2 does not allow
for. So, we will work more often with the standard deviation than the other.

4.1

Matzingers rule of thumb

A rule of thumb is that:


most variables most of the time take values not further than two standard deviations from
their expected values. We could thus write in a lose way:
X E[X] 2X .
To understand where this rule comes from simply think of the following: for example
average American household income is around 70.000. How many households make more
than twice that much, that is above 140000? Certainly not a very large portion of the
population. Now, in our case the argument is not about average, but about the average
fluctuation. Still it is an average. So, what is true for averages should also be true for
an average of fluctuations....
We will see below Chebyche rule which is the worst possible scenario. The probability

20

for any random variable to be further than 2 standard deviations from it expected value
could be as much as 25% but never more:
P (|Z E[Z]| 2Z ) 0.25.
the above inequality holds for any random variable, so it represents in some sense the
worse case. Inequality ?? will be proven in our section on chebyche.
For normal variables, the probability to be further than two standard deviations is much
smaller: it is about 0.05. Now, we will see in the section on central limit theorem, that
any sum of many independent random contributions is approximately normal as soon as
they follow about the same model. Now, 0.0 is much smaller than 0.25. In real life, in
many cases, one will be in between these two possibilities. This rule of thumb is extremely
useful when analyzing data and trying to get the big picture!

Calculation with the Variance

Let X be the outcome of a random experiment. We define the variance of X to be equal


to:
'
(
V AR[X] := E (X E[X])2 .

The square root of the variance is called standard deviation:


&
X := V AR[X].

The standard deviation is a measure for the typical order of magnitude of how far away
the value we get after doing the experiment once, is from E[X].
Lemma 5.1 Let a be a non-random number and X the outcome of a random experiment.
Then:
V AR[aX] = a2 V AR[X].
Proof. We have:
V AR[aX] =E[(aX E[aX])2 ] = E[(aX aE[X])2 ] =
=E[a2 (X E[X])2 ] = a2 E[(X E[X])2 ] = a2 V AR[X],
which finishes to prove that: V AR[aX] = a2 V AR[X].
Lemma 5.2 Let X be the outcome of a random experiment, (in other words a random
variable). Then:
V AR[X] = E[X 2 ] (E[X])2 .
Proof. We have that
E[(X E[X])2 ] = E[X 2 2XE[X] + E[X]2 ] = E[X 2 ] 2E[XE[X]] + E[E[X]2 ]. (5.1)
21

Now E[X] is a constant and constants can be taken out of the expectation. This implies
that
E[XE[X]] = E[X]E[X] = E[X]2 .
(5.2)
On the other hand, the expectation of a constant is the constant itself. Thus, since E[X]2
is a constant, we find:
E[E[X]2 ] = E[X]2 .
(5.3)
Using equation 5.2 and 5.3 with 5.1 we find
E[(X E[X])2 ] = E[X 2 ] 2E[X]2 + E[X]2 = E[X 2 ] E[X]2 .
this finishes to prove that V AR[X] = E[X 2 ] E[X]2 .
Lemma 5.3 Let X and Y be the outcomes of two random experiments, which are independent of each other. Then:
V AR[X + Y ] = V AR[X] + V AR[Y ].
Proof. We have:
V AR[X + Y ] =E[((X + Y ) E[X + Y ])2 ] = E[(X + Y E[X] E[Y ])2 ] =

E[((X E[X]) + (Y E[Y ]))2 ] =


=E[(X E[X])2 + 2(X E[X])(Y E[Y ]) + (Y E[Y ])2 ] =
=E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =

Since X and Y are independent, we have that (X E[X]) is also independent from
(Y E[Y ]). Thus, we can use lemma 3.3, which says that the expectation of a product
equals the product of the expectations in case the variables are independent. We find:
E[(X E[X])(Y E[Y ])] = E[X E[X]] E[Y E[Y ]].
Furthermore:
E[X E[X]] = E[X] E[E[X]] = E[X] E[X] = 0
Thus
E[(X E[X])(Y E[Y ])] = 0.
Applying this to the above formula for V AR[X + Y ], we get:
V AR[X + Y ] =E[(X E[X])2 ] + 2E[(X E[X])(Y E[Y ])] + E[(Y E[Y ])2 ] =
= E[(X E[X])2 ] + E[(Y E[Y ])2 ] = V AR[X] + V AR[Y ].
This finishes our proof.

22

5.1

Getting the big picture with the help of Matzingers rule of


thumb

We mentioned that most of the time, any random variable takes values no further than two
times its standard deviation from its expectation. We can apply this and our calculation
for variance to understand how insurances work, hedging investments, and even statistical
estimation work. Let X1 , X2 , . . . be a sequence of random variables which all follow the
same model and are independent of each other. Let Z be the sum of n such variables:
Z = X1 + X2 + . . . + Xn
We find that
E[Z] = E[X1 + X2 + . . . + Xn ] = E[X1 ] + E[X2 ] + . . . + E[Xn ] = nE[X1 ]
Similarly we can use the fact that the variance of a sum of independent variables is the
sum of the variance to find:
V AR[Z] = V AR[X1 +X2 +. . .+Xn ] = V AR[X1 ]+V AR[X2 ]+. . .+V AR[Xn ] = nV AR[X1 ].
Using the last equation above with the fact that standard deviation is the square root of
variance we find:
&
&

Z = V AR[Z] = n V AR[X1 ] = nX1 .

In other words: the sum of n independents has its expectation grow like n times constant,
but the standard deviation grows only like square root of n times constant!!!! This is
everything you need to know for understanding how insurances and other risk reducers
work....
Let us see dierent examples, what these random numbers Xi could represent:

Say you are an insurance company specializing in proving life insurance. Let Xi be
the claim in the current year of the ith client. You have n clients, so the total claim
which you as a company will have to pay is Z = X1 + X2 + . . . + Xn .
You buy houses which you flip and then try to sell at a profit. You have bought
houses all over the US. Assuming the economy and real estate market stays very
stable, we can assume that the selling prices will be independent of each other.
So, let Xi represent the profit (or loss) for the i-th house which you are currently
renovating. This profit or loss is random, since you dont know exactly what it
will be until you sell. Assume that you have currently n houses which you are
renovating. Then, Z = X1 + . . . + Xn is your total profit or loss, with the n-houses
you are currently holding. This is a random variable since its outcome is not know
in advance.

23

So, again it is all based on the following two equations which hold when the Xi s are
independent and follow the same model:

X1 +...+Xn = X1 n
E[X1 + . . . + Xn ] = nE[X1 ]
So for example with n = 1000000, we get
Z = 1000Xi
whilst
E[Z] = 10000000E[Xi]
so Z becomes negligence compared to E[Z]. So, if we think that most of the times a
variable is within two standard deviations of its expectation, we find
Z 1000000E[X1] 1000X1
so, compared to the order of magnitude of Z the fluctuation becomes almost negligible!

Covariance and correlation

Two random variables are dependent when their joint distribution is not simply the product of their marginal distribution. But the degree of dependence can vary from strong
dependence to loose dependence. One measure of the degree of dependence of random
variables is Covariance. For random variables X and Y we define the covariance as follows:
COV [X, Y ] = E[(X E[X])(Y E[Y ])]
Lemma:
For random variables X and Y there is also another equivalent formula for the covariance:
COV [X, Y ] = E[XY ] E[X]E[Y ]
Proof:
E[(X E[X])(Y E[Y ])] = E[XY Y E[X] XE[Y ] + E[X]E[Y ]]
= E[XY ] E[X]E[Y ]] E[X]E[Y ]] + E[X]E[Y ]]
= E[XY ] E[X]E[Y ]]
Lemma:
For independent random variables X and Y,
COV [X, Y ] = 0
24

Proof:
COV [X, Y ] = E[XY ] E[X]E[Y ]
For independent X and Y, E[XY ] = E[X]E[Y ]. Hence COV [X, Y ] = 0
Lemma:
COV [X, X] = V AR[X]
Proof:
COV [X, X] = E[X 2 ] E[X]2 = V AR[X]
Lemma:
Assume that a is a constant and let X and Y be two random variables. Then
COV [X + a, Y ] = COV [X, Y ]
Proof:
COV [X + a, Y ] = E[(X + a E[X + a])(Y E[Y ])]
= E[Y X + Y a Y E[X + a] XE[Y ] aE[Y ] + E[Y ]E[X + a]]
= E[XY ] + aE[Y ] E[Y ]E[X + a] E[X]E[Y ] aE[Y ] + E[Y ]E[X + a]
= E[XY ] E[X]E[Y ]
= COV [X, Y ]
Lemma:
Let a be a constant and let X and Y be random variables. Then
COV [aX, Y ] = aCOV [X, Y ]
Proof:
COV [aX, Y ] = E[(aX E[aX])(Y E[Y ])]
= E[aXY Y E[aX] aXE[Y ] + E[aX]E[Y ]]
= aE[XY ] aE[X]E[Y ] aE[X]E[Y ] + aE[X]E[Y ]
= aE[XY ] aE[X]E[Y ]
= a(E[XY ] E[X]E[Y ])
= aCOV [X, Y ]

25

Lemma:
For any random variables X, Y and Z we have:
COV [Z + X, Y ] = COV [Z, Y ] + COV [X, Y ]
Proof:
COV [Z + X, Y ] = E[(X + Z E[X + Z])(Y E[Y ])]

= E[Y X + Y Z Y E[X + Z] XE[Y ] ZE[Y ] + E[X + Z]E[Y ]]

= E[Y X] + E[Y Z] E[Y ]E[X + Z] E[X]E[Y ] E[Z]E[Y ] + E[X + Z]E[Y ]

since E[A + B] = E[A] + E[B], we get

= E[Y X]+E[Y Z]E[Y ]E[X]E[Y ]E[Z]E[X]E[Y ]E[Z]E[Y ]E[X]E[Y ]+E[Z]E[Y ]


= E[Y X] + E[Y Z] E[X]E[Y ] E[Z]E[Y ]
= E[Y X] E[X]E[Y ] + E[Y Z] E[Z]E[Y ]
= COV [X, Y ] + COV [Z, Y ]

NOTE THAT COV [X, Y ] = COV [Y, X]

6.1

Correlation

We define the correlation as follows:


COR[X, Y ] = &

COV [X, Y ]
V AR[X]V AR[Y ]

One can prove that correlation is always between 1 and 1. When the variables are
independent the correlation is zero. The correlation is one when Y can be written as
Y = a + bX where a, b are constants such that b > 0. If the correlation is 1 then the
variable Y can be written as Y = a + bX where b is a negative constant and a is any
constant.
An important property of the correlation is that when we multiply the variable by a
constant then the correlation does not change: COR[aX, Y ] = COR[X, Y ]. This implies
that a change of units does not aect correlation.

Chebyshevs and Markovs inequalities

Let us first explain the Markov inequality with an example.


For this assume the dividend payed (per share) next year to be a random variable. Let
the expected amount of money payed be equal to E[X] = 2 dollars. Then the probability
that the dividend pays more than 100 dollars can not be more then 2/100 = E[X]/100
since otherwise the expectation would have to be bigger than 2. In other words, for a
random variable X which can not take negative values, the probability that the random
variable is bigger than a is at most E[X]/a. This is the content of the next lemma:
26

Lemma 7.1 Assume that a > 0 is a constant and let X be a random variable taking on
only non-negative values, i.e. P (X 0) = 1. Then,
P (X a)

E[X]
.
a

Proof. To simplify the notation, we assume that the variable takes on only integer values.
The result remains valid otherwise. We have that
E[X] = 0 P (X = 0) + 1 P (X = 1) + 2 P (X = 2) + 3 P (X = 3) + . . .

(7.1)

Note that the sum on the right side of the above inequality contains only non-negative
terms. If we leave out some of these terms, the value can only decrease or stay equal. We
are going to just keep the values x P (X = x) for x greater equal to a. This way equation
7.1, becomes
E[X] xa P (X = xa ) + (xa + 1) P (X = xa + 1) + (xa + 2) P (X = xa + 2) + . . . (7.2)
where xa denotes the smallest natural number which is larger or equal to a. Note that
xa + i a for any i natural number. With this we obtain that the right side of 7.2 is
larger-equal than
a(P (X = xa ) + P (X = xa + 1) + P (X = xa + 2) + . . .) = aP (X a).
and hence
E[X] aP (X a).
The last inequality above implies:
P (X a)

E[X]
.
a

The inequality given in the last lemma is called Markov inequality. In is very useful: in
many real world situations it is dicult to estimate all the probabilities (the probability
distribution) for a random variable. However, it might be easier to estimate the expectation, since that is just one number. If we know the expectation of a random variable, we
can at least get upper-bounds on the probability to be far away from the expectation.
Let us next present the Chebyche inequality:
Lemma 7.2 If X is a random variable with expectation E[X] and variance VAR[X] and
a 0 is a non-random number, then
P (|X E[X]| a)

27

V AR[X]
a2

Proof.
Note that |X E[X]| a implies (X E[X])2 a2 and vice versa. Hence,
P (|X E[X]| a) = P ((X E[X])2 a2 )

(7.3)

If Y = (X E[X])2 , then Y is a non-negative variable, and


P (|X E[X]| a) = P (Y a2 ).

(7.4)

Since Y is non-negative, Markov inequality applies. Hence,


P (Y a2 )

E[Y ]
E[(X E[X])2 ]
V AR[X]
=
=
2
2
a
a
a2

Using the last chain of inequalities above with equalities 7.3 and 7.4 we find
P (|X E[X]| a)

V AR[X]
a2

Let us consider one more example. Assume the total expected claim at the end of next
year for an insurance company is 1 000000$. What is the risk that the insurance company
has to pay more than 5 000000 as total claim at the end of next year? The answer goes
as follows:
let Z be the total claim at the end of next year. By Markov inequality, we find
P (Z 5 000 000)

E[Z]
5 000 000

1
= 20%.
5

Hence, we know that the probability to have to pay more than five millions is at most
20%. To derive this the only information needed was the expectation of Z.
When the standard deviation is also available, one can usually get better bounds using
the Chebycheef inequality. Assume in the example above that the expected total claim is
as before, but let the standard deviation of the total claim be one million. Then we have
V AR[Z] = (1 000 000)2.
Note that for Z to be above 5 000000 we need Z E[Z] to be above 4 000000. Hence,
P (Z 5 000 000) = P (Z E[Z] 4 000 000) P (|Z E[Z]| 4 000000).
Using Chebyche, we get
P (|Z 1 000 000| 4 000 000)

V AR[Z]
1
=
= 0.0625.

2
(4 000 000)
16

It follows that the probability that the total claim is above five millions is less than 6.25
percent. This is a lot less than the bound we had found using Markovs inequality.
28

Combinatorics

Theorem 8.1 Let


= {x1 , x2 , . . . , xs } .
denote the state space of a random experiment. Let each possible outcome have same
probability. Let E be an event. Then,
P (E) =

number of outcomes in E
|E|
=
total number of outcomes
s

Proof. We know that


P () = 1
Now
P () = P (X {x1 , . . . , xs }) = P ({X = x1 } . . . {X = xs }) =

P (X = xt )

t=1,...,s

since all the outcomes have equal probability, we have that


#
P (X = xt ) = sP (X = x1 ).
t=1,...,s

Thus,

1
P (X = x1 ) = .
s

Now if:
E = {y1 , . . . , yj }
We find that:
P (E) = P (X E) = P ({X = y1 } . . . {Xj = yj }) =

j
#

P (X = yi ) =

i=1

j
s

which finishes the proof.


Next we present one of the main principles used in combinatorics:
Lemma 8.1 Let m1 , m2 , . . . , mr denote a given finite sequence of natural numbers. Assume that we have to make a sequence of r choices. At the s-th choice, assume that we
have ms possibilities to choose from. Then the total number of possibilities is:
m1 m2 . . . mr
Why this lemma holds can best be understood when thinking of a tree, where at each
knot which is s away from the root we have ms new branches.

29

Example 8.1 Assume we first throw a coin with a side 0 and a side 1. Then we throw a
four sided die. Eventually we throw the coin again. For example we could get the number
031. How many dier numbers are there which we could get? The answer is: First we
have to possibilities. For the second choice we have four, and eventually we have again
two. Thus, m1 = 2, m2 = 4, m3 = 2. This implies that the total number of possibilities is:
m1 m2 m3 = 2 4 2 = 16.
Recall that the product of all natural numbers which are less or equal to k, is denoted by
k!. k! is called k-factorial.
Lemma 8.2 There are
k!
possibilities to put k dierent objects in a linear order. Thus there are k! permutations of
k elements.
To realize why the last lemma above holds we use lemma 8.1. To place k dierent objects
in a row we first choose the first object which we will place down. For this we have k
possibilities. For the second object, there remain k 1 objects to choose from. For the
third, there are k 3 possibilities to choose from. And so on and so forth. This then
gives that the total number of possibilities is equal to k (k 1) . . . 2 1.
Lemma 8.3 There are:

n!
(n k)!

possibilities to pick k out of n dierent objects, when the order in which we pick them
matters.
For the first object, we have n possibilities. For the second object we pick, we have n 1
remaining objects to choose from. For the last object which we pick, (that is the k-th
which we pick), we have n k + 1 remaining objects to choose from. Thus the total
number of possibilities is equal to:
n (n 1) . . . (n k + 1)
which is equal to:

n!
.
(n k)!

n!
The number (nk)!
is also equal to the number of words of length k written with a n-letter
alphabet, when we require that the words never contain twice the same letter.

Lemma 8.4 There are:

subsets of size k in a set of size n.

n!
k!(n k)!
30

The reason why the last lemma holds is the following: there are k! ways of putting a
given subset of size k into dierent orders. Thus, there are k! times more ways to pick k
elements, than there are subsets of size k.
Lemma 8.5 There are:
2n
subsets of any size in a set of size n.
The reason why the last lemma above holds is the following: we can identify the subsets
two binary vectors with n entries. For example, let n = 5. Let the set we consider be
{1, 2, 3, 4, 5}. Take the binary vector:
(1, 1, 1, 0, 0).
This vector would correspond to the subset containing the first three elements of the set,
thus to the subset:
{1, 2, 3}.
So, for every non zero entry in the vector we pick the corresponding element in the set. It
is clear that this correspondence between subsets of a set of size n and binary vectors of
dimension n is one to one. Thus, there is the same number of subsets as there is binary
vectors of length n. The total number of binary vectors of dimension n however is 2n .

Important discrete random variables

9.1

Bernoulli variable

Let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. Let X designate the random number we obtain when we flip
this coin. Thus, with probability p the random variable X takes on the value 1 and with
probability 1 p it takes on the value 0. The random variable X is called a Bernoulli
variable with parameter p. It is named after the famous swiss mathematician Bernoulli.
For a Bernoulli variable X with parameter p we have:
E[X] = p.
V AR[X] = p(1 p).
Let us show this:
E[X] = 1 p + 0 (1 p) = p.
For the variance we find:
V AR[X] = E[X 2 ] (E[X])2 = 12 p + 02 (1 p) (E[X])2 = p p2 = p(1 p).

31

9.2

Binomial random variable

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. We toss this coin independently n times and count the numbers
of 1s observed. The number Z of 1s observed after n coin-tosses is equal to
Z := X1 + X2 + . . . + Xn
where Xi designates the result of the i-th toss. (Hence the Xi s are independent Bernoulli
variables with parameter p.) The random variable Z is called a binomial variable with
parameter p and n. For the binomial random variable with parameter p we find:
E[Z] = np
V AR[Z] = np(1 p)
For k n, we have: P (Z = k) =
Let us show the above statements:

)n*
k

pk (1 p)nk .

E[Z] = E[X1 + . . . + Xn ] = E[X1 ] + . . . + E[Xn ] = n E[X1 ] = n p.


Also:
V AR[Z] = V AR[X1 + . . . + Xn ] = V AR[X1 ] + . . . + V AR[Xn ] = nV AR[X1 ] = np(1 p).
Let us calculate next the probability: P (Z = k). We start with an example. Take n = 3
and k = 2. We want to calculate the probability to observe exactly to ones among the
first three coin tosses. To observe exactly two ones out of three successive trials there are
exactly three possibilities:
Let A be the event: X1 = 1, X2 = 1, X3 = 0
Let B be the event: X1 = 1, X2 = 0, X3 = 1
Let C be the event: X1 = 0, X2 = 1, X3 = 1.
Each of these possibilities has probability p2 (1 p). As a matter of fact, since the trials
are independent we have for example:
P (X1 = 1, X2 = 1, X3 = 0) = P (X1 = 1)P (X2 = 1)P (X3 = 0) = p2 (1 p).
The three dierent possibilities are disjoint of each other. Thus,
P (Z = 2) = P (A B C) = P (A) + P (B) + P (C) = 3p2 (1 p).
Here 3 is the number of realization where we have exactly two ones within the first three
coin tosses. This is equal to the dierent number of ways, there is to choose two dierent
objects out of three items. In other words the number three stand in our formula for 3
32

choose 2.
We can now generalize to n trials and a number k n. There are n choose k possible
outcomes for which among the first n coin tosses there appear exactly k ones. Each of
these outcomes has probability:
pk (1 p)(nk) .
This gives then:
$ %
n k
P (Z = k) =
p (1 p)(nk) .
k

9.3

Geometric random variable

Again, let a coin have a side 0 and a side 1. Let p be the probability of side 1 and 1 p be
the probability of side 0. We toss this coin independently n many times. Let Xi designate
the result of the i-th coin-toss. Let T designate the number of trials it takes until we first
observe a 1. For example, if we have:
X1 = 0, X2 = 0, X3 = 1
we would have that T = 3. If we observe on the other hand:
X1 = 0, X2 = 1
we have that T = 2. T is a random variable. As we are going to show, we have:
For k > 0, we have P (T = k) = p(1 p)k1 .
E[T ] = 1/p
V AR[T ] = (1 p)/p2
A random variable T for which P (T = k) = p(1 p)k1, k N, is called geometric
random variable with parameter p. Let us next prove the above statements: For T to be
equal to k we need to observe k 1 time a zero followed by a one. Thus:
P (T = k) = P (X1 = 0, X2 = 0, . . . , Xk1 = 0, Xk = 1) =
P (X1 = 0) P (X2 = 0) . . . P (Xk1 = 0) P (Xk = 1) = (1 p)k1p.
Let us calculate the expectation of T . We find:
E[T ] =

#
k=1

kp(1 p)k1

Let f (x) be the function:


x / f (x) =
33

#
k=1

kxk1 .

We have that
f (x) =

#
d(xk )
k=1

dx

This shows that:

)+

k=1

xk

dx

#
k=1

Thus,

d (x/(1 x))
1
x
1
=
+
=
(9.1)
2
dx
1 x (1 x)
(1 x)2

k(1 p)k1 = f (1 p) =

E[T ] = p

#
k=1

k(1 p)

k1

=p

1
(p)2
1
1
= .
2
(p)
p

Let us next calculate the variance of a geometric random variable. We find:


2

E[T ] =

#
k=1

k 2 p(1 p)k1 .

Let g(.) be the map:


x / g(x) =
We find:

k 2 (x)k1

k=1

) +
*

k1
#
d x
d(xk )
k=1 kx
g(x) =
k
=
dx
dx
k=1

Using equation 9.1, we find:

g(x) =

1+x
d (x/(1 x)2 )
=
.
dx
(1 x)3

This implies that


E[T 2 ] = pg(1 p) =

2p
.
p2

Now,
2p
V AR[T ] = E[T ] (E[T ] ) =

p2
2

10

$ %2
1
1p
=
.
p
p2

Continuous random variables

So far we have only been studying discrete random variables. Let us see how continuous
random variables are defined.

34

Definition 10.1 Let X be a number generated by a random experiment. (Such a random number is also called random variable). X is a continuous random variable if there
exists a non-negative piecewise continuous function
f : x / f (x) R R+
such that for any interval I = [i1 , i2 ] R we have that:
.
P (X I) = f (x)dx.
I

The function f (.) is called the density function of X or simply the density of X.
/
Note that the notation I f (x)dx stands for:
.

f (x)dx =

i2

f (x)dx.

(10.1)

i1

Recall also that integrals like the one appearing in equation 10.1 are defined to be equal
to the air under the curve f (.) and above the interval I.
Remark 10.1 Let f (.) be a piecewise continuous function from R into R. Then, there
exists a continuous random variable X such that f (.) is the density of X, if and only if
all of the following conditions are satisfied:
1. f is everywhere non-negative.
/
2. R f (x)dx = 1.

Let us next give some important examples of continuous random variables:


The uniform variable in the interval I = [i1 , i2 ], where i1 < i2 . The density of f (.)
is equal to 1/|i2 i2 | everywhere in the interval I. Anywhere outside the interval I,
f (.) is equal to zero.
The standard normal variable has density:
1
2
f (x) := ex /2 .
2
A standard normal random variable is often denoted by N (0, 1).
Let R, > 0 be given numbers. The density of the normal variable with
expectation and standard deviation is defined to be equal to:
f (x) :=

1
2
2
e(x) /2 .
2

35

11

Normal random variables

The probability of a normal random variable is given by the probability density:


f (x) =

1
2
2
e(x) /2 .
2

Hence there are two parameters which determine a normal distribution: and . We
write N (, ) for an normal variable with parameters and .
If we analyze the density function f (x), we see that for any value a, we have that f (+a) =
f ( a). Hence the function f (.) is symmetric at the point . This implies that the
expected value has to be :
E[N (, )] = .
One could also show this by verifying that
.
1
2
2
e(x) /2 dx = 0.
E[N (, ) = x
2
By integration by parts, one can show that
.
1
2
2
x2
e(x) /2 dx = 2
2
and hence the variance is 2 :
V AR[N (, )] = 2 .
Note that the function f (x) decreases when we go away from : the shape is a bell shape
with maximum at and width . (Go onto the Internet to see the graph of a normal
density plotted.)
Let us give next a few very useful facts about normal variables:
Let a and b be two constants such that a = 0. Let X be a normal variable with
parameters X and X . Let Y be the random variable defined by ane transformation from X in the following way: Y := aX + b. Then Y is also normal. The
parameters of Y are
E[Y ] = Y = aX + b
and
Y = aX .
This we obtain simply from the fact that these parameters are expectation and
standard deviation of their respective variables.
Let X and Y be normal variables independent of each others. Let Z := X +Y . Then
Z is also normal. The same result is true for sums of more than two independent
normal variables.
36

If X is normal, then

X E[X]
Z := &
V AR[X]

is a standard normal.

For the last point above note that for any random variable X (not necessarily normal) we
have that if Z = (X E[X])/X , then Z has expectation zero and standard deviation 1.
This is a simple straight forward calculation:
0
1
X E[X]
1
E[Z] = E
=
(E[X] E[E[X]])
(11.1)
X
X
but since E[E[X]] = E[X], equality 11.1 implies that E[Z] = 0. Also
1
0
X E[X]
= V AR[X]/ 2 = 1.
V AR[Z] = V AR
X
Now if X is normal then we saw that Z = (X E[X])/X is also normal, since Z is
just obtained from X by multiplying and adding constants. But Z has expectation 0 and
standard deviation 1 and hence it is standard normal.
One can use normal variables to model financial processes and many others. Let us
consider an example. Assume that a portfolio consists of three stocks. Let Xi denote the
value of stock number i in one year from now. We assume that the three stocks in the
portfolio are
& all independent of each others and normally distributed so that i = E[Xi ]
and i = V AR[Xi ] for i = 1, 2, 3. Let
1 = 100, 2 = 110, 3 = 120

and let
1 = 10, 2 = 20, 3 = 20.
The value of the portfolio after one year is Z = X1 + X2 + X3 and E[Z] = E[X1 ]+ E[X2 ]+
E[X3 ] = 330. : Question: What is the probability that the value of the portfolio after a
year is above 360?
Answer: We have that
V AR[Z] = V AR[X1 ] + V AR[X2 ] + V AR[X3 ] = 100 + 400 + 400 = 900
and hence
Z =
We are now going to calculate

&

V AR[Z] = 30.

P (Z 330).

For this we want to transform the probability into a probability involving a standard
normal since for standard normal we have tables available. We find
$
%
Z E[Z]
360 E[Z]
P (Z 360) = P

.
(11.2)
Z
Z
37

Note that
(360 E[Z])/z = 1

and also (Z E[Z])/Z is standard normal. Using this in equation 11.2, we find that the
probability that the portfolio after a year is above 360 is equal to
P (Z 360) = P (N (0, 1) 1) = 1 (1),
where (1) = P (N (0, 1) 1) = 0.8413 can be found in a table for the standard normal.

12

Distribution functions

Definition 12.1 Let X be a random variable. The distribution function FX : R R of


X is defined in the following way:
FX (s) := P (X s)
for all s R.
Let us next mention a few properties of the distribution function:
FX is an increasing function. This means that for any two numbers s < t in R, we
have that FX (s) FX (t).
lims FX (s) = 1
lims FX (s) = 0
We leave the proof of the three facts above to the reader.
Imagine next that X is a continuous random variable with density function fX . Then,
we have for all s R, that:
. s
FX (s) = P (X s) =
fX (t)dt.

Taking the derivative on all sides of the above system of equations we find that:
dFX (s)
= fX (s).
ds
In other words, for a continuous random variables X, the derivative of the distribution
function is equal to the density of X. Hence, in this case, the distribution function is
dierentiable and thus also continuous. Another implication is: the distribution function
uniquely determines the density function of f . This implies, that the distribution function
determines uniquely all the probabilities of events which can be defined in terms of X.
Assume next that the random variable X has a finite state space:
X = {s1 , s2 , . . . , sr }
38

such that s1 < s2 < . . . < sr . Then, the distribution function FX is a step function. Left
of s1 , we have that FX is equal to zero. Right of sr it is equal to one. Between si and
si+1 , that is on the interval [si , si+1 [, the distribution function is constantly equal to:
#
P (X = sj ).
ji

(This holds for all i between 1 and r 1.)


To sum up: for continuous random variables the distribution functions are dierentiable
functions, whilst for discrete random variables the distribution functions are step functions. Let us next show how we can use the distribution function to simulate random
variables. The situation is the following: our computer can generate a uniform random
variable U in the interval [0, 1]. (This is a random variable with density equal to 1 in [0, 1]
and 0 everywhere else.) We want to generate a random variable with a given probability
density function fX , using U. We do this in the following manner: we plug the random
number U into the map invFX . (Here invFX designates the inverse map of FX (.).) The
next lemma says that this method really produces a random variable with the desired
density function.
Lemma 12.1 Let fX denote the density function of a continuous random variable and
let FX designate its distribution function. Let Y designate the random variable obtained
by plugging the uniform random variable U into the inverse distribution function:
Y := invFX (U).
Then, the density of Y is equal to fX .
Proof. Since, F (.) is an increasing function. Thus for any number s we have:
Y s.
is equivalent to
Hence:
Now, FX (Y ) = U, thus

FX (Y ) FX (s).
P (Y s) = P (FX (Y ) FX (s)).
P (Y s) = P (U FX (s)).

We know that FX (s) [0, 1]. Using the fact that U has density function equal to one in
the interval [0, 1], we find:
. FX (s)
P (U FX (s)) =
1 dt = FX (s).
0

Thus
P (Y s) = FX (s).
39

This shows that the distribution function FY of Y is equal to FX (s). Applying the
derivative according to s to both FY (s) and FX (s), yields:
fY (s) = fX (s).
Hence, X and Y have same density function. This finishes the proof.

13

Expectation and variance for continuous random


variables

Definition 13.1 Let X be a continuous random variable with density function fX (.).
Then, we define the expectation E[X] of X to be:
.
E[X] :=
sfX (s)ds.

Next we are going to prove that the law of large numbers also holds for continuous random
variables.
Theorem 13.1 Let X1 , X2 , . . . be a sequence of i.i.d. continuous random variables all
with same density function fX (.). Then,
X1 + X2 + . . . + Xn
= E[X1 ].
n
n
lim

Proof. Let > 0 be a fix number. Let us approximate the continuous variables Xi by
a discrete variable Xi . For this we let Xi be the largest integer multiple of which is
still smaller equal to Xi . In this way, we always get that
|Xi Xi | < .
This implies that:
2
2
2 X1 + X2 + . . . + Xn X1 + X2 + . . . + Xn 2
2
2<

2
2
n
n

However the variables Xi are discrete. So for them the law of large number has already
been proven and we find:
'
(
X1 + X2 + . . . + Xn
= E X1
n
n
lim

We have that

E[Xi ] =

#
zZ

z P (Xi = z)
40

(13.1)

However, by definition:
P (Xi = z) = P (Xi [z, (z + 1)[).
The expression on the right side of the last inequality is equal to
.

(z+1)

fX (s)ds.
z

Thus
E[Xi ]

zZ

(z+1)

fX (s)ds.
z

As tends to zero, the expression on the left side of the last equality above tends to:
.
sfX (s)ds

This implies that by taking fix and suciently small, we have that, for large enough n
, the fraction
X1 + X2 + . . . + Xn
n
is as close as we want from
.
sfX (s)ds.

This implies that

X1 + X2 + . . . + Xn
n

actually converges to
.

sfX (s)ds.

The linearity of expectation holds in the same way as for discrete random variables.
This is the content of the next lemma.
Lemma 13.1 Let X and Y be two continuous random variables and let a be a number.
Then
E[X + Y ] = E[X] + E[Y ]
and
E[aX] = aE[X]
Proof. The proof goes like in the discrete case: The only thing used for the proof in the
discrete case is the law of large numbers. Since the central limit theorem also holds for
the continuous case, the exactly same proof holds for the continuous case.

41

14

Central limit theorem

The Central Limit Theorem (CLT) is one of the most important theorems in probability.
Roughly speaking it says that if we build the sum of many independent random variables,
no matter what these little contributions are, we will always get approximately a normal
distribution. This is very important in every day life, because often times you have
situations where a lot of little independent things add up. So, you end up observing
something which is approximately a normal random variable. For example, when you
make a measurement you are most of the time in this situation. That is, when you dont
make one big measurement error. In that case, you have a lot of little imprecisions which
add up to give you your measurement error. Most of the time, these imprecisions can be
seen as close to being independent of each other. This then implies: unless you make one
big error, you will always end up having your measurement-error being close to a normal
variable.
Let X1 , X2 , X3 , . . . be a sequence of independent, identically distributed random variables.
(This means that they are the outcome of the same random experiment repeated several
times independently.) Let
& denote the expectation := E[X1 ] and let denote the
standard deviation := V AR[X1 ]. Let Z denote the sum
Z := X1 + X2 + X3 + . . . + Xn .

Then, by the calculation rules we learned for expectation and variance it follows that:
E[Z] = n
and the standard deviation Z of Z is equal to:

Z = n.
When you subtract from a random variable its mean and divide by the standard deviation
then you always get a new variable with zero expectation and variance equal to one. Thus
the standardized sum:
Z n

n
has expectation zero and standard deviation 1. The central limit theorem says that on
top of this, for large n, the expression
Z n

n
is close to being a standard normal variable. Let us now formulate the central limit
theorem:
Theorem 14.1 Let
X1 , X2 , X3 , . . .
42

be a sequence of independent, identically distributed random variables. Then we have that


for large n, the normalized sum Y :
Y :=

X1 + . . . + Xn n

is close to being a standard normal random variable.


The version of the Central Limit Theorem is not yet very precise. As a matter of fact, what
means close to being a standard normal random variable? We certainly understand
what means that two points are close to each other. But we have not yet discussed the
concept of closeness for random variables. Let us do this by using the example of a sixsided die. Let us assume that we have a six-sided die which is not perfectly symmetric.
For i {1, 2, . . . , 6}, let pi denote the probability of side i:
P (X = i) = pi
where X denotes the number which we get when we through this die once. A perfectly
symmetric die would have the probabilities pi all equal to 1/6. Say, our die is not exactly
symmetric but close to a perfectly symmetric die. What does this mean? This means
that for all i {1, 2, . . . , 6} we have that pi is close to 1/6.
For the die example with have a finite number of outcomes. For a continuous random
variable on the other hand we are interested in the probabilities of intervals. By this I
means that we are interested for a given interval I, in the probabilities that the random
experiment gives result in I. If X denotes our continuous random variable, this means
that we are interested in the probabilities of type:
P (X I).
We are now ready to explain what we mean by: two continuous random variables X
and Y have there probability laws close to each other. By X and Y are close (have
probability laws which are closed to each other) we mean: for each interval I we have
that the real number P (Y I) is close to the real number P (X I). For the interval,
i = [i1 , i2 ] with i1 < i2 , we have that
P (X I) = P (X i2 ) P (X < i1 ).
It follows that if we know all the probabilities for semi-infinite intervals we can determine
the probabilities of type P (X I). Thus, for two continuous random variables X and
Y to be close to each other (with respect to their probability law), it is enough to ask
that for all x R we have that the real number P (X x) is close to the real number
P (Y y).
Now that we have clarified the concept of closeness in distribution for continuous random
variables, we are ready to formulate the CLT in a more precise way. Hence saying that
Z :=

X1 + . . . + Xn n

n
43

is close to a standard normal random variable N (0, 1) means that for every z R we
have that:
P (Z z)
is close to
P (N (0, 1) z).
In other words, as n goes to infinity, P (Z z) converges to P (N (0, 1) z). Let us give
a more precise version of the CLT then what we have done so far:
Theorem 14.2 Let
X1 , X2 , X3 , . . .
be
& a sequence of independent, identically distributed random variables. Let E[X1 ] = and
V AR[X1 ] = . Then, for any z Z, we have that:
$
%
X1 + . . . + Xn n

lim P
z = P (N (0, 1) z).
n
n

15

Statistical testing

Let us first give an example:


Assume that you read in the newspaper that 50% of the population in Atlanta smokes.
You dont believe that number, so you start a survey. You ask 100 randomly chosen
people, and find that 70 out of the hundred smoke. Now, you want to know if the result
of your survey constitutes strong evidence against the 50% claimed by the newspaper.
If the true percentage of the population of Atlanta which smokes would be 50%, you
would expect to find in your survey a number closer to 50 people. However, it could be
that although the true percentage is 50%, you still observe a figure as high as 70. Just by
chance. So, the procedure is the following: determine the probability of getting 70 people
or more in your survey who smoke, given that the percentage would really be 50%. If that
probability is very small you decide to reject the idea that 50% of the population smoke
in Atlanta. In general one takes a fix level > 0 and rejects the idea one wants to test
if the probability is smaller than . Most of the times statisticians work with being
equal to 0.05 or 0.1. So, if the probability of getting 70 people or more in our survey who
smoke is smaller than = 0.05(the probability given that 50% of the population smokes),
then statisticains will say: we reject the hypothesis that 50% of the population in Atlanta
smokes. We do this on the confidence level = 0.05, based on the evidence of our survey.
How do we calculate the probability to observe 70 or more people in our survey who
smoke if the percentage would really be 50% of the Atlanta population? For this it is
important how we choose, the people for our survey. The correct way to choose them is
the following: take a complete list of the inhabitants of Atlanta. Numerate them. Choose
100 of them with replacement and with equal probability. This means that a person could
appear twice.
Let Xi be equal to one if the i-th person chosen is a smoker. Then, if we chose the people
44

following the procedure above we find that the Xi s are i.i.d. and that P (Xi = 1) =
p where p designates the true percentage of people in Atlanta who smoke. Then also
E[Xi ] = p. The total number of people in our survey who smoke Z, can now be expressed
as
Z := X1 + X2 + . . . + X100 .
Let P50% (.) designate the probability given that the true percentage which smoke is really
50%. Testing if 50% in Atlanta smoke can now be discribed as follows:
Calculate the probability:
P50% (X1 + . . . + X100 70).
If the above probability is smaller than = 0.05 we reject the hypothesis that 50%
of the population smokes in Atlanta (we reject it on the = 0.05 level). Otherwise,
we keep the hypothesis. When we keep the hypothesis, this means that the result
of our survey does not constitute strong evidence against the hypothesis: the result
of the survey does not contradict the hypothesis.
Note that we could also have done the test on the = 0.1 level. In that case we would
reject the hypothesis if that probability is smaller that 0.1.
Next we are explaining how we can calculate approximately the probability P50% (Z 70),
using the CLT. Simply note that, by basic algebra, the inequality
Z 70
is equivalent to
which is itself equivalent to:

Z n 70 n

Z n
70 n

n
n
Equivalent inequalities must also have same probability. Hence:
$
%
70 n
Z n

P50% (Z 70) = P50% (Z n 70 n) = P50%

n
n

(15.1)

By the CLT we have that

Z n

n
is close to being a standard normal random variable N (0, 1). Thus, the probability on
the right side of inequality 15.1, is approximately equal to
%
$
70 n

.
(15.2)
P N (0, 1)
n
If the probability in expression 15.2 is smaller than 0.05 then we reject the hypothesis
that 50% of the Atlanta populations smokes. (on the = 0.05 level). We can look upthe
probability that the standard normal N (0, 1) is smaller than the number (70n)/( n)
in a table. We have tables, for the standard normal variable N (0, 1).
45

15.1

Looking up probabilities for the standard normal in a table

Let z R. Let (z) denote the probability that a standard normal variable is smaller
equal than z. Thus:
. z
1 x2 /2
e
dx.
(z) := P (N (0, 1) z) =
2
For example, let z > 0 be a number. Say wen want to find the probability
P (N (0, 1) z).

(15.3)

The table for the standard normal gives the values of (z) for z > 0 thus we have to
try to express probability 15.3 in terms of (z). For this note that:
P (N (0, 1) z) = 1 P (N (0, 1) < z).
Furthermore, P (N (0, 1) < z) is equal to P (N (0, 1) z) = (z). Thus we find that:
P (N (0, 1) z) = 1 (z).
Let us next explain how, if z < 0, we can find the probability:
P (N (0, 1) z).
Note that N (0, 1) is symmetric around the origin. Thus,
P (N (0, 1) z) = P (N (0, 1) |z|).
This brings us back to the previously studied case. We find
P (N (0, 1) z) = 1 (|z|).
Eventually let z > 0 again. What is the probability:
P (z N (0, 1) z)
equal to? For this problem note that
P (z N (0, 1) z) = 1 P (N (0, 1) z) P (N (0, 1) z).
Thus, we find that:
P (z N (0, 1) z) = 1 (1 (z)) (1 (z)) = 2(z) 1.

46

15.2

Two sample testing

Let us give an example to introduce this subject. Assume that we are testing a new fuel
for a certain type of rocket. We would like to know if the new fuel gives a dierent initial
velocity to the rocket. The initial velocity with the old fuel is denote by X whilst Y is
the initial velocity with the new fuel. We fire the rocket five times with the old fuel and
measure each time the initial velocity. We find:
X1 = 100, X2 = 102, X3 = 97, X4 = 100, X5 = 101

(15.4)

(here Xi denotes the initial velocity measured whilst firing the rocket for the i-th time
with the old fuel). Then we fire the rocket five times with the new fuel. Every time we
measure the initial velocity. We find
Y1 = 101, Y2 = 103, Y3 = 99, Y4 = 102, Y5 = 100

(15.5)

We calculate the averages:


:= X1 + X2 + X3 + X4 + X5 = 100
X
5
and

Y1 + Y2 + Y3 + Y4 + Y5
= 101
Y :=
5
When we measure the initial velocities we find dierent values even when we use the same
fuel. The reason is that our measurement instruments are not very precise, so we get the
true value plus a measurement error. The model is as follows:
Xi = X + X
i
and
Yi = Y + Yi .
X
Y
Y
Furthermore X
1 , 2 , . . . are i.i.d. random errors and so are 1 , 2 , . . .. We assume that
the measurement instrument is well calibrated so that
Y
E[X
i ] = E[i ] = 0

for all i = 1, 2, . . .. Here X and Y are unknown constants (in our example X is the
initial speed when we use the old fuel whilst Y is the initial speed when we use the new
fuel). We find that
X
E[Xi ] = E[X + X
i ] = E[X ] + E[i ] = X + 0 = X ,

and similarly
E[Yi ] = Y .

47

So our testing problem can be described as follows: we want to figure out based on our
data 15.4 and 15.5, if the second fuel gives a dierent initial speed than the old fuel. We
observed
= 1 > 0.
Y X
This means that in the second sample, obtained with the new fuel, the initial speed is
higher by one unit on average to the initial speed in the first sample obtained with the
old fuel. But is this evidence enough to conclude that the new fuel provides higher initial
speed, or could this dierence just be due to the measurement errors? As a matter of
fact, since we make measurement errors, it could be that even, if the second fuel does
not provide higher initial speed, (i.e. X = y ) that due to the random errors and bad
luck the second average is higher than the first. In our present setting we can never
be absolutely sure, but we try to see if there is statistically significant evidence for
arguing that X and Y are not equal.
The exact method to do this depends on whether we know the standard deviation of the
errors or not and if they are identical for the two samples. We will need the expectation
and standard deviation of the means. This is what we calculate in the next paragraph.
Expectation and standard deviation of the means Let the standard deviation of
the errors be denoted by
3
3
X := V AR[X
]
,

:=
V AR[Yi ].
Y
i
We find that the standard deviation of Z is given by
Let Z := Y X.
3
3
= V AR[Y ] + V AR[X

Z = Y X = V AR[Y X]

(15.6)

and the Y are


where the last equality above was obtained using the facts that the X
independent of each other, and variance of a sum of independent variable is equal to the
sum of the variances. Now
0
1
X
+
.
.
.
+
X
V AR[X1 + . . . + Xn ]
1
n
= V AR
V AR[X]
=
=
n
n2
2
V AR[X1 ] + V AR[X2 ] + . . . + V AR[Xn ]
nV AR[X1 ]
V AR[X1 ]
X
=
=
=
=
n2
n2
n
n
and similarly
V AR[Y ] =

Y2
n

Using this in equation 15.6, we find


Z = Y X =

48

2
2
X
+ Y
n
n

(15.7)

If X = Y (which should be the case when we use the same measurement instrument),
then equation 15.7 can be rewritten as
4
&
2 2
+
= 2/n,
Y X =
(15.8)
n
n
where = X = Y . If the two samples would have dierent sizes, we would find by a
similar calculation
4
X
Y
Z =
+
(15.9)
n1
n2

where n1 is the size of the first sample and n2 is the size of the second sample. Furthermore
we have for the expectation
= E[Y ] E[X]
=
E[Y X]
1
0
1
0
X1 + . . . + Xn
Y1 + Y2 + . . . + Yn
E
=
=E
n
n
E[Y1 + . . . + Yn ] E[X1 + . . . + Xn ]
=

=
n
n
E[Y1 ] + E[Y2 ] + . . . + E[Yn ] E[X1 ] + E[X2 ] + . . . + E[Xn ]
=

=
n
n
= E[Y1 ] E[X1 ] = X Y
To summarize we found that
= Y X .
E[Y X]

(15.10)

A simplified method Let us first explain a rough method, to explain in a simple way
the idea. This method up to a small detail is the same as what is really used in practice.
At this stage we are ready to explain how we could proceed to know if we have strong
evidence for the case Y X = 0. We are going to use the rule of thumb which says
that in most cases for most variables the values we typically observe are within a distance
of at most 2 times the standard deviation from the expected value. We apply this rule
If we would have that there is no dierence between the new and old
to Z = Y X.
fuel, then Y X would be equal to zero and hence E[Z] = 0 (see equation 15.10). We
can then check if the value we observe for Z is within 2 times the standard deviation Z .
Thus in our case we check if the value 1 is within 2 times the standard deviation Z . This
is the same as checking if

Y X
(15.11)
Z
is not more than 2 in absolute value. If it is, we would think that Y is probably not
equal to X . In that case, we say that we reject the hypothesis that X = Y . The
expression 15.11 is called test statistic. What we did here is check if the value taken by
the test statistics is within the interval [cr , cr ], where we took cr = 2. The number cr is
49

called critical value for our test. If we do not know X and Y , we estimate them and
replace them by their estimates in the formulas 15.7,15.8, 15.9. (To see how to estimate
a standard deviation go to subsection 16.2). We then use that value for the test statistic
instead of 15.11.
The method described here diers from the one really used only in as much as the critical
value is concerned. However even with the way the test is usually done in practice, the
critical value will not be very far from 2. Let us next explain in detail the dierent
methods used in practice. They depend on whether the standard deviation is known or
not. Also, to perform a statistical test in a precise way, need to specify the level of
confidence for the test. The higher the lever of confidence the bigger the critical value
will be. Let us see the details in the next paragraphs:
The case with identical, known standard deviation Assume that the standard
deviations X and Y are known to us and identical. This is typically the case, when the
measurement instruments used for both samples are identical. In this case, we denote by
the value = X = Y . If we work often with the same measurement instruments, we will
known the typical size of the measurement error, hence we will know from experience
. Assume here that the measurement errors are normal. Then, the test statistic
Y
X
X 1 + . . . + Xn Y 1 Y 2 . . . Y n
=
Z
nZ

(15.12)

is also normal. As a matter of fact, as can be seen in 15.12, the test statistic can be
written as a sum of independent normal variables divided by a constant. We know that
sums of independent normal variables are again normal. Furthermore dividing a normal
by a constant gets you a normal again. If X = Y , then the expectation of the test
statistic is zero:
0
1 E[Y X]

Y X
Y X
E
=
=
= 0.
Z
Z
Z
Similarly the variance of the test statistic is one. This can be seen from:
0

1 V AR[Y X]
Y X
V AR
=
= 1.
Z
Z2

Hence, if X = Y , then the test statistic is a normal variable with expectation 0 and
variance 1. In other words, the test statistic is a standard normal variable. So, in this
case, the critical value cr at a confidence level p is the number cr > 0 satisfying
P (cr N (0, 1) cr ) = p.
By symmetry around the origin, this implies (see subsection 15.1) that
(cr ) = (1 + p)/2,

(15.13)

where (x) = P (N (0, 1) x) is the distribution function of a standard normal. Which


value satisfies equation 15.13 can be found in a table for standard normal variables. For
50

example for p = 95%, the corresponding critical value is cr = 1.96


Let us get back to our example with the rocket. Assume that the average measurement
error when we make one measurement is 3. In other words, let = 3. Assume we want to
test on the 95%-level if there is a statistical significant dierence between the means in our
samples. That is, we want to test the hypothesis Y = X against the hypothesis
Y = X on the 95%-level. For this we simply need to check if the test statistic lies

between cr and +cr . The test statistic is (Y X)/(


has
Z ). The constant Z = Y X
been calculated in 15.8, where it was found:
&
Z := 2/n.
Hence, with our value = 3 and n = 5, the test statistic takes on the value

Y X
1
Y X
1
= &
=
= 0.55

Z
1.8
3 0.4
2/n

The value for the test statistic lies within the interval [cr , cr ], when cr = 1.96. Hence
in this situation we can not reject the hypothesis that X = Y on the 95%confidence level. In other words, we do not have enough statistical evidence to reject
the idea that X = Y . This means that our data does not seem to imply that the new
fuel is better or worse then the old. Note that this does not necessarily mean that X
and Y must be identical. It could be that the dierence is so small, that it gets masked
by our measurement errors.
The way the test was done is called a two-sided test. If we would be interested just in
knowing if the new fuel is better then we would do a one-sided test. (It could be that
a company might change to a new fuel but only if it is proven to be better. In that case,
the only interesting thing is to know if the new fuel is better and not if it is dierent). In
the case of a one-sided test the confidence interval would be [, cr ] where the critical
value cr is determined by
P (N (0, 1) cr ) = (cr ) = p
on the confidence level p. Here as before (.) designates the distribution function of a
standard normal variable.
In this example, we assumed the measurement errors to be normal. If this is not the case,
but we have many measurements, the above method still applies due to the Central Limit
Theorem.
Case when the standard deviations are known but not equal or dierent sample
sizes. If in the two samples the standard deviations are dierent (because of dierent
measurement instruments maybe), then all the above remains the same except that we
use a dierent formula for Z . The formula then used for Z is formula 15.7. The same
goes when the samples have dierent sizes from each other. The formula used in that case
is 15.9.

51

The case with unknown, but equal standard deviation Assume that = X =
Y , but is unknown to us. Then instead of Z , we use an estimate for Z . For this note
that
4
2
X
2
+ Y.
(15.14)
Z =
n
n
2
We will estimate X
and Y2 and plug the values into the formula 17.7 instead of the real
values. The estimates we use (see subsection 16.2), are
s2X :=
2
for X
and

2 + (X2 X)
2 + . . . + (Xn X)
2
(X1 X)
n1

(Y1 Y )2 + (Y2 Y )2 + . . . + (Yn Y )2


n1
2
for Y . This then gives as estimate for Z the following expression:
3
(s2X + s2Y )/n
s2Y :=

Our test statistic is obtained by replacing Z by its estimate in the previously used test
statistic. Hence the test statistic for the case of unknown standard deviation is:

Y X
&
.
(15.15)
(s2X + s2Y )/n

The distribution of the test statistic is no longer normal. It is slightly modified. One can
prove when X = Y and the measurement errors are normal, that then the test statistic
has a student t-distribution with 2n 2 degrees of freedom. So our testing procedure is
almost as before, only that we have to find the critical value cr in a dierent table. This
time we have to find it in a table for the t-distribution with 2n 2 degrees of freedom.
So, if we test on the confidence level p, then cr is defined to be the number such that
P (cr T2n2 cr ) = p.

We reject the hypothesis X = Y on the level p, if the test statistic 15.15 takes
a value outside [cr , cr ] Let us get back to our rocket example. Say we want to test
X = Y on the level p = 95%. We find
s2X =

0 + 22 + 3 2 + 0 + 1
14
=
= 3.5
4
4

and

10
0 + 22 + 2 2 + 1 + 1
=
= 2.5
4
4
With n = 5 the test statistic takes on the value

Y X
1
1
&
&

=
=
0.9.
1.2
(s2X + s2Y )/n
6/5
s2Y =

52

Now we have to look of the t-distribution with 2n 2 = 8 degrees of freedom. Upon


reading the table we find the critical value for a two sided test on the 95% level to be
cr = 2.2???. We see that the value taken by the test statistic is way within the interval
[cr , cr ] and hence we can not reject the hypothesis that the new fuel has no eect. More
precisely, our data does not contain significant evidence on the 95%-level that there is a
dierence between the two fuels, i.e. that uX = uY .

16
16.1

Statistical estimation
An example

Imagine that we want to measure the distance d between two points y and z. Every
time we repeat the measurement we make a measurement error. In order to improve the
precision we make several measurements and then take the average value measured. Let
Xi designate measurement number i and i the error number i. We have that:
Xi = d + i .
We assume that the measurement errors are i.i.d. such that
E[i ] = 0
and
V ar[i ] = 2 .
The standard deviation of the measurement instrument is supposed to be know to us.
Imagine that we make 4 measurements and find in meters the four values:
100, 102, 99, 101
We see that the distance d must be around 101 meters. However, the exact value of the
distance d remains unknown to us, since each of the four measurements above contains an
error. So, we can only estimate what the true distance is equal to. Typically we take the
average of the measurements as estimate for d. We write d for our estimate of d. In the
case we decide to use the average of our measurements as estimate for d, we have that:
X1 + X2 + X3 + X4
d =
.
4
The advantage of taking four measurements of the same distance instead of only one,
is that that probability to have a large error is reduced. The errors in the dierent
measurements tend to even each other out when we compute the average. As a matter of
fact, assume we make n measurements and then take the average. In this case:
X1 + . . . + Xn
.
d :=
n
53

We find:
= (1/n) (E[X1 ] + . . . + E[Xn ]) =
E[d]
(1/n) (nE[X1 ]) = E[X1 ] = E[d + i ] = E[d] + E[i ] = d + 0 = d.
An estimator which has its expectation equal to the true value we want to estimate is
called unbiased estimator.
Let us calculate:
0
1
X
+
.
.
.
+
X
1
1
n
= V AR
V AR[d]
= 2 (V AR[X1 ] + . . . + V AR[Xn ]) =
n
n
1
(nV AR[X1 ]) = V AR[X1 ]/n
n2
Thus, the standart deviation of d is equal to
&

V AR[X1 ]/n = / n.

The standard deviation of the average d is thus n times smaller than the standard
deviation of the error when we make one measurement. This justifies taking several
measurements
and taking the average, since it reduces the size of a typical error by a

factor n.
When we make a measurement and give an estimate of what the distance is, it is important
that when know the order of magnitude of the error. Imagine for example that the order
of magnitude of the error is 100 meters. The situation would then be: our estimate of
the distance is 101 meters, and the precision of this estimate is plus/ minus 100 meters.
In this case our estimate our estimate of the distance is almost useless because of the huge
imprecision. This is why, we try to always give the precision of the estimate. Since
the errors are random, theoretically even very large errors are always possible. Very large
errors however have small probability. Hence one tries to be able to be able to give a
upper bound on the size of the error which holds with a given probability. Typically one
uses the probabilities 95% or 99%. The type of statement one whishes to make is for
example: our estimate for the distance is 101 meters. Furthermore, with 95% probability
the true distance is within 2 meters of our estimate. In this case the interval [99, 103] is
called the 95% confidence interval for d. With 95% probability, d should lie within this
interval. More precisely, we look for a real number a > 0 such that:
P (d a d d + a) = 95%
or equivalently:

P (a d d a) = 95%

Hence we are looking for a number a such that:


%
$
%
$
d + 1 + . . . + d + n nd
X1 + . . . + Xn
d a = P a
a =
95% = P a
n
n
%
$
1 + . . . + n
a
= P a
n
54

Now, either way we assume that the errors I are normal or that n is big enough so that
the sum 1 + . . . + n is approximately
normal due to the central limit theorem. Dividing

the sum 1 + . . . + n by n, we get (approximately) a standard normal variable. This


then gives:
$
$
%
%
a n
a n
1 + . . . + n
a n
a n

P
N (0, 1)
95% = P

We thus find the number b > 0 from the table for standard normal random variable such
that:
95% = P (b N (0, 1) b).
Hence:
95% = (b) (1 (b)) = 2(b) 1
where (.) designates the distribution function of the standard normal variable. Then,
we find a > 0 solving:

a n
b=
.

The confidence interval on the 95% level is then:


[d a, d + a].
This means that although we dont know the exact value of d, we can say that with 95%
probability d lies in the interval [d a, d + a].

16.2

Estimation of variance and standart deviation

Assume that we are in the same situation as in the previous subsection. The only dierence
is that instead of trying to determine the distance we want to find out how precise our
measurement
instrument is. In other words, we try to determine the standard deviation
&
= V AR[i ]. For this we make several measurements of the distance between to points
y and z. We choose the point so that we know the distance d between them. Again, if Xi
designates the i-th measurement we have Xi = d + i . Define the random variable Zi in
the following way:
Zi := (Xi d)2 = 2i .
Thus:
E[Zi ] = V AR[i ].
We have argued that if we have a number of independent copies of the same random
variables, a good way to estimate the expectation is to take the average. Thus to estimate
the expectation E[Zi ], we take the average:
i ] := Z1 + . . . + Zn .
E[Z
n
55

In other words, as an estimate for V AR[i ] = E[Zi ] = 2 , we take:


Z1 + . . . + Zn
(X1 d)2 + . . . + (Xn d)2
=
.
n
n
The estimate for is then
& simply the square root of the estimate for the variance. Thus,
our estimator for = V AR[i ] is:
4
(X1 d)2 + . . . + (Xn d)2

=
.
n
If the distance d, should not be known, we simply take and estimate for d instead of d.
In that case our estimate for is
5
2 + . . . + (Xn d)
2
(X1 d)

=
n1
where

X 1 + . . . + Xn
.
d :=
n
(Note that instead of dividing by n in the case that d in unknown, we divide usually
by n 1. This is a little detail which I am not going to explain. For large d, it is not
important since then n/(n 1) is close to 1.)

16.3

Maximum Likelihood estimation

Imagine the following situation: we have two 6-sided dice. Let X designate the number
we obtain when we throw the first die. Let Y designate the number we obtain when we
throw the second one. Assume that the first die is regular whilst the second is skewed.
We have:
(P (X = 1), P (X = 2), . . . , P (X = 6)) = (1/6, 1/6, 1/6, 1/6, 1/6, 1/6).
(Note that 1/6 = 0.16) Assume furthermore that:
(P (Y = 1), . . . , P (Y = 6)) = (0.01, 0.3, 0.2, 0.1, 0.1, 0.29).
Imagine that we are playing the following game: I choose from a bag one of the two dice.
Then I throw it and get a number between 1 and 6. I dont tell you which die I used, but
I tell you the number obtained. You have to guess which die I used based on the number
which I tell you. (This guessing is what statisticians call estimating.) For example I tell
you that obtained the number 1. With the first die, the probability to obtain a 1 is 0.16,
whilst with the second die it is 0.01. The probability to obtain a 1 is thus much smaller
with the second die. Having obtained a one makes us thus think that it is likelier that
the die used is the first die. Our guess will thus be the first die. Of course you could be
wrong, but based on what you know the first die appears to be likelier.
56

If on the other hand, after throwing the die we obtain a 2 we guess that it was the second
die which got used. The reason is that with the second die a 2 has a probability of 0.3
which is larger than the probability to see a 2 with the first die. Again, our guess might be
wrong, but when we observe a 2, the second die seem likelier. The method of guessing
described here is called Maximum likelihood estimation. It consist in guessing (estimating)
the possibility which makes the observed result most likely. In other words, we choose
the possibility, for which the probability of the observed outcome is highest.
Let us look at is in a slightly more abstract way. Let I designate the first die and II
the second. For x = 1, 2, . . . , 6, let P (x, I) designate the probability that the number we
obtain by throwing the first die equals to x. Thus:
P (x, I) := P (X = x).
Let P (x, II) designate the probability that the number we obtain by throwing the second
die equals to x. Thus:
P (x, II) := P (Y = x).
For example, P (1, I) is the probability that the first die gives a 1 and P (1, II) is the
probability that the second die equals 1 whilst P (2, II) designates the probability that
the second die gives a 2.
Let be a (non-random) variable with can take one out of two values: I or II. Statisticians
call the parameter. In this example guessing which die we are using, is the same as
trying to figure out if equals I or II. We consider the probability function P (., .) with
two entries:
(x, ) / P (x, ).
Formally what we did can be describe as follows: given that we observe an outcome x, we
take the which maximizes P (x, ) as our guess for which die was used. . Our maximum
likelihood estimate of is the theta maximizing P (x, ) where x is the observed outcome.
This is a general method, and can be used in many dierent settings. Let us give another
example of maximum likelihood estimation, based on the same principle.

16.4

Estimation of parameter for geometric random variables

Let T1 , T2 , . . . be a sequence of i.i.d. geometric random variables with parameter p > 0.


Assume that p > 0 is unknown. We want to estimate p (in other words we want to try to
guess what p is approximately equal to). Say we observe:
(T1 , T2 , T3 , T4 , T5 ) = (6, 7, 5, 8, 8)
Based on this evidence, what should our estimate p for p be? (Hence what should our
guess for the unknown p be?) We can use the Maximum Likelihood method. For this the
estimate p is the p [0, 1] for which the probability to observe
(6, 7, 5, 8, 8)
57

is maximal. Since we assumed the Ti s to be independent we find that:


P ( (T1 , T2 , T3 , T4 , T5 ) = (6, 7, 5, 8, 8) )

(16.1)

is equal to
P (T1 = 6) P (T2 = 7) . . . P (T5 = 8)
For a geometric random variable T with parameter p we have that:
P (T = k) = p(1 p)k1 .
Thus the probability 16.1 is equal to:
p(1p)5 p(1p)6 . . .p(1p)7 = exp(ln(p)+5 ln(1p)+. . .+ln(p)+7 ln(1p)). (16.2)
We want to find p maximizing the last expression. This is the same as maximizing the
expression:
ln(p) + 5 ln(1 p) + . . . + ln(p) + 7 ln(1 p),
since exp(.) is an increasing function. To find the maximum, we take the derivative
according to p and set it equal to 0. This gives:
0=

1
5
1
7
d (ln(p) + 5 ln(1 p) + . . . + ln(p) + 7 ln(1 p))
=
+ ...+
.
dp
p 1p
p 1p

The last equality leads to:


n(1 p) = (5 + . . . + 7)p
where n designates the number or observations. (In the special example considered here
n = 5.) We find:
n = (6 + . . . + 8)p = p(T1 + T2 + . . . + Tn )
and hence:

1
6+7+5+8+8
T1 + T2 + . . . + Tn
=
=
p
5
n

(16.3)

Our estimate p of p is the p which maximizes expression 16.2. This is the p which satisfies
equation 16.3. Thus our estimate:
%1 $
%1
$
T1 + T2 + . . . + Tn
6+7+5+8+8
=
.
p :=
5
n

17
17.1

Linear Regression
The case where the exact linear model is known

Imagine a situation where you have a chain of shops: The shops can have dierent sizes
and the profit seems to be to some extend a function of the size. The chain owns n shops.
58

Let xi denote the size of the i-th shop and Yi its profit. Now you assume that there are
two constants and so that the following relationship holds:
Yi = + xi + i ,
where we assume that 1 , 2 , . . . are i.i.d. random variables with expectation zero:
E[i ] = 0.
Let denote the standard deviation of the variables i . Often it will also be assumed that
the variables i are normal. Now, we have that the expected profit is equal to
E[Yi ] = E[ + xi + i ] = E[] + E[xi ] + E[i ] = + xi .
In other words, the expected profit is a linear function of the size: E[Y ] = + x, where
x is the size and Y is the profit of a shop. So, if you draw a curve representing expected
profit as function of size, you would get a straight line.
Say for your chain of shops, you would have the relationship Yi = 3 + 4xi + i . So, in this case = 3
and = 4. Say, you own a shop of size 5. Then for that shop, the expected profit given that the size
is 5, would be E[Y |x = 5] = 3 + 4 5 = 23. Here we denote by E[Y |x = 5] the expectation given that
the size is 5. Now why would the profit of that shop be random? Well very simple. It could be that
you own many shops gggbut this one shop with size 5 is going to open next month. So, nobody knows
in advance the exact profit. One can forecast it, give maybe a confidence interval, but nobody knows in
advance the exact value! Hence, the profit is behaving like a random variable. If you are told to predict
(estimate) what the profit will be, you will give the expected value + 5 = 23. Of course, this requires
that you know the constants and . Now, if you know the standard deviation of then you can also
give a confidence interval. First using Matzis rule of thumb, you could simply say that most probably,
the profit of the shop, will be withing two standard deviation of the expected profit. In our case, thus we
could say that typically the profit is 23 + 2 and hence most likely to be between 23 2 and 23 + 2.
If for example = 3, then most likely the profit for our shop will be between 17 and 29. Now, this
is using a rule of thumb which says that random variables typically take values not further than twice
the standard deviation from their expectation most of the time. But this is not very precise. So, we
could actually give a confidence interval. That is we could give an interval so, that with for example 95%
probability the profit will be in that interval. If we assume that the errors are normal, then we have that
/ is standard normal. Hence, (Y x)/ = / is standard normal. Hence,
%
$
Y x
c = P (c N (0, 1) c)
P c

The above allows us to give an interval (think of confidence interval), so that the profit of the new shop
will be in that interval with a given probability. for example of 95%-confidence, the interval is going to
be
[ + 5 c0.95 , + 5 c0.95 ] = [23 C0.95 , 23 + c0.95 ],
where c0.95 denotes the constant so that a standard normal is between + that constant with 0.95probability. We have seen how to calculate, such a constant.
Imagine next a situation where and are known, and is not known. Then we want to estimate
based on our data. Note that Yi = + xi + i and hence:
i = Yi xi .

(17.1)

When the data (Yi , xi ) is known and , are known as well, then we can figure out the value of the
i s using formula 17.1. Note that designates the standard deviation for the errors i . But in previous

59

chapters we have learned how to estimate a standard deviation. So, this is what we are going to do using
the i s to estimate the standard deviation :
4
21 + . . . + n

:=
.
(17.2)
n
Let us give an example. Say we have five shops and as before = 3 and = 4. The data for the shops
is given in the table below:
xi 1 2 3 4 6
Yi 8 10 17 17 27
Now, for the example 1 = Y1 3 4x1 = 8 3 4 = 1 So for each i = 1, 2, . . . , 5 we can calculate the
corresponding i . We get the values:
xi
Yi Xi

1
2
1 11

3 4 6
2 2 1

So, our estimate of the standard deviation becomes:


4

11 + (1)2 + 22 + (2)2 + 02

:=
= 2 1.41.
5

(17.3)

We can now using Matzingers rule of thumb, which says that random variable most of the time, takes
values not further than two standard deviation from their expectation. So, that tells us that for our shop,
the profit should be within 23 + 2
23 + 2.82 So, typically the profit would be in the interval
[20.1716, 25.8284].
The above interval is just to have a rough idea of which area most likely the profit will be in. For a more
precise approach with an explicit confidence level , we would take the interval:
[23 c
, 23 c
]

(17.4)

where c is the constant so that a standard normal is with probability between c and +c :
= P (c N (0, 1) c ).
Now if we do not know the standard deviation, we can replace the true standard deviation by its estimate.
The coecient c from the normal table has to be replaced by a coecient from the student table tn/2 .
So, the confidence interval if we have to estimate the standard deviation becomes:
].
, + 10 + tn/2
[ + 10 tn/2

17.2

(17.5)

When and are not know

If and are not known, then we estimate them using least square. The estimates are
given by the two following equations:
y =
+ x
and
60

+n
(xi x)yi

:= +i=1
n
)2
i=1 (xi x

Put in the value of from the second equation into the first to calculate
.
Now, in principle, all the things we did in the last subsection where and were known
will be done here. The dierence is mainly that instead of and we use the estimate

and instead. But then we act as if the estimates where the true values. (For the
confidence interval there will be a small adjustment). In other words, given some real data
(xi , Yi) for i = 1, 2, . . . , n, you could estimate and . Then forget that your estimates

and are only estimates. Act as if they where the true and and do everything we
did in the section above...in this way, you can figure out how to estimate the standard
deviation, get a confidence interval and so on and so forth. Let us summarize:
1. To estimate the expected profit of a new shop of size x0 , we used in the previous
section + x0 . Now, however and are not known. So, we simply take the
estimates for and and act as if they would be the true values. Our estimate for
the expected profit of a shop of size x0 , when and is not known is:
0.
E[Y|x0 ] :=
+ x
2. To estimate the standard deviation, we had used the i s which are equal to i =
Yi xi . Now, and are not known here, so we replace them by their
respective estimates. So our estimated random errors are
i
i := Yi
x
For estimating the standard deviation , we now simply replace i by the estimate
i in formula 17.2. Hence, the estimated is defined to be:
5
21 + 22 + . . . + 2n

:=
.
(17.6)
n2
3. Let us see how we give a rough confidence interval using Matzis rule of thumb.
(That rule of thumb is: mostly variables take values not further than two times
the standard deviation from their expectation). So, in the formula + x0 + 2
we simply replace , , by their respective estimates: so the rough confidence
interval for the profit of a shop with size x0 would be
0 2
0 + 2
[
+ x
,
+ x
]
where
is our estimate given in 17.6.

61

4. For an exact confidence interval we take the same as in 17.5 but replacing again ,
and by their respective estimates. (Here for estimating we take 17.6). Also,
there is an additional factor equal to
5
1
(x0 x)2
1+ + +
n
)2
i (xi x

This factor is needed, because we have additional uncertainty since we do not know
+ x0 , but only have an estimate for it. Also, for large n, this factor becomes
close to 1. So, all this being said, our confidence-interval on the -co0nfidence level
is
5
5
7
6
2
2
1
1
(x

)
(x

)
0
0
n
n
+t
t
.

+ 10
,
+ 10
1+ + +
1+ + +
/2
/2
2
n
(x

)
n
)2
i
i
i (xi x

17.3 Where the formula for the estimates of and come


from
So a typical situation is that we have data:
x1 x2 . . . xn
y1 y2 . . . yn
We can assume that these points where generated by a model like the one described
at the beginning of this section:
yi = + xi + i
for all i = 1, 2, . . . , n and where , do not depend on i. Again, 1 , 2 , . . . are i.i.d.
with expectation 0 and standard deviation . Typically, , and are not known
to us. So how can we figure them out? Now when we have many data points, we
want to try to find a straight line which is close to all the points. Consider any
straight line y = a + bx. We could try to find such a line so that the sum of the
distances to all the points (xi , yi) is small. This would correspond to searching for
a straight line which minimizes:
n
#
i=1

|yi a bxi |.

Note that in the above sum, the yi s and the xi s are given numbers, so we only
need to find a and b minimizing the above expression. Now, absolute values are a
mess to calculate with. So, instead, we will take the sum of the distances square:
2

d (a, b) :=

n
#
i=1

62

(yi a bxi )2

and find a and b minimizing d2 (a, b). This will yield very nice explicit formulas. To
find those formulas we simple take the derivative according to a and according to b
and set equal to 0. This yields:
+
n
#
d ni=1 (yi a bxi )2
= 2
(yi a bxi )
da
i=1
Setting the expression on the right side of the last equation above equal to 0 we
find:
y = a + b
x,
where

y1 + . . . + yn
n

y :=
and

x1 + . . . + xn
.
n
Then, we take the derivative according to b and set it equal to 0:
+
n
#
d ni=1 (yi a bxi )2
= 2
xi (yi a bxi )
db
i=1
x :=

So, setting the expression ont he right side of the last equation above equal to 0
yields:
n
#
i=1

xi yi a
xi

n
#
i=1

xi b

n
#

xi xi = 0

(17.7)

i=1

Take a change of variables: let := xi x. For a shift in x-coordinate the slope b is


not changed. Also, the distances square are not aected by a shift in x-coordinates.
So,the same formula as 17.7 must hold for the values xi . Hence, formula 17.7 is
equivalent to
n
n
n
#
#
#
xi yi a
xi b
(xi )2 = 0.
i=1

Note however that

xi

i=1

= 0. Hence, we have
n
#
i=1

which implies that

i=1

xi yi

n
#

(xi )2 = 0

i=1

+n
+n
xi yi
(xi x)yi
i=1
b = +n
= +i=1
n
2
)2
i=1 (xi )
i=1 (xi x

We have now found a system of two equations for a and b, which determines which
straight line y = a + bx gets closes to the data-points (x1 , y1), (x2 , y2), . . . , (xn , yn ).
63

By closest, we mean the sum of the vertical distances square between the points
and the line should be minimal. So, the system of two equations is:
y = a + b
x
+n
(xi x)yi
.
b = +i=1
n
)2
i=1 (xi x

(17.8)
(17.9)

Solving the above system of two equations in a and b yields, the straight line y =
a + bx which is closest (in our sense of sum of distances square) to our points.
We will use, these value for a and b which minimize the sum of distances square as
our estimates for and . An explanation why this is a good idea can be found
below in the subsection entitled: how precise is our estimate. So, we have that
the estimates
and are the only solution to 17.8 and 17.9. Hence, they
are given by the following two equations:
y =
+ x
and

17.4

+n
(xi x)yi
:= +i=1
.
n
)2
i=1 (xi x

Expectation and variance of

We have seen that that estimate


We can calculate the expectation of our estimate .
is given by:
+n
(xi x)yi

.
= +i=1
n
)2
i=1 (xi x

We are going to take the variance on both sides of the last equation above, and
use the fact that the xi s are constants and not random. Recall that constants who
multiple a random variable, can be taken out of the variance after squaring. This
leads to
1 +n
0 +n
(xi x)2 V AR[yi ]
(x

)y
V AR[yi ]
2
i
i
i=1
i=1
= V AR +n
+
+
+
=
=
=
.
V AR[]
n
n
)2
( ni=1 (xi x)2 )2
)2
)2
i=1 (xi x
i=1 (xi x
i=1 (xi x
So, we get finally:

and

2
,
)2
i=1 (xi x

= +n
V AR[]

.
)2
i=1 (xi x

= &+n

(17.10)

Recall that the error


Next we want to calculate the expectation of the estimate .
terms i have zero expectation: E[i ] = 0 and hence
E[Y i] = E[ + xi + i ] = E[] + E[xi ] + E[i ] = + xi .
64

We are not ready to calculate the expectation of our estimate:


0+n
1 +n
(x

)y
)E[yi ]
i
i
i=1
i=1 (xi x
= E +n
E[]
= +
=
n
2
)
)2
i=1 (xi x
i=1 (xi x
+n
+n
+n
(xi x)
(xi x)xi
(xi x)( + xi )
i=1
i=1
+n
=
= +n
+ +i=1
= .
n
2
2
)
)
)2
i=1 (xi x
i=1 (xi x
i=1 (xi x

In other words, the expectation value of the estimator is itself. This has some
very important application: We have that is a random number itself since it
depends on the i s which we have assumed to be random. Now for any random
variable Z we have that we measure the approximate average distance from its
expectation(=dispersion) by the standard deviation of the variable. So, how far
= on average when we keep repeating the experiment, is given by .
is from E[]

But, the distance between and is the estimation error of our estimate. So, in
other words, the average size of the estimation error (when we estimate ) is given
by for which we have a close expression given in equation 17.10 above.
6

17.5
17.6
17.7

How precise are our estimates


Multiple factors and or polynomial regression
Other applications

65