Self Study Material - Review of Probability and Statistics

Review of Probability and
Statistics
Aswini Kumar Mishra

Faculty, Department of Economics
BITS-PILANI, K.K. BIRLA GOA CAMPUS

Some Important Definitions
Random Experiments
Sample Space
Sample Points
Events
Types of Events-Mutually Exclusive,
Equally Likely, Collectively Exhaustive
2
Probability Definitions
Classical Definition
Empirical Definition
Probability Properties-Probability of an
Event
Mutually Exclusive and Exhaustive Events
Statistically Independent Events
3
PROBABILITY DISTRIBUTION EXAMPLE: X IS THE SUM OF TWO DICE
red 1 2 3 4 5 6
This sequence provides an example of a discrete random variable. Suppose that you have
a red die which, when thrown, takes the numbers from 1 to 6 with equal probability.
4
red 1 2 3 4 5 6
green
1
2
3
4
5
6
Suppose that you also have a green die that can take the numbers from 1 to 6 with equal
probability.
5
red 1 2 3 4 5 6
green
1
2
3
4
5
6
We will define a random variable X as the sum of the numbers when the dice are thrown.
6
red 1 2 3 4 5 6
green
1
2
3
4
5
6 10
For example, if the red die is 4 and the green one is 6, X is equal to 10.
7
red 1 2 3 4 5 6
green
1
2
3
4
5 7
6
Similarly, if the red die is 2 and the green one is 5, X is equal to 7.
8
red 1 2 3 4 5 6
green
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
The table shows all the possible outcomes.
9
X
red 1 2 3 4 5 6
green 2
3
1 2 3 4 5 6 7 4
5
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12
If you look at the table, you can see that X can be any of the numbers from 2 to 12.
10
X f
red 1 2 3 4 5 6
green 2
3
1 2 3 4 5 6 7 4
5
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12
We will now define f, the frequencies associated with the possible values of X.
11
X f
red 1 2 3 4 5 6
green 2
3
1 2 3 4 5 6 7 4
5 4
2 3 4 5 6 7 8 6
3 4 5 6 7 8 9 7
8
4 5 6 7 8 9 10 9
5 6 7 8 9 10 11 10
11
6 7 8 9 10 11 12 12
For example, there are four outcomes which make X equal to 5.
12
X f
red 1 2 3 4 5 6
green 2 1
3 2
1 2 3 4 5 6 7 4 3
5 4
2 3 4 5 6 7 8 6 5
3 4 5 6 7 8 9 7 6
8 5
4 5 6 7 8 9 10 9 4
5 6 7 8 9 10 11 10 3
11 2
6 7 8 9 10 11 12 12 1
Similarly you can work out the frequencies for all the other values of X.
13
X f p
red 1 2 3 4 5 6
green 2 1
3 2
1 2 3 4 5 6 7 4 3
5 4
2 3 4 5 6 7 8 6 5
3 4 5 6 7 8 9 7 6
8 5
4 5 6 7 8 9 10 9 4
5 6 7 8 9 10 11 10 3
11 2
6 7 8 9 10 11 12 12 1
Finally we will derive the probability of obtaining each value of X.
14
X f p
red 1 2 3 4 5 6
green 2 1
3 2
1 2 3 4 5 6 7 4 3
5 4
2 3 4 5 6 7 8 6 5
3 4 5 6 7 8 9 7 6
8 5
4 5 6 7 8 9 10 9 4
5 6 7 8 9 10 11 10 3
11 2
6 7 8 9 10 11 12 12 1
If there is 1/6 probability of obtaining each number on the red die, and the same on the
green die, each outcome in the table will occur with 1/36 probability.
15
X f p
red 1 2 3 4 5 6
green 2 1 1/36
3 2 2/36
1 2 3 4 5 6 7 4 3 3/36
5 4 4/36
2 3 4 5 6 7 8 6 5 5/36
3 4 5 6 7 8 9 7 6 6/36
8 5 5/36
4 5 6 7 8 9 10 9 4 4/36
5 6 7 8 9 10 11 10 3 3/36
11 2 2/36
6 7 8 9 10 11 12 12 1 1/36
Hence to obtain the probabilities associated with the different values of X, we divide the
frequencies by 36.
16
probability
2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2
__
1 1
36 36 36 36 36 36 36 36 36
36 36
2 3 4 5 6 7 8 9 10 11 12 X
The distribution is shown graphically. in this example it is symmetrical, highest for X equal
to 7 and declining on either side.
17
Some Important Concepts
PDF of a d.r. variable
PDF of a c.r. variable
Joint PDF of d.r.variables and c.r. variables
Marginal PDF of d.r.variables and c.r. variables
Statistical Independence
Conditional PDF
18
Characteristics of Probability
Distributions
A probability distribution can often be
summarized in terms of a few of the
characteristics, known as moments of the
distribution.
Two of the most widely used moments are

the mean or expected value, and the
variance.
19
EXPECTED VALUE OF A DISCRETE RANDOM VARIABLE
Definition of E(X), the expected value of d.r.v.X:
n n
E ( X ) x1 p1 ... xn pn xi pi xi f ( xi )
i 1 i 1
The expected value of a random variable, also known as its population mean, is the
weighted average of its possible values, the weights being the probabilities attached to the
values.
Note that the sum of the probabilities must be unity, so there is no need to divide by the
sum of the weights.
20
EXPECTED VALUE OF A RANDOM VARIABLE
xi
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
This sequence shows how the expected value is calculated, first in abstract and then with
the random variable defined in the first sequence. We begin by listing the possible values
of X. 21
xi pi
x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11
Next we list the probabilities attached to the different possible values of X.
22
xi pi xi pi
x1 p1 x1 p1
x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11
Then we define a column in which the values are weighted by the corresponding
probabilities.
23
xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3
x4 p4
x5 p5
x6 p6
x7 p7
x8 p8
x9 p9
x10 p10
x11 p11
We do this for each value separately.
24
xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11
Here we are assuming that n, the number of possible values, is equal to 11, but it could be
any number.
25
xi pi xi pi
x1 p1 x1 p1
x2 p2 x2 p2
x3 p3 x3 p3
x4 p4 x4 p4
x5 p5 x5 p5
x6 p6 x6 p6
x7 p7 x7 p7
x8 p8 x8 p8
x9 p9 x9 p9
x10 p10 x10 p10
x11 p11 x11 p11
S xi pi = E(X)
The expected value is the sum of the entries in the third column.
26
xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36
x2 p2 x2 p2 3 2/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(X)
The random variable X defined in the previous sequence could be any of the integers from 2
to 12 with probabilities as shown.
27
xi pi xi pi xi pi xi pi
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(X)
X could be equal to 2 with probability 1/36, so the first entry in the calculation of the
expected value is 2/36.
28
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36
x4 p4 x4 p4 5 4/36
x5 p5 x5 p5 6 5/36
x6 p6 x6 p6 7 6/36
x7 p7 x7 p7 8 5/36
x8 p8 x8 p8 9 4/36
x9 p9 x9 p9 10 3/36
x10 p10 x10 p10 11 2/36
x11 p11 x11 p11 12 1/36
S xi pi = E(X)
The probability of x being equal to 3 was 2/36, so the second entry is 6/36.
29
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(X)
Similarly for the other 9 possible values.
30
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(X) 252/36
To obtain the expected value, we sum the entries in this column.
31
x1 p1 x1 p1 2 1/36 2/36
x2 p2 x2 p2 3 2/36 6/36
x3 p3 x3 p3 4 3/36 12/36
x4 p4 x4 p4 5 4/36 20/36
x5 p5 x5 p5 6 5/36 30/36
x6 p6 x6 p6 7 6/36 42/36
x7 p7 x7 p7 8 5/36 40/36
x8 p8 x8 p8 9 4/36 36/36
x9 p9 x9 p9 10 3/36 30/36
x10 p10 x10 p10 11 2/36 22/36
x11 p11 x11 p11 12 1/36 12/36
S xi pi = E(X) 252/36 = 7
The expected value turns out to be 7. Actually, this was obvious anyway. We saw in the
previous sequence that the distribution is symmetrical about 7.
32
Alternative notation for E(X):
E(X) = mX
Very often the expected value of a random variable is represented by m, the Greek m. If
there is more than one random variable, their expected values are differentiated by adding
subscripts to m. 33
EXPECTED VALUE OF A FUNCTION OF A RANDOM VARIABLE
Definition of E[g(X)], the expected value of a function of X:

n
E g ( X ) g ( x1 ) p1 ... g ( xn ) pn g ( xi ) pi
i 1
To find the expected value of a function of a random variable, you calculate all the possible
values of the function, weight them by the corresponding probabilities, and sum the
results. 34
Definition of E[g(X)], the expected value of a function of X:

n
E g ( X ) g ( x1 ) p1 ... g ( xn ) pn g ( xi ) pi
i 1
Example:
n
E ( X ) x p1 ... x pn xi2 pi
2 2
1
2
n
i 1
For example, the expected value of X2 is found by calculating all its possible values,
multiplying them by the corresponding probabilities, and summing.
35
xi pi
x1 p1
x2 p2
x3 p3

xn pn
The calculation of the expected value of a function of a random variable will be outlined in
general and then illustrated with an example.
36
xi pi
x1 p1
x2 p2
x3 p3

xn pn
First you list the possible values of X and the corresponding probabilities.
37
xi pi g(xi)
x1 p1 g(x1)
x2 p2 g(x2)
x3 p3 g(x3)
...
...
...
...
...
...
...
xn pn g(xn)
Next you calculate the function of X for each possible value of X.
38
xi pi g(xi) g(xi ) pi
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2)
x3 p3 g(x3)
...
...
...
...
...
...
...
xn pn g(xn)
Then, one at a time, you weight the value of the function by its corresponding probability.
39
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2) g(x2) p2
x3 p3 g(x3) g(x3) p3
... ...
... ...
... ...
... ...
... ...
... ...
... ...
xn pn g(xn) g(xn) pn
You do this individually for each possible value of X.
40
x1 p1 g(x1) g(x1) p1
x2 p2 g(x2) g(x2) p2
x3 p3 g(x3) g(x3) p3
... ...
... ...
... ...
... ...
... ...
... ...
... ...
xn pn g(xn) g(xn) pn
S g(xi) pi
The sum of the weighted values is the expected value of the function of X.
41
xi pi g(xi) g(xi ) pi xi pi
x1 p1 g(x1) g(x1) p1 2 1/36
x2 p2 g(x2) g(x2) p2 3 2/36
x3 p3 g(x3) g(x3) p3 4 3/36
... ... 5 4/36
... ... 6 5/36
... ... 7 6/36
... ... 8 5/36
... ... 9 4/36
... ... 10 3/36
... ... 11 2/36
xn pn g(xn) g(xn) pn 12 1/36
S g(xi) pi
The process will be illustrated for X2, where X is the random variable defined in the first
sequence. The 11 possible values of X and the corresponding probabilities are listed.
42
xi pi g(xi) g(xi ) pi xi pi xi2

x1 p1 g(x1) g(x1) p1 2 1/36 4
x2 p2 g(x2) g(x2) p2 3 2/36 9
x3 p3 g(x3) g(x3) p3 4 3/36 16
... ... 5 4/36 25
... ... 6 5/36 36
... ... 7 6/36 49
... ... 8 5/36 64
... ... 9 4/36 81
... ... 10 3/36 100
... ... 11 2/36 121
xn pn g(xn) g(xn) pn 12 1/36 144
S g(xi) pi
First you calculate the possible values of X2.
43
xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi

x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9
x3 p3 g(x3) g(x3) p3 4 3/36 16
... ... 5 4/36 25
... ... 6 5/36 36
... ... 7 6/36 49
... ... 8 5/36 64
... ... 9 4/36 81
... ... 10 3/36 100
... ... 11 2/36 121
xn pn g(xn) g(xn) pn 12 1/36 144
S g(xi) pi
The first value is 4, which arises when X is equal to 2. The probability of X being equal to 2
is 1/36, so the weighted function is 4/36, which we shall write in decimal form as 0.11.
44

x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9 0.50
x3 p3 g(x3) g(x3) p3 4 3/36 16 1.33
... ... 5 4/36 25 2.78
... ... 6 5/36 36 5.00
... ... 7 6/36 49 8.17
... ... 8 5/36 64 8.89
... ... 9 4/36 81 9.00
... ... 10 3/36 100 8.83
... ... 11 2/36 121 6.72
xn pn g(xn) g(xn) pn 12 1/36 144 4.00
S g(xi) pi
Similarly for all the other possible values of X.
45

x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9 0.50
x3 p3 g(x3) g(x3) p3 4 3/36 16 1.33
... ... 5 4/36 25 2.78
... ... 6 5/36 36 5.00
... ... 7 6/36 49 8.17
... ... 8 5/36 64 8.89
... ... 9 4/36 81 9.00
... ... 10 3/36 100 8.83
... ... 11 2/36 121 6.72
xn pn g(xn) g(xn) pn 12 1/36 144 4.00
S g(xi) pi 54.83
The expected value of X2 is the sum of its weighted values in the final column. It is equal to
54.83. It is the average value of the figures in the previous column, taking the differing
probabilities into account. 46

x1 p1 g(x1) g(x1) p1 2 1/36 4 0.11
x2 p2 g(x2) g(x2) p2 3 2/36 9 0.50
x3 p3 g(x3) g(x3) p3 4 3/36 16 1.33
... ... 5 4/36 25 2.78
... ... 6 5/36 36 5.00
... ... 7 6/36 49 8.17
... ... 8 5/36 64 8.89
... ... 9 4/36 81 9.00
... ... 10 3/36 100 8.83
... ... 11 2/36 121 6.72
xn pn g(xn) g(xn) pn 12 1/36 144 4.00
S g(xi) pi 54.83
Note that E(X2) is not the same thing as E(X), squared. In the previous sequence we saw
that E(X) for this example was 7. Its square is 49.
47
POPULATION VARIANCE OF A DISCRETE RANDOM VARIABLE
Population variance of X: E ( X m )2
E ( X m ) 2 ( x1 m ) 2 p1 ... ( xn m ) 2 pn ( xi m ) 2 pi
n
i 1
The third sequence defined the expected value of a function of a random variable X. There
is only one function that is of much interest to us, at least initially: the squared deviation
from the population mean. 48
Population variance of X: E ( X m )2
E ( X m ) 2 ( x1 m ) 2 p1 ... ( xn m ) 2 pn ( xi m ) 2 pi
n
i 1
The expected value of the squared deviation is known as the population variance of X. It is
a measure of the dispersion of the distribution of X about its population mean.
49
xi pi xi m (xi m)2 (xi m)2 pi
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
We will calculate the population variance of the random variable X defined in the first
sequence. We start as usual by listing the possible values of X and the corresponding
probabilities. 50
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0
m
0 X
E ( X 0.00
)7
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
Next we need a column giving the deviations of the possible values of X about its
population mean. In the second sequence we saw that the population mean of X was 7.
51
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0
mm EE( (XX0.00
0 XX
) )7
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
When X is equal to 2, the deviation is 5.
52
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0
m
0 X
E ( X 0.00
)7
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
Similarly for all the other possible values.
53
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
Next we need a column giving the squared deviations. When X is equal to 2, the squared
deviation is 25.
54
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
Similarly for the other values of X.
55
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
Now we start weighting the squared deviations by the corresponding probabilities. What do
you think the weighted average will be? Have a guess.
56
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
A reason for making an initial guess is that it may help you to identify an arithmetical error,
if you make one. If the initial guess and the outcome are very different, that is a warning.
57
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
We calculate all the weighted squared deviations.
58
2 1/36 5 25 0.69
3 2/36 4 16 0.89
4 3/36 3 9 0.75
5 4/36 2 4 0.44
6 5/36 1 1 0.14
7 6/36 0 0 0.00
8 5/36 1 1 0.14
9 4/36 2 4 0.44
10 3/36 3 9 0.75
11 2/36 4 16 0.89
12 1/36 5 25 0.69
5.83
The sum is the population variance of X.
59
Population variance of X
E ( X m )2
s X2
In equations, the population variance of X is usually written sX2, s being the Greek s.
60
Standard deviation of X
E[( X m ) 2 ]
sX
The standard deviation of X is the square root of its population variance. Usually written sx,
it is an alternative measure of dispersion. It has the same units as X.
61
EXPECTED VALUE RULES
1. E(X + Y) = E(X) + E(Y)
This sequence states the rules for manipulating expected values. First, the additive rule.
The expected value of the sum of two random variables is the sum of their expected values.
62
1. E(X + Y) = E(X) + E(Y)

Example generalization:
E(W + X + Y + Z) = E(W) + E(X) + E(Y) + E(Z)
This generalizes to any number of variables. An example is shown.
63
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
The second rule is the multiplicative rule. The expected value of (a variable multiplied by a
constant) is equal to the constant multiplied by the expected value of the variable.
64
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
Example:
E(3X) = 3E(X)
For example, the expected value of 3X is three times the expected value of X.
65
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
3. E(b) = b
Finally, the expected value of a constant is just the constant. Of course this is obvious.
66
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
3. E(b) = b
Y = b1 + b2X
E(Y) = E(b1 + b2X)
As an exercise, we will use the rules to simplify the expected value of an expression.
Suppose that we are interested in the expected value of a variable Y, where Y = b1 + b2X.
67
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
3. E(b) = b
Y = b1 + b2X
E(Y) = E(b1 + b2X)
= E(b1) + E(b2X)
We use the first rule to break up the expected value into its two components.
68
1. E(X + Y) = E(X) + E(Y)

2. E(bX) = bE(X)
3. E(b) = b
Y = b1 + b2X
E(Y) = E(b1 + b2X)
= E(b1) + E(b2X)
= b1 + b2E(X)
Then we use the second rule to replace E(b2X) by b2E(X) and the third rule to simplify E(b1)
to just b1. This is as far as we can go in this example.
69
INDEPENDENCE OF TWO RANDOM VARIABLES
Two random variables X and Y are said to be

independent if and only if
E[f(X)g(Y)] = E[f(X)] E[g(Y)]
for any functions f(X) and g(Y).
This very short sequence presents an important definition, that of the independence of two
random variables.
70

E[f(X)g(Y)] = E[f(X)] E[g(Y)]
Two variables X and Y are independent if and only if, given any functions f(X) and g(Y), the
expected value of the product f(X)g(Y) is equal to the expected value of f(X) multiplied by
the expected value of g(Y). 71

E[f(X)g(Y)] = E[f(X)] E[g(Y)]
Special case: if X and Y are independent,

E(XY) = E(X) E(Y)
As a special case, the expected value of XY is equal to the expected value of X multiplied by
the expected value of Y if and only if X and Y are independent.
72
ALTERNATIVE EXPRESSION FOR POPULATION VARIANCE
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
This sequence derives an alternative expression for the population variance of a random
variable. It provides an opportunity for practising the use of the expected value rules.
73
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
We start with the definition of the population variance of X.
74
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
We expand the quadratic.
75
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
Now the first expected value rule is used to decompose the expression into three separate
expected values.
76
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
The second expected value rule is used to simplify the middle term and the third rule is
used to simplify the last one.
77
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
The middle term is rewritten, using the fact that E(X) and mX are just different ways of writing
the population mean of X.
78
s X2 = E(X2) m2
s X2 = E[(X m)2]
= E(X2 2mX + m2)
= E(X2) + E(2mX) + E(m2)
= E(X2) 2mE(X) + m2
= E(X2) 2m2 + m2 = E(X2) m2
Hence we get the result.
79
THE FIXED AND RANDOM COMPONENTS OF A RANDOM VARIABLE
Population mean of X: E(X) =mX
In observation i, the random

component is given by ui = xi mX
Hence xi can be decomposed

into fixed and random components: xi = mX + ui
Note that the expected value

of ui is zero:
E(ui) = E(xi mX) = E(xi) + E(mX) =mX mX = 0
In this short sequence we shall decompose a random variable X into its fixed and random
components. Let the population mean of X be mX.
80



of ui is zero:
The actual value of X in any observation will in general be different from mX. We will call the
difference ui, so ui = xi - mX.
81



of ui is zero:
Re-arranging this equation, we can write xi as the sum of its fixed component, mX, which is
the same for all observations, and its random component, ui.
82



of ui is zero:
The expected value of the random component is zero. It does not systematically tend to
increase or decrease X. It just makes it deviate from its population mean.
83
CONTINUOUS RANDOM VARIABLES
probability
2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2
__
1 1
36 36 36 36 36 36 36 36 36
36 36
2 3 4 5 6 7 8 9 10 11 12 X
A discrete random variable is one that can take only a finite set of values. The sum of the
numbers when two dice are thrown is an example.
84
probability
2
__ 3
__ 4
__ 5
__ 6
__ 5
__ 4
__ 3
__ 2
__
1 1
36 36 36 36 36 36 36 36 36
36 36
2 3 4 5 6 7 8 9 10 11 12 X
Each value has associated with it a finite probability, which you can think of as a packet of
probability. The packets sum to unity because the variable must take one of the values.
85
height
55 60 65 70 75 X
However, most random variables encountered in econometrics are continuous. They can
take any one of an infinite set of values defined over a range (or possibly, ranges).
86
height
55 60 65 70 75 X
As a simple example, take the temperature in a room. We will assume that it can be
anywhere from 55 to 75 degrees Fahrenheit with equal probability within the range.
87
height
55 60 65 70 75 X
In the case of a continuous random variable, the probability of it being equal to a given finite
value (for example, temperature equal to 55.473927) is always infinitesimal.
88
height
55 60 65 70 75 X
For this reason, you can only talk about the probability of a continuous random variable
lying between two given values. The probability is represented graphically as an area.
89
height
55 56 60 65 70 75 X
For example, you could measure the probability of the temperature being between 55 and
56, both measured exactly.
90
height
0.05
55 56 60 65 70 75 X
Given that the temperature lies anywhere between 55 and 75 with equal probability, the
probability of it lying between 55 and 56 must be 0.05.
91
height
0.05
55 56 57 60 65 70 75 X
Similarly, the probability of the temperature lying between 56 and 57 is 0.05.
92
height
0.05
55 5758 60 65 70 75 X
And similarly for all the other one-degree intervals within the range.
93
height
0.05
55 5758 60 65 70 75 X
The probability per unit interval is 0.05 and accordingly the area of the rectangle
representing the probability of the temperature lying in any given unit interval is 0.05.
94
height
0.05
55 5758 60 65 70 75 X
The probability per unit interval is called the probability density and it is equal to the height
of the unit-interval rectangle.
95
f(X) = 0.05 for 55 X 75

height
f(X) = 0 for X < 55 and X > 75
0.05
55 5758 60 65 70 75 X
Mathematically, the probability density is written as a function of the variable, for example
f(X). In this example, f(X) is 0.05 for 55 < X < 75 and it is zero elsewhere.
96
probability f(X) = 0.05 for 55 X 75

density
f(X) = 0 for X < 55 and X > 75
f(X)
0.05
55 5758 60 65 70 75 X
The vertical axis is given the label probability density, rather than height. f(X) is known as
the probability density function and is shown graphically in the diagram as the thick black
line. 97

density
f(X) = 0 for X < 55 and X > 75
f(X)
0.05
55 60 65 70 75 X
Suppose that you wish to calculate the probability of the temperature lying between 65 and
70 degrees.
98

density
f(X) = 0 for X < 55 and X > 75
f(X)
0.05
55 60 65 70 75 X
To do this, you should calculate the area under the probability density function between 65
and 70.
99

density
f(X) = 0 for X < 55 and X > 75
f(X)
0.05
55 60 65 70 75 X
Typically you have to use the integral calculus to work out the area under a curve, but in
this very simple example all you have to do is calculate the area of a rectangle.
100

density
f(X) = 0 for X < 55 and X > 75
f(X)
5
0.05
0.05 0.25
55 60 65 70 75 X
The height of the rectangle is 0.05 and its width is 5, so its area is 0.25.
101
probability
density
f(X)
65 70 75 X
Now suppose that the temperature can lie in the range 65 to 75 degrees, with uniformly
decreasing probability as the temperature gets higher.
102
probability
density
f(X)
0.20
0.15
0.10
0.05
65 70 75 X
The total area of the triangle is unity because the probability of the temperature lying in the
65 to 75 range is unity. Since the base of the triangle is 10, its height must be 0.20.
103
probability f(X) = 1.50 0.02X for 65 X 75

density
f(X) = 0 for X < 65 and X > 75
f(X)
0.20
0.15
0.10
0.05
65 70 75 X
In this example, the probability density function is a line of the form f(X) = b1 + b2X. To pass
through the points (65, 0.20) and (75, 0), b1 must equal 1.50 and b2 must equal -0.02.
104

density
f(X) = 0 for X < 65 and X > 75
f(X)
0.20
0.15
0.10
0.05
65 70 75 X
Suppose that we are interested in finding the probability of the temperature lying between
65 and 70 degrees.
105

density
f(X) = 0 for X < 65 and X > 75
f(X)
0.20
0.15
0.10
0.05
65 70 75 X
We could do this by evaluating the integral of the function over this range, but there is no
need.
106

density
f(X) = 0 for X < 65 and X > 75
f(X)
0.20
0.15
0.10
0.05
65 70 75 X
It is easy to show geometrically that the answer is 0.75. This completes the introduction to
continuous random variables.
107
EXPECTED VALUE OF A CONTINUOUS RANDOM VARIABLE
Definition of E(X), the expected value of c.r.v.X:

E ( X ) xf ( x)dx

The only difference between this case and the expected value of a d.r.v. is that we replace
the summation symbol by the integral symbol.
108
EXPECTED VALUE OF A CONTINUOUS RANDOM VARIABLE
Given the continuous PDF
1
f ( x) x 2where0 x 3
9
3
x2
E ( X ) x( )dx 2.25
0
9
The only difference between this case and the expected value of a d.r.v. is that we replace
the summation symbol by the integral symbol.
109
COVARIANCE, COVARIANCE AND VARIANCE RULES, AND CORRELATION
Covariance
cov( X ,Y ) s XY E ( X m X )(Y mY )
The covariance of two random variables X and Y, often written sXY, is defined to be the
expected value of the product of their deviations from their population means.
110
Covariance
E ( X m X )(Y mY ) E ( X m X )E (Y mY )
E ( X ) E ( m X )E (Y ) E ( mY )
m X m X mY mY 0 0 0
If two variables are independent, their covariance is zero. To show this, start by rewriting
the covariance as the product of the expected values of its factors.
111
Covariance
E ( X ) E ( m X )E (Y ) E ( mY )
m X m X mY mY 0 0 0
We are allowed to do this because (and only because) X and Y are independent (see the
earlier sequence on independence.
112
Covariance
E ( X ) E ( m X )E (Y ) E ( mY )
m X m X mY mY 0 0 0
The expected values of both factors are zero because E(X) = mX and E(Y) = mY. E(mX) = mX
and E(mY) = mY because mX and mY are constants. Thus the covariance is zero.
113
Covariance rules
1. If Y = V + W,
cov(X, Y) = cov(X, V) + cov(X,W).
2. If Y = bZ, where b is a constant

cov(X, Y) = bcov(X, Z)
3. If Y = b, where b is a constant,
cov(X, Y) = 0
There are some rules that follow in a perfectly straightforward way from the definition of
covariance, and since they are going to be used frequently in later chapters it is worthwhile
establishing them immediately. First, the addition rule. 114
Covariance rules
1. If Y = V + W,

cov(X, Y) = 0
Next, the multiplication rule, for cases where a variable is multiplied by a constant.
115
Covariance rules
1. If Y = V + W,

cov(X, Y) = 0
Finally, a primitive rule that is often useful.
116
Covariance rules
1. If Y = V + W,
Proof:
Since Y = V + W, mY = mV + mW
cov( X ,Y ) E X m X Y mY
E X m X [V W ] [ mV mW ]
E X m X V mV X m X W mW
cov( X ,V ) cov( X ,W ).
The proofs of the rules are straightforward. In each case the proof starts with the definition
of cov(X, Y).
117
Covariance rules
1. If Y = V + W,
Proof:
We now substitute for Y and re-arrange.
118
Covariance rules
1. If Y = V + W,
Proof:
This gives us the result.
119
Covariance rules
2. If Y = bZ,
cov(X, Y) = bcov(X, Z).
Proof:
Since Y = bZ, mY = bmZ
cov( X , Y ) E( X m X )(Y mY )
E( X m X )(bZ bm Z )
bE( X m X )( Z m Z )
bcov( X , Z ).
Next, the multiplication rule, for cases where a variable is multiplied by a constant. The Y
terms have been replaced by the corresponding bZ terms.
120
Covariance rules
2. If Y = bZ,
cov(X, Y) = bcov(X, Z).
Proof:
Since Y = bZ, mY = bmZ
E( X m X )(bZ bm Z )
bE( X m X )( Z m Z )
bcov( X , Z ).
b is a common factor and can be taken out of the expression, giving us the result that we
want.
121
Covariance rules
3. If Y = b,
cov(X, Y) = 0.
Proof:
Since Y = b, mY = b
E( X m X )(b b )
E0
0.
The proof of the thrid rule is trivial.
122
Example use of covariance rules
Suppose Y = b1 + b2Z
cov(X, Y) = cov(X, [b1 + b2Z])

= cov(X, b1) + cov(X, b2Z)
= 0 + cov(X, b2Z)
= b2cov(X, Z)
Here is an example of the use of the covariance rules. Suppose that Y is a linear function of
Z and that we wish to use this to decompose cov(X, Y). We substitute for Y (first line) and
then use covariance rule 1 (second line). 123
Example use of covariance rules
Suppose Y = b1 + b2Z
cov(X, Y) = cov(X, [b1 + b2Z])

= cov(X, b1) + cov(X, b2Z)
= 0 + cov(X, b2Z)
= b2cov(X, Z)
Next we use covariance rule 3 (third line), and finally covariance rule 2 (fourth line).
124
Variance rules
1. If Y = V + W,
var(Y) = var(V) + var(W) + 2cov(V, W).
2. If Y = bZ, where b is a constant,

var(Y) = b2var(Z).
var(Y) = 0.
4. If Y = V + b, where b is a constant,
var(Y) = var(V).
Corresponding to the covariance rules, there are parallel rules for variances. First the
addition rule.
125
Variance rules
1. If Y = V + W,

var(Y) = b2var(Z).
var(Y) = 0.
var(Y) = var(V).
Next, the multiplication rule, for cases where a variable is multiplied by a constant.
126
Variance rules
1. If Y = V + W,

var(Y) = b2var(Z).
var(Y) = 0.
var(Y) = var(V).
A third rule to cover the special case where Y is a constant.
127
Variance rules
1. If Y = V + W,

var(Y) = b2var(Z).
var(Y) = 0.
var(Y) = var(V).
Finally, it is useful to state a fourth rule. It depends on the first three, but it is so often of
practical value that it is worth keeping it in mind separately.
128
Variance rules
1. If Y = V + W,
Proof:
var(Y) = cov(Y, Y) = cov([V + W], Y)
= cov(V, Y) + cov(W, Y)
var( X ) E ( X m X ) 2
= cov(V, [V + W]) + cov(W, [V + W])
E( X m X )( X m X )
= cov(V, V) + cov(V,W) + cov(W, V) + cov(W, W)
= var(V) + 2cov(V, W) + var(W)
cov( X , X ).
The proofs of these rules can be derived from the results for covariances, noting that the
variance of Y is equivalent to the covariance of Y with itself.
129
Variance rules
1. If Y = V + W,
Proof:
= cov(V, [V + W]) + cov(W, [V + W])
We start by replacing one of the Y arguments by V + W.
130
Variance rules
1. If Y = V + W,
Proof:
= cov(V, [V + W]) + cov(W, [V + W])
We then use covariance rule 1.
131
Variance rules
1. If Y = V + W,
Proof:
= cov(V, [V + W]) + cov(W, [V + W])
We now substitute for the other Y argument in both terms and use covariance rule 1 a
second time.
132
Variance rules
1. If Y = V + W,
Proof:
= cov(V, [V + W]) + cov(W, [V + W])
This gives us the result. Note that the order of the arguments does not affect a covariance
expression and hence cov(W, V) is the same as cov(V, W).
133
Variance rules

var(Y) = b2var(Z).
Proof:
var(Y) = cov(Y, Y) = cov(bZ, bZ)
= b2cov(Z, Z)
= b2var(Z).
The proof of the variance rule 2 is even more straightforward. We start by writing var(Y) as
cov(Y, Y). We then substitute for both of the iYi arguments and take the b terms outside as
common factors. 134
Variance rules
var(Y) = 0.
Proof:
var(Y) = cov(b, b) = 0.
The third rule is trivial. We make use of covariance rule 3. Obviously if a variable is
constant, it has zero variance.
135
Variance rules
var(Y) = var(V).
Proof:
var(Y) = var(V) + 2cov(V, b) + var(b)
= var(V)
The fourth variance rule starts by using the first. The second term on the right side is zero
by covariance rule 3. The third is also zero by variance rule 3.
136
Variance rules
var(Y) = var(V).
Proof:
var(Y) = var(V) + 2cov(V, b) + var(b)
= var(V)
0 mV
V
0 mV + b
V+b
The intuitive reason for this result is easy to understand. If you add a constant to a
variable, you shift its entire distribution by that constant. The expected value of the
squared deviation from the mean is unaffected. 137
Correlation
s XY
XY
s Xs Y
2 2
cov(X, Y) is unsatisfactory as a measure of association between two variables X and Y

because it depends on the units of measurement of X and Y.
138
Correlation
s XY
XY
s Xs Y
2 2
A better measure of association is the population correlation coefficient because it is

dimensionless. The numerator possesses the units of measurement of both X and Y.
139
Correlation
s XY
XY
s Xs Y
2 2
The variances of X and Y in the denominator possess the squared units of measurement of
those variables.
140
Correlation
s XY
XY
s Xs Y
2 2
However, once the square root has been taken into account, the units of measurement are
the same as those of the numerator, and the expression as a whole is unit free.
141
Correlation
s XY
XY
s Xs Y
2 2
If X and Y are independent, XY will be equal to zero because sXY will be zero.
142
Correlation
s XY
XY
s Xs Y
2 2
If there is a positive association between them, sXY, and hence XY, will be positive. If there
is an exact positive linear relationship, XY will assume its maximum value of 1. Similarly, if
there is a negative relationship, XY will be negative, with minimum value of 1. 143
Correlation
s XY
XY
s Xs Y
2 2
If X and Y are independent, rXY will be equal to zero because sXY will be zero. If there is a
positive association between them, sXY, and hence rXY, will be positive. If there is an exact
positive linear relationship, rXY will assume its maximum value of 1. Similarly, if there is a
negative relationship, rXY will be negative, with minimum value of 1.
144
SAMPLING AND ESTIMATORS
Suppose we have a random variable X and we wish to

estimate its unknown population mean mX.
Planning (beforehand concepts)
Our first step is to take a sample of n observations {X1,
, Xn}.
Before we take the sample, while we are still at the
planning stage, the Xi are random quantities. We know
that they will be generated randomly from the
distribution for X, but we do not know their values in
advance.
So now we are thinking about random variables on two
levels: the random variable X, and its random sample
components.
145

, Xn}.
advance.
components.
146

, Xn}.
advance.
components.
147

Realization (afterwards concepts)
Once we have taken the sample we will have a set of
numbers {x1, , xn}.
This is called by statisticians a realization. The lower
case is to emphasize that these are numbers, not
variables.
148

Back to the plan. Having generated a sample of n
observations {X1, , Xn}, we plan to use them with a
mathematical formula to estimate the unknown
population mean mX.
This formula is known as an estimator. In this context,
the standard (but not only) estimator is the sample mean
1
X X 1 ... X n
n
An estimator is a random variable because it depends on
the random quantities {X1, , Xn}.
149

Back to the plan. Having generated a sample of n
observations {X1, , Xn}, we plan to use them with a
mathematical formula to estimate the unknown
population mean mX.
This formula is known as an estimator. In this context,
the standard (but not only) estimator is the sample mean
1
X X 1 ... X n
n
An estimator is a random variable because it depends on
the random quantities {X1, , Xn}.
150

Realization (afterwards concepts)
The actual number that we obtain, given the realization
{x1, , xn}, is known as our estimate.
151
probability density probability density

function of X function of X
mX X mX X
We will see why these distinctions are useful and important in a comparison of the
distributions of X and X. We will start by showing that X has the same mean as X.
152
1
E X E X 1 ... X n
n
1
E X 1 ... X n
n
1
E X 1 ... E X n
n
1
m X ... m X m X
n
We start by replacing X by its definition and then using expected value rule 2 to take 1/n out
of the expression as a common factor.
153
1
E X E X 1 ... X n
n
1
E X 1 ... X n
n
1
E X 1 ... E X n
n
1
m X ... m X m X
n
Next we use expected value rule 1 to replace the expectation of a sum with a sum of
expectations.
154
1
E X E X 1 ... X n
n
1
E X 1 ... X n
n
1
E X 1 ... E X n
n
1
m X ... m X m X
n
Now we come to the bit that requires thought. Start with X1. When we are still at the
planning stage, X1 is a random variable and we do not know what its value will be.
155
1
E X E X 1 ... X n
n
1
E X 1 ... X n
n
1
E X 1 ... E X n
n
1
m X ... m X m X
n
All we know is that it will be generated randomly from the distribution of X. The expected
value of X1, as a beforehand concept, will therefore be mX. The same is true for all the other
sample components, thinking about them beforehand. Hence we write this line.
156
1
E X E X 1 ... X n
n
1
E X 1 ... X n
n
1
E X 1 ... E X n
n
1
m X ... m X m X
n
Thus we have shown that the mean of the distribution of X is mX.
157
probability density probability density

function of X function of X
mX X mX X
We will next demonstrate that the variance of the distribution of X is smaller than that of X,
as depicted in the diagram.
158
1
s var X 1 ... X n
2
n
X
1
2
var X 1 ... X n
n
1
2 var X 1 ... var X n
n
2 s X ... s X2
1 2
n
s
2 ns X2 X .
2
1
n n
We start by replacing X by its definition and then using variance rule 2 to take 1/n out of the
expression as a common factor.
159
1
s var X 1 ... X n
2
n
X
1
2
var X 1 ... X n
n
1
n
2 s X ... s X2
1 2
n
s
2 ns X2 X .
2
1
n n
Next we use variance rule 1 to replace the variance of a sum with a sum of variances. In
principle there are many covariance terms as well, but they are zero if we assume that the
sample values are generated independently. 160
1
s var X 1 ... X n
2
n
X
1
2
var X 1 ... X n
n
1
n
2 s X ... s X2
1 2
n
s
2 ns X2 X .
2
1
n n
Now we come to the bit that requires thought. Start with X1. When we are still at the
planning stage, we do not know what the value of X1 will be.
161
1
s var X 1 ... X n
2
n
X
1
2
var X 1 ... X n
n
1
n
2 s X ... s X2
1 2
n
s
2 ns X2 X .
2
1
n n
All we know is that it will be generated randomly from the distribution of X. The variance of
X1, as a beforehand concept, will therefore be sX2. The same is true for all the other sample
components, thinking about them beforehand. Hence we write this line. 162
1
s var X 1 ... X n
2
n
X
1
2
var X 1 ... X n
n
1
n
2 s X ... s X2
1 2
n
s
2 ns X2 X .
2
1
n n
Thus we have demonstrated that the variance of the sample mean is equal to the variance of
X divided by n, a result with which you will be familiar from your statistics course.
163
UNBIASEDNESS AND EFFICIENCY
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Much of the analysis in this course will be concerned with three properties of estimators:
unbiasedness, efficiency, and consistency. The first two, treated here, relate to finite
sample analysis: analysis where the sample has a finite number of observations. 164
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Consistency, a property that relates to analysis when the sample size tends to infinity, is
treated in a later slideshow.
165
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Suppose that you wish to estimate the population mean mX of a random variable X given a
sample of observations. We will demonstrate that the sample mean is an unbiased
estimator, but not the only one. 166
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
We will start with the proof in the previous sequence. We use the second expected value
rule to take the 1/n factor out of the expectation expression.
167
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Next we use the first expected value rule to break up the expression into the sum of the
expectations of the observations.
168
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Thinking about the sample values {X1, , Xn} at the planning stage, each expectation is
equal to mX, and hence the expected value of the sample mean, before we actually generate
the sample, is mX. 169
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
Generalized estimator Z = l1X1 + l2X2
However, the sample mean is not the only unbiased estimator of the population mean. We
will demonstrate this supposing that we have a sample of two observations (to keep it
simple). 170
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
We will define a generalized estimator Z which is the weighted sum of the two observations,
l1 and l2 being the weights.
171
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
E ( Z ) E (l1 X 1 l2 X 2 ) E (l1 X 1 ) E (l2 X 2 )
l1 E ( X 1 ) l2 E ( X 2 ) (l1 l2 )m X
m X if (l1 l2 ) 1
We will analyze the expected value of Z and find out what condition the weights have to
satisfy for Z to be an unbiased estimator.
172
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
E ( Z ) E (l1 X 1 l2 X 2 ) E (l1 X 1 ) E (l2 X 2 )
l1 E ( X 1 ) l2 E ( X 2 ) (l1 l2 )m X
m X if (l1 l2 ) 1
We begin by decomposing the expectation using the first expected value rule.
173
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
E ( Z ) E (l1 X 1 l2 X 2 ) E (l1 X 1 ) E (l2 X 2 )
l1 E ( X 1 ) l2 E ( X 2 ) (l1 l2 )m X
m X if (l1 l2 ) 1
Now we use the second expected value rule to bring l1 and l2 out of the expected value
expressions.
174
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
E ( Z ) E (l1 X 1 l2 X 2 ) E (l1 X 1 ) E (l2 X 2 )
l1 E ( X 1 ) l2 E ( X 2 ) (l1 l2 )m X
m X if (l1 l2 ) 1
The expected value of X in each observation, before we generate the sample, is mX.
175
Unbiasedness of X:
1 1
E ( X ) E ( X 1 ... X n ) E ( X 1 ... X n )
n n
E ( X 1 ) ... E ( X n ) nm X m X
1 1
n n
E ( Z ) E (l1 X 1 l2 X 2 ) E (l1 X 1 ) E (l2 X 2 )
l1 E ( X 1 ) l2 E ( X 2 ) (l1 l2 )m X
m X if (l1 l2 ) 1
Thus Z is an unbiased estimator of mX if the sum of the weights is equal to one. An infinite
number of combinations of l1 and l2 satisfy this condition, not just the sample mean.
176
probability
density
function
estimator B
estimator A
mX
How do we choose among them? The answer is to use the most efficient estimator, the one
with the smallest population variance, because it will tend to be the most accurate.
177
probability
density
function
estimator B
estimator A
mX
In the diagram, A and B are both unbiased estimators but B is superior because it is more
efficient.
178

s Z2 var(l1 X 1 l2 X 2 )
var(l1 X 1 ) var(l2 X 2 ) 2 cov(l1 X 1 , l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
We will analyze the variance of the generalized estimator and find out what condition the
weights must satisfy in order to minimize it.
179

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
The first variance rule is used to decompose the variance.
180

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
Note that we are assuming that X1 and X2 are independent observations and so their
covariance is zero. The second variance rule is used to bring l1 and l2 out of the variance
expressions. 181

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
The variance of X1, at the planning stage, is sX2. The same goes for the variance of X2.
182

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
Now we take account of the condition for unbiasedness and re-write the variance of Z,
substituting for l2.
183

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
The quadratic is expanded. To minimize the variance of Z, we must choose l1 so as to

minimize the final expression.
184

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
ds Z2
0 4l1 2 0 l1 l2 0.5
dl1
We differentiate with respect to l1 to obtain the first-order condition.
185

s Z2 var(l1 X 1 l2 X 2 )
l12s X2 1 l22s X2 2
(l12 l22 )s X2
(l12 [1 l1 ]2 )s X2 if (l1 l2 ) 1
( 2l12 2l1 1)s X2
ds Z2
0 4l1 2 0 l1 l2 0.5
dl1
The expression is minimized for l1 = 0.5. It follows that l2 = 0.5 as well. So we have
demonstrated that the sample mean is the most efficient unbiased estimator, at least in this
example. (Note that the second differential is positive, confirming that we have a minimum.)
186
f ( l1.2
1)
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 l1 1
Alternatively, we could find the minimum graphically. Here is a graph of the expression as a
function of l1.
187
f ( l1.2
1)
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 l1 1
Again we see that the variance is minimized for l1 = 0.5 and so the sample mean is the most
efficient unbiased estimator.
188
CONFLICTS BETWEEN UNBIASEDNESS AND MINIMUM VARIANCE
probability
density
function
estimator B
estimator A
Suppose that you have alternative estimators of a population characteristic q, one unbiased,
the other biased but with a smaller variance. How do you choose between them?
189
loss
error (negative) error (positive)
One way is to define a loss function which reflects the cost to you of making errors, positive
or negative, of different sizes.
190
MSE( Z ) E ( Z q ) 2 s Z2 ( m Z q ) 2
probability
density
function
estimator B
A widely-used loss function is the mean square error of the estimator, defined as the
expected value of the square of the deviation of the estimator about the true value of the
population characteristic. 191
MSE( Z ) E ( Z q ) 2 s Z2 ( m Z q ) 2
probability
density
function
estimator B
bias
q mZ
The mean square error involves a trade-off between the variance of the estimator and its
bias. Suppose you have a biased estimator like estimator B above, with expected value mZ.
192
MSE( Z ) E ( Z q ) 2 s Z2 ( m Z q ) 2
probability
density
function
estimator B
bias
q mZ
The mean square error can be shown to be equal to the sum of the variance of the estimator
and the square of the bias.
193
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
E ( Z m Z ) 2 E ( m Z q ) 2 E 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
To demonstrate this, we start by subtracting and adding mZ .
194
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
We expand the quadratic using the rule (a + b)2 = a2 + b2 + 2ab, where a = Z mZ and b = mZ
q.
195
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
We use the first expected value rule to break up the expectation into its three components.
196
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
The first term in the expression is by definition the variance of Z.
197
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
(mZ q) is a constant, so the second term is a constant.
198
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
In the third term, (mZ q) may be brought out of the expectation, again because it is a
constant, using the second expected value rule.
199
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
Now E(Z) is mZ, and E(mZ) is mZ.
200
MSE( Z ) E ( Z q ) 2
E ( Z m Z m Z q ) 2
E ( Z m Z ) 2 ( m Z q ) 2 2( Z m Z )( m Z q )
s Z2 ( m Z q ) 2 2( m Z q ) E ( Z m Z )
s Z2 ( m Z q ) 2 2( m Z q )( m Z m Z )
s Z2 ( m Z q ) 2
Hence the third term is zero and the mean square error of Z is shown be the sum of the
variance of Z and the bias squared.
201
probability
density
function
estimator B
estimator A
In the case of the estimators shown, estimator B is probably a little better than estimator A
according to the MSE criterion.
202
ESTIMATORS OF VARIANCE, COVARIANCE, AND CORRELATION
var( X ) s E X m X

2 2
Variance X
We have seen that the variance of a random variable X is given by the expression above.
203
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
Given a sample of n observations, the usual estimator of the variance is the sum of the
squared deviations around the sample mean divided by n 1, typically denoted s2X.
204
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
Since the variance is the expected value of the squared deviation of X about its mean, it
makes intuitive sense to use the average of the sample squared deviations as an estimator.
But why divide by n 1 rather than by n? 205
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
The reason is that the sample mean is by definition in the middle of the sample, while the
unknown population mean is not, except by coincidence.
206
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
As a consequence, the sum of the squared deviations from the sample mean tends to be
slightly smaller than the sum of the squared deviations from the population mean.
207
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
Hence a simple average of the squared sample deviations is a downwards biased estimator
of the variance. However, the bias can be shown to be a factor of (n 1)/n. Thus one can
allow for the bias by dividing the sum of the squared deviations by n 1 instead of n. 208
var( X ) s E X m X

2 2
Variance X
1 n
Estimator s
2
X X 2
.
X
n 1 i 1 i
Covariance cov( X ,Y ) s XY E X m X Y mY
1 n
Estimator s XY X i X Yi Y .
n 1 i 1
A similar adjustment has to be made when estimating a covariance. For two random variables X
and Y an unbiased estimator of the covariance sXY is given by the sum of the products of the
deviations around the sample means divided by n 1. 209
s XY
Correlation XY
s X2 s Y2
The population correlation coefficient XY for two variables X and Y is defined to be their
covariance divided by the square root of the product of their variances.
210
s XY
Correlation XY
s X2 s Y2
Estimator
1
s XY n1
X X Y Y
rXY 2 2
1 1
s X sY

2 2
X X Y Y
n1 n1

X X Y Y
X X Y Y
2 2
The sample correlation coefficient, rXY, is obtained from this by replacing the covariance and
variances by their estimators.
211
s XY
Correlation XY
s X2 s Y2
Estimator
1
s XY n1
X X Y Y
rXY 2 2
1 1
s X sY

2 2
X X Y Y
n1 n1

X X Y Y
X X Y Y
2 2
The 1/(n 1) terms in the numerator and the denominator cancel and one is left with a
straightforward expression.
212

Self Study Material - Review of Probability and Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Self Study Material - Review of Probability and Statistics

Uploaded by

Copyright:

Available Formats

Review of Probability and

Aswini Kumar Mishra

BITS-PILANI, K.K. BIRLA GOA CAMPUS

Similarly, if the red die is 2 and the green one is 5, X is equal to 7.

The table shows all the possible outcomes.

For example, there are four outcomes which make X equal to 5.

Finally we will derive the probability of obtaining each value of X.

Two of the most widely used moments are

Definition of E(X), the expected value of d.r.v.X:

Next we list the probabilities attached to the different possible values of X.

We do this for each value separately.

To obtain the expected value, we sum the entries in this column.

Alternative notation for E(X):

Definition of E[g(X)], the expected value of a function of X:

Definition of E[g(X)], the expected value of a function of X:

Next you calculate the function of X for each possible value of X.

You do this individually for each possible value of X.

xi pi g(xi) g(xi ) pi xi pi xi2

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi

xi pi g(xi) g(xi ) pi xi pi xi2 xi2 pi

xi pi xi m (xi m)2 (xi m)2 pi

xi pi xi m (xi m)2 (xi m)2 pi

xi pi xi m (xi m)2 (xi m)2 pi

When X is equal to 2, the deviation is 5.

xi pi xi m (xi m)2 (xi m)2 pi

Similarly for all the other possible values.

xi pi xi m (xi m)2 (xi m)2 pi

xi pi xi m (xi m)2 (xi m)2 pi

Similarly for the other values of X.

xi pi xi m (xi m)2 (xi m)2 pi

xi pi xi m (xi m)2 (xi m)2 pi

xi pi xi m (xi m)2 (xi m)2 pi

We calculate all the weighted squared deviations.

xi pi xi m (xi m)2 (xi m)2 pi

The sum is the population variance of X.

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

This generalizes to any number of variables. An example is shown.

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

1. E(X + Y) = E(X) + E(Y)

Two random variables X and Y are said to be

Two random variables X and Y are said to be

Two random variables X and Y are said to be

Special case: if X and Y are independent,

= E(X2 2mX + m2)

= E(X2) + E(2mX) + E(m2)

= E(X2) 2m2 + m2 = E(X2) m2

= E(X2 2mX + m2)

= E(X2) + E(2mX) + E(m2)

= E(X2) 2m2 + m2 = E(X2) m2

We start with the definition of the population variance of X.

= E(X2 2mX + m2)

= E(X2) + E(2mX) + E(m2)

= E(X2) 2m2 + m2 = E(X2) m2

We expand the quadratic.

= E(X2 2mX + m2)

= E(X2) + E(2mX) + E(m2)