You are on page 1of 263

CONTENTS

1. Introduction

(a) Structure of the course. 1

2. Probability Theory

(a) Algebra of sets 2

(b) Random variable 6

(c) Distribution function of a random variable. 10

(d) Probability mass and density functions . 13

(e) Conditional probability distribution 19

(f) Bayes theorem and its applications . . 22

(g) More on conditional probability distribution . 25

(h) Mathematical expectation 35

(i) Bivariate moments 57

(j) Generating functions 66

(k) Distribution of a function of a random variable 74

3. Univariate Discrete and Continuous Distributions

(a) The basic distribution-hypergeometric . . . . 82

(b) Binomial distribution (as a limit of hypergeometric) 85

(c) Poisson distribution (as a limit of binomial) 96

(d) Normal distribution . . . . . . . 98

(e) Properties of normal distribution . 100

(f) Distributions derived from normal (X 2 , t and F) 105

Introduction to Statistics for

Econometricians

by

Anil K. Bera
1.1 Introduction.
If you look around, you will notice the world is full of uncertainity. With
all the enormous amount of past information, we never can tell about the exact
weather condition of tomorrow. The same is true for many economic variable, such
as- stock price, exchange rate, inflation, unemployment, interest rate, mortgage
rate, etc. [If you know the exact future price of major stock- you can make a
million! In that case, of course, you won't be taking this course.] Then, what is
the role of statistics in this uncertain world? The basic foundation of statistics
is based on the idea that there is an underlying principle or common rule in
the midst of all the chaos and irregularities. Statistics is a science to formulate
these common rules in a systematic way. Econometric:; is that field of science
which deals with application of statistics to economics. Statistics is applicable
to all branches of science and humanities. You might have heard of fields, like
sociometry, psychometry, cliometrics and biometrics. These are application of
statistics to sociology, psychology, history and biology, respectively. Application of
statistics in economics is somewhat controversial, since unlike physical or biological
sciences, in economics we can't conduct purely random experiments. In most
cases what we have historical data on certain economic variables. For all practical
purposes, we can view these data as result of some random experiment and then
use the statistical tools to analyze the data. For example, regarding stock price
movement, based on the available data we can try to find the underlying probability
distribution. This distribution will depend on some unknown parameters which
can be estimated using the data. We can also test some hypothesis regarding the
parameters or we can even test whether the (assumed) probability distribution is
valid or not.

Just like any other science, in statistics there are many approaches, Classical
and Bayesian, parametric and nonparametric etc. These are not always substitutes
of each other and in many cases, they can be successfully used as compliments of
each other. However) in this course, we will concentrate on the classical parametric
approach.

1
2

2.1 Basic Set Theory.

The objective of econometrics is "advancement of economic theory in its rela


tion to statistics and mathematics." This course is concerned with the "statistics"
part, and the foundation of statistics is in probability theory. Again, "probability"
is defined for events, and these events can be described as sets or subsets. To see
the link

Econometrics ---+ Statistics ---+ Probability ---+ Event ---+ Set.

Let us start with a definition of a "set".

Definition 2.1.1: A set is any (well defined) collection of objects.

Example 2.1.1:

i) C = {I, 2, 3, ... } set of all positive integers.

ii) D = {2,4,6, ... } set of all positive even integers.

iii) F = { Students attending Econ 472}


iv) G = {Students attending Econ 402}

An object in a set is an element, e.g., 1 is an element of the set "C". We will


denote this as 1 E C, " E " means "belongs to." Note that the set C contains more
elements than the set D. We can say that D is a "subset" of C and will denote
this as DeC. Formally,

Definition 2.1.2: Set B is a subset of set A, denoted by B C A if x E B implies


x E A.

You know that with real numbers we can do lot of operations like addition
(+), substraction (-), multiplication (x) etc. Similar operations could also be done
with sets, e.g., we can "add" (sort of) two sets, substract one set from the another
etc. Two very important operations are "union" and "intersection" of sets.

Definition 2.1.3 : Union of two sets A and B is C, defined by C = AU B, if


C = {xix E A and/or x E B.}

2
3

In other words, by union of two sets we mean collection of elements which


belong to at least one of the two sets.

Example 2.1.2: In Example 2.1.1, CUD = C. If we define another set E =


{l, 3,5, ...}, set of all positive odd integers, then C = DUE.
The operation can be defined for more than two sets. Suppose we have n sets
A I ,}!2,' .. , An. Then Al U A2 . " U An denoted by Ui=I Ai is defined as-
n
U Ai = {xix E at least one Ai, i = 1,2, ... ,n}
i=I

In a similar fashion, we can define union of infinite ( but countable) number


00

of sets A 1 ,A2,A3 , as U Ai = Al U A2 U A 3 .
i=I

Example 2.1.3: Let Ai = {i}, i.e. Al {I}, A2 = {2}, ... etc.


00

Then U Ai C of Example 2.1.1.


i=I
00

Or let Ai = [-i, iJ, an interval in the real line R, then U Ai n.


i=I

The next concept we discuss is "intersection". Intersection of two sets A and


B, denoted by A n B, consists of all the common elements of A and B. Formally,

Definition 2.1.4 : Intersection of two sets A and B is C, denoted by C An B,


if C = {xix E A and x E B}.

Example 2.1.4 : In Example 2.1.1, enD D and F n G={students attending


both Econ 472 and 402}

As in the case of "union", we can also define the operation "n" for more than two
sets. For example,

n Ai = Al n Az ... n An
n

i=l
{xix E Ai, for all i = 1,2"", n}

n Ai = Al n A2 n A
00

3 = {xix E Ai, for all t 1, 2, 3 , ... }


i=I

It is easy to represent the above two concepts diagramatically called Venn


diagram [see Figure 2.1.1]

3
4

AOB
Figure 2.1.1

Be (l...
Figure 2.1.2

- 3/:
5

Continuing with Example 2.1.1, suppose those students taking Econ 472 have
already attended Econ 402, i.e. there is no student in Econ 472 class who is taking
Econ 402 now. Then if we talk about F n G, the set will be empty. We will call
such a set a null set and will denote it by 4>. By definition for any set A, AU4> = A,
An 4> = 4>.

Example 2.1.5: If! Example 2.1.1 and 2.1.2, D n E = 4>.

Earlier we noted, in Example 2.1.1, DeC. Now remove the elements of D


from the set C, what we are left with is the set E in Example 2.1.2. We write this
as E = C - D, i.e., the difference between sets is the elements of one set after
removing the elements those belong to the other set. Formally,

Definition 2.1.5: The difference between two sets A and B, denoted by A - B,


is defined as C = A - B = {xix E A and x r:f. B}. Note, " r:f. " means "does not
belong to". In Venn diagram A - B can be represented as in Figure 2.1.2.

Now it is clear that a set consists of elements satisfying certain properties.


We can imagine a big set which consists of elements with very little restriction.
For example, in Example 2.1.1, regarding sets C and D, we can think of n, set
of all real numbers. We will vaugely called such a big (reference) set, a .space
and will denote as S. l\-ote here, C c S, DeS. So let S = n. Define
Q = {set of all rational numbers}, then S - Q = {set of all irrational numbers}.
Another way to think about S - Q is as "compliment" of Q in S, which denoted
as QCIS.

Definition 2.1.6. Compliment of a set A with respect to a Space S, denoted by


ACIS= {x E Six r:f. A}.
In most cases, the reference set S will be obvious from the context and we
will omit S from the notation and will write A cIS as simply A c.

Example 2.1.6: In Examples 2.1.1 and 2.1.2 DC IC = E, EC IC D.


See the Venn diagrams in Figure 2.1.3.:

Consider the identity (A U Br


= A C n BC. Without the diagram, we can
easily prove this. The trick is if we want to show that a set C is equal to another
set D, show the following:

4
6

Figure 2.1.3

- yA
7

If for every x, x E C then xED :::} C C D

If for every x,x E D then x E C:::} Dee

combine these two and obtain C D.

Let us prove the above identity. Let x E (A U Bt so x :. (A. U B), i.e.,


x :. A and x :. B. In other words x E AC and x E B C i.e., x E AC nBc.
Therefore, (A. U B) C C A C n BC. N~xt assume x E A C n BC, and reversing the
above arguments, we see that x E (A U Bt. So we have AC n B C c (A U Bt.
Hence (A U Bt = AC nBc.

Now try to prove the following identity

These identities are known as De Morgan's law. Try to prove the following gener
alizations:

Let us now link up the set theory with the concepts of "event" and "probability".
Suppose we throw one coin twice. The coin has two sides, head (H) and tail (T).
What are the possible outcomes?

Both tail (T T)

Both head (H H)

Tail head (T H)

Head tail (H T).

Collect these together in a set D={ (T T), (H H), (T H), (H Tn, this is the
collection of all possible outcomes. We may be interested in the following special
outcomes:

Al {outcomes with first head} = {(HH),(HTn.

A2 {outcomes with at least one head} = {(HH),(HT),(T,H)}.

A3 {outcomes with no tail} = {(H H)}.

5
8

AI, A 2 , As ... are all events, and note A I ,A21 As C n.


We can think
of a collection of subsets of n and a particular event will be an element of that
collection. Under this framework we can define the probabilities of different events.

So far we have considered sets which are collection of single element, e.g., we
had a set C = {I, 2, 3, ... }. We can think uf <t set whose elements are also sets,
i.e., a set of sets. We can call this a collection of a class of sets. By giving a
different structure to this class of sets, we can define many concepts, such as ring
and field. For our future purpose, all we need is the concept of a - field (sigma
field). This will be denoted by A (script A). A 0" - field is nothing but a collection
of sets AI, A 2 , As, ... satisfying the following properties
00

(i) AI,Az, ... E A ===? U Ai EA.


i=l

(ii) If A E A then AC E A.

In other words A is closed under the formation of countable unions and under
complimentation. From the above two conditions, it is clear that for A to be a
O"-field,the null set and the space n must belong to A.

Example 2.1.7: n {1,2,3,4}, A a-field on n c~n be written as


A = {<p, (1, 2), (3, 4),(1, 2, 3, 4)}

Example 2.1.8:

n= R (real line)
A = {countable union of intervals like( a, b]}
A is called Borel field and members of A are called Borel sets in R.

2.2 Random Variable.

As you can guess, the word "random" is associated with some sort of uncer
tainty. If we toss a coin, we know the possibilities: head (H) or tail (T); but we are
uncertain about exactly which one will appear. Therefore, "tossing a coin" can be
regarded as a random experiment where the possibilities are known but not the

6
9

exact outcome. The probability theory, the collection of all possible outcomes is
known as Jample Jpace.

Example 2.2.1:
(i) Toss a coin. The sample space is 51 = {H, T}.

(ii) Toss two coins or one coin twice, the sample space

51 = {(HH),(TT),(HT), (TH)}

(iii) Throw a die,


51 {(.), (:), (: .), (::), (:: .), (:::)}

Instead of assigning symbols, we can give these outcomes, some numbers (real
numbers). For example, for the above Example (i), we can define

x 0 if the outcome is T
= 1 if the outcome is H

For Example (iii) above, X can take values 1, 2, 3, 4, 5, 6. X defined in such a


way is called a random variable. Once a random variable is defined, we can talk
about the probability distribution of the random variable.

Let us first formally define "Probability". For Example (i), we have the sample
space 51 {H, T}. The O"-field defined on 51 is A = {cP, 51, (H), (T)}. Elements
of A are called the events. "Probability" is nothing but assigning real numbers
(satisfying some conditions) to each of these events.

Definition 2.2.1: Probability denoted as P is a function from A to [0,1].

P :A ---t [O,lJ
satisfying the following axioms

(i) pen) = 1
(ii) If AI, A 2 , A 3 , . E A are disjoint (i.e., Ai n Aj = cP for all i f:: j) then
00

7
10

Example 2.2.2:

n == {H,T}

A = {t,h, n, (H), (T)}

P(t,h) = 0, pen) 1, P(H) = f' peT) = t


Earlier we indicated that a random variable can be-defined by assigning real
number to the elements of n. Now define a a-field on the real line n and denote
it by B. Formally, we can define a random variable X as

Definition 2.2.2: A random variable X is a function X : n --+ R such that for


all B E B, X-I (B) EA.

Note that here X I(B) = {w E nIX(w) E B). For a diagramatic representa


tion of random variable X as a function, see Figure 2.2.1

In other words X (.) is a measurable function from the sample space to the
real line. "Measurability" is defined by requiring that the inverse image of X is
an element of the a-field, i.e., an event. Recall that, probability is defined only
for events. By requiring that X is measurable, in a sense, we are assuring its
probability distribution.

Example 2.2.3: Toss a coin twice, then the sample space n and a a-field A can
be defined as

n= {(H H), (TT), (HT), (TH)}


A {t,h, n, (HH), (TT), (HT), (TH), ((H H)(TT)),
((TT)(HT)), ((HT)(TH)), ((H H)(TH)), ((H H)(HT))
((TT)(TH)), ((H H)(TT)(HT)), ((TT)(HT)(T H)),
((HH)(TT)(TH)),((HH)(HT)(TH))}

Define X = number of heads. Then X takes 3 values

x=o
=1
2

8
11

X-I

Figure 2.2.1

-- ~l-
12

First assign the following probabilities


1 1 1 1
P(H H) = 4' P(TT)
4' P(HT) = 4' P(TH) = -.
4
The triplet (n, A, P) is called a probability space and peA) is the probability
of the event A.

CorreSponding to (n, A, P), there exists another probability space (R, B, pX),
where pX is defined as

PX(B) = P[X- 1 (B)] for B E B.

In the above example, take B = 1, then

PX(l) = P[X- 1 (1)]


= P[(HT), (TH)]
P[( HT) U (TH)]
P(HT) + P(TH) (why?)
1 1 1
= 4 + 4; = 2'
Similarly, we can show that
1
pX (0) = 1 and pX (2)
4 4

pX (.) is called the probability measure induced by X. To summarize, we have


defined two functions

X : n ------+ R
pX : B ------+ [0, 1].

where B is a a-field defined on R [see Example 2.1.8]

For the above example, the two functions can be described as


w X(w) pX
(TT) 0 1/4
(HT),(TH) 1 1/2
(HH) 2 1/4

13

The last two columns describe the probability distribution of the random
variable X. Sometimes we will simply denote it by P(X).

x P(X)
0 1/4
1 1/2
2 1/4

Most of the time probability distributions (of discrete random variables) are
presented this way. From the above discussion, it is clear that each such probability
distribution originates from an n, sample space of a random experiment.

Definition 2.2.3: Listing of the values along with the corresponding probabilities
is called the probability distribution of a random variable.

Note: Strictly speaking, this definition applies to "discrete" random variable only.
Later, we will define, "discrete" and" continuous" random variables.

2.3 Distribution Function of a Random Variable.

Sometimes it is also called cumulative probability distribution and is denoted


by F(). Let us denote by "x" the value( s) X can take, then F() is simply defined
as
F( x) Probabili ty of the event X:::; x
Pr(X :::; x).

Note: \Ve will use "PrO" to denote probability of an event without defining the
set explicitly, and PO or pX (.), when the set is explicitly stated in the argument.
Also note that the probability spaces for P and pX are respectively, (n, A, P) and
(n,B,p X ).
Let us now provide a formal definition of the distribution function. Let
W(x) = {w E nIX(w) :::; x}.
Since X is measurable, W(x) E A. In the probability space (n,B,p X ), we can
write the probability of W(x) as
P(W(x)) = pX[(_oo, xl].

10
14

This is well defined since (-00, xl E B. This probability is called the distribution
function of X, i.e.,

F(x) = Pr(X ~ x) P (W(x = pX [( -00, xl].

For our example:


w px = Pr(X = x) F(x) = Pr(X ~ x)
(TT) o 1/4 1/4
(H T) (T H) 1 1/2 1/2 + 1/4 = 3/4
(HH) 2 1/4 3/4 + 1/4 = 1

Or simply

X F(X)
0 1/4
1 3/4
2 1

If we plot, F( x) will look like as in Figure 2.3.1. Note that it is a step function.
Also notice, the discontinuties at x = 0, 1 and 2.

2.3.1 Properties of the Distribution Function.


(i) 0 ~ F( x) ~ 1. Since F( x) is nothing but a probability, the result follows from
the definition of probability.

(ii) F(x) is a nondecreasing function of x Le., if Xl > Xz, then F(xt) 2: F(xz).

Proof:

F(Xl) = Pr(X ~ xd pX [( -00, Xl]] pX (Ad (say)

F(xz) = Pr(X ~ xz) = pX [( -00, xzn pX (Az) (say)

Since Az C AI, we have

PX(Ad 2: PX(A2) (why 7)


I.e., F(XI) 2: F(xz)

11
15

Fex-)

. 0 i

F19ure 2.3.1

1/ A
16

(iii) F(-oo) = 0 where F(-oo) = limn--+ooF(-n).


Proof: Define the event

An = {w E DIX(w) ::; -n}


Note that P(An) = Pr(X ::; -n) = pX [( -00, -nl] = F( -n).

Now limn--+oo An = >

F(-oo) = n--+oo
lim F(-n) = lim P(An)
n--+oo
= P( lim An)' (why 7)
n--+oo
= P( = O. (why 7)

Note: The first (why 7) follows from the "continuity" property of P(.). It says:
if {An} is a monotone sequence of events, then P(limn--+oo An) = limn--+oo P(A n ).
(Try to prove this; see Workout Examples-I, Question 6).

(iv) F(oo) = 1, where F(oo) = limn--+ooF(n).

The proof is similar to (iii), Define

An = {w E DIX(w) ::; n}

F( 00) = lim P(An) = P( lim An) = P(D) = l.


n--+oo n--+oo

(v) For all x, F( x) is continuous to the right or right continuous. [What does it
really mean is that F(x + 0) = F(x) where F(x + 0) = lime.j..o F(x + c:).]
Proof: Define the set

An = {w E D IX (w) ::; X + -)
n

1
F(x + -) = P(An)
n

lim F(x + ~) = lim pX [( -00, xl] =F(x).


n--+oo n n--+oo

12

17

1
F(x + 0) = limF(x
e:.j.O
+ c) = lim F(x
n-too
+ -).
n

Therefore, F(x + 0) = F(x).


We can show that F(x) may not be continuous to the left. i.e., F(x-O) F(x)
where F(x 0) lime:to F(x + c:). To prove this, define

1
Bn = {w E nIX(w) :::; x - -}
n

F(x - 0) = lim F(x -


n-too
~)
n
lim P(Bn)
n-too
P( lim Bn) = pew E QIX(w) < x) = Pr(X < x).
n-too

However,

F(x) Pr(X:::; x) = Pr(X < x) + P(X = x) (why?)

Hence,
F(x) - F(x - 0) Pr(X = x)

Therefore, whenever Pr(X x) > 0, there will be a jump in F(x) at X = x, or


discontinuity at X x. In the Figure 2.3.1, we noted the discontinuity at x = 0,1,
and 2. Also note that

Pr(X = 0) = ~ > 0

Pr(X = 1) = ! >0

Pre X = 2) = ~ > 0

If Pr(X = x) = 0 for all x then F(x) will be continuous since, in that case
F(x) F(x + 0) = F(x - 0) for all x.

2.4 Probability Mass and Density Functions.

Once we have defined the distribution function, we can talk about the "Prob
ability mass function" (for discrete variables) and "Probability density function"
(for continuous variables).

13
18

Let n contain finite (or count ably infinite) number of elements. Here by
countably infinite we mean one-to-one correspondence with the set of integers, N =
{I, 2, 3, ..... }. To see an example, consider an experiment of tossing a coin until we
get a head. Then n = {H, T H, TT H, ..... }. If we define X as the number of trials
to get a head then X = 1,2,3, ...... Denote that as n = {WI, W2, W3, . }. Therefore,
n contains discrete points. For any event A E A, we define the probability

peA) L: P(Wi).
"',EA

For a random variable X 1 constructed on n will also take discrete values. Let us
now denote the range of X as X and the associated probability space as (X, B, pX).
Therefore, we will have a discrete random variable X with discrete probability
distribution pX. Given that
pX(X) 1.

the total mass will be distributed on a discrete number of points. Therefore,


sometimes the probability distribution of X, pX is called probability mass func
tion(pmf).

Example 2.4.1:

n= {(HH), (TT), (HT), (TH)}


X = # heads

Then

1.e.,

X p,x
o 1/4
1 1/2
2 1/4

14

19

3
pX (X) = L Pr(X = Xi)
i=l

Example 2.4.2 :
(i) Toss a coin n times and let X = # heads. Then X takes (n+ 1) values, namely,
X = 0,1,2, ... , n. The probability distribution of X with the corresponding
points in the sample space can be written as

w X pX
TTTT ... TTT o (1/2t
HTTT ... TTT (1/2)n )
THTT ... TTT (1/2)n ( Add = n(1/2)n

TTTT ... TTH (1/2)nj


HHTT .. , TTT 2 U/2)n
THHT ... TTT 2 (1/2)n

TTTT ... THH 2

THHH ... HHH (n -1) (1/2)n


HTHH ... HHH (n - 1) (1/2)n
Add = n(1/2)n
HHHH ... HHT (n
HHHH ... HHH n (1/2)n

So here Pr(X 1) = n O)n ,Pr(X = 2) = ot and so on. Later we


will derive this probability distribution simply as a special case of binomial

15
20

distribution. Check here that if we add pX for all the values of X, it is equal
to one.

(ii) Let us now consider our earlier example of tossing a coin until we get a head,
and define X = # heads. Then X will take countably infinite number of
values with the following probability distribution.

w x pX

H 1 1/2
TH 2 (1/2)2
TTH 3 (1/2)3

It is easy to check that here the total probability is equal to t + (t) 2 + (t )3 +


... = 1.

(iii) Now suppose X takes n values, (Xl,X2, ... ,X n ) = {Xi, i = 1,2, ... ,n}.

Let Pr(X=xi)=Pi, 2 1,2, ... ,n

The distribution function for this probabiity mass function is

F(x) = Pr(X $ x) = I: Pi
Xi

Any set of pi s can serve our purpose. All we need is to satisfy the following two
conditions:

(i) Pi 2: 0 Vi.
(ii) 2: i Pi = 1.
As we noted before when the distribution is discrete there will be jumps in F(x),
and therefore, it will not be continuous and hence not differentiable. Now suppose,
F( x) is continuous and, differentiable excpept a few points and

f(x) = dF(x)
dx
16
21

where f( x) is continuous function (except at a few points). We will then call X


a continuous random variable with probability density function (p. d. f.) f(x).
Therefore, the relation between f(x) and F(x) can also be written as

F(x) = [Zoo f(t)dt.

Recall F(00 ) 1, therefore


[ : f(x)dx = 1.

Also we noted earlier that F( x) is nondecreasing, therefore we should have f( x) :2:


o "Ix. We define f(x) to be a pdf of a continuous random variable X if the
following two conditions are satisfied

(i) f(x):2: 0 "Ix E X

(ii) [ : f(x)dx Ix f(x)dx 1.

Note: Here X denotes the range of X.

For a continuous variable X,

Pr(a::; X :::; b) = Pr[X :::; b]- Pr[X ::; a]


= F(b) - F(a)

= [boo f(x)dx - [aoo f(x)dx


= lb f(x)dx.

Note that for the discrete case, this probability can be written as

Pr(a :::; X :::; b) = 'I: Pr(X Xi)


a~Xi~b

When F is continuous Pr(X = a) = F(a) - F(a-) = O. Therefore, for


continuous case, Pr(a ::; X :::; b) = Pr(a < X :::; b) Pr(a:::; X < b) = Pr(a <
X < b). [see Figure 2.4.1]

17
22

Figure 2.4.1

17A
23

Example 2.4.3:
0, for x <
Let F(x) = x, for x E [O,IJ
{
1, for x > 1.
as given in Figure 2.4.2.

Here F( x) is ".differentiable," therefore we can construct f( x) as

0, for x <
f(x)::::; 1, for x E [0, IJ
{
0, for x > 1.

Simply we can write this as [see Figure 2.4.3]

f(x) = {I, for x E [0,1]


0, elsewhere.
Here X is a continuous random variable; however, note the discontinuities of f( x )
at 0 and 1. This distribution is known as uniform distribution [since for x E [0, 1],
f(x) is constant].
So far we have talked variables which are either discrete or continuous. A
random variable, however, could be of mixed type. Let

X = expenditure on cars

If we assume X is continuous, then Pr(X = 0) = O. But there will be many


individuals who do not have any expenditure on cars. Suppose half of the people
do not have any expenditure on cars,during a certain period then it is reasonable
to put Pr(X = 0) = 0.5. Suppose we assume F(x) = 0.5 + 0.5(1 - e- X ) for x > 0,
and F(x) = 0 for x < 0. The corresponding probability function is [see Figure
2.4.4J

Pr(X < 0) =0
Pr(X = 0) 0.5
f(x) = 0.5e- x for x >0

18
24

Figure 2.4.2

,..
' () i
F 19ure 2.4.3

16 /1 -
25

j~)

'x... ->
Figure 2.4.4
26

Note that here f(x) ;:::: 0 and

i: f(x)dx 0.5 + 0.51= e-Xdx = 1.0.

Hence, this is a well defined probability distribution.

2.5 Conditional Probability Distribution.


Let us consider two events A, B E A. We are interested in evaluating proba
bility of A only for those cases when B also occurs. We will denote this probability
as P(AIB) and will assume PCB) > O. We can treat B as the (total) sample space.
First note that
P(AIB) peA n BIB).

This is true because when fl is the sample space

peA) = P(Alfl) peA n fllfl).

Here B is our sample space. Also note that P(BIB) = 1. Now,

P(AIB) peA IB) = peA n BIB) = peA n Blfl) (why?)


nB P(BIB) P(Blfl)
p(An B)
PCB) .

We will write this conditional probability simply as P(AIB) P~tm)' This is


called conditional probability of (event) A given (event) B.

Note: (Above why?) Use old definition of probability

#cases for An B
peA n BIB)
#cases for B
(#cases for An B/#cases in fl) peA n Blfl)
-
(#cases forB/#cases in fl) P(Blfl)

Example 2.5.1 : Let

fl {(HH),(TT),(HT),(TH)}
and A (HT), B = {(HT), (TH)}, An B = (HT).

19
27

1 1 1
Therefore, P(A) = 4' P(B) = 2' p(AnB)=4

Let us first intiutively find the conditional probabilities. For (AlB), we know
that either (HT) or (TH) has appeared, and we want to find the probability that
(HT) has occurred. Since all the element of n has equal probability, P(AIB) = ~.
Similarly P(BIA) = 1 since (HT) has already occurred. Now let us use the formula
to get the conditional probabilities.

P(AIB) = P(A n B) = t = ~ # P(A)


P(B) t 2

P(BIA) =
P(A n B)
P(A) = t
1
= 1 # P(B) (Interpret this result)

Here the probability of A (or B) changes after it has been given that B (or A) has
appeared. In such a case we say that the two events A and B are dependent.

Example 2.5.2: Let us continue with the same sample space

n= {(HH), (TT), (HT), (TH)}

but now assume A = {(TT),(HT)}, . B = {(HT),(TH)}


We have An B = {(HT)}
Therefore, P(A) =~, P(B) = ~, P(A n B) = ~
4

P(AIB) =
p(AnB)
P(B) = t = "2
1 1
= P(A) (Interpret this result)

P(BIA) = p(AnB) = i = ~ = P(B)


P(A) ~ 2

Therefore, we have P(AIB)= P~t:/ =P(A)


1.e., P(AB)=P(A).P(B).

In this case, we say that A and B are independent.

Result 2.5.1: Conditional probability satisfies the axioms of probability.

20
28

Proof:

C) P(AIB)
Z
= P(AB) >0
PCB)

( ;;) p(nIB) pen n B) PCB) 1

.. ~, = PCB) = PCB) = .
(iii) Let AI) A21 A 3 ) be a sequence of disjoint ~vents, then

P (QI (Ai n B))


- PCB)
_ I::~l P(Ai n B)
PCB)
= f P(Ai nB)
i=1 PCB)
<Xl

= LP(AiIB)
i=l

Note that (Ai n BY s are disjoint, since (Ai n B) n (Aj n B)=Ai n Aj n B = for
ii=j.
Note: Conditional distributions are vary useful in many practical applications.
Such as,

(i) Forecasting: Give data on T periods, XI, X 2 , .. ,XT if we want to forecast


the value in (T + 1)th period, that could be obtained from the conditional
disribution P(XT+IIX I , X 2 ) ,XT).
(ii) Duration dependence: We can consider the conditional probability of a getting
a job given the duration of unemployment.

(iii) Wage differential: Wage distributions could be different for unionized and
non-unionized workers.

21
29

2.6 Bayes Theorem and Applications.

As in any other science, in statistics we have many approaches to tackle a


particular problem. Two main approaches are Classical and Bayesian Methods.
[This is just like two rival approaches in economics-Keynes sian and monetary ap
proaches.] In the classical approach all analysis is based on the available data
~sample information). In Bayesian analysis we combine the sarpple information
and prior information, and then make an inference. Here we ask the question
how our prior notion changes in the light of the sample information. Essentially,
we look at the conditional distribution given the sample. This is called posterior
distribution. In Bayesian analysis, statistical inference is based on this posterior
distribution, whereas in classical analysis all inferences are drawn from the sample
information. We will see that posterior distribution is nothing but a combination of
the prior distribution and the sampling distribution. This combination is achieved
by Bayes theorem. It is no exaggeration to say that all of Bayesian analysis is
based on this simple Bayes theorem. Let us state and prove this theorem.

Let us have a probability space (n,A,p). Let {Bd, i = 1,2, ... ,n be n


. n
disjoint events in A with P(B i ) > a and U Bi = B. Note that B E A (why?).
i=l

Lemma 2.6.1 : For any A E A, we have


n
peA n B) = L P(Bi)P(AIB;).
i=l

Proof: We have
n n
AnB=AncUB i )= UCAnBi).
;=1 i=l

Note that (A n Bi), i 1,2, ... , n are disjoint events (why?). Therefore,
n

i=l
n

1=1
n
= L P(Bi).P(AIB i )
i=l

22
30

[By the defnition of conditional probability]

Note: In this result if we put B = n, we have


n

i=l
n
I.e. peA) =L P(Bj)P(AIBi) (1)
i=l

Theorem (Bayes) 2.6.1 : Let {Bd, i = 1,2, ... ,n be n disjoint events in A, i.e.,
n
Bi n Bj = 4>, Vi =J:. j, and let U Bi = n. Then for any event A E A for which
i=l
P(AIBj) is defined
(2)

Proof:

P(B'IA) = P(Bj n A)
, peA)
P(AIBj)P(B j )
= (using definition of conditional probability)
peA)
P(AIBj)P(Bj)
(using (1), above)

Now let us discuss a simple application of the Bayes theorem.

Example 2.6.1 : The customer service manager for PROTRAC is reponsible for
expediting late orders. To do this job effectively, when an order is late he must
determine if the lateness is caused by an ordering error or a delivery error. If an
order is late, one or the other of these two types of errors must have occured.
Because of the way in which this system is designed, both errors cannot occur on
the same order. From past experience, he knows that an ordering error will cause
8 out of 20 deliveries to be late, whereas a delivery error will cause 8 out of 10
deliveries to be late. Historically, out of 1000 orders, 30 ordering errors and 10
delivery errors have occured.

23
31

Assume that an order is late. If the customer service manager wishes to look
first for the type of error that has the largest probability of occurring, should he
look for an ordering error or delivery error?

Solution:
Let
A=the event that an order is late
B 1 =the event that an ordering error is made

B2=the event that a delivery error is made.

B3=the event that no error is made.

The problem is to find the maximum of P(BlIA) and P(B 2 IA). From the data
in the problem it seems reasonable to assess the following probabilities: P(Bd =
0.03, P(B2) = 0.01, P(B3) = 0.96, P(AIBl) = DAD, P(AIB 2) = 0.80,
P(AIB3) = O.

From Bayes theorem.

P(AIBI)P(B 1 )
P(B1IA) = P(AIBI)P(BI) + P(AIB2)P(B2) + P(AI B 3)P(B3)
P(AIB 2 )P(B2 )
2 A
P(B I ) P(AIB2 )P(B2) + P(AIB1)P(Bd + P(AI B 3)P(B3)
Since P(AIB3) = 0, we have

PCB IA) _ . 004 x 0.03 _


1 - 004 x 0.03 + 0.8 x 0.01 - 0.6
P(B 2 IA) ~ 0.8 x 0.01 = 004
0.8 x 0.01 + 004 x 0.03

Thus the customer service manager should first check on whether an ordering
error has been made. Here w~ should note that P(AIB2) > P(AIBd, however,
P(B1IA) > P(B2IA), i.e., altho,Ugh probability of being late is higher when there
is a delivery error, the probability that a lateness will be caused by a delivery error
is lower.
Note: Q.7 of Workout EXanIples-I is almost the same as the above Bayes

24
32

theorem. The only difference is that in the above question we consider a sequence
00

of disjoint events B 1 , B 2 , B 3 ,. " E A with U Bi = n.


i=l

Note: The Bayes theorem was originally proved by Thomas Bayes in 1763 (al
though many people doubt this, who think Bayes theorem is not due to Bayes!).
The theorem did not have much influence until the appearence of the classic
book,Harold Jeffreys (1950), "Theory of Probability".

In the Bayes theorem if we interpret "A" as sample and "B" as our "prior infor
mation" , then the theorem tells us how to revise our prior opinion on the the light
of the occurrence of the sample.

For more on Bayesian inference see:

Berger, J. O. (1985), Statistical Decision Theory and Bayesian Analysis,


(Springer-ver lag)

Leamer, E. E. (1978), Specification Searches: Adhoc Inference with N onexper


imental Data, (John-Wiley & Sons)

Zellner, A. (1971), An Introduction to Bayesian Analysis m Econometrics,


(John-Wiley & Sons)

Rev. T. Bayes (1763), "An Essay Toward Solving a Problem in the Doctrine
of Chances," Philosophical Transaction of Royal Society of London, Vo1.53,
pp. 370-418.

2.7 Bivariate Conditional Probability Distribution.

We now want to discuss conditional probability distribution of one random


variable given (in a certain sense) another random variable. Let us first discuss
the joint distribution of two random variables. For simplicity, consider our earlier
example of tossing a coin twice.

Example 2.7.1 Now define two random variables.


X = # of heads (as before)
y = # of tails.
It is clear that both X and Y takes three values, namely, 0,1,2. We still have the

25
33

same sample space S1, i.e.,

S1 = {(H H), (TT), (HT), (TH)}.

For X and Y, we can think of the following probability distribution

w X(w) Y(w) pXY

<P 0 0 0
<P 0 1 0
(T T) 0 2 1
4

<P 1 0 0
1
(HT) 1 1 4"
1
(TH) 1 1 4"
<P 1 2 0
1
(H H) 2 0 4"
<P 2 1 0
<P 2 2 0

We can also present the joint probability distribution of X and Y as

X\Y 0 1 2 pX
0 0 0 1
4 t = Pr(X = 0)
1 0 1
2
0 ~ = Pr(X = 1)
2 1
4" 0 0 t = Pr(X = 2)
pY 1 1 1 1.0
4 2 4
= Pr(Y = 0) = Pr(Y = 1) = Pr(Y = 2)
The joint probability distribution pXY is graphically presented in Figure 2.7.1.
In general, we can define

So that pX and pY are induced probabilities defined on (R, B). Therefore, the
joint induced probability pXY is defined on (R2, B2). Therefore, the induced
probability space is (R2, B2, pXY).

26
34

o i 2

Figure 2.7.1
35

Note that

(X, Y) :!l ---+ nXn


:!l ---+ n2.

As we have defined in the univariate case, for the bivariate situation, the joint
distribution function of X and Y can be d~fined as

F(x, y) Pr[-oo < X :::; x, -00 < Y :::; y].

In other words, if [see Figure 2.7.2]

Then
F(x, y) = pXY (A).

2.7.1 Properties of the Bivariate Distribution Function.

Earlier we showed that in the univariate case, F( x) is nondecreasing function,


i.e., for a > 0, F( x + a) ;:::: F( x). Counterpart of this property, for the bivariate
cas e can be stated as
Property (i): Let a, b > 0 then

F(x + a, y + b) - F(x + a, y) - F(x, y + b) + F(x, y) ;:::: O.

Proof: The left hand side quantity is [see Figure 2.7.3]

Pr[x < X :::; x + a,y < Y :::; y + b], (why?)

which is same as pXY (B) where

As before can we can prove the following properties:


(i) F(-oo,y) F(x,-oo) =: O.

27

36

Figure 2.7.2 .
\ ' .

-27A
37

(ii) F( (X), (X)) = 1.


(iii) F(x +0, y+o) = F(x+O, y) F(x, y+O) = F(x, y) (continuous to the right).
F( x, y) may not be continuous to the left( as in the discrete case).
If we denote the distribution functions of X and Y as G(x) and H(y) respec
tively, then

G(x) = Pr(X:5 x) = F(x,(X), and H(y) = Pr(Y:5 y) = F(X),y).

Now we are in a stage to define conditional distribution involving two random


variables. Recall, the definition of conditional probability. If A, B E A and P(B) >
0, then conditional probability of A given B is

P(AIB) = P(A n B)
P(B) .

Similarly, for two random variables we can define the "conditional probability" of
X E Al given Y E A z , which we will denote as pXIY (AIIAz), as

pXIY(A IA ) = PXY(AI x Az) Y


I z PY(A ) ' P (Az) > 0.
z

Here by Al x Az we mean the set {(x,y)lx E Ahy E Az}

Similarly,

If pXIY(AIIA z ) is same as PX(Ad for all Al and AZI we say X and Yare
independent. In that case

Another way to express this result is

28
38

Theorem 2.7.1 X and Y are independent if and only if

F(x, y) G(x).H(y),

where

F(x, y) : joint distribution function of X anq Y


G( x) : distribution function of X
H(y) : distribution function of Y

Proof: Sufficient part: Assume F(x,y) G(x).H(y). For any given intervals,

we can write [see Figure 2.7.4]

pXY[(aI,b I ] x (a2,b 2J] = F(b I ,b2) - F(aI,b 2) - F(b I ,a2) + F(al,a2)


G(b1 ).H(b2) - G(aJ).H(b 2) - G(bd H (a2) + G(aJ).H(a2)
= [G(bJ) - G(al)].[H(b2) - H(a2)]
= pX [Cab bIl] .pY [(a2) b2J]

Thus we have

This is true for any intervals II and 12 , and therefore, for any Al and A2 E B
(why?), i.e.)

Hence X and Y are independent.

N eceJJary part: Assume X and Yare independent and therefore, for any Al and
A2 ,

Choose Al = (-00, x], A2 (-00, V].

29
39

(a", Cl,,)

Figure 2.7.4

-- 29A
40

Then

I.e., F(x,y) = G(x).H(y) Q.E.D .

Now let us assume that n is discrete. Then both X and Y will be dis
crete. Suppose X takes values {xd, i = 1,2,3, ... , and Y takes values {Yj },j =
1,2,3, .... That is, X and Y can take infinite or countably infinite numbers of
values. We denote,

Pij = Pr(X = Xi, Y = Yi).

Then the joint distribution of X and Y can be represented as

X\Y Yl Y2 Y3 Yj

Xl Pll Pl2 Pl3 Plj PIO = L:~l Plj


X2 P21 P22 P23 P2j P20 = L:~l P2j
x3 P31 P32 P33 P3j P30 = L:~l P3j

Xi Pil Pi2 Pi3 Pij

POI P02 P03 POj 1= L:i L: j Pij


= L:i PiO = L: j Poj

This joint distribution should satisfy

(i) Pij 2 0 Vi and J

(ii) L LPij = l.
j

For the above bivariate discrete distribution we can also define

XiS;XYjS;y

30
41

The marginal distribution of X is defined as

Pr(X = Xi) = Pr(X = Xi,Y = Yl, or Y2 or Y3"')

00 00

,= L P(Aj) = LPij = Pia,


j==1 j=1

where Aj = {w En x nix = Xi, Y = Yj}. Similarly,


00

Prey Yj) LPij=POj.


i=l

We denne the conditional probability as

According to our earlier definition, X and Y are independent if

Pr (X = xilY = Yj) = Pr(X = xd Pia


Pij
I.e. - =PiO
POj
I.e. Pij = PiO X POj for all i and j.

Example 2.7.1: The joint distribution of X and Y was given as


X\Y 0 1 2 Mar. Prob. of X: pX
0 0 0 1/4 1/4
1 0 1/2 o 1/2
2 1/4 0 o 1/4
Mar. Prob. 1/4 1/2 1/4 1.0
of Y: pY

We can easily verify that

Pr(X 11Y = 1) = Pr(X = 1, Y = 1) = 1/2 1.


P(Y = 1) 1/2

31
42

Q. Check whether X and Yare independent.

Q. Check F(l, 1) = Pr(X ::; 1, Y ::; 1) =!


Q. Find F(l, 2), F(2,1), F(2,2), and F(3,3).

For the continuous case, we define the joint probability density function(pdf)
as
= 8 2 F(x, y)
f( x, y ) 8x8y
if F( x, y) is differentiable except on a set of points with measure zero. Conversely

F(x,y) = f~f~f(u,V)dUdV, Vx,yER.

For f (x, y) to be joint pdf, we should have

f(x,y)~O

f : f : f(x, y)dxdy = 1.

We can define the marginal distribution of X as

g(x) = f : f(x, y)dy.

(Compare this with the defnition of marginal distribution for the discrete case.)

Distribution function of X :

G(x) = l~ g(u)du = f~i: f(u,y)dydu.

Similarly, the marginal pdf and distribution function of Yare respectively given
by

To define the conditional distribution function, we need to be careful, since


for the continuous case, the probability of a particular point is zero [see Cramer

32
43

(1946, p. 268)J. We have for 8 > 0

Pr[X<x,y 0<Y<y+8]
Pr [X < xly - 0 < Y < Y + 0] = [ - -]
- - - Pr y - 8 ~ Y ~ y +5

f~oo I:!: f( u, v)dvdu


= f:!aa h( v )dv

Now if we let 0 -j. 0, we have

lim (Y+d h(v)dv = lim H(y + 0) - H(y - 0) = dH(y) = hey)


8-->0 Jy-a a-->O 20 dy

Similarly,
y+a
lim
a-+O !.y-a
feu, v)dv = feu, y).

Therefore, for 8 - j . 0, the above probability reduces to

f~oo f( u, y)du
hey)
This is defined as the conditional distribution function of X, i.e.,

= f~oo feu, y)du


I)
F( xy h(y)'

By differentiating this we get the conditional pdf of x, i.e.,


f(xly) = aF(xly) f(x,y)
ax hey)
Similarly, f(ylx) = f(x,y)
g(x).
We can also show that X and Yare independent if

f(x,y) = g(x) x hey) (why?).

Example 2.7.2: Let


xy
F(x, y) T(x+y) xE[O,lJ YE[O,lj.

33
44

Differen tiating,

f(x,y) = [PF(x,y) = ~ [OF(X,y)]


oxoy ox oy

= OX
o [12x 2
+ xy] x+ y.

Therefore, we can write the joint pdf.. of X and Y as

f(x,y)=x+y xE[O,l] yE[O,l]


= 0 otherwise.

Check that

(i) f(x, y) 2 0

(ii) 11 11 f(x, y)dxdy = 1.

Now given f(x,y), we can find F(x,y) by integrating fCx,y) i.e.,

F(x,y) = 1l x Y
f(u,v)dvdu = foX 1 Y
(u + v)dvdu
{X v2 ] Y {X ( y2 )
= Jo [uv +"2 0du = 10 uy +"2 du
u2 y2]X 1
= "2 Y + u 2 = 2 xy (x + y).
. 0

This is where we started with.

Check: F(O,O) =0, F(l,l)=l, etc.

Marginal pdf of X :

g(x) = 11 f(x,y)dy 11 (x + y)dy = xy + y;J:


1
x+
2'

i.e. the marginal pdf of X is


1
g(x) x +2 x E [0,1]
= a otherwise.

34
45

Check g(x) 2:: 0, 11 g(x)dx = 1.


Similarly, the marginal pdf of Y is

hey) = 11 f(x, y)dx y +-.


2
1

Conditional pdf of X given Y :


f(x,y) x+y
f(xly)
hey) = y + r
Note:
f(xly) 2:: 0, t
f(xly)dx = x + ~ dx = 1 (why?).
t
Jo Jo y + 2"
In a similar way, conditional pdf of Y given X,
x+y
f(ylx) (why?).

Note that X and Y are not independent (why?). Given f(x, y) we can also find
probabilities like

1 3
Pr[O $ X $ 2' t[
4 $ Y $ 1] = J 2 Jo[12 f(x, y)dx ] dy.
4

Pr[O $ X $
1
21 = 12 1
g(x)dx.

Pr [0 $ X $ ~, ~ $ Y $ 1]
Pr [0 $ X $ ~ I~ $ Y $ 1] Pr[~ ~ Y ~ 1]
Q: Find the above three values.

3.1 Mathematical Expectation- Part 1.

Consider our experiment of tossing a coin twice. Let us set up the following
game. As before, let X # of heads. If
X = a you get $10.00
= 1 you pay $12.00
2 you get $10.00

35
46

Will you agree to play this game? The answer really depends on whether you
expect to loose or gain. The question is how do we compute this expectation.
Let us formally define a random variable X on the probability space (n, A,p)
wi th distribution function F( x). Then expectation (or mathematical expectation)
of X is said to exist if and only if

i: IxldF(x) < 00.

Or in other words X is an integrable function. The expectation is then defined by

E(X) = i: xdF(x).

For a continuous distribution with pdf f(x), this expectation can be written as

E(X) = f: xf(x)dx.

[Recall f(x) = d~~x) i.e., f(x)dx = dF(x).] For a discrete distribution with pmf
Pr(X = Xi) = Pi, i 1,2, ... ,00, E(X) can be expressed
00 00

E(X) = L xjPr(X Xi) = L XiPi


i=l i=l

For our example, the expected value of playing the game can be calculated using
the above formula. Let Z =: payoff. Then
1
Z = 10 with prob.
4
1
-12 with prob.
2
1
= 10 with prob.
4
Hence
1 1 1
E(Z) = 10 x 4 -12 x 2' + 10 x 4 = -1.
Therefore, on the average you would loose $1.00. That is, every time you play this
game you are not really going to loose a dollar, but if you playa large number
(N) of times, you can expect to loose N dollars overall.

36
47

Some Results on Expectation:


0) If a random variable X takes value C with probability 1, then E(X) = C.
Proof:

Here X = C w. p. 1
=1= C w. p. 0

Hence

E(X) = C.l = C
(ii) If E(X) exists and C is a finite constant then E(C X) exists and E( C X)
CE(X).

Proof: E(X) exists, therefore J IxldF(x) < 00. Now E(CX) will exist if
J ICXldF(x) < 00.
j ICXldF(x) = ICI j IXldF(x) < 00,

i.e., E( C X) e xists.

E(CX) = ! CXdF(x) = C J XdF(x) = CE(X).

(iii) Let E(X) and E(Y) exist for two random variables X and Y defined on
the same probability space (n, A, P). Then E(X + Y) exists and E(X + Y)
E(X) + E(Y).

Proof:

vVe have J IX IdF < 00 and J IYldF < 00


Since IX + YI51XI + WI
! IX + YldF 5 JIXldF + J IYldF < 00

Therefore, E(X + Y) exits, and


E(X + Y) j(X + Y)dF = j XdF + ! YdF = E(X) + E(Y).
37
48

(iv) If E(X) exists, then for any finite real numbers a and b, E(a + bX) exits
and E(a + bX) = a + bE(X).
Proof: Left as an exercise.

Note: All the above cases can be obtained as special cases of E(aX + bY)
aE(X) + bE(Y).
Now consider a continuous function g(X).

g: n--+n
Recall: X: n --+ n
If we denote g(.) as g(X(w)) then

g: n --+ n.

Now the question is, if X is random variable, is g(X) a random variable? The
answer is yes. Pick up a set A E B [recall the u-field n, B)] and define a set

{w/g(X(w)) E A}
={w/X(w) E g-I(A)}
={w/w E X-I (g-l(A))} E A

since X is a random variable. Therefore, g(X) is a random variable. Now if


E[g(X)] exists, then
,

E(g(X)) =: J g(x)dF(x)

= J g(x)f(x)dx if Xis continuous


00

= L g(xj)Pr(X = Xi) if X is discrete.


i=l

3.2 Moments.

Moments are a special kind of expectation. Define g(X) = xr. This is a


measurable function,
g: n --+ n.
38
49

If E(xr) exists we call it rth raw moment of X or rth moment around zero, and
denote it by fL~' Therefore, we have

fL~ = E(xr) = Jxr dF(x)

For r = 0, ~~ = [ : dF(x) [ : f(x)dx =1


r = 1 fL~ = / xf(x )dx E(X)

r = 2 fL~ = E(X2) . .. etc.

Theorem 3.2.1: If r < 5, then existence of I.L~ implies existence of fL~'


Proof: When IXI ;::: 1, then IXls ;::: IXlr implies

and E(xr) exists. If 0 :5 IXI < 1, then IXI8 + 1 ;::: IXl r implies

and hence E(xr) exists.

Next we define g(x) = (x - a)r >. If E[g(x)] exists, we call it a rth moment
of X around a. If we take a = E(X), then

E[g(x)] E[X E(X)]2


E(X - I.Ld 2

If this exists, we call it rth central moment of X) and we will denote this as fLr.
Note that

flo = 1
I.Ll = 0 (why?)
fLz = E(X -l.Li? = E(X - E(X2
= E(X2) [E(X)F (why?)
fL~ fL~2

39
50

fJ-2 is called the variance of X and is denoted by VeX). If we can find a relationship
between the raw and the central moments.

Theorem 3.2.2: If fl~ exists, then fl~(a) E(X - ar also exists, and

. fl~(a) = fl~ - (~)fl~_la + (;)fl~_za2 - ...


+'( -1)r-lrfl~ a r- 1 + (-It a r

f S mce
P roo: flrIeXIsts,
'
t hen so d0 fllI , fl2'
I
... , flr-l
I
.
Now

(1)

This implies

Ix air s Ixl r + G) Ixl r - 1 lal + (;) Ixlr-Zlal Z+ ...

+ C~ 1) r 1
Ixllal - + lair,

J Ix - alrdF(x) s J IxlrdF(x) + (~) lal J Ixlr-1dF(x) +.,' + lair < 00,


Therefore fJ-~(a) exists. Now taking expectation of (1), we have

fl~(a) fl~ - (~) fl~-l a + (;) fl~_2a2 - ...

+ (_ly-1 C~ 1)fJ-~ a r- I + (-ly ar.


In the above theorem, if we put a = E(X) = fl~, then

flr = E(X -fl~Y

= fl~ (~)fl~-Ifl~ + (;)fl~_Zfl~2 - ...


+ (dr-IrfJ-~r + (-IYfl~r.
40
51

I.e., Pr = p~ - G) P~-lP; + (;) P~_2P;2


+ 1 y-I (r - l)p;r (why?) (2)

In (2), let us put r = 1,2,3" ..

r =1 PI / (1) / /

PI - 1 POPI = 0

I /2
= P2 - PI
r = 3 P3 p~ - 3p~p~ + 2p~3 (why?)
r 4 (why?)

Similarly, we can express the raw moments in terms of central moments.

Theorem 3.2.3: If p~ exists, then so does p~, and

p~ = p~(a) + (~) P~-l (a)a + (;)p~_2(a)a2 + ...


+ C: rI r
Jp;(a)a - + G)a .
Proof:

xr = [(x - a) + ar
(x-ay+G)(x aY-Ia+(;)(x ay- 2a2+ ...

+ C: l)(X - a)a r- I + G)a r, (3)

Therefore,

Ixlr S; Ix - ar + (~) Ix - alr-1lal + (;) Ix - al r- 2lal 2 + . ,.

+ (r: 1) Ix - allar- 1
+ lair,

JIxlr dF(x) JIx - air dF(x) +.,. + C: 1) lal r- I JIx - aldF(x) + lair
S; < 00.

41
52

Taking expectations of (3), we have

p~ = p~(a) + (~)p~_l(a)a + (;)p~-2(a) + ...


+ (r r )pI(a)ar-1 +a r .
-1 1

If we put a = E(X) = p;, then

Pr1 = pr + (r)
1 pr-IPI1 + (r)
2 pr-2Pl12 + ...

Noting p~ = po = 1, for r = 1,2,3, ... , we have


1 1
PI = PI
1
P2 = P2 + PI 12

P; = P3 + 3P2P~ + p;3
P4 = P4 + 4 P3PI + 6P2Pl + PI
1 1 12 14
.
Q.: Check the above results.
Lastly, we put
g(X) x(r) = X(X 1)(X - 2) ... (X - r + 1).
Then E[g(X)] is called the rth factorial moment of X and is denoted by p(r), if it
exists. It is called factorial moment since x(d = (x:'!r)!' We can show that if the
raw moments p~ exits, so does p(r). This is because
xU) =x
X(2) = x2 - x
x(3) = x 3 - 3x 2 + 2x
X(4) X4 - 6x 3 + llx 2 6x.
From above, taking expectations

p(2) p~ - p~

p(3) = P; - 3p~ + 2p;

p(4) = p~ - 6p; + llp~ - 6p~.

42
53

The reverse relationship is also easily established, since

x = X(1)

X2 = X(2) + XCI)
X3 = X(3) + 3X(2) + X(I)

X4 = X(4) + 6X(3) + 7X(2) + X(I).

We have

/-Li ;:::: /-L( 1)

/-L~ ;:::: /-L(2)+ /-L(I)


/-L; = /-L(3) + 3/-L(2) + /-L(l)

/-L~ = /-L(4) + 6/-L(3) + 7/-L(2) + /-L(l).

Extra Note:

1. You might have noticed in Section 3.1, we say expectation of X exists iff
J~oo IxldF(x) < 00 While E(X) is define as J~oo xdF(x). The question is:
why do we need a stronger condition for the existence of expectation of X?
Consider, E(X) when X is discrete random variable taking countably infinite
number of values Xl, X2, .. with probabilities PI, P2, respectively, then
00

E(X) ;;:;:: L XiPi


i=l

This sum may be convergent, however, it might take different values if we


change the order of X. For example, let X takes values 1, 2, 3, ... with
probabilities :2'
~ x x 3\ ... repectively. First note,

311
2x7r2(1+22+32"') 1

Now if we calculate E(X) as L~l (i - i) :2


x fr then it is exactly O. However,
calculate E(X) as ( i x :2
x l2) - (L~l i x x :2 /2)'
then it does not
exist. In fact, it is easily checked that L~l IXi!Pi for this case is not < 00.

43
54

2. At this stage you might be wondering what are the uses of these all kinds of
moments. From the moments we can get a very good idea of the probability
distribution of a random variable. As we will see later different measures
of central tendency, dispersion, skewness, peakness of distributions can be
described by moments.

4. Mathematical Expectation- Part II

4.1 Riemann-Stieltjes Intregral.


Recall we defined
E[g(x)] = J g(x)dF(x)

What does it really mean when we intregrate a function g(x) w. r. t. another


function F( x)7 This can be done in the following way.

1 a
b
g(x)dF(x)
n
= limn -+ oo L9((i)[F(Xi) - F(Xi-d,
i=l
where the interval (a,b] has been divided into n subintervals (Xi,Xi+d, i.e., a
Xo < Xl < X2 < ". < Xn = b; and (i is an arbitrary point in the subinterval
(Xi-I, Xi]' This is a generalization of the standard Riemann integral, you already
know, namely,

If F( x) is differentiable with = f(x), then we can write

1b g(x)dF(x) = 1 6
g(x)f(x)dx

which is a standard Riemann intregraL On the other hand if F(.) is a step function
(like our distribution function in the discrete case) with jumps at Xl,;:r2, . " then

1b g(x)dF(x) = ~g(xi)[F(Xi) -
,
F(Xi-dJ
n
= L g(xi)Pr(X Xi).
i=l
44
83

5.3.1. Probability Generating Function (p.g.f.).

Take g(X, t) t X , then we have the p.gJ. of X

In some sense Px(t) generates probability. Consider a random variable X that


takes values 0,1,2,3, .... Then
00

Px(t) L t i Pr(X = i).


i=O

Therefore, the coefficient of t i gives Pr(X = i). From Px(t) we can also get the
factorial moments. We have

dP~(t)] t=l J xtX-1dF(X)] t=l = J xdF(x) = p(l)

i2~~(t)t=l J x(x -1)t X - 2qF(x)L=1 J x(x -1)dF(x) = p(2)


d3;~(t)t=1 = j x(x 1)(x - 2)t X- 3 dF(x)L=1 =/ x(x -1)(x - 2)dF(x) = p(3)

[Recall our definition of p(r) (Section 3.2)

p(r)=E[X(X-l)(X 2) ... (X-r+l)]= jx(x 1)(x-2) ... (x r+l)dF(x)].

In general

r
d Px(t)] p(r).
dt r t=l

Result 5.3.1: If Xl and X 2 are independent, then

Proof:

67

84

Q.E.D.

The above proof uses the fact that since Xl and X 2 are independent

5.3.2 Moment Generating Function (m.g.f.).

Here g(X, t) = etX and we denote m. g. f by Mx(t)

Mx(t) = E(e tX ) = J et:r:dF(x) provided the integral exits

t2 t3
=
J (1 +tx + 2!X2 + 3! + .. .)dF(x)

= ko Jt
00
i
j!xjdF(x) ko J
00 t
j
i
xjdF(x)

00 tj

= L
j=O
7Jllj.

J.

Mx(t) is called the moment generating function (m.g.f) since as we see from above,
the coefficient of ~ gives Ilj. Another way to find Ilj is
j
d Mx(t)]
. (h
c eck) .
dtJ t=o

Example 5.3.1: Let


1 0'-1 ~
f(x) fCo:){3ax e{3 xE(O,oo).

This is called the gamma distribution. For this distribution

MxCt) E(e t x )= 1 o
00

etx 1
f( 0: ){3a
xa-IeTdx
-z

1
--c--:---
1
o
00

x a -1 e -"'(l-{3t) dx
(3

1 fCo:) since roo e-axxn-1dx = r(n)


f(0:){3a'[(11t)]a Jo an

1 1
=(1 {3t)a' t<fj'

68
85

Therefore,

dMX(t)]
dt t=O
= (-a)(1 ,8t)-cr-
1
c-rnL=o' =} ,u~=a,8.
2
M~(t)] ,
d
dt t=O
= C_0)( _0 - 1)(1 ,8t)-cr-2( _,8)2 L=o' =} ,u2 a( a + 1),82.

Hence, ,u2 ,u~ -,u~ 2 = a,82 Unfortunately m.g.j'of many distribution do not exist
as the following example shows:

Example 5.3.2: Define the probability distribution


6
Pr(X =j) = 2 '2 j = 1,2,3, ...
- 1!' )

Note that Pr(X = j) 2:: 0 Vj = 1,2,3, ... and


6 6 1 1 1 6 1!'2
L
00

.
~
1!')
= 2"(
1!'
+ 22 + 32 + ... ) = 2"
1!'
x 6 1.
)=1

Therefore, this is a proper probability distribution. Here


oo oo

L tj6
e - x
1
1!'2
-6
1!'2
L e tj 1
j=1 j=1

This sum does not exist, and therefore Mx(t) is not defined.

However, there is a function which always exists and from which we can find
the moments. That function is called characteristic function.

5.3.3 Characteristic Function (c.r.).

Here g(X, t) e itX with i = and we have c.f. as

X(t) E(e itX ) = J eitxdF(x).

69
86

Note that this integral always exists. We can write

But le itx I = /cos tx + i sintxl = J cos 2 tx + sin2 tx = 1.

Hence l>x(t)I:5 J dF(x) =-1.

Let us consider the examples of Cauchy and Laplace distributions. A general


form of Cauchy distribution is [Cramer (1946, p. 246)]

1 ,\
f(x) - _. - 00 <x< 00, 0 < ,\ < 00
- 7r ,\2 + (x - J1.)2

but for simplicity we will work with the form

1 1
f(x) =;- 1+ - 00 < x < 00.

It is easy to see that

00 f(x)dx = -1100 1 = -7r2 tan -1]00 = 1


1-00 -00 1+ 7r
x 0

However the moments of this distribution do not exist. For example,

J1.~ = 00 xf(x)dx = -1100 1 x dx


1-00 , -00 + x ,7r
2

= 1 [l;b x dx + 1
00
x dx ]
7r -00 1 + x 1+x 2
0
2

However
0 x
1 -00 1+
- --=-'------'- ,~oo = -00 (does not exist)

and

Jo
roo 1 +x x dx = 2 00 (does not exist)

70
87

It is interesting to note that for any finite "a"


a

1
-a 1 +x
x 2 dx = O.

The characteristic function of the Cauchy distribution, however, does exist. It can
be shown that

= -11=
itx
e
2 dx
_=
1T 1+ x

To see this, note that the Result 5.3.3 below, implies

f(x) (*)

and there is one-to-one correspondence between c.f. and p.d.f.

Look at the integral

-1
2
J=_= . ett.1'-ltldt = 1= 0
costx - e-tdt

:=:
x sin tx - cos tx = 1
1+ 1 o - 1 +x2

Now think about a p.d.f.


1
f(x)= e- 1.1'1 -oo<x<oo
2
Then its c.f. must be
1
;x(t) = 1 +
This p.d.f is the Laplace distribution. The reciprocal Fourier integral (*) connects
the Cauchy distribution with the Laplace distribution [see Cramer (1946, p.247)J.
To see it directly, let us derive the above c.f.

x(t) = 1=_=
~ eitxe-Ixldt =~
2
1=_=
(cos tx + i sin tx )e-1x1dx

= 1= cos txe -x dx

71
88

Now

I.e.,

1
Hence >x(t) 1+t 2

In a similar way, we can show that the moment generating function for the
Laplace distribution is given by

Mx(t) = E(e tX ) = 1 JOO etxe-Ixldx


2 -00

1
=

All the moments exist for this distribution:

/-tr = 0 if l' is odd


= 1 if l' is even.

Hence .,ffJ; = 0, and (32 = (ii2 = 6.

Some results on characteristic function:

Result 5.3.2: >x(t) Mx(it).


Result 5.3.3: There is a one-to-one correspondence between c.f. and the distri
bution function. In other words, every distribution has a unique c.f., and to each
c.f. there corresponds a unique probability distribution. This is achieved by a
result known as the inversion theorem which can be stated as follows:

If >x(t) is the c.f. corresponding to F(x) and (xo a, Xo + a) is a continuity


interval of F(x), then

F(xo + a) F(xo - a) = . 1
hm -
JT sinat e-1txo>x(t)dt
.
. T----+oo 7r -T t

72
89

Result 5.3.4: If f1,~ exists then

Note dr:~(t) denotes rth derivative of </>x(t) with respect to t.

Proof: We have

</>x(t) = j eitxdF(x)

r
Th erelore, dr</>x(t)
dr .rj x r eitxdF( x )
= Z
t

.~ dr</>~(t)] = jxrdF(X) f1,~ Q. E. D.


z dt t""O

Result 5.3.5:

Proof:

Result 5.3.6: If X and Y are independent then

Note: Here tll t2 are two real numbers.

Proof:

</>X,y(tl, t 2) = E (eitlX+it2Y)
=j j eittx+it2YdF(x, y)

= j j eittx.eit2Yd(G(x).H(y)) Since X and Y are independent

= j eit1XdG(x). j eit2Y dH(y) = </>x(td x </>y(t2) Q. E. D.

73
90

Result 5.3.7: <px(t) is real iff X follows a symmetric distribution around zero.

Proof:

<p-x(t) = E(e- itX ) = <px( -t) ~X(t); [~is complex conjugate.]

Suppose X has symmetric distribution, then ,

<px(t) <p-x(t) = ~x(t).

Hence <px(t) is real (why?).

Conversely, suppose <Px(t) is real, then

<px(t) = ~x(t) <p-x(t)

i.e., X and -X has same cf., i.e., same distribution. Hence X is symmetrically
distributed around zero.

Example 5.3.3: Let

f(x)=l XE[O, 1]

<px(t)

Use L'Hospital's theorem to show that

1 d<Px(t)] _ 1 _ I
i dt - 2 - Ill'
t=O

6. Distribution of a Function of a Random Variable.

Let U = g(X) be a measurable function of X. Hence U is also a r.v. We


know the p. d. f. of X, f(x). We want to find the p. d. f. of g(X). There are
three approaches to do that

1. Distribution function approach

74
91

2. Transformation of variable method


3. Using generating function.

In many statistical and econometric applications we need to transform the

original variable X. The most popular transformations are

U log X (lpg transformation)

..;x (square-root transformation)

X>'-l
A (Box-Cox transformation, where A is a constant)

6.1 Distribution function approach:.

Here the basic principle is that, first find the distribution function of U) F( u).
Then differentiate it w. r. t. U and obtain
dF(u)
feu) du .
The approach is better demonstrated through an example.
Example 6.1.1: Let
3
f(x)=2x'1. -1:5x:51
o otherwise.

Let U = X'1.) then

F(u) = Pr(U :5 u) Pr(X '1. :5 u) = Pre -vu :5 X :5 vu)


= j
.fii f(x)dx 3j.fii dx - 2 _
-3 x -X].fii
3
3

2 -.fii 3_.fii u!.


-.fii X
2
Therefore,
feu) = d~~u) = ~u~.
Now the range of u can be obtained from the relationship U = X2. Since -1 :5
x :5 1, we should have 0 :5 u :5 1. Therefore,
3 1
feu) = 2U2 u E [0, 1]
= 0 otherwise.

75
92

Check that
1 311
1 0
f(u)du = -
2 0
1
u 2 du = 1.

6.2 Transformation of Variable Approach.


First, for simplicity we assume that the function g( x) is one-to-one and has
continuous first derivative. Since g( x) is one-to-one, the inverse function X =
g-l(U) = h(U) exists. Now

F(u)=Pr(U-5:u)=Pr(g(X)-5:U) idF(X) where A= {xlg(x)-5: u}.

Now to evaluate the above integral, consider a change of variable method. We can
write
i dF(x) =i f(x)dx = If(h(u))ld~~)ldU = If(u)dU

whereB {ulu ~ g(x)}. By analogy feu) must be the p. d. f. of u and is given


by
f( u) = f (h( u 1 d~~u) I.
Here 1d~~U) I = IJI, where J is called the Jacobian of the transformation.

Suppose, now that we do not have an one-to-one function. Therefore, no


unique inverse exists. We, however, assume, the domain of g(X) can be partitioned
into a finite number, say P of disjoint subdomains, denoted by 11 ,12 , ,Ip, on
each of which g(X) is strictly monotone (decreasing or increasing) .. Hence, on
each Ii(i = 1,2, ... , P), g(X) has a unique inverse. Let gi denote the restriction
of the function 9 to Ii, i.e., gi(X) is defined only on Ii and gi(X) = g(x) for
x E Ii,i 1,2, ... ,P. By assumption, gi(X) has a unique inverse x = g;l(u) =
hie u), i = 1,2, ... , P. Then we can show that the p.d.f of U is given by

feu) = t
1= 1
oi(u)J[g;l(u)ll dg~~(u)!
P
= LDi(u)J[hi(u)l!dhi(u)!
.
1==1
du

76
93

where bi(U) 1 if u E gi(Ii)


= 0 otherwise, i = 1,2, ... , P.

For a proof of this result see Andre 1. Khuri (1993), Advanced Calculus with Ap
plications in Statistics, John Wiley & Sons, pp. 246-249.

Example 6.2.1: Let

1 (",_~)2
f(x) = --e- 2" ) -00 <x< 00.
V2ia
Let U = a + bX. Therefore, the inverse function h( u) is

a U
X = -- + -
b b
= h(U).

N ow dh(u)1
I~ = m'
1
t hereore,

a U 1
f( u) = f (h( u)) Id:Su) I f(- b + b"\bl
1 1
=--e
V2ia Ibl
1
--==--e
V2ialbl
Now fix the range of u; since -00 < x < 00 and u a + bx, we should have
-00 < u < 00. Therefore,

feu)

Let us now consider the case when the transformation is not one-to-one. For
example, let X '" N(O, 1), and we want to find the probability densities of IXI and
X2. The functions are not one-to-one.
First consider the case of U = 9 ( x) = IX I. We can partition the range of X
into two parts so that the function is one-to-one separately in those two (P = 2)

77
94

intervals, and we can have unique inverse functions, namely

u = -X for - <X <0


00

= X for 0 < X < 00

I.e., X -U for X <0


= U for X> 0

and the Jacobian IJI = 1~~ 1= 1

Then using the formula

feu) = 8t (u)f(-u).1 + 82(u)f(u)


= --e
1 -u 2/2 +--e
1 -u2/2
-/2 pi y'2;
= ~e-u2/2 0 < u < 00.

For the second case, U = X2, and we have

X = - ru for - <X <0


00

= ru for 0 < X < 00

and IJI = I ~~ I .!U-


2
1/ 2

Therefore,

feu) f("7""Vu)~U-l/2 + f(Vu)~u-l/2


1 e-u/2 ~u-l/2 + _1_e-u/2 ~U-l/2
2 ...j2i 2
1 e-u/2u-1/2 0 < u < 00.

Transformation of variable approach can easily be generalized to the multi


variate case. Consider the transformation (X 1 ,X2 ) -----; (U 1 ,U2 )) where

78

95

We know the joint pdf of Xl and X 2 and we are interested in finding the joint pdf
of U1 and U2 . We denote the inverse functions as

and

Suppose the joint pdf of Xl and X 2 is given by f(xl, X2)' The first step is to find
the Jacobian.

J = j8(X,X2)
O(Ul,U2)
1= Id(h (UId(Ul,U2)
U2),h 2(UI,U2)) I
l 1

(1)

i.e., J is the determinant of a 2 x 2 matrix. Once we know J, the second step


is to write the joint pdf of U1 and U2 as

where 111 is the absolute value of the determinant in equation (1).

The third and laJt Jtep is to find the ranges of U1 and U2 using the relationship

So, to summerize the three steps: (for the one variable case)

(1) Find Jacobian IJI = I~= I


(2) In f(x), replace x by x ~ g-l(U) = h(u) and write

feu) = J[h(u)]IJI

(3) Find the range of U using U = g(X) and range of X which is already known.

79
96

Example 6.2.2: Let

Suppose we want to find the distribution of

First we will find the joint distribution of

And then from the joint p. d. f. f( Ul, U2), we will obtain the p. d. f. of Ul by

The inverse functions are

Therefore,

J=I ~I=I \ ~1=1


and IJI = 1. The joint density of Ul and U2 is then given by

To find the ranges of U l and U2 , note that

(why?).

Therefore,

To check whether f( Ul) is a proper density function:

(i) f(Ul) ~ 0, (ii) 1


00

f(UJ)dul = 1= e-U1dul r(2) = 1.

80
97

6.3 Generating Function Approach.

We can use either the characteristic function (c.f.) or the moment generating
function (m.g.f) if it exists. In this approach we find the g.f.. of U = g(X) and
compare it with the g.f.. of well known distribution and try to see a match. Best
way to demonstrate this approach is through an example.

Example 6.3.1: Consider the normal distribution

f(x) 00 <x< 00.

We want to find the p. d. f. of

U = a+bX,

where a and b are constants.

Let us first find the c. f. of X.

<Px(t) = E(e itX ) =


1
f: e
itx

(Xl. (",;:;)2 +itx dx


f(x)dx

= -I21ra J-oo
1
---::=-
1
00

e
_ (z_p)2_ 2",2
20":1
j ,,,,

dx
-00

= _1~ roo e ~ [x 2 -2X(/4+o-2itH/J2] dx


-I21ra )- 00

1[ 2] 2dx
~ x-(J.I+o- it)

e ~ (/42-(J.I+o- 2it)2]

= e i /4t

Above we have used the result that

81

98

As we noted before, there is a one-to-one correspondence between the c. f. and


the distribution function or the p. d. f.(See Result 5.3.3). In our case,

1 ("'-'2)'
f(x) = --e- 2.. 00 < x < 00.
J'hf(J

if and only if
<Px(t) e itIJ

Now let us find the c. f. of U = a + bX.

<pu(t) := E(e itU ) = E(eita+itbX)


= eitaE(eitbX) eitaE(ei(tb)X)
. . (1'2(10)2
=elta[elPth- 2 ] (why?)
= ei ( a+bp)t- (O(l'~212

Because of the one-to-one correspondence between c. f. and p. d. f., f( u) can


be obtained from f(x), by replacing f.l by a + bf.l and (J2 by b2 (J2 i.e., (J by Ibl(J.
Therefore, the p. d. f. of U is given by

- 00 < U < 00.

Note that we obtained the same result on in Example 6.2.1.

This concludes our discussions on Part-II of the course, Probability Theory,


Part-III of the course is on Univariate Discrete and Continuous Distribu
tions.

7. UNIVARIATE DISCRETE AND CONTINUOUS DISTRIBUTIONS

7.1 Hypergeometric Distribution.


Let us think of an experiment where we have N items fresh from the factory.
These items are either defective or non-defective. We would like to see whether the
lot is acceptable or not, that is, whether or not the number of defectives exceeds
a preassigned limit. Let us assume that the number of defectives is r(::; N), so
the rest (N - r) are non-defectives. If N is very large, we cannot inspect all the

82

You might also like