Probability

Course 003: Basic Econometrics, 2012-2013
Course 003: Basic Econometrics

Rohini Somanathan- Part 1
Sunil Kanwar- Part II
Delhi School of Economics, 2014-2015
Page 0
'
Rohini Somanathan
Outline of the Part 1

Main text: Morris H. DeGroot and Mark J. Schervish, Probability and Statistics, fourth edition.
1. Probability Theory: Chapters 1-6
Probability basics: The definition of probability, combinatorial methods, independent
events, conditional probability.
Random variables: Distribution functions, marginal and conditional distributions,
distributions of functions of random variables, moments of a random variable,
properties of expectations.
Some special distributions,laws of large numbers, central limit theorems
2. Statistical Inference: Chapters 7-10
Estimation: definition of an estimator, maximum likelihood estimation, sufficient
statistics, sampling distributions of estimators.
Hypotheses Testing: simple and composite hypotheses, tests for differences in means,
test size and power, uniformly most powerful tests.
Nonparametric Methods
&
Page 1
%
Rohini Somanathan
'
Administrative Information
Internal Assessment: 25% for Part 1
1. Midterm: 20%
2. Lab assignments, Tutorial attendance and class participation: 5%
Problem Sets: - Do as many problems from the book as you can. All odd-numbered
exercises have solutions so focus on these.
Tutorials: -Check the notice board in front of the lecture theatre for lists.
Punctuality is critical - coming in late disturbs the rest of the class and me
&
Page 2
'
Rohini Somanathan
Why is this course useful?

We (as economists, citizens, consumers, exam-takers) are often faced with situations in
which we have to make decisions in the face of uncertainty. This may be caused by:
randomness in the world ( a farmer making planting decisions does not know how much
it will rain during the season, we do not know how many days well be sick next year,
what the chances are of an economic crisis or recovery)
incomplete information about the realized state of the world (Is a politicians promise
sincere? Is a firm telling us the truth about a product? Has our opponent been dealt a
better hand of cards? Is a prisoner guilty or innocent... )
By putting structure on this uncertainty, we can arrive at
decision rules: firms choose techniques, doctors choose drug regimes, electors choose
politicians- these rules have to tell us how best to incorporate new information.
estimates: of empirical relationships (wages and education, drugs and health...)
tests: how likely is it that population parameters take particular values based on the
estimates weve obtained?
Probability theory puts structure on uncertain events and allows us to derive systematic
decision rules. The field of statistics shows us how we can collect and use data to estimate
empirical models and test hypotheses about the population based on our estimates.
&
Page 3
%
Rohini Somanathan
'
A motivating example: gender ratios

We are interested in whether the gender ratio in a population reflects discrimination, either
before of after birth.
Suppose it is equally likely for a child of either sex to be conceived.
We visit a small village with 10 children under the age of 1. If each birth is independent, we
would get considerable variation in the sex-ratio in the absence of discrimination.
.05
.1
probability
.15
.2
.25
P(0)=.0001, P(1)=.001, P(2)=.044, P(3)=.12, P(4)=.21, P(5)=.25 ... (display binomial(10, k, .5))
10
When should we conclude that there is gender bias? Can we get an estimate of this bias?
&
Page 4
'
Rohini Somanathan
Origins of probability theory

A probability is a number attached to some event which expresses the likelihood of the
event occurring.
A theory of probability was first exposited by European mathematicians in the 16th C
studying gambling problems.
How are probabilities assigned to events?
By thinking about all possible outcomes. If there are n of these, all equally likely, we
can attach numbers n1 to each of them. If an event contains k of these outcomes, we
k
attach a probability n
to the event. This is the classical interpretation of probability.
Alternatively, imagine the event as a possible outcome of an experiment. Its probability
is the fraction of times it occurs when the experiment is repeated a large number of
times. This is the frequency interpretation of probability
In many cases events cannot be thought of in terms of repeated experiments or equally
likely outcomes. We could base likelihoods in this case on what we believe about the
world subjective probabilities. The subjective probability of an event A is a real number
in the interval [0, 1] which reflects a subjective belief in the validity or occurence of event
A. Different people might attach different probabilities to the same events. Examples?
We formalize this subjective interpretation by imposing certain consistency conditions on
combinations of events.
&
Page 5
%
Rohini Somanathan
'
Definitions
An experiment is any process whose outcome is not known in advance with certainty. These
outcomes may be random or non-random, but we should be able to specify all of them and
attach probabilities to them.
Experiment
Event
10 coin tosses
4 heads
select 10 LS MPs
one is female
go to your bus-stop at 8
bus arrives within 5 min.
A sample space is the collection of all possible outcomes of an experiment.

An event is a certain subset of possible outcomes in the space S.
The complement of an event A is the event that contains all outcomes in the sample space
that do not belong to A. We denote this event by Ac
The subsets A1 , A2 , A3 . . . . . . of sample space S are called mutually disjoint sets if no two of
these sets have an element in common. The corresponding events A1 , A2 , A3 . . . . . . are said to
be mutually exclusive events.
If A1 , A2 , A3 . . . . . . are mutually exclusive events such that S = A1 A2 A3 . . . . . . , these are
called exhaustive events.
&
Page 6
'
Rohini Somanathan
Example: 3 tosses of a coin

The experiment has 23 possible outcomes and we can define the sample space S = {s1 , . . . , s8 }
where
s1 = HHH, s2 = HHT s3 = HT H, s4 = HT T , s5 = T HH, s6 = T HT , s7 = T T H, s8 = T T T
Any subset of this sample space is an event.
If we have a fair coin, each of the listed events are equally likely and we attach probability
to each of them.
1
8
Let us define the event A as atleast one head. Then A = {s1 , . . . , s7 }, Ac = {s8 }. A and Ac are
exhaustive events.
The events exactly one head and exactly two heads are mutually exclusive events.
Notice that there are lots of different ways in which we can define a sample space and the
most useful way to do so depending on the event we are interested in (# heads, or with
picking from a deck of cards, we may be interested in the suit, the number or both)
&
Page 7
%
Rohini Somanathan
'
The definition of probability

Definition: Let S be a collection of all events in S. A Probability distribution is a function
P : S [0, 1] which satisfies the following axioms:
1. The probability of every event must be non-negative
P(A) 0 for all events A S
2. If an event is certain to occur, its probability is 1
P(S) = 1
3. For any sequence of disjoint events A1 , . . . . . .
P(
i=1 Ai ) =
P(Ai )
i=1
Note:
We will typically use P(A) or Pr(A) instead of P(A)
For finite sample spaces S is straightforward to define. For any S which is a subset of the
real line (and therefore infinite) let S be the set of all intervals in S.
&
Page 8
Rohini Somanathan
'
Probability measures... some useful results

We can use our three axioms to derive some useful results:
Result 1:For each A S, P(A) = 1 P(Ac )
Proof: A Ac = S. By our second axiom, P(S) = 1 and by axiom 3,
P(A Ac ) = P(A) + P(Ac )
Result 2:P() = 0
Proof:
Let A = so Ac = S. Since P(S) = 1, P() = 0 using the first result above.
Result 3:If A1 and A2 are subsets of S such that A1 A2 , then P(A1 ) P(A2 )
Proof: Lets write A2 as: A2 = A1 (Ac1 A2 ). Since these are disjoint, we can use property
3 to get P(A2 ) = P(A1 ) + P(Ac1 A2 ). The second term on the RHS is non-negative (by
axiom 1), so P(A2 ) P(A1 ).
Result 4: For each A S, 0 P(A) 1
Proof: Since A S, we can directly apply the previous result to obtain
P() P(A) P(S) or 0 P(A) 1
&
Page 9
%
Rohini Somanathan
'
Some useful results..

Result 5: If A1 and A2 are subsets of S then P(A1 A2 ) = P(A1 ) + P(A2 ) P(A1 A2 )
Proof: As before, the trick is to write A1 A2 as a union of disjoint sets and then add the
probabilities associated with them. Drawing a Venn Diagram helps to do this.
A1 A2 = (A1 Ac2 ) (A1 A2 ) (A2 Ac1 )
(1)
but A1 = (A1 Ac2 ) (A1 A2 ) and A2 = (A2 Ac1 ) (A1 A2 ), so

P(A1 ) + P(A2 ) = P(A1 Ac2 ) + P(A1 A2 ) + P(A2 Ac1 ) + P(A1 A2 )
Subtracting P(A1 A2 ) gives us the expression in (??).
&
Page 10
Rohini Somanathan
'
Examples using the probability axioms

1. Consider two events A and B such that Pr(A) = 13 and Pr(B) = 21 . Determine the value of
P(BAc ) for each of the following conditions: (a) A and B are disjoint (b) A B (c)
Pr(AB) = 18
2. Consider two events A and B, where P(A) = .4 and P(B) = .7. Determine the minimum and
maximum values of Pr(AB) and the conditions under which they are obtained?
3. A point (x, y) is to be selected from the square containing all points (x, y), such that
0 x 1 and 0 y 1. Suppose that the probability that the point will belong to any
specified subset of S is equal to the area of that subset. Find the following probabilities:
(a) (x 21 )2 + (y 12 )2
(b)
1
2
<x+y<
1
4
3
2
(c) y < 1 x2
(d) x = y
answers: (1) 1/2, 1/6, 3/8 (2) .1, .4 (3) 1-/4, 3/4, 2/3, 0
&
Page 11
%
Rohini Somanathan
'
Finite sample spaces

If a sample space S contains a finite number of points s1 , . . . sn , we can specify a probability
distribution on S by assigning a probability to each point si S. This probability must
satisfy the two conditions:
1. pi 0 for i = 1, 2, . . . n and
n
P
2.
pi = 1
i=1
The probability of any event A can now be found as the sum of pi for all outcomes si that
belong to A.
A sample space containing n outcomes is called a simple sample space if the probability
assigned to each of the outcomes s1 . . . , sn is n1 . Probability measures are easy to define in
such spaces. If the event A contains exactly m outcomes, then P(A) = m
n
Notice that for the same experiment, we can define the sample space in multiple ways
depending on the events of interest. For example- suppose were interested in obtaining a
given number of heads in the tossing of 3 coins, our sample space can either comprise all
the 8 possible outcomes (a simple space) or just four outcomes (0,1,2 and 3 heads).
We can arrive at the total number of elements in a sample space through listing all possible
outcomes. A simple sample space for a coin-tossing experiment with 3 fair coins would have
a eight possible outcomes, a roll of two dice would have 36, etc. We then just calculate the
number of elements contained in our event A and divide this by the total number of
outcomes to get our probability (P(2 heads)=3/8 and P(sum of 7)=1/6
Listing outcomes can take a long time, and we can use a number of counting methods to
make things easier and avoid mistakes.
&
Page 12
'
Rohini Somanathan
Counting methods..the multiplication rule

Sometimes it is useful to think of an experiment as being performed in multiple stages
(tossing coins, picking cards, questions on an exam, going from one city to another via a
third).
If the first stage has m possible outcomes and the second n outcomes, then we can define a
simple sample space with exactly mn outcomes. Each element in this space with be a pair
(xi , yj ).
Example: the experiment of tossing 5 fair coins will have 32 elements in the simple sample
space, the probability of five heads is 1/32 and of one head is 5/32.
&
Page 13
%
Rohini Somanathan
'
Permutations
Suppose we are sampling k objects from a total of n distinct objects without replacement.
We are interested in the total number of different arrangements of these objects we can
obtain.
We first pick one object- this can happen in n different ways. Since we are now left with
n 1 objects, the second one can be picked in (n 1) different ways, and so on.
The total number of permutations of n objects taken k at a time is given by
Pn,k = n(n 1) . . . (n k + 1)
and Pn,n = n!
Pn,k can alternatively be written as:
Pn,k = n(n 1).. . . . (n k + 1) = n(n 1).. . . . (n k + 1)
n!
(n k)!
=
(n k)!
(n k)!
In the case with replacement, we can apply the multiplication rule derived above. In this
case there are n outcomes possible for each of the k selections, so the number of elements in
S is nk .
&
Page 14
Rohini Somanathan
'
The birthday problem

You go to watch an India-Australia cricket match with a friend.
He would like to bet Rs. 100 that among the group of 23 players on the field (2 teams plus
a referee) at least two people share a birthday
Should you take the bet?
What is the probability that out of a group of k, at least two share a birthday?
the total number of possible birthdays is 365k
365!
the number of different ways in which each of them has different birthdays is (365k)!
(because the second person has only 364 days to choose from, etc.). The required
365!
probability is therefore p = 1 (365k)!365
k
It turns out that for k = 23 this number is .507, so you should take the bet (if you are not
risk-averse)
&
Page 15
%
Rohini Somanathan
'
Combinatorial methods..the binomial coefficient

How many different subsets of k elements can be chosen from a set of n distinct elements?
We are not interested in the order in which the k elements are arranged.
Each such subset is called a combination, denoted by Cn,k
We derived above the number of permutations of n elements, taken k at a time. We can
think of these permutations as being derived by the following process:
First pick a set or combination of k elements.
Since there are k! permutations of these elements, this particular combination will give
rise to k! permutations.
This is true of each such combination, therefore the number of permutations is given by
Pn,k = k!Cn,k , or
Pn,k
n!
=
Cn,k =
k!
k!(n k)!

This is also denoted by n
k and called the binomial coefficient.
&
Page 16
Rohini Somanathan
'
The multinomial coefficient

Suppose we have elements of multiple types (jobs, modes of transport, methods of water
filtration..) and want to find the number of ways that n distinct elements (individuals,
trips..) can be divided into k groups such that for j = 1, 2, . . . , k jth group containing exactly
nj elements.

The n1 elements for the first group can be chosen in nn1 ways, the second group is chosen

1
out of (n n1 ) elements and this can be done in nn
total number of ways of
n2 ways...The
n nn1 nn1 n2
+nk
dividing the n elements into k groups is therefore n1
. . . nk1
n2
n3
n
k1
This can be simplified to
n!
n1 !n2 !...nk !
This expression is known as the multinomial coefficient.
Examples:
An student organization of 1000 people is picking 4 office-bearers and 8 members for its
1000!
managing council. The total number of ways of picking this groups is given by 4!8!988!
105 students have to be organized into 4 tutorial groups, 3 with 25 students each and
one with the remaining 30 students. How many ways can students be assigned to
groups?
&
Page 17
%
Rohini Somanathan
'
Unions of finite numbers of events

We can extent our formula on the probability of a union of events to the case where the number
of events is greater than 2 but finite:
For any three events A1 , A2 , and A3 ,
P(A1 A2 A3 ) = P(A1 ) + P(A2 ) + P(A3 ) [P(A1 A2 ) + P(A1 A3 ) + P(A2 A3 )] + P(A1 A2 A3 )
The easiest way to see this is to draw a Venn Diagram and express the desired set in terms
of 7 disjoint sets, p1 , . . . , p7 .
For a finite number of events, we have:
P(
n
[
i=1
Ai ) =
n
X
i=1
P(Ai )
X
i<j
P(Ai Aj ) +
Pr(Ai Aj Ak ) ...(1)n+1 P(A1 A2 . . . An )
i<j<k
&
Page 18
Rohini Somanathan
'
Independent Events
Definition: Let A and B be two events in a sample space S. Then A and B are independent
iff P(A B) = P(A)P(B). If A and B are not independent, A and B are said to be dependent.
Events may be independent because they are physically unrelated -tossing a coin and rolling
a die, two different people falling sick with some non-infectious disease, etc.
This need not be the case however, it may just be that one event provides no relevant
information on the likelihood of occurrence of the other.
Example:
The even A is getting an even number on a roll of a die .
The event B is getting one of the first four numbers.
The intersection of these two events is the event of rolling the number 2 or 4, which we
know has probability 13 .
Are A and B independent? Yes because P(A)P(B) =
12
23
1
3
This is because the occurrence of A does not affect the likelihood that B will occur, or
vice-versa. Why?
If A and B are independent, then A and Bc are also independent as are Ac and Bc . (We
require P(A Bc ) = P(A)P(Bc ). But A = (A B) (A Bc ), so with A and B independent,
P(A Bc ) = P(A) P(A)P(B) = P(A)[1 P(B)] = P(A)P(Bc ). Starting now with A and B
complement, we can use the same argument to show Ac and Bc independent.
&
Page 19
%
Rohini Somanathan
'
Independent Events..examples and special cases

1. A company has 100 employees, 40 men and 60 women. There are 6 male executives. How
many female executives should there be for gender and rank to be independent?
solution: If gender and rank are independent, then P(M E) = P(M)P(E). We can solve
P(ME)
for P(E) as P(M) = .06
.4 = .15. So there must be 9 female executives.
2. The experiment involves flipping two coins. A is the event that the coins match and B is
the event that the first coins is heads. Are these events independent?
solution: In this case P(B) = P(A) =
events are independent.
1
2
( {H,H} or {T,T}) and P(A B) = 41 , so yes, the
3. Suppose A and B are disjoint sets in S. Does it tell us anything about the independence of
events A and B?
4. Remember that disjointness is a property of sets whereas independence is a property of the
associated probability measure and the dependence of events will depend on the probability
measure that is being used.
&
Page 20
Rohini Somanathan
'
Independence with 3 or more events

Definition: Let A1 , A2 , A3 . . . . . . An be events in the sample space S. Then A1 , A2 , A3 . . . . . . An are
mutually independent iff P(A1 A2 A3 . . . . . . Ak ) = P(A1 )P(A2 ) . . . P(Ak ). for every collection of
k of these events, where 2 k n These events are pairwise independent iff
P(Ai Aj ) = P(Ai )P(Aj ) for all i, j.
Clearly mutual independence implies pairwise independence but not vice-versa.
Examples:
One ticket is chosen at random from a box containing 4 lottery tickets with numbers
112, 121, 211, 222.
The event Ai is that a 1 occurs in the ith place of the chosen number.
P(Ai ) = 12 i = 1, 2, 3 P(A1 A2 ) = P({112}) = 41 Similarly for A1 A3 and A2 A3 . These 3
events are therefore pairwise independent.
Are they mutually independent? Clearly not: P(A1 A2 A3 ) 6= P(A1 )P(A2 )P(A3 )
Toss two dice, white and black. The sample space consists of all ordered pairs
(i, j) i, j = 1, 2 . . . 6. Define the following events :
A1 : first die = {1, 2 or 3}, P(A1 ) =
A2 : first die = {3, 4 or 5}, P(A2 ) =
1
2
1
2
A3 : the sum of the faces equals 9, P(A3 ) =
1
9
1
In this case, P(A1 A2 A3 ) = P(3, 6) = 36
= ( 12 )( 21 )( 19 ) = P(A1 )P(A2 )P(A3 ) but
1
1
P(A1 A3 ) = P(3, 6) = 36 6= P(A1 )P(A3 ) = 18 , so the events are not independent, nor
pairwise independent.
&
Page 21
%
Rohini Somanathan
'
Conditional probability
When we conduct an experiment, we are absolutely sure that the event S will occur.
Suppose now we have some additional information about the outcome, say that it is an
element of B S.
What effect does this have on the probabilities of events in S? How exactly can we use such
additional information to compute conditional probabilities?
Example: The experiment involves tossing two fair coins in succession. What is the
probability of two tails? Suppose you know the first one is a head? What if it is a tail?
We denote the conditional probability of event A, given B by P(A|B)
B is now the conditional sample space and since B is certain to occur, P(B|B) = 1
Event A will now occur iff A B occurs
Definition: Let A and B be two events in a sample space S. If P(B) 6= 0, then conditional
probability of event A given event B is given by
P(A|B) =
P(A B)
P(B)
Notice that P(.|B) is now a probability set function (probability measure) defined for
subsets of B.
For independent events A and B, the conditional and unconditional probabilities are equal:
P(A)P(B)
P(A|B) = P(B) = P(A)
&
Page 22
Rohini Somanathan
'
Conditional probability...the multiplication rule

The above definition of conditional probability can be manipulated to arrive at a set of rules
that are useful in for computing conditional probabilities for particular types of problems.
We defined the conditional probability of event A given event B as P(A|B) =
P(AB)
P(B)
Multiplying both sides by P(B), we have the multiplication rule for probabilities:
P(A B) = P(A|B)P(B)
This is especially useful in cases where an experiment can be interpreted as being
conducted in two stages. In such cases, P(A|B) and P(B) can often be very easily assigned.
Examples:
Two cards are drawn successively, without replacement from an ordinary deck of
playing cards. What is the probability of drawing two aces?
Here the event B is that the first card drawn is an ace and the event A is that the
4
1
3
1
second card is an ace. P(B) is clearly 52
= 13
and P(A|B) = 51
= 17
The required
1
1
1
probability P(A B) is therefore ( 13 )( 17 ) = 221
There are two types of candidates, competent and incompetent (C and I). The share of
I-type candidates seeking admission is 0.3. All candidates are interviewed by a
committee and the committee rejects incompetent candidates with probability 0.9.
What is the probability that an incompetent candidate is admitted?
Here were interested in P(A I) where P(I) = .3 and P(A|I) = .1, so the required
probability is .03.
&
Page 23
%
Rohini Somanathan
'
The law of total probability

Let S denote the sample space of an experiment and consider k events A1 , A2 , . . . Ak in S
such that A1 , A2 , . . . Ak are disjoint and i = 1k Ai = S. These events are said to form a
partition of the sample space S.
If B is any other event, then the events A1 B, A2 B, . . . , Ak B form a partition of B:
k
P
B = (A1 B) (A2 B), (Ak B) and, since these are disjoint events P(B) =
P(Ai B).
i=1
If P(Ai ) > 0 for all i, then using the multiplication rule derived above, this can be written as:
P(B) =
k
X
P(Ai )P(B|Ai )
i=1
This is known as the law of total probability.

Example: Youre playing a game in which your score is equally likely to take any integer
value between 1 and 50. If your score the first time you play is equal to X, and you play
until you score Y X, what is the probability that Y = 50?
1
Solution: For each value xi , P(X = xi ) = 50
. We can compute the conditional probability of
Y = 50 for each of these values. The event Ai is the probability that X = xi and the event B
is getting a 50 to end the game. The probability of getting xi in the first round and 50 to
end the game is given by the product, P(B|Ai )P(Ai ). The required probability is the sum
of these products over all possible values of i:
P(Y = 50) =
50
X
x=1
1
1
1
1
1
1
.
=
(1 + + + +
) = .09
51 x 50
50
2
3
50
&
Page 24
'
Rohini Somanathan
Bayes Theorem
Bayes Theorem: (or Bayes Rule) Let the events A1 , A2 , . . . Ak form a partition of S such that
P(Aj ) > 0 for all j = 1, 2, . . . , k, and let B be any event such that P(B) > 0. Then for i = 1, . . . , k,
P(Ai |B) =
P(B|Ai )P(Ai )
k
P
P(Aj )P(B|Aj )
j=1
Proof:
By the definition of conditional probability,
P(Ai |B) =
P(Ai B)
P(B)
The denominators in these expressions are the same by the law of total probability and the
numerators are the same using the multiplication rule.
In the case where the partition of S consists of only two events,
P(A|B) =
P(B|A)P(A)
P(B|A)P(A) + P(B|Ac )P(Ac )
&
Page 25
%
Rohini Somanathan
'
Bayes Rule...remarks
Bayes rule provides us with a method of updating events in the partition based on the new
information provided by the occurrence of the event B
Since P(Aj ) is the probability of event Aj prior to the occurrence of event B, it is referred
to as the prior probability of event Aj .
P(Aj |B) is the updated probability of the same event after the occurrence of B and is called
the posterior probability of event Aj .
Bayes rule is very commonly used in game-theoretic models. For example, in political
economy models a Bayes-Nash equilibrium is a standard equilibrium concept: Players (say
voters) start with beliefs about politicians and update these beliefs when politicians take
actions. Beliefs are constrained to be updated based on Bayes conditional probability
formula.
In Bayesian estimation, prior distributions on population parameters are updated given
information contained in a sample. This is in contrast to more standard procedures where
only the sample information is used. The sample would now lead to different estimates,
depending on the prior distribution of the parameter that is used.
A word about Bayes: He was a non-conformist clergyman (1702-1761), with no formal
mathematics degree. He studied logic and theology at the University of Edinburgh.
&
Page 26
Rohini Somanathan
'
Bayes Rule ...examples

C1 , C2 and C3 are plants producing 10, 50 and 40 per cent of a companys output. The
percentage of defective pieces produced by each of these is 1, 3 and 4 respectively. Given
that a randomly selected piece is defective, what is the probability that it is from the first
plant?
P(C|C1 ))(P(C1 )
(.01)(.1)
1
P(C1 |C) =
=
=
= .03
P(C)
(.01)(.1) + (.03)(.5) + (.04)(.4)
32
How do the prior and posterior probabilities of the event C1 compare? What does this tell
you about the difference between the priors and posteriors for the other events?
Suppose that there is a new blood test to detect a virus. Only 1 in every thousand people
in the population has the virus. The test is 98 per cent effective in detecting a disease in
people who have it and gives a false positive for one per cent of disease free persons tested.
What is the probability that the person actually has the disease given a positive test result:
P(Disease|Positive) =
P(Positive|Disease)P(Disease)
(.98)(.001)
=
= .089
P(Positive)
(.98)(.001) + (.01)(.999)
So in spite of the test being very effective in catching the disease, we have a large number of
false positives.
&
Page 27
%
Rohini Somanathan
'
Bayes Rule ... priors, posteriors and politics

To understand the relationship between prior and posterior probabilities a little better, consider
the following example:
A politician, on entering parliament, has a reasonably good reputation. A citizen attaches a
prior probability of 34 to his being honest (undertaking policies to maximize social welfare,
rather than his bank balance).
At the end of his tenure, the citizen finds a very large number of potholes on roads in the
politicians constituency. While these do not leave the citizen with a favorable impression of
the incumbent, it is possible that the unusually heavy rainfall over these years was
responsible.
Elections are coming up. How does the citizen use this information on road conditions to
update his assessment of the moral standing of the politician? Let us compute the posterior
probability of the politicians being honest, given the event that the roads are in bad
condition:
Suppose that the probability of bad roads is
he/she is dishonest.
1
3
if the politician is honest and is
2
3
if
The posterior probability of the politician being honest is now given by

P(honest|bad roads) =
( 1 )( 3 )
P(bad roads|honest)P(honest)
3
= 1 33 42 1 =
P(bad roads)
5
( 3 )( 4 ) + ( 3 )( 4 )
What would the posterior be if the prior is equal to 1? What if it the prior is zero? What if
the probability of bad roads was equal to 12 for both types of politicians? When are
differences between priors and posteriors going to be large?
&
Page 28
'
Rohini Somanathan
The Monty Hall problem

A game show host leads the contestant to a wall with three closed doors.
Behind one of these is a fancy car, behind the other two a consolation prize (a bag of sweets)
The contestant must first choose a door without any prior knowledge of what is behind each
door.
The host then opens one of the doors hiding a bag of sweets.
The contestant is given an opportunity to switch doors and wins whatever is behind the
door that is finally chosen by him.
Does he raise his chances of winning the car by switching?
Suppose that the contestant chooses door 1 and the host opens door three. Denote by
A1 , A2 and A3 the events that the car is behind doors 1,2 and 3 respectively. Let B be
the event that the host opens door 3.
Wed like to compare P(A1 |B) and P(A2 |B).
By Bayes rule, the denominator of both these expressions is P(B), we therefore need to
compare P(B|A1 )P(A1 ) and P(B|A2 )P(A2 )
The first expression is 21 . 13 , the second is
certainly be opened, so P(B|A2 ) = 1
1
3
(because if the car is behind 2 then three will
The contestant can therefore double his probability of being correct by switching. The
posterior probability of A2 is 32 while that of A1 remains 13 .
&
Page 29
%
Rohini Somanathan
'
Bayes Rule: The Sally Clark case

Sally Clark was a British solicitor who became the victim of a one of the great miscarriages
of justice in modern British legal history
Her first son died within a few weeks of his birth in 1996 and her second one died in
similarly in 1998 after which she was arrested and tried for their murder.
A well-known paediatrician Professor Sir Roy Meadow, who testified that the chance of two
children from an affluent family suffering sudden infant death syndrome was 1 in 73 million,
which was arrived by squaring 1 in 8500 for likelihood of a cot death in similar circumstance.
Clark was convicted in November 1999. In 2001 the Royal Statistical Society issued a public
statement expressing its concern at the misuse of statistics in the courts and arguing that
there was no statistical basis for Meadows claim
In January 2003, she was released from prison having served more than three years of her
sentence after it emerged that the prosecutors pathologist had failed to disclose
microbiological reports that suggested one of her sons had died of natural causes.
RSS statement excerpts:
In the recent highly-publicised case of R v. Sally Clark, a medical expert
witness drew on published studies to obtain a figure for the frequency of sudden infant death syndrome
(SIDS, or cot death) in families having some of the characteristics of the defendants family. He went on
to square this figure to obtain a value of 1 in 73 million for the frequency of two cases of SIDS in such a
family. ..This approach is, in general, statistically invalid. It would only be valid if SIDS cases arose
independently within families,.. there are very strong a priori reasons for supposing that the assumption
will be false. There may well be unknown genetic or environmental factors that predispose families to SIDS,
so that a second case within the family becomes much more likely. The true frequency of families with two
cases of SIDS may be very much less incriminating than the figure presented to the jury at trial.
&
Page 30
%
Rohini Somanathan
Topic 2: Random Variables and Probability Distributions

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Sample spaces and random variables

The outcomes of some experiments inherently take the form of real numbers:
crop yields with the application of a new type of fertiliser
students scores on an exam
miles per litre of an automobile
Other experiments have a sample space that is not inherently a subset of Euclidean space
Outcomes from a series of coin tosses
The character of a politician
The modes of transport taken by a citys population
The degree of satisfaction respondents report for a service provider -patients in a
hospital may be asked whether they are very satisfied, satisfied or dissatisfied with the
quality of treatment. Our sample space would consist of arrays of the form
(VS, S, S, DS, ....)
The caste composition of elected politicians.
The gender composition of children attending school.
A random variable is a function that assigns a real number to each possible outcome s S.
&
Page 1
%
Rohini Somanathan
'
Random variables
Definition: Let (S, S, ) be a probability space. If X : S < is a real-valued function having as
its domain the elements of S, then X is called a random variable.
A random variable is therefore a real-valued function defined on the space S. Typically x is
used to denote this image value, i.e. x = X(s).
If the outcomes of an experiment are inherently real numbers, they are directly
interpretable as values of a random variable, and we can think of X as the identity function,
so X(s) = s.
We choose random variables based on what we are interested in getting out of the
experiment. For example, we may be interested in the number of students passing an exam,
and not the identities of those who pass. A random variable would assign each element in
the sample space a number corresponding to the number of passes associated with that
outcome.
We therefore begin with a probability space (S, S, ) and arrive at an induced probability
space (R(X), B, PX (A)).
How exactly do we arrive at the function Px (.)? As long as every set A R(X) is associated
with an event in our original sample space S, Px (A) is just the probability assigned to that
event by P.
&
Page 2
Rohini Somanathan
'
Random variables..examples
1. Tossing a coin ten times.
The sample space consists of the 210 possible sequences of heads and tails.
There are many different random variables that could be associated with this
experiment: X1 could be the number of heads, X2 the longest run of heads divided by
the longest run of tails, X3 the number of times we get two heads immediately before a
tail, etc...
For s = HT T HHHHT T H, what are the values of these random variables?
2. Choosing a point in a rectangle within a plane
An experiment involves choosing a point s = (x, y) at random from the rectangle
S = {(x, y) : 0 x 2, 0 y 1/2}
The random variable X could be the xcoordinate of the point and an event is X taking
values in [1, 2]
Another random variable Z would be the distance of the point from the origin,
p
Z(s) = x2 + y2 .
3. Heights, weights, distances, temperature, scores, incomes... In these cases, we can have
X(s) = s since these are already expressed as real numbers.
&
Page 3
%
Rohini Somanathan
'
Induced probability spaces..examples

Lets look at some examples of how we arrive at our probability measure PX (A).
A coin is tossed once and were interested in the number of heads, X. The probability
assigned to the set A = {1} in our new space is just the probability associated with one head
in our original space. So Pr(X = x) = 12 , x {0, 1}.
With two tosses, the probability attached to the set A = {1} is the sum of the probabilities
associated with the disjoint sets {H, T } and {T , H} whose union forms this event. In this case

Pr(X = x) = x2 ( 21 )2 x {0, 1, 2}
Now consider a sequence of flips of an unbiased coin and our random variable X is the
number of flips needed for the first head. We now have
Pr(X = x) = f(x) =
x1 x
1
1
1
=
2
2
2
x = 1, 2, 3 . . .
Is this a valid probability measure?

How is the nature of the sample space in the first two coin-flipping examples is different
from the third?
In all these cases we have a discrete random variable .
&
Page 4
Rohini Somanathan
'
The distribution function

Once weve assigned real numbers to all the subsets of our sample space S that are of
interest, we can restrict our attention to the probabilities associated with the occurrence of
sets of real numbers.
Consider the set A = (, x]
Now P(A) = Pr(X x)
F(x) is used to denote the probability Pr(X x) and is called the distribution function of x
Definition: The distribution function F of a random variable X is a function defined for each
real number x as follows:
F(x) = P(X x) for < x <
If there are a finite number of elements w in A, this probability can be computed as
F(x) =
f(w)
wx
In this case, the distribution function will be a step function, jumping at all points x in
R(X) which are assigned positive probability.
Consider the experiment of tossing two fair coins. Describe the probability space induced
by the random variable X, the number of heads, and derive the distribution function of X.
&
Page 5
%
Rohini Somanathan
'
Discrete distributions
Definition: A random variable X has a discrete distribution if X can take only a finite number k
of different values x1 , x2 , . . . , xK or an infinite sequence of different values x1 , x2 , . . . .
The function f(x) = P(X = x) is the probability function of x. We define it to be f(x) for all
values x in our sample space R(X) and zero elsewhere.
If X has a discrete distribution, the probability of any subset A of the real line is given by
P
P(X A) =
f(xi ).
xi A
Examples:
1. The discrete uniform distribution: picking one of the first k non-negative integers at
random
1
for x = 1, 2, ...k,
f(x) = k
0
otherwise
2. The binomial distribution: the probability of x successes in n trials.

n px qnx
for x = 0, 1, 2, ...n,
x
f(x) =
0
otherwise
Derive the distribution functions for each of these.
&
Page 6
Rohini Somanathan
'
Continuous distributions
The sample space associated with our random variable often has an infinite number of points.
Example: A point is randomly selected inside a circle of unit radius with origin (0, 0) where the probability
assigned to being in a set A S is P(A) = areaof A and X is the distance of the selected point from the
origin. In this case F(x) = Pr(X x) = area of circle with radius x , so the distribution function of X is given by
0
F(x) =
for x < 0
x2
0x<1
1x
Definition: A random variable X has a continuous distribution if there exists a nonnegative

function f defined on the real line, such that for any interval A,
Z
P(X A) =
f(x)dx
A
The function f is called the probability density function or p.d.f. of X and must satisfy the
conditions below
1. f(x) 0
2.
f(x)dx = 1
What is f(x) for the above example? How can you use this to compute P( 14 < X 21 )? How would
you use F(x) instead?
&
Page 7
%
Rohini Somanathan
'
Continuous distributions..examples
1. The uniform distribution on an interval: Suppose a and b are two real numbers with a < b.
A point x is selected from the interval S = {x : a x b} and the probability that it
belongs to any subinterval of S is proportional to the length of that subinterval. It follows
that the p.d.f. must be constant on S and zero outside it:
1
for a x b
f(x) = ba
0
otherwise
Notice that the value of the p.d.f is the reciprocal of the length of the interval, these values
can be greater than one, and the assignment of probabilities does not depend on whether
the distribution is defined over the closed interval or the open interval (a, b)
2. Unbounded random variables: It is sometimes convenient to define a p.d.f over unbounded
sets, because such functions may be easier to work with and may approximate the actual
distribution of a random variable quite well. An example is:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
3. Unbounded densities: The following function is unbounded around zero but still represents
a valid density.
2 x 13
for 0 < x < 1
f(x) = 3
0
otherwise
&
Page 8
Rohini Somanathan
'
Mixed distributions
Often the process of collecting or recording data leads to censoring, and instead of obtaining
a sample from a continuous distribution, we obtain one from a mixed distribution.
Examples:
The weight of an object is a continuous random variable, but our weighing scale only
records weights up to a certain level.
Households with very high incomes often underreport their income, for incomes above a
certain level (say $250,000), surveys often club all households together - this variable is
therefore top-censored.
In each of these examples, we can derive the probability distribution for the new random
variable, given the distribution for the continuous variable. In the example weve just
considered:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
suppose we record X = 3 for all values of X 3 The p.f. for our new random variable Y is
given by the same p.f. for values less than 3 and by 14 for Y=3.
Some variables, such as the number of hours worked per week have a mixed distribution in
the population, with mass points at 0 and 40.
&
Page 9
%
Rohini Somanathan
'
Properties of the distribution function

Recall that the distribution function or cumulative distribution function (c.d.f ) for a random
variable X is defined as
F(x) = P(X x) for < x < .
It follows that for any random variable (discrete, continuous or mixed), the domain of F is the
real line and the values of F(x) must lie in [0, 1]. We can also establish that all distribution
functions have the following three properties:
1. F(x) is a nondecreasing function of x, i.e. if x1 < x2 then F(x1 ) < F(x2 ).
( The occurrence of the event {X x1 } implies the occurrence of {X x2 } so P(X x1 ) P(X x2 ))
2. limx F(x) = 0 and limx F(x) = 1

( {x : x } is the entire sample space and {x : x } is the null set. )
3. F(x) is right-continuous, i.e. F(x) = F(x+ ) at every point x, where F(x+ ) is the right hand
limit of F(x).
( for discrete random variables, there will be a jump at values that are taken with positive probability)
&
Page 10
Rohini Somanathan
'
Computing probabilities using the distribution function

RESULT 1: For any given value of x, P(X > x) = 1 F(x)
RESULT 2: For any values x1 and x2 where x1 < x2 , P(x1 < X x2 ) = F(x2 ) F(x1 )
Proof: Let A be the event X x1 and B be the event X x2 . B can be written as the union of two events
B = (A B) (Ac B). Since A B, P(A B) = P(A). The event were interested in is Ac B whose probability
is given by P(B) P(A) or P(x1 < X x2 ) = P(X x2 ) P(X x1 ). Now apply the definition of a d.f.
RESULT 3: For any given value x

P(X < x) = F(x )
RESULT 4: For any given value x
P(X = x) = F(x+ ) F(x )
The distribution function of a continuous random variable will be continuous and since
Rx
F(x) =
f(t)dt,
F0 (x) = f(x)
For discrete and mixed discrete-continous random variables F(x) will exhibit a countable number
of discontinuities at jump points reflecting the assignment of positive probabilities to a countable
number of events.
&
Page 11
%
Rohini Somanathan
'
Examples of distribution functions

Consider the experiment of rolling a die or tossing a fair coin, with X in the first case being
the number of dots and in the second case the number of heads. Graph the distribution
function of X in each of these cases.
What about the experiment of picking a point in the unit interval [0, 1] with X as the
distance from the origin?
3.3 The Cumulative
Distribution
Function
109 distribution function?
What type of probability function
corresponds
to the
following
3.6 An example of a
F(x)
1
z3
z2
z1
z0
x1
x2
x3
x4
Section 1.10. Similarly, the fact that Pr(X x) approaches 1 as x follows from
&
Exercise 12 in Sec. 1.10.
Page 12
%
Rohini Somanathan
The limiting values specified in Property 3.3.2 are indicated in Fig. 3.6. In this
figure, the value of F (x) actually becomes 1 at x = x4 and then remains 1 for x > x4.
Course
Basic
Econometrics,
1 and
Pr(X
> x4) = 0.2012-2013
On the other
Hence, it may be concluded that Pr(X
x4) =003:
'
hand, according to the sketch in Fig. 3.6, the value of F (x) approaches 0 as x ,
but does not actually become 0 at any finite point x. Therefore, for every finite value
of x, no matter how small, Pr(X x) > 0.
A c.d.f. need not be continuous. In fact, the value of F (x) may jump at any
finite or countable number of points. In Fig. 3.6, for instance, such jumps or points
of discontinuity occur where x = x1 and x = x3. For each fixed value x, we shall let
F (x ) denote the limit of the values of F (y) as y approaches x from the left, that is,
as y approaches x through values smaller than x. In symbols,
The distribution function X gives us the probability that X x for all real numbers x
F (y).
F (x ) = lim
yx
y<x
Suppose we are given a probability
p and want to know the value of x corresponding to this
value
of
the
distribution
function.
+
Similarly, we shall define F (x ) as the limit of the values of F (y) as y approaches x
The quantile function
from
Thus,
Ifthe
F right.
is a one-to-one
function, then it has an inverse and the value we are looking for is given
1
by F (p)
F (y).
F (x +) = lim
yx
y>x
Examples: median income would be found by F1 ( 21 ) where F is the distribution function of

If the c.d.f. is continuous at a given point x, then F (x ) = F (x +) = F (x) at that point.
income.
Property
3.3.3
Definition:
When
the
distribution
a random
X=is continuous and one-to-one
Continuity from
the Right
. A
c.d.f. is alwaysfunction
continuousoffrom
the right; variable
that is, F (x)
+ ) at every point x.
F (x the
over
whole set of possible values of X, we call the function F1 the quantile function of X. The
value of F1 (p) is called the pth quantile of X or the 100 pth percentile of X for each 0 < p < 1.
Proof Let y1 > y2 > . . . be a sequence of numbers that are decreasing such that
event {X distribution
x} is the intersection
of all
the events
yF(x)
limn yn =Ifx.XThen
xa
n}
Example:
hasthe
a uniform
over the
interval
[a,{X
b],
= ba
over this interval, 0
for n = 1, 2, . . . . Hence, by Exercise 13 of Sec. 1.10,
for x a and 1 for x > b. Given a value p, we simply
solve for the pth quantile:
F (x) = Pr(X x) = lim Pr(X yn) = F (x +).
n
x = pb + (1 p)a. Compute this for p = .5, .25, .9, . . .
It follows from Property 3.3.3 that at every point x at which a jump occurs,
F (x +) = F (x) and F (x ) < F (x).
&
Page 13
%
Rohini Somanathan
'
Examples: computing quantiles, etc.

1. The p.d.f of a random variable is given by:
1x
f(x) = 8
0
for 0 x 4
otherwise
Find the value of t such that

(a) P(X t) =
(b) P(X t) =
1
4
1
2
2. The p.d.f of a random variable is given by:
cx2
f(x) =
0
for 1 x 2
otherwise
Find the value of the constant c and Pr(X > 23 )
&
Page 14
'
Rohini Somanathan
Bivariate distributions
Social scientists are typically interested in the manner in which multiple attributes of
people and the societies they live in. The object of interest is a multivariate probability
distribution. examples: education and earnings, days ill per month and age, sex-ratios and
areas under rice cultivation)
This involves dealing with the joint distribution of two or more random variables. Bivariate
distributions attach probabilities to events that are defined by values taken by two random
variables (say X and Y).
Values taken by these random variables are now ordered pairs, (xi , yi ) and an event A is a
set of such values.
If both X and Y are discrete random variables, the probability function
P
f(x, y) = P(X = x and Y = y) and P(X, Y) A =
f(xi , yi )
(xi ,yi )A
&
Page 15
%
Rohini Somanathan
'
Representing a discrete bivariate distribution

If both X and Y are discrete, this function takes only a finite number of values.
If there are only a small number of these values, they can be usefully presented in a table.
The table below could represent the probabilities of receiving different levels of education.
X is the highest level of education and Y is gender:
education
gender
male
female
none
.05
.2
primary
.25
.1
middle
.15
.04
high
.1
.03
senior secondary
.03
.02
graduate and above
.02
.01
What are some features of a table like this one? In particular, how would we obtain
probabilities associated with the following events:
receiving no education
becoming a female graduate
completing primary school
What else do you learn from the table about the population of interest?
&
Page 16
Rohini Somanathan
'
Continuous bivariate distributions

We can extend our definition of a continuous univariate distribution to the bivariate case:
Definition: Two random variables X and Y have a continuous joint distribution if there exists a
nonnegative function f defined over the xy-plane such that for any subset A of the plane
Z Z
P[(X, Y) A] =
f(x, y)dxdy
A
f is now called the joint probability density function and must satisfy
1. f(x, y) 0 for < x < and < y <
2.
f(x, y)dxdy = 1

Example 1: Given the following joint density function on X and Y, well calculate P(X Y)

f(x, y) =
cx2 y
for x2 y 1
otherwise
First find c to make this a valid joint density (notice the limits of integration here)-it will turn out to be 21/4.
3 .
Then integrate the density over Y (x2 , x) and X (1, 1). Now using this density, P(X Y) = 20
Example 2: A point (X, Y) is selected at random from inside the circle x2 + y2 9. Determine the joint density
function, f(x, y).
&
Page 17
%
Rohini Somanathan
'
Bivariate distribution functions

Definition: The joint distribution function of two random variables X and Y is defined as the
function F such that for all values of x and y ( < x < and < y < )
F(x, y) = P(X x and Y y)
The probability that (X, Y) will lie in a specified rectangle in the xy-plane is given by
Pr(a < X b and c < Y d) = F(b, d) F(a, d) F(b, c) + F(a, c)
Note: The distinction between weak and strict inequalities is important when points on the boundary of the
rectangle occur with positive probability.
The distribution functions of X and Y can be derived as:

Pr(X x) = F1 (x) = lim F(x, y) and Pr(Y y) = F2 (y) = lim F(x, y)
y
If F(x, y) is continuously differentiable in both its arguments, the joint density is derived as:
f(x, y) =
2 F(x, y)
xy
and given the density, we can integrate w.r.t x and y over the appropriate limits to get the
distribution function.
Example:
1 xy(x + y), derive the distribution functions of

Suppose that, for x and y [0, 2], we have F(x, y) = 16
X and Y and their joint density. Notice the (x, y) range over which F(x, y) is strictly increasing.
&
Page 18
'
Rohini Somanathan
Marginal distributions
A distribution of X derived from the joint distribution of X and Y is known as the marginal
distribution of X. For a discrete random variable:
f1 (x) = P(X = x) =
P(X = x and Y = y) =
f(x, y)
and analogously
f2 (y) = P(Y = y) =
P(X = x and Y = y) =
f(x, y)
For a continuous joint density f(x, y), the marginal densities for X and Y are given by:
f1 (x) =
f(x, y)dy and f2 (y) =
f(x, y)dx
Go back to our table representing the joint distribution of gender and education and find
the marginal distribution of education.
Can one construct the joint distribution from one of the marginal distributions?
&
Page 19
%
Rohini Somanathan
'
Independent random variables

Definition: The two random variables X and Y are independent if, for any two sets A and B of
real numbers,
P(X A and Y B) = P(X A)P(Y B)
In other words, if A is an event whose occurrence depends only values taken by X and Bs
occurrence depends only on values taken by Y, then the random variables X and Y are
independent only if the events A and B are independent, for all such events A and B.
The condition for independence can be alternatively stated in terms of the joint and
marginal distribution functions of X and Y by letting the sets A and B be the intervals
(, x) and (, y) respectively.
F(x, y) = F1 (x)F2 (y)
For discrete distributions, we simply define the sets A and B as the points x and y and
require f(x, y) = f1 (x)f2 (y).
In terms of the density functions, we say that X and Y are independent if it is possible to
choose functions f1 and f2 such that the following factorization holds for
( < x < and < y < )
f(x, y) = f1 (x)f2 (y)
&
Page 20
'
Rohini Somanathan
Independent random variables..examples

There are two independent measurements X and Y of rainfall at a certain location:
2x
for 0 x 1
g(x) =
0
otherwise
Find the probability that X + Y 1.
The joint density 4xy is got by multiplying the marginal densities because these variables
are independent. The required probability of 61 is then obtained by integrating over
y (0, 1 x) and x (0, 1)
How might we use a table of probabilities to determine whether two random variables are
independent?
Given the following density, can we tell whether the variables X and Y are independent?
ke(x+2y)
for x 0 and y 0
f(x, y) =
0
otherwise
Notice that we can factorize the joint density as the product of k1 ex and k2 e2y where
k1 k2 = k. To obtain the marginal densities of X and Y, we multiply these functions by
appropriate constants which make them integrate to unity. This gives us
f1 (x) = ex for x 0 and f2 (y) = 2e2y for y 0
&
Page 21
%
Rohini Somanathan
'
Dependent random variables..examples

Given the following density densities, lets see why the variables X and Y are dependent:
1.
f(x, y) =
x + y
for 0 < x < 1 and 0 < y < 1
otherwise
Notice that we cannot factorize the joint density as the product of a non-negative function
of x and another non-negative function of y. Computing the marginals gives us
f1 (x) = x +
1
1
for 0 < x < 1 and f2 (y) = y + for 0 < y < 1
2
2
so the product of the marginals is not equal to the joint density.

2. Suppose we have
f(x, y) =
kx2 y2
for x2 + y2 1
otherwise
In this case the possible values X can take depend on Y and therefore, even though the joint
density can be factorized, the same factorization cannot work for all values of (x, y).
More generally, whenever the space of positive probability density of X and Y is bounded by a
curve, rather than a rectangle, the two random variables are dependent.
&
Page 22
Rohini Somanathan
'
Dependent random variables..a result

Whenever the space of positive probability density of X and Y is bounded by a curve, rather
than a rectangle, the two random variables are dependent. If, on the other hand, the support of
f(x, y) is a rectangle and the joint density is of the form f(x, y) = kg(x)h(y), then X and Y are
independent.
Proof: For the latter part of the result, suppose the support of f(x, y) is given by the rectangle abcd where
a < b and c < d and a x b and c y d. Now the joint density f(x, y) can be written as
1
1
k1 g(x)k2 h(y) where k1 = b
and k2 = d
.
R
g(x)dx
The marginal densities are f1 (x) = k1 g(x)
h(y)dy
c
d
R
c
k2 h(y)dy and f2 (y) = k2 g(y)
b
R
a
k1 g(x)dx, whose product gives us the joint
density.
Now to show that if the support is not a rectangle, the variables are dependent: Start with a point (x, y) outside
the domain where f(x, y) > 0. If x and y are independent, we have f(x, y) = f1 (x)f2 (y), so one of these must be zero.
Now as we move due south and enter the set where f(x, y) > 0, our value of x has not changed, so it could not be
that f1 (x) was zero at the original point. Similarly, if we move west, y is unchanged so it could not be that f2 (y)
was zero at the original point. So we have a contradiction.
&
Page 23
%
Rohini Somanathan
'
Conditional distributions
Definition: Consider two discrete random variables X and Y with a joint probability function
f(x, y) and marginal probability functions f1 (x) and f2 (y). After the value Y = y has been
observed, we can write the the probability that X = x using our definition of conditional
probability:
P(X = x and Y = y)
f(x, y)
=
P(X = x|Y = y) =
Pr(Y = y)
f2 (y)
g1 (x|y) =
f(x,y)
f2 (y)
is called the conditional probability function of X given that Y = y. Notice that:
1. for each fixed value of y, g1 (x|y) is a probability function over all possible values of X
because it is non-negative and
X
g1 (x|y) =
1 X
1
f(x, y) =
f2 (y) = 1
f2 (y) x
f2 (y)
2. conditional probabilities are proportional to joint probabilities because they just divide
these by a constant.
We cannot use the definition of condition probability to derive the conditional density for
continuous random variables because the probability that Y takes any particular value y is zero.
We simply define the conditional probability density function of X given Y = y as
g1 (x|y) =
f(x, y)
for ( < x < and < y < )
f2 (y)
&
Page 24
Rohini Somanathan
'
Conditional versus joint densities

f(x,y)
The numerator in g1 (x|y) = f (y) is a section of the surface representing the joint density and
2
the denominator is the constant by which we need to divide the numerator to get a valid density
(which integrates to unity)
&
Page 25
%
Rohini Somanathan
'
Deriving conditional distributions... the discrete case

For the education-gender example, we can find the distribution of educational achievement
conditional on being male, the distribution of gender conditional on completing college, or any
other conditional distribution we are interested in :
education
gender
male
female
f(education|gender=male)
none
.05
.2
.08
primary
.25
.1
.42
middle
.15
.04
.25
high
.1
.03
.17
senior secondary
.03
.02
.05
graduate and above
.02
.01
.03
f(gender|graduate)
.67
.33
&
Page 26
'
Rohini Somanathan
Deriving conditional distributions... the continuous case

For the continuous joint distribution weve looked at before
cx2 y
for x2 y 1
f(x, y) =
0
otherwise
the marginal distribution of X is given by
Z1
21 2
21 2
x ydy =
x (1 x4 )
4
8
x2
and the conditional distribution g2 (y|x) =
f(x,y)
f1 (x) :
g2 (y|x) =
2y
1x4
for x2 y 1
otherwise
If X = 12 , we can compute P(Y 41 |X = 12 ) = 1 and P(Y 34 |X = 12 ) =
R1
3
4
g2 (y| 21 ) =
7
15
&
Page 27
%
Rohini Somanathan
'
Construction of the joint distribution

We can use conditional and marginal distributions to arrive at a joint distribution:
f(x, y) = g1 (x|y)f2 (y) = g2 (y|x)f1 (x)
(1)
Notice that the conditional distribution is not defined for a value y0 at which f2 (y) = 0, but this is irrelevant
because at any such value f(x, y0 ) = 0.
Example: X is first chosen from a uniform distribution on (0, 1) and then Y is chosen from a uniform distribution
on (x, 1). The marginal distribution of X is straightforward:

f1 (x) =
for 0 < x < 1
otherwise
Given a value of X = x, the conditional distribution

1
1x
for x < y < 1
otherwise
1
1x
for 0 < x < y < 1
otherwise
g2 (y|x) =
Using (1), the joint distribution is

f(x, y) =
and the marginal distribution for Y can now be derived as:
y
Z
f2 (y) =
f(x, y)dx =
1
dx = log(1 y) for 0 < y < 1
1x
&
Page 28
'
Rohini Somanathan
Multivariate distributions
Our definitions of joint, conditional and marginal distributions can be easily extended to an
arbitrary finite number of random variables. Such a distribution is now called a multivariate
distributon.
The joint distribution function is defined as the function F whose value at any point
(x1 , x2 , . . . xn ) <n is given by:
F(x1 , . . . , xn ) = P(X1 x1 , X2 x2 , . . . , Xn xn )
For a discrete joint distribution, the probability function at any point (x1 , x2 , . . . xn ) <n is given
by:
f(x1 , . . . , xn ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn )
(2)
and the random variables X1 , . . . , Xn have a continuous joint distribution if there is a nonnegative
function f defined on <n such that for any subset A <n ,
Z
Z
P[(X1 , . . . , Xn ) A] =
f(x1 , . . . , xn )dx1 . . . dxn
(3)
...A ...
The marginal distribution of any single random variable Xi can now be derived by integrating
over the other variables
Z
Z
f1 (x1 ) =
...
f(x1 , . . . , xn )dx2 . . . dxn
(4)
and the conditional probability density function of X1 given values of the other variables is:
g1 (x1 |x2 . . . xn ) =
f(x1 , . . . , xn )
f0 (x2 , . . . , xn )
(5)
&
Page 29
%
Rohini Somanathan
'
Independence for the multivariate case

Independence: The n random variables X1 , . . . Xn are independent if for any n sets
A1 , A1 , . . . An or real numbers,
P(X1 A1 , X2 A2 , . . . , Xn An ) = P(X1 A1 )P(X2 A2 ) . . . P(Xn An )
If the joint distribution function of X1 , . . . Xn is given by F and the marginal d.f. for Xi by
Fi , it follows that X1 , . . . Xn will be independent if and only if, for all points (x1 , . . . xn ) <n
F(x1 , . . . xn ) = F1 (x1 )F2 (x2 ) . . . Fn (xn )
and, if these random variables have a continuous joint distribution with joint density
f(x1 , . . . xn ):
f(x1 , . . . xn ) = f1 (x1 )f2 (x2 ) . . . fn (xn )
In the case of a discrete joint distribution the above equality holds for the probability
function f.
Random samples: The n random variables X1 , . . . Xn form a random sample if these
variables are independent and the marginal p.f. or p.d.f. of each of them is f. It follows that
for all points (x1 , . . . xn ), their joint p.f or p.d.f. is given by
g(x1 , . . . , xn ) = f(x1 ) . . . f(xn )
The variables that form a random sample are said to be independent and identically
distributed (i.i.d.) and n is the sample size.
&
Page 30
Rohini Somanathan
'
Multivariate distributions..example
Suppose we start with the following density function for a variable X1 :
ex for x > 0
f1 (x) =
0
otherwise
and are told that for any given value of X1 = x1 , two other random variables X2 and X3 are
independently and identically distributed with the following conditional p.d.f.:
x ex1 t for t > 0

1
g(t|x1 ) =
0
otherwise
The conditional p.d.f. is now given by g23 (x2 , x3 |x1 ) = x21 ex1 (x2 +x3 ) for non-negative values of
x2 , x3 (and zero otherwise) and the joint p.d.f of the three random variables is given by:
f(x1 , x2 , x3 ) = f1 (x1 )g23 (x2 , x3 |x1 ) = x21 ex1 (1+x2 +x3 )
for non-negative values of each of these variables. We can now obtain the marginal joint p.d.f of
X2 and X3 by integrating over X1
&
Page 31
%
Rohini Somanathan
'
Distributions of functions of random variables

Wed like to derive the distribution of X2 , knowing that X has a uniform distribution on (1, 1)
the density f(x) of X over this interval is
1
2
we know further than Y takes values in [0, 1).

the distribution function of Y is therefore given by
G(y) = P(Y y) = P(X y) = P( y X y) =
Zy
f(x)dx =
The density is obtained by differentiating this:
1 for 0 < y < 1

g(y) = 2 y
0
otherwise
&
Page 32
'
Rohini Somanathan
The Probability Integral Transformation

RESULT: Let X be a continuous random variable with the distribution function F and let
Y = F(X). Then Y must be uniformly distributed on [0, 1]. The transformation from X to Y is
called the probability integral transformation.
We know that the distribution function must take values between 0 and 1. If we pick any of
these values, y, the yth quantile of the distribution of X will be given by some number x and
Pr(Y y) = Pr(X x) = F(x) = y
which is the distribution function of a uniform random variable.
This result helps us generate random numbers from various distributions, because it allows
us to transform a sample from a uniform distribution into a sample from some other
distribution provided we can find F1 .
Example: Suppose we want a sample from an exponential distribution. The density is ex
defined over all x > 0 and the distribution function is 1 ex . If we pick from a uniform
between 0 and 1, and get (say) .3, we can invert the distribution function to get
x = log(10/7) = .36 as an observation of an exponential random variable.
&
Page 33
%
Rohini Somanathan
'
Random number generators

Historically, tables of random digits were used to generate a sample from a uniform
distribution. For example, consider the following series of digits
553617280595580771997955130480651347088612
If we want 10 numbers between 1 and 9, we start at a random digit in the table, and pick
the next 10 numbers. What about numbers between 1 and 100?
Today, we would never do this, but use a statistical package to generate these. In stata for
example:
runiform() returns uniformly distributed random variates on the interval [0,1).
Many packages also allow us to draw directly from the distribution we are interested in:
rnormal(m, s) returns normal(m, s) random variates, where m is the mean and s is the
standard deviation.
&
Page 34
%
Rohini Somanathan
Topic 3: The Expectation and other Moments of a Random Variable

Rohini Somanathan
Course 003, 2014-2015
Page 0
Rohini Somanathan
'
Expectation of a discrete random variable

Definition: The expected value of a discrete random variable exists, and is defined by
P
EX = xR(X) xf(x)
The expectation is simply a weighted average of possible outcomes, with the weights being
assigned by f(x).
In general EX 6 R(X). Consider the experiment of rolling a die where the random variable
is the number of dots on the die. The density function is given by f(x) = 61 I{1,2...6} (x)
The expectation is given by
6
P
x=1
x
6 I{1,2...6} (x)
= 3.5
If X can take only a finite number of different values, this expectation always exists.
If there is an infinite sequence of possible values of X, then this expectation exists if and
P
only if
limxR(X) |x|f(x) < (the series defining the expectation is absolutely convergent )
We can think of the expectation as a point of balance: if there were various weights placed
on a weightless rod, where should a fulcrum be placed so that the distribution of weights
balances?
The expectation of X is also called the expected value or the mean of X.
&
Page 1
%
Rohini Somanathan
'
Expectation of a continuous random variable

Definition: The expected value of a continuous random variable exists, and is defined by
EX =
|x|f(x) <
xf(x) iff
Suppose we have a distribution f which is symmetric with respect to a given point x0 on the
x axis, so that f(x0 + ) = f(x0 ) . If the expectations exists, it will be equal to x0 .
The expectation will always exist if the set of values taken by X is bounded.
When does it not exist? We need sufficiently small weights attached to large values of X
when the set of possible values is not bounded. The tails of a distribution may fall off fast
enough for the area under it to integrate to 1, but the function xf(x) may not have this
property if the tails are thick.
The Cauchy distribution (f(x) =
1
)
(1+x2 )
is symmetric (as are the Normal and the Students
t distributions) but the expectation of the Cauchy distribution does not exist.
&
Page 2
Rohini Somanathan
'
ter 4 Expectation
The p.d.f. of a
ribution.
f(x)
1
p
!3
!2
!1
The curve
or the Cauchy
1
3
1
3
f(x)
1
2p
!3
!2
!1
1
2p
&
Page 3
The Expectation of a Function
%
Rohini Somanathan
'
Expectation of functions of random variable

We may be interested in the expectation of a function y = g(x) of a random variable x.
Examples:
Agricultural yields may be given by the random variable X, revenue, for any given value
x, is given by the function p(x)x
Our random variable might be food availability on a farm, child health would be a
function of such availability.
Scores on an aptitude test may be the random variable and performance in a course
could be a function of these.
Suppose that the density function of y was available to us. We could directly compute the
R
expectation as EY =
yh(y)dy (if continuous). But we dont need this:
RESULT: Let X be a random variable having density function f(x). Then the expectation of
Y = g(X) ( in the discrete and continuous case respectively) is given by:
X
Eg(X) =
g(x)f(x)
xR(X)
Eg(X) =
g(x)f(x)dx
&
Page 4
Rohini Somanathan
'
Expectation of functions- examples

g(x) =
X
f(x) =
R1 1
E( X) = x 2 (2x)dx =
0
2x
for 0 < x < 1
otherwise
4
5
A point (X, Y) is chosen at random from the unit square: 0 x 1 and 0 y 1. The joint
R1 R1 2
density over all points (x, y) in the square is 1 and E(X2 + Y 2 ) =
(x + y2 )dxdy = 32
0 0
(X1 , X2 ) forms a random sample of size 2 from a uniform distribution on (0, 1) and
R1 xR2
R1
Y = min(X1 , X2 ). Well show that E(Y) = 2
x1 dx1 dx2 = x22 dx2 = 31
0 0
0
Suppose we are interested in the expectation of a random variable Y = g(X), defined over a set . This
R
would be given by
yf(x)dx. If 1 and 2 form a partition of , we can write this integral as
Z
yf(x)dx =
Z
yf(x)dx +
yf(x)dx
2
In this case, we either have X1 < X2 or X1 X2 and so

E(Y) = E(min(X1 , X2 )|X1 < X2 ) + E(min(X1 , X2 )|X1 X2 )
The first of these is given by integrating the density over the triangle above the 45 degree line and gives us
R1 xR2
R1 x22
1
x1 dx1 dx2 =
2 dx2 = 6 . We double this to account for the case where X1 X2
0 0
&
Page 5
%
Rohini Somanathan
'
Expectation properties
RESULT 1: If Y = aX + b, then E(Y) = aE(X) + b
( for a continuous random variable X)
R
R
R
E(aX + b) =
(ax + b)f(x)dx = a
xf(x)dx + b
f(x)dx = aE(x) + b
Proof:
Example: If E(X) = 5 then E(3X 5) = 10

RESULT 2: The expectation of a sum is the sum of the expectations:
k
k
P
P
E
ui (X) =
Eui (X)
i=1
i=1
Proof: E
k
P
ui (X) =
i=1
k
P

k
k
R
P
P
ui (x) f(x)dx =
ui (x)f(x)dx =
Eui (X)
i=1
i=1
i=1
RESULT 3: For a random sample, the expectation of a product is the product of the
expectations: If X1 , . . . , Xn are n independent random variables such that each expectation
n
n
Q
Q
E(Xi ) exists, then E( Xi ) =
E(Xi )
i=1
Proof:
i=1
( for a continuous random variable X) Since the random variables are independent,
their joint density is the product of the marginals,i.e.

E(
n
Q
i=1
Xi ) =
...
n
Q
i=1
xi )f(x1 , . . . , xn )dx1 . . . dxn =
...
n
Q
i=1
f(x1 , . . . , xn ) =
n
Q
fi (xi ) and
i=1
xi fi (xi )]dx1 . . . dxn =
n
R
Q
i=1
xi fi (xi )dxi =
n
Q
E(Xi )
i=1
(Notice that this third property applies only to independent random variables, whereas the
second property holds for dependent variables as well.)
&
Page 6
Rohini Somanathan
'
Expectation properties...examples
Expected number of successes: n balls are selected from a box containing a fraction p of red
balls. The random variable Xi takes the value 1 if the ith ball picked is red and zero
otherwise. Were interested in the expected value of the number of red balls picked.
This is simply X = X1 + X2 + + Xn . The expectation of X, (using our theorem) is equal to
E(X1 ) + E(X2 ) + + E(Xn ) where E(Xi ) = p1 + (1 p)0 = p. We therefore have E(X) = np
Expected number of matches: If n letters and randomly placed in n envelopes, how many
matches would we expect? Let Xi = 1 if the ith letter is placed in the correct envelope, and
zero otherwise.
1
1
P(Xi = 1) =
and P(Xi = 0) = 1
n
n
It is therefore the case that E(Xi ) =
1
n
i and E(X) =
1
n
1
n
+ +
1
n
= 1.
Suppose the random variables X1 , . . . , Xn form a random sample of size n from a given
continuous distribution on the real line for which the p.d.f. is f. Find the expectation of the
number of observations in the sample that fall in a specified interval [a, b]. This is just like
b
R
the first problem, except the probability of success is now f(x)dx, so the answer is
b
R
n f(x)dx
a
&
Page 7
%
Rohini Somanathan
'
More examples
The density function for X is given by f(x) = 2(1 x)I(0,1)
E(X) =
h 2
i
h 3
i
R1
R1
3 1
4 1
xf(x)dx = 2 (x x2 )dx = 2 x2 x3
= 2( 16 ) = 13 and E(X2 ) = 2 (x2 x3 )dx = 2 x3 x4
= 2( 13 14 ) = 16 . We
0
can use these to compute E(6X + 3X2 ) = 6( 31 ) + 3( 16 ) = 52 . We could have also computed this directly using the
formula for the expectation of a function r(X).
A horizontal line segment of length 5 is divided at a randomly selected point and X is the
length of the left-hand part. Let us find the expectation of the product of the lengths.
We are picking a point from a uniform distribution on [0, 5] so the density f(x) = 51 I(0,5) (x). E(X) = 52 and
E(5 X) = 25 (why?). The expected value of the product of the lengths is given by
2
R5
5
= E(X)E(5 X)
E [X(5 X)] = 51 x(5 x)dx = 25
6 6= 2
0
A bowl contains 5 chips, 3 marked $1 and 2 marked $4. A player draws 2 chips at random
and is paid the sum of the values of the chips. If it costs $4.75 to play, is his expected gain
positive?
3 )( 2 )
(x
2x , x = 0, 1, 2 (a
(52)
1
6
3
hypergeometric distribution). Compute f(0) = 10 , f(1) = 10 and f(2) = 10 . In this case u(x) = x + 4(2 x) = 8 3x,
Let the random variable X be the number of $1 chips. The probability function is f(x) =
1 )8 + ( 6 )5 + ( 3 )2 = 4.4. Alternatively, compute E(X) = 0 + f(1) + 2f(2) = 12 and find the desired
so E[u(x)] = ( 10
10
10
10
expectation as 8 3E(X).
&
Page 8
'
Rohini Somanathan
Variance of a random variable

Definition: If X is a random variable with E(X) = , the variance of X is defined as follows:
Var(X) = E[(X )2 ]
Since (X )2 0, as long as exists, the variance must be non-negative, if it exists.
The expectation E[(X )2 ] will always exist if the values of X are bounded, but need not
exist in general.
A small value of the variance indicates a distribution that is concentrated around .
The variance is denoted by 2 and its non-negative square root is called the standard
deviation and is denoted by .
&
Page 9
%
Rohini Somanathan
'
Variance properties
1. Var(X) = 0 if and only if there exists a constant c such that P(X = c) = 1
2. For an constants a and b, Var(aX + b) = a2 Var(X). It follows that Var(X) = Var(X)
Proof: Var(aX + b) = E[(aX + b a b)2 ] = E[(a(X ))2 ] = a2 E[(X )2 ] = a2 Var(X)
3. Var(X) = E(X2 ) [E(X)]2

Proof: expand the LHS and take expectations
4. If X1 , . . . , Xn are independent random variables, then

Var(X1 + + Xn ) = Var(X1 ) + + Var(Xn ).
Proof: For n = 2, E(X1 + X2 ) = 1 + 2 and therefore
Var(X1 + X2 ) = E[(X1 + X2 1 2 )2 ] = E[(X1 1 )2 + (X2 2 )2 + 2(X1 1 )(X2 2 )]
Taking expectations, we get
E[(X1 1 )2 + (X2 2 )2 + 2(X1 1 )(X2 2 )] = Var(X1 ) + Var(X2 ) + 2E[(X1 1 )(X2 2 )]
But since X1 and X2 are independent,
E[(X1 1 )(X2 2 )] = E(X1 1 )E(X2 2 ) = (1 1 )(2 2 ) = 0
It therefore follows that
Var(X1 + X2 ) = Var(X1 ) + Var(X2 )
Using an induction argument, this can be established for any n
&
Page 10
Rohini Somanathan
'
Moments of a random variable

Moments of a random variable are special types of expectations that capture characteristics of
the distribution that we may be interested in (its shape and position). Moments are defined
either around the origin or around the mean
Definition: Let X be a random variable with density function f(x). Then the kth moment of X is
0
the expectation E(Xk ). This moment is denoted by k and is said to exist if and only if
k
E(|X| ) < .
0
0 = 1
0
1 is called the mean of X and is denoted by

If a random variable is bounded, all moments exist, and if the kth moment exists, all lower
order moments exist.
Definition: Let X be a random variable for which E(X) = . Then for any positive integer k, the
expectation E[(X )k ] is called the kth central moment of X and denoted by k
1 is clearly zero.
The variance is the second central moment of X
If the distribution of X is symmetric with respect to its mean , and the central moment
exists for a given odd integer k, then it must be zero because the positive and negative
terms of the corresponding expectation will cancel one another.
&
Page 11
%
Rohini Somanathan
'
Moment generating functions

Given a random variable X, consider for each real number t, the following function (known as the
moment generating function (MGF) of X:
(t) = E(etX )
If X is bounded, the above expectation exists for all values of t, if not, it may only exist for
some values of t.
(t) is always defined at t = 0 and (0) = E(1) = 1
If the MGF exists for all values of t in an interval around t = 0, then the derivative of (t)
exists at t = 0 and
0 (0) =
d
d tX
[E(etX )]t=0 = E[(
e )]t=0 = E[(XetX )]t=0 = E(X)
dt
dt
The derivative of the MGF at t = 0 is the mean of X.

More generally, the kth derivative evaluated at t = 0 gives us the kth moment of X.
2
The function ex can be expressed as the sum of the series 1 + x + x2! + . . . and so etx can be expressed as the sum
P
2 2
2 2
(1 + tx + t 2!x + . . . )f(x). If we differentiate this w.r.t t and then set t = 0,
1 + tx + t 2!x + . . . and the expectation E(etx ) =
Proof.
x=0
were left with only the second term in parenthesis, so we have
differentiate twice, were left with

with an integral.
xf(x) which is defined as the expectation of X. Similarly, if we
x=0
x2 f(x), which is the second moment. For continuous distributions, we replace the sum
x=0
P
x=0
(. . . )dx
&
Page 12
Rohini Somanathan
'
Moment generating functions ..an example

Suppose a random variable X has the density function f(x) = ex I(0,) , we can use its MGF to
compute the mean and the variance of X as follows:

R
x(t1)
1
1
(t) = ex(t1) dx = e t1 = 0 t1
= 1t
for t < 1
0
Taking the derivative of this function with respect to t, we get 0 (t) =

differentiating again, we get
00 (t)
1
,
(1t)2
and
2
.
(1t)3
Evaluating the first derivative at t = 0, we get =
1
(10)2
= 1.
The variance 2 = 2 2 = 2(1 0)3 1 = 1.
&
Page 13
%
Rohini Somanathan
'
Properties of moment generating functions

RESULT 1: Let X be a random variable for which the MGF is 1 and consider the random
variable Y = aX + b, where a and b are given constants. Let the MGF of Y be denoted by
2 . Then for any value of t such that 1 (t) exists,
2 (t) = ebt 1 (at)
RESULT 2: Suppose that X1 , . . . , Xn are n independent random variables and that i is the
MGF of Xi . Let Y = X1 + + Xn and the MGF of Y be given by . Then for any value of t
such that i (t) exists for all i = 1, 2, . . . , n,
(t) =
n
Y
i (t)
i=1
RESULT 3: If the MGFs of two random variables X1 and X2 are identical for all values of t
in an interval around the point t = 0, then the probability distributions of X1 and X2 must
be identical.
Examples:
If f(x) = ex I(0,) as in the above example, the MGF of the random variable
Y = (X 1) =
et
1t
for t < 1 (using the first result above, setting a = 1 and b = 1) and if Y = 3 2X,
the MGF of Y is given by
e3t
1+2t
for t > 12
&
Page 14
'
Rohini Somanathan
An Illustration: the binomial distribution

Suppose that there is a probability p of a girl child being born and this probability does not
vary by the birth-order of the child.
A family has n children. The random variable Xi = 1 if the ith child is a girl and 0 otherwise.
The total number of girls in the family is given by the random variable X = X1 + + Xn
follows a binomial distribution with parameters n and p
We know from the properties of the variance that Var(X) =
n
P
Var(Xi )
i=1
E(Xi ) = 1.p + 0(1 p) = p and E(X2i ) = 12 (p) + 02 (1 p) = p, so Var(Xi ) = p p2 and Var(X) = np(1 p)
We can get the same expression using the MGF for the binomial:
The MGF for each of the Xi variables is given by
et P(Xi = 1) + (1)P(Xi = 0) = pet + q.
Using the additive property of MGFs for independent random variables, we get the
MGF for X as
(t) = (pet + q)n
For two random variables each with parameters (n1 , p) and (n2 , p), the MGF of their sum
is given by the product of the MGFs, (pet + q)n1 +n2
&
Page 15
%
Rohini Somanathan
'
The median of a distribution

The mean gives us the centre of gravity of a distribution and is one way of summarizing it. A
disadvantage in some contexts that it is influenced by every observation.
An alternative measure of the centre of a distribution is the median:
Definition: For any random variable X, a median of the distribution of X is defined as a
point m such that P(X m) 21 and P(X m) 12
RESULT: Let m be the median of the distribution of X and let d be any other number.
Then
E(|X m|) E(|X d|)
Every distribution has at least one median and may have multiple medians as seen in the
following examples:
1. P(X = 1) = .1 P(X = 2) = .2 P(X = 3) = .3 P(X = 4) = .4
2. P(X = 1) = .1 P(X = 2) = .4 P(X = 3) = .3 P(X = 4) = .2
3.
f(x) =
4.
4x3
for 0 < x < 1
otherwise
2
f(x) =
for 0 x 1
for 2.5 x 3
otherwise
&
Page 16
'
Rohini Somanathan
Covariance and correlation

Definition: Let X and Y be random variables with E(X) = X and E(Y) = Y and variances
Var(X) = 2X and Var(Y) = 2Y .
The covariance of X and Y is defined as E[(X X )(Y Y )] and is denoted by XY or Cov(X, Y).
The value of the covariance will be finite if each of the above variances are finite. It can be
positive, negative or zero. It can conveniently be computed as E(XY) E(X)E(Y) (just expand
the expression above and take expectations)
Definition: If 0 < 2X < and 0 < 2Y < then the correlation of X and Y is given by
(X, Y) =
Cov(X, Y)
X Y
Result: For any two random variables U and V, it is always the case that (EUV)2 EU2 EV 2 .
This is known as the Cauchy-Schwarz Inequality
This provides us with bounds on the value of the covariance, |XY | X Y (let U = (X EX) and
V = (Y EY) in the statement of the Cauchy-Schwarz inequality above) and in turn implies a
correlation bound
1 XY 1
Example: Let f(x, y) = (x + y)I(0,1) (x)I(0,1) (y). In this case E(X) =
EXY =
R1 R1
00
R1 R1
00
7 = E(Y) and
x(x + y)dxdy = 12
11 = 2 and cov(X, Y) = 1 ( 7 )( 7 ) = 1 and = q

xy(x + y)dxdy = 31 21 = E(X2 ) E(X)2 = 144
2
3
12
12
144
1
144
11 )( 11 )
( 144
144
1
= 11
&
Page 17
%
Rohini Somanathan
'
Properties of covariance and correlation

Result 1: Let X and Y be random variables with 2X < and 2Y < , then
Cov(X, Y) = E(XY) E(X)E(Y)
Proof: expand the expression for the covariance and take expectations.
Result 2: If X and Y are independent random variables, each with finite variance, then
Cov(X, Y) = (X, Y) = 0
Proof: If X and Y are independent, E(XY) = E(X)E(Y). Now apply the expression for covariance in the above
result.
Note: zero correlation does not imply independence- example: If X takes the values 1, 0, 1 with equal
probability and Y = X2 , E(XY) = E(X3 ) = 0. Since E(XY) = E(X)E(Y), = 0 and the variables are uncorrelated but
clearly dependent. A zero correlation only tells us that the variables are not linearly dependent.
Result 3: Suppose X is a random variable with finite variance and Y = aX + b for some
constants a and b. If a > 0, then (X, Y) = 1. If a < 0, then (X, Y) = 1
Proof: Y Y = a(X x ), so Cov(X, Y) = aE[(X X )2 ] = a2X and Y = |a|X , plug these values into the
expression for to get the result.
Result 4: If X and Y are random variables with finite variance then

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) .
Proof:
Var(X + Y) = E[(X + Y X Y )2 ] = E((X x )2 + (Y Y )2 + 2(X x )(Y Y ) = Var(X) + Var(Y) + 2Cov(X, Y)
Result 5: If X1 , . . . , Xn are random variables each with finite variance, then

n
n
P
P
PP
Var( Xi ) =
Var(Xi ) + 2
i<j Cov(Xi , Xj )
i=1
i=1
&
Page 18
'
Rohini Somanathan
Conditional Expectations
The conditional expectation of random variables is defined using conditional probability
density functions rather than their unconditional counterparts.
Suppose that X and Y are random variables with a joint density function f(x, y), with the
marginal p.d.f of X denoted by f1 (x).
For any value of x such that f1 (x) > 0, let g(y|x) denote the conditional p.d.f of Y given that
X=x
R
The conditional expectation of Y given X is E(Y|x) =
yg(y|x)dy for continuous X and Y
P
and E(Y|x) = y yg(y|x)dy if X and Y have a discrete distribution.
&
Page 19
%
Rohini Somanathan
Topic 4: Some Special Distributions

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Parametric Families of Distributions

There are a few classes of functions that are frequently used as probability distributions,
because they are easy to work with (have a small number of parameters) and attach
reasonable values to the types of uncertain events we are interested in analyzing.
The choice among these families depends on the question of interest.
For modeling the distribution of income or consumption expenditure, we want a density
which is skewed to the right ( gamma, weibull, lognormal..)
IQs, heights, weights, arm circumference are quite symmetric around a mode (normal
or truncated normal)
number of successes in a given number of trials (binomial)
the time to failure for a machine or person (gamma, exponential)
We refer to these probability density functions by f(x; ) where refers to a parameter
vector.
A given choice of therefore leads to a given probability density function.
is used to denote the parameter space.
&
Page 1
%
Rohini Somanathan
'
Discrete Distributions: Uniform

Parameter: N
Probability function: f(x; N) =
Moments:
1
N
I(1,2,...,N) (x)
1 N(N + 1) (N + 1)
=
N
2
2

X
(N + 1) 2 N2 1
1 N(N + 1)(2N + 1)
2
2
2
=
x f(x) =
=
N
6
2
12
=
MGF:
xf(x) =
PN
ejt
j=1 N
Applications: experiments or situations in which each outcome is equally likely (dice,

coins..) Can you think of applications in economics?
&
Page 2
Rohini Somanathan
'
Discrete Distributions: Bernoulli

Parameter: p , 0 p 1
Probability function: f(x; p) = px (1 p)1x I(0,1) (x)
Moments:
=
xf(x) = 1.p1 (1 p)0 + 0.p0 ((1 p)1 = p

2 =
x2 f(x) 2 = p(1 p)
MGF: et p + e0 (1 p) = pet + (1 p)
Applications: experiments or situations in which there are two possible outcomes: success
or failure, defective or not defective, male or female, etc.
&
Page 3
%
Rohini Somanathan
'
Discrete Distributions: Binomial

.
Parameters: (n, p) , 0 p 1 and n is a positive integer
Probability function: An observed sequence of n Bernoulli trials can be represented by an

n!
n-tuple of zeros and ones. The number of ways to achieve x ones is given by n
x = x!(nx)! .
The probability of x successes in n trials is therefore:
npx (1 p)nx
x = 0, 1, 2, . . . n
x
f(x; n, p) =
0
otherwise
Notice that since
n
P
x=0
n x nx
x a b
= (a + b)n ,
n
P
f(x) = [(p + (1 p)]n = 1 so we have a valid
x=0
density function.
MGF:The MGF is given by:
n
n
x
P tx
P
P
nx =
e f(x) =
etx n
x p (1 p)
x
x=0
x=0
n
t x
nx
x (pe ) (1 p)
= [(1 p) + pet ]n
Moments: The MGF can be used to derive = np and 2 = np(1 p)

Result: If X1 , . . . Xk are independent random variables and if each Xi has a binomial
distribution with parameters ni and p, then the sum X1 + + Xk has a binomial
distribution with parameters n = n1 + + nk and p.
&
Page 4
Rohini Somanathan
'
Multinomial Distributions
Suppose there are a small number of different outcomes (methods of public transport, water
purification etc. ) The Multinomial distribution gives us the probability associated with a
particular vector of these outcomes:
P
Parameters: (n, p1 , . . . pm ) , 0 pi 1, i pi = 1 and n is a positive integer
Probability function:
f(x1 , . . . xm ; n, p1 , . . . pm ) =
n!
m
Q
m
Q
xi ! i=1
pi i
x = 0, 1, 2, . . . n,
Pm
i
xi = n
i=1
otherwise
&
Page 5
%
Rohini Somanathan
'
Geometric and Negative Binomial distributions

The Negative Binomial (or Pascal) distribution gives us the probability that x failures will
occur before r successes are achieved. This means that the rth success occurs on the
(x + r)th trial.
P
Parameters: (r, p) , 0 pi 1, i pi = 1 and r is a positive integer
Density: For the rth success occurs on the (x + r)th trial, we require (r 1) successes in
the first (x + r 1) trials. We therefore obtain the density:
f(x; r, p) =

r+x1
pr qx ,
x
x = 0, 1, 2, 3...
The geometric distribution is a special case of the negative binomial with r = 1.

The density in this case takes the form f(x|1, p) = pqx over all natural numbers x
P
p
t x
the MGF is given by E(etX ) = p
x=0 (qe ) = 1qet
for t < log( q1 )
We can use this function to get the mean and variance, =
q
p
and 2 =
q
p2
The negative binomial is just a sum of r geometric variables, and the MGF is therefore
p
rq
rq
r
2
( 1qe
t ) and the corresponding mean and variance is = p and = p2
The geometric distribution is memory-less, so the conditional probability of k + t
failures given k failures is the unconditional probability of t failures,
P(X = k + t|X k) = P(X = t)
&
Page 6
Rohini Somanathan
'
Discrete Distributions: Poisson

Parameter: , > 0
Probability function:
f(x; ) =
e x
x!
, x = 0, 1, 2, . . . ,
0
2
otherwise
Using the result that the series 1 + + 2! + 3! + . . . converges to e ,
P
P
P
x
e x
e = 1 so we have a valid density.
f(x) =
= e
x!
x! = e
x
x=0
x=0
Moments: = =
MGF: E(etX ) =
P
x=0
etx e x
x!
= e
P
x=0
(et )x
x!
= e(e
t 1)
The MGF can be used to get the first and second moments about the origin, and 2 +
so the mean and the variance are both .
We can also use the product of k identical MGFs to show that the sum of k independently
distributed Poisson variables has a Poisson distribution with mean 1 + . . . k .
&
Page 7
%
Rohini Somanathan
'
A Poisson process
Suppose that the number of type A outcomes that occur over a fixed interval of time, [0, t]
follows a process in which
1. The probability that precisely one type A outcome will occur in a small interval of time t
is approximately proportional to the length of the interval:
g(1, t) = t + o(t)
where o(t) denotes a function of t having the property that limt0
o(t)
t
= 0.
2. The probability that two or more type A outcomes will occur in a small interval of time t
is negligible:
X
g(x, t) = o(t)
x=2
3. The numbers of type A outcomes that occur in nonoverlapping time intervals are
independent events.
These conditions imply a process which is stationary over the period of observation, i.e the
probability of an occurrence must be the same over the entire period with neither busy nor quiet
intervals.
&
Page 8
Rohini Somanathan
'
Poisson densities representing poisson processes

RESULT: Consider a Poisson process with the rate per unit of time. The number of events in
a time interval t is a Poisson density with mean = t.
Applications:
the number of weaving defects in a yard of handloom cloth or stitching defects in a shirt
the number of traffic accidents on a motorway in an hour
the number of particles of a noxious substance that come out of chimney in a given period
of time
the number of times a machine breaks down each week
Example:
let the probability of exactly one blemish in a foot of wire be
blemishes be zero.
1
1000
and that of two or more
were interested in the number of blemishes in 3, 000 feet of wire.

if the numbers of blemishes in non-overlapping intervals are assumed to be independently
distributed, then our random variable X follows a poisson process with = t = 3 and
P(X = 5) =
35 e3
5!
you can plug this into a computer, or alternatively use tables to compute f(5; 3) = .101
&
Page 9
%
Rohini Somanathan
'
The Poisson as a limiting distribution

We can show that a binomial distribution with large n and small p can be approximated by a
Poisson ( which is computationally easier).
useful result: ev = limn (1 +
v n
n)
We can rewrite the binomial density for non-zero values as

x
Q
f(x; n, p) =
(ni+1)
i=1
x!
px (1 p)nx . If np = , we can subsitute for p by

x
Q
limn f(x; n, p)
limn i=1
x
Q
=
=
=
(n i + 1)
x!

to get
x
nx
1
n
n
(n i + 1)
x
n
x
1
1
x!
n
n
h n (n 1)
(n x + 1) x
n
x i
limn
.
....
1
1
n
n
n
x!
n
n
limn i=1
nx
e x
x!
(using the above result and the property that the limit of a product is the product of the
limits)
&
Page 10
'
Rohini Somanathan
Poisson as a limiting distribution...example

We have a 300 page novel with 1, 500 letters on each page.
Typing errors are as likely to occur for one letter as for another, and the probability of such
an error is given by p = 105 .
The total number of letters n = (300) (1500) = 450, 000
Using = np, the poisson distribution function gives us the probability of the number of
errors being less than or equal to 10 as:
P(x 10)
10
X
e4.5 (4.5)x
= .9933
x!
x=0
Rules of Thumb: close to binomial probabilities when n 20 and p .05, excellent when n 100
and np 10.
&
Page 11
%
Rohini Somanathan
'
Discrete distributions: Hypergeometric

Suppose, like in the case of the binomial, there are two possible outcomes and were
interested in the probability of x values of a particular outcome, but we are drawing
randomly without replacement so our trials are not independent.
In particular, suppose there are A + B objects from which we pick n, A of the total number
available are of one type (red balls) and the rest are of the other (blue balls).
If the random variable is the total number of red balls selected, then, for appropriate values
B
(A)(nx
)
of x, we have f(x; A, B, n) = x A+B
( n )
Over what values of x is this defined? max{0, n B, } X min{n, A}
The multivariate extension is (for xi 0, 1, 2..n,
n
P
xi = n and
i=1
Ki = M ):
i=1
m
Q
f(x1 . . . xm ; K1 . . . Km , n) =
m
P
Kj
xj
j=1
M
n
&
Page 12
Rohini Somanathan
'
Continuous distributions: uniform or rectangular

Parameters: (a, b) , (a, b) : < a < b <
Density: f(x; a, b) =
Moments: =
1
ba
(a+b)
,
2
MGF: MX (t) =
I(a,b) (x) (hence the name rectangular)
2 =
ebt eat
(ba)t
(ba)2
12
for t 6= 0 and MX (t) = 1 for t = 0 ( use LHopitals rule )
Applications:
to construct the probability space of an experiment in which any outcome in the
interval [a, b] is equally likely.
to generate random samples from other distributions (based on the probability integral
transformation). This is part of your first lab assignment.
&
Page 13
%
Rohini Somanathan
'
The gamma function

The gamma function is a special mathematical function that is widely used in statistics. The
gamma function of is defined as
y1 ey dy
() =
(1)
If = 1, () =
R
0

ey dy = ey = 1
0
If > 1, we can integrate (1) by parts, setting u = y1 and dv = ey and using the formula

R
R
R
1

udv = uv vdu to get: yey + ( 1) y2 ey dy
0
The first term in the above expression is zero because the exponential function goes to zero
faster than any polynomial and we obtain
() = ( 1)( 1)
and for any integer > 1, we have
() = ( 1)( 2)( 3) . . . (3)(2)(1)(1) = ( 1)!
&
Page 14
Rohini Somanathan
'
The gamma distribution

Define the variable x by y =
x
,
where > 0. Then dy =
x 1
() =
1
dx
and can rewrite () as
1
dx
or as
1=
1
x
x1 e dx
()
This shows that for , > 0,

f(x; , ) =
1
x
x1 e I(0,) (x)
()
is a valid density and is known as a gamma-type probability density function.
&
Page 15
%
Rohini Somanathan
'
Features of the gamma density

This is a valuable distribution because it can take a variety of shapes depending on the values of
the parameters and
It is skewed to the right
It is strictly decreasing when 1
If = 1, we have the exponential density, which is memory-less.
pter 5 Special Distributions

For > 1 the density attains it maximum at x = ( 1)
Graphs of the
veral different
ributions with
an of 1.
a ! 0.1, b ! 0.1
a ! 1, b ! 1
a ! 2, b ! 2
a ! 3, b ! 3
1.2
Gamma p.d.f.
1.0
0.8
0.6
0.4
0.2
0
&
Page 16
Rohini Somanathan
Moments. Let X have the gamma distribution with parameters and . For k =
1, 2, . . . ,
Course
#(
+ k)003: Basic
(Econometrics,
+ 1) . . . (2012-2013
+ k 1)
'
E(X k ) = k
=
.
k
#()
Theorem
5.7.5
In particular, E(X) = , and Var(X) =
.
2
Proof For k = 1, 2, . . . ,
Moments
of the gamma
!
! distribution
x k f (x|, ) dx =
x +k1ex dx
E(X k ) =
Parameters: (, ) 0, > 0, > 0
#() 0
2
2
#( + k)
Moments: = ,
= . #( + k)
=
= k
.
(5.7.14)
+k
1
#() for t < which

can
#()be derived as follows:
MGF: MX (t) = (1 t)
The expression for E(X)Z follows immediately
from (5.7.14). The variance can be
x
1
tx
1 e
M
(t)
=
e
x
X
computed as
()
0
" #2
Z
1
((
+11)t)x
= Var(X)=
x1 e
= 2.
()
2
0
1
()
1 ( 1 t)x
dx
x
t
e
1 mean equal to
p.d.f.s that all have
Figure 5.7 shows several gamma distribution
t
0
1 but different values of and
1 . ()
1
=
Example
5.7.6
()
1
1
t
1
(by setting y = ( t)x in the expression for ().

t
1
Service Times in a Queue
5.7.5, the conditional mean service rate given
= . In Example

1 t
the observations X1 = x1, . . . , Xn = xn is
E(Z|x1, . . . , xn) =
n+1
.
$
2 + ni=1 xi
For large n, the conditional mean is approximately 1 over the sample average of
&
the service times. This makes sense since 1 over the average service time is what we
Page 17
Rohini Somanathan
generally mean by service rate.
!
'
Gamma applications
Survival analysis
We can use it to model the waiting time till the rth event/success. If X is the time that
passes until the first success, then X could be gamma distribution with = 1 and = 1 .
This is known as an exponential distribution. If, instead we are interested in the time
taken for the rth success, this has a gamma density with = r and 1 = .
Related to the Poisson distribution: If the variable Y is the number of successes
(deaths, for example) in a given time period t and has a poisson density with parameter
, the rate of success is given by = t .
Example: A bottling plant breaks down, on average, twice every four weeks. We want the
probability that the number of breakdowns, X 3 in the next four weeks. We have = 2
3
P
i
e2 2i! = .135 + .271 + .271 + .18 = .857
and the breakdown rate = 21 per week. P(X 3) =
i=0
Suppose we wanted the probability that the machine does not break down in the next four
weeks. The time taken until the first break-down, x must therefore be more than four
weeks. This follows a gamma distribution, with = 1 and = 1.

R
x
x
P(X 4) = 21 e 2 dx = e 2 = e2 = .135
4
Income distributions that are uni-modal
&
Page 18
Rohini Somanathan
'
Gamma distributions: some useful properties

Gamma Additivity: Let X1 , . . . Xn be independently distributed random variables with
respective gamma densities Gamma(i , ). Then
Y=
n
X
i=1
Scaling Gamma Random Variables:

Gamma(, ) and let c > 0. Then
n
X
Xi Gamma(
i , )
i=1
Let X be distributed with gamma density
Y = cX Gamma(, c)
Both these can be easily proved using the gamma MGF and applying the MGF uniqueness
theorem: In the first case the MGF of Y is the product of the individual MGFs, i.e.
MY (t) =
n
Y
i=1
n
P
n
Y
i
1
MXi (t) =
(1 t)i = (1 t)i=1
for t <
i=1
For the second result, MY (t) = McX (t) = MX (ct) = (1 ct) for t <
1
c
&
Page 19
%
Rohini Somanathan
'
The gamma family: exponential distributions

An exponential distribution is simply a gamma distribution with = 1
Parameters: , > 0
Density: f(x; ) =
x
1
I(0,) (x)
e
Moments: = , 2 = 2
MGF: MX (t) = (1 t)1 for t <
Applications: As discussed above, the most important application the representation of

operating lives. The exponential is memoryless and so, if failure hasnt occurred, the object
(or person, animal) is as good as new. The risk of failure at any point t is given by the
hazard rate,
f(t)
h(t) =
S(t)
where S(t) is the survival function, 1 F(t). Verify that the hazard rate in this case is a
constant, 1 .
If we would like wear-out effects, we should use a gamma with > 1 and for work-hardening
effects, use a gamma with < 1
&
Page 20
Rohini Somanathan
'
The gamma family: chi-square distributions

An Chi-square distribution is simply a gamma distribution with =
v
2
and = 2
Parameters: v , v is a positive integer (referred to as the degrees of freedom)

Density: f(x; v) =
x
v
1
x 2 1 e 2 I(0,) (x)
v
2 2 ( v
2 )
Moments: = v, 2 = 2v
v
MGF: MX (t) = (1 2t) 2 for t <
1
2
Applications:
Notice that for v = 2, the Chi-Square density is equivalent to the exponential density with
= 2. It is therefore decreasing for this value of v and hump-shaped for other higher values.
The 2 is especially useful in problems of statistical inference because if we have v
v
P
independent random variables, Xi N(0, 1), their sum
X2i 2v Many of the estimators we
i=1
use in our models fit this case (i.e. they can be expressed as the sum of independent normal
variables)
&
Page 21
%
Rohini Somanathan
'
The Normal (or Gaussian) distribution

This symmetric bell-shaped density is widely used because:
1. Outcomes certain types of continuous random variables can be shown to follow this type of
distribution, this is the motivation weve used for most parametric distributions weve
considered so far (heights-humans, animals and plants, weights, strength of physical
materials, the distance from the centre of a target if errors in both directions are
independent).
2. It has nice mathematical properties: many functions of a set normally distributed random
variables have distributions that take simple forms.
3. Central Limit Theorems: The sample mean of a random sample from any distribution with
finite variance is approximately normal.
&
Page 22
Rohini Somanathan
'
The Normal density

Parameters: (, 2 ) , (, ), > 0
Density: f(x; , 2 ) =
MGF: MX (t) = et+
1 x 2
1
e 2 ( ) I(,+) (x)
2
2 t2
2
The MGF can be used to derive the moments, E(X) = and variance is 2
As can be seen from the p.d.f, the distribution is symmetric around , where it achieves its
maximum value. this is therefore also the median and the mode of the distribution.
The normal distribution with zero mean and unit variance is known as the standard normal
1 2
distribution and is of the form: f(x; 0, 1) = 1 e 2 x I(,+) (x)
2
The tails of the distribution are thin: 68% of the total probability lies within one of the
mean, 95.4% within 2 and 99.7% within 3.
&
Page 23
%
Rohini Somanathan
'
The Normal distribution: deriving the MGF

By the definition of the MGF:
M(t)
(x)2
1
22
dx
e
2
h
i
Z
(x)2
tx
1
22
dx
e
2
etx
We can rewrite the term inside the square brackets to obtain:

tx
(x )2
[x ( + 2 t)]2
1 2 2
=
t
+
22
2
22
The MGF can now be written as:

1
2 t2
MX (t) = Cet+ 2
where C =
e
2
[x(+2 t)]2
22
dx = 1 because the integrand is a normal p.d.f with parameter
replaced by ( + 2 t)
&
Page 24
'
Rohini Somanathan
The Normal distribution: computing moments

First taking derivatives of the MGF:
2 t2
2
M(t)
e(t+
M0 (t)
M(t)( + 2 t)
M00 (t)
M(t) 2 + M(t)( + 2 t)2
(obtained by differentiating M(t) with respect to t and substituting for M0 (t))

Evaluating these at t = 0, we get M0 (0) = and M00 (0) = 2 + 2 , or the variance = 2 .
&
Page 25
%
Rohini Somanathan
'
Transformations of Normally Distributed Variables...1

RESULT 1: Let X N(, 2 ). Then Z =
(X)
Proof: Z is of the form aX + b with a =
MZ (t) = ebt MX (at) = e t e
2 t2
t+ 22
=e
N(0, 1)
and b =
. Therefore
t2
2
which is the MGF of a standard normal distribution.
An important implication of the above result is that if we are interested in any distribution in
this class of normal distributions, we only need to be able to compute integrals for the standard
normal-these are the tables youll see at the back of most textbooks.
Example: The kilometres per litre of fuel achieved by a new Maruti model , X N(17, .25). What
is the probability that a new car will achieve between 16 and 18 kilometres per litre?

Answer: P(16 x 18) = P 1617
z 1817
= P(2 z 2) = 1 2(.0228) = .9544
.5
.5
&
Page 26
'
Rohini Somanathan
Transformations of Normals...2
RESULT 2: Let X N(, 2 ) and Y = aX + b, where a and b are given constants and a 6= 0,
then Y has a normal distribution with mean a + b and variance a2 2
1
2 2
2 2
Proof: The MGF of Y can be expressed as MY (t) = ebt eat+ 2 a t = e(a+b)t+ 2 (a) t .
This is simply the MGF for a normal distribution with the mean a + b and variance a2 2
RESULT 3: If X1 , . . . , Xk are independent and Xi has a normal distribution with mean i
and variance 2i , then Y = X1 + + Xk has a normal distribution with mean 1 + + k
and variance 21 + + 2k .
Proof: Write the MGF of Y as the product of the MGFs of the Xi s and gather linear and
squared terms separately to get the desired result.
We can combine these two results to derive the distribution of sample mean:
RESULT 4: Suppose that the random variables X1 , . . . , Xn form a random sample from a
n denote the sample mean.
normal distribution with mean and variance 2 , and let X
2
Then Xn has a normal distribution with mean and variance n .
&
Page 27
%
Rohini Somanathan
'
Transformations of Normals to 2 distributions

RESULT 5 : If X N(0, 1), then Y = X2 has a 2 distribution with one degree of freedom.
Proof:
MY (t)
x2
2
1
ex t e 2 dx
2
1 2
1
e 2 x (12t) dx
2
1
p
(1 2t)
1
(12t)
e 2 (x
(12t))2
dx
1
1
p
for t <
2
(1 2t)
( the integrand is a normal density with = 0 and 2 =
1
(12t) ).
The MGF obtained is that of a 2 random variable with v = 1 since the 2 MGF is given by
v
MX (t) = (1 2t) 2 for t <
1
2
&
Page 28
'
Rohini Somanathan
Normals and 2 distributions...

RESULT 6 : Let X1 , . . . Xn be independent random variables with each Xi N(0, 1), then Y =
n
P
i=1
X2i
has a 2 distribution with n degrees of freedom.

Proof:
MY (t)
n
Y
i=1
n
Y
MX2 (t)
i
(1 2t) 2
i=1
(1 2t)
n
2
for t <
1
2
which is the MGF of a 2 random variable with v = n. This is the reason that the parameter v is
called the degrees of freedom. There are n freely varying random variables whose sum of squares
represents a 2v -distributed random variable.
&
Page 29
%
Rohini Somanathan
'
The Bivariate Normal distribution
The bivariate normal has the density:

f(x, y) =
q2
1
p
e 2
2
21 2 1
where
q=
x y y 2 i
1 h x 1 2
1
2
2
2
+
2
1
1
1
2
2
E(Xi ) = i , Var(Xi ) = 2i and the correlation coefficient (X1 , X2 ) =

Verify that in this case, X1 and X2 are independent iff they are uncorrelated.
Applications: heights of couples, scores on tests...
&
Page 30
%
Rohini Somanathan
Topic 5: Sample statistics and their properties

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
The Inference Problem

So far, our starting point has been a given probability space. The likelihood of different
outcomes in this space is determined by our probability measure- weve discussed different
types of sample spaces and measures that can be used to assign probabilities to events.
Well now look at how we can generate information about the probability space by
analyzing a sample of outcomes. This process is referred to as statistical inference.
Inference procedures are parametric when we make assumptions about the probability
space from which our sample is drawn (for example, each sample observation represents an
outcome of a normally distributed random variable with unknown mean and unit variance).
If we make no such assumptions our procedures are nonparametric.
Well discuss parametric inference. This involves both the estimation of population
parameters and testing hypotheses about them.
&
Page 1
%
Rohini Somanathan
'
Defining a Statistic
Definition: Any real-valued function T = r(X1 , . . . , Xn ) is called a statistic.
Notice that:
a statistic is itself a random variable
weve considered several functions of random variables, whose distributions are well defined
such as :
Y=
Y=
n
P
where
X N(, 2 ). We showed that Y N(0, 1).
Xi , where each Xi has a bernoulli distribution with parameter p was shown to have a
i=1
binomial distribution with parameters n and p.

Y=
n
P
i=1
X2i where each Xi has a standard normal distribution was shown to have a 2n distribution
etc...
Only some of these functions of random variables are statistics ( why?) This distinction is
important because statistics have sample counterparts.
In a problem of estimating an unknown parameter, , our estimator will be a statistic
whose value can be regarded as an estimate of .
It turns out that for large samples, the distributions of some statistics, such as the sample
mean, are well-known.
&
Page 2
'
Rohini Somanathan
Markovs Inequality
We begin with some useful inequalities which provide us with distribution-free bounds on the
probability of certain events and are useful in proving the law of large numbers, one of our two
main large sample theorems.
Markovs Inequality: Let X be a random variable with density function f(x) such that
P(X 0) = 1. Then for any given number t > 0,
P(X t)
E(X)
t
Proof. (for discrete distributions)

P
P
P
E(X) = x xf(x) = x<t xf(x) + xt xf(x) All terms in these summations are non-negative by
assumption, so we have
X
X
E(X)
xf(x)
tf(x) = tP(X t)
xt
xt
This inequality obviously holds for t E(X) (why?). Its main interest is in bounding the
probability in the tails. For example, if the mean of X is 1, the probability of X taking values
bigger than 100 is less than .01. This is true irrespective of the distribution of X- this is what
makes the result powerful.
&
Page 3
%
Rohini Somanathan
'
Chebyshevs Inequality
This is a special case of Markovs inequality and relates the variance of a distribution to the
probability associated with deviations from the mean.
Chebyshevs Inequality: Let X be a random variable such that the distribution of X has a finite
variance 2 and mean . Then, for every t > 0,
P(|X | t)
2
t2
or equivalently,
P(|X | < t) 1
2
t2
Proof. Use Markovs inequality with Y = (X )2 and use t2 in place of the constant t. Then Y takes only
non-negative values and E(Y) = Var(X) = 2 .
In particular, this tells us that for any random variable, the probability that values taken by the
variable will be more than 3 standard deviations away from the mean cannot exceed 91
P(|X | 3)
1
9
For most distributions, this upper bound is considerably higher than the actual probability of
this event.
&
Page 4
Rohini Somanathan
'
Probability bounds ..an example

Chebyshevs Inequality can, in principle be used for computing bounds for the probabilities of
certain events. In practice this is not often used because the bounds it provides are quite
different from actual probabilities as seen in the following example:
Let the density function of X be given by f(x) =
2 =
(ba)2
12
(2
I (x).
3) ( 3, 3)
In this case = 0 and
= 1. If t = 32 , then
3
3
3
Pr(|X | ) = Pr(|X| ) = 1
2
2
Z2
32
Chebyshevs inequality gives us

while our bound is
1
t2
4
9
1
3
= .13
dx = 1
2
2 3
which is much higher. If t = 2, the exact probability is 0,
1
4.
&
Page 5
%
Rohini Somanathan
'
The sample mean and its properties

Our estimate for the mean of a population is typically the sample mean. We now define this
formally and derive its distribution. We will further justify the use of this estimator when we
move on to discuss estimation.
Definition: Suppose the random variables X1 , . . . , Xn form a random sample of size n from a
distribution with mean and variance 2 . The arithmetic average of these sample observations,
Xn is known as the sample mean:
Xn =
1
(X1 + + Xn )
n
Since Xn is the sum of i.i.d. random variables, it is also a random variable

E(Xn ) =
1
n
n
P
E(Xi ) =
i=1
Var(Xn ) =
1
n .n
n
P
1
Var( Xi )
n2
i=1
1
n2
n
P
i=1
Var(Xi ) =
1
n2
n2
2
n
Weve therefore learned something about the distribution of the sample mean, irrespective of the
distribution from which the sample is drawn:
Its expectation is equal to that of the population.
It is more concentrated around its mean value than was the original distribution.
The larger the sample, the lower the variance of Xn .
&
Page 6
Rohini Somanathan
'
Sample size and precision of the sample mean

We can use Chebyshevs Inequality to ask how big a sample we should take, if we want to ensure
a certain level of precision in our estimate of the sample mean.
Suppose the random sample is picked from a distribution which unknown mean and
variance equal to 4 and we want to ensure an estimate which is within 1 unit of the real
| 1) .01.
mean with probability .99. So we want Pr(|X
| 1)
Applying Chebyshevs Inequality we get Pr(|X
n = 400.
4
n.
Since we want
4
n
= .01 we take
and therefore often

This calculation does not use any information on the distribution of X
gives us a much larger number than we would get if this information was available.
Example:
n
P
each Xi followed a bernoulli distribution with p = 12 , then the total number of successes T =
Xi follows
T , E(T ) = n and Var(T ) = n

a binomial, Xn = n
2
4
wed like our sample mean to lie within .1 of the population mean, i.e. in the interval [.4, .6], with
probability equal to .7. Using Chebyshevs Inequality, we have
i=1
P(|Xn | .1) 1
2
Xn
t2
25
25
= 1 4n
1 = 1 n . We therefore need 1 n = 0.7. This gives us n = 84.
100
If we compute these probabilities directly from the binomial distribution, we get F(9) F(6) = .7 when
n = 15, so if we knew that Xi followed a Bernoulli distribution we would take this sample size for the
desired level of precision in our estimate of Xn .
This illustrates the trade-off between more efficient parametric procedures and more robust
non-parametric ones.
&
Page 7
%
Rohini Somanathan
'
Convergence of Real Sequences

We would like our estimators to be well-behaved. What does this mean?
One desirable property is that our estimates get closer to the parameter that we are trying
to estimate as our sample gets larger. Were going to make precise this notion of getting
closer.
Recall that a sequence is just a function from the set of natural numbers N to any set A
(Examples: yn = 2n , yn = n1 )
A real number sequence {yn } converges to y if for every > 0, there exists N() for which
n N() = |yn y| < . In such as case, we say that {yn } y Which of the above
functions converge?
If we have a sequence of functions {fn }, the sequence is said to converge to a function f if
fn (x) f(x) for all x in the domain of f.
In the case of matrices, a sequence of matrices converges if each the sequences formed by
(i, j)th elements converge, i.e. Yn [i, j] Y[i, j].
&
Page 8
Rohini Somanathan
'
Sequences of Random Variables

A sequence of random variables is a sequence for which the set A is a collection of random
variables and the function defining the sequence puts these random variables in a specific
order.
n
n
P
P
Xi , where Xi N(, 2 ), or Yn =
Xi , where Xi Bernoulli(p)
Examples: Yn = n1
i=1
i=1
We now need to modify our notion of convergence, since the sequence {Yn } no longer defines
a given sequence of real numbers, but rather, many different real number sequences,
depending on the realizations of X1 , . . . , Xn .
Convergence questions can no longer be verified unequivocally since we are not referring to
a given real sequence, but they can be assigned a probability of occurrence based on the
probability space for random variables involved.
There are several types of random variable convergence discussed in the literature. Well
focus on two of these:
Convergence in Distribution
Convergence in Probability
&
Page 9
%
Rohini Somanathan
'
Convergence in Distribution
Definition: Let {Yn } be a sequence of random variables, and let {Fn } be the associated sequence
of cumulative distribution functions. If there exists a cumulative distribution function F such
that Fn (y) F(y) y at which F is continuous, then F is called the limiting CDF of {Yn }.
Letting Y have the distribution function F, we say that Yn converges in distribution to the
d
random variable Y and denote this by Yn Y.

d
The notation Yn F is also used to denote Yn Y F

Convergence in distribution holds if there is convergence in the sequence of densities
(fn (y) f(y)) or in the sequence of MGFs (MYn (t) MY (t)). In some cases, it may
be easier to use these to show convergence in distribution.
d
Result: Let Xn X, and let the random variable g(X) be defined by a function continous
d
function g(.) Then g(Xn ) g(X)

d
Example: Suppose Zn Z N(0, 1), then 2Zn + 5 2Z + 5 N(5, 4) (why?)
&
Page 10
Rohini Somanathan
'
Convergence in Probability
This concept formalizes the idea that we can bring the outcomes of the random variable Yn
arbitrarily close to the outcomes of the random variable Y for large enough n.
Definition: The sequence of random variables {Yn } converges in probability to the random
varaible Y iff
lim P(|yn y| < ) = 1
>0
We denote this by Yn Y or plimYn = Y. This justifies using outcomes of Y as an

approximation for outcomes of Yn since the two are very close for large n.
Notice that while convergence in distribution is a statement about the distribution
functions of Yn and Y whereas convergence in probability is a statement about the joint
density of outcomes, yn and y.
Distribution functions of very different experiments may be the same: an even number on a
fair die and a head on a fair coin have the same distribution function, but the outcomes of
these random variables are unrelated.
p
Therefore Yn Y implies Yn Y In the special case where Yn c, we also have

p
Yn c and the two are equivalent.
&
Page 11
%
Rohini Somanathan
'
Properties of the plim operator

plim AXn = A plim Xn
the plim of a sum is the sum of the plims
the plim of a product is the product of the plims
Example: Yn = (2 +
plim Yn = plim(2 +
1
n )X + 3
and X N(1, 2). Using the properties of the plim operator, we have
1
n )X + plim(3)
= 2X + 3 N(5, 8) Since convergence in probability implies

d
convergence in distribution, Yn N(5, 8)
&
Page 12
Rohini Somanathan
'
The Weak Law of Large Numbers

Consider now the convergence of the random variable sequence whose nth term is given by:
Xn =
n
1 X
Xi
n
i=1
WLLN: Let {Xn } be a sequence of i.i.d. random variables with finite mean and variance 2 .
p
Then Xn .
Proof. Using Chebyshevs Inequality,
| < ) 1
P(|X
2
n2
Hence
| < ) = 1 or plimX
=
lim P(|X
The WLLN will allow us to use the sample mean as an estimate of the population mean, under
very general conditions.
&
Page 13
%
Rohini Somanathan
'
Central Limit Theorems

Central limit theorems specify conditions under which sequences of random variables
converge in distribution to known families of distributions.
These are very useful in deriving asymptotic distributions of test statistics whose exact
distributions are either cumbersome or difficult to derive.
There are a large number of theorems which vary by the assumptions they place on the
random variables -scalar or multivariate, dependent or independent, identically or
non-indentically distributed.
The Lindberg-Levy CLT: Let {Xn } be a sequence of i.i.d. random variables with EXi = and
var(Xi ) = 2 (0, ) i. Then
Xn
n(Xn ) d
N(0, 1)
&
Page 14
Rohini Somanathan
'
Lindberg-Levy CLT..applications
Approximating Binomial Probabilities via the Normal Distribution: Let {Xn } be a sequence
of i.i.d. Bernoulli random variables. Then, by the LLCLT:
n
P
Xi np
n
X
a
d
i=1
p
Xi N(np, np(1 p))
N(0, 1) and
np(1 p)
i=1
In this case,
n
P
Xi is of the form aZn + b with a =
p
np(1 p) and b = np and since Zn
i=1
converges to Z in distribution, the asymptotic distribution
n
P
Xi is normal with mean and
i=1
variance given above (based on our results on normally distributed variables).

Approximating 2 Probabilities via the Normal Distribution: Let {Xn } be a sequence of
i.i.d. chi-square random variables with 1 degree of freedom. Using the additivity property
n
P
of variables with gamma distributions, we have
Xi 2n Recall that the mean of gamma
i=1
distribution is and its variance is 2 . For a 2n random variable, =

by the LLCLT:
n
P
Xi n
n
X
a
d
i=1
N(0, 1) and
Xi N(n, 2n)
2n
i=1
v
2
and = 2. Then,
&
Page 15
%
Rohini Somanathan
Topic 6: Estimation
Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Random Samples
We cannot usually look at the population as a whole because
it may be too big and therefore expensive and time-consuming
analyzing the sample may destroy the product/organism (you need brain cells to figure
out how the brain works, or to crash cars to see know how sturdy they are)
We would like to choose a sample which is representative of the population or process that
interests us. A common procedure with many desirable properties is random sampling - all
objects in the population have an equal chance of being selected
Haphazard sampling procedures often result in non-random samples.
Example: We have a bag of sweets and chocolates of different types (eclairs, five-stars, gems...) and want to
estimate the average weight of a items in the bag. If we pass the bag around, each student puts their hand
in and picks 5 items, how do you think these sample averages would compare with the true average?
Definition: Let f(x) be the density function of a continuous random variable X. Consider a
sample of size n from this distribution. We can think of the first value drawn as a realization of
the random variable X1 , similarly for X2 . . . Xn . (x1 , . . . , xn ) is a random sample if
f(x1 , . . . , xn ) = f(x1 )f(x2 ) . . . f(xn ) = f(x).
&
Page 1
%
Rohini Somanathan
'
Statistical Models
Definition: A statistical model for a random sample consists of a parametric functional form,
f(x; ) together with a parameter space which defines the potential candidates for .
Examples: We may specify that our sample comes from
a Bernoulli distribution and = {p : p [0, 1]}
a Normal distribution where = {(, 2 ) : (, ), > 0}
Note that could be much more restrictive in each of these examples. What matters for our
purposes is that
contains the true value of the parameter
the parameters are identifiable meaning that the probability of generating the given sample
is different under two distinct parameter values. If, given a sample x and parameter values
1 and 2 , suppose f(x, 1 ) = f(x, 2 ), well never be able to use the sample to reach a
conclusion on which of these values is the true parameter.
&
Page 2
'
Rohini Somanathan
Estimators and Estimates

Definition: An estimator of the parameter , based on the random variables X1 , . . . , Xn , is a
real-valued function (X1 , . . . , Xn ) which specifies the estimated value of for each possible set of
values of X1 , . . . , Xn .
Since an estimator (X1 , . . . , Xn ) is a function of random variables, X1 , . . . , Xn , the estimator
is itself a random variable and its probability distribution can be derived from the joint
distribution of X1 , . . . , Xn .
A point estimate is a specific value of the estimator (x1 , . . . , xn ) that is determined by using
the observed values x1 , . . . , xn .
There are lots of potential functions of the random sample, , what criteria should we use
to choose among these?
&
Page 3
%
Rohini Somanathan
'
Desirable Properties of Estimators

n ) = .
Unbiasedness : E(
n | > ) = 0 for every > 0.
Consistency: limn P(|
n )2 E(? )2 for any ? .
Minimum MSE: E(
n
n
Using the MSE criterion may often lead us to choose biased estimators because
= Var()
+ Bias(,
)2
MSE()
2 = E[E(
2 = E[(E(
2 +2(E(
= E()
2 +(E())
)2 +0
MSE()
)+E(
)]
))
))(E(
))]
= Var()+Bias(
,
A Minimum Variance Unbiased Estimator (MVUE) is an estimator which has the smallest
variance among the class of unbiased estimators.
A Best Linear Unbiased Estimator (BLUE) is an estimator which has the smallest variance among
the class of linear unbiased estimators ( the estimates must be linear functions of sample values).
&
Page 4
'
Rohini Somanathan
Maximum Likelihood Estimators

Definition: Suppose that the random variables X1 , . . . , Xn form a random sample from a discrete
or continuous distribution for which the p.f. or p.d.f is f(x|), where belongs to some
parameter space . For any observed vector x = (x1 , . . . , xn ), fn (x|) is a function of and is
called the likelihood function.
For each possible observed vector x, let (x) denote a value of for which the
= (X) be the estimator of defined in this
likelihood function fn (x|) is a maximum, and let
is called the maximum likelihood estimator of (M.L.E.).
way. The estimator
&
Page 5
%
Rohini Somanathan
'
M.L.E..of a Bernoulli parameter

Suppose that the random variables X1 , . . . , Xn form a random sample from a Bernoulli
distribution for which the parameter is unknown. We can derive the M.L.E. of as follows:
The Bernoulli density can be written as f(x; ) = x (1 )1x ,
x = 0, 1.
For any observed values x1 , . . . , xn , where each xi is either 0 or 1, the likelihood function is
P
P
n
Q
x
n xi
xi (1 )1xi = i (1 )
given by: fn (x|) =
i=1
The value of that will maximize this will be the same as that which maximizes the log of
the likelihood function, L() which is given by:
L() =
n
X
n

X
xi ln + n
xi ln(1 )
i=1
i=1

The first order condition for an extreme point is given by:
n
P
MLE =
this, we get
n
P
xi
i=1
n
P
i=1
xi

and solving
xi
i=1
Confirm that the second derivate of L() is in fact negative, so we do have a maximum.
&
Page 6
Rohini Somanathan
'
Sampling from a normal distribution..unknown mean

Suppose that the random variables X1 , . . . , Xn form a random sample from a normal distribution
for which the parameter is unknown, but the variance 2 is known. Recall the normal density
f(x; , 2 ) =
1 x 2
1
e 2 ( ) I(,+) (x)
2
For any observed values x1 , . . . , xn , the likelihood function is given by:

P
12 (
(xi )2 )
2
n
fn (x|) =
1
n
(22 ) 2
i=1
fn (x|) will be maximized by the value of which minimizes the following expression in :
Q() =
n
n
n
X
X
X
(xi )2 =
x2i 2
xi + n2
i=1
=2
The first order condition is now: 2n
n
P
MLE =
mean
i=1
n
P
i=1
xi and our M.L.E. is once again the sample
i=1
Xi
i=1
The second derivative is positive so we have a minimum value of Q().
&
Page 7
%
Rohini Somanathan
'
Sampling from a normal distribution..unknown and 2

Now suppose that the random variables X1 , . . . , Xn form a random sample from a normal
distribution for which both the parameters and 2 are unknown. Now the likelihood function
P
12 (
(xi )2 )
2
n
is a function of both parameters: fn (x|, 2 ) =
1
n
(22 ) 2
i=1
To find the M.L.Es of both and 2 , it is easiest to maximize the log of the likelihood
n
P
function: L(, 2 ) = n2 ln(2) n2 ln(2 ) 21 2
(xi )2
i=1
We now have two first-order conditions obtained by setting each of the following partial
derivatives equal to zero:
L
L
2
n
1 X
(
xi n)
2
(1)
i=1
n
n
1 X
(xi )2
+
22
24
(2)
i=1
=x
n and substituting this into the second
The first equation can be solved to obtain
n
P
1
2
2
= n
n)
equation, we obtain
(xi x
i=1
n and 2 =
=X
The maximum likelihood estimators are therefore
1
n
n
P
n )2
(Xi X
i=1
&
Page 8
'
Rohini Somanathan
Sampling from a uniform distribution

Now suppose X1 , . . . , Xn are draws from a uniform distribution on [0, ] for which the parameters
is unknown.
The density function is given by
f(x; ) =
1
I
(x)
[0,]
The likelihood function is therefore

fn (x; ) =
1
n
This function is decreasing in and is therefore maximized at the smallest admissible value
= max(X1 . . . Xn )
of which is given by
Notice that if we modify the domain of the density to be (0, ) instead of [0, ], then no
M.L.E. exists since the maximum sample value is no longer an admissible candidate for .
Now suppose the random sample is from a uniform distribution on [, + 1]. Now could
lie anywhere in the interval [max(x1 , . . . , xn ) 1, min(x1 , . . . , xn )] and the method of maximum
likelihood does not provide us with a unique estimate.
&
Page 9
%
Rohini Somanathan
'
Computation of MLEs
The form of a likelihood function is often complicated enough to make numerical computation
necessary
Consider, for example, a sample of size n from the following Gamma distribution and suppose we
would like an MLE of :
1
f(x; ) =
x1 ex , x > 0
()
n
P
n
Y
xi
1
fn (x|) = n
(
xi )1 e i=1 .
()
i=1
Setting the derivative of log L to zero, we get the first-order condition

n
1 X
0 ()
=
xi
()
n
i=1
The LHS is the digamma function which is tabulated and now embedded in software packages.
&
Page 10
Rohini Somanathan
'
Properties of Maximum Likelihood Estimators

is the maximum likelihood estimator of , and g() is a one-to-one
1. Invariance: If
is a maximum likelihood estimator of g()
function of , then g()
This allows us to easily compute M.L.E.s of various statistics once we have a few of them.
Example:we have shown that the sample mean and the sample variance are the M.L.E.s of the mean and
variance of a random sample from a normal distribution. We can use the invariance property to conclude
that
the M.L.E. of the standard deviation is the square root of the sample variance
the M.L.E of E(X2 ) is equal to the sample variance plus the square of the sample mean, i.e. since
2 +
2
E(X2 ) = 2 + 2 , the M.L.E of E(X2 ) =
n of a parameter for a sample of size n,

2. Consistency: If there exists a unique M.L.E.
then plim n = .
Note: MLEs are not, in general, unbiased.
n
P
n )2
(Xi X
2n = i=1
Example: The MLE of the variance of a normally distributed variable is given by
2n ) = E[
E(
n
X
X
X
X
X
1 X
+X
2 )] = E[ 1 (
2 )] = E[ 1 (
2 + nX
2 )] = E[ 1 (
2 )]
(X2i 2Xi X
X2i 2X
X2i 2nX
X2i nX
Xi +
X
n
n
n
n
i=1
2
1 X
2 )] = 1 [n(2 + 2 ) n( + 2 )] = 2 n 1
[
E(X2i ) nE(X
n
n
n
n
P
An unbiased estimate would therefore be
n )2
(Xi X
n1
&
Page 11
%
Rohini Somanathan
'
Sufficient Statistics
We have seen that M.L.Es may not exist, or may not be unique. Where should our search
for other estimators start? Well see that a natural starting point is the set of sufficient
statistics for the sample.
Suppose that in a specific estimation problem, two statisticians A and B would like to
estimate ; A observes the realized values of X1 , . . . Xn , while B only knows the value of a
certain statistic T = r(X1 , . . . , Xn ).
A can now choose any function of the observations (X1 , . . . , Xn ) whereas B can choose only
functions of T . In general, A will be able to choose a better estimator than B. Suppose
however that B does just as well as A because the single function T summarizes all the
relevant information in the sample for choosing a suitable . A statistic T with this
property is called a sufficient statistic.
In this case, given T = t, we can generate an alternative sample X01 . . . X0n in accordance with this
conditional joint distribution (auxiliary randomization). Suppose A uses (X1 . . . Xn ) as an
estimator. Well B could just use (01 . . . X0n ), which has the same probability distribution as A 0 s
estimator.
&
Page 12
'
Rohini Somanathan
Sufficient statistics: their detection and importance

The Factorization Criterion (Fisher (1922) ; Neyman (1935)): Let (X1 , . . . , Xn ) form a random
sample from either a continuous or discrete distribution for which the p.d.f. or the p.f. is f(x|),
where the value of is unknown and belongs to a given parameter space . A statistic
T = r(X1 , . . . , Xn ) is a sufficient statistic for if and only if, for all values of x = (x1 , . . . , xn ) Rn
and all values of , fn (x|) of (X1 , . . . , Xn ) can be factored as follows:
fn (x|) = u(x)v[r(x), ]
The functions u and v are nonnegative; the function u may depend on x but does not depend on
; and the function v will depend on but depends on the observed value x only through the
value of the statistic r(x).
Rao-Blackwell Theorem: An estimator that is not a function of a sufficient statistic is dominated
by one that is (in terms of having a lower MSE )
&
Page 13
%
Rohini Somanathan
'
Sufficient Statistics: examples

1. Sampling from a Poisson Distribution: Let (X1 , . . . , Xn ) form a random sample from a
Poisson distribution for which the the value of mean is unknown and belongs to the
parameter space = { : > 0}. For any set of nonnegative integers, x1 , . . . xn , the joint p.f.
fn (x|) of X1 , . . . Xn is as follows:
fn (x|) =
n
n
Y
e xi Y 1 n y
=
e
xi !
xi !
i=1
i=1
where y =
n
P
xi . Weve expressed fn (x|) as the product of a function that does not depend
i=1
on and a function that depends on but depends on the observed vector x only through
n
P
the value of y. It follows that T =
Xi is a sufficient statistic for .
i=1
2. Sampling from a normal distribution with known variance and unknown mean: Let
(X1 , . . . , Xn ) form a random sample from a normal distribution for which the the value of
mean is unknown and variance 2 is known. The joint p.f. fn (x|) of X1 , . . . Xn has already
been derived as:
fn (x|) =
1
(22 )
n
2
n
n

X
1 X 2
n2
exp 2
xi exp
xi
2
2
22
i=1
i=1
The above expression is a product of a function that does not depend on and a function
n
n
P
P
that depends on and on x only through the value of
xi . It follows that T =
Xi is a
i=1
i=1
sufficient statistic for .
&
Page 14
'
Rohini Somanathan
Jointly Sufficient Statistics

If our parameter space is multi-dimensional, and often even when it is not, there may not
exist a single sufficient statistic T , but we may be able to find a set of statistics, T1 . . . Tk
which are jointly sufficient statistics for estimating our parameter.
The corresponding factorization criterion is now
fn (x|) = u(x)v[r1 (x), . . . rk (x), ]
The functions u and v are nonnegative; the function u may depend on x but does not
depend on ; and the function v will depend on but depends on the observed value x only
through the value of the statistic r(x).
Example: If both the mean and the variance of a normal distribution is unknown, the joint
p.d.f.
n
n

X
1
1 X 2
n2
xi exp
xi
fn (x|) =
n exp
2
2
2
22
(22 ) 2
i=1
can be seen to depend on x only through the statistics T1 =

therefore jointly sufficient statistics for and 2 .
i=1
Xi and T2 =
X2i . These are
If T1 . . . , Tk are jointly sufficient for some parameter vector and the statistics T10 , . . . Tk0 are
obtained from these by a one-to-one transformation, then T10 , . . . Tk0 are also jointly sufficient.
So the sample mean and sample variance are also jointly sufficient in the above example,
since T1 = nT10 and T2 = n(T20 + T10 2 )
&
Page 15
%
Rohini Somanathan
'
Minimal Sufficient Statistics

Definition: A statistic T is a minimal sufficient statistic if T is a sufficient statistic and is a
function of every other sufficient statistic.
Minimal jointly sufficient statistics are defined in an analogous manner.
Let Y1 denote the smallest value in the sample, Y2 the next smallest, and so on, with Yn the
largest value in the sample. We call Y1 , . . . Yn the order statistics of a sample.
Order statistics are always jointly sufficient. To see this, note that the likelihood function is
given by
n
Y
fn (x|) =
f(xi |)
i=1
Since the order of the terms in this product are irrelevant (we need to know only the values
obtained and not which one was X1 . . . , we could as well write this expression as
fn (x|) =
n
Y
f(yi |).
i=1
For some distributions this may be the simplest set of jointly sufficient statistics and are
therefore minimally jointly sufficient.
Notice that if a sufficient statistic r(x) exists, the MLE must be a function of this (this
follows from the factorization criterion). It turns out that if MLE is a sufficient statistic, it
is minimally sufficient.
&
Page 16
'
Rohini Somanathan
Implications
Suppose we are picking a sample from a normal distribution, we may be tempted to use
Y(n+1)/2 as an estimate of the median m and Yn Y1 as an estimate of the variance. Yet we
know that we would do better using the sample mean for m and the sample variance must
P
P 2
be a function of
Xi and
Xi .
A statistic is always sufficient with respect of a particular probability distribution, f(x|)
and may not be sufficient w.r.t. , say, g(x|). Instead of choosing functions of the sufficient
statistic we obtain in one case, we may want to find a robust estimator that does well for
many possible distributions.
In non-parametric inference, we do not know the likelihood function, and so our estimators
are based on functions of the order statistics.
&
Page 17
%
Rohini Somanathan
Topic 7: Sampling Distributions of Estimators

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Sampling distributions of estimators

Since our estimators are statistics (particular functions of random variables), their
distribution can be derived from the joint distribution of X1 . . . Xn . It is called the sampling
distribution because it is based on the joint distribution of the random sample.
Given this distribution, we can
calculate the probability that an estimator will not differ from the parameter by more
than a specified amount
obtain interval estimates rather than point estimates after we have a sample- an
interval estimate is a random interval such that the true parameter lies within this
interval with a given probability (say 95%).
choose between to estimators- we can, for instance, calculate the mean-squared error of
)2 ] using the distribution of .
the estimator, E [(
Sampling distributions of estimators depend on sample size, and we want to know exactly
how the distribution changes as we change this size so that we can make the right trade-offs
between cost and accuracy.
We begin with a set of results which help us derive the sampling distributions of the
estimators we are interested in.
&
Page 1
%
Rohini Somanathan
'
Joint distribution of sample mean and sample variance

For a random sample from a normal distribution, we already know that the MLEs of the
n
n )2 /n respectively.
n and P (Xi X
population mean and variance are X
i=1
the sample mean is itself normally distributed with mean and variance
n
P
i=1
2
n
and
( Xi )2 has a 2n distribution since it is the sum of squares of n standard normal random
variables.
Theorem: If X1 , . . . Xn form a random sample from a normal distribution with mean and
n
n and the sample variance 1 P (Xi X
n )2 are independent
variance 2 , then the sample mean X
n
i=1
random variables and
2
n N(, )
X
n
n
P
n )2
(Xi X
i=1
2n1
&
Page 2
'
Rohini Somanathan
The t-distribution
Let Z N(0, 1), let Y 2v , and let Z and Y be independent random variables. Then
Z
X = q tv
Y
v
The p.d.f of the t-distribution is given by:

f(x; v) =

( v+1
x2 ( v+1
2 )
2 )
1+
v
v
( 2 ) v
Features of the t-distribution:

One can see from the above density function that the t-density is symmetric with a
maximum value at x = 0.
The shape of the density is similar to that of the standard normal (bell-shaped) but with
fatter tails.
&
Page 3
%
Rohini Somanathan
'
Relation to random normal samples

RESULT 1: Define S2n =
n
P
n )2 The random variable

(Xi X
i=1
U=
n(Xn )
q
tn1
2
Sn
n1
n(X )
S2
n
Proof: We know that
N(0, 1) and that n2 2n1 . Dividing the first random variable
by the square root of the second, divided by its degrees of freedom, the in the numerator and
denominator cancels to obtain U.
Implication: We may not be able to make statements about the difference between the
n using the normal distribution, because even though
population
mean and the sample mean X
n(Xn )
2
N(0, 1), may not unknown. This result allows us to use its estimate
n
P
n )
2
n )2 /n since (X
= (Xi X
tn1
n1
/
i=1
RESULT 2 Given X, Z, Y, n as above. As n
X Z N(0, 1)
As the sample size gets larger the t-density looks more and more like a standard normal
distribution. For instance, the value of x for which the distribution function is equal to .55 is
.129 for t10 , it is .127 for t20 and .126 for the standard normal distribution. The differences
between these values increases for higher values of their distribution functions (why?)
q
n1 n(Xn )
To see why this might happen, consider the variable we just derived,
tn1
gets very close to and

As n gets large
n1
n
is close to 1.
&
Page 4
'
Rohini Somanathan
Interval estimates for the mean

Let us know see how, given 2 , we can obtain an interval estimate for , i.e. an interval which is
likely to contain with a pre-specified probability.

(Xn )
(Xn )
Since /
N(0, 1), Pr 2 < /
< 2 = .955
n
n
2
But this event is equivalent to the events
< Xn <
n
and Xn
< < Xn +
2
2
With known , each of the random variables Xn
and Xn +
are statistics.
n
n
Therefore, we have derived a random interval within which the population parameter lies
with probability .955, i.e.

2
2
= .955 =
Pr Xn < < Xn +
n
n
Notice that there are many intervals for the same , this is the shortest one.
Now, given our sample, our statistics take particular values and the resulting interval either
contains or does not contain . We can therefore no longer talk about the probability that
it contains because the experiment has already been performed.
2
2
< < xn +
) is a 95.5% confidence interval for . Alternatively, we
We say that (xn
n
n
may say that lies in the above interval with confidence or that the above interval is a
confidence interval for with confidence coefficient
&
Page 5
%
Rohini Somanathan
'
Confidence Intervals for means..examples

Example 1: X1 , . . . , Xn forms a random sample from a normal distribution with unknown
and 2 = 10. xn is foundqto be 7.164 with q
n = 40. An 80% confidence interval for the mean
is given by (7.164 1.282
10
40 ), 7.164 + 1.282
10
40 )
or (6.523, 7.805). The confidence coefficient. is .8
Example 2: Let X denote the sample mean of a random sample of size 25 from a
distribution with variance 100 and mean . In this case, n = 2 and, making use of the
central limit theorem the following statement is approximately true:

(Xn )
< 1.96 = .95 or Pr Xn 3.92 < < Xn + 3.92 = .95
Pr 1.96 <
2
If the sample mean is given by xn = 67.53, an approximate 95% confidence interval for the
sample mean is given by (63.61, 71.45).
Example 3: Suppose we are interested in a confidence interval for the mean of a normal
distribution but do not know 2 . We know that
(Xn )
/
n1
tn1 and can use the
t-distribution with (n 1) degrees of freedom to construct our interval estimate. With

= 1.17, a 95% interval is given by
n = 10, xn = 3.22,
(3.22 (2.262)(1.17)/ 9, 3.22 + (2.262)(1.17)/ 9) = (2.34, 4.10)
&
Page 6
Rohini Somanathan
'
Confidence Intervals for differences in means

Let X1 , . . . , Xn and Y1 , . . . , Ym denote, respectively, independent random samples from two
Y and sample variances
distributions, N(1 , 2 ) and N(2 , 2 ), with sample means denoted by X,
2
2
1 and
2.
by
Weve established that:
and Y are normally and independently distributed with means 1 and 2 and variances
X
and
2
n
Using our results on the distribution of linear combinations of normally distributed

n Ym is normally distributed with mean 1 2 and variance
variables, we know that X
n Ym )(1 2 )
(X
2
2
q
has a standard normal distribution and will
n + m . The random variable
2
2
+
m
form the numerator of the T random variable that we are going to use.
2
n
2
m
We also know that 21 and 2 2 have 2 distributions with (n 1) and (m 1) degrees of

21 + m
2 )/2 has a 2 distribution with (n + m 2)
freedom respectively, so their sum (n
r2
degrees of freedom and the random variable
21 +m
22
n
2 (n+m2)
can appear as the denominator of
a random variable which has a tdistribution with (n + m 2) degrees of freedom.
&
Page 7
%
Rohini Somanathan
'
Confidence Intervals for differences in means..contd

We have therefore established that X =
1 2 )
s(Xn Ym )(

22
21 +m
n
(n+m2)
has a t-distribution with
1
1
n+m
(n + m 2) degrees of freedom. To simplify notation, denote the denominator of the above

expression by R.
Given our samples, X1 , . . . , Xn and Y1 , . . . , Ym , we can now construct confidence intervals for
differences in the means of the corresponding populations, 1 2 . We do this in the usual
way:
Suppose we want a 95% confidence interval for the difference in the means, we find a
number b such that, using the t-distribution with (n + m 2) degrees of freedom,

Pr b < X < b = .95
Y)
bR, (X
Y)
+ bR will now contain the true difference in
The random interval (X
means with 95% probability.
m ) and corresponding
A confidence interval is now based on sample values, (
xn y
sample variances.
Based on the CLT, we can use the same procedure even when our samples are not normal.
&
Page 8
Rohini Somanathan
'
The F-distribution
RESULT : Let Y 2m , Z 2n , and let Y and Z be independent random variables. Then
F=
Y/m
nY
=
Z/n
mZ
has an F-distribution with m and n degrees of freedom. The F-density is given by:
f(x) =
m/2 nn/2
( m+n
xm/21
2 )m
I(0,) (x)
m
n
( 2 )( 2 )
(mx + n)(m+n)/2
m and n are sometimes referred to as the numerator and denominator degrees of freedom
respectively.
It turns out that the square of a random variable with a T distribution with n degrees of
freedom has an F distribution with (1, n) degrees of freedom.
The F test turns out to be useful in testing for differences in variances between
two-distributions.
Many important specification tests rely on the F-distribution (example: testing for a set of
coefficients in a linear model being equal to zero).
&
Page 9
%
Rohini Somanathan
'
Computing probabilities with the F-distribution

While the t-density is symmetric, the F density is defined on the positive real numbers and
is skewed to the right.
From the definition of the F distribution (the ratio of two 2 distributions ), it follows that
1
= Fv2 ,v1
Fv1 ,v2
So if a random variable X has an F-distribution with (m, n) degrees of freedom, then
an F-distribution with (n, m) degrees of freedom.
1
X
has
This allows us to easily construct lower tail events having probability from upper-tail
events having probability . For the normal and t-distributions we use the symmetry of
those distributions to do this. We dont do this anymore since most statistical packages give
us the required output.
&
Page 10
%
Rohini Somanathan
Topic 8: Hypothesis Testing

Rohini Somanathan
Course 003, 2014-2015
Page 0
Rohini Somanathan
'
The Problem of Hypothesis Testing

A statistical hypothesis is an assertion or conjecture about the probability distribution of
one or more random variables.
A test of a statistical hypothesis is a rule or procedure for deciding whether to reject that
assertion.
Suppose we have a sample x = (x1 , . . . , xn ) from a density f. We have two hypotheses about
f. On the basis of our sample, one of the hypotheses is accepted and the other is rejected.
The two hypotheses have different status:
the null hypothesis, H0 , is the hypothesis under test. It is the conservative hypothesis,
not to be rejected unless the evidence is clear
the alternative hypothesis H1 specifies the kind of departure from the null hypothesis
that is of interest to us
Two types of tests:
tests of a parametric hypothesis: we partition into two subsets 0 and 1 and Hi
the hypothesis that is in i (we consider only these hypotheses).
goodness-of-fit tests: H0 : f = f0 against H1 : f 6= f0
A hypothesis is simple if it completely specifies the probability distribution, else it is
composite.
Examples: (i) Income is log-normally distributed with known variance but unknown mean. H0 : 8, 000 rupees per
month, H1 : < 8, 000 (ii) We would like to know whether parents are more likely to have boys than girls. Denoting
with the probability of a boy child being denoted by the Bernoulli parameter p. H0 : p = 12 and H1 : p > 12
&
Page 1
%
Rohini Somanathan
'
Statistical tests
Before deciding whether or not to accept H0 , we observe a random sample. Denote by S,
the set of all possible outcomes X of the random sample.
A test procedure or partitions the set S into two subsets, one containing the values that will
lead to the acceptance of H0 and the other containing values which lead its rejection.
A statistical test is defined by the critical region R which is the subset of S for which H0 is
rejected. The complement of this region must therefore contain all outcomes for which H0 is
not rejected.
Most tests are based on values taken by a test statistic ( the same mean, the sample variance
or functions of these). In this case, the critical region R, is a subset of values of the test
statistic for which H0 is rejected. The critical values of a test statistic are the bounds of R.
When arriving at a decision based on a sample and a test, we may make two types of errors:
H0 may be rejected when it is true- a Type I error
H0 may be accepted when it is false- a Type II error
Since a test is any rule which we use to make a decision and we have to think of rules that
help us make better decisions (in terms of these errors). We will discuss how to evaluate a
test and for some problems, we will characterize optimal tests.
&
Page 2
'
Rohini Somanathan
The power function

One way to characterize a test is to specify, for each value of , the probability ()
that the test procedure will lead to the rejection of H0 . The function () is called the
power function of the test:
() = Pr(X R) for
If our critical region is specified in terms of values taken by the statistic T , we have
() = Pr(T R) for
Since the power function of a test specifies the probability of rejecting H0 as a function of
the real parameter value, we can evaluate our test by asking how often it leads to mistakes.
What is the power function of an ideal test ? Think of examples when such a test exists.
It is common to specify an upper bound 0 on () for every value 0 . This bound 0
is the level of significance of the test.
The size of a test, is the maximum probability, among all values of 0 of making an
incorrect decision:
= sup ()
0 0
Given a level of significance 0 , only tests for which 0 are admissible.
&
Page 3
%
Rohini Somanathan
'
The power function..example 1

We want to test a hypothesis about the number of defective bolts in a shipment.
We assume that the probability of a bolt being defective is given by the parameter p in a
Bernoulli distribution.
The null hypothesis is that defects are less than 2%, i.e. p .02 and the alternative is that
p > .02.
We pick a sample of 200 items and our test takes the form of rejecting the null hypothesis if
more than a certain number of defective items are found.
Suppose that want to find a test for which 0 = .05. The probability of a given number of
defective items x is increasing in p for all x > np. Therefore for such values of x and all
values of p 0 , any test that rejects H0 if more than x defective items are found, will have
the highest probability of rejecting the null hypothesis when p = .02. We can therefore
restrict our attention to this value of p, when finding a test with significance level 0 = 0.05.
It turns out that for p = .02, the probability that the number of defective items is greater
than 7 is .049. This is therefore the test we choose. (The size of tests which reject for more
than 4, 5 and 6 defective pieces are .37. .21 and .11 respectively).
The size of the above test (R = {x : x > 7}) is 0.049. With discrete distributions, the size will
often be strictly smaller than 0 .
What does the power function look like?
Note: The stata 12 command you can use to verify this is:
display 1-binomial(200,7,.02).
&
Page 4
'
Rohini Somanathan
The power function graph..example 1
&
Page 5
%
Rohini Somanathan
'

Suppose a random sample is taken from a uniform distribution on [0, ] and the null and
alternative hypotheses are as follows:
H0 : 3 4
H1 : < 3 or > 4
Suppose that our test procedure uses the M.L.E. of , Yn = max(X1 , . . . , Xn ) and rejects the
null hypothesis whenever Yn lies outside [2.9, 4].
(What might be the rationale for this type of test?)
The power function for this test is given by
() = Pr(Yn < 2.9|) + Pr(Yn > 4|)
What is the power of the test if < 2.9?
When takes values between 2.9 and 4, the probability that any sample value is less than
2.9 n
2.9 is given by 2.9
and therefore Pr(Yn < 2.9|) = ( ) and Pr(Yn > 4|) = 0. Therefore the
2.9 n
power function () = ( )
4 n
n
When > 4, () = ( 2.9
) + 1 ()
&
Page 6
Rohini Somanathan
'
The power graph..example 2

9.1 Problems of Testing Hypotheses
wer funcple 9.1.7.
3
2.9
Let T be a test statistic, and suppose that our test will reject the null hypothesi
&
T c, for some constant c. Suppose also that we desire our test to have%
the level
significance 0. The power function of our test is (|) = Pr(T c|), and we wa
Page 7
Rohini Somanathan
'

Let X be N(, 100). To test H0 : = 80 against H1 : > 80 , let the critical region be defined by
> 83}, where x
is the sample mean of a random sample of size n = 25 from
R = {(x1 , x2 , . . . x25 ) : x
this distribution.
1. How is the power function () defined for this test?
The probability of rejecting the null is
> 83) = P
() = P(X

X 83
83
= 1 (
)
>
2
2
/ n
2. What is the size of this test?

This is simply the probability of Type 1 error: = 1 ( 23 ) = .067 = (80)
3. What are the values of (83), and (86)?
(80) is given above, (83) = 0.5, (86) = 1 ( 23 ) = ( 32 ) = .933
stata 12: display normal(1.5)
4. Sketch the graph of the power function
stata 12: twoway function 1-normal((83-x)/2), range (70 90)
= 83.41 This is the smallest level of significance, 0 at
5. What is the p-value corresponding to x
which a given hypothesis would be rejected based on the observed outcome of X?
83.41) = 1 ( 3.41 ) = .044.
Solution: The p-value is given by Pr(X
2
&
Page 8
'
Rohini Somanathan
Testing simple hypotheses

Suppose that 0 and 1 contain only a single element each and our null and alternative
hypotheses are given by
H0 : = 0 and H1 : = 1
Denote by fi (x) the joint density function or p.f. of the observations in our sample under
Hi :
fi (x) = f(x1 |i )f(x2 |i ) . . . f(xn |i )
Let the type I error and type II error be denoted by () and () respectively:
() = Pr( Rejecting H0 | = 0 )
and
() = Pr( Not Rejecting H0 | = 1 )
By always accepting H0 , we achieve () = 0 but then () = 1. The converse is true if we

always reject H0 .
It turns out that we can find an optimal test which minimizes any linear combination of
() and ().
&
Page 9
%
Rohini Somanathan
'
Optimal tests for simple hypotheses

The following result gives us the test procedure that minimizes a() + b() for specified
constants a and b:
THEOREM : Let denote a test procedure such that the hypothesis H0 is accepted if
af0 (x) > bf1 (x) and H1 is accepted if af0 (x) < bf1 (x). Either H0 or H1 may be accepted if
af0 (x) = bf1 (x). Then for any other test procedure ,
a( ) + b( ) a() + b()
So if we are minimizing the sum of errors, we would reject whenever the likelihood ratio
Proof.
f1 (x)
f0 (x)
> 1.
(for discrete distributions)
a() + b() = a
X
xR
f0 (x) + b
X
xRc
f1 (x) = a
X
xR
f0 (x) + b 1
f1 (x) = b +
xR
[af0 (x) bf1 (x)]
xR
The desired function a() + b() will be minimized when the above summation is minimized. This will happen if the critical
a.
region includes only those points for which af0 (x) bf1 (x) < 0. We therefore reject when the likelihood ratio exceeds b
&
Page 10
'
Rohini Somanathan
Minimizing (), given 0

If we fix a level of significance 0 we want a test procedure that minimizes (), the type II
error subject to 0 .
The Neyman-Pearson Lemma : Let denote a test procedure such that, for some constant k,
the hypothesis H0 is accepted if f0 (x) > kf1 (x) and H1 is accepted if f0 (x) < kf1 (x). Either H0 or
H1 may be accepted if f0 (x) = kf1 (x). If is any other test procedure such that () ( ),
then it follows that () ( ). Furthermore if () < ( ) then () > ( )
This result implies that if we set a level of significance 0 = .05, we should try and find a value of
k for which ( ) = .05 This procedure will then have the minimum possible value of ().
Proof.
(for discrete distributions)
From the previous theorem we know that ( ) + k( ) () + k(). So if () ( ), it follows that () ( )
&
Page 11
%
Rohini Somanathan
'
Neyman Pearson Lemma..example 1

Let X1 . . . Xn be a sample from a normal distribution with unknown mean and variance 1.
H0 : = 0 and H1 : = 1
We will find a test procedure which keeps = .05 and minimizes . We have
f0 (x) =
(2)
n
2
e 2
P 2
xi
and f1 (x) =
1
(2)
n
2
e 2
(xi 1)2
1
f1 (x)
= en(xn 2 )
f0 (x)
The lemma tells us to use a procedure which rejects H0 when the likelihood ratio is greater than
n > 21 + n1 log k = k0 .
a constant k. This condition can be re-written in terms of our sample mean x
0
0
We want to find a value of k for which Pr(Xn > k | = 0) = .05 or, alternatively, Pr(Z > k0 n) = .05
(why?)
. Under this procedure, the type II error, () is given by

This gives us k0 n = 1.645 or k0 = 1.645
n
n < 1.645
( ) = Pr(X
| = 1) = Pr(Z < 1.645 n)
n
For n = 9, we have ( ) = 0.0877
If instead, we are interested in choosing 0 to minimize 2() + (), we choose k0 = 21 + n1 log 2,
n > 0.577. In this case, (0 ) = 0.0417
so with n = 9 our optimal procedure rejects H0 when X
(display normal( (.577-1)*3)) and (0 ) = 0.1022 (display normal( (.577-1)*3)) and the minimized
value of 2() + () is 0.186
&
Page 12
'
Rohini Somanathan
Neyman Pearson Lemma..example 2

Let X1 . . . Xn be a sample from a Bernoulli distribution
H0 : p = 0.2 and H1 : p = 0.4
We will find a test procedure which keeps = .05 and minimizes . let Y =
realization. We have
Xi and y its
f0 (x) = (0.2)y (0.8)ny and f1 (x) = (0.4)y (0.6)ny

f1 (x) 3 n 8 y
=
f0 (x)
4
3
The lemma tells us to use a procedure which rejects H0 when the likelihood ratio is greater than
a constant k. This condition can be re-written in terms of our sample mean y >
log k+n log 43

log 83
= k0 .
Now we would like to find k0 such that Pr(Y > k0 |p = 0.2) = .05 We may not however be able to do
this given that Y is discrete. If n = 10, we find that Pr(Y > 3|p = 0.2) = .121 and
Pr(Y > 4|p = 0.2) = .038, (display 1-binomial(10,4,.2)) so we can decide to set these probabilities as
the values of () and use the corresponding values of k0 for our test.
&
Page 13
%
Rohini Somanathan
'
Optimal tests for composite hypotheses

Suppose now that 1 contains multiple elements and we are considering tests at the level of
significance 0 , i.e. test procedures for which
(|) 0 or () 0 for all 0
If 1 and 2 are two different values of in 1 , there may be no single test procedure
that maximizes the power function for all values of 1 .
When such a test does exist, it is called the uniformly most powerful test or a UMP test.
Definition : A test procedure is a UMP test at the level of significance 0 if ( ) 0
and (|) (| ) for every value of 1
We will now look at a sufficient condition for such a test to exist.
&
Page 14
'
Rohini Somanathan
Monotone likelihood ratios

Definition : Let fn (x|) denote the joint density or joint p.f. of the observations X1 , . . . , Xn and
T = r(X) some function of the vector X. Then fn (x|) has a monotone likelihood ratio in the
fn (x|2 )
statistic T if, for any two values 1 and 2 with 1 < 2 , the ratio fn
(x|1 ) depends on the
vector x only through the function r(X) and this ratio is an increasing function of r(X) over the
range of possible values of r(X).
Example 1 : Consider a sample of size n from a Bernoulli distribution for which the parameter p
P
is unknown. Let y =
xi . Then fn (x|p) = py (1 p)ny and for p1 < p2 , the ratio

fn (x|p2 )
p2 (1 p1 ) y 1 p2 n
=
fn (x|p1 )
p1 (1 p2 )
1 p1
is increasing in y so fn (x|p) has a monotone likelihood ratio in the statistic Y =
n
P
Xi .
i=1
Example 2 : Consider a sample of size n from a normal distribution for which the mean is
unknown and the variance is known. The joint p.d.f. is:
P
12 (
(xi )2 )
2
n
fn (x|) =
1
n
(2) 2 n
i=1
n(2 1 )
fn (x|2 )
[x n 12 (2 +1 )]
2
=e
fn (x|1 )
n
n , therefore fn (x|) has a MLR in the statistic X
is increasing in x
&
Page 15
%
Rohini Somanathan
is a UMP test of the hypotheses (9.3.15) with size equal to 0. We shall then determine
the power function of the UMPCourse
test.003: Basic Econometrics, 2012-2013
' It is known from Example 9.3.5 that the joint p.d.f. of X , . . . , X has an increas1
n
ing monotone likelihood ratio in the statistic X n. Therefore, by Theorem 9.3.1, a test
procedure 1 that rejects H0 when X n c is a UMP test of the hypotheses (9.3.15).
alternatives
Pr(X n of
c|one-sided
= 0).
The size of this UMP
test is 0 =tests
Since X has a continuous distribution, c is the 1 0 quantile
of the distribution
Suppose that n0 is an element of the parameter space and
consider the following hypotheses
of X n given = 0. That is, c is the 1 0 quantile of the normal distribution with
H0learned
: 0in Chapter
H1 : >5,this
mean 0 and variance 2 /n. As we
quantile is
0
We have the following result:
1
c = 0 + $1(1 0) n1/2 ,
(9.3.16)
is the quantile
function
the standard
normal ratio
distribution.
For simplicity,
where $Suppose
Theorem:
that fn (x|)
has aofmonotone
likelihood
in the statistic
T = r(X), and let
1(1 ) for the rest of this example.
=
$
we
shall
let
z
c be a constant such
that
0
0
We shall now determine the power
Pr(Tfunction
c| = (|
0 ) = 1)0of this UMP test. By definition,
Then the test procedure which rejects H0 if T c is a UMP test of the above hypotheses at the
level of significance
(9.3.17)
(|)0 .= Pr(Rejecting H |) = Pr(X + z n1/2 |).
1
If instead of the above hypotheses, we have
For every value of , the random variable Z = n1/2 (X n )/ will have the stanH0 :
H1the
: c.d.f.
< 0 of the standard normal
dard normal distribution. Therefore,
if
$ denotes
0
distribution, then
our UMP test will now
! set
"
1/2 (
Pr(T
c|
= 0 ) = 0
n
)
0
(|1) = Pr Z z0 +
In the first case, the power function with be monotonically increasing in while in the second
(9.3.18)
!
"
"
!
case it will be decreasing.
n1/2 (0 )
n1/2 ( 0)
= 1 $ z0 +
=$
z 0 .
&
! Somanathan
The
Page
16 power function (|1) is sketched in Fig. 9.6.
Rohini
In each of the pairs of hypotheses (9.3.8), (9.3.14), and (9.3.15), the alternative
because2012-2013
the set of possible values of
hypothesis H1 is called a one-sided
Course alternative
003: Basic Econometrics,
'
the parameter under H1 lies entirely on one side of the set of possible values under
the null hypothesis H0. In particular, for the hypotheses (9.3.8), (9.3.14), or (9.3.15),
functions
UMP
is larger
thantests
every possible value
every possible valuePower
of the parameter
under H1of
under H0.
The following figures show power functions for one-sided alternative hypotheses discussed above
for the case of a sample from a normal distribution with unknown mean:
The power funcfor the UMP test

heses (9.3.15).
a0
0
m0
9.3 Uniformly Most Powerful Tests
The power funcfor the UMP test

heses (9.3.19).
H0 : 0
565
H1 : > 0
a0
0
m0
H0 : 0
H1 : < 0
&
Page 17
Rohini Somanathan
Example
One-Sided Alternatives in the Other Direction. Suppose now that instead of testing
'
Two-sided alternatives
No UMP tests exists in these cases. A test which does very well for an alternative 2 > 0 may do
9.4 Two-Sided Alternatives
very badly for 1 < 0
Figure 9.8 The power func-
p(md2 )
tions of four test procedures.
569
p(md1)
p( md3)
p(md4 )
m0
in Fig. 9.8, along with the power functions (|1) and (|2 ), which had previously
been sketched in Figs. 9.6 and 9.7.
power
As the values of c1 and c2 in Eq. (9.4.2) or Eq. (9.4.3) are decreased, the%
Rohini Somanathan
function (|) will become smaller for < 0 and larger
for > 0. For 0 =
0.05, the limiting case is obtained by choosing c1 = and c2 = 0 + 1.645 n1/2 .
The test procedure defined by these values is just 1. Similarly, as the values of c1
and c2 in Eq. (9.4.2) or Eq. (9.4.3) are increased, the power function (|) will
become larger for < 0 and smaller for > 0. For 0 = 0.05, the limiting case is
obtained by choosing c2 = and c1 = 0 1.645 n1/2 . The test procedure defined
by these values is just 2 . Something between these two extreme limiting cases seems
appropriate for hypotheses (9.4.1).
&
Page 18
Selection of the Test Procedure

For a given sample size n, the values of the constants c1 and c2 in Eq. (9.4.2) should
be chosen so that the size and shape of the power function are appropriate for the
particular problem to be solved. In some problems, it is important not to reject the
null hypothesis unless the data strongly indicate that differs greatly from 0. In
such problems, a small value of 0 should be used. In other problems, not rejecting
the null hypothesis H0 when is slightly larger than 0 is a more serious error than
not rejecting H0 when is slightly less than 0. Then it is better to select a test having
a power function such as (|4) in Fig. 9.8 than to select a test having a symmetric
function such as (|3).
In general, the choice of a particular test procedure in a given problem should be
based both on the cost of rejecting H0 when = 0 and on the cost, for each possible
value of , of not rejecting H0 when = 0. Also, when a test is being selected, the
relative likelihoods of different values of should be considered. For example, if it
is more likely that will be greater than 0 than that will be less than 0, then it
is better to select a test for which the power function is large when > 0, and not
so large when < 0, than to select one for which these relations are reversed.
Example
9.4.2
Egyptian Skulls. Suppose that, in Example 9.4.1, it is equally important to reject the
null hypothesis that the mean breadth equals 140 when < 140 as when > 140.

Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability

Uploaded by

Copyright:

Available Formats

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics

Course 003: Basic Econometrics, 2012-2013

Outline of the Part 1

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

Why is this course useful?

Course 003: Basic Econometrics, 2012-2013

A motivating example: gender ratios

Course 003: Basic Econometrics, 2012-2013

Origins of probability theory

Course 003: Basic Econometrics, 2012-2013

bus arrives within 5 min.

A sample space is the collection of all possible outcomes of an experiment.

Course 003: Basic Econometrics, 2012-2013

Example: 3 tosses of a coin

Course 003: Basic Econometrics, 2012-2013

The definition of probability

Course 003: Basic Econometrics, 2012-2013

Probability measures... some useful results

Let A = so Ac = S. Since P(S) = 1, P() = 0 using the first result above.

Course 003: Basic Econometrics, 2012-2013

Some useful results..

but A1 = (A1 Ac2 ) (A1 A2 ) and A2 = (A2 Ac1 ) (A1 A2 ), so

Course 003: Basic Econometrics, 2012-2013

Examples using the probability axioms

Course 003: Basic Econometrics, 2012-2013

Finite sample spaces

Course 003: Basic Econometrics, 2012-2013

Counting methods..the multiplication rule

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

The birthday problem

Course 003: Basic Econometrics, 2012-2013

Combinatorial methods..the binomial coefficient

Course 003: Basic Econometrics, 2012-2013

The multinomial coefficient

This can be simplified to

This expression is known as the multinomial coefficient.

Course 003: Basic Econometrics, 2012-2013

Unions of finite numbers of events

Pr(Ai Aj Ak ) ...(1)n+1 P(A1 A2 . . . An )

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

Independent Events..examples and special cases

( {H,H} or {T,T}) and P(A B) = 41 , so yes, the

Course 003: Basic Econometrics, 2012-2013

Independence with 3 or more events

A3 : the sum of the faces equals 9, P(A3 ) =

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

Conditional probability...the multiplication rule

Course 003: Basic Econometrics, 2012-2013

The law of total probability

This is known as the law of total probability.

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

Course 003: Basic Econometrics, 2012-2013

Bayes Rule ...examples

Course 003: Basic Econometrics, 2012-2013

Bayes Rule ... priors, posteriors and politics

if the politician is honest and is

The posterior probability of the politician being honest is now given by

Course 003: Basic Econometrics, 2012-2013