You are on page 1of 301

Slides

Advanced Statistics
Summer Term 2011
(April 5, 2011 May 17, 2011)
Tuesdays, 14.15 15.45 and 16.00 17.30
Room: J 498

Prof. Dr. Bernd Wilfling


Westf
alische Wilhelms-Universit
at M
unster

Contents
1
1.1
1.2

Introduction
Syllabus
Why Advanced Statistics ?

2
2.1
2.2
2.3
2.4

Random Variables, Distribution Functions, Expectation,


Moment Generating Functions
Basic Terminology
Random Variable, Cumulative Distribution Function, Density Function
Expectation, Moments and Moment Generating Functions
Special Parameteric Families of Univariate Distributions

3
3.1
3.2
3.3
3.4

Joint and Conditional Distributions, Stochastic Independence


Joint and Marginal Distribution
Conditional Distribution and Stochastic Independence
Expectation and Joint Moment Generating Functions
The Multivariate Normal Distribution

4
4.1
4.2
4.3
4.4

Distributions of Functions of Random Variables


Expectations of Functions of Random Variables
Cumulative-distribution-function Technique
Moment-generating-function Technique
General Transformations

5
5.1
5.2
5.3
5.3.1
5.3.2
5.3.3

Methods of Estimation
Sampling, Estimators, Limit Theorems
Properties of Estimators
Methods of Estimation
Least-Squares Estimators
Method-of-moments Estimators
Maximum-Likelihood Estimators

6
6.1
6.2
6.2.1
6.2.2
6.2.3

Hypothesis Testing
Basic Terminology
Classical Testing Procedures
Wald Test
Likelihood-Ratio Test
Lagrange-Multiplier Test
i

References and Related Reading


In German:
Mosler, K. und F. Schmid (2008). Wahrscheinlichkeitsrechnung und schlieende Statistik
(3. Auflage). Springer Verlag, Heidelberg.
Schira, J. (2009). Statistische Methoden der VWL und BWL Theorie und Praxis (3. Auflage). Pearson Studium, Munchen.
Wilfling, B. (2010). Statistik I. Skript zur Vorlesung Deskriptive Statistik im Wintersemester 2010/2011 an der Westfalischen Wilhelms-Universitat Munster.
Wilfling, B. (2011). Statistik II. Skript zur Vorlesung Wahrscheinlichkeitsrechnung
und schlieende Statistik im Sommersemester 2011 an der Westfalischen
Wilhelms-Universitat Munster.

In English:
Chiang, A. (1984). Fundamental Methods of Mathematical Economics, 3. edition. McGrawHill, Singapore.
Feller, W. (1968). An Introduction to Probability Theory and its Applications, Vol. 1. John
Wiley & Sons, New York.
Feller, W. (1971). An Introduction to Probability Theory and its Applications, Vol. 2. John
Wiley & Sons, New York.
Garthwaite, P.H., Jolliffe, I.T. and B. Jones (2002). Statistical Inference, 3. edition. Oxford
University Press, Oxford.
Mood, A.M., Graybill, F.A. and D.C. Boes (1974). Introduction to the Theory of Statistics,
3. edition. McGraw-Hill, Tokyo.

ii

1. Introduction
1.1 Syllabus
Aim of this course:
Consolidation of
probability calculus
statistical inference
(on the basis of previous Bachelor courses)
Preparatory course to Econometrics, Empirical Economics
1

Web-site:
http://www1.wiwi.uni-muenster.de/oeew/
Study Courses summer term 2011
Advanced Statistics
Style:
Lecture is based on slides
Slides are downloadable as PDF-files from the web-site
References:
See Contents
2

How to get prepared for the exam:


Courses
Class in Advanced Statistics
(Thu, 14.00 16.00 and 16.00 18.00, J 498,
April 7, 2011 May 19, 2011)

Auxiliary material to be used in the exam:


Pocket calculator (non-programmable)
All course-slides and solutions to class-exercises
No textbooks
3

Class teacher:
Dipl.-Mathem. Marc Lammerding
(see personal web-site)

1.2 Why Advanced Statistics ?


Contents of the BA course Statistics II:
Random experiments, events, probability
Random variables, distributions
Samples, statistics
Estimators
Tests of hypothesis
Aim of the BA course Statistics II:
Elementary understanding of statistical concepts
(sampling, estimation, hypothesis-testing)

Now:
Course in Advanced Statistics
(probability calculus and mathematical statistics)

Aim of this course:


Better understanding of distribution theory
How can we find good estimators?
How can we construct good tests of hypothesis?

Preliminaries:
BA courses
Mathematics
Statistics I
Statistics II
The slides for the BA courses Statistics I+II are downloadable
from the web-site
(in German)
Later courses based on Advanced Statistics:
All courses belonging to the three modules Econometrics
and Empirical Economics
(Econometrics I+II, Analysis of Time Series, ...)
7

2. Random Variables, Distribution Functions, Expectation, Moment generating Functions


Aim of this section:
Mathematical definition of the concepts
random variable
(cumulative) distribution function
(probability) density function
expectation and moments
moment generating function

Preliminaries:
Repetition of the notions
random experiment
outcome (sample point) and sample space
event
probability

(see Wilfling (2011), Chapter 2)

2.1 Basic Terminology


Definition 2.1: (Random experiment)
A random experiment is an experiment

(a) for which we know in advance all conceivable outcomes that


it can take on, but

(b) for which we do not know in advance the actual outcome


that it eventually takes on.

Random experiments are performed in controllable trials.


10

Examples of random experiments:


Drawing of lottery numbers
Roulette, tossing a coin, tossing a dice
Technical experiments
(testing the hardness of lots from steel production etc.)

In economics:
Random experiments (according to Def. 2.1) are rare
(historical data, trials are not controllable)
Modern discipline: Experimental Economics
11

Definition 2.2: (Sample point, sample space)


Each conceivable outcome of a random experiment is called a
sample point. The totality of conceivable outcomes (or sample
points) is defined as the sample space and is denoted by .
Examples:
Random experiment of tossing a single dice:
= {1, 2, 3, 4, 5, 6}
Random experiment of tossing a coin until HEAD shows up:
= {H, TH, TTH, TTTH, TTTTH, . . .}
Random experiment of measuring tomorrows exchange rate
between the euro and the US-$:
= [0, )
12

Obviously:
The number of elements in can be either (1) finite or (2)
infinite, but countable or (3) infinite and uncountable

Now:
Definition of the notion Event based on mathematical sets
Definition 2.3: (Event)
An event of a random experiment is a subset of the sample space
. We say the event A occurs if the random experiment has
an outcome A.
13

Remarks:
Events are typically denoted by A, B, C, . . . or A1, A2, . . .
A = is called the sure event
(since for every sample point we have A)
A = (empty set) is called the impossible event
(since for every we have
/ A)
If the event A is a subset of the event B (A B) we say that
the occurrence of A implies the occurrence of B
(since for every A we also have B)
Obviously:
Events are represented by mathematical sets
application of set operations to events
14

Combining events (set operations):


Intersection:

n
T

i=1

Union:

n
S

i=1

Ai occurs, if all Ai occur

Ai occurs, if at least one Ai occurs

Set difference:
C = A\B occurs, if A occurs and B does not occur
Complement:
C = \A A occurs, if A does not occur
The events A and B are called disjoint, if A B =
(both events cannot occur simultaneously)
15

Now:
For any arbitrary event A we are looking for a number P (A)
which represents the probability that A occurs
Formally:
P : A P (A)
(P () is a set function)

Question:
Which properties should the probability function (set function) P () have?
16

Definition 2.4: (Kolmogorov-axioms)


The following axioms for P () are called Kolmogorov-axioms:

Nonnegativity: P (A) 0 for every A


Standardization: P () = 1
Additivity: For two disjoint events A and B (i.e. for AB = )
P () satisfies
P (A B) = P (A) + P (B)
17

Easy to check:
The three axioms imply several additional properties and rules
when computing with probabilities
Theorem 2.5: (General properties)
The Kolmogorov-axioms imply the following properties:
Probability of the complementary event:
P (A) = 1 P (A)
Probability of the impossible event:
P () = 0
Range of probabilities:
0 P (A) 1
18

Next:
General rules when computing with probabilities
Theorem 2.6: (Calculation rules)
The Kolmogorov-axioms imply the following calculation rules
(A, B, C are arbitrary events):

Addition rule (I):


P (A B) = P (A) + P (B) P (A B)
(probability that A or B occurs)
19

Addition rule (II):


P (A B C) = P (A) + P (B) + P (C)
P (A B) P (B C)
P (A C) + P (A B C)
(probability that A or B or C occurs)

Probability of the difference event:


P (A\B) = P (A B)
= P (A) P (A B)
20

Notice:
If B implies A (i.e. if B A) it follows that
P (A\B) = P (A) P (B)

21

2.2 Random Variable, Cumulative Distribution


Function, Density Function
Frequently:
Instead of being interested in a concrete sample point
itself, we are rather interested in a number depending on
Examples:
Profit in euro when playing roulette
Profit earned when selling a stock

Monthly salary of a randomly selected person


Intuitive meaning of a random variable:
Rule translating the abstract into a number
22

Definition 2.7: (Random variable [rv])


A random variable, denoted by X or X(), is a mathematical
function of the form
X : R
X().
Remarks:
A random variable relates each sample point to a real
number
Intuitively:
A random variable X characterizes a number that is a priori
unknown
23

When the random experiment is carried out, the random


variable X takes on the value x
x is called realization or value of the random variable X after
the random experiment has been carried out
Random variables are denoted by capital letters, realizations
are denoted by small letters
The rv X describes the situation ex ante, i.e. before carrying
out the random experiment
The realization x describes the situation ex post, i.e. after
having carried out the random experiment

24

Example 1:
Consider the experiment of tossing a single coin (H=Head,
T =Tail). Let the rv X represent the Number of Heads

We have
= {H, T }
The random variable X can take on two values:
X(T ) = 0,

X(H) = 1

25

Example 2:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, |(H, {z
H, T )}, . . . , |(T, {z
T, T )}}
{z
|
=1

=2

=8

The rv X is defined by

X() = number of H in
Obviously:
X relates distinct s to the same number, e.g.
X((H, H, T )) = X((H, T, H)) = X((T, H, H)) = 2
26

Example 3:
Consider the experiment of randomly selecting 1 person from
a group of people. Let X represent the persons status of
employment

We have
= {employed
{z
}, unemployed
|
{z
}}
|
=1

=2

X can be defined as
X(1) = 1,

X(2) = 0
27

Example 4:
Consider the experiment of measuring tomorrows price of a
specific stock. Let X denote the stock price
We have = [0, ), i.e. X is defined by
X() =

Conclusion:
The random variable X can take on distinct values with specific probabilities

28

Question:
How can we determine these specific probabilities and how
can we calculate with them?
Simplifying notation: (a, b, x R)
P (X = a) P ({|X() = a})
P (a < X < b) P ({|a < X() < b})
P (X x) P ({|X() x})
Solution:
We can compute these probabilities via the so-called cumulative distribution function of X
29

Intuitively:
The cumulative distribution function of the random variable
X characterizes the probabilities according to which the possible values x are distributed along the real line
(the so-called distribution of X)

Definition 2.8: (Cumulative distribution function [cdf])


The cumulative distribution function of a random variable X,
denoted by FX , is defined to be the function
FX : R [0, 1]
x FX (x) = P ({|X() x}) = P (X x).
30

Example:
Consider the experiment of tossing a coin three times. Let
X represent the Number of Heads
We have
= {(H,
H,
H)}, (H,
H, T )}, . . . , |(T, {z
T, T )}}
|
|
{z
{z
= 1
= 2
= 8

For the probabilities of X we find

P (X = 0) = P ({(T, T, T )}) = 1/8


P (X = 1) = P ({(T, T, H), (T, H, T ), (H, T, T )}) = 3/8
P (X = 2) = P ({(T, H, H), (H, T, H), (H, H, T )}) = 3/8
P (X = 3) = P ({(H, H, H)}) = 1/8
31

Thus, the cdf is given by

FX (x) =

0.000

0.125

0.5

0.875

for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3

Remarks:
In practice, it will be sufficient to only know the cdf FX of X
In many situations, it will appear impossible to exactly specify
the sample space or the explicit function X : R.
However, often we may derive the cdf FX from other factual
consideration
32

General properties of FX :
FX (x) is a monotone, nondecreasing function
We have
lim FX (x) = 0

and

lim FX (x) = 1

x+

FX is continuous from the right; that is,


lim
F (z) = FX (x)
zx X
z>x

33

Summary:
Via the cdf FX (x) we can answer the following question:
What is the probability that the random variable X takes
on a value that does not exceed x?
Now:
Consider the question:
What is the value which X does not exceed with a
prespecified probability p (0, 1)?
quantile function of X
34

Definition 2.9: (Quantile function)


Consider the rv X with cdf FX . For every p (0, 1) the quantile
function of X, denoted by QX (p), is defined as
QX : (0, 1) R
QX (p) = min{x|FX (x) p}.
p

The value of the quantile function xp = QX (p) is called the pth


quantile of X.
Remarks:
The pth quantile xp of X is defined as the smallest number
x satisfying FX (x) p
In other words: The pth quantile xp is the smallest value that
X does not exceed with probability p
35

Special quantiles:
Median: p = 0.5
Quartiles: p = 0.25, 0.5, 0.75
Quintiles: p = 0.2, 0.4, 0.6, 0.8
Deciles: p = 0.1, 0.2, . . . , 0.9
Now:
Consideration of two distinct classes of random variables
(discrete vs. continuous rvs)

36

Reason:
Each class requires a specific mathematical treatment
Mathematical tools for analyzing discrete rvs:
Finite and infinite sums
Mathematical tools for analyzing continuous rvs:
Differential- and integral calculus
Remarks:
Some rvs are partly discrete and partly continuous
Such rvs are not treated in this course
37

Definition 2.10: (Discrete random variable)


A random variable X will be defined to be discrete if it can take
on either
(a) only a finite number of values x1, x2, . . . , xJ or
(b) an infinite, but countable number of values x1, x2, . . .

each with strictly positive probability; that is, if for all j =


1, . . . , J, . . . we have
P (X = xj ) > 0

and

J,...
X

P (X = xj ) = 1.

j=1
38

Examples of discrete variables:


Countable variables (X = Number of . . .)
Encoded qualitative variables
Further definitions:

Definition 2.11: (Support of a discrete random variable)


The support of a discrete rv X, denoted by supp(X), is defined
to be the totality of all values that X can take on with a strictly
positive probability:
supp(X) = {x1, . . . , xJ }

or

supp(X) = {x1, x2, . . .}.


39

Definition 2.12: (Discrete density function)


For a discrete random variable X the function
fX (x) = P (X = x)
is defined to be the discrete density function of X.

Remarks:
The discrete density function fX () takes on strictly positive
values only for elements of the support of X. For realizations
of X that do not belong to the support of X, i.e. for x
/
supp(X), we have fX (x) = 0:
fX (x) =

P (X = xj ) > 0
0

for x = xj supp(X)
for x
/ supp(X)
40

The discrete density function fX () has the following properties:


fX (x) 0 for all x
X

fX (xj ) = 1

xj supp(X)

For any arbitrary set A R the probability of the event


{|X() A} = {X A} is given by
P (X A) =

fX (xj )

xj A
41

Example:
Consider the experiment of tossing a coin three times and
let X = Number of Heads
(see slide 31)
Obviously: X is discrete and has the support
supp(X) = {0, 1, 2, 3}
The discrete density function of X is given by

fX (x) =

P (X = 0) = 0.125

P (X = 1) = 0.375

P (X = 2) = 0.375

P (X = 3) = 0.125

for x = 0
for x = 1
for x = 2
for x = 3
for x
/ supp(X)
42

The cdf of X is given by (see slide 32)

FX (x) =

0.000

0.125

0.5

0.875

for x < 0
for 0 x < 1
for 1 x < 2
for 2 x < 3
for x 3

Obviously:
The cdf FX () can be obtained from fX ():
FX (x) = P (X x) =

{xj supp(X)|xj x}

=P (X=xj )
z }| {
fX (xj )

43

Conclusion:
The cdf of a discrete random variable X is a step function
with steps at the points xj supp(X). The height of the
step at xj is given by
FX (xj ) xx
lim F (x) = P (X = xj ) = fX (xj ),
j
x<xj

i.e. the step height is equal to the value of the discrete density
function at xj
(relationship between cdf and discrete density function)

44

Now:
Definition of continuous random variables
Intuitively:
In contrast to discrete random variables, continuous random
variables can take on an uncountable number of values
(e.g. every real number on a given interval)

In fact:
Definition of a continuous random variable is quite technical
45

Definition 2.13: (Continuous rv, probability density function)


A random variable X is called continuous if there exists a function
fX : R [0, ) such that the cdf of X can be written as
FX (x) =

Z x

fX (t)dt

for all x R.

The function fX (x) is called the probability density function (pdf)


of X.

Remarks:
The cdf FX () of a continuous random variable X is a primitive function of the pdf fX ()
FX (x) = P (X x) is equal to the area under the pdf fX ()
between the limits and x
46

Cdf FX () and pdf fX ()

fX(t)
P(X x) = FX(x)

47

Properties of the pdf fX ():


1. A pdf fX () cannot take on negative value, i.e.
fX (x) 0

for all x R

2. The area under a pdf is equal to one, i.e.


Z +

fX (x)dx = 1

3. If the cdf FX (x) is differentiable we have


0 (x) dF (x)/dx
fX (x) = FX
X

48

Example: (Uniform distribution over [0, 10])


Consider the random variable X with pdf
fX (x) =

0
0.1

, for x
/ [0, 10]
, for x [0, 10]

Derivation of the cdf FX :


For x < 0 we have
FX (x) =

Z x

fX (t) dt =

Z x

0 dt = 0

49

For x [0, 10] we have


FX (x) =

Z x

Z 0

fX (t) dt
0 dt +

|
{z
=0

Z x
0

0.1 dt

= [0.1 t]x0

= 0.1 x 0.1 0
= 0.1 x
50

For x > 10 we have


FX (x) =

Z x

Z 0

fX (t) dt
0 dt +

{z
|
=0

= 1

Z 10

|0

0.1 dt +

{z
=1

0 dt

| 10{z }
=0

51

Now:
Interval probabilities, i.e. (for a, b R, a < b)
P (X (a, b]) = P (a < X b)
We have
P (a < X b) = P ({|a < X() b})
= P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() > a} {|X() b})
= 1 P ({|X() a} {|X() > b})
52

= 1 [P (X a) + P (X > b)]
= 1 [FX (a) + (1 P (X b))]
= 1 [FX (a) + 1 FX (b)]
= FX (b) FX (a)
=

Z b

Z b

fX (t) dt

Z a

fX (t) dt

fX (t) dt
53

Interval probability between the limits a and b


fX(x)

P(a < X b)

54

Important result for a continuous rv X:


P (X = a) = 0

for all a R

Proof:
P (X = a) = lim P (a < X b) = lim
ba

Z a
a

Z b

ba a

fX (x) dx

fX (x)dx = 0

Conclusion:
The probability that a continuous random variable X takes
on a single explicit value is always zero
55

Probability of a single value

fX(x)

b3

b2

b1

56

Notice:
This does not imply that the event {X = a} cannot occur
Consequence:
Since for continuous random variables we always have P (X =
a) = 0 for all a R, it follows that
P (a < X < b) = P (a X < b) = P (a X b)
= P (a < X b) = FX (b) FX (a)
(when computing interval probabilities for continuous rvs, it
does not matter if the interval is open or closed)
57

2.3 Expectation, Moments and Moment Generating Functions


Repetition:
Expectation of an arbitrary random variable X
Definition 2.14: (Expectation)
The expectation of the random variable X, denoted by E(X), is
defined by

E(X) =

xj P (X = xj )

{x supp(X)}

Z +

x fX (x) dx

, if X is discrete
.
, if X is continuous
58

Remarks:
The expectation of the random variable X is approximately
equal to the sum of all realizations each weighted by the
probability of its occurrence
Instead of E(X) we often write X
There exist random variables that do not have an expectation
(see class)

59

Example 1: (Discrete random variable)


Consider the experiment of tossing two dice. Let X represent the absolute difference of the two dice. What is the
expectation of X?
The support of X is given by
supp(X) = {0, 1, 2, 3, 4, 5}

60

The discrete density function of X is given by

fX (x) =

P (X = 0) = 6/36

P (X = 1) = 10/36

P (X = 2) = 8/36

P (X = 3) = 6/36

P (X = 4) = 4/36

P (X = 5) = 2/36

This gives

for x = 0
for x = 1
for x = 2
for x = 3
for x = 4
for x = 5
for x
/ supp(X)

10
8
6
4
2
6
+1
+2
+3
+4
+5
E(X) = 0
36
36
36
36
36
36
=

70
= 1.9444
36
61

Example 2: (Continuous random variable)


Consider the continuous random variable X with pdf

x
, for 1 x 3
fX (x) =
4
0
, elsewise

To calculate the expectation we split up the integral:

E(X) =
=

Z +

Z 1

x fX (x) dx
Z 3

+
x
0 dx +
0 dx
x dx +
4

3
1

62

1 1 3 3
dx =

x
4 3
1 4
1

Z 3 2
x

27 1
1
=

4
3
3

26
=
= 2.1667
12

Frequently:
Random variable X plus discrete density or pdf fX is known
We have to find the expectation of the transformed random
variable
Y = g(X)
63

Theorem 2.15: (Expectation of a transformed rv)


Let X be a random variable with discrete density or pdf fX ().
For any Baire-function g : R R the expectation of the transformed random variable Y = g(X) is given by

E(Y ) = E[g(X)]

g(xj ) P (X = xj )

{xj supp(X)}

Z +

g(x) fX (x) dx

, if X is discrete
.
, if X is continuous

64

Remarks:
All functions considered in this course are Baire-functions
For the special case g(x) = x (the identity function) Theorem
2.15 coincides with Definition 2.14

Next:
Some important rules for calculating expected values

65

Theorem 2.16: (Properties of expectations)


Let X be an arbitrary random variable (discrete or continuous),
c, c1, c2 R constants and g, g1, g2 : R R functions. Then:
1. E(c) = c.
2. E[c g(X)] = c E[g(X)].
3. E[c1 g1(X) + c2 g2(X)] = c1 E[g1(X)] + c2 E[g2(X)].
4. If g1(x) g2(x) for all x R then

E[g1(X)] E[g2(X)].
Proof: Class
66

Now:
Consider the random variable X (discrete or continuous) and
the explicit function g(x) = [x E(X)]2
variance and standard deviation of X
Definition 2.17: (Variance, standard deviation)
For any random variable X the variance, denoted by Var(X), is
defined as the expected quadratic distance between X and its
expectation E(X); that is
Var(X) = E[(X E(X))2].

The standard deviation of X, denoted by SD(X), is defined to


be the (positive) square root of the variance:
q

SD(X) = + Var(X).
67

Remark:
Setting g(X) = [X E(X)]2 in Theorem 2.15 (on slide 64)
yields the following explicit formulas for discrete and continuous random variables:
Var(X) = E[g(X)]

X
2 P (X = x )

(X)]
[x

j
j

{xj supp(X)}

Z +

[x E(X)]2 fX (x) dx

68

Example: (Discrete random variable)


Consider again the experiment of tossing two dice with X
representing the absolute difference of the two dice (see Example 1 on slide 60). The variance is given by
Var(X) = (0 70/36)2 6/36 + (1 70/36)2 10/36
+ (2 70/36)2 8/36 + (3 70/36)2 6/36

+ (4 70/36)2 4/36 + (5 70/36)2 2/36


= 2.05247
Notice:
The variance is an expectation per definitionem
rules for expectations are applicable

69

Theorem 2.18: (Rules for variances)


Let X be an arbitrary random variable (discrete or continuous)
and a, b R real constants; then
1. Var(X) = E(X 2) [E(X)]2.
2. Var(a + b X) = b2 Var(X).
Proof: Class
Next:
Two important inequalities dealing with expectations and
transformed random variables
70

Theorem 2.19: (Chebyshev inequality)


Let X be an arbitrary random variable and g : R R+ a nonnegative function. Then, for every k > 0 we have
P [g(X) k]

E [g(X)]
.
k

Special case:
Consider
g(x) = [x E(X)]2

and

k = r2 Var(X)

Theorem 2.19 implies


n

P [X E(X)]

r2 Var(X)

(r > 0)

Var(X)
1
2
= 2
r Var(X)
r
71

Now:
n

P [X E(X)]

r2 Var(X)

= P {|X E(X)| r SD(X)}


= 1 P {|X E(X)| < r SD(X)}

It follows that
1
P {|X E(X)| < r SD(X)} 1 2
r
(specific Chebyshev inequality)

72

Remarks:
The specific Chebyshev inequality provides a minimal probability of the event that any arbitrary random variable X takes
on a value from the following interval:
[E(X) r SD(X), E(X) + r SD(X)]
For example, for r = 3 we have

1
8
P {|X E(X)| < 3 SD(X)} 1 2 =
3
9
which is equivalent to
P {E(X) 3 SD(X) < X < E(X) + 3 SD(X)} 0.8889
or
P {X (E(X) 3 SD(X), E(X) + 3 SD(X))} 0.8889
73

Theorem 2.20: (Jensen inequality)


Let X be a random variable with mean E(X) and let g : R R
be a convex function, i.e. for all x we have g 00(x) 0; then

E [g(X)] g(E[X]).
Remarks:
If the function g is concave (i.e. if g 00(x) 0 for all x) then
Jensens inequality states that E [g(X)] g(E[X])
Notice that in general we have

E [g(X)] =
6 g(E[X])
74

Example:
Consider the random variable X and the function g(x) = x2
We have g 00(x) = 2 0 for all x, i.e. g is convex
It follows from Jensens inequality that

E
E[X])
[g(X)] g(
| {z }
| {z }
=E(X 2)

i.e.

=[E(X)]2

E(X 2) [E(X)]2 0
This implies
Var(X) = E(X 2) [E(X)]2 0
(the variance of an arbitrary rv cannot be negative)
75

Now:
Consider the random variable X with expectation E(X) = X ,
the integer number n N and the functions
g1(x) = xn
g2(x) = [x X ]n

Definition 2.21: (Moments, central moments)


(a) The n-th moment of X, denoted by 0n, is defined as
0n E[g1(X)] = E(X n).
(b) The n-th central moment of X about X , denoted by n, is
defined as
n E[g2(X)] = E[(X X )n].
76

Relations:
01 = E(X) = X
(the 1st moment coincides with E(X))
1 = E[X X ] = E(X) X = 0
(the 1st central moment is always equal to 0)
2 = E[(X X )2] = Var(X)
(the 2nd central moment coincides with Var(X))

77

Remarks:
The first four moments of a random variable X are important
measures of the probability distribution
(expectation, variance, skewness, kurtosis)
The moments of a random variable X play an important role
in theoretical and applied statistics
In some cases, when all moments are known, the cdf of a
random variable X can be determined

78

Question:
Can we find a function that gives us a representation of all
moments of a random variable X?

Definition 2.22: (Moment generating function)


Let X be a random variable with discrete density or pdf fX ().
The expected value of etX is defined to be the moment generating function of X if the expected value exists for every value
of t in some interval h < t < h, h > 0. That is, the moment
generating function of X, denoted by mX (t), is defined as
mX (t) = E

i
tX
e
.
79

Remarks:
The moment generating function mX (t) is a function in t
There are rvs X for which mX (t) does not exist
If mX (t) exists it can be calculated as
mX (t) = E

etX

etxj P (X = xj )

{x supp(X)}

Z +

etx fX (x) dx

, if X is discrete

, if X is continuous

80

Question:
Why is mX (t) called the moment generating function?
Answer:
Consider the nth derivative of mX (t) with respect to t:

X
n etxj P (X = x )

(x
)

j
j

{xj supp(X)}

dn
mX (t) =

dtn

Z +

xn etx fX (x) dx

for discrete X

for continuous X

81

Now, evaluate the nth derivative at t = 0:

(xj )n P (X = xj )

{xj supp(X)}

dn
mX (0) =
n

dt

Z +

xn fX (x) dx

for discrete X

for continuous X

= E(X n) = 0n
(see Definition 2.21(a) on slide 76)

82

Example:
Let X be a continuous random variable with pdf
fX (x) =

0
ex

, for x < 0
, for x 0

(exponential distribution with parameter > 0)


We have

mX (t) = E etX =
=
for t <

Z +
0

Z +

etx fX (x) dx

e(t)x dx =

83

It follows that
m0X (t) =
and thus

( t)2

0 (0) = E(X) =
mX

and

and

m00X (t) =

2
( t)3

2
m00X (0) = E(X 2) = 2

Now:
Important result on moment generating functions

84

Theorem 2.23: (Identification property)


Let X and Y be two random variables with densities fX () and
fY (), respectively. Suppose that mX (t) and mY (t) both exist
and that mX (t) = mY (t) for all t in the interval h < t < h for
some h > 0. Then the two cdfs FX () and FY () are equal; that
is FX (x) = FY (x) for all x.
Remarks:
Theorem 2.23 states that there is a unique cdf FX (x) for a
given moment generating function mX (t)
if we can find mX (t) for X then, at least theoretically, we
can find the distribution of X
We will make use of this property in Section 4
85

Example:
Suppose that a random variable X has the moment generating function
mX (t) =

1
1t

for 1 < t < 1

Then the pdf of X is given by


fX (x) =

0
ex

, for x < 0
, for x 0

(exponential distribution with parameter = 1)

86

2.4 Special Parametric Families of Univariate Distributions


Up to now:
General mathematical properties of arbitrary distributions
Discrimination: discrete vs continuous distributions
Consideration of
the cdf FX (x)
the discrete density or the pdf fX (x)
expectations of the form E[g(X)]
the moment generating function mX (t)
87

Central result:
The distribution of a random variable X is (essentially) determined by fX (x) or FX (x)
FX (x) can be determined by fX (x)
(cf. slide 46)
fX (x) can be determined by FX (x)
(cf. slide 48)

Question:
How many different distributions are known to exist?
88

Answer:
Infinitely many
But:
In practice, there are some important parametric families of
distributions that provide good models for representing realworld random phenomena
These families of distributions are decribed in detail in all
textbooks on mathematical statistics
(see e.g. Mosler & Schmid (2008), Mood et al. (1974))

89

Important families of discrete distributions


Bernoulli distribution
Binomial distribution
Geometric distribution
Poisson distribution
Important families of continuous distributions
Uniform or rectangular distribution
Exponential distribution
Normal distribution

90

Remark:
The most important family of distributions at all is the normal distribution

Definition 2.24: (Normal distribution)


A continuous random variable X is defined to be normally distributed with parameters R and 2 > 0, denoted by X
N (, 2), if its pdf is given by
fX (x) =

2
x
1
2

1
e
2

x R.

91

PDFs of the normal distribution

fX(x)
N(5,1)

N(0,1)

N(5,3)
N(5,5)

92

Remarks:
The special normal distribution N (0, 1) is called standard normal distribution the pdf of which is denoted by (x)
The properties as well as calculation rules for normally distributed random variables are important pre-conditions for
this course
(see Wilfling (2011), Section 3.4)

93

3. Joint and Conditional Distributions, Stochastic


Independence
Aim of this section:
Multidimensional random variables (random vectors)
(joint and marginal distributions)
Stochastic (in)dependence and conditional distribution
Multivariate normal distribution
(definition, properties)
Literature:
Mood, Graybill, Boes (1974), Chapter IV, pp. 129-174
Wilfling (2011), Chapter 4
94

3.1 Joint and Marginal Distribution


Now:
Consider several random variables simultaneously
Applications:
Several economic applications
Statistical inference

95

Definition 3.1: (Random vector)


Let X1, , Xn be a set of n random variables each representing
the same random experiment, i.e.
Xi : R

for i = 1, . . . , n.

Then X = (X1, . . . , Xn)0 is called an n-dimensional random variable or an n-dimensional random vector.
Remark:
In the literature random vectors are often denoted by
X = ( X1 , . . . , X n )

or more simply by

X1 , . . . , X n

96

For n = 2 it is common practice to write


X = (X, Y )0

or

(X, Y )

or

X, Y

Realizations are denoted by small letters:


x = (x1, . . . , xn)0 Rn

or

x = (x, y)0 R2

Now:
Characterization of the probability distribution of the random
vector X

97

Definition 3.2: (Joint cumulative distribution function)


Let X = (X1, . . . , Xn)0 be an n-dimensional random vector. The
function
FX1,...,Xn : Rn [0, 1]
defined by
FX1,...,Xn (x1, . . . , xn) = P (X1 x1, X2 x2, . . . , Xn xn)
is called the joint cumulative distribution function of X.
Remark:
Definition 3.2 applies to discrete as well as to continuous
random variables X1, . . . , Xn
98

Some properties of the bivariate cdf (n = 2):


FX,Y (x, y) is monotone increasing in x and y

lim FX,Y (x, y) = 0


x

lim FX,Y (x, y) = 1


x+

lim FX,Y (x, y) = 0

y+

Remark:
Analogous properties
FX1,...,Xn (x1, . . . , xn)

hold

for

the

cdf

n-dimensional

99

Now:
Joint discrete versus joint continuous random vectors
Definition 3.3: (Joint discrete random vector)
The random vector X = (X1, . . . , Xn)0 is defined to be a joint discrete random vector if it can assume only a finite (or a countable
infinite) number of realizations x = (x1, . . . , xn)0 such that
P (X1 = x1, X2 = x2, . . . , Xn = xn) > 0
and
X

P (X1 = x1, X2 = x2, . . . , Xn = xn) = 1,

where the summation is over all possible realizations of X.


100

Definition 3.4: (Joint continuous random vector)


The random vector X = (X1, . . . , Xn)0 is defined to be a joint
continuous random vector if and only if there exists a nonnegative
function fX1,...,Xn (x1, . . . , xn) such that
FX1,...,Xn (x1, . . . , xn) =

Z xn

...

Z x
1

fX1,...,Xn (u1, . . . , un) du1 . . . dun

for all (x1, . . . , xn). The function fX1,...,Xn is defined to be a joint


probability density function of X.

Example:
Consider X = (X, Y )0 with joint pdf
fX,Y (x, y) =

x+y
0

, for (x, y) [0, 1] [0, 1]


, elsewise
101

Joint pdf fX,Y (x, y)

2
1.5
fHx,yL 1
0.5
0
0

1
0.8
0.6
0.4 y

0.2
0.4
x 0.6

0.2
0.8

10

102

The joint cdf can be obtained by


FX,Y (x, y) =
=

Z y

Z x

Z yZ x
0

fX,Y (u, v) du dv

(u + v) du dv

= ...

0.5(x2y + xy 2)

0.5(x2 + x)
=

0.5(y 2 + y)

, for
, for
, for
, for

(x, y) [0, 1] [0, 1]


(x, y) [0, 1] [1, )
(x, y) [1, ) [0, 1]
(x, y) [1, ) [1, )

(Proof: Class)

103

Remarks:
If X = (X1, . . . , Xn)0 is a joint continuous random vector,
then
nFX1,...,Xn (x1, . . . , xn)
x1 xn

= fX1,...,Xn (x1, . . . , xn)

The volume under the joint pdf represents probabilities:


u
o
o
P (au
1 < X1 a1 , . . . , an < Xn an )

Z ao
n
au
n

...

Z ao
1
au
1

fX1,...,Xn (u1, . . . , un) du1 . . . dun

104

In this course:
Emphasis on joint continuous random vectors
Analogous results for joint discrete random vectors
(see Mood, Graybill, Boes (1974), Chapter IV)

Now:
Determination of the distribution of a single random variable Xi from the joint distribution of the random vector
(X1, . . . , Xn)0
marginal distribution

105

Definition 3.5: (Marginal distribution)


Let X = (X1, . . . , Xn)0 be a continuous random vector with joint
cdf FX1,...,Xn and joint pdf fX1,...,Xn . Then
FX1 (x1) = FX1,...,Xn (x1, +, +, . . . , +, +)
FX2 (x2) = FX1,...,Xn (+, x2, +, . . . , +, +)
...
FXn (xn) = FX1,...,Xn (+, +, +, . . . , +, xn)
are called marginal cdfs while

106

fX1 (x1) =

Z +

fX2 (x2) =

Z +

fXn (xn) =

Z +

...

Z +

fX1,...,Xn (x1, x2, . . . , xn) dx2 . . . dxn

...

Z +

fX1,...,Xn (x1, x2, . . . , xn) dx1 dx3 . . . dxn

...

Z +

fX1,...,Xn (x1, x2, . . . , xn) dx1 dx2 . . . dxn1

are called marginal pdfs of the one-dimensional (univariate) random variables X1, . . . , Xn.
107

Example:
Consider the bivariate pdf
fX,Y (x, y)
=

40(x 0.5)2y 3(3 2x y)


0

, for (x, y) [0, 1] [0, 1]


, elsewise

108

Bivariate pdf fX,Y (x, y)

3
fHx,yL 2

1
0.8

1
0
0

0.6
0.4 y

0.2
0.4
x 0.6

0.2
0.8

10

109

The marginal pdf of X obtains as


fX (x) =

Z 1
0

40(x 0.5)2y 3(3 2x y)dy

= 40(x 0.5)2

Z 1
0

(3y 3 2xy 3 y 4)dy

1
2x
1
3
= 40(x 0.5)2 y 4
y4 y5
4
4
5
0

3 2x 1
2
= 40(x 0.5)

4
4
5

= 20x3 + 42x2 27x + 5.5


110

Marginal pdf fX (x)

fHxL
1.5
1.25
1
0.75
0.5
0.25
x
0.2

0.4

0.6

0.8

111

The marginal pdf of Y obtains as


fY (y) =

Z 1
0

40(x 0.5)2y 3(3 2x y)dx

= 40y 3
=

Z 1
0

(x 0.5)2(3 2x y)dx

10 3
y (y 2)
3

112

Marginal pdf fY (y)

fHyL
3
2.5
2
1.5
1
0.5
y
0.2

0.4

0.6

0.8

113

Remarks:
When considering the marginal instead of the joint distributions, we are faced with an information loss
(the joint distribution uniquely determines all marginal distributions, but the converse does not hold in general)
Besides the respective univariate marginal distributions, there
are also multivariate distributions which can be obtained from
the joint distribution of X = (X1, . . . , Xn)0

114

Example:
For n = 5 consider X = (X1, . . . , X5)0 with joint pdf fX1,...,X5
Then the marginal pdf of Z = (X1, X3, X5)0 obtains as
fX1,X3,X5 (x1, x3, x5)
=

Z + Z +

fX1,...,X5 (x1, x2, x3, x4, x5) dx2 dx4

(integrate out the irrelevant components)

115

3.2 Conditional Distribution and Stochastic Independence


Now:
Distribution of a random variable X under the condition that
another random variable Y has already taken on the realization y
(conditional distribution of X given Y = y)

116

Definition 3.6: (Conditional distribution)


Let X = (X, Y )0 be a bivariate continuous random vector with
joint pdf fX,Y (x, y). The conditional density of X given Y = y is
defined to be
fX,Y (x, y)
.
fX|Y =y (x) =
fY (y)
Analogously, the conditional density of Y given X = x is defined
to be
fX,Y (x, y)
fY |X=x(y) =
.
fX (x)

117

Remark:
Conditional densities of random vectors are defined analogously, e.g.
fX1,X2,X4|X3=x3,X5=x5 (x1, x2, x4) =
fX1,X2,X3,X4,X5 (x1, x2, x3, x4, x5)
fX3,X5 (x3, x5)

118

Example:
Consider the bivariate pdf
fX,Y (x, y)
=

40(x 0.5)2y 3(3 2x y)


0

, for (x, y) [0, 1] [0, 1]


, elsewise

with marginal pdf


fY (y) =

10 3
y (y 2)
3

(cf. Slides 108-112)

119

It follows that
fX|Y =y (x) =
=

fX,Y (x, y)
fY (y)
40(x 0.5)2y 3(3 2x y)
3
10
3 y (y 2)

12(x 0.5)2(3 2x y)
=
2y

120

Conditional pdf fX|Y =0.01(x) of X given Y = 0.01

Bedingte

Dichte

3
2.5
2
1.5
1
0.5
x
0.2

0.4

0.6

0.8

121

Conditional pdf fX|Y =0.95(x) of X given Y = 0.95

Bedingte

Dichte

1.2
1
0.8
0.6
0.4
0.2
x
0.2

0.4

0.6

0.8

122

Now:
Combine the concepts joint distribution and conditional
distribution to define the notion stochastic independence
(for two random variables first)

Definition 3.7: (Stochastic Independence [I])


Let (X, Y )0 be a bivariate continuous random vector with joint
pdf fX,Y (x, y). X and Y are defined to be stochastically independent if and only if
fX,Y (x, y) = fX (x) fY (y)

for all x, y R.
123

Remarks:
Alternatively, stochastic independence can be defined via the
cdfs:
X and Y are stochastically independent, if and only if
FX,Y (x, y) = FX (x) FY (y)

for all x, y R.

If X and Y are independent, we have


fX|Y =y (x) =
fY |X=x(y) =

fX,Y (x, y)
fY (y)
fX,Y (x, y)
fX (x)

f (x) fY (y)
= X
= fX (x)
fY (y)
f (x) fY (y)
= X
= fY (y)
fX (x)

If X and Y are independent and g and h are two continuous


functions, then g(X) and h(Y ) are also independent
124

Now:
Extension to n random variables
Definition 3.8: (Stochastic independence [II])
Let (X1, . . . , Xn)0 be a continuous random vector with joint pdf
fX1,...,Xn (x1, . . . , xn) and joint cdf FX1,...,Xn (x1, . . . , xn). X1, . . . , Xn
are defined to be stochastically independent, if and only if for all
(x1, . . . , xn)0 Rn
fX1,...,Xn (x1, . . . , xn) = fX1 (x1) . . . fXn (xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn).
125

Remarks:
For discrete random vectors we define: X1, . . . , Xn are stochastically independent, if and only if for all (x1, . . . , xn)0 Rn
P (X1 = x1, . . . , Xn = xn) = P (X1 = x1) . . . P (Xn = xn)
or
FX1,...,Xn (x1, . . . , xn) = FX1 (x1) . . . FXn (xn)
In the case of independence, the joint distribution results
from the marginal distributions
If X1, . . . , Xn are stochastically independent and g1, . . . , gn are
continuous functions, then Y1 = g1(X1), . . . , Yn = gn(Xn) are
also stochastically independent
126

3.3 Expectation and Joint Moment Generating


Functions
Now:
Definition of the expectation of a function
g : Rn R

(x1, . . . , xn) 7 g(x1, . . . xn)

of a continuous random vector X = (X1, . . . , Xn)0

127

Definition 3.9: (Expectation of a function)


Let (X1, . . . , Xn)0 be a continuous random vector with joint pdf
fX1,...,Xn (x1, . . . , xn) and g : Rn R a real-valued continuous
function. The expectation of the function g of the random vector
is defined to be

E[g(X1, . . . , Xn)]
=

Z +

...

Z +

g(x1, . . . , xn) fX1,...,Xn (x1, . . . , xn) dx1 . . . dxn.

128

Remarks:
For a discrete random vector (X1, . . . , Xn)0 the analogous definition is

E[g(X1, . . . , Xn)] =

g(x1, . . . , xn) P (X1 = x1, . . . , Xn = xn),


where the summation is over all realizationen of the vector

Definition 3.9 includes the expectation of a univariate random variable X:


Set n = 1 and g(x) = x
E(X1) E(X) =

Z +

xfX (x) dx

Definition 3.9 includes the variance of X:


Set n = 1 and g(x) = [x E(X)]2
Var(X1) Var(X) =

Z +

[x E(X)]2fX (x) dx
129

Definition 3.9 includes the covariance of two variables:


Set n = 2 and g(x1, x2) = [x1 E(X1)] [x2 E(X2)]
Cov(X1, X2)
=

Z + Z +

[x1 E(X1)][x2 E(X2)]fX1,X2 (x1, x2) dx1 dx2

Via the covariance we define the correlation coefficient:


Cov(X1, X2)

q
Corr(X1, X2) = q
Var(X1) Var(X2)

General properties of expected values, variances, covariances


and the correlation coefficient
Class
130

Now:
Expectation and variances of random vectors
Definition 3.10: (Expected vector, covariance matrix)
Let X = (X1, . . . , Xn)0 be a random vector. The expected vector
of X is defined to be

E(X1)

...
E(X) =

.
E(Xn)

The covariance matrix of X is defined to be

Cov(X) =

Var(X1)
Cov(X1, X2)
Cov(X2, X1)
Var(X2)
...
...
Cov(Xn, X1) Cov(Xn, X2)

. . . Cov(X1, Xn)
. . . Cov(X2, Xn)
...
...
...
Var(Xn)

131

Bemerkung:
Obviously, the covariance matrix is symmetric per definition
Now:
Expected vectors and covariance matrices under linear transformations of random vectors

Let
X = (X1, . . . , Xn)0 be a n-dimensional random vector
A be an (m n) matrix of real numbers
b be an (m 1) column vector of real numbers
132

Obviously:
Y = AX + b is an (m 1) random vector:

Y =

a11 a12
a21 a22
...
...
am1 am2

. . . a1n
X1

. . . a2n
X2

.
.
...
.. ..
. . . amn
Xn

b1
b2
...
bm

a11X1 + a12X2 + . . . + a1nXn + b1


a21X1 + a22X2 + . . . + a2nXn + b2
...
am1X1 + am2X2 + . . . + amnXn + bm

133

The expected vector of Y is given by

E(Y) =

a11E(X1) + a12E(X2) + . . . + a1nE(Xn) + b1


a21E(X1) + a22E(X2) + . . . + a2nE(Xn) + b2
...
am1E(X1) + am2E(X2) + . . . + amnE(Xn) + bm

= AE(X) + b
The covariance matrix of Y is given by

Cov(Y) =

Cov(Y1, Y2)
Var(Y1)
Var(Y2)
Cov(Y2, Y1)
...
...
Cov(Yn, Y1) Cov(Yn, Y2)

. . . Cov(Y1, Yn)

. . . Cov(Y2, Yn)

...
...

...
Var(Yn)

= ACov(X)A0
(Proof: Class)
134

Remark:
Cf. the analogous results for univariate variables:

E(a X + b) = a E(X) + b
Var(a X + b) = a2 Var(X)
Up to now:
Expected values for unconditional distributions
Now:
Expected values for conditional distributions
(cf. Definition 3.6, Slide 117)
135

Definition 3.11: (Conditional expected value of a function)


Let (X, Y )0 be a continuous random vector with joint pdf fX,Y (x, y)
and let g : R2 R be a real-valued function. The conditional
expected value of the function g given X = x is defined to be

E[g(X, Y )|X = x] =

Z +

g(x, y) fY |X (y) dy.

136

Remarks:
An analogous definition applies to a discrete random vector
(X, Y )0
Definition 3.11 naturally extends to higher-dimensional distributions
For g(x, y) = y we obtain the special case E[g(X, Y )|X = x] =
E(Y |X = x)
Note that E[g(X, Y )|X = x] is a function of x

137

Example:
Consider the joint pdf
fX,Y (x, y) =

x+y
0

, for (x, y) [0, 1] [0, 1]


, elsewise

The conditional distribution of Y given X = x is given by

x+y
x + 0.5
fY |X (y) =

, for (x, y) [0, 1] [0, 1]


, elsewise

For g(x, y) = y the conditional expectation is given as


Z 1

1
x+y
x
1
y
E(Y |X = x) =
dy =

+
x + 0.5
x + 0.5
2
3
0

138

Remarks:
Consider the function g(x, y) = g(y)
(i.e. g does not depend on x)
Denote h(x) = E[g(Y )|X = x]
We calculate the unconditional expectation of the transformed variable h(X)
We have

139

E {E[g(Y )|X = x]} = E[h(X)] =


=

Z +

Z +

h(x) fX (x) dx

E[g(Y )|X = x] fX (x) dx

Z + "Z +

Z + Z +

g(y) fY |X (y) fX (x) dy dx

Z + Z +

g(y) fX,Y (x, y) dy dx

g(y) fY |X (y) dy fX (x) dx

= E[g(Y )]
140

Theorem 3.12:
Let (X, Y )0 be an arbitrary discrete or continuous random vector.
Then

E[g(Y )] = E {E[g(Y )|X = x]}


and, in particular,

E[Y ] = E {E[Y |X = x]} .


Now:
Three important rules for conditional and unconditional expected values

141

Theorem 3.13:
Let (X, Y )0 be an arbitrary discrete or continuous random vector
and g1(), g2() two unidimensional functions. Then
1. E[g1(Y ) + g2(Y )|X = x] = E[g1(Y )|X = x] + E[g2(Y )|X = x],
2. E[g1(Y ) g2(X)|X = x] = g2(x) E[g1(Y )|X = x].
3. If X and Y are stochastically independent we have

E[g1(X) g2(Y )] = E[g1(X)] E[g2(Y )].

142

Finally:
Moment generating function for random vectors
Definition 3.14: (Joint moment generating function)
Let X = (X1, . . . , Xn)0 be an arbitrary discrete or continuous
random vector. The joint moment generating function of X is
defined to be
mX1,...,Xn (t1, . . . , tn) = E

et1X1+...+tnXn

if this expectation exists for all t1, . . . , tn with h < tj < h for an
arbitary value h > 0 and for all j = 1, . . . , n.
143

Remarks:
Via the joint moment generating function mX1,...,Xn (t1, . . . , tn)
we can derive the following mathematical objects:
the marginal moment generating functions mX1 (t1), . . . ,
mXn (tn)
the moments of the marginal distributions
the so-called joint moments

144

Important result: (cf. Theorem 2.23, Slide 85)

For any given joint moment generating function


mX1,...,Xn (t1, . . . , tn) there exists a unique joint cdf
FX1,...,Xn (x1, . . . , xn)

145

3.4 The Multivariate Normal Distribution


Now:
Extension of the univariate normal distribution
Definition 3.15: (Multivariate normal distribution)
Let X = (X1, . . . , Xn)0 be an continuous random vector. X is defined to have a multivariate normal distribution with parameters

12

= ...

1n
...
...
and
,
2
n1 n
if for x = (x1, . . . , xn)0 Rn its joint pdf is given by

1
fX(x) = (2)n/2 [det()]1/2 exp (x )0 1 (x ) .
2
1
.
=
..
n

146

Remarks:
See Chang (1984, p. 92) for a definition and the properties
of the determinant det(A) of the matrix A
Notation:
X N (, )
is a column vector with 1, . . . , n R
is a regular, positive definite, symmetric (n n) matrix
Role of the parameters:

E(X) =

and

Cov(X) =

147

Joint pdf of the multiv. standard normal distribution N (0, In):

1 0
n/2
(x) = (2)
exp x x
2
Cf. the analogy to the univariate pdf in Definition 2.24, Slide
91
Properties of the N (, ) distribution:

Partial vectors (marginal distributions) of X also have multivariate normal distributions, i.e. if
X=

"

X1
X2

"

1
2

# "

11 12
21 22

#!

then
X1 N (1, 11)
X2 N (2, 22)
148

Thus, all univariate variables of X = (X1, . . . , Xn)0 have univariate normal distributions:
X1 N (1, 12)
X2 N (2, 22)
...
2)
Xn N (n, n
The conditional distributions are also (univariately or multivariately) normal:

1
X1|X2 = x2 N 1 + 1222
(x2 2), 11 121
22 21

Linear transformations:
Let A be an (m n) matrix, b an (m 1) vector of real
numbers and X = (X1, . . . , Xn)0 N (, ). Then
AX + b N (A + b, AA0)

149

Example:
Consider
X N (, )
N

"

0
1

# "

1 0.5
0.5 2

#!

Find the distribution of Y = AX + b where


A=

"

1 2
3 4

1
2

AA0 =

"

b=

"

It follows that Y N (A + b, AA0)


In particular,
A + b =

"

3
6

and

12 24
24 53

#
150

Now:
Consider the bivariate case (n = 2), i.e.
X = (X, Y )0,

E(X) =

"

X
Y

"

2
X
Y X

XY
Y2

We have
XY = Y X = Cov(X, Y ) = X Y Corr(X, Y ) = X Y
The joint pdf follows from Definition 3.15 with n = 2
fX,Y (x, y) =

1
2X Y
"

exp

2
2 1
1 2

(y Y )2
(x X )2 2(x X )(y Y )
+

2
X Y
X
Y2
(Derivation: Class)
151

#)

fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0

fHx,yL0.1
0.15

0.05
0

0 y

-2
0

-2

x
2

152

fX,Y (x, y) for X = Y = 0, x = Y = 1 and = 0.9

0.3
fHx,yL0.2
2

0.1
0

0 y

-2
0

-2

x
2

153

Remarks:
The marginal distributions are given by

2 ) and
X N (X , X
Y N (Y , Y2 )
interesting result for the normal distribution:

If (X, Y )0 has a bivariate normal distribution, then X and Y


are independent if and only if = Corr(X, Y ) = 0
The conditional distributions are given by

X
2 1 2
X|Y = y N X +
(y Y ), X
Y

Y
(x X ), Y2 1 2
Y |X = x N Y +
X
(Proof: Class)

154

4. Distributions of Functions of Random Variables


Setup:
Consider as given the joint distribution of X1, . . . , Xn
(i.e. consider as given fX1,...,Xn and FX1,...,Xn )
Consider k functions
g1 : Rn R, . . . , gk : Rn R
Find the joint distribution of the k random variables
Y1 = g1(X1, . . . , Xn), . . . , Yk = gk (X1, . . . Xn)
(i.e. find fY1,...,Yk and FY1,...,Yk )
155

Example:
Consider as given X1, . . . , Xn with fX1,...,Xn
Consider the functions
g1(X1, . . . , Xn) =

n
X

i=1

Find fY1,Y2 with Y1 =

Xi

and

n
1 X
g2(X1, . . . , Xn) =
Xi
n i=1

Pn
1 Pn
Y
=
X
and
2
i=1 i
n i=1 Xi

Remark:
From the joint distribution fY1,...,Yk we can derive the k marginal
distributions fY1 , . . . fYk
(cf. Chapter 3, Slides 106, 107)
156

Aim of this chapter:


Techniques for finding the (marginal) distribution(s)
of (Y1, . . . , Yk )0

157

4.1 Expectations of Functions of Random Variables


Simplification:
In a first step, we are not interested in the exact distributions,
but merely in certain expected values of Y1, . . . , Yk
Expectation two ways:
Consider as given the (continuous) random variables X1, . . . ,
Xn and the function g : Rn R
Consider the random variables Y = g(X1, . . . , Xn) and find
the expectation E[g(X1, . . . , Xn)]
158

Two ways of calculating E(Y ):

E(Y ) =
or

E(Y ) =

Z +

...

Z +

Z +

y fY (y) dy

g(x1, . . . , xn)fX1,...,Xn (x1, . . . xn) dx1 . . . dxn

(cf. Definition 3.9, Slide 128)


It can be proved that
Both ways of calculating E(Y ) are equivalent
choose the most convenient calculation
159

Now:
Calculation rules for expected values, variances, covariances
of sums of random variables

Setting:
X1, . . . , Xn are given continuous or discrete random variables
with joint density fX1,...,Xn
The (transforming) function g : Rn R is given by
g(x1, . . . , xn) =

n
X

xi

i=1

160

In a first step, find the expectation and the variance of


Y = g(X1, . . . , Xn) =

n
X

Xi

i=1

Theorem 4.1: (Expectation and variance of a sum)


For the given random variables X1, . . . , Xn we have

n
X

i=1

and

Var

n
X

i=1

Xi =

n
X

i=1

Xi =

n
X

E(Xi)

i=1

Var(Xi) + 2

n
X

n
X

Cov(Xi, Xj ).

i=1 j=i+1
161

Implications:
For given constants a1, . . . , an R we have

n
X

ai Xi =

i=1

(why?)

n
X

i=1

ai E(Xi)

For two random variables X1 and X2 we have

E(X1 X2) = E(X1) E(X2)


If X1, . . . , Xn are stochastically independent, it follows that
Cov(Xi, Xj ) = 0 for all i 6= j and hence

Var

n
X

i=1

Xi =

n
X

Var(Xi)

i=1
162

Now:
Calculating the covariance of two sums of random variables
Theorem 4.2: (Covariance of two sums)
Let X1, . . . , Xn and Y1, . . . , Ym be two sets of random variables
and let a1, . . . an and b1, . . . , bm be two sets of constants. Then

Cov

n
X

i=1

ai Xi,

m
X

j=1

bj Yj =

n X
m
X

i=1 j=1

ai bj Cov(Xi, Yj ).

163

Implications:
The variance of a weighted sum of random variables is given
by

Var

n
X

i=1

ai Xi = Cov

n
X

i=1

n X
n
X

i=1 j=1

j=1

ai aj Cov(Xi, Xj )
n
X

a2
i Var(Xi ) +

n
X

a2
i Var(Xi) + 2

i=1

aj Xj

n
X

i=1

ai Xi,

n
X

n
X

i=1 j=1,j6=i
n
X

n
X

ai aj Cov(Xi, Xj )

i=1 j=i+1

ai aj Cov(Xi, Xj )
164

For two random variables X1 and X2 we have


Var(X1 X2) = Var(X1) + Var(X2) 2 Cov(X1, X2),
and if X1 and X2 are independent we have
Var(X1 X2) = Var(X1) + Var(X2)
Finally:
Important result concerning the expectation of a product of
two random variables

165

Setting:
Let X1, X2 be both continuous or both discrete random variables with joint density fX1,X2
Let g : Rn R be defined as g(x1, x2) = x1 x2
Find the expectation of
Y = g(X1, X2) = X1 X2
Theorem 4.3: (Expectation of a product)
For the random variables X1, X2 we have

E (X1 X2) = E(X1) E(X2) + Cov(X1, X2).


166

Implication:
If X1 and X2 are stochastically independent, we have

E (X1 X2) = E(X1) E(X2)


Remarks:
A formula for Var(X1 X2) also exists
In many cases, there are no explicit formulas for expected
values and variances of other transformations (e.g. for ratios
of random variables)

167

4.2 The Cumulative-distribution-function Technique


Motivation:
Consider as given the random variables X1, . . . , Xn with joint
density fX1,...,Xn
Find the joint distribution of Y1, . . . , Yk where Yj = gj (X1, . . . ,
Xn) for j = 1, . . . , k
The joint cdf of Y1, . . . , Yk is defined to be
FY1,...,Yk (y1, . . . , yk ) = P (Y1 y1, . . . , Yk yk )
(cf. Definition 3.2, Slide 98)
168

Now, for each y1, . . . , yk the event


{Y1 y1, . . . , Yk yk }
= {g1(X1, . . . , Xn) y1, . . . , gk (X1, . . . , Xn) yk } ,
i.e. the latter event is an event described in terms of the given
functions g1, . . . , gk and the given random variables X1, . . . , Xn
since the joint distribution of X1, . . . , Xn is assumed given,
presumably the probability of the latter event can be calculated and consequently FY1,...,Yk determined

169

Example 1:
Consider n = 1 (i.e. consider X1 X with cdf FX ) and k = 1
(i.e. g1 g and Y1 Y )
Consider the function
g(x) = a x + b,

b R, a > 0

Find the distribution of


Y = g(X) = a X + b

170

The cdf of Y is given by


FY (y) = P (Y y)
= P [g(X) y]
= P (a X + b y)

yb
= P X
a

yb
= FX
a

If X is continuous, the pdf of Y is given by

yb
1
yb
0
fY (y) = FY0 (y) = FX
= fX
a
a
a
(cf. Slide 48)

171

Example 2:
Consider n = 1 and k = 1 and the function
g(x) = ex
The cdf of Y = g(X) = eX is given by
FY (y) = P (Y y)

= P (eX y)

= P [X ln(y)]
= FX [ln(y)]
If X is continuous, the pdf of Y is given by

fX [ln(y)]
0
0
fY (y) = FY (y) = FX [ln(y)] =
y

172

Now:
Consider n = 2 and k = 2, i.e. consider X1 and X2 with joint
density fX1,X2 (x1, x2)
Consider the functions
g1(x1, x2) = x1 + x2

and

g2(x1, x2) = x1 x2

Find the distributions of the sum and the difference of two


random variables
Derivation via the two-dimensional cdf-technique

173

Theorem 4.4: (Distribution of a sum / difference)


Let X1 and X2 be two continuous random variables with joint pdf
fX1,X2 (x1, x2). Then the pdfs of Y1 = X1 + X2 and Y2 = X1 X2
are given by
fY1 (y1) =

Z +

fX1,X2 (x1, y1 x1) dx1

Z +

fX1,X2 (y1 x2, x2) dx2

and

fY2 (y2) =

Z +

fX1,X2 (x1, x1 y2) dx1

Z +

fX1,X2 (y2 + x2, x2) dx2.

174

Implication:
If X1 and X2 are independent, then
fY1 (y1) =

Z +

fX1 (x1) fX2 (y1 x1) dx1

fY2 (y2) =

Z +

fX1 (x1) fX2 (x1 y2) dx1

Example:
Let X1 and X2 be independent random variables both with
pdf
fX1 (x) = fX2 (x) =

1
0

, for x [0, 1]
, elsewise

Find the pdf of Y = X1 + X2


(Class)
175

Now:
Analogous results for the product and the ratio of two random variables

Theorem 4.5: (Distribution of a product / ratio)


Let X1 and X2 be continuous random variables with joint pdf
fX1,X2 (x1, x2). Then the pdfs of Y1 = X1 X2 and Y2 = X1/X2
are given by
Z +
1

and

y1
fY1 (y1) =
fX1,X2 (x1, ) dx1
x1
|x1|
fY2 (y2) =

Z +

|x2| fX1,X2 (y2 x2, x2) dx2.


176

4.3 The Moment-generating-function Technique

Motivation:
Consider as given the random variables X1, . . . , Xn with joint
pdf fX1,...,Xn
Again, find the joint distribution of Y1, . . . , Yk where Yj =
gj (X1, . . . , Xn) for j = 1, . . . , k

177

According to Definition 3.14, Slide 143, the joint moment


generating function of the Y1, . . . , Yk is defined to be
mY1,...,Yk (t1, . . . , tk ) = E
=

et1Y1+...+tk Yk

Z +

...

Z +

et1g1(x1,...,xn)+...+tk gk (x1,...,xn)

fX1,...,Xn (x1, . . . , xn) dx1 . . . dxn


If mY1,...,Yk (t1, . . . , tk ) can be recognized as the joint moment
generating function of some known joint distribution, it will
follow that Y1, . . . , Yk has that joint distribution by virtue of
the identification property
(cf. Slide 145)
178

Example:
Consider n = 1 and k = 1 where the random variable X1 X
has a standard normal distribution
Consider the function g1(x) g(x) = x2
Find the distribution of Y = g(X) = X 2
The moment generating function of Y is given by
h

mY (t) = E etY = E e
=

Z +

tX 2

etx fX (x)dx

179

Z +

= ...

1
2

2
1
tx

= 1
2t

2
1
2x

dx

1
for t <
2

This is the moment generating function of a gamma distri1


=
bution with parameters = 1
and
r
2
2
(see Mood, Graybill, Boes (1974), pp. 540/541)
Y = X 2 (0.5, 0.5)
180

Now:
Distribution of sums of independent random variables
Preliminaries:
Consider the moment generating function of such a sum
Let X1, . . . , Xn be independent random variables and let Y =
Pn
i=1 Xi

The moment generating function of Y is given by


mY (t) = E
= E

etY

= E

et

Pn

i=1 Xi

= E

i
h
i
h
i
tX
tX
tX
e 1 E e 2 ... E e n

etX1 etX2 . . . etXn

[Theorem 3.13(c)]

= mX1 (t) mX2 (t) . . . mXn (t)

181

Theorem 4.6: (Moment generating function of a sum)


Let X1, . . . , Xn be stochastically independent random variables
with existing moment generating functions mX1 (t), . . . , mXn (t)
for all t (h, h), h > 0. Then the moment generating function
P
of the sum Y = n
i=1 Xi is given by
mY (t) =

n
Y

i=1

mXi (t)

for t (h, h).

Hopefully:
Pn
The distribution of the sum Y = i=1 Xi may be identified
from the moment generating function of the sum mY (t)

182

Example 1:
Assume that X1, . . . , Xn are independent and identically distributed exponential random variables with parameter > 0
The moment generating function of each Xi (i = 1, . . . , n) is
given by

mXi (t) =
for t <
t
(cf. Mood, Graybill, Boes (1974), pp. 540/541)
Pn
So the moment generating function of the sum Y = i=1 Xi

is given by

mY (t) = mP Xi (t) =
mXi (t) =
t
i=1
n
Y

183

This is the moment generating function of a (n, ) distribution


(cf. Mood, Graybill, Boes (1974), pp. 540/541)
the sum of n independent, identically distributed exponential random variables with parameter has a (n, )
distribution

184

Example 2:
Assume that X1, . . . , Xn are independent random variables
and that Xi N (i, i2)
Furthermore, let a1, . . . , an R be constants
Then the distribution of the weighted sum is given by
Y =

n
X

i=1

(Proof: Class)

ai Xi N

n
X

i=1

ai i,

n
X

i=1

ai2 i2

185

4.4 General Transformations


Up to now:
Techniques that allow us, under special circumstances, to
find the distributions of the transformed variables
Y1 = g1(X1, . . . , Xn), . . . , Yk = gk (X1, . . . , Xn)
However:
These methods do not necessarily hit the mark
(e.g. if calculations get too complicated)

186

Resort:
There are constructive methods by which it is generally possible (under rather mild conditions) to find the distributions
of transformed random variables
transformation theorems
Here:
We restrict attention to the simplest case where n = 1, k = 1,
i.e. we consider the transformation Y = g(X)
For multivariate extensions (i.e. for n 1, k 1) see Mood,
Graybill, Boes (1974), pp. 203-212
187

Theorem 4.7: (Transformation theorem for densities)


Suppose X is a continuous random variable with pdf fX (x). Set
D = {x : fX (x) > 0}. Furthermore, assume that
(a) the transformation g : D W with y = g(x) is a one-to-one
transformation of D onto W ,
(b) the derivative with respect to y of the inverse function g 1 :
W D with x = g 1(y) is continuous and nonzero for all
y W.
Then Y = g(X) is a continuous random variable with pdf

1
dg (y)

1
g (y)
f

fY (y) = dy X

, for y W

, elsewise
188

Remark:
The transformation g : D W with y = g(x) is called oneto-one, if for every y W there exists exactly one x D with
y = g(x)
Example:
Suppose X has the pdf
fX (x) =

x1
0

, for x [1, +)
, elsewise

(Pareto distribution with parameter > 0)


Find the distribution of Y = ln(X)
We have D = [1, +), g(x) = ln(x), W = [0, +)
189

Furthermore, g(x) = ln(x) is a one-to-one transformation of


D = [1, +) onto W = [0, +) with inverse function
x = g 1(y) = ey
Its derivative with respect to y is given by

dg 1(y)
= ey ,
dy
i.e. the derivative is continuous and nonzero for all y [0, +)

Hence, the pdf of Y = ln(x) is given by


fY (y) =

ey (ey )1
0

ey
0

, for y [0, +)
, elsewise

, for y [0, +)
, elsewise
190

5. Methods of Estimation
Setting:
Let X be a random variable (or let X be a random vector)
representing a random experiment
We are interested in the actual distribution of X (or X)
Notice:
In practice the actual distribution of X is a priori unknown

191

Therefore:
Collect information on the unknown distribution by repeatedly observing the random experiment (and thus the random
variable X)
random sample
statistic
estimator

192

5.1 Sampling, Estimators, Limit Theorems


Setting:
Let X represent the random experiment under consideration
(X is a univariate random variable)
We intend to observe the random experiment (i.e. X) n times
Prior to the explicit realizations we may consider the potential
observations as a set of n random variables X1, . . . , Xn

193

Definition 5.1: (Random sample)


The random variables X1, . . . , Xn are defined to be a random
sample from X if
(a) each Xi, i = 1, . . . , n, has the same distribution as X,
(b) X1, . . . , Xn are stochastically independent.

The number n is called the sample size.

194

Remarks:
We assume that, in principle, the random experiment can be
repeated as often as desired
We call the realizations x1, . . . , xn of the random sample
X1, . . . , Xn the observed or the concrete sample
Considering the random sample X1, . . . , Xn as a random vector, we see that its joint density is given by
fX1,...,Xn (x1, . . . , xn) =

n
Y

i=1

fXi (xi)

(since the Xis are independent; cf. Definition 3.8, Slide 125)

195

Model of a random sample

Zufallsvorgang X
X1 (ZV)
X2 (ZV)
...
Xn (ZV)

x1 (Realisation 1. Exp.)
x2 (Realisation 2. Exp.)
...
xn (Realisation n. Exp.)

Mgliche
Realisationen

196

Now:
Consider functions of the sampling variables X1, . . . , Xn
statistic
estimator
Definition 5.2: (Statistic)
Let X1, . . . , Xn be a random sample from X and let g : Rn R
be a real-valued function with n arguments that does not contain
any unknown parameters. Then the random variable
T = g(X1, . . . , Xn)
is called a statistic.
197

Examples:
Sample mean:

n
1 X
X = g1(X1, . . . , Xn) =
Xi
n i=1

Sample variance:

n
2
1 X
2
S = g2(X1, . . . , Xn) =
Xi X
n i=1

Sample standard deviation:

v
u
n
2
u1 X
Xi X
S = g3(X1, . . . , Xn) = t
n i=1
198

Remarks:
All these concepts can be extended to the multivariate case
The statistic T = g(X1, . . . , Xn) is a function of random variables and hence it is itself a random variable
a statistic has a distribution
(and, in particular, an expectation and a variance)

Purposes of statistics:
Statistics provide information on the distribution of X
Statistics are central tools for
estimating parameters
hypothesis-testing on parameters
199

Random samples and statistics

Stichprobe
( X1, . . ., Xn)

g( X1, . . ., Xn)
Statistik

Messung

Stichprobenrealisation
( x1, . . ., xn)

g( x1, . . ., xn)
Realisation der Statistik

200

Now:
Let X be a random variable with unknown cdf FX (x)
We may be interested in one or several unknown parameters
of X
Let denote this unknown vector of parameters, e.g.

"

E(X)
Var(X)

Frequently, the distribution family of X is known, e.g. X


N (, 2), but we do not know the specific parameters. Then

"

We will estimate the unknown parameter vector on the basis


of statistics from a random sample X1, . . . , Xn
201

Definition 5.3: (Estimator, estimate)


b (X , . . . , X ) is called estimator (or point estimaThe statistic
n
1
tor) of the unknown parameter vector . After having observed
the concrete sample x1, . . . , xn, we call the realization of the esb (x , . . . , x ) an estimate.
timator
n
1

Remarks:

b (X , . . . , X ) is a random variable or a random


The estimator
n
1
vector
an estimator has a (joint) distribution, an expected value
(or vector) and a variance (or a covariance matrix)
b (x , . . . , x ) is a number (or a vector of num The estimate
n
1
bers)
202

Example:
Let X N (, 2) with unknown parameters and 2
The vector of parameters to be estimated is given by

"

"

E(X)
Var(X)

Potential estimators of and 2 are


n
1 X

=
Xi
n i=1

2 =

and

an estimator of is given by

b =

"

n
1 X
)2
(Xi
n 1 i=1

1 Pn X
i=1 i

= 1 n
Pn
2
)
n 1 i=1 (Xi
203

Question:
Why do we need this seemingly complicated concept of an
estimator in the form of a random variable?

Answer:
To establish a comparison between alternative estimators of
the parameter vector

Example:
Let = Var(X) denote the unknown variance of X
204

Two alternative estimators of are

n
2
1 X
Xi X
1(X1, . . . , Xn) =
n i=1
n
2
1 X
2(X1, . . . , Xn) =
Xi X
n 1 i=1

Question:
Which estimator is better and for what reasons?
properties (goodness criteria) of point estimators
(see Section 5.2)

205

Notice:
Some of these criteria qualify estimators in terms of their
properties when the sample size becomes large
(n , large-sample-properties)
Therefore:
Explanation of the concept of stochastic convergence:
Central-limit theorem
Weak law of large numbers
Convergence in probability
Convergence in distribution
206

Theorem 5.4: (Univariate central-limit theorem)


Let X be any arbitrary random variable with E(X) = and
Var(X) = 2. Let X1, . . . , Xn be a random sample from X and
let
n
1 X
Xn =
Xi
n i=1
denote the arithmetic sample mean. Then, for n , we have

X n N ,

2
n

and

Xn
n
N (0, 1).

Next:
Generalization to the multivariate case
207

Theorem 5.5: (Multivariate central-limit theorem)


Let X = (X1, . . . , Xm)0 be any arbitrary random vector with
E(X) = and Cov(X) = . Let X1, . . . , Xn be a (multivariate) random sample from X and let
n
1 X
Xi
Xn =
n i=1

denote the multivariate arithmetic sample mean. Then, for n


, we have

1
Xn N ,
n

and


n Xn N (0, ).

208

Remarks:
A multivariate random sample from the random vector X
arises naturally by replacing all univariate random variables
in Definition 5.1 (Slide 194) by corresponding multivariate
random vectors
Note the formal analogy to the univariate case in Theorem
5.4
(be aware of matrix-calculus rules!)

Next:
Famous theorem on the arithmetic sample mean
209

Theorem 5.6: (Weak law of large numbers)


Let X1, X2, . . . be a sequence of independent and identically distributed random variables with

E(Xi) = < ,
Var(Xi) = 2 < .
Consider the random variable
n
1 X
Xi
Xn =
n i=1

(arithmetic sample mean). Then, for any > 0 we have

lim P X n = 0.
n
210

Remarks:
Theorem 5.6 is known as the weak law of large numbers
Irrespective of how small we choose > 0, the probability
that X n deviates more than from its expectation tends
to zero when the sample size increases
Notice the analogy between a sequence of independent and
identically distributed random variables and the definition of
a random sample from X on Slide 194

Next:
The first important concept of limiting behaviour
211

Definition 5.7: (Convergence in probability)


Let Y1, Y2, . . . be a sequence of random variables. We say that
the sequence Y1, Y2, . . . converges in probability to , if for any
> 0 we have
lim P (|Yn | ) = 0.

We denote convergence in probability by


plim Yn =

or

Yn .

Remarks:
Specific case: Weak law of large numbers
plim X n =

or

Xn
212

Typically (but not necessarily) a sequence of random variables converges in probability to a constant R
For multivariate sequences of random vectors Y1, Y2, . . . the
Definition 5.7 has to be applied to the respective corresponding elements
The concept of convergence in probability is important to
qualifying estimators

Next:
Alternative concepts of stochastic convergence
213

Definition 5.8: (Convergence in distribution)


Let Y1, Y2, . . . be a sequence of random variables and let Z also be
a random variable. We say that the sequence Y1, Y2, . . . converges
in distribution to the distribution of Z if
lim FYn (y) = FZ (y) for any y R.
We denote convergence in distribution by
n

Yn Z.
Remarks:
Specific case: central-limit theorem

Xn d
U N (0, 1)
Yn = n

In the case of convergence in distribution, the sequence of


rvs always converges to a limiting random variable
214

Theorem 5.9: (Rules for probability limits)


Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables
with plim Xn = a and plim Yn = b. Then
(a) plim (Xn Yn) = a b,
(b) plim (Xn Yn) = a b,
(c) plim

Xn = a (for b 6= 0).
Yn
b

(d) (Slutsky-Theorem) If g : R R is a continuous function in


a R, then
plim g (Xn) = g(a).
215

Remark:
There is a property similar to Slutskys theorem that holds
for the convergence in distribution

Theorem 5.10: (Rule for limiting distributions)


Let X1, X2, . . . be a sequence of random variables and let Z be a
d
random variable such that Xn Z. If h : R R is a continuous
function, then
d

h (Xn) h(Z).
Next:
Connection of both convergence concepts
216

er-Theorem)
Theorem 5.11: (Cram
Let X1, X2, . . . and Y1, Y2, . . . be sequences of random variables,
let Z be a random variable and a R a constant. Assume that
d
plim Xn = a and Yn Z. Then
d

(a) Xn + Yn a + Z,
d

(b) Xn Yn a Z.
Example:
Let X1, . . . , Xn be a random sample from X with E(X) =
and Var(X) = 2

217

It can be shown that

n
2
X
1
2
Xi X n = 2
plim Sn = plim
n 1 i=1
n
2
X
1
2
Xi X n = 2
plim Sn = plim
n i=1

For g1(x) = x/ 2 Slutkskys theorem yields

plim g1 Sn2

plim g1 Sn2

Sn2
= plim 2 = g1( 2) = 1

Sn2
= plim 2 = g1( 2) = 1

218

For g2(x) = / x Slutkskys theorem yields

2
plim g2 Sn
= plim = g2( 2) = 1
Sn

plim g2 Sn2

= plim

= g2( 2) = 1
Sn

From the central-limit theorem we know that


Xn d
n
U N (0, 1)

219

Now, Cram
ers theorem yields
Xn
2
g2 S n n

Xn
=
n

Sn

Xn
n
=
Sn
d

1U
= U N (0, 1)
Analogously, Cram
ers theorem yields
Xn d
U N (0, 1)
n
Sn
220

5.2 Properties of Estimators


Content of Definition 5.3 (Slide 202):
An estimator is defined to be a statistic
(a function of the random sample)
there are several alternative estimators of the unknown
parameter vector
Example:
Assume that X N (0, 2) with unknown variance 2 and let
X1, . . . , Xn be a random sample from X
Alternative estimators of = 2 are
n
2
1 X
1 =
Xi X
n i=1

and

n
2
1 X
2 =
Xi X
n 1 i=1
221

Important questions:
Are there reasonable criteria according to which we can select
a good estimator?
How can we construct good estimators?
First goodness property of point estimators:
Concept of repeated sampling:
Draw several random samples from X
Consider the estimator for each random sample
An average of the estimates should be close to the
unknown parameter
(no systematic bias)
unbiasedness of an estimator
222

Definition 5.12: (Unbiasedness, bias)


An estimator (X1, . . . , Xn) of the unknown parameter is defined
to be an unbiased estimator if its expectation coincides with the
parameter to be estimated, i.e. if
h

E (X1, . . . , Xn) = .
The bias of the estimator is defined as
Bias() = E() .
Remarks:
Definition 5.12 easily generalizes to the multivariate case
The bias of an unbiased estimator is equal to zero
223

Now:
Important and very general result
Theorem 5.13: (Unbiased estimators of E(X) and Var(X))
Let X1, . . . , Xn be a random sample form X where X may be
arbitrarily distributed with unknown expectation = E(X) and
unknown variance 2 = Var(X). Then the estimators

and

n
1 X

(X1, . . . , Xn) = X =
Xi
n i=1

2(X1, . . . , Xn) = S 2 =

n
2
X
1

Xi X
n 1 i=1

are always unbiased estimators of the parameters = E(X) and


2 = Var(X), respectively.
224

Remarks:
Proof: Class
Note that no explicit distribution of X is required
Unbiasedness does, in general, not carry over to parameter
transformations.
For example,

2 is not a unbiased estimator of = SD(X) =


=
S
S
q
Var(X)
Question:
How can we compare two alternative unbiased estimators of
the parameter ?
225

Definition 5.14: (Relative efficiency)


Let 1 and
rameter .
if

2 be two unbiased estimators of the unknown pa1 is defined to be relatively more efficient than 2
Var(1) Var(2)

for all possible parameter values of and


Var(1) < Var(2)
for at least one possible parameter value of .

226

Example:
Assume = E(X)
Consider the estimators

n
1 X
Xi
1(X1, . . . , Xn) =
n i=1

n
X
1
X1
2(X1, . . . , Xn) =
+
Xi
2
2(n 1) i=2

Which estimator is relatively more efficient?


(Class)
Question:
How can we compare two estimators if (at least) one estimator is biased?
227

Definition 5.15: (Mean-squared error)


Let be an estimator of the parameter . The mean-squared
error of the estimator is defined to be
MSE() = E

2
h

= Var + Bias()

i2

Remarks:
If an estimator is unbiased, then its MSE is equal to the
variance of the estimator
The MSE of an estimator depends on the value of the
unknown parameter
228

Next:
Comparison of alternative estimators via their MSEs
Definition 5.16: (MSE efficiency)
Let 1 and 2 be two alternative estimators of the unknown
parameter . 1 is defined to be more MSE efficient than 2 if
MSE(1) MSE(2)
for all possible parameter values of and
MSE(1) < MSE(2)
for at least one possible parameter value of .
229

Unbiased vs biased estimator

2 ( X 1, K , X n )

1 ( X 1, K , X n )

230

Remarks:
Frequently 2 estimators of are not comparable with respect
to MSE efficiency since their respective MSE curves cross
There is no general mathematical principle for constructing
MSE efficient estimators
However, there are methods for finding the estimator with
uniformly minimum-variance among all unbiased estimators
restriction to the class of all unbiased estimators
These specific methods are not discussed here
(Rao-Blackwell-Theorem, Lehmann-Scheff
e-Theorem)
Here, we consider only one important result
231

Theorem 5.17: (Cram


er-Rao lower bound for variance)
Let X1, . . . , Xn be a random sample from X and let be a parameter to be estimated. Consider the joint density of the random
sample fX1,...,Xn (x1, . . . , xn) and define the value
!21

fX1,...,Xn (X1, . . . , Xn)

CR() E
.

Under certain (regularity) conditions we have for any unbiased


estimator (X1, . . . , Xn)
Var() CR().

232

Remarks:
The value CR() is the minmal variance that any unbiased
estimator can take on
goodness criterion for unbiased estimators
If for an unbiased estimator (X1, . . . , Xn)
Var() = CR(),
then is called UMVUE
(Uniformly Minimum-Variance Unbiased Estimator)

233

Second goodness property of point estimators:


Consider an increasing sample size (n )
Notation: n(X1, . . . , Xn) = (X1, . . . , Xn)
Analysis of the asymptotic distribution properties of n
consistency of an estimator
Definition 5.18: ((Weak) consistency)
The estimator
if it converges

n(X1, . . . , Xn) is called (weakly) consistent for


in probability to , i.e. if
plim n(X1, . . . , Xn) = .
234

Example:
Assume that X N (, 2) with known 2 (e.g. 2 = 1)
Consider the following two estimators of :
n
1 X
Xi

n(X1, . . . , Xn) =
n i=1

n
X
2
1

Xi +
n(X1, . . . , Xn) =
n i=1
n


n is (weakly) consistent for
(Theorem 5.6, Slide 210: weak law of large numbers)

235

n is (weakly) consistent for



(this follows from Theorem 5.9(a), Slide 215)
n:
Exact distribution of
n N (, 2/n)

(linear transformation of the normal distribution)


Exact distribution of
n:

n N ( + 2/n, 2/n)
(linear transformation of the normal distribution)

236

Pdfs of the estimator


n for n = 2, 10, 20 ( 2 = 1)

8
6
4
2
0
-1

-0.5

=0

0.5

237

Pdfs of the estimator


n for n = 2, 10, 20 ( 2 = 1)

8
6
4
2
0
-0.5 =0

0.5

1.5

2.5

238

Remarks:
Sufficient (but not necessary) condition for consistency:
lim E(n) =

(asymptotic unbiasedness)

lim Var(n) = 0
n
Possible properties of an estimator:
consistent and unbiased
inconsistent and unbiased
consistent and biased
inconsistent and biased

239

Next:
Application of the central-limit theorem to estimators
asymptotic normality of an estimator
Definition 5.19: (Asymptotic normality)
An estimator n(X1, . . . , Xn) of the parameter is called asymptotically normal if there exist (1) a sequence of real constants
1, 2, . . . and (2) a function V () such that


d
n n n U N (0, V ()).
240

Remarks:
Alternative notation:

appr.
n N (n, V ()/n)

The concept of asymptotic normality naturally extends to


multivariate settings

241

5.3 Methods of Estimation


Up to now:
Definitions + properties of estimators
Next:
Construction of estimators
Three classical methods:
Method of Lesst Squares (LS)
Method of Moments (MM)
Maximum-Likelihood method (ML)
242

Remarks:
There are further methods
(e.g. the Generalized Method-of-Moments, GMM)
Here: focus on ML estimation

243

5.3.1 Least-Squares Estimators


History:
Introduced by
A.M. Legendre (1752-1833)
C.F. Gau (1777-1855)
Idee:
Approximate the (noisy) observations x1, . . . , xn by functions
gi(1, . . . , m), i = 1, . . . , n, m < n such that
S(x1, . . . , xn; ) =

n
X

[xi gi( )]2 min

i=1

The LS-estimator is then defined to be

b (X1, . . . , Xn) = argmin S(X1, . . . , Xn; )

244

Remark:
The LS-method is central to the linear regression model
(cf. the courses Econometrics I + II)

245

5.3.2 Method-of-moments Estimators


History:
Introduced by K. Pearson (1857-1936)
Definition 5.20: (Theoretical and sample moments)
(a) Let X be a random variable with expectation E(X). The
theoretical p-th moment of X, denoted by p0 , is defined as
p0 = E(X p).
The theoretical p-th central moment of X, denoted by p, is
defined as
p = E {[X E(X)]p} .
246

(b) Let X1, . . . , Xn be a random sample from X and let X denote


the arithmetic sample mean. Then the p-th sample moment,
denoted by
0p, is defined as
n
1 X
p
0
Xi .

p =
n i=1

The p-th central sample moment, denoted by


p, is defined
as
n
p
1 X
Xi X .
p =

n i=1

247

Remarks:
The theoretical moments 0p and p had already been introduced in Definition 2.21 (Slide 76)
The sample moments
p are estimators of the theo0p and
retical moments p0 and p
The arithmetic sample mean is the 1st sample moment of
X1, . . . , Xn
The sample variance is the 2nd central sample moment of
X1 , . . . , X n

248

General setting:
Based on the random sample X1, . . . , Xn from X estimate the
r unknown parameters 1, . . . , r

Basic idea of the method of moments:


1. Express the r theoretical moments as functions of the r unknown parameters:
01 = g1(1, . . . , r )
...
0r = gr (1, . . . , r )

249

2. Express the r unknown parameters as functions of the r theoretical moments:


1 = h1(1, . . . , r , 01, . . . , 0r )
...
r = hr (1, . . . , r , 01, . . . , 0r )
3. Replace the theoretical moments by the sample moments:
1(X1, . . . , Xn) = h1(
r ,
1 , . . . ,
01, . . . ,
0r )
...
r ,
01, . . . ,
r (X1, . . . , Xn) = hr (
1, . . . ,
0r )

250

Example: (Exponential distribution)


Let the random variable X have an exponential distribution
with parameter > 0 and pdf
fX (x) =

ex
0

, for x > 0
, elsewise

The expectation and the variance of X are given by

E(X) =
Var(X) =

1
2

251

Method-of-moments estimator via the expectation:


1. We know that
1
0
E(X) = 1 =

2. This implies
1
= 0
1
3. Method-of-moments estimator for :
1

(X1, . . . , Xn) =
Pn
1/n i=1 Xi
252

Method-of-moments estimator via the variance:


1. We know that
1
Var(X) = 2 = 2

2. This implies
=

1
2

3. Method-of-moments estimator for :


v
u
u

(X1, . . . , Xn) = u
t

2
Pn
1/n i=1 Xi X

Method-of-moment estimators of an unknown parameter


are not unique
253

Remarks:
Method-of-moment estimators are (weakly) consistent since
0 ,...,
plim 1 = plim h1(
r ,
1
1 , . . . ,
0r )

r0 )
= h1(plim
1, . . . , plim
r , plim
01, . . . , plim
= h1(1, . . . , r , 01, . . . , r0 )
= 1
In general, method-of-moments estimators are not unbiased
Method-of-moments estimators typically are asymptotically
normal
The asymptotic variances are often hard to determine
254

5.3.3 Maximum-Likelihood Estimators


History:
Introduced by Ronald Fisher (1890-1962)
Basic idea behind ML estimation:
We estimate the unknown parameters 1, . . . , r in such a
manner that the likelihood of the observed sample x1, . . . , xn,
which we express as a function of the unknown parameters,
becomes maximal

255

Example:
Consider an urn containing black and white balls
The ratio of numbers is known to be 3 : 1
It is not known if the black or the white balls are more numerous
Draw n balls with replacement
Let X denote the number of black balls in the sample
Discrete density of X:
n
P (X = x) =
px(1p)nx,
x
(binomial distribution)

x {0, 1, . . . , n}, p {0.25, 0.75}

256

p {0.25, 0.75} is the parameter to be estimated


Consider a particular sample of size n = 3
potential realizations:
Number of black balls: x
P (X = x; p = 0.25)
P (X = x; p = 0.75)

27
64
1
64

27
64
9
64

9
64
27
64

1
64
27
64

Intuitive estimation:
We estimate p by that value which ex-ante maximizes the
probability of observing the actual realization x
(

0.25 , f
ur x = 0, 1
ur x = 2, 3
0.75 , f
Maximum-Likelihood (ML) estimation
p =

257

Next:
Formalization of the ML estimation technique
Notions:
Likelihood-, Loglikelihood function
ML estimator
Definition 5.21: (Likelihood function)
The likelihood function of n random variables X1, . . . , Xn is defined to be the joint density of the n random variables, say
fX1,...,Xn (x1, . . . , xn; ), which is considered to be a function of
the parameter vector .
258

Remarks:
If X1, . . . , Xn is a random sample from the continuous random
variable X with pdf fX (x, ), then
fX1,...,Xn (x1, . . . , xn; ) =

n
Y

i=1

fXi (xi; ) =

n
Y

fX (xi; )

i=1

The likelihood function is often denoted by L( ; x1, . . . , xn)


or L( ), i.e. in the above-mentioned case
L( ; x1, . . . , xn) = L( ) =

n
Y

fX (xi; )

i=1

259

If the X1, . . . , Xn are a sample from a discrete random variable


X, the likelihood function is given by
L( ; x1, . . . , xn) = P (X1 = x1, . . . , Xn = xn; ) =

n
Y

P (X = xi; )

i=1

(likelihood = probability that the observed sample occurs)


Example:
Let X1, . . . , Xn be a random sample from X N (, 2). Then
= (, 2)0 and the likelihood function is given by
L( ; x1, . . . , xn) =
=

n
Y

i=1

1
2 2

1
2 2

2
1/2((x
)/)
i
e

n/2

exp

n
X

1
2
(x

)
i
2 2 i=1
260

Definition 5.22: (Maximum-likelihood estimator)


Let L( , x1, . . . , xn) be the likelihood function of the random varib (x , . . . , x ) is a function of
b [where
b =
ables X1, . . . , Xn. If
n
1
the observations x1, . . . , xn] is the value of which maximizes
b (X , . . . , X ) is the maximum-likelihood esL( , x1, . . . , xn), then
n
1
timator of .
Remarks:
We obtain the ML estimator via (1) maximizing the likelihood
function
b ; x , . . . , x ) = max L( ; x , . . . , x ),
L(
n
n
1
1

and (2) by replacing the realizations x1, . . . , xn by the random


variables X1, . . . , Xn
261

It is often easier to maximize the loglikelihood function


ln[L( ; x1, . . . , xn)]
(L( ) and ln[L( )] have their maxima at the same value of
)
b = (
1, . . . , r )0 by solving the system of equations
We derive

ln[L( ; x1, . . . , xn)] = 0


1
...

ln[L( ; x1, . . . , xn)] = 0


r

262

Example:
Let X1, . . . , Xn be a random sample from X N (, 2) with
the likelihood function
L(, 2) =

1
2 2

n/2

exp

n
X

1
2
)
(x

i
2 2 i=1

The loglikelihood function is given by


L(, 2) = ln[L(, 2)]

n
X
n
n
1
2

= ln(2) ln( 2)
x

)
(
i
2
2
2 2 i=1

263

The partial derivatives are given by

and

n
L(, 2)
1 X
= 2
(xi )

i=1

n
L(, 2)
1 X
n 1
2
=

)
(x
i
2
2 2
2 4 i=1
Setting these equal to zero, solving the system of equations
and replacing the realizations by the random variables yields
the ML estimators
n
1 X
Xi = X
(X1, . . . , Xn) =

n i=1

2(X1, . . . , Xn) =

n
2
1 X
Xi X
n i=1
264

General properties of ML estimators:


Distributional assumptions are necessary
Under rather mild regularity conditions ML estimators have
nice properties:
1. If is the ML estimator of , then g() is the ML estimator
of g()
(equivariance property)
2. (Weak) consistency:
plim n =

265

3. Asymptotic normality:


d
n n U N (0, V ())
4. Asymptotic efficiency:
V () coincides with the Cram
er-Rao lower bound
5. Direct computation (numerical methods)
6. Quasi-ML estimation:
ML estimators computed on the basis of normally distributed random samples are robust even if the random
sample actually is not normally distributed
(robustness against distribution misspecification)
266

6. Hypothesis Testing
Setting:
Let X represent the random experiment under consideration
Let X have the unknown cdf FX (x)
We are interested in an unknown parameter in the distribution of X

Now:
Testing of a statistical hypothesis on the unknown on the
basis of a random sample X1, . . . , Xn
267

Example 1:
In a our local pub the glasses are said to contain 0.4 litres
of beer. We suspect that in many cases the glasses actually
contain less than 0.4 litres of beer
Let X represent the process of filling a glass of beer
Let = E(X) denote the expected amount of beer filled in
one glass
On the basis of a random sample X1, . . . , Xn we would like
to test
= 0.4

versus

< 0.4

268

Example 2:
We know from past data that the risk of a specific stock
(measured by the standard deviation of the stock return) has
been equal to 25%. Now, there is a change in the managerial
board of the firm. Does this change affect the risk of the
stock?
Let X represent the stock return
q

Let = Var(X) = SD(X) denote the standard deviation of


the return
On the basis of a random sample X1, . . . , Xn we would like
to test
= 0.25

versus

6= 0.25
269

6.1 Basic Terminology


Definition 6.1: (Parameter test)
Let X be a random variable and let be an unknown parameter in
the distribution of X. A parameter test constitutes a statistical
procedure for deciding on a hypothesis concerning the unknown
parameter on the basis of a random sample X1, . . . , Xn from
X.
Statistical hypothesis-testing problem:
Let denote the set of all possible parameter values
(i.e. ; we call the parameter space)
Let 0 be a subset of the parameter space
270

Consider the following statements:


H 0 : 0

versus

H1 : /0 = 1

H0 is called the null hypothesis, H1 is called the alternative


hypothesis

Types of hypotheses:
If |0| = 1 (i.e. 0 = {0}) and H0 : = 0, then H0 is called
simple
Otherwise H0 is called composite
An analogous terminology applies to H1
271

Types of hypothesis tests:


Let 0 be a real constant. Then
versus

H 1 : 6 = 0

H0 : 0

versus

H1 : > 0

H0 : 0

versus

H1 : < 0

H 0 : = 0
is called a two-sided test
The tests
and

are called one-sided tests (right- and left-sided tests)

272

Next:
Consider the general testing problem
H0 : 0

versus

H1 : 1 = /0

General procedure:
Based on a random sample X1, . . . , Xn from X decide on
whether to reject H0 in favor of H1 or not
Explicit procedure:
Select an appropriate test statistic T (X1, . . . , Xn) and determine an appropriate critical region K R
Decision:

T (X1, . . . , Xn) K = reject H0


/ K = do not reject (accept) H0
T (X1, . . . , Xn)
273

Notice:
T (X1, . . . , Xn) is a random variable
the decision is random
possibility of wrong decisions
Types of errors:
Reality
H0 true
H0 false

Decision based on test


reject H0
accept H0
type I error
correct decision
correct decision
type II error

Conclusion:
Type I error: test rejects H0 when H0 is true
Type II error: test accepts H0 when H0 is false
274

When do wrong decisions occur?


The type I error occurs if
T (X1, . . . , Xn) K
when for the true parameter we have 0
The type II error occurs if
T (X1, . . . , Xn)
/ K,
when for the true parameter we have 1

275

Question:
When does a hypothesis test of the form
H0 : 0

versus

H1 : 1 = /0

have good properties?


Intuitively:
A test is good if it possesses low probabilities of committing
type I and type II errors
Next:
Formal instrument for measuring type I and type II error
probabilities
276

Definition 6.2: (Power function of a test)


Consider a hypothesis test of the general form given on Slide 276
with the test statistic T (X1, . . . , Xn) and an appropriately chosen critical region K. The power function of the test, denoted
by G(), is defined to be the probability that the test rejects H0
when is the true (unknown) parameter. Formally,
G : [0, 1]
with
G() = P (T (X1, . . . , Xn) K).

277

Remark:
Using the power function of a test, we can express the probabilities of the type I error as
G()

for all 0

and the probabilities of the type II error as


1 G()

for all 1

Question:
What should an ideal test look like?
Intuitively:
A test would be ideal if the probabilities of both the type I
and the type II errors were constantly equal to zero
the test would yield the correct decision with probab. 1
278

Example:
For 0 consider the testing problem
H0 : 0

versus

H1 : > 0

Power function of an ideal test


279

Unfortunately:
It can be shown mathematically that, in general, such an
ideal test does not exist

Way out:
For the selected test statistic T (X1, . . . , Xn) consider the
maximal type-I-error probability
= max {P (T (X1, . . . , Xn) K)} = max {G()}
0

Now, fix the critical region K in such a way that takes on


a prespecified small value
280

all type-I-error probabilities are less than or equal to


Frequently used -values: = 0.01, = 0.05, = 0.1
Definition 6.3: (Size of test)
Consider a hypothesis test of the general form given on Slide
276 with the test statistic T (X1, . . . , Xn) and an appropriately
chosen critical region K. The size of the test (also known as
the significance level of the test) is defined to be the maximal
type-I-error probability
= max {P (T (X1, . . . , Xn) K)} = max {G()}.
0

281

Implications of this test construction:


The probability of the test rejecting H0 when in fact H0 is
true (i.e. the type-I-error probability) is at the utmost
if, for a concrete sample, the test rejects H0, we can be
quite sure that H0 is in fact false
(we say that H1 is statistically significant)
By contrast, we cannot control for the type-II-error probability (i.e. for the probability of the test accepting H0 when
in fact H0 is false)
if, for a concrete sample, the test accepts H0, then there
is no probability assessment of a potentially wrong decision
(acceptance of H0 simply means: the data are not inconsistent with H0)
282

Therefore:
It is crucial how to formulate H0 and H1
We formulate our research hypothesis in H1
(hoping that, for a concrete sample, our test rejects H0)
Example:
Consider Example 1 on Slide 268
If, for a concrete sample, our test rejects H0 we can be quite
sure that (on average) the glasses contain less than 0.4 litres
of beer
If our test accepts H0 we cannot make a statistically significant statement
(the data are not inconsistent with H0)
283

6.2 Classical Testing Procedures


Next:
Three general classical testing procedures based on the loglikelihood function of a random sample

Setting:
Let X1, . . . , Xn be a random sample from X
Let R be an unknown parameter
Let L() = L(; x1, . . . , xn) denote the likelihood function
284

Let ln[L()] denote the loglikelihood function


Assume g : R R to be a continuous function
Consider the testing problem:
H0 : g() = q

versus

H1 : g() =
6 q

Fundamental to all three tests:


Maximum-Likelihood estimator M L of

285

6.2.1 Wald Test


History:
Suggested by A. Wald (1902-1950)
Idea behind this test:
If H0 : g() = q is true, then the random variable g(M L) q
should not be significantly different from zero

286

Previous knowledge:
Equivariance property of the ML estimator (Slide 265)
g(M L) is the ML estimator of g()
Asymptotic normality (Slide 266)

d
g(M L) g() U N (0, Var(g(M L)))

The asymptotic variance Var(g(M L)) needs to be estimated


from the data

Wald test statistic:


h

i2

d
g M L q (under H0)
i
h

U 2
W =
1
d g
M L
Var

287

Test decision:
Reject H0 at the significance level if W > 2
1;1
Remarks:
The Wald test is a pure test against H0
(it is not necessary to exactly specify H1)
The Wald principle can be applied to any consistent, asymptotically normally distributed estimator

288

Wald test statistic for H0 : g() = 0 versus H1 : g() 6= 0

g( )

ML

ln[ L( )]

289

6.2.2 Likelihood-Ratio Test (LR Test)


Idea behind this test:
Consider the likelihood function L() at 2 points:
max L()
(= L(H0 ))
{:g()=q}

max L()

(= L(M L))

Consider the quantity

L(H0 )
=
L(M L)

Properties of :
01
If H0 is true, then should be close to one
290

LR test statistic:
n

io (under H )
0

LR = 2 ln() = 2 ln L(M L) ln L(H0 )

U 2
1

Properties of the LR test statistic:


0 LR <
If H0 is true, then LR should be close to zero
Test decision:
Reject H0 at the significance level if LR > 2
1;1
291

Remarks:
The LR test verifies if the distance in the loglikelihood functions, ln[L(M L)] ln[L(H0 )], is significantly larger than 0
The LR test does not require the computation of any asymptotic variance

292

LR test statistic for H0 : g() = 0 versus H1 : g() 6= 0

ln[ L( ML )]

g( )

LR

ln[ L( H 0 )]

H 0

ML
ln[L( )]

293

6.2.3 Lagrange-Multiplier Test (LM Test)


History:
Suggested by J.L. Lagrange (1736-1813)
Idea behind this test:
For the ML estimator M L we have

ln[L()]
=0

=M L

If H0 : g() = q is true, then the slope of the loglikelihood


function at the point H0 should not be significantly different
from zero
294

LM test statistic:

i1 (under H )

ln[L()]
0

2
d
H

LM =

Var
1
0

H0

Test decision:
2
Reject H0 at the significance level if LM > 1;1

295

LM test statistic for H0 : g() = 0 versus H1 : g() =


6 0

ln[L( )]

g()

LM

H0

ML
ln[ L( )]

296

Remarks:
The test statistics of both, the Wald and the LM tests, contain the estimated variances of the estimator H0
These unknown variances can be estimated consistently by
the co-called Fisher-information
Many econometric tests are based on these three construction principles
The three tests are asymptotically equivalent, i.e. for large
sample sizes they produce identical test decisions
The three principles can be extended to the testing of hypotheses on a parameter vector
If Rm, then all 3 test statistics have a 2
m distribution
under H0
297

The 3 tests in one graph

ln L( )

ln[( ML )]

ln[( H 0 )]

LR

g( )

LM

H 0

ML
ln L( )

298