You are on page 1of 216

Department of Electrical and Computer Engineering Brigham Young University Provo, Utah

2009

Detection and Estimation Theory Lecture Notes For ECEn 672


Prepared by Wynn Stirling Winter Semester, 2009 Section 001

Copyright c 2009, Wynn C. Stirling

0-2

ECEn 672

Contents
1 The Formalism of Statistical Decision Theory 1.1 1.2 Game Theory and Decision Theory . . . . . . . . . . . . . . . . . . . . . . . The Mathematical Structure of Decision Theory . . . . . . . . . . . . . . . . 1.2.1 1.2.2 The Formalism of Statistical Decision Theory . . . . . . . . . . . . . Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1-1 1-4 1-5 1-9 2-1 2-1 2-1 2-4 2-6 3-1 3-1 3-2 3-3 3-9

2 The Multivariate Normal Distribution 2.1 2.2 2.3 2.4 The Univariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . Development of The Multivariate Distribution . . . . . . . . . . . . . . . . . Transformation of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . The Multivariate Normal Density . . . . . . . . . . . . . . . . . . . . . . . .

3 Introductory Estimation Theory Concepts 3.1 3.2 Notational Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Populations and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 3.2.2 3.3 3.4 Sucient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . Complete Sucient Statistics . . . . . . . . . . . . . . . . . . . . . .

Exponential Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 Minimum Variance Unbiased Estimators . . . . . . . . . . . . . . . . . . . . 3-17 4-1 4-1 4-2 4-3 4-8

4 Neyman-Pearson Theory 4.1 4.2 4.3 4.4 4.5 4.6 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Hypothesis versus Simple Alternative . . . . . . . . . . . . . . . . . . The Neyman-Pearson Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . The Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Receiver Operating Characteristic . . . . . . . . . . . . . . . . . . . . . . . . 4-11 Composite Binary Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18 5-1 5-1

5 Bayes Decision Theory 5.1 The Bayes Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Winter 2009 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 Bayes Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayes Tests of Simple Binary Hypotheses . . . . . . . . . . . . . . . . . . . .

0-3 5-2 5-4

Bayes Envelope Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Posterior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Randomized Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 Minimax Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 Summary of Binary Decision Problems . . . . . . . . . . . . . . . . . . . . . 5-18 Multiple Decision Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18

5.10 An Important Class of M-Ary Problems . . . . . . . . . . . . . . . . . . . . 5-24 6 Maximum Likelihood Estimation 6.1 6.2 6.3 6.4 6.5 6.6 6.7 The Maximum Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . Maximum Likelihood for Continuous Distributions . . . . . . . . . . . . . . . Comments on Estimation Quality . . . . . . . . . . . . . . . . . . . . . . . . The Cramr-Rao Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . e 6-1 6-1 6-5 6-8 6-9

Asymptotic Properties of Maximum Likelihood Estimators . . . . . . . . . . 6-15 The Multivariate Normal Case . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20 Appendix: Matrix Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . 6-23 7-1 7-1 7-5

7 Conditioning 7.1 7.2 7.3 7.4 Conditional Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -elds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conditioning on a -eld . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 Conditional Expectations and Least-Squares Estimation . . . . . . . . . . . 7-13 8-1 8-3 8-6 8-9

8 Bayes Estimation Theory 8.1 8.2 8.3 8.4 8.5 Bayes Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MAP Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conjugate Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . .

Improper Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12 Sequential Bayes Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13

0-4 9 Linear Estimation Theory 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

ECEn 672 9-16

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-16 Minimum Mean Square Estimation (MMSE) . . . . . . . . . . . . . . . . . . 9-18 Estimation Given a Single Random Variable . . . . . . . . . . . . . . . . . . 9-19 Estimation Given two Random Variables . . . . . . . . . . . . . . . . . . . . 9-20 Estimation Given N Random Variables . . . . . . . . . . . . . . . . . . . . . 9-21 Mean Square Estimation for Random Vectors . . . . . . . . . . . . . . . . . 9-23 Hilbert Space of Random Variables . . . . . . . . . . . . . . . . . . . . . . . 9-24 Geometric Interpretation of Mean Square Estimation . . . . . . . . . . . . . 9-27 Gram-Schmidt Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-29

9.10 Estimation Given the Innovations Process . . . . . . . . . . . . . . . . . . . 9-33 9.11 Innovations and Matrix Factorizations . . . . . . . . . . . . . . . . . . . . . 9-36 9.12 LDU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-37 9.13 Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-38 9.14 White Noise Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-40 9.15 More On Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-41 10 Estimation of State Space Systems 10-42

10.1 Innovations for Processes with State Space Models . . . . . . . . . . . . . . . 10-42 10.2 Innovations Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48 10.3 A Recursion for Pi|i1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-50 10.4 The Discrete-Time Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . 10-52 10.5 Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-57 10.6 Kalman Filter Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-59 10.6.1 Model Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-59 10.7 Interpretation of the Kalman Gain . . . . . . . . . . . . . . . . . . . . . . . 10-62 10.8 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-63 10.8.1 A Word About Notation . . . . . . . . . . . . . . . . . . . . . . . . . 10-63 10.8.2 Fixed-Lag and Fixed-Point Smoothing . . . . . . . . . . . . . . . . . 10-64 10.8.3 The Rauch-Tung-Streibel Fixed-Interval Smooother . . . . . . . . . . 10-64

Winter 2009

0-5

10.9 Extensions to Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . 10-69 10.9.1 Linearization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-69 10.9.2 The Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . 10-72

0-6

ECEn 672

List of Figures
1-1 Loss function (or matrix) for Odd or Even game . . . . . . . . . . . . . . . . 1-2 Structure of a Statistical Game . . . . . . . . . . . . . . . . . . . . . . . . . 1-3 Risk Matrix for Statistical Odd or Even Game . . . . . . . . . . . . . . . . . 4-1 Illustration of threshold for Neyman-Pearson test . . . . . . . . . . . . . . . 4-2 Error probabilities for normal variables with dierent means and equal variances: (a) PF A calculation, (b) PD calculation. . . . . . . . . . . . . . . . . . 4-12 4-3 Receiver operating characteristic: normal variables with unequal means and equal variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 4-4 Receiver operating characteristic: normal variables with equal means and unequal variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15 4-5 Demonstration of convexity property of ROC. . . . . . . . . . . . . . . . . . 4-16 5-1 Bayes envelope function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 5-2 Bayes envelope function: normal variables with unequal means and equal variances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 5-3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 5-4 Bayes envelope function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 5-5 Geometrical interpretation of the risk set. . . . . . . . . . . . . . . . . . . . . 5-21 5-6 Geometrical interpretation of the minimax rule. . . . . . . . . . . . . . . . . 5-22 5-7 Loss Function for Statistical Odd or Even Game . . . . . . . . . . . . . . . . 5-22 5-8 Risk set for odd or even game. . . . . . . . . . . . . . . . . . . . . . . . . 5-23 5-9 Decision space for M = 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28 6-1 Empiric Distribution Function. . . . . . . . . . . . . . . . . . . . . . . . . . 7-1 The family of rectangles {X [x x, x + x], Y [y y, y + y]}. . . . 7-2 The family of trapezoids {X [x x, x + x], Y [y Xy, y + Xy]}. 6-4 7-3 7-4 1-2 1-8 1-9 4-6

9-1 Geometric interpretation of conditional expectation. . . . . . . . . . . . . . . 9-28 9-2 Geometric illustration of Gram-Schmidt procedure. . . . . . . . . . . . . . . 9-30

Winter 2009

1-1

1
1.1

The Formalism of Statistical Decision Theory


Game Theory and Decision Theory

This course is primarily focused on the engineering topics of detection and estimation. These topics have their roots in probability theory, and t in the general area of statistical decision theory. In fact, the component of statistical decision theory that we will be concerned with ts in an even larger mathematical construct, that of game theory. Therefore, to establish these connections and to provide a useful context for future development, we will begin our discussion of this topic with a brief detour into the general area of mathematical games. A two-person, zero sum mathematical game, which we will refer to from now on simply as a game, consists of three basic components: 1. A nonempty set, 1 , of possible actions available to Player 1. 2. A nonempty set, 2 , of possible actions available to Player 2. 3. A loss function, L : 1 2 , representing the loss incurred by Player 1 (which,

under the zero-sum condition, corresponds to the gain obtained by Player 2). Any such triple (1 , 2 , L) denes a game. Here is a simple example taken from [3, Page 2]. Example: Odd or Even. Two contestants simultaneously put up either one or two ngers. Player 1 wins if the sum of the digits showing is odd, and Player 2 wins if the sum of the digits showing is even. The winner in all cases receives in dollars the sum of the digits showing, this being paid to him by the loser. To create a triple (1 , 2 , L) for this game we dene 1 = 2 = {1, 2} and dene loss function by L(1, 1) = 2 L(1, 2) = 3 L(2, 1) = 3 L(2, 2) = 4 It is customary to arrange the loss function into a loss matrix as depicted in Figure 1-1.

1-2

ECEn 672

d 2 d 1 d

1 2 3

2 3 4

1 2

Figure 1-1: Loss function (or matrix) for Odd or Even game We wont get into the details of how to develop a strategy for this game and many others similar in structure to it; that is a topic in its own right. For those who may be interested in general game theory, [10] is a reasonable introduction. Exercise 1-1 Consider the well-known game of Prisoners Dilemma. Two agents, denoted X1 and X2 , are accused of a crime. They are interrogated separately, but the sentences that are passed are based upon the joint outcome. If they both confess, they are both sentenced to a jail term of three years. If neither confesses, they are both sentenced to a jail term of one year. If one confesses and the other refuses to confess, then the one who confesses is set free and the one who refuses to confess is sentenced to a jail term of ve years. This payo matrix is illustrated in Table 1-1. The rst entry in each quadrant of the payo matrix corresponds to X1 s payo, and the second entry corresponds to X2 s payo. This particular game represents an slight extension to our original denition, since it is not a zero-sum game. When playing such a game, a reasonable strategy is for each agent to make a choice such that, once chosen, neither player would have an incentive to depart unilaterally from the outcome. Such a decision pair is called a Nash equilibrium point. In other words, at the Nash equilibrium point, both players can only hurt themselves by departing from their decision. What is the Nash equilibrium point for the Prisoners Dilemma game? Explain why this problem is considered a dilemma. Exercise 1-2 In his delightful book, Superior BeingsIf They Exist, How Would We Know?, Steven J. Brams introduces a game called the Revelation Game. In this game, there are two

Winter 2009 X2 silent confesses 1,1 5,0 0,5 3,3

1-3

X1 silent confesses

Table 1-1: A typical payo matrix for the Prisoners Dilemma. P Believe in SBs existence him- P faithful with evidence (3,4) P faithful without evidence (4,2) Dont believe in SBs existence P unfaithful despite evidence (1,1) P unfaithful without evidence (2,3)

Reveal self SB

Dont reveal himself

Table 1-2: Payo for Revelation Game: 4 = best, 3 = next best, 2 = next worst, 1 = worst. agents. Player 1 we will term the superior being (SB), and Player 2 is a person (P). SB has two strategies: 1. Reveal himself 2. Dont reveal himself Agent P also has two strategies: 1. Believe in SBs existence 2. Dont believe in SBs existence Figure 1-2 provides the payo matrix for this game. What is the Nash equilibrium point for this game? We will view decision theory as a game between the decision-maker, or agent, and nature, where nature takes the role of, say, Player 1, and the agent becomes Player 2. The components of this game, which we will denote by (, , L), become 1. A nonempty set, , of possible states of nature, sometimes referred to as the parameter space.

1-4

ECEn 672

2. A nonempty set, , of possible decisions available to the agent, sometimes called the decision space. 3. A loss function, L : the cost function. Lets take a minute and detail some of the important dierences between game theory and decision theory. In a two-person game, it is usually assumed that the players are simultaneously trying to maximize their winnings (or minimize their losses), whereas with decision theory, nature assumes essentially a neutral role and only the agent is trying to extremize anything. Of course, if you are paranoid, you might want to consider nature your opponent, but most people feel content to think of nature as being neutral. If we do so, we might be willing to be a little more bold in the decision strategies we choose, since we dont need to be so careful about protecting ourselves. In a game, we usually assume that each player makes its decision based on exactly the same information (cheating is not allowed), whereas in decision theory, the agent may have available additional information, via observations, that may be used to gain an advantage on nature. This dierence is more apparent than real, because there is nothing about game theory that says a game has to be fair. In fact, decision problems can be viewed as simply more complex games. The fact seems to be, that decision theory is really a subset of the larger body of game theory, but there are enough special issues and structure involved in the way the agent may use observations to warrant its being a theory on its own, apart considered from game theory proper. , representing the loss incurred by nature (which

corresponds to the gain obtained by the agent. This function is also sometimes called

1.2

The Mathematical Structure of Decision Theory

In its most straightforward expression, the agents job is to guess the state of nature. A good job means small loss, so the agent is motivated to get the most out of any information available in the form of observations. We suppose that before making a decision the agent is

Winter 2009

1-5

permitted to look at the observed value of a random variable or vector, X, whose distribution depends upon the true state of nature, . Before presenting the mathematical development, we need a preliminary denition. Let (1 , T1 ) and (2 , T2 ) be two measurable spaces. A transition probability is a mapping P : 1 T2 [0, 1] such that1 1. For every 1 1 , P (1 , ) is a probability on (2 , T2 ). 2. For every T2 T2 , P (, T2) is a measurable function on (1 , T1 ). 1.2.1 The Formalism of Statistical Decision Theory

Let (, F ) and (, T ) be measurable spaces, and let P be a transition probability such that P : F [0, 1]. Let X be a random variable dened over (, F , P (, )). Recall that this means that X : such that, for any Borel set A , the inverse image X 1 (A) F , that is, the inverse image of the Borel set A is an element of the -eld F . Since it is awkward to operate in this space, we choose to work with the derived transition probability PX such that, for each and each Borel set A, PX (, A) = P (, X 1 (A)). We may generalize the denition of the derived distribution slightly by permitting the Borel set A to be a subset of n-dimensional Euclidean space eld dened over
n n

. In particular, let B be the Borel

, let X

, and us dene the following measure spaces:

(X , B) = the space of observations (sample space) (, T ) = the space of parameters (, D) = the space of decisions PX is a transition probability; PX : B [0, 1]. The probability PX (, ) governs the observation X = x X when is the value of the parameter (unknown to the observer).
1

See, for example, [12].

1-6

ECEn 672

Example: Coin Toss. Suppose a coin is tossed, and the agent observes the value X = 1 if it lands heads, and X = 0 if it lands tails. Then = {H, T } F = {, {H}, {T }, } The derived Probability space contains the elements X = {0, 1} B = {, {0}, {1}, X } For a parameter space, let us suppose the coin is either fair or biased towards heads.
1 = { 2 , p},

p =1 2

T Then

1 = {, { 2 }, {p}, }

A = {0} or A = {1} 1 A=X 0 A= A = {1} p (1 p) A = {0} PX (p, A) = A=X 1 0 A=


1 2 1 PX ( 2 , A) =

PX (, ) = 0, all

PX (, X ) = 1, all 1 1 =2 2 PX (, {1}) = p =p PX (, {0}) =


1 =2 1p =p 1 2

We continue with the development of our formalism; for brevity, we will assume that B is the Borel eld over . The extension of the concepts to the multivariate case is straightforward but lengthy2 . For each value of the probability measure PX (, ) induces a
The denition of a distribution function in the multivariate case is somewhat technical; we wont dwell on it in this class since we will usually be working with well-known densities. For a detailed treatment of the theory, the reader is referred to [17].
2

Winter 2009 cumulative distribution function, dened as FX (x|) = PX (, (, x]) = P (, X 1(, x]).

1-7

FX (x|) represents the distribution of the random variable X when is the true value of the parameter. Note that with this development, we have not specied whether or not is a random variable. We will have more to say about that later on (We will see that if we adopt a Bayesian attitude, then we will model as a random variable, but thats not the only way to think about the parameters). Let L : be a measurable function. L(, ) represents the loss following a

decision when is the value of the parameter (the true state of nature). A strategy, or decision rule, or decision function, d : X is a rule for deciding = d(X) after having observed X. If the agent chooses this rule, then his loss becomes L(, d(X)), which, for xed , is a random variable (i.e., it is a function of the random variable X). The expected value of this loss is the risk function, which is a function of the parameter and the decision rule d, and may be expressed by the Riemann-Stieltjes integral R(, d) = EL(, d(X)) =

L(, d(x))dFX (x|).


d F (x|) dx X

If a probability density function (pdf) fX (x|) = may be written as the Riemann integral R(, d) =

exists, then the risk function

L(, d(x))fX (x|)dx.

If the probability is purely discrete, then a probability mass function (pmf) pX (xk |) = FX (xk |), k = 1, . . . , N, exists, then the risk function may be expressed as
N

R(, d) =
k=1

L(, d(xk ))pX (, xk ).

The risk represents the average loss to the agent when the true state of nature is and the agent uses the decision rule d. Any function d : X
3

is called a (nonrandomized) decision rule3 or decision function

provided the risk function R(, d) exists and is nite for all . We will denote the class
There also exist random decision rules, which correspond to probability distributions over a space of decision rules. A nonrandomized decision rule is a degenerate randomized decision rule where all of the mass is placed on one rule. We wont need to worry about randomized decision rules in this class, but its nice to know that they exist.

1-8

ECEn 672

of all nonrandomized decision rules by D. We state without proof that D contains only functions d for which L(, d()) is continuous with probability one for each . With the introduction of the risk function, R, and the class of decision functions, D, we may replace the original game (, , L) by a new game, which we will denote by the triple (, D, R), in which the space D and the function R have have an underlying structure, depending on and L and the distribution of X, whose exploitation is the main objective of decision theory. Sometimes the triple (, D, R) is called a statistical game. Figure 1-2 illustrates the structure of the decision problem. The parameter space is linked to the decision space through the risk function, which is the expectation of the loss function. The parameter space is also linked to the sample space through the transition probability function, and the sample space is linked to the decision space through the decision function. R = E(L)

Parameter Space (, T )

Decision Space (, D)

FX (|)

d(X) D

Sample Space (X , B)

Figure 1-2: Structure of a Statistical Game Example: Odd or Even. The game of odd or even mentioned earlier may be extended to a statistical decision problem. Suppose that before the game is played the agent is allowed to ask nature how many ngers it intends to put up and that nature must answer truthfully with probability 3/4 (hence untruthfully with probability 1/4). The agent therefore observes a random variable X (the answer nature gives) taking the values of 1 or 2. If = 1 is the true state of nature, the probability that X = 1 is 3/4; that is, P (1, {1}) = 3/4. Similarly,

Winter 2009

1-9

P (2, {1}) = 1/4. There are exactly four possible functions from X = {1, 2} into = {1, 2}. These are the four decision rules d1 (1) = 1, d2 (1) = 1, d3 (1) = 2, d4 (1) = 2, d1 (2) = 1; d2 (2) = 2; d3 (2) = 1; d4 (2) = 2.

Rules d1 and d4 ignore the value of X. Rule d2 reects the agents belief that nature is telling the truth, and rule d3 , that nature is not telling the truth. The risk matrix, given in Figure 1-3, characterizes this statistical game.
dD dd

d1 2 3

d2

d3

d4

1 2

3/4 7/4 3 9/4 5/4 4

Figure 1-3: Risk Matrix for Statistical Odd or Even Game

Exercise 1-3 Verify the contents of the risk matrix for the statistical odd or even game. 1.2.2 Special Cases

The above framework provides a formalism for much of the statistical analysis we will do in this course. Only a part of statistics is represented by this formalism. We will not discuss such topics as the choice of experiments, the design of experiments, or sequential analysis. In each case, however, additional structure could be added to the basic framework to include these topics, and the problem could be reduced again to a simple game. For example, in sequential analysis the agent may take observations one at time, paying c units each time he does so. Therefore a decision rule will have to tell him both when to stop taking observations

1-10

ECEn 672

and what action to take once he has stopped. He will try to choose a decision rule that will minimize in some sense his new risk, which is dened now as the expected value of the loss plus the cost. Most of the body of statistical decision making involves three special cases of the general game formulation. 1. consists of two points, = {1 , 2 }. If the decision space consists of only two elements, the resulting problem is called a hypothesis testing problem. Suppose = and the loss function is L(, 1 ) = L(, 2 ) = where 0 is some xed number and
1 1 if > 0 0 if 0

0 if > 0 , 2 if 0
2

and

are positive numbers. With this example,

we would like to take action 1 if 0 , and action 2 if > 0 . As a specic example, suppose represents the return energy of a radar signal, and 0 is the minimum return energy that would correspond to the presence of a target. Suppose the observed return is of the form X = + , where is receiver noise. The essence of our decision problem is to decide whether or not a target is present. Our decision problem can be stated as follows: Choose 1 = H0 : No Target Present Choose 2 = H1 : Target Present. In statistical parlance, H0 is termed the null hypothesis, and H1 the alternative hypothesis. With this simple problem, four things can happen: H0 True, Choose 1 : Target not present, decide target not present: correct decision. H1 True, Choose 2 : Target present, decide target present: correct decision.

Winter 2009

1-11

H1 True, Choose 1 : Target present, decide target not present: missed detection. H0 True, Choose 2 : Target not present, decide target present: false alarm. The space D of decision rules consists of those functions d : X {1 , 2 } with the property that PX (, {d(X) = i }) , i = 1, 2, is well-dened for all values of . With this structure in place, the problem then, is to determine the function d. This is were most of the eort of detection theory is placed. It involves the statistical description of the random variable as well as the criterion one would wish to employ for penalizing errors. For example, if the cost of missed detections is very high, we might have to live with a high false alarm rate. Conversely, if the cost of false alarms is high, we may have to design a detector that gives us a lot of missed detections. Exercise 1-4 Show that the risk function for this case is R(, d) = = 1 }) if > 0 . 2 P (, {d(X) = 2 }) if 0
1 P (, {d(X)

As noted, there are two types of error possible with this problem. First, if > 0 , P (, {d(X) = 1 }) is the probability of making the error of taking action 1 when the true state of nature is greater than 0 . In our radar signal detection context, for example, this error occurs if a target is present, but the decision rule decides that it is not presenta missed detection. Such an error is often called a Type I error. Similarly, for 0 , P (, {d(X) = 2 }) = 1 P (, {d(X) = 1 }) is the probability of making the error of taking action 2 when we should take action 1 . This error occurs if we the decision rule claims that a target is present when it is not. Such an error is termed a Type II error. 2. consists of k points, = {1 , 2 , , k }, k 3. These problems are called multiple decision problems, or multiple hypothesis testing problems. 3. consists of the real line, = . Such decision problems are referred to as point and the loss function is

estimation of a real parameter. Consider the case were =

1-12 given by L(, ) = c( )2 ,

ECEn 672

where c is some positive constant. A decision function, d, in this case is a real-valued function dened on the sample space, and is often called an estimate of the true unknown state of nature, . It is the agents desire to choose the function d to minimize the risk function R(, d) = cE( d(X))2 , which is c times the mean squared error of the estimate d(X). More generally, we may wish to estimate some function f () that depends on the value of the parameter , in which case the loss function may assume the form L(, ) = w()(f () )2 . This criterion is one of the most widely used loss functions in all of classical statistical engineering analysis, and is the basis for such well-known estimation techniques as the Wiener lter and the Kalman lter.

Winter 2009

2-1

The Multivariate Normal Distribution

The normal distribution is probably the most important one for this course. We rst present the univariate normal (Gaussian) distribution, then use that to derive the multivariate distribution.

2.1

The Univariate Normal Distribution

We begin our discussion with a brief review of the univariate normal distribution. Let X be a random variable with a univariate normal distribution. This is an absolutely continuous distribution whose density, with mean and variance 2 , is fX (x) = 1 exp 2 (x )2 2 2 ,

where > 0. This distribution is denoted N (, 2). Recall that the characteristic function of a random variable is dened as the Fourier transform of the density function. The characteristic function of the univariate normally distributed random variable is X () = E exp(jX) =

ejx fX (x)dx (2-1)

= exp(j 2 2 /2)

2.2

Development of The Multivariate Distribution

We now turn our attention to the multivariate case. Let X = [X1 , . . . , Xn ]T denote a random vector (i.e., each element Xi of this vector is a random variable). The expectation of a random vector is the vector of expectations: EX = [EX1 , . . . , EXn ]T 4 . The covariance matrix of a random vector X = [X1 , . . . , Xn ]T is dened as the matrix of covariances [Cov (Xi , Xj )] or Cov X = E(X EX)(X EX)T . We have the following fact: Theorem 1 Every covariance matrix is symmetric and nonnegative denite. Every symmetric and nonnegative denite matrix is a covariance matrix. If Cov X is not positive denite, then with probability one, X lies in some hyperplane bT X = c, with b = 0.
4

More generally, the expectation of a random matrix is dened as the matrix of expectations.

2-2 Proof: Cov X is symmetric because Cov (Xi , Xj ) = Cov (Xj , Xi ). Furthermore, bT (Cov X)b = bT (E(X EX)(X EX)T )b = EbT (X EX)(X EX)T b = E[(bT (X EX))2 ] 0,

ECEn 672

which proves that Cov X is nonnegative denite. If, for some b = 0, E[(bT (X EX))2 ] = 0, then P [bT X = bT EX] = 1, so that with probability one X lies in the hyperplane bT X = c, where c = bT EX. Now let R be an arbitrary symmetric nonnegative denite matrix. Let A = R 2 be the nonnegative square root of R. Let X be a vector of independent random variables with zero means and unit variances. Then Cov X = I. Now let Y = AX. Then EY = A(EX) = 0 and Cov Y = EYY T = E(AX)(AX)T = AEXXT AT = AAT = R. 2 We are now in a position to dene the multivariate normal distribution. Our development follows [3]. An n-dimensional random vector X is said to have a multivariate or n-dimensional normal distribution if for every n-dimensional vector the random variable T X has a (univariate) normal distribution (possibly degenerate) on the real line. The normal distribution on the real line is consistent with this denition, for if a random variable X has a normal distribution then so has the random variable X for any real number . One advantage of dening the multivariate normal distribution in this way is that we obtain, as an immediate consequence, the fact that linear transformations of multivariate normal random variables are also multivariate normal. Theorem 2 If X has an n-dimensional normal distribution, then for any k-dimensional vector n of constants and any k n matrix A of constants, the random vector Y = AX + n has a k-dimensional normal distribution.
1

Winter 2009

2-3

Proof: Let be an arbitrary k-dimensional vector. We are to show that T Y is normally distributed; but because X has a multivariate normal distribution, ( T A)X is normally distributed and so is T AX + T n = T Y, completing the proof. 2

To compute the characteristic function of the multivariate normal distribution, recall that the joint characteristic function of random variables X1 , . . . , Xn is dened as X () = E exp[j(1 X1 + + n Xn )] = E exp(j T X). (2-2)

We now observe that T X may be thought of as a function of the random vector X, and that (2-2) can be viewed as the characteristic function of the random variable T X evaluated at 1, i.e., the right side of (2-2) is simply T X (1). Thus the characteristic equation of the multivariate normal random vector X is the same as the characteristic equation of the univariate normal random variable T X evaluated at 1, namely, X () = T X (1) = exp[jE T X 1 Var ( T X) 12 /2] = exp[jE T X Var ( T X)/2]. With the notation m = EX and R = Cov X, it follows that E T X = T m and Var ( T X) = E T (X m)(X m)T = T R. Hence, X () = exp(j T m T R/2). (2-3)

We can show that each characteristic function of the form given by (2-3) corresponds uniquely to a multivariate normal distribution. To see this, note that T X (t) = exp[j( T m)t ( T R)t2 /2] is of the form (2-1). Because the characteristic function determines the distribution uniquely, the multivariate normal distribution is determined once its mean vector m and covariance matrix R are given. To see that (2-3) actually does represent a characteristic function if R is a covariance matrix, let Z = [Z1 , . . . , Zn ]T be a vector of independent random variables, each

2-4

ECEn 672

having a normal distribution with mean zero and variance one. The characteristic function for Z is
n

Z () = E exp j
j=1 n

j Zj

=
j=1 n

E exp(jj Zj )
2 exp(j /2) j=1

= exp( T /2). Now let A be the symmetric nonnegative denite square root of R and let Y = AZ + m. Then Y () = E exp(i T AZ + i T m) = exp(j T m)Z (A) = exp(j T m T AA/2) = exp(jm T R/2) which shows that (2-3) is indeed a characteristic function if R is a covariance matrix. We have thus proved the following theorem. Theorem 3 Functions of the form X () = exp(j T m T R/2) where R is a symmetric nonnegative denite matrix are characteristic functions of multivariate normal distributions. Every multivariate normal distribution has a characteristic function of this form, where m is the mean vector and R is the covariance matrix of the distribution. We denote this distribution by N (m, R).

2.3

Transformation of Variables

Before presenting this sketch it may be wise to pause and review some material from basic probability theory about the transformation of variables. Specically, we will review the

Winter 2009

2-5

technique required to calculate the distribution of a function of a random variable. Rather than prove the general case directly, lets rst prove the univariate case, then state the general multivariate case. Theorem 4 Let X and Y be continuous random variables with Y = g(X). Suppose g is one-to-one and both g and its inverse function, g 1, are continuously dierentiable. Then fY (y) = fX [g 1 (y)] dg 1 (y) . dy (2-4)

Proof. Since g is one-to-one, it is either increasing or decreasing; suppose it is increasing. Let a and b be real numbers such that a < b; we have P [Y (a, b)] = P [g(X) (a, b)] = P [X (g 1 (a), g 1 (b))]. But P [Y (a, b)] = and
g 1 (b) a b

fY (y)dy

P [X (g (a), g (b))] = =

fX (x)dx
g 1 (a) b a

dg 1(y) fX [g (y)] dy. dy


1

Thus, for all intervals (a, b), we have that


b a

fY (y)dy fX [g 1 (y)]

dg 1(y) dy

dy = 0.

(2-5)

Suppose that (2-4) is not true. then there exists some y such that equality does not hold; but by the continuity of the density functions fX and fY , then (2-5) must be nonzero for some small interval containing y . This yields a contradiction, so (2-4) is true if g is increasing. To show that it holds for decreasing g, we simply note that, the change of variable will also reverse the limits as well as the sign of slope. Thus, the absolute value will be required. 2 Theorem 5 Let X and Y be continuous random vectors with Y = g(X). Suppose g is one-to-one and both g and its inverse function, g 1 , are both continuously dierentiable. Then fY (y) = fX [g 1 (y)] g 1 (y) , y (2-6)

2-6 g 1 (y) is the absolute value of the Jacobian determinant. y

ECEn 672

where

The proof of this theorem is similar to the proof for the univariate case, and we will not repeat it here.

2.4

The Multivariate Normal Density

It remains to determine the probability density function of a multivariate normally distributed random vector. We rst observe that if R is not positive denite, then all of the probability mass lies in some hyperplane, and the probability density does not exist. In such a case, we say that the multivariate normal distribution is singular. When R > 0, however, the multivariate probability density does exist and is given by the following theorem. Theorem 6 If the covariance matrix R is nonsingular, the density of the multivariate normal distribution with characteristic function Y () = exp(j T m T R/2) exists and is given by fY (y) = (2) 2 (det R) 2 exp 1 (y m)T R1 (y m) . 2
n 1

(2-7)

Proof: The distribution with characteristic function (2-3) is the distribution of Y = AZ + m where A is the symmetric positive denite square root of R and were Z N (0, I). The density of Z is the product of the marginal densities fZ (z) = fZ1 (z1 ) fZn (zn ) = (2) 2 exp(zT z/2). The next step is to determine the density of Y in terms of the density of Z. To do this, recall the transformation of variables formula, and note that the inverse transform for this problem is Z = A1 (Y m), whose Jacobian is det J = det Hence, the density of Y is fY (y) = fZ (A1 (y m)) det J = (2) 2 (det R) 2 exp[ 1 (y m)T A1 A1 (y m)] 2
n 1 n

Zi Yj

= det A1 = det R 2 = (det R) 2 .

Winter 2009 which reduces to (2-7).

2-7 2

We complete our discussion of multivariate normal distributions by noting that, while the covariance of two independent random variables is zero, a zero covariance does not generally imply that the variables are independent. For the multivariate normal distribution, however, zero covariance does imply independence. Theorem 7 If Y N (m, R), then the component random variables Y1 , . . . , Yn are mutually independent if and only if R is a diagonal matrix. Proof: If R is not diagonal, then there is a nonzero o-diagonal element which gives a nonzero covariance between two of the elements of Y; therefore they cannot be independent.
2 2 Conversely, if R = diag {1 , . . . , n }, the characteristic function factors as n n

Y () = exp i
j=1 n

j mj

1 2

2 j j j=1

=
j=1

2 exp(jj mj 1 j j ), 2

which proves the independence.

Exercise 2-1 Let Y = AX + be a linear transformation from X to Y, where A is a nonsingular square matrix and is a constant vector. Show that the Jacobian det this transformation is the determinant of A. Exercise 2-2 (Ferguson) Random variables may be univariate normal but not jointly normal. Here is an example of two normal random variables that are uncorrelated but not independent. Let X have a normal distribution with mean zero and variance one. Let c be a nonnegative number and let Y = X if |X| c and Y = X if |X| > c. Then Y also has a normal distribution with zero mean and variance one. Show that the covariance of X and Y is a continuous function of c, going from +1, when c = 0, to 1 when c . Therefore for some value of c, X and Y are uncorrelated, yet X and Y are are as far from being independent as possible, each being a function of the other. (c = 1.538 ).
yi xj

of

Winter 2009

3-1

3
3.1

Introductory Estimation Theory Concepts


Notational Conventions

In an earlier disucssion we introduced the transition probability P (, ), and observed that, for every value of , this function is a probability. For a given random variable, X, we then formed the derived distribution, which we expressed as PX (, ), and we expressed the associated distribution function as FX (x | ) with corresponding notational conventions for the probability density function (pdf) and probability mass function (pmf). Although most of our work will involve the distribution and, as appropriate, the pdf or pmf, it will often be necessary to refer to the transition probability function P (, ). When there is no chance of confusion concerning the random variable under consideration, it is customary to adopt abbreviated notation. Two such notational conventions are common. Sometimes we will write P () to denote this probability, and sometimes we will write it as P (|). In both cases, we depend on the identity of the random variable to be understood from the context of the problem. You will need to get used to both notations, as they will both appear in the literature and in these notes5 . When there is no likelihood of confusion, we may also sometimes denote the distribution function FX (x | ) by the abbreviated form F (x). For discrete random variables, the probability mass function (pmf) will be denoted by fX (x | ) or f (x) and, similarly, for continuous random variables, the probability density function (pdf) will also be denoted by fX (x | ) or f (x). We will also be required to take the mathematical expectation of various random variables. As usual, we let E() denote the expectation operator (with or without parentheses, depending upon the chances of confusion). When we write EX it is understood that this expectation is performed using the distribution function of X, but when this distribution function is parameterized by , we must augment this notation by writing E X.

I think it one of the unspoken prerogatives of probabilists and statisticians to use arcane, inconsistent and, sometimes, abusive notation.

3-2

ECEn 672

3.2

Populations and Statistics

As we have described earlier, the problem of estimation is, essentially, to obtain a set of data, or observations, and use this information in some way to fashion a guess for the value of an unknown parameter (the parameter may be a vector). One of the ways to achieve this goal is through the method of random sampling. Our starting point for this discussion is the concept of a population. Denition. A population, or parent population, is the probability space ( , B, PX (, )) induced on by a random variable X. The random variable X is called the population random variable. The distribution of the population is the distribution of X. The population is discrete or continuous according as X is discrete or continuous. This denition extends to the vector case in the obvious way. By sampling, we mean that we repeat a given experiment a number times; The ith repetition involves the creation, mathematically, of a replica, or copy, of the population on which a random variable Xi is dened. The distribution of the random variable Xi is the same as the distribution of X, the parent population random variable. The random variables X1 , X2 , . . . , are called sample random variables or, sometimes, the sample values of X. In general, a function of the sample values of a random variable X is called a statistic of X. The act of sampling can take many forms, some of which will be discussed in this course. Perhaps the simplest sampling procedure is that of sampling with replacement. A more complicated sampling example involves the sampling of an entire stochastic process. For an example of sampling with replacement, let X be a random variable with unknown mean value. Suppose we have a collection of independent samples of X, which we will denote as X1 , . . . , Xn . The sample mean, written as the random variable X, is given by 1 X= n
n

Xi ,
i=1

and is an example of an estimator of the population mean; that is, X is a statistic. Before continuing with this discussion, it is important to make a distinction between random variables and the values they may take. Once the observations have been taken, the sample values become evaluated at the points Xi = xi , and the array {x1 , . . . , xn }; that is

Winter 2009

3-3

X = xi is a collection of real numbers. After the observations, therefore, the sample mean may be evaluated as 1 x= n
n

xi .
i=1

The real number x is not a random variable, nor are the quantities x1 , . . . xn . When we talk about quantities such as the mean or variance, they are associated with random variables, and not the values they assume. We can certainly talk about the average of the numbers x1 , . . . , xn , but this average is not the mathematical expectation of the random variable X. The only way we can think of x as a random variable is in a degenerate sense, where all of the mass is located at the number x. Outside this context, it is meaningless to speak of the mean or variance of x, but it is highly relevant to speak of the mean and variance of the random variable X. 3.2.1 Sucient Statistics

The random variable, X, is one of many possible statistics to be obtained from the samples X1 , . . . , Xn . Suppose our objective in collecting the observations is to determine the mean value of the random variable X. Let us ask ourselves, What information about X is furnished by the sample values? Or perhaps better, What is the best estimate of the mean value of X that we can make on the basis of the sample values alone? This question is not yet really mathematically meaningful, since the notion of best has not been dened. Yet, with the above example, there is a strong compulsion to suppose that the random variable, X, captures everything there is to learn from the random variables X1 , . . . , Xn about the expectation of X. As we will see, the random variable X contains some special properties that qualify it as a sucient statistic for the mean of the random variable X. Denition. Let X be a random variable whose distribution depends on a parameter . A real-valued function T of X is said to be sucient for if the conditional distribution of X, given T = t, is independent of . That is, T is sucient for if FX|T (x | t, ) = FX|T (x | t). The above denition remains unchanged if X, , and T are vector-valued, rather than scalar-valued.

3-4

ECEn 672

Example 3-1 A coin with unknown probability p, 0 p 1, of heads is tossed independently n times. If we let Xi be zero if the outcome of the ith toss is tails and one if the outcome is heads, the random variables X1 , . . . , Xn are independent and identically distributed with common probability mass function fX (xi | p) = P (Xi = xi | p) = pxi (1 p)1xi for xi = 0, 1.

If we are looking at the outcome of this sequence of tosses in order to make a guess of the value of p, it is clear that the important thing to consider is the total number of heads and tails. It is hard to see how the information concerning the order of heads and tails can help us once we know the total number of heads. In fact, if we let T denote the total number of heads, T = Xi , then intuitively the conditional distribution of X1 , . . . , Xn , given n T = j, is uniform over the n-tuples which have j ones and n j zeros; that is, j given that T = j, the distribution of X1 , . . . , Xn may be obtained by choosing completely at
n i=1

random the j places in which ones go and putting zeros in the other locations. This may be done not knowing p. Thus, once we know the total number of heads, being given the rest of the information about X1 , . . . , Xn is like being told the value of a random variables whose distribution does not depend on p at all. In other words, the total number of heads carries all the information the sample has to give about the unknown parameter p. We claim that the total number of heads is a sucient statistic for p. To prove that fact, we need to show that the conditional distribution of {X1 , . . . , Xn }, given T = t, is independent of p. This conditional distribution is fX1 ,...Xn | T (x1 , . . . , xn | t, p) = P (X1 = x1 , . . . Xn = xn , T = t | p) . P (T = t | p) (3-1)

The denominator of this expression is the binomial probability P (T = t | p) = n t pt (1 p)nt . (3-2)

We now examine the numerator. Since t represents the sum of the values Xi takes, we must set the probability that X1 + . . . + Xn = t to zero, otherwise we will have an inconsistent probability. Thus, the numerator is zero except when x1 + . . . + xn = t and each xi = 0 or 1,

Winter 2009 and then P (X1 = x1 , . . . , Xn = xn , T = t | p) = P (X1 = x1 , . . . , Xn = xn | p) = px1 (1 p)1x1 . . . pxn (1 p)1xn = p But t =

3-5

(1 p)n

(3-3)

xi , thus, substituting (3-2) and(3-3) into (3-1), we obtain fX1 ,...Xn | T (x1 , . . . , xn | t, p) = n t
1

where t =

xi and each xi = 0 or 1. This distribution is independent of p for all t =

0, 1, . . . , n, which proves the suciency of T . The results of this example are likely no surprise to you; it makes intuitive sense without requiring a rigorous mathematical proof. We do learn from this example, however, that the notion of suciency is central to the study of statistics. But it would be tedious to establish suciency by essentially proving a new theorem for every application. Fortunately, we wont have to do so. The factorization theorem gives us a convenient mechanism for testing suciency of a statistic. We state and prove this theorem for the discrete variables, and sketch a proof for absolutely continuous variables as well. Theorem 1 (The Factorization Theorem). Let X be a discrete random variable whose probability mass function fX (x | ) depends on a parameter . The statistic T = t(X) is sucient for if, and only if, the probability mass function factors into a product of a function of t(x) and and a function of x alone; that is, fX (x | ) = b[t(x), ]a(x). (3-4)

Proof. Suppose T = t(X), and note that, due to this constraint, the joint probability mass function fX,T (x, t(x) | ) must be zero whenever T = t(X). Furthermore, this joint probability must equal the marginal probability of X whenever the constraint is satised. To see this, observe that we may write fX (x | ) =

fX,T (x, | )I{t(x)} ( ), and since there

is only one such , we have that fX (x | ) = fX,T [x, t(x) | ], whenever T = t(X), as claimed.

3-6

ECEn 672 Assume that T is sucient for , and that T = t(X). Then the conditional distribution

of X given T is independent of , and we may write fX (x | ) = fX,T [x, t(x) | ] = fX|T [x | t(x), ]fT [t(x) | ] = fX|T [x | t(x)]fT [t(x) | ], provided the conditional probability is well dened. Hence, we dene a(x) by

a(x) = With

0 if fX (x | ) = 0 for all . fX|T [x | t(x)] if fX (x | ) > 0 for some b[t(x), ] = fT [t(x) | ],

the factorization is established. To establish the converse, suppose a factorization of the form (3-4) holds, and let t0 be chosen such that fT (t0 | ) > 0 for some . Then fX|T (x | t0 , ) = fX,T (x, t0 | ) . fT (t0 | ) (3-5)

The numerator is zero for all whenever t(x) = t0 , and when t(x) = t0 , the numerator is simply fX (x | ), by our previous argument. The denominator may be written fT (t0 | ) = =
xA(t0 )

xA(t0 )

fX (x | ) b[t(x), ]a(x), (3-6)

where A(t0 ) = {x : t(x) = t0 }. Hence, substituting (3-4) and (3-6) into (3-5) and setting the pmf to zero otherwise, we obtain 0 if t(x) = t0 b(t0 , )a(x) if t(x) = t0 fX|T (x | t0 , ) = b(t0 , ) a(x )
x A(t0 )

Thus, fX|T (x | t0 ) is independent of for all t0 and for which it is dened. 2

Winter 2009

3-7

The factorization theorem is also true for a large family of continuous random variables. A completely rigorous proof is outside the scope of this class, but we will give a sketch of the proof, which will hopefully illuminate the key things that go on, and give you condence that the result is true. Armed with an understanding of the transformation of variables theorem, we may now sketch a proof of the factorization theorem for the continuous case. Sketch of the proof in the absolutely continuous case. For this development we recognize that the statistic may be multi-dimensional, so we generalize the treatment to permit vectorvalued statistics, which we will denote by T. We rst observe that the statistic T may not be a one-to-one mapping of the random vector X, since the dimension of T may be dierent from the dimension of X. A standard trick when dealing with problems of this type is to include some additional functions in order to ll out the dimension of the transformation. For example, if T is r-dimensional, then the dimension of U would be n r. We then prove the theorem with the aid of these auxiliary functions and nally show that the choice of functions does not matter to the result we want. This approach may seem a little messy, but unless we do something to enable us to use our standard transformation of variables formula the proof is likely to be even more messy. So, let U(X) be an auxiliary statistic so that the mapping X = g(T, U) is one-to-one and therefore invertible. Further, suppose U is smooth enough for the Jacobian to exist. Notation becomes a problem with manipulations of this kind, and it will be convenient to write x(t, u) for g(t, x), and (t(x), u(x)) for g 1 (x). The densities transform as follows: g 1 (x) fX (x | ) = fg(X) [g (x), | ] x (t(x), u(x)) = fT,U [t(x), u(x) | ] x (t(x), u(x)) , = fT [t(x) | ]fU|T [u(x) | t(x), ] x
1

where

(t(x), u(x)) is the absolute value of the Jacobian determinant. If T is sucient, x

3-8

ECEn 672

then fU|T (u | t, ) is independent of , giving the required factorization analogous to the earlier proof. Conversely, if a factorization exists, then fT,U (t, u | ) = fX [g(t, u) | ] g(t, u) (t, u) x(t, u) = fX [x(t, u) | ] (t, u) x(t, u) , = b(t, )a[x(t, u)] (t, u)

(3-7)

so that, integrating out the u, the marginal of T becomes fT (t | ) = b(t, ) a[x(t, u)] x(t, u) du, (t, u) (3-8)

and we have, taking the ratio of (3-7) and (3-8), fU|T (u, ) = fT,U (t, u | ) fT (t | ) a[x(t, u)] =

x(t, u) (t, u)

x(t, u) a[x(t, u)] du (t, u)

independent of . Thus the distribution of U given T is independent of ; hence the distribution of (T, U) given T is independent of , and the distribution of X, given T, is independent of . 2 Example 3-2 Consider a sample X1 , . . . , Xn from N (, 2). The joint density of X1 , . . . , Xn is fX1 ,...,Xn (x1 , . . . , xn | , ) = (2 )
2 n 2 n

exp[(2 )

2 1 i=1

(xi )2 ].
n i=1 (Xi

(3-9) )2 is a

If is a known quantity, then from the factorization theorem t(X) =


n i=1 n i=1 (xi

sucient statistic for 2 . (In this case the function a(x) may be taken identically equal to one.) Let x =
1 n

xi and s2 =

1 n

x)2 , so that the density (3-9) may be written

fX1 ,...,Xn (x1 , . . . , xn | , ) = (2 2 ) 2 exp[ns2 /2 2 ] exp[n( )2 /2 2 ]. x If 2 is a known quantity, then from the factorization theorem, X is a sucient statistic for . If both and 2 are unknown, the pair (X, S 2 ) is a sucient statistic for (, 2 ). (We

Winter 2009

3-9

adopt the notation that X and S 2 are the random variables corresponding to the realizations x and s2 .) Example 3-3 Consider a sample X1 , . . . , Xn from the uniform distribution over the interval [, ]. The joint density is
n

fX1 ,...,Xn (x1 , . . . , xn | , ) = ( )

n i=1

I(,) (xi ),

where IA is the indicator function: IA (x) = 1 if x A, IA (x) = 0 if x A. This joint density may be rewritten as fX1 ,...,Xn (x1 , . . . , xn | , ) = ( )n I(,) (min xi )I(,) (max xi ). We examine three cases. First, if is known, then max Xi is a sucient statistic for . Second, if is known, then min Xi is a sucient statistic for , and if both and are unknown, then (min Xi , max Xi ) is a sucient statistic for (, ). 3.2.2 Complete Sucient Statistics

As we have seen, the concept of a sucient statistic is useful for simplifying the structure of estimators. It leads to economy in the design of algorithms to compute the estimates, and may simplify the requirements for data acquisition and storage. Clearly, not all sucient statistics are created equal. As an extreme case, the mapping T1 (X1 , . . . , Xn ) = (X1 , . . . , Xn ) is always sucient statistic, but no reduction in complexity is obtained. At the other extreme, if the random variables Xi are i.i.d., then, as we have seen, a sucient statistic for the mean is the average, T2 (X1 , . . . , Xn ) = X, and it is hard to see how complexity could be reduced further. What about the vector-valued statistic T3 (X1 , . . . , Xn ) =
n1 i=1

Xi , Xn ? It is

straightforward that this statistic is also sucient for the mean. Obviously, T2 would be require less bandwidth to transmit, less memory to store, and would be simpler to use, but all three are sucient for the mean. In fact, it easy to see that T3 can be expressed as function of T1 but not vice versa, and that T2 can be expressed as a function of T3 (and, consequently, of T1 ). This leads to a useful denition. Denition. A sucient statistic for a parameter that is a function of all other sucient statistics for is said to be a minimal sucient statistic, or necessary and sucient statistic,

3-10

ECEn 672

for . Such a sucient statistic represents the smallest amount of information that is still sucient for the parameter. There are a number of questions one might ask about minimal sucient statistics: (a) Does one always exist: (b) If so, is it unique? (c) If it exists, how do I nd it? Rather than try to answer these questions directly, we beg it slightly, and introduce a related concept, that of completeness. Denition. A sucient statistic, T , for a parameter is said to be complete if every real-valued function of T is zero with probability one whenever the mathematical expectation of that function of T is zero for all values of the parameter. In other words, Let W be a real-valued function. Then T is complete if E W (T ) = 0 implies P [W (T ) = 0] = 1 . Example 3-4 Let X1 , . . . , Xn be a sample from the uniform distribution over the interval [0, ], > 0. Then T = maxj Xj is sucient for . We may compute the density of T as follows. For any real number t, the event [maxi Xi t] occurs if and only if [Xi t] , i = 1, . . . , n. Thus, using the independence of the Xi , we have 0 if t0 n n t n if 0 t , P [T t] = P [Xi t] = 1 if i=1 <t fT (t | ) = n Hence, if

and the density is

tn1 I(0,) (t). n W (t)tn1 dt


0 0

E W (T ) = nn is identically zero for > 0, we must have that

W (t)tn1 dt = 0 for all . This

implies that W (t) = 0 for all t > 0 except for a set of Lebesgue measure zero6 . At all
Roughly speaking, Lebesgue measure corresponds to length; so this means that W must be zero except, perhaps, on a set whose total length is zero.
6

Winter 2009

3-11

points of continuity, the fundamental theorem of calculus shows that W (t) is zero. Hence, P [W (T ) = 0] = 1 for all > 0, so that T is a complete sucient statistic. Our interest in forming the notion of completeness is that it has some useful consequences. In particular, we present two of the most important properties of complete sucient statistics. We precede these properties by an important denition. Denition. Let X be a random variable whose sample values are used to estimate a parameter of the distribution of X. An estimate (X) of a is said to be unbiased if, when is the true value of the parameter, the mean of the distribution of (X) is , i.e., E (X) = .

Theorem 2 (Lehmann-Sche). Let T be a complete sucient statistic for a parameter e , and let W be a function of T that produces an unbiased estimate of ; then W is unique with probability one. Proof. Let W1 and W2 be two functions of T that produce unbiased estimates of . Thus, E W1 (T ) = E W2 (T ) = . But then E [W1 (T ) W2 (T )] = 0 . We note, however, that W1 (T ) W2 (T ) is a function of T , so by the completeness of T , we must have W1 (T ) W2 (T ) = 0 with probability one for all . 2 Theorem 3 A complete sucient statistic for a parameter is minimal. Before proving this theorem, we need the following background material. Denition. Let F be a -eld, and let X be a random variable such that E|X| < . The conditional expectation of X given F is a random variable, written as EF X or E(X|F ),

such that it possesses the following attributes:

3-12 (a) E(X|F ) is an F -measurable random variable, and (b) E{[X E(X|F )] Z} = 0 for all F -measurable random variables Z.

ECEn 672

In particular, if Y be a random variable, and F = {Y } is the -eld generated by Y , that is, the -eld containing the inverse images under Y of all Borel sets, then we write the conditional expectation as E(X|Y ). Attribute (b) of the conditional expectation is the one that makes it useful. It says that the random variable X E(X|F ) is orthogonal to all random variables that are measurable with respect to F . Hence, if F = {Y }, then the dierence between the random variable X and its conditional expectation given Y is orthogonal to Y . We will develop these ideas more fully later in the course. The following list enumerates the main properties of conditional expectations. 1. E(X|Y ) = EX if X and Y are independent. 2. EX = E[E(X|Y )]. 3. E(X|Y ) = f (Y ), where f () is a function. 4. E[g(Y )X|Y ] = g(Y )E(X|Y ), where g() is a function. 5. If Z is a random variable and {Y } {Z}, then E(X|Y ) = E[E(X|Z)|Y ]. 6. If Z is a random variable and {Y } {Z}, then E(X|Y ) = E[E(X|Y )|Z]. 7. E(c|Y ) = c for any constant c. 8. E[g(Y )|Y ] = g(Y ). 9. E[(cX + dZ)|Y ] = cE(X|Y ) + dE(Z|Y ) for any constants c and d. Proof. Let T be a complete sucient statistic and let S be another sucient statistic, and suppose that S is minimal. By Property 2, we know that ET = E[E(T |S)]. By Property 3, we know that the conditional expectation E(T |S) is a function of S. But, because S is minimal, we also know that S is a function of T . Thus, the random variable T E(T |S) is

Winter 2009

3-13

a function of T , and this function has zero expectation for all . Therefore, since T is complete, it follows that T = E(T |S) with probability one. This makes T a function of S, and since S is minimal, T is therefore a function of all other sucient statistics, and T is itself minimal. 2

3.3

Exponential Families

It is evident from what we have proven thus far that it is desirable to use complete sucient statistics when possible. The fact is, however, that complete sucient statistics do not always exist. We have seen that for the family of normal distributions, the two-dimensional statistic ( Xi , Xi2 ) (or, equivalently, the sample mean and the sample variance) is sucient for

(, 2 ), and it is at least intuitively obvious that this statistic is also minimal. This motivates us to look for properties of the distribution that would be conducive to completeness and, hence, to minimality. One family of distributions worth considering is the so-called exponential family. Denition. A family of distributions on the real line with probability mass function or density f (x | ) is said to be a k-parameter exponential family if f (x | ) has the form
k

f (x | ) = c()a(x) exp

i ()ti (x) .
i=1

(3-10)

Because f (x | ) is a probability mass function or density function of a distribution, the function c() is determined by the functions a(x), i (), and ti (x) by means of the formulas c() =
x

1 a(x) exp
k i=1

i ()ti (x)

in the discrete case and c() =


x

1 a(x) exp
k i=1

i ()ti (x) dx

in the continuous case. Now let X1 , . . . , Xn be a sample of size n from an exponential family of distributions with either mass function or density given by (3-10). Then the joint probability mass or density

3-14 is fX1 ,...,Xn (x1 , . . . , xn | ) =


n

ECEn 672

c ()
j=1

a(xj ) exp
i=1

i ()
j=1

ti (xj ) ,

(3-11)

and from the factorization theorem applied to this function it is clear that
n n T

T = [T1 , . . . , Tk ] =
j=1

t1 (Xj ), . . . ,
j=1

tk (Xj )

is a sucient statistic. Example 3-5 The probability mass function for the binomial distribution for the number of successes in m independent trials when is the probability of success at each trial is fX (x | ) = with c() = (1 )m m a(x) = x 1 () = log log(1 ) t1 (x) = x. Hence, for sample size n,
n j=1

m x

x (1 )mx = (1 )m

m x

exp {x[log log(1 )]} ,

for x = 0, 1, . . . , m, so that this family of distributions is a one-parameter exponential family

Xj is sucient for .

Example 3-6 The probability mass function for the Poisson distribution for the number of events that occur in a unit-time interval when the events are occurring in a Poisson process at rate > 0 per unit time. The probability mass function is fX (x) = 1 x e = e e(log )x , x! x!

for x = 0, 1, . . .. This is a one-parameter exponential family with c() = e 1 a(x) = x! 1 () = log t1 (x) = x.

Winter 2009

3-15

Hence, the number of events that occur during the specied time interval is a sucient statistic for . Example 3-7 The normal probability density function is fX (x) = (x )2 2 1 1 1 2 exp exp 2 exp = x + 2x . 2 2 2 2 2 2 2

This is a 2-parameter exponential family with c() = a(x) = 1 1 1 (, 2 ) = 2 2 2 2 (, ) = 2 t1 (x) = x2 t2 (x) = x. Hence, for sample size n, (
n i=1

1 2 exp 2 2 2

Xi ,

n i=1

Xi2 ) are sucient for (, 2 ).

Example 3-8 An important family of distributions that is not exponential is the family of uniform distributions. We will not digress to prove this fact (we dont need to because we already have identied a complete sucient statistic for that distribution). If X1 , . . . , Xn is a sample from the exponential family (3-10), the marginal distributions of the sucient statistic T = [T1 , . . . , Tk ] =
n j=1 t1 (Xj ), . . . , n j=1 tk (Xj )

also form an

exponential family, as indicated by the following theorem. Theorem 4 Let X1 , . . . , Xn be a sample from the exponential family (3-10), either continuous or discrete. (We assume, in the continuous case, that a density exists.) Then the distribution of the sucient statistic T = [T1 , . . . , Tk ]T has the form
k

fT (t | ) = c()a0 (t) exp where t = [t1 , . . . , tk ]T .

i ()ti ,
i=1

(3-12)

3-16

ECEn 672

Proof in the continuous case. From the proof of the factorization theorem (see (3-8)), we may write the marginal distribution of T as fT (t | ) = b(t, ) a[x(t, u)] x(t, u) du. (t, u)

Also, by the factorization theorem, we know that b[t(x), ] = and, when fX is exponential, we may write c()a(x) exp b[t(x), ] =
k i=1

fX (x | ) a(x)

i ()ti (x)

a(x)

so, substituting this into the marginal for T, we obtain fT (t | ) = c0 () a[x(t, u)] x(t, u) du exp (t, u) i ()ti ,
i=1

which is of the desired form if we set a0 (x) = 2 We are now in a position to state a key result, which in large measure justies our attention to exponential families of distributions. Theorem 5 For a k-parameter exponential family, the sucient statistic
n n T

a[x(t, u)]

x(t, u) du. (t, u)

T=
j=1

t1 (Xj ), . . . ,
j=1

tk (Xj )

is complete, and therefore a minimal sucient statistic. Proof. To establish completeness, we need to show that, for any function W of T, the condition E W (T) = 0, implies P [W (T) = 0] = 1. But the expectation is
k

E W (T) =

W (t)c()a0(t) exp
i=1

i ()ti dt,

We observe that this is the Laplace transform of a function of the vector t, and by the unicity of the Laplace transform, we must have W (t) = 0 for almost all t (that is, all t except possibly on a set of Lebesgue measure zero). 2

Winter 2009

3-17

3.4

Minimum Variance Unbiased Estimators

Thus far in our development, we have identied some desirable properties of estimators. We introduced the concept of suciency to encapsulate the notion that there may be ways to reduce the complexity of an estimate by combining the observations various ways, and we introduced the ideas of completeness and minimality in recognition that there are ways to formulate sucient statistics that reduce the complexity of the statistic to a minimum. What we have not done, thus far, is to attribute any notion of quality to an estimate in terms of a loss function. Intuitively, we might draw the conclusion that a desirable property of an estimator is unbiasedness, and that is indeed the case. Unbiasedness, however, is still not a quantiable metric, so we still need to address the question: If more than one estimator for a parameter exists, how can it be determined whether one is better than another? One measure of the quality of an estimator is its variance. If X is a vector of sample values and is used to estimate a parameter , then, denoting this estimate by (X), its variance is
2 = E((X) )2 .

In the sequel, when there is no chance for confusion, we will shorten the notation for this estimate to simply . Denition. An estimator is said to be a minimum variance unbiased estimate of if (a) E (X) = ,
2 (b) = min E ((X) )2 , where is the set of all possible unbiased estimates,

given X, of . The notion of minimum variance is a conceptually powerful one. From our Hilbert space background, we know that variance has a valid interpretation as squared distance, and a minimum variance estimate thus possesses the property, therefore, that this measure of distance between the estimate and the true parameter is minimized. This appears to be desirable. Lets explore this in more detail; we begin by establishing the famous RaoBlackwell theorem.

3-18

ECEn 672

Theorem 6 (Rao-Blackwell). Let Y be a random variable such that E Y = and


2 Y = E (Y )2 . Let Z be a random variable that is sucient for , and let g(Z) be the

conditional expectation of Y given Z, i.e., g(Z) = E(Y |Z). Then (a) Eg(Z) = , and
2 (b) E(g(Z) )2 Y .

Proof. The proof of (a) is immediate from Property 2 of conditional expectation: Eg(Z) = E[E(Y |Z)] = EY = . To establish (b), we write
2 Y

= E(Y )2 = E[Y g(Z) + g(Z) ]2 = E[Y g(Z)]2 + E[g(Z) ]2 +2E[Y g(Z)][g(Z) ]


2
2 g(Z)

We next examine the term E[Y g(Z)][g(Z) ], and note that, by Properties 2 and 4 of conditional expectations, E[Y g(Z)][g(Z) ] = E(E{[Y g(Z)] [g(Z) ] | Z}) function of Z = E({[g(Z) ]E[Y g(Z)] | Z}) = E({[g(Z) ][EY g(Z)] | Z}) = 0.
=0

Thus,
2 2 Y = 2 + g(Z) ,

which establishes (b). 2 The relevance of this theorem to us is as follows: Let X = {X1 , . . . , Xn } be sample values of a random variable X whose distribution is parameterized by , and let Z = T (X)

Winter 2009

3-19

be a sucient statistic for . Let Y = be any unbiased estimator of . The Rao-Blackwell theorem states that the estimate E[|T (X)] is unbiased and has variance at least as small as that of the estimate . Since the Rao-Blackwellized estimator is unbiased, if it is also complete, then the LehmannSche theorem establishes that it is unique, and hence by default is the minimum variance e unbiased estimator (thus, to say it is minimum variance doesnt add anything). Example 3-9 Suppose a telephone operator who, after working for n time intervals of 10 minuites each, wonders if he would be missed if he took a 10-minute break. he assumes that calls are coming in to his switchboard as a Poisson process at the unknown rate of calls per 10 minutes. To assess his chances of missing calls, the operator wants to estimate the probability that no calls will be received during a 10-minute interval. Clearly, the probability of no calls being received is given according to the Poisson distrubiton as = e . We will addres this problem in two ways. First, we will nd an estimate of , and then we will nd an estimate for = e . It may seem obvious, given an estimate of , that
the estimate of should be = e . Although the latter certainly is an estimate of , we

take this opportunity to raise an important point: the estimate of a function of an unknown quantity is not necessarily the same thing as the function of the estimate of the quantity. As we will subsequently see, this relationship is guaranteed to hold only in the case of ane functions. Although direct observation of the unknown paramters is not possible, the operator can observe the number of calls that arrive during any time interval. Let Xi denote the number of calls received within the ith interval. As we have seen, , and it is not hard to show that it is also sucient for . Let us suppose that, on the basis of observing X1 only (the number of calls during only the rst time interval), the operator wishes to estimate the parameters. 1. Estimating . Using only his observations, the operator denes an estimator for as Y = X1 .
n i=1

Xi is a sucient statistic for

3-20

ECEn 672 Now, suppose that he were to Rao-Blackwellize this estimator based on the sucient statistic Z = X1 + + Xn , the total number of calls received during the n time intervals. (As we have seen, Z is sucient for .) He would then compute g(Z) = E(Y |Z), the conditional expectation of his crude extimator given the sucient statistic Z. To proceed, we rst notice that
n n n n

E Xi
i=1 i=1

Xi = z = E
i=1

Xi
i=1

Xi = z = z.

that is, given that the total number of calls is Z = z, the expected value of Z is z. Furthermore, assuming that the Xi s are all independent and identically distributed, then each term in the sum on the left-hand-side must be the same, hence
n

E Xi |

Xi = z =
i=1

z . n

Thus, the Rao-Blackwellized estimate of is z = E X1 |Z = z = . n

2. Estimating . Now let us estimate the probability that no calls will be received during the n + 1st time interval. Let = e (that is, we estimate the probability of no calls occurring directly, rather than constructing it with our estimate of ). Again using only the rst interval, we dene the estimate of as Y = 1 if X1 = 0 . 0 otherwise

Notice that this is also a very crude estimator. If no calls are received, he simply sets the probability of no calls being received to be unity, but if one or more calls are received,

Winter 2009

3-21

then he sets the probability of no calls to zero. The Rao-Blackwellized estimator is g(z) = E(Y |Z = z) = 1 P X1 = 0 Z = z + 0 P X1 = 0 Z = z P (X1 = 0, n Xi = z) i=2 = P (Z = z) P (X1 = 0)P ( n Xi = z) i=2 = P (Z = z) = e ((n 1))z e(n1) z! (n)z en z! n1 n
z

Thus, the Rao-Blackwellized estimator is = 1 1 n


X1 ++Xn

This example illustrates that the estimate of e does not equal e raised to the power . However, it is well known that
n

lim

1 1 n
n

= e,

so for large values of n, 1 1 n e .

Thus, in this case the two estimates are asymptotically equivalent. Lets review what we have done with all of our analysis. We started with the assumption of minimum variance unbiasedness as our criterion for optimality. The Rao-Blackwell theorem showed us that the minimum variance estimate was based upon a sucient statistic. We recognized, completely justiably, that if we are going base our estimate on a sucient statistic, then we should use a complete sucient statistic. But the Lehmann-Sche thee orem tells us that there is at most one unbiased estimate based on a completely sucient statistic. So what? Well, we thought we were going after optimality, and we established that the set of optimal estimates, according to our criterion, contains at most one member.

3-22

ECEn 672

Thus, if you have found an unbiased estimate based on a complete sucient statistic, not only is it the best one, it is the only one. What we really have done is to establish one and only one useful fact: The minimum variance unbiased estimate of a parameter is a function of a complete sucient statistic. Nothing more, nothing less. Example 3-10 Let X = {X1 , . . . , Xn } be a sample from the distribution N (, 2). We know that T1 (X) =
n i=1

Xi and T2 (X) =
1 T n 1 n1 T2 n

n i=1

Xi2 are sucient for (, 2 ). By virtue of the fact

that the normal distribution is an exponential family, we have immediately that (T1 , T2 ) are also complete. Since and are unbiased estimates of and 2 , respectively, they

represent the minimum variance unbiased estimate of the mean and variance of the normally distributed population random variable X. Example 3-11 (From Ferguson) This example illustrates the dubious optimality of minimum variance unbiasedness. Continuing with the telephone opertor, suppose that, working for only 10 minutes, wonders if he would be missed if he took a 20-minute break. As before, we assume that calls are coming in to his switchboard as a Poisson process at the unknown rate of calls per 10 minutes. Let X denote the number of calls received within the rst 10 minutes. As we have seen, X is a sucient statistic for . On the basis of observing X, the operator wishes to estimate the probability that no calls will be received within the next 20 minutes. Since the probability of no calls in any 10-minute interval is fX (0) = unbiased estimates, he will look for an estimate (X) for which E (X) =
x=0 0 e , 0!

the

probability of no calls in a 20-minute interval is = e2 . If the operator is enamored with

e (x)

e2 . x!

After multiplying both sides by e and expanding e in a power series, he would obtain (x) x! x=0
x x=0

(1)

x!

Two convergent power series can be equal only if corresponding coecients are equal. The only unbiased estimate of = e2 is (x) = (1)x . Thus he would estimate the probability of receiving no calls in the next 20 minutes as +1 if he received an even number of calls in

Winter 2009

3-23

the last 10 minutes, and as 1 if he received an odd number of calls in the last 10 minutes. This ridiculous estimate nonetheless a minimum-variance unbiased estimate. At rst glance the results of Examples 3-9 and 3-11 seem to be incongrous. On the one hand, the estimate has an intuitively pleasing structure, while the other is patently rediculous. Yet, both are claimed to be minimim-variance unbiased estimates of the probability of no phone calls occurring. But it must be remembered that the two estimators use the data dierently. For the estimator given by Example 3-9, the estimator uses the actual number of calls, while the estimator given in Example 3-11 uses only the odd/even properties of the number of calls.

Winter 2009

4-1

Neyman-Pearson Theory

We now focus on the hypothesis testing, or binary decision problem, where the decision space consists of only two points. This decision problem, although perhaps the simplest of decision problems, possesses a surprising depth of structure and mathematical sophistication. There are two major approaches to this problem: (a) the Bayesian approach, and (b) the NeymanPearson approach. With the Bayesian approach, we assume that the parameter space is actually a probability space (, T , ), where is a probability measure over a -eld of the states of nature, and is called the a priori probability. The Neyman-Pearson approach, on the other hand, does not use prior probabilities; rather, it focuses on the use of probabilities, sometimes called likelihoods, of success or failure, given the state of nature. We rely heavily on [3, Chapter 5] and on [16]. Also, [13] is useful reading. The Neyman-Pearson approach has had great utility for detection using radar signals, and some of the terminology used in that context have permeated the general eld. Notions such as false alarm, missed detection, receiver operating characteristic, etc., owe their origins to radar. Statistics has coined their own vocabulary for these concepts, however, and we will nd it desirable to become familiar with both the engineering and statistics terminology. The fact that more than one discipline has embraced these concepts is a testimony to their great utility.

4.1

Hypothesis Testing

Let (, , L) be a statistical game with = (0 , 1 ). We observe a random variable X taking values in a space X . The distribution of X is given by FX ( | ), where is a parameter lying in a parameter space . We desire to fashion a decision rule, or test, : X observed, (x) = 1 if x R 0 if x A , such that, when X = x is

where R and A are measurable subsets of X , and X = R A. We interpret this decision rule as follows: If x R we take action 1 , and if x A we take action 0 . The next step in the development of this problem is to determine the sets R and A. We begin by calculating

4-2 the expectation of the decision rule. We observe that E (X) = 1 P (R | ) + 0 [1 P (R | )] = P (R | ).

ECEn 672

The expectation E (X) is called the power function corresponding to the decision rule (or test) . We will assume that can be written = 0 1 , for some disjoint sets 0 and 1 , and dene the hypotheses H0 and H1 as H 0 : 0 H 1 : 1 .

This classical decision problem gives rise to following terminology: H0 is called the null hypothesis to mean that 0 , and the alternative hypothesis to mean that 1 . Only one of these disjoint hypotheses is true, and our job is to guess which one. If we guess correctly, the loss is zero, and if we guess incorrectly, the loss is one. The decision 0 may be considered as taking the action accept H0 , and the decision 1 the action accept H1 , or reject H0 .

4.2

Simple Hypothesis versus Simple Alternative

We rst look at the case where H0 and H1 are simple, that is, 0 and 1 each contain exactly one element, 0 = {0 } and 1 = {1 }. Then, if 0 is the true value of the parameter, we prefer to take action 0 , whereas if 1 is the true value we prefer 1 . Denition. The probability of rejecting the null hypothesis H0 when it is true is called the size of the rule , and is denoted . This is called a type I error, or false alarm. We thus have = P [(X) = 1 | 0 ] = E0 (X) = PF A , where PF A is standard notion for the probability of a false alarm. This latter terminology stems from radar applications, where a pulsed electromagnetic signal is transmitted. If a

Winter 2009

4-3

return signal is reected from the target, we say a target is detected. But due to receiver noise, atmospheric disturbances, spurious reections from the ground and other objects, and other signal distortions, it is not possible to determine with absolute certainty whether or not a target is present. Denition. The power, or detection probability, of a decision rule is the probability of correctly accepting the alternative hypothesis, H1 , when it is true, and is denoted by . One minus the power is the probability of accepting H0 when H1 is true, resulting in a type II error, or missed detection. We thus have = P [(X) = 1 | 1 ] = E1 (X) = PD , where PD is standard notion for the probability of a detection, and PM D = E1 [1 (X)] is the probability of a missed detection. Denition. A test is said to be best of size for testing H0 against H1 if E0 (X) = and if for every test for which E0 (X) we have = E1 (X) E1 (X) = ; that is, a test is best of size if, out of all tests with PF A not greater than , has the largest probability of detection, that is, it is the most powerful test.

4.3

The Neyman-Pearson Lemma

We now give a general method for nding the best tests of a simple hypothesis against a simple alternative. This test is provided by the fundamental lemma of Neyman and Pearson. Lemma 1 (Neyman-Pearson Lemma). Suppose that = {0 , 1 } and that the distributions of X have densities (or mass functions) fX (x | ).

4-4 (a) Any test (X) of the form 1 if fX (x | 1 ) > kfX (x | 0 ) (x) if fX (x | 1 ) = kfX (x | 0 ) (x) = 0 if fX (x | 1 ) < kfX (x | 0 )

ECEn 672

(4-1)

for some k 0 and 0 (x) 1, is best of its size for testing H0 : = 0 against H1 : = 1 . Corresponding to k = , the test (x) = 1 if fX (x | 0 ) = 0 0 if fX (x | 0 ) > 0

(4-2)

is best of size zero for testing H0 against H1 .

(b) (Existence). For every , 0 1, there exists a test of the form above with (x) = , a constant, for which E0 (X) = . (c) (Uniqueness). if is a best test of size for testing H0 against H1 , then it has the form given by (4-1), except perhaps for a set of x with probability zero under H0 and H1 . Proof. (a) Choose any (X) of the form (4-1) and let (X), 0 (X) 1, be any test whose size is not greater than the size of (X), that is, for which E0 (X) E0 (X). We are to show that E1 (X) E1 (X), i.e., that the power of (X) is not greater than the power of (X). Note that [(x) (x)][fX (x | 1 ) kfX (x | 0 )]dx = [1 (x)][fX (x | 1 ) kfX (x | 0 )]dx [0 (x)][fX (x | 1 ) kfX (x | 0 )]dx

A+

+
A

+
A0

[(x) (x)][fX (x | 1 ) kfX (x | 0 )]dx,

where A+ = {x : fX (x | 1 ) kfX (x | 0 ) > 0} A = {x : fX (x | 1 ) kfX (x | 0 ) < 0} A0 = {x : fX (x | 1 ) kfX (x | 0 ) = 0}

Winter 2009

4-5

Since (x) 1, the rst integral is nonnegative. Also, the second integral is nonnegative by inspection, and the third integral is identically zero. Thus, [(x) (x)][fX (x | 1 ) kfX (x | 0 )]dx 0. This implies that E1 (X) E1 (X) kE0 (X) kE0 (X) 0, where the last inequality is a consequence of the hypothesis that E0 (X) E0 (X). This proves that (X) is more powerful than (X), i.e., k( ). For the case k = , any test of size = 0 must satisfy = (x)fX (x | 0 )dx = 0, (4-4) (4-3)

hence (x) must be zero almost everywhere on the set {x : fX (x, | 0 ) > 0}. Thus, using this result and (4-2), E1 [(X) (X)] = ((x) (x))fX (x | 1 )dx
=0

{x:fX (x|0 )>0}

+
{x:fX (x|0 )=0}

((x) (x))fX (x | 1 )dx

=
{x:fX (x|0 )=0}

(1 (x))fX (x | 1 )dx 0,

since (x) = 1 whenever the density fX (x | 0 ) = 0 by (4-2), and (x) 1. This completes the proof of (a). (b) Since a best test of size = 0 is given by (4-2), we may restrict attention to 0 < 1. The size of the test (4-1), when (x) = , is E0 (X) = P0 [fX (X | 1 ) > kfX (X | 0 )] + P0 [fX (X | 1 ) = kfX (X | 0 )] = 1 P0 [fX (X | 1 ) kfX (X | 0 )] + P0 [fX (X | 1 ) = kfX (X | 0 )]. (4-5)

4-6

ECEn 672

For xed , 0 < 1, we are to nd k and so that E0 (X) = , or equivalently using the representation (4-5), 1 P0 [fX (X | 1 ) kfX (X | 0 )] + P0 [fX (X | 1 ) = kfX (X | 0 )] = or P0 [fX (X | 1 ) kfX (X | 0 )] P0 [fX (X | 1 ) = kfX (X | 0 )] = 1 . (4-6)

If there exists a k0 for which P0 [fX (X | 1 ) kfX (X | 0 )] = 1 , we take = 0 and k = k0 . If not, then there is a discontinuity in P0 [fX (X | 1 ) kfX (X | 0 )] when viewed as a function of k that brackets the particular value 1 , that is, there exists a k0 such that P0 [fX (X | 1 ) < k0 fX (X | 0 )] < 1 P0 [fX (X | 1 ) k0 fX (X | 0 )]. (4-7)

Figure 4-1 illustrates this situation. Using (4-6) for 1 in (4-7) and solving the equation 1 P0 [fX (X | 1 ) k0 fX (X | 0 )] for yields = P0 [fX (X | 1 ) k0 fX (X | 0 )] (1 ) P0 [fX (X | 1 ) = k0 fX (X | 0 )]

satises (4-6) and 0 1, so letting k = k0 , (b) is proved. P0 [fX (X|1 ) kfX (X|0 )] 1 1

k k0 Figure 4-1: Illustration of threshold for Neyman-Pearson test (c) If = 0, the argument in (a) shows that (x) = 0 almost everywhere on the set {x : f0 (x) > 0}. If has a minimum probability of the second kind of error, then 1 (x) = 0

Winter 2009

4-7

almost everywhere on the set {x : f1 (x) > 0} {x : f0 (x) > 0}. Thus diers from the of (4-2) by a set of probability zero under either hypothesis. If > 0, let be the best test of size of the form (4-1). Then, because Ei (X) = Ei (X), i = 0, 1, the integral (4-3) must be equal to zero. But because this integral is nonnegative it must be zero almost everywhere; that is to say, on the set for which fX (x | 1 ) = fX (x | 0 ) we have (x) = (x) almost everywhere. Thus, except for a set of probability zero, (x) has the same form as (4-1) with the same value for k as (x), thus the function (x) satises the uniqueness requirement. 2 The Neyman-Pearson lemma thus gives is a general decision rule for a simple hypothesis versus a simple alternative. We would apply it as follows: 1. For a given binary decision problem, determine which hypothesis is to be the null, and which is to be the alternative. This choice is at the discretion of the analyst. As a practical issue, it would be wise to choose as the null hypothesis the one that has the most serious consequences if rejected, because the analyst is able to choose the size of the test, which enables control of probability of rejecting the null hypothesis when it is true. 2. Select the size of the test. It seems to be the tradition for many applications to set = 0.05 or = 0.01, which correspond to common signicance levels used in statistics. The main issue, however, is to choose the size relevant to the problem at hand. For example, in a radar target detection problem, if the null hypothesis is no target present, setting = 0.05 means that we are willing to accept a 5% chance that a target will not be there when our test tell us that a target is present. The smaller the size, in general, the smaller also is the power, as will be made more evident in the discussion of the receiver operator characteristic. 3. Calculate the threshold, k. The way to do this is not obvious from the theorem. Clearly, k must be a function of the size, , but until specic distributions are used, there is no obvious formula for determining k. That will be one of the tasks examined in the examples to follow.

4-8

ECEn 672

4.4

The Likelihood Ratio

The key quantities in the Neyman-Pearson theory are the density functions fX (x | 1 ) and fX (x | 0 ). These quantities are sometimes viewed as the conditional pdfs (or pmfs) of X given . The concept of conditioning, however, requires that the quantity be a random variable. But nothing in the Neyman-Pearson theory requires to be so viewed; in fact, the Neyman-Pearson approach is often considered to be an alternative to the Bayesian approach, where is viewed as a random variable. Since the purists insist that the Neyman-Pearson not be confused with the Bayesian approach, they have coined the term likelihood function for fX (x | 1 ) and fX (x | 0 ). To keep with tradition and to keep any rabid anti-Bayesians in the crowd from getting too overworked, we will respect this convention and call these things likelihood functions, or likelihoods, when required (or when we think about itengineers dont usually get too worked up over these types of issues, but perhaps they should). The inequality fX (x | 1 ){> = <}kfX (x | 0 ) has emerged as a natural expression in the statement and proof of the Neyman-Pearson lemma. This inequality may be expressed as a ratio: (x) = fX (x | 1 ) {> = <}k. fX (x | 0 )

The quantity (x) is called the likelihood ratio, and the test (4-1) may be rewritten 1 if (x) > k if (x) = k . (x) = 0 if (x) < k

(4-8)

You may have noticed in the proof of the lemma that we have used expressions such

as fX (X | 1 ), where we have used the random variable X as an argument of the density function. When we do this, the function fX (X | 1 ) is, of course, a random variable since it becomes a function of a random variable. The likelihood ratio (X) = a random variable. A false alarm occurs (accepting H1 when H0 is true) if (x) > k when = 0 and X = x. Let f (l | 0 ) denote the density of given = 0 ; then
k fX (X | 1 ) fX (X | 0 )

then is also

= PF A = P0 [ (X) > k] =

f (l, | 0 )dl,

Winter 2009 so long as P0 [ (X) = k] = 0. Thus, if we could compute the density of

4-9 given = 0 , we

would have a convenient method of computing the value of the threshold, k. Example 4-1 Let us assume that, under hypothesis H1 , a source output is a constant voltage m, and under H0 the source output is zero. Before observation the voltage is corrupted by an additive noise; the sample random variables are Xi = + Z i , (4-9)

where {0 , 1 } with 0 = 0 and 1 = m. The random variables Zi are independent zero-mean normal random variables with known variance 2 , and are also independent of the source output, . We sample the output waveform each second and obtain n samples. In other words, H 0 : Xi = with 1 z2 exp 2 . 2 2 The probability density of Xi under each hypothesis is fZ (z) = fX (x | 0 ) = fX (x | 1 ) = x2 1 exp 2 2 2 1 (x m)2 . exp 2 2 2 Zi i = 1, . . . , N , i = 1, . . . , N H 1 : Xi = m + Z i

Because the Zi are statistically independent, the joint probability density of X1 , . . . , Xn is simply the product of the individual probability density functions. Thus
n

fX1 ,...,Xn (x1 , . . . , xn | 0 ) = fX1 ,...,Xn (x1 , . . . , xn | 1 ) = The likelihood ratio becomes (x1 , . . . , xn ) =

i=1 n

1 x2 exp i2 2 2 1 (xi m)2 . exp 2 2 2

i=1

fX1 ,...Xn (x1 , . . . , xn | 1 ) fX1,...,Xn (x1 , . . . , xn | 0 )


n 1 i=1 2 m) exp (xi22
2

n 1 i=1 2

exp

x2 i 22

4-10 After canceling common terms and taking the logarithm7, we have m log (x1 , . . . , xn ) = 2
n

ECEn 672

i=1

xi

nm2 , 2 2

(4-10)

resulting in the log likelihood ratio. It is interesting to see, in this example, that the only data that appear in the likelihood ratio is through the sum our knowledge that
n i=1 n i=1 xi ,

which is consistent with

Xi , is a sucient statistic for the mean.

where is the threshold we need to calculate. Viewing the log likelihood ratio as a random variable and multiplying (4-10) by / nm yields 1 log (X1 , . . . , Xn ) = nm n Dene the new random variable nm L(X1 , . . . , Xn ) = log (X1 , . . . , Xn ) + nm 2 1 = n
n n

The log likelihood ratio test then becomes 1 if log (x1 , . . . , xn ) > log if (x1 , . . . , xn ) = , (x1 , . . . , xn ) = 0 if log (x , . . . , x ) < log 1 n

(4-11)

i=1

Xi

nm . 2

Xi .
i=1

Under hypothesis H0 , L is obtained by adding n independent zero-mean normal random variables with variance 2 and then dividing by n, yielding L N (0, 1), and under hypothesis H1 , L N ( nm/, 1). Thus, for this example, we are able to calculate the densities of the log likelihood ratio. The test becomes (x1 , . . . , xn ) = 1 if L(x1 , . . . , xn ) > 0 if L(x1 , . . . , xn )
1 d 1 d

log + log +

d 2 d 2

(4-12)

nm . Note, in (4-12) that we have set = 0 without loss of generality, since P [L(X1 , . . . , Xn ) = ] = 0. where d =
We know we can do this without disturbing the structure of the test, since the logarithm is a monotonic function.
7

Winter 2009

4-11

The size, PF A , is the integral of the density fL (l | 0 ) over the interval ( 1 log + d , ), d 2 or PF A =

1 d

log + d 2

fL (l, | 0 )dl 1 l2 exp dl 2 2

=
1 d

log + d 2

= 1 where (z) =
z

log + d2 /2 , d
1

(2) 2 exp[x2 /2]dx

is the normal integral, corresponding to the area under the normal curve from to the point z. Thus, to compute the threshold for a given size , we solve (using normal tables or a computer) 1 (z) = for z, that is, compute z = 1 (1 ), then calculate d2 =z d . 2

d , 2

1 Similarly, the power, PD , is the integral of the density fL (l | 1 ) over the interval ( d log +

), or PD =
1 d

log + d 2
1 d

fL (l, | 1 )dl (l d)2 1 exp dl 2 2 1 y 2 exp dy 2 2 .

log + d 2

=
1 d

log d 2

= 1

log d2 /2 d

Figure 4-2 illustrates the normal curves for the two hypotheses under question, and the regions corresponding to PF A and PD are indicated in the gure.

4.5

Receiver Operating Characteristic

For a Neyman-Pearson test, the size and power, as specied by PF A and PD , completely specify the test performance. We can gain some valuable insight by cross-plotting these

4-12

ECEn 672

'
f (l|0 )

E
f (l|1 )

$ $$$ $$$ $ $ $ W $ $
log d2 /2 d

PF A

(a)

$ PD $$$ $$$ $ W $

(b)

Figure 4-2: Error probabilities for normal variables with dierent means and equal variances: (a) PF A calculation, (b) PD calculation. parameters for a given test; the resulting plot is called the Receiver Operating Characteristic, or ROC curve, borrowing from radar terminology. ROC curves are perhaps the most useful single method of evaluation of performance of a binary detection system. Example 4-2 The plot of PD versus PF A for various values of d with the varying parameter is given in Figure 4-3. For = 0, log = , and the test always chooses H1 . Thus PF A = 1 and PD = 1. As increases, PF A and PD decrease. When = , the test always chooses H0 and PF A = PD = 0. Example 4-3 We modify the previous example, and assume that, under hypothesis H1 , a
2 source output is normal zero-mean with variance 1 , and under H0 the source output is 2 normal zero-mean with variance 0 . Under both hypotheses, we assume the variables are

independent. We sample the output waveform each second and obtain n samples, thus H0 : {X1 , . . . , Xn } N (0, 0 I) H1 : {Xi , . . . , Xn } N (0, 1 I) ,

Winter 2009
1

4-13
d=3
0.8

d=2

d=1
0.6

PD

0.4

increasing


0.2

0 0 0.2 0.4 0.6 0.8 1

PF A Figure 4-3: Receiver operating characteristic: normal variables with unequal means and equal variances.
2 2 that is, with = {0 , 1 } = {0 , 1 },

fX1 ,...,Xn (x1 , . . . , xn | 0 ) = fX1 ,...,Xn (x1 , . . . , xn | 1 ) = The likelihood ratio becomes (x1 , . . . , xn ) =

i=1 n

x2 1 exp i2 20 20 x2 1 exp i2 . 21 21

i=1

fX1 ,...Xn (x1 , . . . , xn | 1 ) fX1,...,Xn (x1 , . . . , xn | 0 )


n 1 i=1 21 n 1 i=1 20

exp 2i2
1

x2

exp
n

x2 2i2 0

After canceling common terms and taking the logarithm, we have 1 log (x1 , . . . , xn ) = 2 1 1 2 2 1 0 x2 + n log i
i=1

0 . 1

(4-13)

The log likelihood ratio test then becomes 1 if log (x, . . . , xn ) > log if (x1 , . . . , xn ) = (x1 , . . . , xn ) = , 0 if log (x , . . . , x ) < log 1 n

(4-14)

4-14
2 2 where is the threshold. Assume 1 > 0 , and dene the new random variable 2 2 20 1 2 2 1 0

ECEn 672

L(X1 , . . . , Xn ) =

log (X1 , . . . , Xn ) n log

0 1

We may then replace the test (4-14) by the test (x1 , . . . , xn ) = where =
2 2 20 1 2 2 1 0

1 if L(x1 , . . . , xn ) > 0 if L(x1 , . . . , xn ) < 0 1

(4-15)

log n log

This problem is slightly more involved than the previous example, since the random variable (X), given , is not normally distributed. We can simplify things a lot, however, if we deal with the special case n = 2. Then
2 2 PF A = P0 (L ) = P0 (X1 + X2 ).

To evaluate the expression on the right, we change to polar coordinates: x1 = u cos v, x2 = u sin v, Then
2

u=

x2 + x2 1 2 x1 . x2

v = tan1

P0 (U 2 ) =

u2 1 exp 2 2 20 20

dudv.

Integrating with respect to v, we have PF A =


1 u2 exp 2 2 0 20

du.

Since L = U 2 , changing variables l = u2 yields PF A = Similarly, PD = exp


2 21

l 1 exp 2 2 0 20

dl = exp

2 20

Winter 2009 The threshold for (4-15) is then


2 = 20 log PF A .

4-15

2 We observe that the threshold does not depend upon 1 ; the power of the test, however, does

depend on this quantity. To construct the ROC we combine these expressions, eliminate , and obtain PD = (PF A )0 /1 , or in terms of logarithms, log PD =
2 0 log PF A . 2 1
2 1 2 0 2 2

As expected, the performance improves monotonically as the ratio r = 4-4 illustrates this case.
1

increases. Figure

0.8

r=4 r=3
0.6

PD

r=2
0.4

r=1

0.2

0 0 0.2 0.4 0.6 0.8 1

PF A Figure 4-4: Receiver operating characteristic: normal variables with equal means and unequal variances. We now develop some important properties of the ROC. Property 1 All continuous likelihood ratio tests have ROC curves that are concave downward. Proof. Suppose the ROC has a segment that is convex. To be specic, suppose
a a b b (PF A , PD ) and (PF A , Pd ) are points on the ROC curve, but the curve is convex between

4-16

ECEn 672 these two points, as illustrated in Figure 4-5. Let a (x) and b (x) be the decision rules obtained for the corresponding sizes and powers, as given by the Neyman-Pearson lemma.
b PD PD

a PD

a b PF A PF A PF A

Figure 4-5: Demonstration of convexity property of ROC. Now form a new rule by choosing a with probability q and b with probability 1 q, for any 0 < q < 1. i.e, (x) = a (x) b (x) with probability q with probability 1 q .

Such a rule is termed a randomized rule, because the rule is actually a probability over a set of actions, rather than a deterministic rule corresponding to a single action. Essentially, a decision maker who chose a randomized rule would toss a coin whose probability of landing heads is q, and would take action corresponding to a if the coin landed heads, otherwise he would take action corresponding to rule b . The probability
of detection, PD , for this randomized rule is

a b PD = qPD + (1 q)PD , a b a convex combination of PD and PD . The set of all such convex combinations must lie a b on the line connecting PD and PD , hence the rule (x) of size PF A , has greater power

than the rule provided by the Neyman-Pearson test, thus contradicting the optimality of the Neyman-Pearson test. Thus, the ROC curve cannot be concave. 2

Winter 2009

4-17

Property 2 All continuous likelihood ratio tests have ROC curves that are above the PD = PF A line. This is just a special case of Property 1 because the points (0, 0) and (1, 1) are contained on all ROC curves. Property 3 The slope of the ROC curve at a particular point is equal to the value of the threshold k required to achieve the PD and PF A of that point. Proof. Let be the likelihood ratio, and suppose k is a given threshold. Then PD =
k

f (l | 1 )dl f (l | 0 )dl.

PF A =
k

Let be a small perturbation in the threshold; then


k+

PD =
k k+

f (l | 1 )dl f (l | 0 )dl

PF A =
k

represent the changes in PD and PF A , respectively, as a result of the change in threshold. Then the slope of the ROC curve is given by f (k | 1 ) f (k | 1 ) PD = lim = . 0 f (k | 0 ) 0 PF A f (k | 0 ) lim To establish that this ratio equals k, we we observe that, in general, E1 n (X) = =
n

(4-16)

(x)fX (x | 1 )dx

n fX (x | 1 ) fX (x | 1 )dx n fX (x | 0 ) n+1 fX (x | 1 ) fX (x | 0 )dx n+1 fX (x | 0 ) n+1

= = E0

(X)fX (x | 1 )dx (X).

n+1

4-18 But the condition E1


n

ECEn 672 = E0
n+1

requires that ln+1 f (l | 0 )dl

ln f (l | 1 )dl = must hold for all n, which implies that

f (l | 1 ) = lf (l | 0 )

(4-17)

must hold for all values of l. Thus, applying (4-17) to (4-16), we obtain the desired result: f (k | 1 ) dPD = = k. dPF A f (k | 0 ) 2

4.6

Composite Binary Hypotheses

Thus far, we have dealt with the simplest form of binary hypothesis testing: a simple hypothesis versus a simple alternative. We now generalize our thinking to composite hypotheses. Denition. A hypothesis H : 0 is said to be composite if 0 consists of at least two elements. We are interested in testing a composite hypothesis H0 : 0 against a composite alternative H1 : 1 . Before pursuing the development of a theory for composite hypotheses, we need to generalize the notions of size and power for this situation. Denition. A test of H0 : 0 against H1 : 1 is said to have size if sup E (X) = .
0

Denition. A test 0 is said to be uniformly most powerful (UMP) of size for testing H0 : 0 against H1 : 1 if 0 is of size and if, for any other test of size at most , E 0 (X) E (X) for each 1 . For a test to be UMP, it must maximize the power E (X) for each 1 . This is a very stringent condition, and the existence of a uniformly most powerful test is not guaranteed

Winter 2009

4-19

in all cases. For example, although the Neyman-Pearson lemma tells us that there exists a most powerful test of size for xed 1 1 , there is no reason why this same test should also be most powerful of size for 2 = 1 , with 2 1 . Our goal in this section is to arrive at conditions for which the existence of a UMP can indeed be guaranteed. That is, we want to establish conditions under which there exists a test such that the probability of false alarm is less than a given for all 0 , but at the same time has maximum probability of detection for all 1 . We will approach this development through an example; this result will motivate the characterization of the conditions for the existence of a UMP test. Illustrative Example. Let X be a unit-variance normal random variable and unknown mean . Let 0 = (, 0 ], and let 1 = (0 , ). We wish to test H0 : 0 against H1 : 1 . We desire the test to be uniformly most powerful out of the class of all tests for which E (X) 0 . (4-18)

To solve this problem we rst solve a related problem, and seek the best test 0 of size for testing the simple hypothesis H0 : = 0 against the simple alternative H1 : = 1 , where 1 > 0 . By the Neyman-Pearson lemma, this test is of the form 1 if if 0 (x) = 0 if
1 2 1 2 1 2

exp[(x 1 )2 /2] > exp[(x 1 )2 /2] = exp[(x 1 )2 /2] <

k 2 k 2 k 2

exp[(x 0 )2 /2] exp[(x 0 )2 /2] . exp[(x 0 )2 /2]

After taking logarithms and rearranging, this test assumes an equivalent form 1 if 0 (x) = x>k , 0 if otherwise (4-19)

where k =
2 2 (1 /2 0 /2) + log k . (1 0 )

Note that we may set = 0 since the probability that X = k is zero. With this test, we see

4-20 that P0 [X > k ] =


k

ECEn 672

1 exp[(x 0 )2 /2]dx 2 1 exp[x2 /2]dx 2

= = implies that k 0 = 1 (1 ), or

k 0

k = 0 + 1 (1 ).

(4-20)

It is important to note that k depends only on 0 and , but not otherwise on 1 . In fact, exactly the same test as given by (4-19), with k determined by (4-20), is best, according to the Neyman-Pearson lemma for all 1 (0 , ). Thus, 0 given by (4-19) is UMP out of the class of all tests for which E0 (X) . We have thus established that 0 is UMP for H0 : = 0 (simple) and H1 : > 0 (composite). To complete the development, we need to extend the discussion to permit H0 : 0 (composite). We may do this by establishing that 0 satises the condition given by (4-18). Fix k by (4-20) for the given . Now examine E 0 (X) = P [X > k ] =
k

1 exp[(x )2 /2]dx, 2

and note that this quantity is an increasing function of (k being xed). Hence, E 0 (X) < E0 0 (X) 0 and, consequently, sup
(, 0 ]

E 0 (X) .

Hence, 0 is uniformly best out of all tests satisfying (4-18), i.e., it is UMP. Summarizing, we have established that there does indeed exist a uniformly most powerful test for testing the hypothesis H0 : 0 against the alternatives H1 : > 0 , for any 0

Winter 2009

4-21

where 0 is the mean of a normal random variable X with known variance. Such a test is said to be one-sided, and has very simple form: reject H0 if X > k and accept H0 if X k , where k is chosen to make the size of the test equal to . We now turn attention to the issue of determining what conditions on the distribution are sucient to guarantee the existence of a UMP. Denition. A real parameter family of distributions is said to have monotone likelihood ratio if densities (or probability mass functions) f (x | ) exist such that, whenever 1 < 2 , the likelihood ratio (x) = f (x | 2 ) f (x | 1 )

is a nondecreasing function of x in the set of its existence; that is, for x in the set of points for which at least one of f (x | 1 ) and f (x | 2 ) is positive. If f (x | 1 ) = 0 and f (x | 2 ) > 0, the likelihood ratio is dened as +. Thus, if the distribution has monotone likelihood ratio, the larger x the more likely the alternative, H1 , is to be true. Theorem 1 (Karlin and Rubin). If the distribution of X has monotone likelihood ratio, then any test of the form 1 if x > x0 if x = x0 (x) = 0 if x < x 0

(4-21)

has nondecreasing power. Any test of the form (4-21) is UMP of its size for testing H0 : 0 against H1 : > 0 for any 0 , provided its size is not zero. For every 0 < 1 and every 0 , there exist numbers < x0 < and 0 1 such that the test (4-21) is UMP of size for testing H0 : 0 against H1 : > 0 . Proof. Let 1 and 2 be any points of with 1 < 2 . By the Neyman-Pearson lemma, any test of the form 1 if fX (x | 2 ) > kfX (x | 1 ) if fX (x | 2 ) = kfX (x | 1 ) (x) = 0 if f (x | ) < kf (x | ) X 2 X 1

(4-22)

for 0 k < , is best of its size for testing = 1 against = 2 . Because the distribution has monotone likelihood ratio, any test of the form (4-21) is also of the form (4-22). To see

4-22 this, note that if x < x0 , then (x ) (x0 ). For any k in the range of

ECEn 672 there exists a x0

such that if (x) = k, then x = x0 . Thus, (4-21) is best of size > 0 for testing = 1 against = 2 . The remainder of the proof is essentially the same as the proof for the normal distribution, and will be omitted. 2 Example 4-4 The one-parameter exponential family of distributions with density (or probability mass function) f (x | ) = c()h(x) exp[()t(x)] has a monotone likelihood ratio provided that both and t are nondecreasing. To see this, simply write, with 1 < 2 , c(2 ) f (x | 2 ) = exp {[(2 ) (1 )]t(x)} , f (x | 1 ) c(1 ) which is nondecreasing in x.

Winter 2009

5-1

Bayes Decision Theory

Thus far, our treatment of decision theory has been to consider the parameter as an unknown quantity, but not a random variable, and formulate a decision rule on the basis of maximizing the probability of correct detection (the power) while at the same time attempting to keep the probability of false alarm (the size) to an acceptably low level. The result was the likelihood ratio test and receiver operating characteristic. Decision theory is nothing more than the art of guessing, and as with any art, there is no absolute, or objective, measure of quality. In fact, we are free to invent any principle we like by which to act in making our choice of decision rule. In our study of Neyman-Pearson theory, we have seen one attempt at the invention of a principle by which to order decision rules, namely, the notions of power and size. The Bayesian approach constitutes another approach, and there are still others.

5.1

The Bayes Principle

The Bayes theory requires that the parameter be viewed as a random variable, rather than just an unknown quantity. This assumption is a major leap, and should not be glossed over lightly. Making it requires us to accept the premise that nature has specied a particular probability distribution, called the prior, or a priori, distribution of . Furthermore, strictly speaking, Bayesianism requires that we know what this distribution is. These are large pills for some people to swallow, particularly for those of the so-called objectivists school which includes those of the Neyman-Pearson persuasion. Bayesianism has been subjected to much criticism from this quarter over the years. But the more modern school of subjective probability has gone a long way towards the development of a rationale for Bayesianism8 . Briey, subjectivists argue that it is not necessary to believe that nature actually chooses a state according to a prior distribution, but rather, the prior distribution is viewed merely as a reection of the belief of the decision-maker (sometimes called an agent) about where the true state of nature lies, and the acquisition of new information, usually in the form of observations, acts to change the agents belief about the state of nature. In fact, it can be
8

An interesting discussion of this topic is found in [7].

5-2

ECEn 672

shown that, in general, every really good decision rule is essentially a Bayes rule with respect to some prior distribution. To characterize as a random variable, we must be able to dene the joint distribution of X and . Let this distribution be represented by FX, (x, ), where we use the notation to represent values that may be assumed by the random variable , that is, we can write [ = ] to mean the event that the random variable takes on the parameter value similar to the way we write [X = x] to mean the event that the the random variable X takes on the value x. Usually, textbooks and papers are not so careful, and rely upon context to determine when is viewed as being a random variable and when it is viewed as a value, but we will try to make this distinction in these notes. We will assume, for our treatment, that such a joint distribution exists, and recall that FX, (x, ) = FX| (x | )F () = F|X ( | x)FX (x). Note a slight notational change here. Before, with the Neyman-Pearson approach, we did not explicitly include the in the subscript of the distribution function, we merely carried it along as a parameter in the argument list of the function. While that notation was suggestive of conditioning, it was not required that we interpret it in that light. Within the Bayesian context, however, we wish to emphasize that the parameter is viewed as a random variable and FX| is a conditional distribution, so we will be careful to carry it in subscript of the distribution function as well as in its argument list.

5.2

Bayes Risk

For this development we rely on [2, 3]. Let (, T , ) be a probability space, where is the by now familiar parameter set, T is a -eld over , and is a probability dened over this -eld. Let (, , L) be a statistical game. Let X be a random variable (or vector) taking values in X (X may be a subset of [or of
k

] for continuous random variables, or it may

be a countable set for discrete random variables).

Winter 2009

5-3

We earlier introduced (, D, R) as an equivalent form of the statistical game, where D is the space of decision functions and R is the risk function, dened as the expected value of the loss function: R(, ) =
X

L[, (x)]fX| (x | )dx.

when fX| (x | ) is a density function, and R(, ) =


xX

L[, (x)]fX| (x | )

when fX| (x | ) is a probability mass function. The risk represents the average loss to the statistician when the true state of nature is and the statistician uses the decision rule . We might suppose that a reasonable decision criterion would be to choose such that the risk is minimized, but this is not generally possible since the value assumes is unknown, so we cannot unilaterally minimize the risk as long as the loss function depends on (and that takes in just about all interesting cases). Application of the Bayes principle, however, permits us to view R(, ) as a random variable, since it is a function of the random variable . So the natural thing to do now is to compute the average risk and then nd a decision rule that minimizes this average risk. Denition. The distribution of the the random variable is called the prior, or a priori distribution. The set of all possible prior distributions is denoted by the set . We will assume that this set of prior distributions (a) contains all nite distributions, i.e., all distributions that give all their mass to a nite number of points of ; and (b) is convex, i.e., if 1 and 2 , then a1 + (1 a)2 , for all 0 a 1 (this is the set of so-called convex combinations). Denition. The Bayes risk function with respect to a prior distribution, F , denoted r(F , ), is given by r(F , ) = ER(, ), where the expectation is taken over the space of values that may assume: r(F , ) =

R(, )f ()d

when F has a density function f (), and r(F , ) =

R(i , )f ()

5-4 when F has a probability mass function f ().

ECEn 672

We note that, whereas the risk R is dened as the average of the loss function obtained by averaging over all values X = x for a xed , the Bayes risk, r, is the average value of the loss function obtained by averaging over all values X = x and = . For example, when both X and are continuous, r(F , ) = EL[, (X)] =

R(, )f ()d L[, (x)]fX| (x | )f ()dxd (5-1)

=
X

If X is continuous and is discrete, then r(F , ) = EL[, (X)] =

R(, )f ()

=
X

L[, (x)]fX| (x | )f ()dx.

(5-2)

The remaining constructions when X is discrete are also easily obtained.

5.3

Bayes Tests of Simple Binary Hypotheses

In the statistical game (, , L), let = {0 , 1 }, and let = (0 , 1 ). We observe a random variable X taking values in a space X . The distribution of X is given by FX| ( | ), where is a random variable with prior distribution function F (). As before, we desire to fashion a decision rule, or test, : X is observed, (x) = 1 if x R , 0 if x A such that, when X = x

where R and A are measurable subsets of X , and X = R A. We interpret this decision rule as follows: If x R we take action 1 , and if x A we take action 0 . The next step in the development of this problem is to determine the sets R and A. The risk function for

Winter 2009 such a rule is R(, R) = [1 P (R | )]L(, 0 ) + P (R | )L(, 1 ) = L(, 0 ) + P (R | )[L(, 1) L(, 0 )],

5-5

where by P (R | ) we mean the conditional probability that X will take values in R, given . For our particular choice of decision rule, we observe that the conditional expectation of (X) given is E[(X) | ] = 1 P (R | ) + 0 [1 P (R | )] = P (R | ), so we may write R(, ) = L(, 0 ) + E[(X) | ][L(, 1 ) L(, 0 )]. We will dene the loss function as L(, 0 ) = aI{1 } () = L(, 1 ) = bI{0 } () where a and b are arbitrary positive constants. Thus, if = 1 but we wrongly guess = 0 we incur a penalty or loss of a units, and if = 0 and we guess that = 1 we lose b units. The risk function becomes R(, ) = aI{1 } () + E[(X) | ][bI{0 } () aI{1 } ()] = bE[(X) | = 0 ] a(1 E[(X) | = 1 ]) for for = 0 = 1 (5-4) a if = 1 0 if = 0 , (5-3)

The smaller the values of R(0 , ) and R(1 , ), the better the decision rule . Denition. Let be a real number such that 0 1, and suppose that = f (1 ) = P [ = 1 ] 1 = f (0 ) = P [ = 0 ] Then characterizes the prior probability distribution for , and the Bayes risk is r(, ) = (1 )R(0 , ) + R(1 , ). (5-6) (5-5)

5-6

ECEn 672

Any decision function that, for xed , minimizes the value of r(, ), is said to be Bayes with respect to , and will be denoted , which satises = arg min r(, ).

(5-7)

The usual intuitive meaning associated with (5-6) is the following. Suppose that you know (or believe) that the unknown parameter is in fact a random variable with specied prior probabilities of and 1 of taking values 1 and 0 , respectively. Then for any decision function , the global expected loss will be given by (5-6), and hence it will be reasonable to use the decision function which minimizes r(, ). We now proceed to nd . To do so requires us to evaluate the conditional expectation E[(X) | ]. We will assume that the two conditional distributions of X for = 0 and = 1 , are given in terms of density functions fX| (x | 0 ) and fX| (x | 1 ). Then from (5-4) and (5-6), we have r(, ) = a(1 E[(X) | = 1 ]) + (1 )bE[(X) | = 0 ] = a 1 = a +
X X

fX| (x | 1 )(x)dx + (1 )b

fX| (x | 0 )(x)dx (5-8)

afX| (x | 1 ) + (1 )bfX| (x | 0 ) (x)dx.

This last expression is minimized by minimizing the integrand for each x, that is, by dening (x) to be 1 if if if (1 )bfX| (x | 0 ) < afX| (x | 1 ) (1 )bfX| (x | 0 ) > afX| (x | 1 ) . (1 )bfX| (x | 0 ) = afX| (x | 1 )

(x) =

We may simplify this to

0 arbitrary 1 if 0

(x) =

(1 )bfX| (x | 0 ) < afX| (x | 1 ) otherwise

We may dene the sets R and A as R = x : (1 )bfX| (x | 0 ) < afX| (x | 1 ) A = x : (1 )bfX| (x | 0 ) afX| (x | 1 ) ;

Winter 2009 then (5-8) becomes r(, ) = a 1 = a


X

5-7

fX| (x | 1 )IR (x)dx + (1 )b

fX| (x | 0 )IR (x)dx (5-9)

fX| (x | 1 )IA (x)dx + (1 )b

fX| (x | 0 )IR (x)dx.

Since we decide = 1 if x R and = 0 if x A, we observe that, using (5-5) and setting a = b = 1, the Bayes risk (5-9) becomes the total probability of error: r(, ) = P [R| = 0 ] P [ = 0 ] + P [A| = 1 ] P [ = 1 ].
PF A PM D

(5-10)

It is important to note that this test is identical in form to the solution to the NeymanPearson; only the threshold is changed. Whereas, for the Neyman-Pearson test the threshold was determined by the size of the test, the Bayesian formulation provides the threshold as a function of the prior distribution on . We leave it to the users to determine which of these criteria is more applicable to their specic problem. Example 5-1 This is the same problem as Example 1 of the notes on Neyman-Pearson theory. We repeat the entire problem statement to maintain completeness of these notes. Let us assume that, under hypothesis H1 , a source output is a constant voltage m, and under H0 the source output is zero. Before observation the voltage is corrupted by an additive noise; the n sample random variables are Xi = + Z i , i = 1, . . . n (5-12)

Observe that (x) is a likelihood ratio test: fX| (x | 1 ) b(1 ) 1 if > a fX| (x | 0 ) (x) = . 0 otherwise

(5-11)

where {0 , 1 } with 0 = 0 and 1 = m. The random variables Zi are independent zero-mean normal random variables with known variance 2 , and are also independent of the source output, . We assume is a random variable with distribution P [ = m] = P [ = 0] = 1

5-8

ECEn 672

We sample the output waveform each second and obtain n samples. In other words, H 0 : Xi = Zi i = 1, . . . , n , i = 1, . . . , n

H 1 : Xi = m + Z i with fZ (z) =

1 z2 exp 2 . 2 2

The probability density of Xi under each hypothesis is fX| (x | 0) = fX| (x | m) = 1 x2 exp 2 2 2 (x m)2 1 . exp 2 2 2

Because the Zi are statistically independent, the joint probability density of X1 , . . . , Xn is simply the product of the individual probability density functions. Thus
n

fX1 ,...,Xn (x1 , . . . , xn | 0 ) = fX1 ,...,Xn (x1 , . . . , xn | 1 ) = The likelihood ratio becomes (x1 , . . . , xn ) =

i=1 n

1 x2 exp i2 2 2 1 (xi m)2 . exp 2 2 2

i=1

fX1 ,...Xn (x1 , . . . , xn | 1 ) fX1,...,Xn (x1 , . . . , xn | 0 )


n 1 i=1 2 m) exp (xi22
2

n 1 i=1 2

exp

x2 i 22

After canceling common terms and taking the logarithm, we have m log (x1 , . . . , xn ) = 2 resulting in the log likelihood ratio. From (5-11), we have, with a = b = 1, (x1 , . . . , xn ) = 1 0 if
fX| (x1 ,...,xn | 1 ) fX| (x1 ,...,xn | 0 ) n

i=1

xi

nm2 , 2 2

(5-13)

> 1 , otherwise

Winter 2009 from which the log likelihood ratio test is (x1 , . . . , xn ) = 1 if log (x1 , . . . , xn ) > log 1 . 0 if otherwise

5-9

(5-14)

Viewing the log likelihood ratio as a random variable and multiplying (5-13) by / nm yields 1 log (X1 , . . . , Xn ) = nm n Dene the new random variable nm L(X1 , . . . , Xn ) = log (X1 , . . . , Xn ) + nm 2 1 = n
n n

i=1

Xi

nm . 2

Xi .
i=1

(5-15)

Under hypothesis H0 , L is obtained by adding n independent zero-mean normal random variables with variance 2 and then dividing by n, yielding L N (0, 1), and under hypothesis H1 , L N ( nm/, 1). Thus, for this example, we are able to calculate the conditional densities of the log likelihood ratio. The test becomes 1 if L(x1 , . . . , xn ) > d + 1 log 1 2 d (x1 , . . . , xn ) = , 0 if otherwise where d = nm . It is convenient to dene the threshold function T (, d) = 1 d 1 + log . 2 d (5-17)

(5-16)

Then PF A is the integral of the conditional density fL (l | 0 ) over the interval (T (, d), ), or PF A =
T (,d) T (,d)

fL (l, | 0 )dl 1 l2 exp dl 2 2 2 (5-18) (5-19)

= 1 (T (, d)),

5-10 where (z) =

ECEn 672

(2) 2 exp[x2 /2]dx

is the normal integral, corresponding to the area under the normal curve from to the point z. The probability of missed detection, PM D , is the integral of the conditional density fL (l | 1 ) over the interval (, T (, d)), or
T (,d)

PM D =
T (,d)

fL (l, | 1 )dl 1 (l d)2 dl exp 2 2 2 y 2 1 exp dy 2 2 2 (5-20)

T (,d)d

(5-21) (5-22)

= (T (, d) d).

5.4

Bayes Envelope Function

Denition. The function () dened by ( ) = r(, ) = min r(, )

(5-23)

is called the Bayes envelope function. It represents the minimal global expected loss attainable by any decision function when is a random variable with a priori distribution P [ = 1 ] = and P [ = 0 ] = 1 . We observe that, for = 0, ( ) = 0, and for = 1, it is also true that ( ) = 0. It is useful to plot the Bayes envelope function; see Figure 5-1. This curve is the envelope of the one-parameter family of straight lines as varies from 0 to 1; y = r(, ) = R(1 , ) + (1 )R(0 , ) for 0 1. Theorem 1 (Concavity of Bayes risk). For any distributions 1 and 2 of and for any number q such that 0 q 1, (q1 + (1 q)2 ) q(1 ) + (1 q)(2 ).

Winter 2009

5-11

r(1, )

0.25

0.2

y = r(, )

0.15

0.1

r(0, )
0.05

y = r(, )
0.2 0.4 0.6 0.8 1

Figure 5-1: Bayes envelope function. Proof. Since (5-6) is linear in , it follows that for any decision , r(q1 + (1 q)2 , ) = qr(1 , ) + (1 q)r(2 , ). To obtain the Bayes envelope, we must minimize this expression over all decision rules . But the minimum of the sum of two quantities can never be smaller than the sum of their individual minima, hence min r(q1 + (1 q)2 , ) = min[qr(1 , ) + (1 q)r(2 , )]

min qr(1 , ) + min(1 q)r(2 , ).


2 We thus see that, for each xed , the curve y = ( ) lies entirely below the straight line y = r(, ). The quantity r(, ) may be regarded as the expected loss incurred by assuming that P [ = 1 ] = and hence uses the decision rule , when in fact P [ = 1 ] = ;

5-12

ECEn 672

the excess of r(, ) over ( ) is the cost of the error in incorrectly estimating the true value of the a priori probability = P [ = 1 ]. Example 5-2 Consider the above example involving the normal distribution with unequal means and equal variances. Setting a = b = 1 and using (5-18) and (5-20), the Bayes risk becomes the total probability of error, and is of the form r(, ) = (1 )
T (,d)

1 l2 dl + exp 2 2 2

T (,d)

1 (l d)2 dl exp 2 2 2

= (T (, d) d) + (1 )(T (, d)). Figure 5-2 illustrates the Bayes corresponding envelope functions for various values of d.

0.3

d=1
0.25

0.2

r(, ) 0.15 d=2


0.1

0.05

d=3

0 0 0.2 0.4 0.6 0.8 1

Figure 5-2: Bayes envelope function: normal variables with unequal means and equal variances.

5.5

Posterior Distributions

If the distribution of the parameter before observations are made is called the prior distribution, then it is natural to consider dening a posterior distribution as the distribution of the

Winter 2009

5-13

parameter after observations are taken and processed. Let us proceed with this development as follows. We rst consider the case for X and both continuous. Assuming we can reverse the order of integration in (5-1), we obtain r(, ) =
X

L[, (x)]fX| (x | )f ()dxd L[, (x)] fX| (x | )f () ddx


fX (x,)

=
X

=
X

L[, (x)]f|X ( | x)d fX (x)dx,

(5-24)

where we have used the fact that fX| (x | )f () = fX (x, ) = f|X ( | x)fX (x). In other words, a choice of by the marginal distribution f (), followed by a choice of X from the conditional distribution fX| (x | ) determines a joint distribution of and X, which in turn can be determined by rst choosing X according to its marginal distribution fX (x) and then choosing according to the conditional distribution f|X ( | x) of given X = x. With this change in order of integration, some very useful insight may be obtained. We see that we may minimize the Bayes risk given by (5-24) by nding a decision function (x) that minimizes the inside integral separately for each x; that is, we may nd for each x a rule, call it (x), that minimizes L[, (x)]f|X ( | x)d. (5-25)

Denition. The conditional distribution of , given X, denoted f|X ( | x), is called the posterior, or a posteriori, distribution of . The expression given in (5-25) is the expected loss given that X = x, and we may, therefore, interpret a Bayes decision rule as one that minimizes the posterior conditional expected loss, given the observation.

5-14

ECEn 672

The above results need be modied only in notation for the case where X and are discrete. For example, if is discrete, say = {1 , . . . , k }, we reverse the order of summation and integration in (5-2) to obtain
k

r(, ) =
i=1 X k

L[i , (x)]fX| (x | i )f (i )dx

=
X i=1

L[i , (x)]fX| (x | i )f (i )dx


k

=
X i=1

L[i , (x)]f|X (i | x) fX (x)dx.

(5-26)

Example 5-3 Let us consider the simple hypothesis versus simple alternative problem formulation, and let = {0 , 1 } and = {0, 1}. Assume we observe a random variable X taking values in {x0 , x1 }, with the following conditional distributions:
3 fX| (x1 |0 ) = P [X = x1 | = 0 ] = 4 , 1 fX| (x1 |1 ) = P [X = x1 | = 1 ] = 3 ,

fX| (x0 |0 ) = P [X = x0 | = 0 ] = fX| (x0 |1 ) = P [X = x0 | = 1 ] =

1 4 2 3

The loss function for this problem is given by the matrix in Figure 5-3.
d dd

0 0 10

1 5 0

0 1

Figure 5-3: Loss Function Let P [ = 1 ] = and P [ = 0 ] = 1 be the prior distribution for , for 0 1. We will address this problem by solving for the a posteriori pmf. The posterior pmf is given, via Bayes theorem, as f|X (1 | x) = fX| (x | 1 )f (1 ) fX| (x | 0 )f (0 ) + fX| (x | 1 )f (1 ) 1 3 3 1 if x = x1 (1 )+ 4 3 . = 2 1 3 2 if x = x0 (1 )+
4 3

Winter 2009 Note that f|X (0 | x) = 1 f|X (1 | x).

5-15

After the value X = x has been observed, a choice must be made between the two actions = 0 and = 1. The Bayes decision rule is (x) = arg min L(1 , )f|X (1 | x) + L(0 , )f|X (0 | x)

1 3 arg min L(1 , ) 3 3 1 + L(0 , ) 3 4 (1 )1 (1 )+ (1 )+


4 3 4 3 2 1 arg min L(1 , ) 1 3 2 + L(0 , ) 1 4 (1 )2 (1 )+ (1 )+ 4 3 4 3

if x = x1 , (5-27) if x = x0

for {0, 1}. Evaluating the expressions in braces in (5-27) yields, after using the values for the loss function and some arithmetic, (x1 ) = and (x0 ) = 0 if 1 if > 0 if 1 if >
9 17 9 17 3 19 3 19

.
3 , 19

We may compute the Bayes risk function as follows. If 0 < 0 (1 ) + 10 = 10 . If


3 19

then it follows that

(x) 0 will be the Bayes rule what ever the value of x. The corresponding Bayes risk is
9 , 17

then (x0 ) = 1 and (x1 ) = 0 is the Bayes decision

function, and the corresponding risk is r(, ) = R(1 , ) + (1 )R(0 , ) = [10 = If


9 17 10 3 1 3 5

+ 0 ] + (1 )[0 + 5 ]
3 4 4

+ (1 ).
4

< 1, then (x) 1 is the Bayes rule, and the Bayes risk is 5(1 ). The Bayes

envelope function is provided in Figure 5-4.

5.6

Randomized Decision Rules

We have previously alluded to the existence of randomized decision rules, which we now discuss in more detail. Suppose, rather than invoking a rule that assigns a specic action

5-16

ECEn 672

1.5

r(, )
1

0.5

0.2

0.4

0.6

0.8

Figure 5-4: Bayes envelope function

for a given x, we instead invoke a rule that attaches a specic probability distribution to the actions, and the decision-maker then chooses its action by sampling the action space according to that distribution. For example, let 0 and 1 be two candidate actions, and let be a rule that yields, for each x, a probability , such that the decision maker chooses action 1 with probability and chooses action 0 with probability 1 . Indeed, it is easy to see that any nite convex combination of actions corresponds to a randomized rule. In fact, even the deterministic rules we have been discussing can be viewed as degenerate randomized rule, where we have set = 1 for some action . Let D denote the set of all randomized decision rules. Let D and D be two rules, and let be the randomized decision rule corresponding to choosing with probability , where (0, 1), and choosing with probability 1 . Then D and

R(, d ) = R(, ) + (1 )R(, ).

Winter 2009

5-17

5.7

Minimax Rules

An interesting approach to decision making is to consider ordering decision rules according to the worst that could happen. Consider the value = M on the Bayes envelope plot given in Figure 5-1. At this value, we have that r(0, M ) = r(1, M ) = max ( ).

Thus, for = M , the maximum possible expected loss due to ignorance of the true state of nature is minimized by using M . This observation motivates the introduction of the so-called minimax decision rules. Denition. We say that a decision rule 1 is preferred to rule 2 if max R(, 1 ) < max R(, 2 ).

Recall that D is the set of all possible randomized decision rules; then this notion of preference leads to a linear ordering of the rules in D . A rule that is most preferred in this ordering is called a minimax decision rule. That is, a rule 0 is said to be minimax if max R(, 0 ) = min max R(, ).
D

(5-28)

The value on the right side of (5-28) is called the minimax value, or upper value of the game. In words, (5-28) means, essentially, that if we rst nd the value of that maximizes the risk for each rule D , then nd the rule 0 D that minimizes the resulting set of risks, we have the minimax decision rule. This rule corresponds to an attitude of cutting our losses. We rst determine what state nature would take if we were to take action and it were perverse, then we take the action the minimizes the amount of damage that nature can do to us. If I am paranoid, I would be inclined toward a minimax rule. But, as they say, Just because Im paranoid doesnt mean theyre not out to get me, and indeed nature may have it in for me. In such a situation, nature would search through the family of possible prior distributions, and would choose one that does me the most damage, even if I adopt a minimax stance.

5-18 Denition. A distribution 0 is said to be a least favorable prior if


D

ECEn 672

min r(0 , ) = max min r(, ).


D

(5-29)

The value on the right side of (5-29) is called the maximin value, or lower value of the game. The terminology, least favorable, derives from the fact that, if I were told which prior nature was using, I would like least to be told a distribution 0 satisfying (5-29), because that would mean that nature had taken a stance that would allow me to cut my losses by the least amount.

5.8

Summary of Binary Decision Problems

The following observations summarize the results we have obtained for the binary decision problem. 1. Using either Neyman-Pearson or a Bayes criterion, we see that the optimum test is a likelihood ratio test. Thus, regardless of the dimensionality of the observation space, the test consists of comparing a scalar variable (x) with a threshold. 2. In many cases construction of the likelihood ratio test can be simplied by using a sucient statistic. 3. A complete description of the likelihood ratio test performance can be obtained by plotting the conditional probabilities PD versus PF A as the threshold is varied. The resulting ROC curve can be used to calculate either the power for a given size (and vice versa) or the Bayes risk (the probability of error). 4. The minimax criterion is a special case of a Bayes rule with a least favorable prior. 5. A Bayes rule minimizes the expected loss under the posterior distribution.

5.9

Multiple Decision Problems

Thus far, we have focused our discussion mainly on the binary hypothesis testing problem, but we now turn our attention to the M-ary problem. Although [16, Page 46] claims the generalization of Neyman-Pearson theory to multiple hypothesis exists but is not widely

Winter 2009

5-19

used, I have never seen another reference to it9 . From its very construction, the NeymanPearson theory is designed to deal with binary hypotheses; there does not seem to be a natural extension to the problem of selecting from among M > 2 choices. Even granting that a M-ary Neyman-Pearson theory exists, I suspect that it loses some of its elegance when it is extended to more than the binary case. At any rate, we will not be attempting such a generalization in this class; instead, we will pursue the Bayesian approach for arbitrary nite . Suppose that consists of k 2 points, = {1 , . . . , k }, and consider the set, S, called the risk set, contained in k-dimensional Euclidean space {R(1 , ), . . . , R(k , )}, where ranges through D , the set of all randomized decisions. In other words, S is the set of all k-tuples {y1 , . . . , yk } such that yi = R(i , ), i = 1, . . . , k, for some D . Theorem 2 The risk set S is a convex subset of
k k

, of points of the form

Proof. Let y = [y1 , . . . , yk ]T and y = [y1 , . . . , yk ]T be arbitrary points in S. According to the denition of S, there exist decision rules and in D for which yi = R(i , ) and yi = R(i , ) for i = 1, . . . , k. Let be arbitrary such that 0 1 and consider the decision rule which chooses rule with probability and rule with probability 1 . Clearly, D , and R(i , ) = R(i , ) + (1 )R(i , ) for i = 1, . . . , k. If z denotes the point whose i-th coordinate is R(i , ), then z = y + (1 )y , thus z S.
k i=1 i

A prior distribution for nature is a k-tuple of nonnegative numbers {1 , . . . , k } such that = 1, with the understanding that i represents the probability that nature chooses

i . Let = [1 , . . . , k ]T . For any point y S, the Bayes risk is then the inner product
k k

y=
i=1
9

i yi =
i=1

i R(i , ).

I think Van Trees is being kind with the phrase not widely used.

5-20

ECEn 672

(The existence of a randomized decision rule is guaranteed by the convexity of the risk set.) We make the following observations: 1. There may be multiple points with the same Bayes risk (for example, suppose one or more entries in is zero.) Consider the set of all vectors y that satisfy, for a given , the relationship Ty = b (5-30)

for any real number b. Then all of these points (and the corresponding decision rules) are equivalent. 2. The set of points y that satisfy (5-30) lie in a hyperplane; this plane is perpendicular to the vector from the origin to the point (1 , . . . , k ). To see this, consider Figure 5-5, where, for k = 2, the risk set and sets of equivalent points are displayed (the concepts carry over to the general case for k > 2, but the graphical display is not as convenientor possible). 3. The quantity b can be visualized by noting that the point of intersection of the diagonal line y1 = = yk with the plane T y =
i i yi

= b must occur at [b, . . . , b]T .

4. To nd the Bayes rules we nd the minimum of those values of b, call it b0 , for which the plane T y = b0 intersects the set S. Decision rules corresponding to points in this intersection are Bayes with respect to the prior . We may also use the risk set to graphically depict the minimax point. The maximum risk for a xed rule is given by max R(i , ).
i

All points y S that yield this same value of maxi yi are equivalent with respect to the minimax principle. Thus, all points on the boundary of the set Qc = {(y1 , . . . , yk ) : yi c for i = 1, . . . , k} for any real number c are equivalent. To nd the minimax rules we nd the minimum of those values of c, call it c0 , such that the set Qc0 intersects S. Any decision rule whose associated

Winter 2009

5-21

R(k , )
Equivalent points

y1 = yk S

Bayes point

(1 , k ) R(1 , )

b0

Figure 5-5: Geometrical interpretation of the risk set. risk point [R(1 , ) . . . R(k , )]T is an element of Qc0 S is a minimax decision rule. Figure 5-6 depicts a minimax rule for k = 2. This gure also depicts the least favorable prior, which is visualized as follows. As we have seen, a strategy for nature is a prior distribution = [1 , . . . , k ]T which represents the family of planes perpendicular to . In using a Bayes rule to minimize the risk, we must nd the plane out of this family that is tangent to and below S. Because the minimum Bayes risk is b0 , where [b0 , . . . , b0 ]T is the intersection of the line y1 = . . . = yk and the plane, tangent to and below S and perpendicular to , a least favorable prior distribution is the choice of that makes the intersection as far up the line as possible. Thus the least favorable prior (lfp) is a Bayes rule whose risk is b0 = c0 . Example 5-4 We now can develop solutions to the odd or even game we introduced earlier in the course. As you recall, nature and yourself simultaneously put up either one or two ngers. Nature wins if the sum of the digits showing is odd, and you win if the sum of the digits showing is even. The winner in all cases receives in dollars the sum of the digits showing, this being paid to him by the loser. Before the game is played you are allowed to ask nature how many ngers it intends to put up and nature must answer truthfully with probability 3/4 (hence untruthfully with probability 1/4). You therefore observe a random variable X (the answer nature gives) taking the values of 1 or 2. If = 1 is the true

5-22

ECEn 672

R(k , )

lfp

y1 = yk c

S
Equivalent points

c0 c0 c
Minimax point

R(1 , )

Figure 5-6: Geometrical interpretation of the minimax rule. state of nature, the probability that X = 1 is 3/4; that is, P (1, {1}) = 3/4. Similarly, P (2, {1}) = 1/4. The four nonrandomized decision rules are 1 (1) = 1, 2 (1) = 1, 3 (1) = 2, 4 (1) = 2, 1 (2) = 1; 2 (2) = 2; 3 (2) = 1; 4 (2) = 2.

The risk matrix, given in Figure 5-7, characterizes this statistical game.
dD dd

d1 2 3

d2

d3

d4

1 2

3/4 7/4 3 9/4 5/4 4

Figure 5-7: Loss Function for Statistical Odd or Even Game The risk set for this example is given in Figure 5-8, which must contain all of the lines

Winter 2009

5-23

between any two of the points (2, 3), (3/4, 9/4), (7/4, 5/4), (3, 4). According to our earlier analysis, the minimax point corresponds the point indicated in the gure, which is on the line L connecting the (R(1, 1), R(2, 1)) with (R(1, 2 ), R(2, 2)). The parametric equation for this line is y1 =
5 4

q2
21 4

y2 = q + 3
5 as q ranges over the interval [0, 1]. This line intersects the line y1 = y2 at 4 q 2 = 21 q + 3, 4

that is, when q =

10 , 13

and the minimax risk is

5 4

10 13

2 = 27 . 26

(2, 3)
3

(7, 5) 4 4
lfp

L
-2 -1 1 2 3

Minimax point

-1

S
-2

( 3 , 9 ) 4 4
-3

-4

(3, 4)

Figure 5-8: Risk set for odd or even game.

5-24

ECEn 672

We may compute the least favorable prior as follows. Let nature take action = 1 with probability and = 2 with probability 1 . If the vector = [, 1 ]T is perpendicular to L, the slope of this vector must be the negative of the reciprocal of the slope of L. Thus, we require
1 3 = 7 , or = 7 . 10

Thus, if nature chooses to hold up one nger 70% of the time,

27 it will maintain your expected loss to at least 26 , and if you apply rule 2 with probability 10 13

and 1 with probability

3 , 13

27 seems reasonable to call 26 the value of the game. If a referee were to arbitrate this game,

you will restrict your average loss to no more than 27 . It 26


27 26

it would seem fair to require nature to pay you

dollars in lieu of playing the game.

The above example demonstrates a situation in which the best you can do in response to the worst nature can do yields the same expected loss as would be obtained if nature did its worst in response to the best you can do. This result is summarized in the following theorem (which we will not prove here). Theorem 3 (The Minimax Theorem). If for a given decision problem (, D, R) with nite = {1 , . . . , k }, the risk set S is bounded below, then
D

min max r(, ) = max min r(, ),


D

and there exists a least favorable distribution 0 . This example demonstrates still another property of Bayes decision theory, which is, essentially, that if we use a Bayes decision rule (that is, a rule that minimizes the Bayes risk), we may restrict ourselves to nonrandomized rules. From our rules describing the construction of the Bayes point for this problem, we see that every point on the line L is a
3 Bayes point, consequently the vertices (2, 3) and ( 4 , 9 ) are Bayes points, corresponding 4

to nonrandomized decision rules. Can you construct the set of Bayes points corresponding to every possible prior?

5.10

An Important Class of M-Ary Problems

Suppose there are M 2 possible source outputs, each of which corresponds to one of the M hypotheses. We observe the output and are required to decide which source was used to generate it. Put in the light of the radar detection problem we discussed earlier, suppose there are

Winter 2009

5-25

M dierent target possibilities, and we not only have to detect the presence of a target, but to classify it as well. For example, we may be required to choose between three alternatives: H0 : no target present, H1 : target is present and hostile, H2 : target is present and friendly. Formally, the parameter space is of the form = {0 , 1 , . . . , M 1 }. Let H0 : = 0 , , H1 : = 1 , . . . , HM 1 : = M 1 denote the M hypotheses to test. We will employ the Bayes criterion to address this problem, and assume that = [0 , . . . , M 1 ]T is the corresponding a priori probability vector. We will denote the cost of each course of action as Cij , where the rst subscript i signies that the i-th hypothesis is chosen, and the second subscript j signies that the j-th hypothesis is true. In words, Cij is the cost of choosing Hi when Hj is true. We observe a random variable X taking values in X compute the posterior conditional expected loss for X = x. The natural generalization of the binary case is to partition the observation space into M disjoint regions S0 , . . . , SM 1 , that is, X = S0 SM 1 , and to invoke a decision rule of the form (x) = n if x Sn , n = 0, . . . , M 1. The loss function then assumes the form L[j , (x)] = From (5-26), the Bayes risk is r(, ) =
X M 1 j=0 M 1 i=0 k

. We wish to generalize the

notion of a threshold test that was so useful for the binary case. Our approach will be to

(5-31)

Cij ISi (x).

L[j , (x)]f|X (j | x) fX (x)dx

=
X

M 1 M 1 j=0 i=0

Cij ISi (x)f|X (i | x) fX (x)dx,

and we may minimize this quantity by minimizing the quantity in braces for each x. It suces to minimize the posterior conditional expected loss, r (, ) =
M 1 M 1 j=0 i=0

Cij ISi (x)f|X (i | x).

(5-32)

5-26

ECEn 672

The problem reduces to determining the sets Si , i = 0, . . . , M 1, that result in the minimization of r . From Bayes rule, we have f|X (j | x) = which when substituted into (5-32) yields r (, ) =
M 1 M 1 j=0 i=0

fX| (x | j )f (j ) , fX (x)

Cij ISi (x)

fX| (x | j )f (j ) . fX (x)

We now make a very important observation: Given X = x, we can minimize the posterior conditional expected loss by minimizing
M 1 M 1 j=0 i=0

Cij ISi (x)fX| (x | j )f (j ),

that is, fX (x) is simply a scale factor for this minimization problem, since x is assumed to be xed. Since
M 1 M 1 j=0 i=0

Cij ISi (x)fX| (x | j )f (j ) =

M 1 i=0

ISi (x)

M 1 i=0

Cij fX| (x | j )f (j ),

we may now ascertain the structure of the sets Si that result in the Bayes decision rule (x) given by (5-31). Sk = {x X :
M 1 j=0

Ckj fX| (x | j )f (j )

M 1 j=0

Cij fX| (x | j )f (j ) i = k}.

The general structure of these decision regions is rather messy to visualize and lengthy to compute, but we can learn almost all there is to know about this problem by simplifying it a bit. We rst set Cii = 0 Cij = 1, i = j. Second, we consider only the case M = 3. Then, with j = f (j ), S0 = {x : fX| (x|1 )1 +fX| (x|2 )2 min{fX| (x|0 )0 +fX| (x|2 )2 , fX| (x|0 )0 +fX| (x|1 )1 }}

Winter 2009

5-27

S1 = {x : fX| (x|0 )0 +fX| (x|2 )2 min{fX| (x|0 )0 +fX| (x|1 )1 , fX| (x|1 )1 +fX| (x|2 )2 }} S2 = {x : fX| (x|0 )0 +fX| (x|1 )1 min{fX| (x|0 )0 +fX| (x|2 )2 , fX| (x|1 )1 +fX| (x|2 )2 }} We nd it convenient to dene the two likelihood ratios
1 (x)

fX| (x|1 ) fX| (x|0 ) fX| (x|2 ) . fX| (x|0 )

2 (x)

Then S0 = {x :
1 (x)1

2 (x)2

min{0 +

2 (x)2 ,

0 + + +

1 (x)1 }} 2 (x)2 }} 2 (x)2 }}. 1, 2

(5-33) (5-34) (5-35) plane. To see

S1 = {x : 0 + S2 = {x : 0 +

2 (x)2

min{0 + min{0 +

1 (x)1 ,

1 (x)1

1 (x)1

2 (x)2 ,

1 (x)1

geometrically, this decision function corresponds to three lines in the this, observe that (5-33) (5-34) and (5-35) may be expressed as S0 = x:
1 (x)

0 & 1

2 (x)

0 2 0 1 0 2

S1 = S2 =

x: x:

2 (x)

>

1 1 (x) & 2 1 1 (x) & 2


1, 2

1 (x)

2 (x)

2 (x)

Figure 5-9 illustrates these regions in the

plane. These decision regions may be in1 (x)

terpreted as follows: Sample X = x and evaluate the likelihood ratios

and

2 (x)

Determine which of the three possible regions the point ( 1 (x), 2 (x)) lies, and render the decision according to the rule 0 if ( 1 (x), 1 if ( 1 (x), (x) = 2 if ( (x), 1
2 (x)) 2 (x)) 2 (x))

H0 H1 . H2

5-28

ECEn 672

2 (x)

H2
0 2

$ $$ $$$ $

slope =

1 2

H0
0 1

H1
E
1 (x)

Figure 5-9: Decision space for M = 3. Exercise 5-1 Consider two boxes A and B, each of which contains both red balls and green balls. It is known that, in one of the boxes, in the other box,
1 4 1 2 3 4

of the balls are red and

1 2

are green, and that,


1 2

of the balls are red and

are green. Let the box in which

are red be

denoted box W , and suppose P (W = A) = and P (W = B) = 1 . Suppose you may select one ball at random from either box A or box B and that, after observing its color, must decide whether W = A or W = B. Prove that if
1 2 2 < < 3 , then in order to maximize the

probability of making a correct decision, he should select the ball from box B. Prove also that if
2 3

1, then it does not matter from which box the ball is selected.

Exercise 5-2 A wildcat oilman must decide how to nance the drilling of a well. It costs $100,000 to drill the well. The oilman has available three options: H0 : nance the drilling himself and retain all the prots; H1 : accept $70,000 from investors in return for paying them 50% of the oil prots; H2 : accept $120,000 from investors in return for paying them 90% of the oil prots. The oil prots will be $3, where is the number of barrels of oil in the well.

Winter 2009

5-29

From past data, it is believed that = 0 with probability 0.9, and the density for > 0 is g() = 0.1 e/300,000 I(0,) (). 300, 000

A seismic test is performed to determine the likelihood of oil in the given area. The test tells which type of geological structure, x1 , x2 , or X3 , is present. It is known that the probabilities of the xi given are fX| (x1 |) = 0.8e/100,000 fX| (x2 |) = 0.2 fX| (x3 |) = 0.8(1 e/100,000 ). For monetary loss, what is the Bayes action if X = x1 is observed? For monetary loss, what is the Bayes action if X = x2 is observed? For monetary loss, what is the Bayes action if X = x3 is observed? Exercise 5-3 A device has been created which can supposedly classify blood as type A, B, AB, or O. The device measures a quantity X, which has density fX| (x| = e(x) I(,) (x). If 0 < < 1, the blood is of type AB; if 1 < < 2 the blood is of type A; if 2 < < 3, the blood is of type B; and if > 3 the blood is of type O. In the population as a whole, is distributed according to the density f () = e I(0,) (). The loss in misclassifying the blood is given by the following table. Classication AB A AB 0 1 True A 1 0 Type B 1 2 O 3 3 If X = 4 is observed, what is the Bayes action?

B O 1 2 2 2 0 2 3 0

Winter 2009

6-1

Maximum Likelihood Estimation

As we have stated earlier, estimation is the process of making decisions over a continuum of parameters. The same dichotomy exists here as with the detection problem, however, since we may view the unknown parameter as either an unknown, but deterministic quantity, or as a random variable. Consequently, there are multiple schools of thought regarding estimation. In this section, we present the classical approach, based upon the principle of maximum likelihood [1, 6, 16]. In a subsequent section we present an approach based upon Bayesian assumptions.

6.1

The Maximum Likelihood Principle

The essential feature of the principle of maximum likelihood as it applies to estimation theory is that is requires one to choose, as an estimate of a parameter, that value for which the probability of obtaining a given sample actually observed is as large as possible. That is, having obtained observations, one looks back and computes the probability, from the point of view of one about to perform the experiment, that the given sample values will be observed. This probability will in general depend on the parameter, which is then given that value for which this probability is maximized10 . Suppose that the random variable X has a probability distribution which depends on a parameter . Let fX (x | ) denote, say, a pmf (it could be a pdf, we dont really care for now). We suppose that the form of fX is known, but not the value of . The joint pmf of m sample random variables evaluated at the sample points x1 , . . . , xm , is
m

l(, x1 , . . . , xm ) = fX1 Xm (x1 , . . . , xm | ) =

i=1

fX (xi | )

(6-1)

This function is also known as the likelihood function of the sample; we are particularly interested in it as a function of when the sample values x1 , . . . , xm are xed. The principle of maximum likelihood requires us to choose as an estimate of the unknown parameter that value of for which the likelihood function assumes its largest value.
This is reminiscent of story about the crafty politician who, once he observes which way the crowd is going, hurries to the front of the group as if to lead the parade.
10

6-2

ECEn 672 If the parameter is a vector, say = [1 , . . . , k ]T , then the likelihood function will be

a function of all of the components of . Thus, we are free to regard as a vector in (6-1), and the maximum likelihood estimate of is then the vector of numbers which render the likelihood function a maximum. Example 6-1 (A Maximum Likelihood Detector). Suppose you are given a coin and told that it is biased, with one side four times as likely to turn up as the other; you are allowed three tosses and must then guess whether it is biased in favor of head or in favor of tails. Let be the probability of heads (H, with T corresponding to tails) on a single toss. Dene the random variable, X : {H, T } {0, 1}; X(H) = 1 and X(T ) = 0. The pmf for X is given by fX (0 | 4/5) = 1/5 fX (0 | 1/5) = 4/5 fX (1 | 4/5) = 4/5; fX (1 | 1/5) = 1/5.

Suppose you throw the coin three times, resulting in the samples HT H. The sample values are x1 = 1, x2 = 0, x3 = 1. The likelihood function is l(, x1 , x2 , x3 ) = fX1 X2 X3 (x1 , x2 , x3 | ) = fX1 X2 X3 (1, 0, 1 | ) = fX1 (1 | )fX2 (0 | )fX3 (1 | ) or l(4/5, 1, 0, 1) = (4/5)(1/5)(4/5) = 16/125 l(1/5, 1, 0, 1) = (1/5)(4/5)(1/5) = 4/125 Clearly, = 4/5 yields the larger value of the likelihood function, so by the likelihood principle we are compelled to decide that the coin is biased in favor of heads. Although, as this example demonstrates, the principle of maximum likelihood may be applied to discrete decision problems, it has found greater utility for problems where the distribution is continuous and dierentiable in . The reason for this is that we will usually be taking derivatives in order to nd maxima. But it is important to remember that general

Winter 2009

6-3

decision problems can, in principle, be addressed via the principle of maximum likelihood. Notice, for this example, that neither cost functions nor a prior knowledge of the distribution of the parameters is needed to fashion a maximum likelihood estimate.

Example 6-2 (Empiric Distributions). Let X be a random variable of unknown distribution, and that X1 , . . . , Xm are sample random variables from the population of X. Suppose we are required to estimate the distribution function of X. There are many ways to approach this problem. One way would be to assume some general structure, such as an exponential family, and try to estimate the parameters of this family. But then one has the simultaneous problems of (a) estimating the parameters and (b) justifying the structure. Although there are many ways of doing both of these problems, it is not easy. The maximum likelihood method gives us a fairly simple approach that, if for no other reason, would be valuable as a baseline for evaluating other, more sophisticated approaches. To apply the principle of maximum likelihood to this problem, we must rst dene the parameters. We do this by setting i = P [Xi = xi ], The event [X1 = x1 , , Xm = xm ] is observed, and, according to the maximum likelihood principle, we wish to choose the values of i that maximize the probability that this event will occur. Since the events [Xi = xi ], i = 1, . . . , m are independent, we have
m m

, i = 1, . . . , m.

P [X1 = x1 , , Xm = xm ] =

P [Xi = xi ] =
i=1 m i=1 i i=1

i ,

which we wish to maximize subject to the constraint

= 1. The standard way to

extremize a function subject to constraints is to formulate it as a Lagrange multiplier problem. Let


m m

J=
i=1

i + (
i=1

i 1),

6-4

ECEn 672

and set the gradient of J with respect to i , i = 1, . . . , m and with respect to to zero: J j =
i=j m

i + = 0,

j = 1, . . . , m

J =

i=1

i 1 = 0
i=j

But the only way all of the products

i can be equal is if 1 = = m , and the

constraint therefore requires that i = 1/m, i = 1, . . . , m. We dene the maximum likelihood estimate for the distribution as follows. Let X be a random variable, called the empiric random variable, whose distribution function is 1 FX (x) = P [X x] = m
m

I[xi ,) (x)
i=1

Figure 6-1 illustrates the structure of the empiric distribution function. FX (x)
m m

2 m 1 m

x2 x1

x5 x3

x4

Figure 6-1: Empiric Distribution Function. For large samples, it is convenient to quantize the observations and construct the empiric density function by building a histogram. Thus, the empiric distribution is precisely that distribution for which the inuence of the sample values actually observed is maximized at the expense of other possible values of X. Of course, the actual utility of this distribution is limited since the number of parameters may be very large. But it is a maximum likelihood estimate of the distribution function.

Winter 2009

6-5

6.2

Maximum Likelihood for Continuous Distributions

Suppose now that the random variable X is continuous and has a probability density function fX (x | ) which depends on the parameter ( may be a vector). The joint probability density function of the sample random variables, evaluated at the sample points x1 , . . . , xm , is given by l(, x1 , . . . , xm ) = fX1 Xm (x1 , . . . xm | ) =
m

i=1

fX (xi | ).

For small dx1 , . . . , dxm , the m + 1-dimensional volume fX1 Xm (x1 , . . . xm | )dx1 dxm represents, approximately, the probability that a sample will be chosen for which the sample points lie within an n-dimensional rectangle at x1 , . . . , xm , with sides dx1 , . . . , dxm . Conceptually, we can consider calculating this volume, for xed xi and dxi , as is varied over its range of permissible values. According to the maximum likelihood principle, we take, as the maximum likelihood estimate of , that value that maximizes the volume, the idea being that, if that were the actual value of that nature used, it would correspond to the distribution that yields the largest probability of producing samples near the observed values x1 , . . . , xm . Since the rectangle is xed, the volume, and hence the probability, is maximized by maximizing the likelihood function l(, x1 , . . . , xm ). It must be stressed that the likelihood function l(, x) is to be viewed as a function of , with x being a xed quantity, rather than a variable. This is in contradistinction to the way we view the density function fX (x | ), were is a xed quantity and x is viewed as a variable. So remember, even though we may write l(, x) = fX (x | ) we view the roles of x and in the two expressions entirely dierently. It is actually more convenient, for many applications, to consider the logarithm of the likelihood function, which we denote L(, x) = log fX (x | ), and call the log-likelihood function. Since the logarithm is a monotonic function, the maximization of the likelihood and log-likelihood functions is equivalent, that is, M L maximizes the likelihood function if and only if it also maximizes the log-likelihood function. Thus, in this development we will deal mainly with the log-likelihood function.

6-6

ECEn 672 If the log-likelihood function is dierentiable in , a necessary but not sucient condition

for to be a maximum of the log-likelihood function is for the gradient of the log-likelihood function to vanish at that value of , that is, we require L(, x) = log fX (x | ) = 0. The major issue before us is to nd a way to maximize the likelihood function. If the maximum is interior to the range of , and and L(, x) has a continuous rst derivative, then a necessary condition for M L to be the maximum likelihood estimate for is that L(, x) = 0.
=M L

(6-2)

This equation is called the likelihood equation. We now give some examples to illustrate the maximization process. Example 6-3 Let X1 , . . . , Xm denote a random sample of size m from a uniform distribution over [0, ]. We wish to nd the maximum likelihood estimate of . The likelihood function is
m

l(, x1 , . . . , xm ) =

m i=1 m

I(0,) (xi )

= m
i=1 m

I(0,maxi {xi }) (min xi )I(mini {xi },) (max{xi })


i i

= m
i=1 m

I(mini {xi },) (max{xi })


i

m i=1

I(maxi {xi },) ().

Since the maximum of this quantity does not occur on the interior of the range of , we cant take derivatives and set to zero. But we dont need to do that for this example, since m is monotonically decreasing in . Consequently, the likelihood function is maximized at M L = max{xi }.
i

Winter 2009

6-7

Example 6-4 Let X1 , . . . , Xm denote a random sample of size m from the normal distribution N (, 2 ). We wish to nd the maximum likelihood estimates for and 2 . The density function is fX1 ,...,Xm (x1 , . . . , xm | , ) = and the log-likelihood function is then L(, , x1 , . . . , xm ) = m log 1 2 m log 2 2
m m

i=1

1 (xi )2 , exp 2 2 2

i=1

(xi )2 .

Taking the gradient and equating to zero yields L = 1 2


m

i=1

(xi ) = 0
m

M L and L =

1 = m

xi ,
i=1

m + 3 1 m

i=1 m

(xi )2 = 0

M L = 2

i=1

(xi )2 .

Before we get too euphoric over the simplicity and seemingly magical powers of the maximum likelihood approach, consider the following example. Example 6-5 Let X1 N (, 1) and X2 N (, 1) and dene Y = then fY (y | ) = likelihood function at y , yielding l(, y ) = 1 1 1 (y )2 1 1 1 (y +)2 e 2 + e 2 , 2 2 2 2 1 1 1 (y)2 1 1 1 (y+)2 e 2 + e 2 . 2 2 2 2 X1 with probability 1/2 X2 with probability 1/2 .

Now let Y = y be a given sample value. According to our procedure, we would evaluate the

6-8

ECEn 672

and choose, as the maximum likelihood estimate of , that value that maximizes l(, y ). But this function does not have a unique maximum, so there is not a unique estimate. Both M L = y and M L = y qualify as maximum likelihood estimates for .

6.3

Comments on Estimation Quality

In the immortal words of A. Jazwinski, An estimate is meaningless unless one knows how good it is [8, Page 150]. Thus, estimation theorists are sometimes consumed, not only with devising and understanding various algorithms for estimation, but with evaluations of how reliable they are. We usually ask the question in the superlative: What is the best estimate? We might be tempted to answer that the best estimate is the one closest to the true value of the parameter to be estimated. But every estimate is a function of the sample values, and thus is the observed value of some random variable. There is no means of predicting just what the individual values are to be for any given experiment, so the goodness of an estimate cannot be judged reliably from individual values. As we repeatedly sample the population, however, we may form statistics, such as the sample mean and variance, whose distributions we may calculate. If we are able to form estimators from these statistics, then the best we can hope for is that the bulk of the mass in the distribution is concentrated in some small neighborhood of the true value. In such circumstances, there is a high probability that the estimate will only dier from the true value by a small amount. From this point of view, we may order the quality of estimators as a function of how the sample distribution is concentrated about the true value. If the distribution is such that the mathematical expectation of estimate is exactly the true value, then the estimator is, of course, unbiased. In general, we would prefer unbiased estimates, and will restrict our attention primarily to such estimators in the sequel. One measure of the dispersion of a distribution is its variance (or covariance, in the multidimensional case). Most estimation techniques use this measure exclusively as a means of evaluating the quality of the estimate. This choice is motivated strongly by the important case when the sampling distributions of the estimates are at least approximately normal, since then the second-order moment is then the unique measure of dispersion.

Winter 2009

6-9

Based upon the above arguments, we should feel justied in focusing primarily on the variance of the estimation error as the measure of dispersion and, hence, of goodness. But I want to sensitize you to the fact that this is a somewhat arbitrary, albeit very reasonable, measure of goodness, and later in this course I hope to revisit these issues in a little more depth, and build a case for measures other than dispersion as being valid measures of quality. But for now, we will follow the conventional development and focus on the measure of quality being equivalent to measures of dispersion, that is, to the variance of the estimation error.

6.4

The Cramr-Rao Bound e

The maximum likelihood method of estimation does not provide, as a byproduct of calculating the estimate, any measure of the concentration (that is, the variance) of the estimation error. Although the variance can be calculated for many important examples, it is dicult for others. Rather than approach the problem of calculating the variance for an estimate directly, therefore, we will rst calculate a lower bound for the variance of the estimation error for any unbiased estimator, then we will see how the variance of the maximum likelihood estimation error compares with this lower bound. Before stating the main result of this section, we need to establish some new notation and terminology and prove some preliminary results. A good modern reference for this material is [11], from whom the following development is borrowed. I think it is a better development, since it proves the main results for the vector case in a very nice way. You may contrast this development with the more conventional proofs given in [16]. The Score Function and Fisher Information Denition. Let X = [X1 , . . . , Xn ]T denote an n-dimensional random vector, and = [1 , . . . , p ]T denote a p-dimensional parameter vector. The score function s(, X) of a likelihood function l(, X) is dened as sT (, X) = 1 L(, X) = l(, X). l(, X)

More on notation: since the likelihood and log-likelihood functions are scalars and is a p-dimensional vector, then s is the p-dimensional column matrix s(, X) = L(, X), . . . , L(, X) 1 p
T

6-10

ECEn 672

Before continuing, we prove some useful facts about the score function. We begin with the following theorem. Theorem 1 If s(, X) is the score of a likelihood function l(, X) and if t is any vectorvalued function of X and , then (under certain regularity conditions)11 Es(, X)tT (, X) = Proof. We have EtT (, X) = = tT (, x)fX (x | )dx tT (, x)l(, x)dx. EtT (, X) E T t (, X) . (6-3)

Upon dierentiating both sides with respect to and taking the dierentiation under the integral sign in the right-hand side (this is were the regularity conditions come into play), we obtain EtT (, X) = log l(, x) T t (, x)l(, x)dx + tT (, x) l(, x)dx. (6-4) 2

The result follows on simplifying and rearranging this expression. We may quickly obtain three useful corollaries of this theorem. Corollary. If s(, X) is the score corresponding to a regular likelihood function l(, X), then Es(, X) = 0.

(6-5)

Proof. Choose t as any constant vector. Then, , since t is not a function of , its derivative vanishes, so by (6-3), Es(, X)tT = E[s(, X)]tT = 0, which can happen for arbitrary t only if E[s(, X)] = 0.
11

This is nice way of saying that we will assume whatever additional assumptions may be required to accomplish all of the steps outlined in the proof. This isnt too bad of a cop-out, since the regularity conditions turn out to be quite mild.

Winter 2009 Corollary.

6-11

If s(, X) is the score corresponding to a regular likelihood function l(, X) and if t(X) is any unbiased estimator of , then E[s(, X)tT (X)] = I. (6-6)

Proof. Since the estimate is unbiased, we have Et(X) = , and since t is not a function of , we have
tT

= 0, thus by (6-3) E[s(, X)tT (X)] = = I. 2

The Cramr-Rao Lower Bound e Denition. The covariance matrix of the score function is the Fisher information matrix, denoted J(). Since by (6-5) the score function is zero-mean, we have J() = Es(, X)sT (, X). (6-7)

Theorem 2 (Cramr-Rao). If t(X) is any unbiased estimator of based on a regular e likelihood function, then E[t(X) ][t(X) ]T J1 (), where J() is the Fisher information matrix. Proof. For brevity, let Var [t] = E[t(X) ][t(X) ]T . Let a and c be two p-dimensional vectors and let s(, X) be the score function. Form the two random variables = aT t(X) and = cT s(, X). Since the correlation coecient, = E Var []Var [] (6-8)

is bounded in magnitude by one, we have that E 2 () 1. Var []Var [] (6-9)

6-12 But since the score function is zero mean, it is immediate that Var [] = EcT s(, X)sT (, X)c = cT Var [s(, X)]c = cT J()c. Also, Var [] = aT Var [t]a. Furthermore, by (6-6) we have that E = aT E[t(X)sT (, X)]c = aT Ic = aT c. Substituting these expressions into (6-9), E 2 () (aT c)2 = T 1. Var []Var [] a Var [t]acT J()c

ECEn 672

(6-10)

The reason we have set up this equation is because we want to exploit a little trick (that about fty years of hindsight have provided the community when solving problems of this type). The trick is, we want to apply a certain neat little result that we now develop (this is worth remembering). Suppose you are given two vectors, a and c, and wish to maximize the projection of one vector onto another, subject to a constraint on one of the vectors, say, c, is of the form cT Jc = 1, where J is a positive denite matrix. This particular quadratic form corresponds to the Mahalanobis length of the vector c. In other words, we want to constrain, in some general sense, the length of c and still align it as best we can along the direction dened by a. The answer is provided by the following lemma. Lemma 2 Let J be a positive denite matrix, and let a be a xed vector. The maximum of aT c subject to the constraint cT Jc = 1 is attained at c= J1 a (aT J1 a) 2
1

(6-11)

Winter 2009

6-13

Proof of Lemma. We formulate this maximization problem as a Lagrange multiplier problem: C(c, ) = aT c + (cT Jc 1), and dierentiate C with respect to c and , set the results to zero, and solve for the unknowns. C c C Solving (6-12) for c yields J1 a c= , 2 and substituting this into (6-13) yields 2 = Thus, the extremizing value is given by c= This proves the lemma. Substituting this result into (6-10) and applying the constraint (6-11) yields
a

= a + 2cT J = 0 = cT Jc 1 = 0.

(6-12) (6-13)

aT J1 a.

J1 a aT J1 a

)a T J1 ( )a a aT Var [t]a
T J1 (

aT J1 ()a 1. aT Var [t]a

(6-14)

We now observe that this inequality must hold for all a, so aT Var [t] J1 () a 0 for all a, which is equivalent to (6-8). 2

The inverse of the Fisher information matrix is therefore a lower bound on the variance that may be attained by any unbiased estimator of the parameter given the observations X. It is important to determine conditions under which Cramr-Rao lower bound may be e achieved. From (6-9) we see that equality is possible if

E 2 () = Var []Var [],

6-14 or E() = Var [] Var [].

ECEn 672

But from the Schwarz inequality, equality is possible if and only if and are linearly related, that is, if t(X) = k()s(, X). Eciency Denition. An estimator is said to be ecient if it is unbiased and the covariance of the estimation error equals the Cramr-Rao lower bound, that is, let = t(X) be an estimator e for . Then is ecient if E = E[ ][ ]T = J1 (). Theorem 3 (Eciency) An unbiased estimator is ecient if and only if J()( ) = s(, X). Furthermore, any unbiased ecient estimator is a maximum likelihood estimator. Proof. Suppose J()( ) = s(, X). Then from the denition J() = Es(, X)sT (, X) = J()E[ ][ ]T J(). But this result implies E[ ][ ]T J() = I, which yields eciency. (6-15)

Conversely, suppose is ecient. From (6-5) and (6-6), it follows that Es(, X)( )T = I,

so by the Schwarz inequality, I = Es(, X)( )T


2

E[s(, X)sT (, X)]E[( )( )T ]

= J()E( )( )T = I by eciency assumption.

Winter 2009 Equality can hold with the Schwartz inequality if and only if s(, X) = K()( )

6-15

for some constant K(). Multiplying both sides of this expression by ( )T and taking expectations yields K() = J(). To show that any unbiased ecient estimator is a maximum likelihood estimator, let be ecient and unbiased, and let be a maximum likelihood estimate of . Evaluating (6-15) at = yields J()( ) = s(, X). but the score function is zero when evaluated at the maximum likelihood estimate, consequently, = . 2

6.5

Asymptotic Properties of Maximum Likelihood Estimators

Unfortunately, it is the exception rather than than the rule that an unbiased ecient estimator can be found for problems of practical importance. This fact motivates us to analyze just how close we can get to the ideal of an ecient estimate. Our approach will be to examine the large sample properties of maximum likelihood estimates. In our preceding development we have consider the size of the sample as a xed integer the asymptotic behavior of M L as m tends to innity. In this section we establish three key results (subject to sucient regularity of the distributions): (a) maximum likelihood estimates are consistent, (b) maximum likelihood estimates are asymptotically normally distributed, and (c) maximum likelihood estimates are asymptotically ecient. In the interest of clarity, we will treat only the case for scalar . We assume, in the statement of the following three theorems, that all of the appropriate regularity conditions are satised. Denition. Let m be an estimator based on m samples of a random variable. The sequence {m , m = 1, . . . , } is said to be a consistent sequence of estimators of if limm m = m 1. Let us now suppose that an unbiased estimate can be dened for all m, and consider

6-16 almost surely, (that is, with probability one), written a.s. m .

ECEn 672

Theorem 4 (Consistency) Let m designate the maximum likelihood estimate of based on m independent, identically distributed random variables X1 , . . . , Xm . Then, if 0 is the true value of the parameter, m converges almost surely to 0 . Proof. Although this theorem is true in a very general setting, its rigorous proof is beyond the scope of our preparation. Consequently, we will content ourselves with a heuristic demonstration, based on [4]. For our demonstration, we will proceed through all of the major steps of the proof, but will assume sucient regularity and other nice properties, when needful, to make life bearable. To simplify things, let x = {x1 , . . . , xm }, and introduce the the following notation. fm (x | ) = fX1 ,...,Xm (x1 , . . . , xm | ). We can get away with this since the quantities x1 , . . . , xm do not change throughout the proof. Rather, the parameter that changes is . From Theorem 1 and its corollaries, E where X = {X1 , . . . , Xm }. Suppose the true value of the parameter is 0 . Now let us expand Taylor series about 0 to obtain log fm (x | ) =
= def

log fm (X | ) = 0,

(6-16)

log fm (x | ) in a

log fm (x | )

+
=0

2 log fm (x | ) 2

( 0 ),

(6-17)

where is chosen to force equality. Let m be the maximum likelihood estimate based on X1 , . . . , Xm , which consequently satises log fm (x | ) = 0.
=m

Winter 2009 Hence, evaluating (6-17) at = m , we obtain 2 log fm (x | ) 2 (m 0 ) = log fm (x | )

6-17

(6-18)
=0

Since X1 , . . . , Xm are i.i.d., we have, with fX (x | ) the common density function, log fm (x | ) fX (xi | ) = log i=1 =
m m m

i=1

log fX (xi | )

=
i=1

log fX (xi | ) .

By a similar argument, 2 log fm (x | ) = 2 From the strong law of large numbers 1 m


m 12 m

i=1

2 log fX (xi | ) . 2

it follows that (6-19)

i=1

log fX (xi | ) a.s log fX (X | ) = 0, E

where the last equality holds from (6-16). Similarly, 1 m


m

i=1

2 log fX (xi | ) 2

a.s

2 log fX (X | ) 2

.
=

(6-20)

We now make the assumption that E 2 log fX (X | ) 2 = 0.


=

This assumption is essentially equivalent to the condition that the likelihood function be a concave function for all values of . We might suspect that most of the common distributions we would use satisfy this conditionbut we will not expend the eort to prove it. Given the above assumption and substituting (6-19) and (6-20) into (6-18), we obtain that (m 0 )
12

a.s.

E E

log fX (X | ) 2 log fX (X | ) 2

=0 =

= 0. .

(6-21)

The strong law of large numbers says that for {Xi } a sequence of i.i.d. random variables with common a.s n 1 expectation , then n i=1 xi .

6-18

ECEn 672 2

The above theorem shows that, as m , the maximum likelihood estimate m tends to 0 with probability one, the true value of the parameter. The next theorem shows us that, for large m, the values of m from dierent trials are clustered around 0 with a normal distribution. Theorem 5 (Asymptotic normality) Let m designate the maximum likelihood estimate of based on m independent, identically distributed random variables X1 , . . . , Xm . Then if 0 is the true value of the parameter, m converges in law to a normal random variable, that is, where Y N (0, J 1(0 )), where J() the Fisher information. Proof. Due to the complexity of the proof of this result, we content ourselves with a heuristic demonstration of this result also. First, we form a Taylor expansion about the true parameter value, 0 : log fm (x | ) =
=m

m(m 0 ) Y,

law

log fm (x | )

+
=0

2 log fm (x | ) 2

=0

(m 0 ) + h.o.t. (6-22)

m is the maximum likelihood estimate, the left-hand side of (6-22) is zero, and therefore 1 log fm (x | ) m 2 log fm (x | ) = (m 0 ). (6-23) m m 2 =0 =0 But from the strong law of large numbers, 1 2 log fm (x | ) a.s. 2 log fX (X | ) . E m 2 2 From Theorem 1 with t = s, we obtain Es(, X)sT (, X) = E T s (, X) = E 2 L(, X) , (6-24)

a.s. Since m 0 , we assume sucient regularity to neglect the higher order terms. Also, since

Winter 2009 or, rewriting, log fX (X | ) 2 log fX (X | ) =E E 2 We have thus established that the random variable log fX (Xi | )
2

6-19

= J().

(6-25)

=0

is a zero-mean random variable with variance J(0 ). Thus, by the central limit theorem13 the left-hand side of (6-23) converges to a normal random variable, that is, 1 m
m

i=1

log fX (Xi | )

=0

W,

law

where W N [0, J(0 )]. Consequently, the right-hand side of (6-23) also converges to W , that is,
law mJ(0 )(m 0 ) W.

Finally, it is evident, therefore, that m(m 0 )


law

1 W N [0, J 1 (0 )]. J(0 )

(6-26) 2

Theorem 6 (Asymptotic eciency) Within the class of consistent uniformly asymptotically normal estimators, m is asymptotically ecient in the sense that asymptotically it attains the Cramr-Rao lower bound as m . e e Proof. This result is an immediate consequence of the previous theorem and the Cramr-Rao lower bound. 2

This theorem is of great practical signicance, since it shows that the maximum likelihood estimator makes ecient use of all the available data for large samples.
The version of the central limit theorem we need is: Let {Xn } be a sequence of i.i.d. random variables with common expectation and common variance 2 . Let Zn = X1 ++Xn n . Then Zn Z where Z is n
X1 ++Xn . n 13

distributed N (0, 1). Stated another way, Let Wn =

Then Wn W where W is N (, 2 ).

6-20

ECEn 672

6.6

The Multivariate Normal Case

Because of its general importance to engineering, we develop the maximum likelihood estimate for the mean and covariance of the multivariate normal distribution. Suppose X1 , . . . , Xm is a random n-dimensional sample from N (m, R), where is a n-vector and R is a n n covariance matrix. The likelihood function for this sample is l(m, R, X1 , . . . , Xm) = (2) and, taking logarithms, mn m 1 L(m, R, X1 , . . . , Xm ) = log(2) log |R| 2 2 2 Equation (6-28) can be simplied as follows. First, let 1 x= m We then write (xi m)T R1 (xi m) = (xi x m)T R1 (xi x m). Thus, expanding, we obtain (xi m)T R1(xi m) = (xi x)T R1 (xi x)+(xm)T R1 (xm)+2(xm)T R1 (xi x). Summing over the index i = 1, . . . , m, the nal term on the right-hand side vanishes, and we are left with
m m m m mn 2

|R|

m 2

1 exp 2

i=1

(xi m)T R1 (xi m) ,

(6-27)

i=1

(xi m)T R1 (xi m). (6-28)

xi .
i=1

i=1

(xi m) R (xi m) =

i=1

(xi x)T R1 (xi x) + m(x m)T R1 (x m). (6-29)

Since each term of (xi x)T R1 (xi x) is a scalar, it equals the trace of itself. Hence, since the trace of the product of matrices is invariant under any cyclic permutation of the matrices, (xi x)T R1 (xi x) = tr R1 (xi x)(xi x)T . Summing (6-30) over the index i and substituting into (6-29) yields
m m

(6-30)

i=1

(xi m)T R1 (xi m) = tr R1

i=1

(xi x)(xi x)T +m(xm)T R1 (xm). (6-31)

Winter 2009 Now dene S= and using (6-31) in (6-28) gives L(m, R, X1 , . . . , Xm ) = 1 m

6-21

i=1

(xi x)(xi x)T

m m m mn log(2) log |R| tr R1S + (x m)T R1 (x m). 2 2 2 2 (6-32)

Calculation of the Score Function To facilitate the calculation of the score function, it is convenient to parameterize the log-likelihood equation in terms of V = R1 , yielding L(m, V, X1, . . . , Xm ) = m m mn log(2) + log |V| tr VS 2 2 2 m tr V(x m)(x m)T . 2 (6-33)

To calculate the score function, we must evaluate L , and L . m V m L = (x m)T V(x m) m 2 m = m(x m)T V. log |V| . We have To calculate L , we rst calculate V V log |V| log |V| |V| = V |V| V = 1 |V| . |V| V (6-34)

An important identity worth remembering, which we will not prove here (see, for example, [5]), is given in the following lemma. Lemma 3 Let V be a symmetric matrix, then |V| = 2{Vij } diag {Vii }, V where Vij is the ij-th cofactor of V.

6-22 Consequently, log |V| =2 V But, since Vij


|V|

ECEn 672

Vij |V|

diag

Vii . |V|

is the ij-th element of R, we have log |V| = 2R diag R. V (6-35)

We next must calculate tr VS . Another important identity worth remembering, which V we also will not prove here (see, for example, [5]), is given in the following lemma. Lemma 4 Let V and S symmetric matrices. Then tr VS = 2S diag S. V (6-36)

To complete the calculation of L , we must compute (x m)T V(x m). Since V V (x m)T V(x m) = tr V(x m)(x m)T , we may apply the previous lemma to obtain (xm)T V(xm) = tr V(xm)(xm)T = 2(xm)(xm)T diag (xm)(xm)T . V V (6-37) Combining (6-35), (6-36), and (6-37), we obtain L m = (2M diag M), V 2 where M = R S (x m)(x m)T . To nd the maximum likelihood estimate of m and R, we must solve L =0 m L = 0. V (6-40) (6-39) (6-38)

From (6-34) we see that the maximum likelihood estimate of m is mM L = x (6-41)

Winter 2009

6-23

To obtain the maximum likelihood estimate of R we require, from (6-38), that M = 0, which yields RM L = S + (x m)(x m)T , but since the solutions for m and S must satisfy (6-40), we must have m = mM L = x, hence we obtain RM L = S. (6-42)

6.7

Appendix: Matrix Derivatives

When one does matrix calculus, one quickly nds that there are two kinds of people in the world: those people who think the a gradient is a row vector, and those who think it is a column vector. The text is of the column-vector persuasion, while I am a row-vector man. It really doesnt matter very much, but since there are dierent conventions, you should become aware of that fact and learn to appreciate both of them. Since Let = [1 , . . . , p ]T be a vector (unless explicitly stated otherwise, all vectors are considered to be column vectors). Let a :
p

be a scalar-valued function of the p-dimensional

vector . Then the gradient of a with respect to is a(), . . . , a() . a() = 1 p Let a :
p

be a k-dimensional vector valued function of the p-dimensional vector .

Then the gradient of a() = [a1 (), . . . , ak ()]T with respect to is g() = . . . a () 1 k
a () 1 1

. . . a () p k

a () p 1

Thus, the derivative of a vector with respect to a vector is obtained by stacking up the gradients of the each component of the vector in the obvious way. Some basic results that follows include: 1. 2.

= I. = bT .

bT

6-24 3. 4. 5. 6. 7.
T a ()b() T Q

ECEn 672 = aT ()
b()

+ bT ()

a().

2 T Q (Q + QT )
T m.

if Q is symmetric . otherwise

mT Qm

= 2Q

1 exp 1 T Q = exp 2 T Q T Q. 2

log( T Q) = 2 log( T Q) T Q.

It is also possible to take the derivative of quantities with respect to matrices. The following results are useful: 1. 2.
Q

log det Q = Q1 = (Q1 BAQ1 )T

(trace AQ1 B) Q

Exercise 6-1 Justify (6-4) and show how it leads to (6-3). Exercise 6-2 Consider an m-dimensional normal random vector Y with mean value c (where c is a constant n-dimensional vector) and covariance matrix (an n n known matrix). Show that the maximum likelihood estimate of is = (cT 1 c)1 cT 1 Y. Exercise 6-3 Consider the same system as presented in Exercise 6-2, except that has the special form = 2 I, where 2 is to be estimated. Show that the maximum likelihood estimators for and 2 are = (cT c)1 cT Y 2 = (1/m)(Y c)T (Y c). Exercise 6-4 Consider N independent observations of an m-variate random vector {Yk , k (1, 2, . . . , N)} such that each Yk has a normal distribution with mean ck and common co variance . Show that a necessary condition for and to be maximum likelihood estimators

Winter 2009 of and , respectively, is that they simultaneously satisfy


N

6-25

= = 1 N

1 cT ck k

1 N k=1

cT Y k k

(6-43) (6-44)

k=1 N

k=1

(Yk ck )(Yk ck )T .

(To establish this result, you may need some of the matrix dierentiation identities presented above.) Exercise 6-5 Equations (6-43) and (6-44) do not have simple closed form solutions. However, then can be solved by a relaxation algorithm as follows: 1. Pick any value of (say I). 2. Solve (6-43) for using . 3. Solve (6-44) for using . 4. Stop if converged, otherwise to to (2). Unfortunately, no one seems to be aware of the existence of a proof of global convergence of the above relaxation algorithm. Computational studies, however, indicate that it works well in practice. What can be shown, however, is that regardless of the value of , the estimate given by (6-43) is an unbiased estimate of . Prove this fact. For extra credit (and perhaps a Ph.D) show that the relaxation algorithm is globally convegent :-)

Winter 2009

7-1

Conditioning

The notion of conditioning is central to estimation theory. It is the vehicle that connects the things we observe to the things we cannot directly observe but need to learn about. Suppose X and Y are two random variables such that direct observation of X is not possible, but it is possible to observe Y . Given that Y = y, what can this knowledge tell us about X? One possibility is to compute the expected value of X conditioned on the event Y = y. In this section we explore this candidate and assess its attributes as an estimator of the value assumed by X.

7.1

Conditional Densities

The most obvious way to compute the conditional expecation is to rst compute the conditiononal density function and compute E(X|Y = y) =

xfX|Y (x|y)dx,

where fX|Y (x|y) is the conditional density of X given Y = y. The problem is, how to obtain this conditional density. If Y may assume a nite number of values, each with positive probability, this is not a dicult task, for then we have fX|Y (x|y) = lim
x0

Writing this expression in terms of the joint distribution function, we obtain fX|Y (x|y) = lim FXY (x + x, y) FXY (x x, y) fXY (x, y) = , x0 2x P [Y = y] fY (y)

P [X [x x, x + x], Y = y] . 2x P [Y = y]

where fY is the probability mass function for Y and fXY is the joint density/mass function of X and Y . As we let x tend to zero, this expression is well-dened. However, what if Y assumes a continuum of values? Then the event Y = y has zero probability of occurrence, and we need to be very careful in the formulation of our limit. Perhaps the most obvious way to proceed is to dene the conditional density as fX|Y (x|y) = =
P [X[xx,x+x], Y [yy,y+y]] 2x2y lim P [Y [yy,y+y]] x,y0 2y P [X[xx,x+x], Y [yy,y+y]] 2x2y lim P [X(,), Y [yy,y+y]] x,y0 2y

(7-1)

(7-2)

7-2

ECEn 672 Lets pay close attention to the way this limit is obtained. Note that this conditional

density is dened for points (x, y) that are the limits of rectangles of the form X [x x, x + x], Y [y y, y + y]] (7-3)

as x and y both approach zero independently. Without loss of generality, we assume that x > 0 and y > 0. Figure 7-1 illustrates a typical rectangle. To facilitate the limiting procedure it is convenient to express the probability associated with rectangles in terms of the distribution function. We do this by means of what are called partial dierence operators. The partial dierence operator of step hi , denoted bii , is dened by a xi +i = FX1 ,XN (x1 , . . . , xi1 , xi + i , xi+1 , . . . , xn ) xi FX1 ,XN (x1 , . . . , xi1 , xi i , xi+1 , . . . , xn ). Clearly, 0. Composing with itself yields, for n = 2, x+x y+y FXY (x, y) = FXY (x + x, y + y) FXY (x + x, y y) xx yy + FXY (x x, y y) FXY (x x, y + y). Using the fact that the probability associated with the cell [xx, x+x)[y y, y +y] is expressed in in terms of the distribution function as P [X [x x, x + x], Y [y y, y + y]] = x+x y+y FXY (x, y) , xx yy the numerator of the ratio in (7-2) is
FXY (x+x,y+y)FXY (x+x,yy)+FXY (xx,yy)FXY (xx,y+y) 2x2y

which becomes, as x and y both approach zero, the joint density function fXY (x, y). The limit of the denominator of (7-2) becomes, as y approaches zero, the marginal density of Y , which may be expressed as

fXY (, y)d. Thus, we may conclude, for this case, that


fX|Y (x|Y = y) =

fXY (x, y) fXY (x, y) . = fY (y) fXY (, y)d

(7-4)

The conditional density dened by 7-4 is what we often think of when we go about dening such things. But we must remember that we arrived at this result by a very carefully

Winter 2009 Y
T

7-3

y + y y y y

x x

x + x

Figure 7-1: The family of rectangles {X [x x, x + x], Y [y y, y + y]}. constructed limit, namely, we viewed the point (x, y) as the limit of rectangles. This is not the only way express the point (x, y) as the limit of sets. Heres another way [18, p. 88]. Consider sets of the form Y y y y + y , X X X or, equivalently, {y Xy Y y + Xy}. Now consider sets of the form {X [x x, x + x], Y [y Xy, y + Xy]}. These sets are trapezoids, as illustrated in Figure 7-2. Note that the lines dening the Y component have slope y, but as x and y both tend to zero, the trapezoid converges to the limit point (x, y), just as as was the case with rectangular sets. With this model, the conditional density becomes fX|Y (x|y) = =
P [X[xx,x+x], Y [yXy,y+Xy]] 2x2y lim P [Y [yXy,y+Xy]] x,y0 2y P [X[xx,x+x], Y [yXy,y+Xy]] 2x2y lim P [X(,), Y [yXy,y+Xy]] x,y0 2y

(7-5)

(7-6)

The numerator of the ratio in (7-6) may be expressed in terms of the joint distribution function as
FXY (x+x,y+xy)FXY (x+x,yxy)+FXY (xx,yxy)FXY (xx,y+xy) 2x2y

7-4 Y
T

ECEn 672

@@@ y + y @@@ @@@@ @ y hhhh hhh hhhh y y hh

@@

x x

x + x

Figure 7-2: The family of trapezoids {X [x x, x + x], Y [y Xy, y + Xy]}. Now suppose we take the limit as y 0. Let us examine the quantity FXY (x + x, y + xy) FXY (x + x, y xy), and note that we can re-write this expression as F = FXY (x + x, y + z) FXY (x + x, y z), where z = xy. Let us rst assume that x > 0. We may then form the ratio F F z = , y z y or, since
z y

= x, we have F [FXY (x + x, y + z) FXY (x + x, y z)]x = . y 2z

If x < 0, we have z = |z| and x = |x|, so [FXY (x + x, y + z) FXY (x + x, y z)](|x|) F = , y 2|z| so in general, we obtain F [FXY (x + x, y + z) FXY (x + x, y z)]|x| = . y |2z| We have thus succeeded in reducing this problem to the previous case, except for the addition of the extra term |x|. Passing to the limit as x and z (and hence y) tend to zero, we obtain the conditional density function fX|Y (x|Y = y) =

fXY (x, y)|x| . fXY (, y)||d

Winter 2009

7-5

This is a very dierent conditional distribution than the one obtained with the rectangle structure! Whats going on here? We have competing denitions for the conditional density. This is because there are many ways in which limiting operations can take place, and there is no mathematical reason to prefer one over the other. This suggests that we must pay very careful attention to the relationships between X and Y when computing conditional expectations. This prompts us to ask a very signicant question: Is there a way to dene the conditional expectation without rst computing the conditional density function? To answer this question, we need to discuss -elds.

7.2

-elds

Fundamental to all of probability theory is the concept of an event. An event is the outcome of an experiment. For example, if I roll a die, the sure event is = {1, 2, 3, 4, 5, 6}, the null event is the empty set, , and some other examples of events are: even and not 4 = {2, 6}, less than 5 = {1, 2, 3, 4}, and not 5 = {1, 2, 3, 4, 6}. The power set, denoted 2 , is the set of all subsets of . Probability theory involves the basic Boolean set operations of union, intersection, and complementation. Any collection of sets that is closed under these operations is called a eld (if the collection of sets is nite, this collection is also called a Boolean algebra). For example, consider the real line, , and let A be any subset. The collection { , , A, Ac}

is a eld, where Ac is the complement of A. A sigma eld (usually written -eld) is a eld that is closed under countable (not just nite, but still enumerable) unions. Thus, formally, a -eld F is a collection of sets (events) such that A F Ac F A1 , A2 , . . . , F F
i=1

Ai F

Let be a a sample space and let F be a -eld dened over . The pair {, F } is called a measurable space. Examples such as coin-ips and dice-rolls are nice ways to introduce the concept of events and elds, but we now need to move to a more sophisticated level, and discuss -

7-6

ECEn 672

elds in the context of random variables. Before doing so,however, we need to introduce some new terminology. Let I be an arbitrary index set (countable or uncountable) and let C = {A , I} be an arbitrary collection of sets indexed by I. The -eld generated by C is the smallest -eld that contains all of the members of C. In particular, suppose C is the set of all open intervals on the real line. The -eld generated by this collection is called the Borel eld. We will reserve the notation B for the Borel eld. The Borel eld has great intuitive appeal, because it is the smallest -eld that contains subsets of the real line that we can describe with english sentences. It contains all singleton sets {x}, it contains all open sets, all closed sets, all countable unions of such sets, their complements, intersections, and so forth. Just about any subset of the real line that you can describe in a nite number of words (and many that cannot be so easily described) is a member of the Borel eld. As a point of terminology, the elements of the Borel eld are called Borel sets. Let {, F } be a measurable space, and consider a function X that maps a sample space to the real line; that is, X: . We say that X is measurable with respect to F if, are elements of F ; that is, if, and only if and only if, the inverse images of all Borel sets in

A B X 1 (A) = { : X() A} F . If a random variable is measurable with respect to a eld F , we denote this fact by the notation X F .14 Thus, a function is a random variable if and only if it is a measurable function. We emphasize this point because, in general, a -eld may be smaller than the power set of the sample space. In particular, if the sample space is = , the real line, the power set is huge, and not relevant to the experimentevery possible situation would be an event. We need deal with -elds that are relevant to the experiment at hand, otherwise we dont have much chance of making meaningful interpretations. Often, we will be dealing with more than one -eld. Let F and G be two -elds. If every element of F is also an element of G, we express this situation by the notation F G. Furthermore, if X is F -measurable, that is, X F , then X G.
This is clearly an abuse of notation, because F is a collection of sets and X is a function, not a set. However, this, like many well-known abuses, are standard in the theory. Such abuses are cherished attributes of probability theory. I have often said that notation abuse is one of the distinguishing characteristics of probability theoryyou get used to it.
14

Winter 2009

7-7

In most applications in signal detection and estimation, the elds of interest will be generated by one or more random variables. Given a random variable Y , the -eld generated by Y , denoted {Y }, is dened as the smallest eld with respect which Y is measurable, that is, the smallest -eld containing sets of the form { : a < Y () < b}, the inverse images under Y of open intervals. By contrast, consider the smallest -eld containing sets generated by the random variable
Y , X

which contains sets of the form : a a c < Y () < +c . X X

Furthermore, the -eld generated by Y should be distinguished from the -eld generated by the pair of random variables, (X, Y ), denoted {X, Y } which is the smallest -eld containing sets of the form { : , c < X < d, a < Y () < b}. It is an important fact that if one -eld is a subset of another, say {Y } {W }, then the random variable Y must be a function of the random variable W . We will not prove this result, but refer the truly interested reader to [17, p. 12]. The converse is also true, namely, if there exists a function f such that Y () = f [W ()] then {Y } {W }. So, after the dust settles, whats the big deal with -elds? In general, a -eld is a complete description of all of the possible events that can be detected as a result of some experiment. In particular, the -eld generated by a random variable is a complete description of the events that can be detected as a result of observing the random variable or any function of the random variable. Consider again the die-throwing problem. The sample space is = {1, 2, 3, 4, 5, 6}, and let the function Y be dened as if {2, 6} 1 0 if {1, 3, 4} . Y () = 1 if = 5

7-8

ECEn 672

Recall that the -eld generated by a function is the smallest -eld that contains the inverse images of all possible open sets on the real line. Let A be any open set in image of Y is: {2, 6} {1, 2, 3, 4, 6} {2, 5, 6} {1, 3, 4} 1 Y [A] = {1, 3, 4, 5} {5} {1, 2, 3, 4, 5, 6} if if if if if if if if 1A 1A 1A 1A 1A 1A 1A 1A & & & & & & & & 0A 0A 0A 0A 0A 0A 0A 0A &1 A &1 A &1 A &1 A . &1 A &1 A &1 A &1 A . The inverse

Since this collection of events is closed under complementation and union, it is the -eld generated by Y : {2, 6} {1, 2, 3, 4, 6} {2, 5, 6} {1, 3, 4} {Y } = {1, 3, 4, 5} {5} {1, 2, 3, 4, 5, 6}

Now if we are given this -eld, what events can be detected? The event even and not 4 = {2, 6} is a member of {X}, and so is the event not 5 = {1, 2, 3, 4, 6}, but the event less than 5 = {1, 2, 3, 4} is not in {Y }, and cannot be detected. In other words, no matter what value Y assumes, there is no way for me to ascertain that the event less than 5 occurred (although we can know whether or not the event less than 5 but not 2 occurred). Let X be a function given by 1 2 3 X() = 4 5 6 if if if if if if =1 =2 =3 . =4 =5 =6

For this problem, the -eld generated by X is the power set of . Now suppose Y is observed. What can we say about X? In other words, what are the values you would expect X to assume, given that you knew what values Y assumed? Clearly, this would be a function

Winter 2009

7-9

of Y ; for the time being, let us call this function (Y ). Since (Y ) is a function of Y , the eld generated by this function must be a subset of the eld generated by Y , that is, {(Y )} {Y }. The above example involves only nitely many events both in X and Y , so it is straightforward to calculate the conditional expectation of X given Y via Bayes rule: fX|Y (x|y) = fY |X (y|x)fX (x) . fY (y)

Let us assume that the die is fair, that is, fX (i) = 1 , i = 1, . . . , 6. The conditional probability 6 of Y = y given X = x is easy to obtain: fY |X (1|x) = fY |X (0|x) = fY |X (1|x) = Since the die is fair, it is easy to see that 1 3 1 fY (0) = 2 1 , fY (1) = 6 fY (1) = and that, consequently, the conditional expectation of X given Y = y is obtained as
6

1 x {2, 6} , 0 otherwise 1 x {1, 3, 4} , 0 otherwise 1 x=5 . 0 otherwise

E[X|Y = y] =
i=1

ifX|Y (i|y),

yielding
6

E(X|Y = 1) =
i=1 6

ifX|Y (i|Y = 1) = 2 ifX|Y (i|Y = 0) = 1

1 1 +6 =4 2 2 1 1 1 8 +3 +4 = 3 3 3 3

E(X|Y = 0) =
i=1 6

E(X|Y = 1) =

i=1

ifX|Y (i|Y = 1) = 1 5 = 5

7-10

ECEn 672

Calculations such as this are ne for situations involving probability mass functions, because we dont have to take limits. But, as we saw earlier, taking limits can be a problem. This motivates us to consider an alternative way to dene conditional expectationa denition that does not require the specication of a conditional distribution function.

7.3

Conditioning on a -eld

Given a random variable X satisfying the condition E|X| < (this condition can be relaxed in various ways, but we dont need to worry about that now), the conditional expectation of X given the -eld F = {Y } is dened as a random variable, written variously as E F X, E[X|F ] or E[X|Y ], such that 1. E[X|F ] is an F -measurable function; that is, sets of the form { : a < E[X|F ]() < b} are elements of F . 2. The random variable X E[X|F ] is orthogonal15 to all F -measurable functions; that is, E[(X E[X|F ])Z] = 0 Z F . This second property is the one that makes conditional expectations useful, and we will have quite a bit to say about this as we progress through the course. Viewed as a random variable (that is, a function of , the conditional expectation for the six-sided die example is easily seen to be {2, 6} 4 8 {1, 3, 4} E[X|Y ] = 3 5 = 5

Lets pause a moment and examine some dierences between this and the denition of

conditional expectation dened in terms of conditional distributions. The denition in terms of a conditional distribution is constructive, in that one is able actually to compute the conditional expectation with the conditional distribution.
Recall that orthogonality is dened in terms of the inner product of two random variables as X, Y = E[XY ].
15

Winter 2009

7-11

The denition in terms of -elds is not constructive. The denition is provided in terms of properties that the conditional expectation must possess, but does not point to a way to compute the conditional expectation. This situation is somewhat similar, at least in spirit, to the situation with dierential equations. You may recall that, when considering equations of the form x = f (x, u), all the theory provides is theorems regarding existence and uniqueness; it does not tell us how to nd the solution. This is not to say, however, that the properties of conditional expectations cannot be used to identify solutionsit just cant generally be used to construct them. Take, for example, the Wiener lter. Recall that orthogonality is the key property used to identify the solution, But Wiener and Hopf had to be very creative to nd a way to solve the resulting equation. Of course, if one can construct the conditional density or mass function, one certainly may use it to compute the conditional expectation. But, by exploiting the properties of conditional expectations, one may be able to develop ways to construct the conditional expectation without rst constructing the conditional density. Remember, the conditional expectation is just the rst moment of the conditional density, and one may not need all of the information that the conditional density provides in order to compute the conditional expectation. Sometimes we can obtain all of the information we need by expoiting the properties of moments of distributions, rather than requiring complete knowledge of the distribtuion. The conditional expectation dened in terms of a conditional distribution is, fundamentally a number; that is, it is computed for each value event Y = y. It may be viewed as a function by computing its value for each possible value of y. With this extension, we can think of conditional expectation as a function of Y , and thus as a random variable. The conditional expectation dened in terms of a -eld is, fundamentally, a random variable. If that -eld is generated by a random variable Y , then the conditional

7-12

ECEn 672 expectation is a function of Y , and can be evaluated for each event Y = y. With this restriction, conditional expectation may be viewed as a number. (that is, it assumes the value corresponding to the inverse image of the event Y = y).

Theoretically speaking, conditional expectations are generally more signicant than conditional densities (whose existence often requires stronger conditions). To show that conditional expectations exist requires some deeper theory (specically, the Radon-Nikodym theorem) but for most applications it is enough to know the main properties of conditional expectations, which are 1. If X F then E[X|F ] = X. 2. E[E[X|F ]] = EX. 3. If Z F then E[ZX|F ] = ZE[X|F ]. 4. If F G then E[X|F ] = E[E[X|F ]|G]. 5. If F G then E[X|F ] = E[E[X|G]|F ]. 6. Jensens inequality: if f () is a convex function, then E[f (X)|F ] f (E[X|F ]). It is helpful in appreciating these properties to think of conditional expectation E[X|Y ] as the projection of X onto the subspace generated by all functions of the random variable Y , the projection being carried out via the inner product X, Y = EXY . When conditional densities exist, these properties can also be veried by elementary calculations using Bayes rule. The important thing is that these properties also hold when densities do not exist and the denition of conditional expectations has to be less constructive. Essentially, what is done is to isolate certain important properties and then to dene the conditional expectation as a random variable that has those properties.

Winter 2009

7-13

7.4

Conditional Expectations and Least-Squares Estimation

As an example, we establish the fact that E[X|Y ] = the least-squares estimate of X given Y . To do this, suppose X0 is any other estimate of X, also based on information in {Y }. Then E[X X0 ]2 = E[X E[X|Y ] + E[X|Y ] X0 ]2 = E[X E[X|Y ]]2 + E[E[X|Y ] X0 ]2 +2E[X E[X|Y ]][E[X|Y ] X0 ] But, since both E[X|Y ] and X0 are {Y }-measurable, the orthogonality property ensures that the last term of the above expression is zero. It is now obvious that E[X X0 ]2 will be minimized by choosing X0 = E[X|Y ]. This is very powerful result. To appreciate its value, we might contrast this result with the usual concept of least squares estimation. It is highly likely that your exposure to least squares estimation has thus far been restricted to linear least squares. By linear least squares, we mean that we deal with estimators that are linear functions of the observed quantities. Suppose we want to estimate X, and we observe Y1 , . . . Yn . The linear least squares estimate of X given Y1 , . . . , Yn is a function of the form
n

X=
i=1

ai Yi ,

and the problem is to determine the values of the coecients a1 , . . . , an such that
n 2

E X

ai Yi
i=1

is minimized. By taking the derivative of this quantity with respect to the coecients a1 . . . an , setting the results to zero and solving for the coecients, we may obtain the linear least squares estimate (llse) of X. This quantity, however, is not generally the same thing as the conditional expectation, where we have relaxed the linearity constraint. In general, the variance of the nonlinear (unconstrained) least-squares least-squares estimate will be smaller than the variance of the linear (constrained) least-squares estimate. This is an important

7-14

ECEn 672

result when linear estimates are not adequate. Perhaps even more importantly, however, the fact that the conditional expectation is the least-squares estimate is an important theoretical result that will guide our search for the construction of high-quality estimates. To drive this point home, lets compute the llse of X given Y for the six-sided die problem discussed above. We rst must compute the coecient a that minimizes the quantity E[X aY ]2 . Dierentiating and equating the result to zero yields a= E[XY ] . E[Y 2 ]

(This result is extremely important and will be seen many times throughout this course.) The numerator of this expression is given by E[XY ] =
x y

xyfXY (x, y) =
x y

xyfY |X (y|x)fX (x)

1 [2 + 6 5] 6 1 , = 2

and the denominator is E[Y 2 ] = 12 1 1 1 1 + 02 + (1)2 = . 3 2 6 2

Thus we have a = 1, and the linear least squares estimate of X given Y is Xllse = Y, or Xllse =

Compare this with the unconstrained least-squares estimate of X given Y (namely, the conditional expectation) and draw your own conclusion as to which is more reasonable!

1 0 1

if Y = 1 if Y = 0 if Y = 1

Winter 2009

8-1

Bayes Estimation Theory

Suppose you are to observe a random variable X, whose distribution depends on a parameter . The maximum likelihood approach to estimation says that you should take as your estimate of an unknown parameter that value that is the most likely, out of all possible values of the parameter, to have given rise to the observed data. Before observations are taken, therefore, the maximum likelihood method is silent as to any predictions it would make about either the value of the parameter or the values future observations would take. Rather, the attitude of a rabid max-like enthusiast would be: Wait until all of the data are collected, give them to me, be patient, and soon I will give you an estimate of what the values of the parameters were that generated the data. If you were to ask him for his best guess, before you collected the data, as to what values would be assumed by either the data or the parameters, his response would simply be: Dont be ridiculous. On the other hand, a Bayesian would be all too happy to give you estimates, both before and after the data have been obtained. Before the observation, she would give you, perhaps, the mean value of the a priori distribution of the parameter, and after the data were collected she would give you the mean value of the a posteriori distribution of the parameter. She would oer, as predicted values of the observations, the mean value of the conditional distribution of X given the expected value of (based on the a priori distribution). Some insight may be gained into how the prior distribution enters into the problem of estimation through the following example. Example 8-1 Let X1 , . . . , Xm denote a random sample of size m from the normal distribution N (, 2 ). Suppose is known, and we wish to estimate . We are given the prior
2 density N (0 , ), that is,

f () =

1 ( 0 )2 . exp 2 2 2

Before getting involved in deep Bayesian principles, lets just think about ways we could use this prior information. 1. We could consider computing the maximum likelihood estimate of (which we saw earlier is just the sample average) and then simply averaging this result with the mean

8-2 value of the prior distribution, yielding 0 + M L a = . 2

ECEn 672

This naive approach, while it factors in the prior information, gives equal weight to the prior information as compared to all of the direct observations. Such a result might be hard to justify, especially if the data quality is high. 2. We could treat 0 as one extra data point and average it in with all of the other xi s, yielding
m b = 0 + i=1 xi . m+1

This approach has a very nice intuitive appeal; we simply treat the a priori informa tion in exactly the same way as we do the real data. b is therefore perhaps more reasonable than a , but it still suers a drawback: it is treated as being exactly equal in
2 informational content to each of the xi s, whether or not equals 2 .

3. We could take a weighted average of the a priori mean and the maximum likelihood estimate, each weighted inverse proportionally to the variance, yielding 0 + M L 2 2 M L c = , 1 + 1 2 2 M L
2 where M L is the variance of M L , and is given by

2 M L

1 =E m

i=1

Xi

To calculate the above expectation, we temporarily take o our Bayesian hat and put on our max-like hat, view as simply an unknown parameter, and take the expectation with respect to the random variables Xi only. In so doing, it follows after some
2 manipulations that M L = 2 /m. Consequently,

c =

2 2 /m 0 + 2 2 M L . 2 + 2 /m + /m

(8-1)

The estimate c seems to incorporate all of the information, both a priori and a posteriori, that we have about . We see that, as m becomes large, the a priori information

Winter 2009

8-3

is forgotten, and the maximum likelihood portion of the estimator dominates. We also
2 see that if << 2 , then the a priori information tends to dominate.

The estimate provided by c appears to be, of the three we have presented, the one most worthy of our attention. We shall eventually see that it is indeed a Bayesian estimate.

8.1

Bayes Risk

The starting point for Bayesian estimation, as it was for Bayesian detection, is the specication of a loss function and the calculation of the Bayes risk. Recall that the cost function is a function of the state of nature and the decision function, that is, it is of the general form L[, (X)]. For our development, we will restrict the structure of the loss function to be function of the dierence, that is, to be of the form L[ (X)]. Although this restricts us to only a small subset of all possible loss functions, we will see that it still leads us to some very interesting and useful results. We will examine three dierent cost functionals: (a) squared error, (b) absolute value of error, and (c) uniform cost. Of these, the squared error criterion will emerge as being the most important and deserving of study. We saw earlier (see (5-24)) that, under appropriate regularity conditions, we may reverse the order of integration in the calculation of the Bayes risk function to obtain r(, ) =
X

L[, (x)]f|X ( | x)d fX (x)dx,

and noted that we could minimize the Bayes risk by minimizing the inner integral for each x separately; that is, we may nd, for each x, the action, call it (x), that minimizes L[, (x)]f|X ( | x)d. In other words, the Bayes decision rule minimizes the posterior conditional expected loss, given the observations. Let us now examine the structure of the Bayes rule under the three cost functionals we have dened. Squared Error Loss Let us rst consider squared error loss, and introduce the concept via the following example.

8-4

ECEn 672

Example 8-2 Consider the estimation problem in which = = (0, ) and L(, ) = ( )2 . Suppose we observe the value of a random variable X having a uniform distribution on the interval (0, ) with density fX| (x | ) = Note that we may write fX| (x | ) = 1 1 I(0,) (x) = I(x,) (). 1/ if 0 < x < . 0 otherwise

We are to nd a Bayes rule with respect to the prior distribution F with density e if > 0 f () = . 0 otherwise

The joint density of X and is, therefore, fX (x, ) = fX| (x | )f () = and the marginal distribution of X has the density fX (x) =

1 Ix,)e , (

ex if x > 0 fX (x, )d = . 0 otherwise

Hence, the posterior distribution of , given X = x, has the density fX (x, ) f|X ( | x) = = fX (x) ex if > x , 0
x

otherwise

where x > 0. The posterior expected loss, given X = x, is EL(, | X = x) = ex ( )2 e d.

To nd the that minimizes this expected loss, we may set the derivative with respect to to zero: d EL(, | X = x) = 2ex d This implies (x) = =
e d x e d x x

( )e d = 0.

(x + 1)ex = = x + 1. ex

This, therefore, is a Bayes decision rule with respect to F : if X = x is observed, then the estimate of is x + 1.

Winter 2009

8-5

The problem of point estimation of a real parameter, using quadratic loss, occurs so frequently in engineering applications that it is worthwhile to make the following observation. The posterior expected loss, given X = x, for a quadratic loss function at is the second moment about of the posterior distribution of given x. That is, EL(, | X = x) = Exercise 8-1 Show that EL(, | X = x) =

( )2 f|X ( | x)d.

( )2 f|X ( | x)d

is minimized by taking as the mean of the posterior distribution, that is, (x) = = E( | X = x). This result is important enough to state as a general rule: Rule. In the problem of estimating a real parameter with quadratic loss, a Bayes decision rule with respect to a given prior distribution for is the mean of the posterior distribution of , given the observations. The resulting estimate is termed the mean square estimate of , and is denoted M S . Absolute Error Loss Another important loss function is absolute value of the dierence, L(, ) = | |. The Bayes risk is minimized by minimizing EL(, | X = x) = Exercise 8-2 Show that EL(, | X = x) = is minimized by taking (x) = = median f|X ( | x), that is, Bayes rule corresponding to the absolute error criterion is to take as the median of the posterior distribution of , given X = x.

| |f|X ( | x)d.

| |f|X ( | x)d

8-6 This result is also important enough to state as a general rule:

ECEn 672

Rule. In the problem of estimating a real parameter with absolute error loss, a Bayes decision rule with respect to a given prior distribution for is the median of the posterior distribution of , given the observations. The resulting estimate is termed the absolute error estimate of , and is denoted ABS . Uniform Cost The loss function associated with uniform cost is dened as L(, ) = 0 if | | /2 1 if | | > /2. .

In other words, an error less than /2 is as good as no error, and if the error is greater than /2, we assign a uniform cost. The Bayes risk is minimized by minimizing
/2

L(, )f|X ( | x)d =

f|X ( | x)d +
+ /2

+ /2

f|X ( | x)d

= 1

/2

f|X ( | x)d.

Consequently, the Bayes risk is minimized when the integral


+ /2 /2

f|X ( | x)d.

is maximized. Exercise 8-3 Show that

+ /2 /2

f|X ( | x)d

is maximized when is the midpoint of what me might call the modal interval of length . Dene modal interval of length so that this makes sense, and state a rule for nding Bayes rules, using this loss function.

8.2

MAP Estimates

Of particular interest with the uniform cost function is the case in which is arbitrarily small but nonzero. In this case, it is evident that this integral is maximized when assumes the value at which the posterior density f|X ( | x) is maximized.

Winter 2009

8-7

Denition. The mode of a distribution is that value that maximizes the probability density function. Denition. The value of that maximizes the a posteriori density (that is, the mode of the posterior density) is called the maximum a posteriori probability (MAP) estimate of . If the posterior density of given X is unimodal and symmetric, then it is easy to see that the MAP estimate and the mean square estimate coincide, for then the posterior density attains its maximum value at its expectation. Furthermore, under these circumstances, the median also coincides with the mode and the expectation. Thus, if we are lucky enough to be dealing with such distributions, the various estimates all tend to the same thing. Although we eschewed, in the development of maximum likelihood estimation theory, the characterization of as being random, we may gain some valuable understanding of the maximum likelihood estimate by considering to be a random variable whose prior distribution is so dispersed (that is, has such a large variance) that the information provided by the prior is vanishingly small. If the theory is consistent, we would have a right to expect that the maximum likelihood estimate would be the limiting case of such a Bayesian estimate. Let be considered as a random variable distributed according to the a priori density f (). The a posteriori distribution for , then, is given by f|X ( | x) = fX| (x | )f () . fX (x) (8-2)

If the logarithm of the a posteriori density is dierentiable with respect to , then the MAP estimate is given by the solution to log f|X ( | x) This equation is called the MAP equation. Taking the logarithm of (8-2) yields log f|X ( | x) = log fX| (x | ) + log f () log fX (x), and since fX (x) is not a function of , the MAP equation becomes log f|X ( | x) log fX| (x | ) log f () = + . (8-4) = 0.
=M AP

(8-3)

8-8 Comparing (8-4) to the standard maximum likelihood equation L(, x) we see that the two expressions dier by = 0,
=M L

ECEn 672

log f () .

If f () is suciently at, (that is, if

the variance is very large) its logarithm will also be at, so the gradient of the logarithm will be nearly zero, and the a posteriori density will be maximized, in the limiting case, at the maximum likelihood estimate.

Example 8-3 Let X1 , . . . , Xm denote a random sample of size m from the normal distribution N (, 2). Suppose is known, and we wish to nd the MAP estimate for the mean, . The joint density function for X1 , . . . , Xm is
m

fX1 ,...,Xm (x1 , . . . , xm | ) =

i=1

(xi )2 1 , exp 2 2 2

2 Suppose is distributed N (0, ), that is,

f () = Straightforward manipulation yields

1 2 exp 2 . 2 2

log f|X ( | x) 1 = 2

i=1

(xi )

. 2

Equating this expression to zero and solving for yields M AP =


2 1 2 2 + m m m

xi .
i=1

M L . It is also true that, as m , the MAP estimate asymptotically approaches the ML estimate. Thus, as the knowledge about from the prior distribution tends to zero, or as the amount of data becomes overwhelming, the MAP estimate converges to the maximum likelihood estimate.

2 Now, it is clear that as , the limiting expression is the maximum likelihood estimate

Winter 2009

8-9

8.3

Conjugate Prior Distributions

In general, the marginal density fX (x) and the posterior density f|X ( | x) are not easily calculated. We are interested in establishing conditions on the structure of the distributions involved that ensure tractability in the calculation of the posterior distribution. Denition. Let F denote a class of conditional density functions fX| , indexed by as ranges over all the values in . A class P of distributions is said to be a conjugate family for F if f|X P for all fX| F and all f P. In other words, a family of distributions is a conjugate family if it contains both the prior and the posterior density for all possible conditional densities. A conjugate family is said to be closed under sampling. A signicant part of the Bayesian literature has been devoted to nding conjugate families. We give some examples of conjugate families, stated without proof (for proofs, see [2]), except for the most important conjugate family, at least insofar as engineering is concerned: the normal distribution. Example 8-4 Suppose that X1 , . . . , Xm is a random sample from a Bernoulli distribution with parameter 0 1 with density fX| (x | ) = x (1 )1x x {0, 1} . 0 otherwise

Suppose also that the prior distribution of is a beta distribution with parameters > 0 and > 0, with density f () =
(+) 1 (1 ()()

)1 0 < < 1 otherwise

Then the posterior distribution of when Xi = xi , i = 1, . . . , m is a beta distribution with parameters + y and + m y where y =
m i=1

xi .

Example 8-5 Suppose that X1 , . . . , Xm is a random sample from a Poisson distribution with parameter > 0 with density fX| (x | ) =
e x x!

x = 0, 1, 2, . . . otherwise

8-10

ECEn 672

Suppose also that the prior distribution of is a gamma distribution with parameters > 0 and > 0, with density f () =
1 e ()

>0 . otherwise

Then the posterior distribution of when Xi = xi , i = 1, . . . , m is a gamma distribution with parameters + y and + m where y =
m i=1

xi .

Example 8-6 Suppose that X1 , . . . , Xm is a random sample from an exponential distribution with parameter > 0 with density fX| (x | ) = ex x > 0 . 0 otherwise

Suppose also that the prior distribution of is a gamma distribution with parameters > 0 and > 0, with density f () =
1 e ()

>0 . otherwise

Then the posterior distribution of when Xi = xi , i = 1, . . . , m is a gamma distribution with parameters + m and + y where y =
m i=1

xi .

Example 8-7 Suppose that X1 , . . . , Xm is a random sample from a normal distribution with unknown mean and known variance 2 . Suppose also that the prior distribution of is a
2 normal distribution with mean 0 and variance . Then the posterior distribution of when

Xi = xi , i = 1, . . . , m is a normal distribution with mean 0 + x 2 2 m c = 1 + 1 2 2 m and variance


2 = 2 2 m , 2 2 m +

(8-5)

(8-6)

where x= 1 m

xi
i=1

and

2 m = 2 /m.

Winter 2009

8-11

Due to its importance, we provide a demonstration of the above claim. For < < , the conditional density of X1 , . . . , Xm satises
m

fX1 ...Xm | (x1 , . . . , xm | ) =

i=1

1 (xi )2 exp 2 2 2
m

= (2) The prior density of satises

m 2

1 exp 2 2

i=1

(xi x)2 exp

m ( x)2(8-7) . 2 2

f () =

1 ( 0 )2 , exp 2 2 2

(8-8)

and the posterior density function of will be proportional to the product of (8-7) and (8-8). Letting the symbol denote proportionality, we have f|X1 ,...,Xm ( | x1 , . . . , xm ) exp m ( 0 )2 ( x)2 exp 2 2 2 2 ( x)2 ( 0 )2 . 2 2 2m 2

= exp Simplifying the exponent, we obtain

( x)2 ( 0 )2 2 + 2 1 + = m2 2 ( c )2 + 2 (x 0 )2 , 2 2 2 m m m + where c is given by (8-5). Thus, f|X1 ,...,Xm ( | x1 , . . . , xm ) exp 1/2


2 2 m + ( c )2 . 2 2 m

Consequently, suitably normalized, we see that the posterior density of given X1 , . . . , Xm is normal with mean given by (8-5) and variance given by (8-6). 2 Upon rearranging (8-5) we see that c =
2 m 2 0 + 2 2 x, 2 2 + m + m

which is exactly the same as the estimate given by (8-1). Thus, the weighted average, as proposed as a reasonable way to incorporate prior information into the estimate, turns out to be exactly a Bayes estimate for the parameter given that the prior is a member of the normal conjugate family.

8-12

ECEn 672

8.4

Improper Prior Distributions

As we saw with the example developed for the MAP estimate, sometimes the prior knowledge available about a parameter is very slight when compared to the information we expect to acquire from observations. Consequently, it may not be worthwhile for us to spend a great deal of time and eort in determining a specic prior distribution. Rather, it might be useful in some circumstances to make use of a standard prior that would be suitable in many situations for which it is desirable to represent vague or uncertain prior information. Denition. A proper density function is one whose integral over the parameter space is unity. This is the only type of density function with which we have had anything to do with thus far. In fact, we know that virtually any continuous, nonnegative function whose integral over the parameter space is nite can be turned into a proper density function by dividing it by the integral. Denition. An improper density function is a nonnegative function whose integral over the whole parameter space is innite. For example, if is the real line and, because of vagueness, the prior distribution of is smooth and very widely spread out over the line, then we might nd it convenient to assume a uniform, or constant density over the whole line in order to represent this prior information. Even though this is not a proper density, we might consider formally carrying out the calculations of Bayes theorem and attempt to compute a posterior distribution. Suppose = (, ), let f () = 1 be an improper prior for , and suppose X = x is observed. Formally applying Bayes theorem, we obtain f|X ( | x) = We see that, if

fX| (x | )f (). = f (x | )f ( )d X|

fX| (x | ) . f (x | )d X|

fX| (x | )d < ,

(8-9)

then the posterior density f|X ( | x) is at least dened. Example 8-8 Suppose X1 , . . . , Xm are samples from a normal population with mean and variance 2 . Let be distributed according to an improper prior f () = 1. The conditional

Winter 2009 density of X1 , . . . , Xm given = is


m

8-13

fX1 ...Xm | (x1 , . . . , xm | ) =

i=1

1 (xi )2 exp 2 2 2
m

= (2) where x =
1 m m i=1

m 2

1 exp 2 2

i=1

(xi x)2 exp

m ( x)2 , 2 2

xi . The rst exponential term in this expression is independent of , and

since the integral of the entire expression quantity with respect to over (, ) is nite, we may normalize this quantity to obtain a posterior density for of the form f|X1 ...Xm ( | x1 , . . . , xm ) = ( x)2 1 exp , 2 2m 2m

where m = / m. Thus, the posterior distribution of when Xi = xi , i = 1, . . . , m, is a normal distribution with mean x and variance 2 /m. Although the prior distribution is improper, the posterior distribution is a proper normal distribution after just one observation has been made. Under squared error loss, therefore, the Bayes estimate for , using an improper prior, is the sample mean. Comparing this with previous results, we see that this estimate also coincides with the maximum likelihood estimate. Consequently, we may view the maximum likelihood as (a) the limit of a MAP estimate as the variance of the prior distribution tends to innity, or (b) the mean square estimate associated with an improper prior distribution.

8.5

Sequential Bayes Estimation

Thus far in our treatment of estimation, we have assumed that all of the information to be used to make a decision or estimate is available at one time. More generally, we are interested in addressing problems where the data becomes available as a function of time, that is, sequentially. To introduce this topic, we will consider rst the case of estimating given two measurements, obtained at dierent times. Let be the parameter to be estimated, and suppose X1 and X2 are two observed random variables. Suppose that X1 and X2 have a joint conditional probability density function fX1 X2 | (x1 , x2 | ), for each . The posterior density function of conditioned

8-14 on X1 = x1 and X2 = x2 is f|X1 X2 ( | x1 , x2 ) = fX1 X2 | (x1 , x2 | )f () . f (x , x2 | )f ( )d X1 X2 | 1

ECEn 672

(8-10)

If we had both X1 and X2 at our disposal, then we would simply use this posterior density to form our estimate according to the loss function we choose, say, for example, squared error loss. But suppose we rst observe X1 , and at some future time have the prospect of observing X2 . There are two ways we might proceed: (a) we could put X1 on the shelf and wait until X2 is obtained to calculate our estimate; (b) we could use X1 as soon as it is obtained to estimate using that information only, then update that estimate once X2 becomes available. Our goal is to show that these two approaches yield the same result. We rst compute the posterior distribution of given X1 only: f|X1 ( | x1 ) = fX1 | (x1 | )f () . f (x | )f ( )d X1 | 1 (8-11)

We next compute the conditional distribution of X2 given = and X1 = x1 , yielding fX2 |X1 (x2 | , x1 ) = fX1 X2 | (x1 , x2 | ) , fX1 | (x1 | ) (8-12)

and compute the corresponding posterior density of : f|X1 X2 ( | x1 , x2 ) = fX2 |X1 (x2 | , x1 )f|X1 ( | x1 ) f (x2 | , x1 )f|X1 ( | x1 )d X2 |X1 (8-13)

Substituting (8-11) and (8-12) into (8-13) yields, after some simplication, the conditional density given in (8-10), thus we see that if the observations are received sequentially, the posterior distribution can also be computed sequentially, that is, f|X1 X2 ( | x1 , x2 ) = f|X1 X2 ( | x1 , x2 ). It also follows from this derivation that if the posterior distribution of when X1 = x1 and X2 = x2 is computed in two stages, the nal result is the same regardless of whether X1 or X2 is observed rst. Exercise 8-4 Show that substituting (8-11) and (8-12) into (8-13) yields (8-10).

Winter 2009

8-15

It is straightforward to generalize this result to the case of observing X1 , X2 , X3 , . . . , and sequentially updating the estimate of as time progresses. There is a general theory of sequential sampling, which we will not develop in this class, that treats this problem in detail. For details, see [2, 3]. Although we will not pursue sequential detection theory further in this course, we will develop the concept of a closely related subject, that of sequential estimation theory.

9-16

ECEn 672

9
9.1

Linear Estimation Theory


Introduction

The concept of least squares is probably the oldest of all mathematically based estimation techniques. The history begins with Gauss. As the story goes, in 1801 an Italian astronomer named Piazzi discovered a new planet (actually an asteroid), named Ceres, and begin plotting its path against the xed stars. His work was interrupted, however, and when he returned to it he was unable to relocate the new object. The problem came to the attention of Gauss, who wondered if there was some way to predict the location of the object form the scanty data available. To address this problem, he took the few data points at his disposal, and devised a way to t an orbital model to them. His brilliance and intuition led him to compute the values of the observations as a function of the orbital model, and then adjust the model parameters to minimize the square of the dierence between these values and the actual observations. Needless to say, his scheme was successful, thus giving us another reason to admire and respect this giant of mathematics. Although this application of least squares is perhaps the most famous one, Gauss was not the only one to discover it (his experience just makes the best story). He claims to have discovered the technique in 1795, but the technique was also discovered independently in that same time frame by Legendre, in France, and by Robert Adrian, in the United States. Also, there is evidence that the German-Swiss physicist Johann Heinrich Lambert (17281777) discovered and used the method of least squares before Gauss was born. This is another example of Hutchings Rule: Originality usually dissolves upon inspection16 . So if you think you are the rst to discover something, it may only be a matter of time before others lay claim to having gotten there rst (of course, you still may get the credit). A major motivation for the additional attention to least squares and related ideas is due to the space program that began in earnest in the 1950s. This program developed the requirement to track satellites, predict orbital trajectories, etc. The major successes in this development are due to Kalman [9], Stratonovich [14], (Russian), and Swerling [15]. These methods are based upon the so-called Riccati equationother approaches are possible, the
16

After Brad Hutchings.

Winter 2009

9-17

most notable of which are the so-called square-root method and the Chandrasekhar method. We will not spend much time on non-Riccati methods, but you should know that the squareroot method leads to perhaps the most stable (numerically) of ways to implement the Kalman lter. Example 9-9 (Curve tting). One way to apply least squares is as a method of tting a curve to a given set of data. For example, suppose data pairs (x1 , y1 ), . . . , (xm , ym ) are observed, and we suspect, for physical reasons, that these values should correspond to the curve generated by the function y = g(x). We may attribute deviations from this equation to measurement errors or some unmodeled phenomenon such as disturbances. If the function g is parameterized by some quantity , then we would write y = g(x; ), and a natural course of action (provided we are endowed with some of Gausss insight) would be to determine that value, , such that the squared error of the deviation from the proposed curve is minimized. That is, we want to minimize the loss function
m

L() =
i=1

(yi g(xi ; ))2 .

(9-14)

The estimate, , is called the least squares estimate of . To be specic, suppose xi = Ii represents the input current to a resistor at time i,, and yi = Vi represents the voltage drop across the resistor. Let = R, the resistance of the device. Since measurement errors may occur when measuring both the voltage and current, we would not expect that all (or even any) of the observational pairs (xi , yi ) would lie exactly on the line V = RI, even if R were precisely known and Ohms law were exactly obeyed to arbitrary precision. If we compute the least squares estimate of R, then (9-14) becomes
m

L(R) =
i=1

(Vi RIi )2 .

We may obtain the global minimum of this function by dierentiating with respect to R, setting the result to zero, and solving for the corresponding value of the resistance, denoted R: R=
m i=1 Vi Ii . m 2 i=1 Ii

9-18

ECEn 672

Although the method of least squares does not, strictly speaking, require any appeal to probability or statistics, the modern developments of this theory are almost always couched in a probabilistic framework. In our development, we will follow the probabilistic framework, and view the pairs (xi , yi) as samples from the population (X, Y ) where X and Y are random variables with known joint distribution, and perform the minimization in terms of the expected loss.

9.2

Minimum Mean Square Estimation (MMSE)

Suppose we have two real random variables X, Y , with a known joint density function fXY (x, y), and assume Y is observed (measured). What can be said about the value X takes? In other words, we wish to estimate X given that Y = y. To be specic, we desire to invoke an estimation rule X = h(Y ), where the random variable X is an estimate of X. The mapping h : only of Y . Thus, given Y = y, we will assign the value x = h(y) to the estimate of X. Let us dene the estimation error as X = X X, the dierence between the true (but unknown value of X) and the estimate, X. Ideally, we function of Y . So the best we can obtain is to choose h() such that X is small in average value. Precisely, let L : [0, ) be some nonnegative functional of X. Then we will would like X 0, but this is usually too much to hope for, since X is not generally a 1-1 is some function

attempt to choose the estimator Some candidates: L(X) = E|X| (absolute value) weights all errors equally; L(X) = E|X|2 (squared error) weights small errors less than large ones;

Winter 2009 0 if |X| K if |X| >

9-19

L(X) =

where K and

are some positive quantities;

Lots of other rather arbitrary error functions. A Remarkable fact: the squared error function is the one deserving of the most study. 1. The mean-square estimate can be interpreted as a conditional expectation: X = E(X|Y = y), where E denotes mathematical expectation. 2. For Gaussian random variables the mean-square estimate is a linear function of the observables, leading to easy computations. 3. Sub-optimum estimates are easy to obtain (only rst and second moments are required mean and covariance). 4. Stochastic linear mean-squares theory has many illuminating connections with control theory, including Riccati equations, Matrix inversions, and Observers. 5. There are also connections with martingale theory, likelihood ratios, and nonlinear estimation.

9.3

Estimation Given a Single Random Variable

The general problem is to observe one set of random variables and then infer the value of other random variables. This procedure generally requires knowledge of the joint distributions of all random variables. But we will see how to get something without all of this knowledge. We will assume a linear relationship among random variables and will employ the meansquare error criterion. Then knowledge of the joint pdf can be replaced by knowledge of only the rst and second order statistics. Let X and Y be two zero-mean real random variables. Suppose we wish to nd an estimator of the form X = hY where h is a constant chosen such E(X X)2 is minimized. (Thus, the function h(Y ) = hY is linear.)

9-20 To solve this problem, we expand the cost functional to obtain L = E(X X)2 = E(X hY )2 = EX 2 2hEXY + h2 EY 2

ECEn 672

and set the derivative with respect to h to zero and solve for the resulting value of h: L = 2hEY 2 2EXY = 0 h or h= EXY . EY 2

(The structure of h is signicant: it is the ratio of the cross-correlation of X and Y and the auto-correlation of Y . This structure permeates much of estimation theory.) Thus, X = EXY (EY 2 )1 Y and the minimum mean-square error is (EXY )2 (EXY )2 EXY EXY + EY 2 = EX 2 . E(X X)2 = EX 2 2 EY 2 (EY 2 )2 EY 2

(9-15)

9.4

Estimation Given two Random Variables

Suppose we have two measurements Y1 and Y2 ; Find X. We must look for an estimator that is a linear combination of the observations Y1 and Y2 , that is, has the form X = h1 Y1 + h2 Y2 where h1 , h2 are chosen to to minimize the expected squared error. To solve this problem, observe that we have L = E(X h1 Y1 h2 Y2 )2 = EX 2 + h2 EY22 + h2 EY22 2h1 EXY1 2h2 EXY2 + 2h1 h2 EY1 Y2 . 1 2 Dierentiating with respect to h1 and h2 and equating to zero yields EXY1 = h1 EY12 + h2 EY1 Y2 EXY2 = h2 EY22 + h1 EY1 Y2

Winter 2009 which, in matrix form, is [EXY1 EXY2 ] = [h1 h2 ] Now let hT = [h1 h2 ] and Y = Y1 Y2 EY12 EY1 Y2 EY1 Y2 EY22 .

9-21

and we see that


1

hT = EXY T EYY T Thus

X = hT Y = EXY T EYY T

Y.

(9-16)

A natural generalization (for Y1 , Y2, . . . , YN ) by direct proof (for example, by dierentiation) is straightforward but tedious. We will investigate an alternative way that will lead to further insight.

9.5

Estimation Given N Random Variables

Suppose we have N measurements Y1 , . . . YN ; Find X. We must look for an estimator of the form X0 = kT Y where Y = [Y1 Y1 . . . YN ]T and kT = [k1 k2 kN ] is chosen to minimize the expected To solve this problem, we invoke the completion-of-square method, and write L = E(X X0 )2 = E(X hT Y + hT Y X0 )2 = E(X hT Y)2 + E(hT Y X0 )2 2E(X hT Y)(X0 hT Y). where hT = EXY T EYY T
1

squared error E(X X0 )2 .

(9-17)

(9-18)

Let Z = AY be any linear combination of Y (that is, A is an arbitrary m N matrix, with m 1). Then E(X hT Y)ZT = EXY T AT E{EXY T [EYY T ]1 YY T }AT = EXY T AT EXY T [EYY T ]1 EYY T AT = 0, (9-19)

9-22

ECEn 672

and we see that this condition holds for all matrices A. Therefore, X hT Y is uncorrelated combination of the elements of Y and, therefore, E(X hT Y)(X0 hT Y) = 0, so the third term on the right-hand side of (9-17) vanishes and, consequently, E(X X0 )2 = E(X hT Y)2 + E(hT Y X0 )2 .

with all linear combinations of Y. In particular, X0 hT Y = (kT hT )Y is a linear

(9-20)

The right side of this equation is minimized by setting k = h. Thus, the general solution for this problem is X = hT Y = EXY T EYY T
1

Y.

(9-21)

Equation (9-20) is a characteristic or dening property of linear mean-square estimation, and is called the orthogonality property of mean-square estimation. Equation (9-18) implies the relationship of Equation (9-19), that is, by choosing the proper linear combination, the error is orthogonal to all linear combinations of the observations. The converse is also true: if h is such that E(X hT Y)Y T AT = 0 for all compatible matrices A, then h is given by (9-18). To establish this result, let A = [0, , 0, 1, 0, , 0] where the 1 occurs in the ith slot. Then 0 = E(X hT Y)Y T AT = E(X hT Y)Yi, i = 1, . . . , N; rearranging, EXYi = hT EYYi, i = 1, . . . , N. Combining this result for i = 1, , N, we obtain [EXY1 EXY2 . . . EXYN ] = hT EY [Y1 Y2 . . . YN ] , or EXY T = hT EYY T , which is Equation (9-18). The notion of orthogonality is a central concept in linear mean-square estimation. We shall soon give a geometrical interpretation of this important characteristic property; it is by far the best method to use in deriving linear mean-square estimates.

Winter 2009

9-23

9.6

Mean Square Estimation for Random Vectors

Suppose we have two random vectors X, Y, where X = [X1 X2 Xn ]T and Y =

[Y0 Y1 YN ]T , and we wish to determine the linear mean-square estimate, X, of X. (Note: For the next while we will start the Y sequences at 0 rather than at 1. This convention is standard, and is motivated by the common circumstance where the subscript on the observations is due to timein which case we often start at t = 0 and proceed.) We can do this by estimating each component of X separately by the previous method, yielding Xi = EXi Y T EYY T
1

Y,

and collect them together to form the vector X1 X1 = . = E . YT X . EYY T . . . X Xn n or X = EXY EYY T
1

Y,

Y.

(9-22)

It is convenient to introduce compact notation, and dene the matrices RXY = EXY T

RY X = EYXT RY Y = EYY T .

The matrices RXX and RY Y are the auto-correlation matrices of the random vectors X and Y, respectively, and RXX and RXY , RY X are the cross-correlation matrices of the random vectors X and Y. Then we can write X = RXY R1 Y YY and the mean-squre error matrix is E(X X)(X X)T = RXX RXY R1 RY X . YY Exercise 9-5 Prove Equation (9-23). (9-23)

9-24

ECEn 672

Exercise 9-6 Let X and Y be random vectors with EX = mX and EY = mY . Show that the minimum mean square estimate of X given Y is X = mX + E[X mX ][Y mY ] E[Y mY ][Y mY ]T
1

[Y mY ].

Exercise 9-7 Let X and N be two independent, zero-mean, Gaussian random variables with
2 2 variances X and N , respectively. Let Y = X + N, and suppose Y is observed, yielding the

value Y = y. Show that the mean-square estimate of X given Y = y is x=


2 x y. 2 2 X + N

Exercise 9-8 Let X and Y random n- and m-dimensional vectors, respectively (assume zero mean) with joint density fXY (x, y). We dene a minimum variance estimate, x of X as one for which E { X x |Y = y} E { X z |Y = y} for all vectors z, where z is allowed to be a function of y only. Show that x is also uniquely specied as the conditional mean of X given that Y = y, that is, x = E[X|Y = y] =
n

xfX|Y (x|y)dx.

Hint: E{ X z 2 |Y = y} = =
n

(x z)T (x z)fX|Y (x|y)dx (xT xfX|Y (x|y)dx 2zT xT fX|Y (x|y)dx


n

xfX|Y (x|y)dx + zT z
n

zT +
n

z
n

xfX|Y (x|y)dx
n

xT xfX|Y (x|y)dx

xfX|Y (x|y)dx

9.7

Hilbert Space of Random Variables

Consider the space H of random variables dened over a probability space (, B, P ). It is clear that this space is a vector space. Let X and Y be two random variables dened over this probability space, and dene the function X, Y = EXY. (9-24)

Winter 2009

9-25

It is easy to see that this function satises the symmetry and linearity properties of an inner product, but we need to take a closer look at the nondegeneracy condition. According to this condition, if d(, ) is a metric, then d(X, Y ) = 0 should imply that X Y , that is, for all , we should have X() = Y (). It it is not true, however, that E(X Y )2 = 0 implies X Y . But we can prove something almost as good. Lemma 5 Let Z be a random variable with density function fZ (), and suppose E(Zc)2 = 0 for some constant c. Then Z = c almost surely (a.s.), that is, P [Z = c] = 1. Proof: Suppose there exists an > 0 such that P [|Z c| > ] = But then E(Z c)2 =
2 |zc|>

fZ (z)dz > 0.
|zc|>

(z c)2 fZ (z)dz fZ (z)dz > 0.

|zc|

(z c)2 fZ (z)dz

Thus, if EZ 2 = 0, then Z = c a.s. 2 Thus, by Lemma 5, if d(X, Y ) = E(X Y )2 = 0, we have X = Y a.s., so it is possible for X and Y to dier on a set of probability zero. This is a technicality, and we overcome it, formally, by dening the space H to be a the space of equivalence classes of random variables, where we say that two random variables are in the same equivalence class if they dier on a set of probability zero. With this generalization, then (9-24) denes an inner product and X = denes a norm for the vector X. Once we have a distance metric dened for random variables, we have some very powerful machinery at our disposal for analysis. For example, the inner product allows us to dene the notion of orthogonality between random variables. Two random variables are said to be orthogonal if X, Y = EXY = 0. Orthogonality is so important that we introduce some special notation for it. If X, Y = 0, we write X Y , and say, X is perpendicular to Y . X, X

9-26

ECEn 672

With the concept of distance dened, we may introduce the notion of mean-square convergence. We say that the sequence {Xn , n = 0, 1, } of random variables converges in mean-square if there is a random variable X such that d(Xn , X) 0 as n , and we write X = l.i.m. n Xn for the condition
n

(9-25)

lim E(Xn X)2 = 0.

Theorem 1 Let (, B, P ) be a probability space, and let H be the set of all equivalence classes of random variables dened on this space with nite second moments, that is, X H if EX 2 . With the inner product dened by X, Y = EXY , H is a Hilbert space. We have already established that the inner product satises the algebraic requirements, but to prove completeness is more dicult, and is the content of the famous Riesz-Fischer theorem, which is found in many texts. We will not include the proof in these notes. The squared length of a random variable X is X, X = EX 2 = X 2 . For zero-mean random variables, the squared length is the variance. When dealing with random vectors, however, we come to a slight complication. If we view each random variable as a vector in the Hilbert space, how then do we treat random vectors, that is, nite-dimensional arrays of random variables of the form X1 . X = . ? . Xn
def

It is easy to get confused here, since we are so familiar with the notion of inner product for nite dimensional vector spaces (in which case the inner, or dot product is x y =
i

xi yi )

This is not the inner product we are using to dene the Hilbert space! In our Hilbert space context, a random vector is a nite-dimensional vector of abstract vectors. This really isnt very complicated at all; we only need to be sure to keep book straight. So lets dene

Winter 2009

9-27

an inner product of two random vectors as the matrix obtained by forming the twodimensional array of inner products (in the Hilbert space context) between every pair of elements for the two random vectors. Thus, let X and Y be n- and m-dimensional random vectors, respectively. Then X1 . X, Y = EXY T = E . [Y1 , , Ym ] . Xn

(9-26)

is an n m matrix. This construction does not preserve the property of symmetry, since EXY T = EYXT . The properties of linearity and nondegeneracy are, however, preserved by this operation. But we can easily modify the denition of symmetry to permit the denition of a matrix inner product. All we have to do is to redene symmetry to become y, x = x, y
T

and all the nice results obtained regarding the standard scalar denition apply. We have restricted our attention to real-valued random variables in this development. We wont take the time to develop the theory for complex random variables here, since (a) such quantities are not very important to our development and (b) the extension is a simple one (just use conjugates in the right places). The notion of orthogonality is preserved with matrix inner products, and we say, if X and Y are two random vectors and EXY T = 0 (the zero matrix), that X is perpendicular to Y, and we write X Y to mean that every component of X is orthogonal, in the usual Hilbert space sense, to every element of Y. We emphasize that this perpendicularity is not Euclidean perpendicularity in nite-dimensional space; it is perpendicularity in the Hilbert space (but we often draw pictures as if we are in nite-dimensional Euclidean space).

9.8

Geometric Interpretation of Mean Square Estimation

The linear mean-square estimation problem is one of nding a vector X in the linear space spanned by {Yi } such that the squared length of the error vector X = X X is as small as possible. Figure 9-1 provides a geometric illustration of the geometric interpretation of mean square estimation.

9-28

ECEn 672

X X

Space Spanned by {Yi}

Figure 9-1: Geometric interpretation of conditional expectation. The distance X X is smallest when X is the foot of the perpendicular from X to

the subspace spanned by {Yi }. That is, X X Yi, i = 1, , N. Thus, X HY Yi , or E(X HY)Y T = 0. i = 1, , N,

We have seen that for a vector random variable X, the mean square estimate given Y is X = RXY R1 Y YY so the solution is simply one of inverting a positive-denite matrix. What could be more simplebut there is more to be said! 1. O(N 3 )
17

operations are required to invert an N N matrix. For N on the order of

thousands, this is a problem. Question: Does RY Y possess any structure that could be exploited to simplify the calculations? 2. What if N is growing, and we wish to update our estimate X sequentially as data are obtained. We might not be able to invert a sequence of large matrices.
17

we will use the symbol O to denote the computational order.

Winter 2009

9-29

So what can we do? The nicest thing would be if RY Y were diagonal. Then the components of Y would be uncorrelated, that is, EYiYj = EYi EYj = 0 for i = j. Unfortunately, this condition rarely holds, but it does suggest the possibility of transforming {Y0 , , YN } to an equivalent set of uncorrelated random variables, denoted { 0 , 1 , , properties: (i) The
i s N}

with the following

should be uncorrelated: E

i j

= 0 for i = j. should be a linear combination of

(ii) The transformation should be causal and linear : {Y0 , Y1, , Yk }.

(iii) The transformation should be causally invertible: Yk should be a linear combination of { 0 , 1 , , k }. (iv) The calculations should be recursive: for each k,
k

should be a function of the new


k1 }.

observation Yk and the old transformed variables { 0 , ,

(v) The transformation should simplify calculations: it should take many fewer calculations than inverting RY Y by standard methods We will shortly develop a transformation that has these properties the well-known Gram-Schmidt orthogonalization procedure which, we will demonstrate, meets Properties (i)(iv). Whether or not we can achieve Property (v) will depend upon additional structure of the Yi s.

9.9

Gram-Schmidt Procedure

As introduced earlier, let us view the random variables Yi as vectors in some abstract space with a suitable denition of inner products. We can sequentially orthogonalize the sequence {Y0 , Y1 , } as follows: 1. Set
0

= Y0 .

9-30 2. Subtract from Y1 its projection on the space spanned by


1 0:

ECEn 672

= Y1 Y1 , = Y1 Y1 ,
0

0 0 0

0 0 2 0

Figure 9-2 provides a geometric illustration of this procedure. Y0 =


0 0

Y1 ,

0 0

Y1 Y1 ,
0 0 0 0

Y1 Y1 ,

0 0

0 0

Figure 9-2: Geometric illustration of Gram-Schmidt procedure.

3. Subtract from Y2 its projection on the space spanned by {Y0, Y1 } or, equivalently, the space spanned by { 0 , 1 }:
2

= Y2 Y2 ,

Y2 ,

4. The general form:


i1 i

= Yi

Yi ,
j=0

j.

Some remarks:

Winter 2009 Suppose some

9-31
i

has zero length? Then Yi is linearly dependent on {Y0 , , Yi1 } and

RY Y is singular. Hence, the random variable Yi may be omitted. We could employ a pseudo-inverse and not worry about this potential problem, but for this development we will assume that such problems have already been eliminated. It is sometimes useful to normalize the quantities i = then we can write
i i i i

and generate the sequence

i = 0, 1, ,
i1

= Yi

Yi, i i .
j=0

The projection of Yi onto the subspace spanned by {Y0, , Yi1} is the mean square estimate of Yi given {Y0 , , Yi1}. We denote this estimate by Yi|i1 . Then
i

= Yi Yi|i1.

The random variable

can be regarded as the new information brought by Yi after

{Y0 , , Yi1 } are known. Recall that { i } forms an uncorrelated sequence (a consequence of the Gram-Schmidt construction). The process { i } is termed the new information or innovations process (or just the innovations). The process {i } is the normalized innovations process. The sequences { 0 , 1 , , i } (Orthogonal) {0 , 1 , , i } (Orthonormal) {Y0 , Y1, , Yi} (Arbitrary) All span the same vector space. The Gram-Schmidt procedure is not unique. There are many such orthogonalizing sets and they can be obtained in many dierent ways. The important concept is not the Gram Schmidt procedure, but the properties (i)(iv) of the innovations that we earlier noted.

9-32

ECEn 672

Exercise 9-9 Let {Yi} be a scalar zero-mean stationary random process such that EYi Yj = |ij| for some (0, 1) ({Yi} is an exponentially correlated process). Verify that
i

= Yi Yi1 ,

= 1 2

for i = 2, 3. Exercise 9-10 Suppose the process {Yi} admits the model: Yi Yi1 = ui, where Eui = 0 Eui uj = EY0 = 0 EY02 = 1 EY0 ui = 0, i > 0. 1 2 i = j , 0 i=j i, j > 0, i = 1, 2, . . . ,

Verify that this model yields the correlation function of the form EYi Yj = |ij| . Exercise 9-11 Suppose N independent samples of a random variable, X, are taken, denoted by x1 , x2 , , xN . Find the value x that minimizes the quantity
N

i=1

( xi )2 x

and interpret the result. Suppose you take N observations of a certain quantity and wish to compute the meansquare t to a constant. Write a conventional expression to do this. Now, suppose you take one more observation. Show how to express the new average (over N + 1 observations) in terms of the old average (over N observations) and the (N + 1)st observation. Comment regarding the computational complexity for your new approach versus. the conventional approach.

Winter 2009

9-33

N Exercise 9-12 Let Y0 denote the linear subspace spanned by the set of observations {Yi }N = i=0

{Y0 , Y1 , . . . , YN }.

Let X be such that E(X X)Z = 0

N N for all Z Y0 (that is, X is the orthogonal projection of X onto Y0 ). Also, let X be any N other estimate in Y0 . Prove that

E(X X)2 E(X X)2 and, therefore, that X is the mean square estimate of X given {Yi }N . This proves that i=0 orthogonality is a sucient condition for minimum mean-square estimation.
N estimate for X and that there is a Z Y0 such that E(X X)Z = 0. Dene a new estimator N To prove that orthogonality is also a necessary condition, suppose that X Y0 is another

X by

E(X X)Z Z. X =X+ EZ 2 (Does this remind you of anything? Also, why is EZ 2 > 0?) Show that E(X X)2 < E(X X)2 , which implies that X is a better estimator than X, so X cannot minimize the mean-square error.

9.10

Estimation Given the Innovations Process

Since the observed process {Yi} and the innovations process { i } span the same space, it follows from the projection concept that the estimate of X given {Yi} must be identical with the estimate of X given { i }. Let us dene def X|N = the estimate of X given {Y0 , . . . , YN }. Then, equivalently, X|N = the estimate of X given { 0 , . . . ,
N }.

9-34 Let
N

ECEn 672 =[ ...


T N] .

Since the process { i } is orthogonal (that is, uncorrelated), the


T N N

correlation matrix R = E

is diagonal; we obtain
N

X|N = RX R1
diagonal

= [EX
N

EX

2 2 2 1 N ][diag {E 0 , E 1 , . . . , E N }]

. . .
N

=
i=0

EX i (E 2 )1 i . i
N +1 ,

Thus, if we have an additional observation YN +1 yielding

we may update the estimate


N +1

of X given this new information by computing the new innovations onto this new vector: X|N +1 = X|N + estimate of Xgiven = X|N + EX where
N +1 N +1

and projecting X

2 1 N +1 [E N +1 ] N +1

= YN +1 YN +1|N
N

= YN +1

EYN +1
j=0
inner product

[E 2 ]1 j
normalizing factor

With this formulation, we see that we can avoid inverting large matrices, and sequential updating is easy. To be useful, however, we must also obtain some savings in eort, so we need to explore structure in the process {Yi}. The following is an example of such structure. Example 9-10 Let {Yi } be a scalar zero-mean stationary random process such that EYi Yj = |ij| (9-27)

for some (0, 1) ({Yi } is an exponentially correlated process). Let us compute the innovations process:
0 1

= Y0 , = Y1 Y1 ,

0 0

= EY02 = 0 = 1
0 2 0

= Y1 Y0 ,

= E(Y12 2Y0 Y1 + 2 Y02 ) = 1 2

Winter 2009 And, in general, it is true that


i

9-35

= Yi Yi1 ,

= 1 2

(9-28)

for i > 0. Thus, for any random variable X dened over the same probability space as the Yi s,
N

X|N = E(XY0 )Y0 +

i=1

EXYi EXYi1 (Yi Yi1 ). 1 2

Why is this rule so simple? The answer lies in the fact that the {Yi} process admits a model: Yi Yi1 = ui , where Eui = 0 Eui uj = EY0 = 0 EY02 = 1 EY0 ui = 0, i > 0. 1 2 i = j , 0 i=j i, j > 0, i = 1, 2, . . . , (9-29)

Now project ui onto the space spanned by {Y0, . . . , Yi1 }. We have Eui Yj = E(Yi Yi1 )Yj = ij ij1 = 0 for i > j. Thus, ui this space and, therefore, ui|i1 = 0. By linearity, 0 = ui|i1 = Yi|i1 Yi1|i1 , but Yi1|i1 is the projection of Yi1 onto the space spanned by {Y0, . . . , Yi1 } and, since this space contains Yi1 , this projection is simply Yi1 . Thus, Yi1|i1 = Yi1 and we have Yi|i1 = Yi1

9-36 so, therefore,


i

ECEn 672

= Yi Yi|i1 = Yi Yi1 .

Thus, for this problem, the innovations are the white-noise inputs of the model; that is, we may take
i

= ui .

This example is that of a rst-order auto-regressive (AR) process. In general, an nth order AR process is of the form Yi 1 Yi1 n Yin = ui , with Eui uj = Qij Eui Yj = 0 j {0, . . . , i n}. i > 0,

9.11

Innovations and Matrix Factorizations

Let us continue with the previous example, and compute R1 as an exercise and see what YY develops. First, note that we can arrange the 1 0 0 1 1 . .. . . . = . . . . . . . . . N 0 or = WY where W is a lower-triangular matrix. Let 1 0 0 2 0 1 0 0 . . . . .. . . . . . . . . R = . . . . . .. . . . . . . . . . 0 0 1 2 R =E
T

innovations as 0 0 0 . .. . 0 . . .. .. . . . . . 1

Y0 Y1 . . . YN

a diagonal matrix. But

, (9-30)

= WEYY T WT = WRY Y WT

Winter 2009 and we know that

9-37

RY Y

This is the matrix we need to invert to obtain the mean-square estimate; we want to do it eciently. From (9-30), RY Y = W1 R WT
def T

1 1 2 = . .. .. . N

N N 1 .. .. . . . 1 .. .. .. . . . 2 1 2

(9-31)

where we have invoked the notation WT = [W1 ] . Now we know that we need R1 to YY compute the mean-square estimate, and we can do so by inverting (9-31) to obtain R1 = WT R1 W. YY We know all components of this matrix product and, upon multiplication, it becomes 1 0 0 1 + 2 0 0 0 1 + 2 0 1 . . . . , .. .. .. . . R1 = . . . . . . YY 1 2 0 1 + 2 0 0 1 + 2 0 0 1

a tri-diagonal matrix that is easily implemented. For this example, it is possible to obtain an exact solution. Explicit results, however, are not always to be expected, but the estimation problem corresponds to a particular way of inverting a matrixthe so-called LDU decomposition.

9.12

LDU Decomposition

We now know how to solve the estimation problem by transforming the observed data to a white noise innovations process. One way way of inverting the auto-correlation matrix is by nding its so-called LDU (lower-diagonal-upper) decomposition. In general, for a square matrix R, we can nd three matrices, L, D, U, such that R = LDU

9-38

ECEn 672

where L is lower-triangular, D is diagonal, and U is upper-triangular in structure. Furthermore, if the matrix R is symmetric, we may nd a decomposition such that U = LT . Thus, since the auto-correlation matrix RY Y is symmetric, there exist matrices L and D such that RY Y = LDLT . In the previous example, we may identify L = W1 D = R . (9-32)

9.13

Cholesky Decomposition

Since the matrix we are inverting is positive-denite (that is, has all strictly positive eigenvalues), it is possible to rene the LDU decomposition further. Since RY Y > 0 implies R > 0, we can dene a triangular square-root matrix, denoted D 2 , such that D = D2 D 2 , where we have introduced a common notational shorthand: D 2 = D 2 write (9-32) as RY Y = L L , where L = LD 2 , which is known as the Cholesky decomposition of RY Y . Then R1 = L YY where we dene W=L
1 T def
1 T 1 T 1

def

. We may then

=W W

=R

1 2

W.

The stochastic interpretation of the Cholesky decomposition is as follows: Since = WY, we have Y = W1 , thus RY = EY = W1R , so W1 = RY R1 . Thus, we obtain L = RY R1 and the LDLT decomposition is RY Y = RY R1 R R1 RT = RY R1 RT . Y Y

Winter 2009 Also, the matrix W gives the normalized innovations = WY. Consequently, L = RY = RY R and the Cholesky decomposition is also RY Y = RY RT . Y Reasons for Discussion.
1 2

9-39

A standard method for inverting a positive-denite matrix is to make a Cholesky decomposition RY Y = L L and to compute R1 = L YY
T T

L . This is because it is

easy to invert triangular matrices. Then the general estimation formula becomes X|N = RXY R1 Y YY = RXY W WY = RXY WT R
T 2 T
1 2

WY

= E(XY T )WT R1 = E(X


T

)R1

= RX R1 , the innovations form of the estimator. So the innovations method is equivalent to the Cholesky factorization method of computing R1 and X|N . YY Often, additional information is available in the form of a process model, which yields a fast way of computing the innovations. A fast algorithm is one which will invert RY Y in fewer than O(N 3 ) operations. If we assume stationarity, we can get the number of operations down to O(N 2 ), and if we can invoke an AR model, we can get the number of operations down to O(N log N) (similar in speed to the FFT).

9-40

ECEn 672

9.14

White Noise Interpretations

It may seem counter intuitive that a white noise process could contain information, but we are claiming that the innovations process, which is a zero-mean uncorrelated process and, therefore, a white noise process, contains exactly the same information as does the original data set. Perhaps the best way to convince yourself that this is so is to review how the innovations are obtained. Recall that we set forth ve desirable properties when introducing the possibility of transforming the data. Of those, the notions of causal and causally invertible are of central importance. To restate: The notion of causality is perhaps best viewed in the context of a process for which the indexing set is associated with some physical parameter. We have been dealing with an arbitrary process {Yi }, where the indexing parameter i is simply a non-negative integer. For many applications, this indexing set will correspond to time. For example, the sequence {Yi } might be derived by sampling a waveform at times ti , i = 0, 1, 2, . . .. To say that the transformation from {Yi } to { i } is causal is to say that we do not need future values of Yi to compute past or present values of i . This makes intuitive sense; there are many physical processes which admit a causal model, and we get a lot of mileage out of this notion. But what does causality mean in, say, spatial coordinates, such as with imagery. It does not make intuitive sense to claim that the right-hand pixel precedes, in any sense, the left-hand pixel. Causal transforms, in this context, might be much less desirable than non-causal ones that do not force irrelevant structure onto the data. Perhaps the best way to develop an intuition about innovations is to think of them in the Gram-Schmidt context, where they may be viewed as a way of orthogonalizing an oblique coordinate system. The abstract vector space idea is very powerful; It will surface many times in our development of estimation theory. The notion of causal invertibility is also of central importance. This concept simply means that it is possible to reverse the transformation; that is, to recover {Yi } from { i }.

Winter 2009

9-41

Thus, it is in this sense of causality and causal invertibility that we may claim that the innovations sequence is informationally equivalent to the original process. It can be shown, but we will not attempt it here, that, in the context of information theory, the mutual information between X and {Yi} is exactly the same as the mutual information between X and { i }. Thus, the innovations transformation is one case where the information processing lemma leads to strict equality. That lemma, incidentally, will tell us that it is not possible to increase the information content by any operations on the data; we may at best preserve it. But the fact that such an information preserving transform exists and is useful is in important theoretical fact in and of itself!

9.15

More On Modeling

Thus far, we have developed the notion of innovations and have shown that they are nothing more than a very special coordinate transform of the data. This fact would be of only academic interest if we could not exploit additional structure in the data to speed up the calculations. We have given one example to show that, if the process admits an autoregressive (AR) model, then signicant computational savings can be achieved. This is one reason why such models are so common on data analysis. For example, they have long been used in statistics, economics, etc. But AR models work only for stationary, linear processes (although this can be overcome to some extentwitness the work done in speech processing, which has long made use of such models. But that gets more to the practitioners art, and we have yet to develop the theory. There is another large class of models that attracts our attention: state-space models. These models, though usually linear, need not be be stationary. It can be easily shown that all AR models can be re-cast in a state-space formulation, so we really give up nothing by concentrating on the later class of models. The relaxation of the stationarity constraint makes it well worth our while to do so, since, as we will see, state-space is a natural and very rich place to do our analysis. In fact, the constant parameter estimation problem that we have just solved can also be easily formulated in state-space. So we have a lot to gain and little to lose in concentrating, for the rest of this development, on state-space models.

10-42

ECEn 672

10
10.1

Estimation of State Space Systems


Innovations for Processes with State Space Models

Suppose the observed process admits a model of the form Yi = Hi Xi + Vi , i = 0, 1, 2, , (10-33)

where Xi is an n-dimensional state vector obeying the dierence equation Xi+1 = Fi Xi + Gi Ui , i = 0, 1, 2, . (10-34)

Here, the observations consists of a sequence of m-dimensional vectors {Yi }, as opposed to the sequence of scalars that we have thus far encountered. Also, rather than just one parameter vector X to estimate, there is an entire sequence of them, {Xi}. The Matrices Fi , Gi , and Hi are termed system matrices and are assumed known. The processes {Vi} and {Ui } are termed observation noise and and process noise, respectively. They are stochastic components of the system. Notation Change. In the sequel, we shall be be considering states, inputs, and outputs as random variables unless otherwise explicitly stated. To simplify notation and come into conformity with 30 years of engineering usage in estimation theory, we will use lower-case symbols to denote these random variables. In statistics, the standard notation for a random variable is to use a capital symbol, and we have retained that usage up to this point, mainly to reinforce the concept that we are dealing with random variables and not their actual values (we have had very little to say about actual values of these random variables). But we will now depart from the traditional notation of statistics. Thus, we may rewrite (10-33) and (10-34) as yi = Hi xi + vi , xi+1 = Fi xi + Gi ui , i = 0, 1, 2, , i = 0, 1, 2, , (10-35) (10-36)

where we assume that yi is an m-dimensional vector, xi is an n-dimensional vector, Hi is an m n matrix, vi is an m-dimensional vector, u is a p-dimensional vector, Fi is an n n

Winter 2009

10-43

matrix, and Gi is an n p matrix. We will refer to these equations as a state-space model, and will assume that the only portion of this model that is available to us is the process {yi }. All other random processes are unobserved. It is necessary to impose some statistical structure onto this model. We assume: The process {vi } is a vector zero-mean white noise with covariance matrices
T Evi vj = Ri ij ,

where ij is the Kronecker delta function. The process {ui } is a vector zero-mean white noise with covariance matrices Eui uT = Qi ij . j The cross-correlation matrices of {ui } and {vi } are of the form
T Eui vj = Ci ij .

the initial condition, or initial state vector, x0 , is a random variable with mean mx (0) and covariance E [x0 mx (0)] [x0 mx (0)]T = 0 , and we must assume that the mean value and covariance are known. Without loss of generality, we will often assume that the mean is zero, since it is easy to include it after the theory has been developed. Thus, unless we state otherwise, we will assume that mx (0) = 0 in the sequel. (Actually, there are some things that can be said if mx (0) is not known, and this is a central issue of set-valued estimation.) We must also assume that the initial state vector is uncorrelated with all noise, that is, Eui xT = 0, 0 Evi xT = 0, 0 i0 i0

We must assume that Fi , Gi , Hi , Ci , Qi , and Ri are all known for all i 0.

10-44 Exercise 10-13 Verify each of the following relationships: Exj uT = 0, k


T Exj vk = 0,

ECEn 672

kj kj k>j k>j

Eyj uT = 0, k
T Eyj vk = 0,

Eyk uT = CT k k
T Eyk vk = Rk

Let us denote the state covariance matrix as i = E [xi mx (i)] [xi mx (i)]T , fact). The innovations are expressed as
i

i 0,

with 0 given (note we have assumed mx (0) = 0we will stop reminding the reader of this

= yi yi|i1 , E
T 1 j j j,

where yi|i1 = Hi xi|i1 + vi|i1 with xi|i1 and vi|i1 the mmse of xi and vi , respectively, given {y0 , , yi1 }. Thus, vi|i1 = {y0 , , yj }, we have Evi
T j i1 T j=0 Evi j

and, since

is a linear function of

= 0 for j < i. Hence, vi|i1 = 0 and yi|i1 = Hi xi|i1 .

(Recall that the subscript notation xi|j means that the rst index i corresponds to the time of the state xi , and the second index corresponds to the amount of data that is used in the calculation of the estimatein this case, the data set {y0 , , yj }.) The dynamics equation is xi+1|i = Fi xi|i + Gi ui|i , with
i

(10-37)

ui|i =
j=0 i

Eui

T j

Rj

=
j=0

Eui Hj xj + vj Hj xj|j1

Rj

j,

Winter 2009 where Rj = E


def T j j.

10-45

T j|j1 = 0 But Eui xT = 0 for j i and Eui vj = Ci ij by the modeling hypothesis, and Eui xT j

for j i since xj|j1 depends only upon {y0 , , yj1 } which is orthogonal to vi , also by the modeling hypothesis. Thus,
i

ui|i =
j=0

T Eui vj Rj

= Ci [Ri ]1 i .

Consequently, (10-37) becomes xi+1|i = Fi xi|i + Gi Ci [Ri ]1 i . (10-38)

As a point of terminology, we often refer to xi|i as the ltered estimate of xi , and to xi|i1 as the (one-step) predicted estimate of xi . Also, we adopt the convention that observations begin at i = 0 (usually, the index i will correspond to time). We may view the predicted estimate (10-38) as a time-update equation, since it shows how the state evolves in time from i to i + 1 in the absence of data. We can also think of obtaining a measurement-update equation to tell us how to convert the predicted estimate, xi+1|i into a ltered estimate, xi+1|i+1 . Recall the basic formula for the estimation of any random variable X given the innovations: X|N = X|N 1 + EX Now set N = i+1 X = xi+1 to obtain xi+1|i+1 = xi+1|i + E xi+1
T i+1 T N

[RN ]1

N.

Ri+1

i+1 .

(10-39)

Equations (10-38) and (10-39) constitute a set of recursive equations, and indicate the way the state estimates evolve as observations are made as time progresses. It remains to compute E xi
T i

and Ri . We will shortly obtain expressions for these quantities, but before

10-46

ECEn 672

doing so, let us discuss the recursive nature of these equations. Assuming we can initialize the estimates, we can process a set of observations by computing a sequence of time updates and measurement updates by toggling between the time update and measurement update equations as we increment i. So one key question is that of initialization: How do we specify xi|i1 for i = 0? To answer this question, lets recall that we want xi|i1 to be an estimate of xi conditioned upon the observations sequence, {y0 , yi1 }. But at time i = 0, this estimate becomes x0|1 , and there are no observations at negative time. Thus, x0|1 must be an a priori estimate of x0 . A logical choice for x0|1 is to equate it to the expected value of x0 , which is mx (0). But x0|1 must be a random variable, and mx (0) is a known constant, and is not random. We can get around this problem by dening x0|1 to be a zero-variance random variable, that is, E x0|1 = mx (0) and E x0|1 mx (0) x0|1 mx (0)
T

= 0.

x0|1 is termed the a priori estimate of x0 . Of course, we may assume mx (0) = 0 without loss of generality. To see how the recursion dened by (10-38) and (10-39) works, it is instructive to write out the terms for i = 0 and i = 1. Thus, the measurement update at time i = 0 becomes x0|0 = x0|1 + E x0 with
0 T 0

[R0 ]1

0,

= y0 . But
T R0 = Ey0 y0 = [H0 x0 + v0 ] [H0 x0 + v0 ]T = H0 0 HT + R0 , 0

and Ex0
T 0

= E[x0 xT ]HT = 0 HT . 0 0 0

Thus, putting these pieces together, we obtain x0|0 = x0|1 + 0 HT H0 0 HT + R0 0 0


1 0.

Winter 2009 We can then predict to i = 1, yielding x1|0 = F0 x0|0 + G0 C0 [R0 ]1


0

10-47

= F0 x0|0 + G0 C0 [H0 0 HT + R0 ]1 0 . 0 One can continue in this way to obtain the general case, but a little experimentation will convince you that things get a bit messy, although it should be clear enough, in principle, how to proceed. Before developing the general solution to this recursive system, we will continue our digression and generate some variations on the time- and measurement-update equations. Substitute (10-38) into (10-39) to obtain xi+1|i+1 = Fi xi|i + E xi+1 where
i+1 T i+1

Ri+1

i+1

+ Gi Ci [Ri ]1

(10-40)

= yi+1 Hi+1 xi+1|i = yi+1 Hi+1 Fi xi|i Hi+1 Gi Ci [Ri ]1


i

with

= y0 . Note that (10-40) employs only ltered estimates.

Alternatively, multiply both sides of (10-39) by Fi+1 and substitute into (10-38) to get (with i + 1 i) where
i

xi+1|i = Fi xi|i1 + Ki [Ri ]1

(10-41)

= yi Hi xi|i1
T i

with x0|1 = 0 and Ki = Exi+1 = Fi Exi


T i

+ Gi Ci .

Several other variations on these equations are possible. Equation (10-41) involves only predicted estimates, and is termed the one-step predictor. This formulation is very useful for theoretical development. The parameters Exi
T i

and E

T i i

do not depend upon the actual observations {yi } but,

rather, upon the model parameters F, G, H, Q, R, C, and 0 . Thus, they may, if desired, be

10-48

ECEn 672
T i

calculated in advance of the actual data collection. Explicit, closed-form solutions for Exi

and Ri are not available except in a few very special cases. Recursive ways of computing them, however, are knownwe shall shortly develop the most famous way of doing so.

10.2

Innovations Representations

We recall that the model for the observed process {yi , i = 0, 1, } is xi+1 = Fi xi + Gi ui yi = Hi xi + vi . (10-42) (10-43)

We will assume Ex0 = 0 and Ex0 xT = 0 . This representation of the observations {yi , i = 0 0, 1, } will be called, for lack of a better term, the true model, since we presumably base it upon physical principles, and xi is intended to represent the actual, or true state of the system. But dont forget that the state is a random processwhat does it mean to be the true random state? (Probably not much unless the density function is degenerate.) We may express the covariance of xi as Exi xT = i , and observe that i
T Exi+1 xT i+1 = E [Fi xi + Gi ui ] [Fi xi + Gi ui ]

i+1 = Fi Exi xT FT + Fi Exi uT GT + Gi Eui xT FT + Gi Eui uT GT . i i i i i i i i i Since, by the modeling assumptions, we have Exi uT = 0, i i+1 = Fi i FT + Gi Qi GT , i i 0 given (10-44)
Qi

Now lets think about (10-43) for a moment. In actuality, yi is the only directly observable one; it is the one we measure. The processes xi , ui , and vi are never directly available. Equations (10-42) and (10-43) constitute one way to characterize the observations process {yi }, but we might rightly ask the question: Are there other ways to model the observations? The answer is: yes, and it is the so-called innovations model. We have already seen it: xi+1|i = Fi xi|i1 + Ki [Ri ]1 yi = Hi xi|i1 + i .
i

(10-45) (10-46)

Winter 2009

10-49

As far as an observer is concerned, this signal model is just as valid as the true signal model given by (10-42) and (10-43). But there is at least one very big advantage of the innovations model over the true one: we have access to both xi|i1 and to explore the innovations representation a bit further. We may calculate the covariance of { i } as x i+1|i = E Fi xi|i1 + Ki [Ri ]1 E xi+1|i xT
i i.

So lets

Fi xi|i1 + Ki [Ri ]1
T i

T i

i|i1 i = Fi E xi|i1 xT FT + Fi E xi|i1


0

[Ri ]1 KT i
T i i Ri

+Ki [Ri ] But the innovations

T i xi|i1
0

FT i

+ Ki [Ri ]1 E

[Ri ]1 KT . i

(10-47)

is orthogonal to the subspace spanned by {y0 , , yi1 }, and since

xi|i1 lies in this subspace, we have E xi|i1 If we dene i+1|i i+1|i = E xi+1|i xT , as the covariance of xi+1|i , then (10-47) becomes i+1|i = Fi i|i1 FT + Ki [Ri ]1 KT i i with 0|1 0|1 = E x0|1 xT = 0. Since we have the state xi associated with the true representation of the signal and the state xi|i1 associated with the innovations representation of the signal, we may wish to compare them. Dene the predicted state estimation error as xi|i1 = xi xi|i1 , and let i|i1 Pi|i1 = E xi|i1 xT denote the estimation error covariance matrix.
def T i

= 0.

(10-48)

10-50 We can write the innovations as = yi Hi xi|i1 = Hi xi + vi Hi xi|i1


yi

ECEn 672

= Hi xi|i1 + vi . Then we can express Ri = E


T i i T T i|i1 i i|i1 i = Hi E xi|i1 xT HT + Hi E xi|i1 vi + Evi xT HT + Evi vi . Pi|i1 Ri

(10-49)

But
T E xi|i1 vi = 0

since vi is orthogonal to both xi and xi|i1 . Consequently, Ri = Hi Pi|i1HT + Ri . i Also, Exi


T i T = Exi ( T HT + vi ) xi|i1 i T i|i1 i = Exi xT HT + Exi vi

(10-50)

= E[ i|i1 + x

0 T xi|i1 ] i|i1 HT x i

i|i1 i i|i1 i = E xi|i1 xT HT + E xi|i1 xT HT .


0 Pi|i1

Thus, Exi
T i

= Pi|i1HT . i

(10-51)

10.3

A Recursion for Pi|i1

Since xi|i1 and xi|i1 are orthogonal, we have an orthogonal decomposition of xi : xi = xi|i1 + xi|i1 .

Winter 2009

10-51

(Recall that orthogonality means uncorrelated.) Consequently, taking the variance of both sides of this expression (assuming all random variables are zero-mean), we obtain i = Exi xT i = E xi|i1 + xi|i1 xi|i1 + xi|i1
T

i|i1 i|i1 i|i1 i|i1 = E xi|i1 xT + E xi|i1 xT + E xi|i1 xT + E xi|i1xT , i|i1 or i = i|i1 + Pi|i1 . Since 0|1 = 0, we have P0|1 = 0 . Rearranging (10-52) and applying (10-44) and (10-48), we have Pi+1|i = i+1 i+1|i = Fi i FT + Gi Qi GT Fi i|i1 FT Ki [Ri ]1 KT i i i i or Pi+1|i = Fi Pi|i1 FT + Gi Qi GT Ki [Ri ]1 KT . i i i Since Ki = Fi Pi|i1HT + Gi Ci i and Ri = Hi Pi|i1HT + Ri , i we obtain Pi+1|i = Fi Pi|i1 FT + Gi Qi GT i i Fi Pi|i1HT + Gi Ci i with P0|1 = 0 . (10-54) Hi Pi|i1HT + Ri i
1 0 0 Pi|i1

(10-52)

Fi Pi|i1HT + Gi Ci (10-53) i

Equation (10-53) is known as a Matrix Riccati dierence equation, after the Italian mathematician that rst analyzed nonlinear dierential equations of the form. This dierence equation is nonlinear, but can be easily solved by recursive means.

10-52

ECEn 672

10.4

The Discrete-Time Kalman Filter

With the development of the matrix Riccati equation (10-53), we have completed every step needed for the celebrated Kalman lter. We will present two useful ways to express the Kalman lter; in fact, we have already introduced them. One is the one-step predictor equation, and the other is the time-update/measurement-update formulation. Lets see them both now that we know how to evaluate all of the expectations. The One-Step Predictor Form Substitution of (10-50) and (10-51) into (10-41) yields the one-step predictor form of the Kalman lter: xi+1|i = Fi xi|i1 + Fi Pi|i1 HT + Gi Ci i HiPi|i1 HT + Ri i
1

[yi Hi xi|i1 ]

with x0|1 = mx (0) and Pi|i1 is given by (10-53) and (10-54). Time-Update/Measurement-Update Form Since both the state estimate and the associated error covariance need to be updated, we will derive time- and measurement-update equations for both of these quantities. First, we consider the time-update equation for the state. Substitution of (10-50) into (10-38) yields the time-update equation: xi+1|i = Fi xi|i + Gi Ci Hi Pi|i1HT + Ri i
1

[yi Hi xi|i1 ] .
i

Also, substitution of (10-50) and (10-51) into (10-39) yields the measurement-update equation: xi+1|i+1 = xi+1|i + Pi+1|iHT Hi+1 Pi+1|iHT + Ri+1 i+1 i+1
1

[yi+1 Hi+1 xi+1|i ] .


i+1

The covariance matrix, Pi|i1 , is obtained via (10-53) and (10-54). Alternate Time-Update/Measurement-Update Form with Ci 0 The time-update/measurement-update formulation given above is not quite as convenient as it might be, since both expressions involve the predicted covariance, Pi|i1. A more useful representation of the estimator may be obtained when Ci 0 by developing separate

Winter 2009

10-53

expressions for the time update and measurement update of the estimation error covariance, as well as of the estimated state. We already have introduced the predicted state estimation error covariance, Pi|i1 = i|i1 E xi|i1 xT . What we also need to develop is an expression for the ltered state estimation i|i error covariance, Pi|i = E xi|i xT , where xi|i = xi xi|i . Dene the Kalman gain matrix

Wi = Pi|i1 HT [Ri ]1 i = Pi|i1 HT Hi Pi|i1 HT + Ri i i


1

(10-55)

From the measurement update equation we have

xi+1|i+1 = xi+1|i + Pi+1|iHT [Ri ]1 i+1 = xi+1|i + Wi+1


i+1 .

i+1

(10-56)

Now let us formulate the ltered state error covariance matrix

Pi|i = E xi xi|i

xi xi|i

(10-57)

Substituting (10-56) into (10-57), we obtain

Pi|i = E xi|i1 Wi
Pi|i1

xi|i1 Wi
T T i Wi

T i T T i i Wi .

i|i1 = E xi|i1 xT E xi|i1

i|i1 Wi E i xT + Wi E

But, from (10-49), E xi|i1


T i T i|i1 i = E xi|i1 xT HT + E xi|i1 vi . Pi|i1 0

10-54 Hence, using (10-55), we obtain


T T Pi|i = Pi|i1 Pi|i1HT Wi Wi Hi Pi|i1 + Wi Hi Pi|i1 HT + Ri Wi i i

ECEn 672

= Pi|i1 Pi|i1HT Hi Pi|i1HT + Ri i i


T Wi

Hi Pi|i1 Pi|i1 HT Hi Pi|i1HT + Ri i i


Wi

Hi Pi|i1

+ Pi|i1HT Hi Pi|i1HT + Ri i i
Wi

Hi Pi|i1HT + Ri i
1

HiPi|i1 HT + Ri i
T Wi

Hi Pi|i1

= Pi|i1 Pi|i1HT Hi Pi|i1HT + Ri i i


Wi

Hi Pi|i1

= [I Wi Hi ]Pi|i1. Exercise 10-14 Show that an equivalent form for Pi|i is


T Pi|i = Pi|i1 Wi Ri Wi T = Pi|i1 Wi Hi Pi|i1HT + Ri Wi . i

We will complete the time-update/measurement-update structure for the Riccati equation by obtaining an expression for Pi+1|i in terms of Pi|i. Rearranging (10-53) with Ci 0, Pi+1|i = Fi Pi|i1 Pi|i1 HT Hi Pi|i1 HT + Ri i i
Pi|i 1

Hi Pi|i1 FT + Gi Qi GT i i

Thus, summarizing, we have the following result, which is the traditional formulation of the Kalman lter: Theorem 2 Let xi+1 = Fi xi + Gi ui yi = Hi xi + vi for i = 0, 1, , with Ex0 = mx (0), E [x0 mx (0)] [x0 mx (0)]T = 0 ,
T {vi } is a vector zero-mean white noise with Evi vj = Riij , and {ui } is a vector zero-mean T white noise with Eui uT = Qi ij . Also, assume that Eui vj = 0 for all i and j, Eui xT = 0 j 0

Winter 2009

10-55

for i 0, and Evi xT = 0 for i 0. Then the linear mean squares estimate of xi given 0 observations {yi , i 0}, is xi+1|i+1 Pi+1|i+1 = xi+1|i + Wi+1 yi+1 Hi+1 xi+1|i T = Pi+1|i Wi+1 Ri+1 Wi+1 (measurement update), = [I Wi+1 Hi+1 ] Pi+1|i xi+1|i = Fi xi|i Pi+1|i = Fi Pi|iFT + Gi Qi GT i i where x0|1 = mx (0) (time update), (10-58)

and

(initial conditions),

(10-59)

P0|1 = 0 and Wi = Pi|i1HT Hi Pi|i1HT + Ri i i


1

Exercise 10-15 Show that an alternative form of the Riccati equation (assume Ci 0) is Pi+1|i = Fi Pi|i1 I + HT R1 Hi Pi|i1 i i
1

FT + Gi Qi GT . i i

To establish this result, you may wish to use the following identity: I + HT R1 HT P
1

= I HT R + HPHT

HP,

which in turn is a special case of the famous identity [A + BCD]1 = A1 A1 B DA1 B + C1


1

DA1 .

This identity is proven in many placessee, for example, the appendix of Kailaths Linear Systems. It has great utility in linear estimation theory, and we will see it from time to time. Exercise 10-16 We are given observations yi = xi + ni , i = 0, 1, 2,

where {xi } and {ni } are stationary processes with power spectral densities Sx (z) and Sn (z), respectively (Here, z is the z-transform variable). We shall use a noncausal linear lter with impulse response hi to estimate xi , that is, xi =
j=

hij yj

10-56 Show that the mean-square error, E[xi xi ]2 , will be minimized by choosing H(z) = Sx (z) . Sx (z) + Sn (z)

ECEn 672

Exercise 10-17 Dene a random process {y0 , y1, . . .} by the following recursive procedure: Let y0 be a random variable uniformly distributed over (0, 1) and dene yk as the fractional part of 2yk1, k = 1, 2, . . .. Show that Eyk = 0.5, cov (yk , yi) =
1 4 2|ki| . 12

Show that Let yk|k1 = given {y0 , . . . , yk1}.

+ 1 yk1, where yk|k1 is the linear least squares predictor of yk 2


1 . 16

Demonstrate that E(yk yk|k1)2 =

Can you nd a better nonlinear predictor? If so, what is it? NOTE: If y0 = 0.a1 a2 a3 , observe that the {ak } will be independent random variables

taking values {0, 1}, each with probability 1 , and that we shall have 2 yk = 0.ak ak+1 =
i=1

ak+i . 2i

Exercise 10-18 Consider a process {yk } with a state-space model xk+1 = Fxk + Guk , yk = Hxk + vk where ui E vi [uT j x0
T Eyi yj

k0

T vj

where ij is the Kronecker delta function. Dene k = Exk xT . Show that we can write k = HFij Nj + Rij , NT Fji HT , i Nj = j HT + GC Exercise 10-19 A process {yk } is called wide-sense Markov if the linear least squares estimate of yk+j , j > 0, given {yi , i k}, depends only upon the value of yk . Show that a process is wide-sense Markov if and only if f (i, k) = f (i, j)f (j, k), i j k, ij i<j

Q C 0 xT ] = CT R 0 ij , 0 0 0 0

where

Winter 2009 where f (i, j) = r(i, j) =


def

10-57

r(i, j) r(j, j) Eyiyj

def

10.5

Perspective

We see that the Kalman lter is a solution to the general problem of estimating the state of a linear system. Such restrictions as stationarity or time-invariance are not important to the derivation. What is important, however, is the assumption that the noise processes are uncorrelated. Also, we do not need to know the complete distribution of the noiseonly its rst and second moments. This is a big simplication to the problem, and one of the nice things about linear estimation theory, and is not true of general nonlinear systems. There are many ways to derive the Kalman lter. I have chosen the method that, in my opinion, gives the most insight into the structure of the problemnamely, the orthogonal projections concept. This is essentially the way Kalman rst derived the lter; it is not the only way to prove it. Some alternative ways from some other backgrounds: Control Theory. The problem of building an asymptotic observer for estimating the state of a system to be used for full state feedback is extremely important in control theory. From that perspective, what is required is an optimal stochastic observer. It is well known that, if the observer gains are chosen very large (so that convergence will be fast), then the observation noise will be amplied and the state estimate will have a very high variance and, hence, will be of little value for state feedback applications. To see how one might formulate this problem in optimal control theory, let us dene the cost functional L = 1 1 T [x0 0 ]T 1 [x0 0 ] + xN f 1 xN f 0 f 2 2 N 1 N 1 1 1 T 1 [yk Hk xk ] Rk [yk Hk xk ] + uk Q1 uk , + k 2 k=1 2 k=0

where k = 0, 1, . . . , N and 0 and f are some initial and terminal constraints on the state. The solution is obtained by a classical calculus of variations argument,

10-58

ECEn 672 which yields the minimization of J subject to the system state model constraints. We will not pursue this discussion here in any detail. The solution is in the form of a so-called two-point boundary-value problem (TPBVP). The resulting solution is, however, not exactly the Kalman lter. Recall that the Kalman lter is causal, in that the innovations are causally and inversely causally related to the observations sequence. Here, however, we are using all of the data simultaneously to determine the optimal state estimates (optimal in the sense that they minimize the cost functional J). The solution turns out be what is called the optimal smoother and, by careful identication of terms, one can see that the Kalman lter is embedded in it. We will not develop these equations in this class; I mention them only to provide a cultural background for this very important body of theory that we are developing.

Probability theory. The orthogonal projections approach we have taken for the development of the Kalman lter does not rely on anything more than knowledge of the rst and second moments of the distributions of all the random processes involved. If we do indeed have complete knowledge of all distributions involved, we should perhaps wonder if we might do better than just having partial knowledge. This is a realistic question to address, and the answer is, for linear Gaussian systems, we do not buy anything more!!. The reason is, succinctly, that the rst and second moments completely specify the Gaussian distribution. Statistical methods. One might consider the estimation problem from a couple of other aspects. For example, techniques such as minimum variance and maximum likelihood have great utility in classical statisticsperhaps they will lead to a dierent (and, maybe, better) estimator. Fond hope. It is not too hard to see that the Kalman lter admits interpretations as both a minimum variance estimator and a maximum likelihood estimator. The fact is, that, under fairly wide and applicable conditions, the least-squares, conditional expectations, maximum likelihood, minimum variance, and optimal control interpretations of the Kalman lter are all equivalent. This is quite remarkable to me, and I do not pretend to fathom the deepest meanings of this happy circumstance. I believe the answer

Winter 2009

10-59

lies in the basic structure of linear systems, and I think that the orthogonality principle is the most basic mathematic foundation, but who knows . . .

10.6

Kalman Filter Example

The purpose of this example is to gain some intuition and experience in the operation of the Kalman lter. Consider a six-state linear system with a three-dimensional observations vector corresponding to three-dimensional equations of motion of a moving vehicle. The observations consist of noisy samples of the vehicle position. 10.6.1 Model Equations

Let x = [x, y, z, x, y, z]T denote the kinematic state of a target in some convenient coordinate system. The dynamics equation is 1 0 0 1 0 0 xi = 0 0 0 0 0 0 0 0 0 0 0 0 xi1 + 1 0 0 0 1 0 0 0 1

0 0 1 0 0 0

uxi uyi uzi uxi uyi uzi


ui

where is the sample interval. (We assume that G I.) For a physical observation system, we will not usually be able to measure position directly. Let us assume, however, that an optical angle-of-arrival sensor system is available (for example, from infra-red sensors), yielding azimuth and elevation angles of the vehicle. Further, we assume that the sensor is suciently far from the target that a linearized model is adequate. For convenience we also assume that the measurement units are scaled such that, say, one unit of angle corresponds to one meter of displacement, and that the coordinate system is resolved along the azimuth and elevation angles. Then the observations vector is 1 0 0 0 0 0 yi1 yi = = 0 1 0 0 0 0 xi + vi . yi2 0 0 1 0 0 0
H

Now lets set up the Q, 0 , and R matrices. The only slightly tricky thing is the Q matrix, so lets tackle it rst. You might have been wondering why we set G = I

10-60

ECEn 672

and set up four dierent process noise components in ui . The reason has to do with the continuous-discrete conversion of the dynamics equations. The discrete-time model given above is derived from the continuous-time dynamics equation: xt 0 0 0 1 0 0 xt 0 0 yt 0 0 0 0 1 0 yt 0 0 zt 0 0 0 0 0 1 zt 0 0 xt = 0 0 0 0 0 0 xt + 1 0 yt 0 0 0 0 0 0 yt 0 1 t z 0 0 0 0 0 0 zt 0 0
xt F xt G

0 0 0 0 0 1

wxt wyt wzt


wt

(10-60)

where the system matrices F and G are dened in (10-60) and wt is a continuous-time white noise. To simplify the following development, lets assume a constant sampling rate, . To convert this equation to discrete-time, we rst must calculate the state transition matrix, F. This is easily done by setting (t) = exp {F t} = exp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 t = 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 t 0 0 1 0 0 0 t 0 0 1 0 0 0 t 0 0 1 .

For a step-size of ti+1 ti , the transition matrix becomes, therefore, F = (ti+1 ti ).

Now to calculate Q. Let us express the covariance of the continuous-time process noise
2 qx Ewt ws = Q(t s) = 0 0 The discrete-time process noise is obtained via the ti+1

wt as

0 0 2 qy 0 (t s). 2 0 qz superposition integral as

ui =
ti

(t)Gwt dt.

Clearly, Eui uT = 0 for i = j, and j Eui uT = i


ti+1 ti ti+1 ti+1 ti T (ti+1 t)G E wt ws G T T (ti+1 s)dtds Q(ts)

=
ti

(ti+1 t)GQG T T (ti+1 t)dt.

Winter 2009 Substituting in the values for (t) and G, we obtain


Q = Eui uT i
ti+1

10-61

=
ti

2 2 0 0 0 0 qx (ti+1 t) qx (ti+1 t)2 2 2 2 (t 0 0 0 qy (ti+1 t) 0 qy i+1 t) ti+1 2 (t 2 2 (t 0 0 qz i+1 t) 0 0 qy i+1 t) = 2 q 2 (ti+1 t) 0 0 0 0 qx ti x 2 (t 2 0 qy i+1 t) 0 0 qy 0 2 2 0 0 qz 0 0 qz (ti+1 t) 2 2 2 qx t 0 0 qx t 0 0 2t 2 t2 0 0 0 0 qy qy 2 t2 2 0 qz 0 0 qz t dt 0 = 2 2 0 0 qx 0 0 0 qx t 2 2 0 qy t 0 0 qy 0 2 2 0 0 qz 0 0 qz t 2 3 2 2 qx 0 0 qx 0 0 3 2 2 3 2 2 0 qy 0 0 qy 0 3 2 2 3 2 2 0 0 qz 0 0 qz 3 2 . = 2 2 2 0 0 0 0 qx qx 2 2 2 2 qy 0 0 qy 0 0 2 2 2 2 0 0 qz 0 0 qz 2

ti+1 t 0 0 0 ti+1 t 0 0 0 ti+1 t 1 0 0 0 1 0 0 0 1

2 qx 0 0 ti+1 t 0 0 1 0 0 2 0 qy 0 0 ti+1 t 0 0 1 0 dt 2 0 0 q 0 0 ti+1 t 0 0 1 z dt

The advantage of this way of implementing the Q matrix is that it permits us to account for the kinematic relationships between position and velocity; the process noise induced on the position components is due to the acceleration error accumulation over the integration interval. This formulation also allows the automatic incorporation of changes in the Q matrix due to changes in the sampling rate, . The observation noise covariance matrix is of the form 2 ry 1 0 0 2 R = 0 ry 2 0 , 2 0 0 rz 2

10-62 and the initial state covariance matrix is of 2 x 0 2 0 y 0 0 0 = 0 0 0 0 0 0 the form 0 0 0 0 0 0 0 0 2 z 0 0 0 2 0 x 0 0 2 0 0 y 0 2 0 0 0 z .

ECEn 672

Note that the dynamics model assumes that x = wx y = wy = wz , z

that is, that acceleration is white noise. When we convert to discrete-time, the model becomes, essentially, xi+1 = xi + xi yi+1 = yi + yi zi+1 = zi + zi xi+1 = xi yi+1 = yi zi+1 = zi

10.7

Interpretation of the Kalman Gain

The Kalman Filter measurement update equation is xi+1|i+1 = xi+1|i + Wi+1 yi+1 Hi+1 xi+1|i , which may be rewritten as i+1|i = Wi+1 yi+1 , x where i+1|i = xi+1|i+1 xi+1|i x

Winter 2009

10-63

is the dierence between the ltered state estimate and the predicted state estimate, and yi+1 = yi+1 Hi+1 xi+1|i is the dierence between the actual observation and the predicted observation. We note that the Kalman gain matrix, Wi+1 , maps changes in data space into changes in state space, and is expressed in state-space units per data-space units. To gain some insight into the operation of the Kalman gain, consider the hypothetical case of H square and invertible, and Ri+1 = 0. In that case, we have Wi+1 = Hi+1 Pi+1|i[Hi+1 Pi+1|i HT ]1 = H1 . i+1 i+1 But observe that under these conditions, the observation equation assumes the form yi+1 = Hi+1 xi+1 , so the function of the Kalman lter would be, after all the dust settled, to simply invert the observation matrix. In the general case, it is therefore apparent that the Kalman lter serves as a kind of generalized inverting function that acts very much like an real inverse (or pseudo-inverse) in a low-noise environment.

10.8

Smoothing

The Kalman lter provides the estimate of the state conditioned on the past and present observations, and so is a causal estimator. Such an estimator is appropriate for real-time operation, but often in applications it is possible to delay the calculation of the estimate until future data are obtained. In such a post-processing environment, we ought to consider constructing a smoothed, or non-causal, estimator that uses the future, as well as the past, data. We consider three general smoothing situations: (a) xed-lag smoothing, (b) xedpoint smoothing, and (c) xed-interval smoothing. 10.8.1 A Word About Notation

In our discussions of ltering, we have employed a double-subscript notation of the form xj|k to denote the estimate of the state xj given data up to time k, where we have assumed that

10-64

ECEn 672

the data set is of the form {y0 , y1 , . . . , yk }. For the ensuing discussions, however, it will be convenient, though a bit cumbersome, to modify this notation as follows: Let the estimate xj|i:k , i k, denote the estimate of xj given data {yi , yi+1 , . . . , yk }. In this notation, the ltered estimate xj|k becomes xj|0:k . The estimation error covariance for these estimates will be denoted by Pj|i:k = E[xj xj|i:k ][xj xj|i:k ]T . 10.8.2 Fixed-Lag and Fixed-Point Smoothing

Fixed-lag smoothing is appropriate if a constant delay can be tolerated before the estimate is obtained. We may denote such an estimate by xi|i:i+N , where N is the number of timeincrements into the future that data are available. Fixed-point smoothing is appropriate when we want to estimate the state at one xed time only, and wish to use all of the data to do so. Fir xed t0 , the xed-point smoother is denoted xt0 |0:T , where {y0 , . . . , yT } is the entire collection of data, and 0 t0 T . Fixed-lag and xed-point smoothing are specialized applications that are found in various texts and will not be developed in these notes. Fixedpoint smoothing may actually be viewed as a special case of xed-interval smoothing, which is developed in the next section. 10.8.3 The Rauch-Tung-Streibel Fixed-Interval Smooother

If data are collected over the entire extent of the problem, then a xed-interval smoother is appropriate. We may denote this estimate as xi|0:T , where T is the total number of samples over the full extent of the problem, corresponding to the data set {y0 , . . . , yT }. There are at least three approaches to the development of the xed-interval smoother: (a) the forward-backward smoother, (b) the two-point boundary-value approach, and (c) the Rauch-Tung-Streibel smoother. We present only the Rauch-Tung-Streibel approach in these notes. Assume that for each time k the ltered estimate and covariance, xk|0:k and Pk|0:k , and predicted estimate and covariance xk+1|0:k and Pk+1|0:k , have been computed. We want to use these quantities to obtain a recursion for the xed-interval smoothed estimate and covariance, xk|0:T and Pk|0:T . We begin by assuming that xk and xk+1 are jointly normal, given {y0 , . . . , yT }. We

Winter 2009 consider the conditional joint density function fxk ,xk+1 |y0 ,...,yT (xk , xk+1|y0 , . . . , yT )

10-65

and seek the values of xk and xk+1 that maximize this joint conditional density, resulting in the maximum likelihood estimates for xk and xk+1 given all of the data available over the full extent of the problem. (We will eventually show that the maximum likelihood estimate is indeed the orthogonal projection of the state onto the space spanned by all of the data, although we will not attack the derivation initially from that point of view.) For the remainder of this derivation, we will suspend the subscripts, and let the reader infer the structure of the densities involved from the argument list (this is a standard, though somewhat regrettable, practice in probability theory, although it does signicantly streamline the notationonce you gure out the context, you cant go wrong). We write
f (xk , xk+1 |y0 , . . . yT ) = f (xk , xk+1 , y0 , . . . yT ) f (y0 , . . . , yT ) f (xk , xk+1 , y0 , . . . , yk , yk+1 , . . . , yT ) = f (y0 , . . . , yT ) f (xk , xk+1 , yk+1 , . . . , yT |y0 , . . . , yk )f (y0 , . . . , yk ) = (10-61) f (y0 , . . . , yT ) = f (yk+1 , . . . , yT |xk , xk+1 , y0 , . . . , yk )f (xk , xk+1 |y0 , . . . , yk ) f (y0 , . . . , yk ) . (10-62) f (y0 , . . . , yT )

But, conditioned on xk+1 , the distribution of {yk+1 , . . . , yT } is independent of all previous values of the state and the observations, so f (yk+1, . . . , yT |xk , xk+1, y0 , . . . , yk ) = f (yk+1 , . . . , yT |xk+1 ). Furthermore, f (xk , xk+1 |y0 , . . . , yk ) = f (xk+1 |xk , y0 , . . . , yk )f (xk |y0 , . . . , yk ) = f (xk+1 |xk )f (xk |y0 , . . . , yk ) (10-64) (10-63)

where the last equality obtains since xk+1 conditioned on xk is independent of all previous observations. Substituting (10-63) and (10-64) into (10-62) yields f (xk , xk+1 |y0 , . . . , yT ) = f (xk+1|xk )f (xk |y0 , . . . , yk ) f (yk+1 , . . . , yT |xk+1)f (y0 , . . . , yk ) . f (y0 , . . . , yT ) independent of
xk

10-66

ECEn 672

Now suppose the maximum likelihood estimate of xk+1 is available, yielding xk+1|0:T . Then we may restrict attention to the densities f (xk+1 |xk )f (xk |y0 , . . . , yk ). Assuming normal distributions, (10-42) and (10-59), these densities are f (xk+1|xk ) = N (Fk xk , Gk Qk GT ) k f (xk |y0 , . . . , yk ) = N ( k|0:k , Pk|0:k ), x and the problem of maximizing the conditional probability density function f (xk , xk+1 |y0 , . . . , yT ) with respect to xk assuming xk+1 is given as the smoothed estimate at time k +1 is equivalent to the problem of minimizing 1 [xk+1 Fk xk ]T GQGT 2 evaluated at xk+1 = xk+1|0:T . Exercise 10-20 Set 1 x J(xk ) = [ k+1|0:T Fk xk ]T GQGT 2
1 1

1 [xk+1 Fk xk ] + [xk xk|0:k ]T P1 [xk xk|0:k ] k|0:k 2

1 [ k+1|0:T Fk xk ] + [xk xk|0:k ]T P1 [xk xk|0:k ] x k|0:k 2

and set the derivative of J to zero and show that the solution is of the form
1 xk|0:T = Pk|0:k + Fk [Gk Qk Gk ]1 Fk 1 1 Pk|0:k xk|0:k + Fk [Gk Qk Gk ]1 Fk xk+1|0:T .

Next, use the well-known identities P1 + MT R1 M P1 + MT R1 M to show that xk|0:T = xk|0:k + Sk xk+1|0:T Fk xk|0:k where Sk = Pk|0:k Fk Fk Pk|0:k FT + Gk Qk GT k k = Pk|0:k FT P1 k k+1|0:k .
1 1 1

= P PMT MPMT + R
1

MP

MT R1 = PMT MPMT + R

(10-65)

(10-66)

Winter 2009

10-67

Equation (10-65) is the Rauch-Tung-Streibel smoother. Note that it operates in backward time with xT |0:T , the nal ltered estimate, as the initial condition for the smoother. We next seek an expression for the covariance of the smoothing error, xk|0:T = xk xk|0:T : k|0:T Pk|0:T = E xk|0:T xT . From (10-65), xk xk|0:T = xk xk|0:k Sk xk+1|0:T Fk xk|0:k , or xk|0:T + Sk xk+1|0:T = xk|0:k + Sk Fk xk+1|0:k . Multiplying both sides by the transpose and taking expectations yields
k|0:T + E xk|0:T xT k+1|0:T ST + Sk E xk+1|0:T xT k|0:T + Sk E xk+1|0:T xT k+1|0:T ST = E xk|0:T xT k k k|0:k k|0:k k k k|0:k k|0:k k k E xk|0:k xT + E xk|0:k xT FT ST + Sk Fk E xk|0:k xT + Sk Fk E xk|0:k xT FT ST(10-67)

Examining the cross terms of these expressions yields, for example, k+1|0:T = E xk|0:T [Fk xk|0:T + Gk uk|0:T ]T E xk|0:T xT k|0:T k = E xk|0:T xT FT k|0:T = E E xk|0:T xT |y0 , . . . , yT FT k

= 0.

k|0:T FT = E E xk|0:T |y0 , . . . , yT xT k T x = E E [xk |y0 , . . . , yT ] k|0:T xk|0:T FT k


xk|0:T

By a similar argument (or from previous orthogonality results) k|0:k E xk|0:k xT = 0, and so all cross terms in (10-67) vanish leaving the expression k+1|0:T ST = Pk|0:k + Sk Fk E xk|0:k xT FT ST . k|0:k k k Pk|0:T + Sk E xk+1|0:T xT k (10-68)

10-68 An important byproduct of the above derivations is the result E xk|0:T xk|0:T = 0.

ECEn 672

(10-69)

This result establishes the fact that the smoothed estimation error is orthogonal to the smoothed estimate, which is equivalent to the claim that the smoothed estimate is the projection of the state onto the space spanned by the entire set of observations. Thus smoothing preserves orthogonality. Continuing,we next, we compute the term k+1|0:T . E xk+1|0:T xT To solve for this term, we use the just-established fact that xk+1 = xk+1|0:T + xk+1|0:T is an orthogonal decomposition, so T T Exk+1 xT k+1 = E xk+1|0:T xk+1|0:T + E xk+1|0:T xk+1|0:T k+1|0:T + Pk+1|0:T . = E xk+1|0:T E xT Similarly k|0:k Exk xT = E xk|0:k xT + Pk|0:k . k Furthermore, from (10-44), Exk+1 xT = Fk Exk xT FT + Gk Qk GT . k+1 k k k Substituting these results into (10-68) yields Pk|0:T + Sk Exk+1 xk+1 Pk+1|0:T or Pk|0:T + Sk Exk xk + Gk Qk GT Pk+1|0:T k which simplies to Pk|0:T = Pk|0:k + Sk Pk+1|0:T Gk Qk GT Fk Pk|0:k FT k k = Pk|0:k + Sk Pk+1|0:T Pk+1|0:k
T T T T

ST = Pk|0:k + Sk Fk Exk xT Pk|0:k FT ST , (10-70) k k k k

ST = Pk|0:k + Sk Fk Exk xT Pk|0:k FT ST , k k k k (10-71)

ST k (10-72)

ST . k

Winter 2009

10-69

10.9

Extensions to Nonlinear Systems

Consider a general nonlinear system of the form xk+1 = f(xk , k) + Gk uk yk = h(xk , k) + vk , (10-73) (10-74)

for k = 0, 1, . . ., with {uk , k = 0, 1, . . .} and {vk , k = 0, 1, . . .} uncorrelated, zero-mean process and observation noise sequences, respectively. The general nonlinear estimation problem is extremely dicult, and no general solution to the general nonlinear ltering problem is available. One reason the linear problem is easy to solve is that, if the process noise, observation noise, and initial conditions, x0 , are normally distributed, then the state xk is Gaussian, and so is the conditional expectation xk|j . But if f is nonlinear, then the state is no longer guaranteed to be normally distributed, and if either f or h is nonlinear, then the conditional expectation xk|j is not guaranteed to be normally distributed. Thus, we cannot, in general, obtain the estimate as a function of only the rst two moments of the conditional distribution. The general solution would require the propagation of the entire conditional distribution. Thus, we cannot easily get an exact solution, and we resort to the time-honored topic of obtaining a solution by means of linearization. 10.9.1 Linearization

Suppose a nominal, or reference, trajectory is somehow made available. Denote this trajectory { k , k = 0, 1, . . . , T }. We assume that this trajectory satises the dynamics equation, x that is, xk+1 = f( k , k) x (10-75)

with initial condition x0 . The reference trajectory must be deterministic; no noise may be introduced into the dynamics. The observations associated with this reference trajectory may be computed as yk = h( k , k). x The purpose of the reference trajectory is to provide a path about which to linearize the nonlinear system described by (10-73) and (10-74). The linearization procedure is as follows.

10-70

ECEn 672

Dene the deviation, xk as the dierence between the actual state and the reference state: xk = xk xk . (10-76)

Expanding the dynamics f(xk , k) and the observations h(xk , k)| about the reference trajectory at time k yields f(xk , k) = f( k + xk , k) = f( k , k) + Fk xk + higher-order terms. x x h(xk , k) = h( k + xk , k) = h( k , k) + Hk xk + higher-order terms, x x where Fk = Hk = f(x, k) x h(x, k) x (10-77)
x= k x

.
x= k x

(10-78)

Neglecting higher-order terms, we may approximate (10-73) by xk+1 = f(xk , k) + Gk uk f( k , k) + Fk xk + Gk uk x (10-79)

Using (10-73), we rearrange (10-79) to obtain the deviation dynamics equation (replacing the with = from here on) xk+1 = Fxk + Gk uk . (10-80) We see that (10-80) is a linear dynamics model in the deviation variable, xk . Also, we may approximate (10-74) by yk = h(xk , k) + vk h( k , k) + Hk xk + vk . x Dening yk = yk yk , we rearrange (10-81) to obtain the deviation observation equation yk = Hk xk + vk , (10-82) (10-81)

Winter 2009 a linear observations model in the deviations.

10-71

Once the linearized dynamics and observations equations given by (10-80) and (10-82) are obtained, we may apply the Kalman lter to this system in xk in the standard way. The algorithm consists of the following steps: 1. Obtain a reference trajectory { k , k = 0, 1, . . . , T }. x 2. Evaluate the partials of f and h at xk ; identify these quantities as Fk and Hk , respectively. 3. Compute the reference observations, yk and calculate yk . 4. Apply the Kalman lter to the linearized model xk+1 = Fxk + Gk uk yk = Hk xk + vk to obtain the deviation estimates k|0:k x k|0:T x (ltered) (smoothed).

5. Add the deviation estimates to the nominal trajectory to obtain the trajectory estimates: xk|0:k = xk + k|0:k x xk|0:T = xk + k|0:T x (ltered) (smoothed).

The approach outlined above is called global linearization, and it has several potential problems. First of all, it assumes that a reliable nominal trajectory is available, so that the F and H matrices are valid. But many important estimation problems to not enjoy the luxury of having foreknowledge sucient to generate a reference trajectory. Also, even if the F and H matrices are not grossly in error, the approach is predicated on the assumption that higher-order terms in the Taylor expansion may be safely ignored. It would be highly

10-72

ECEn 672

fortuitous if the nominal trajectory were of such high quality that neither of these concerns were manifest. In the general case, however, the development of a nominal trajectory is problematic. In some special cases it may be possible to generate such a trajectory via computer simulations; in other cases, experience and intuition may guide it development. Often, however, one may simply have to rely on guesses and hope for the best. But bad things may happen. The estimates may diverge, but even if the do not, the results may be suspect because of the sensitivity of the results to the operating point. Of course, one could perturb the operating point and evaluate the sensitivity of the estimates to this perturbation, but that would be a tedious procedure, certainly not possible with real-time applications. 10.9.2 The Extended Kalman Filter

Global linearization about a pre-determined reference trajectory is not the only way to approach the linearization problem. Another approach is to calculate a local nominal trajectory on the y, and update it as information becomes available. Following the timely injunction of Lewis Carrol to Begin at the beginning, . . . go on until you come to the end; then stop, our rst order of business will be to get the estimation process started. We wish to construct a recursive estimator, and regardless of its linearity properties, we are under obligation to provide the estimator with initial conditions in the form of x0|1 and P0|1 , the a priori state estimate and covariance. The state x0|1 represents the best information we have concerning the value x0 , so it makes sense to use this value as the rst point in the nominal trajectory; that is, to dene x0 = x0|1 , and use this value to compute the H0 matrix as H0 = h(x, 0) x

x= 0|1 x

and the deviation observation equation is y0 = y0 h( 0 , 0) = y0 h( 0|1 , 0). x x

Winter 2009

10-73

Using these values, we may process y0 using a standard Kalman lter applied to (10-80) and (10-82). The resulting measurement update is 0|0 = 0|1 + W0 y0 H0 0|1 x x x P0|0 = [I W0 H0 ] P0|1 , where W0 = P0|1 HT H0 P0|1 HT + R0 0 0 linearize, namely x0 . Consequently, 0|1 = x0|1 x0 = 0. x Furthermore, 0|0 = x0|0 x0 = x0|0 x0|1 , x so (10-83) becomes x0|0 = x0|1 + W0 y0 h( 0|1 , 0) . x (10-85)
1

(10-83) (10-84)

. But note that x0|1 fullls two roles: (a) it is

the initial value of the state estimate, and (b) it is the nominal trajectory about which we

Consequently, (10-85) and (10-84) constitute the measurement update equations at time k = 0. Going on, the next order of business is to predict to the time of the next observation and then update. We will need to compute the predicted state, x1|0 , and the predicted covariance, P1|0. To predict the state, we simply apply the nonlinear dynamics equation: x1|0 = f( 0|0 , 0). x (10-86)

To predict the covariance, we need to obtain a linear model, which will enable us to predict the covariance as P1|0 = F0 P0|0 FT + G0 Q0 GT . 0 0 (10-87)

The question is, what should we use as a nominal trajectory at which to evaluate (10-77)? According to our philosophy, we should use the best information we currently have about x0 , and this is our ltered estimate. Thus, we take, for the calculation of F0 , the value x0 = x0|0 . Using this value, the prediction step at time k = 0 is given by (10-86) and (10-87).

10-74

ECEn 672

The next order of business is, of course, to perform the observation update at time k = 1, yielding 1|1 = 1|0 + W1 y1 H1 1|0 x x x P1|1 = [I W1 H1 ] P1|0 , which requires us to employ a reference trajectory x1 . Following our philosophy, we simply use the best information we have at time k = 1, namely, the predicted estimate, so we set x1 = x1|0 . Consequently, 1|0 = x1|0 x1 = 0, and 1|1 = x1|1 x1|0 , which yields x x x1|1 = x1|0 + W1 y1 h( 1|0 , 1) , x where W1 = P1|0 HT H1 P1|0 HT + R1 1 1 with H1 = h(x, 1) x .
x= 1|0 x 1

The pattern should now be quite clear. The resulting algorithm is called the extended Kalman lter, summarized as follows: Measurement Update

xk+1|k+1 = xk+1|k + Wk+1 yk+1 h( k+1|k , k) x Pk+1|k+1 = [I Wk+1 Hk+1] Pk+1|k , where
T Wk+1 = Pk+1|k HT k+1 Hk+1 Pk+1|k Hk+1 + Rk+1 1

(10-88) (10-89)

(10-90)

with Hk+1 = Time Update h(x, k) x .


x= k+1|k x

(10-91)

xk+1|k = f( k|k , k) x Pk+1|k = Fk Pk|k FT + Gk Qk GT , k k

(10-92) (10-93)

Winter 2009 where Fk = Initialization f(x, k) x .


x= k|k x

10-75

(10-94)

The extended Kalman lter is initialized in exactly the same way as is a standard Kalman lter; namely by supplying the a prior estimate and covariance, x0|1 and P0|1 , respectively. Nonlinear Smoothing The Rauch-Tung-Smoother equations are unchanged from the standard linear lter.

Exercise 10-21 Using the model provided in Section 10.6, rework the problem using the nonlinear observations vector R hR (xk ) yk = A = hA (xk ) + vk E hE (xk ) where R denotes the range from a receiver to the vehicle, A denotes the azimuth angle, , and E is the elevation elevation angle. The mathematical models for these observations are Range

hR (x) = (x xr )2 + (y yr )2 + (z zr )2 R 1 = [(x xr ), (y yr ), (z zr )] x R where x = [x, y, z] is the position of the vehicle and [xr , yr , zr ]T is the position of the radar. Azimuth For these calculations we assume that the vectors are resolved into an East-North-Up coordinate system: hA (x) = tan1 x xr , y yr A

1 A = [cos2 A, sin A cos A, 0]. x y yr Elevation

10-76

ECEn 672

For these calculations we assume that the vectors are resolved into an East-North-Up coordinate system: hE (x) = sin1 E x = z zr , R A 2 2

1 [ sin E sin A, sin E cos A, cos E]. R

Winter 2009

Bib-1

References
[1] H. D. Brunk. Mathematical Statistics. Blaisdell, Waltham, MA, second edition, 1965. [2] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, New York, 1970. [3] T. S. Ferguson. Mathematical Statistics. Academic Press, New York, 1967. [4] G. C. Goodwin and R. L. Payne. Dynamic Systems Identication. Academic Press, New York, 1977. [5] A. Graham. Kronecker Products and Matrix Calculus with Applicatons. Halsted Press, New York, 1981. [6] H. Cramr. Mathematical Methods of Statistics. Princeton Univ. Press, Princeton, NJ, e 1946. [7] C. Howson and P. Urbach. Scientic Reasoning: The Bayesian Approach. Open Court, La Salle, Illinois, 1989. [8] A. H. Jazwinski. Stochastic Processes and Filtering Theory. Academic Press, New York, 1970. [9] R. E. Kalman. A new approach to linear ltering and prediction problems.

Trans. ASME, Ser. D: J. Basic Eng, 82:3545, 1960. [10] R. D. Luce and H. Raia. Games and Decisions. John Wiley, New York, 1957. [11] K. V. Mardia, J. T. Kent, and J. M. Bibby. Multivariate Analysis. Academic Press, New York, 1979. [12] J. Neveu. Mathematical Foundations of the Calculus of Probability. Holden Day, San Francisco, 1965. [13] H. V. Poor. An Introduction to Signal Detection and Estimation. Springer-Verlag, New York, 1988.

Bib-2

ECEn 672

[14] R. L. Stratonovich. Conditional markov processes. Theor. Probability Appl., 5:156176, 1960. [15] P. Swerling. First order error propagation in a stagewise smoothing procedure for satellite observations. J. Astronautical Sci., 6:4652, 1959. [16] H. L. Van Trees. Detection, Estimation, and Modulation Theory, Part I. John Wiley and Sons, New York, 1968. [17] H. G. Tucker. A Graduate Course in Probability. Academic Press, New York, 1967. [18] P. Whittle. Probability via Expectation. Springer-Verlag, New York, 2000. Fourth Edition.