You are on page 1of 4

Parameter Estimation:

Maximum Likelihood Estimation


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Discriminant function log (probability density function)


Probability density function is specified by no. of parameters

To find the decision surface between two different classes, the nature of
probability density function is very important, which is specified by the
parameter vector.
For ex., Let us take Gaussian distribution : P(x/wj) ~ N(, )
It is described by two parameters mean vector and covariance matrix

Depending upon the nature of this parameter vectors, we can have different
types of decision boundary between two different classes. ( linear or non linear)

If probability density function is other than Gaussian distribution, in that case we


have to find out what is the parameter vector that identifies the probability
density function.

The maximum likelihood estimation is a technique used for parameter vector


estimation, for a known parametric form of the probability density function.

For example, if P(x/wj) ~ N(, ) then, j consists of components mean


vector and covariance matrix.

If probability density function is other than Gaussian distribution, in that case


we have to find out what is the parameter vector that identifies the probability
density function.

Let C no. of classes


D1 D2 . . . . . Dc are the set of samples in each class
Let the samples are independent identically distributed

The samples in Dj have been drawn independently according to the probability


law P(x/wj). And we also assume that, this P(x/wj) has a known parametric
form, and is therefore determined uniquely by the value of a parameter vector j.

To show the dependence of P(x/wj) on j , we can write p(x|j) as P(x/wj , j )


Our problem is to use the information provided by the training samples to
obtain good estimates for the unknown parameter vectors 1, ..., c associated
with each category.

The estimation of parameter vector j from the information available from


samples in the set Dj is called as maximum likelihood estimation.
So, we have to use the information from training samples in the set Dj to have
a good estimate of parameter vector j.

To simplify treatment of this problem, we shall assume that samples in Di does


not provide any information about j for i j

That is, we shall assume that the parameters for the different classes are
functionally independent. This permits us to work with each class separately,
and thus we have C no.of separate problems.

Use a set D of training samples drawn independently from the probability


density P(x|) to estimate the unknown parameter vector .

Suppose that D contains n samples, x1, x2 ..., xn then, since the samples were
drawn independently, we have

P(D|) is called the likelihood of w.r.to the set of samples.

The maximum likelihood estimate of P(D|) is the value that maximizes P(D|)

Instead of taking likelihood P(D|), we can take the logarithmic of P(D|) for
analysis

Log-likelihood = l() = ln P(D|)


For the maximization of likelihood l(), take the differentiation of l() and equate
that differential to zero.

Since is a vector, take the gradient operator, instead of simple differentiation.

Let is a p-component vector

= (1 , 2, p )t

The value of ~ that maximizes the log-likelihood can be obtained by making

Where

Thus, a set of necessary conditions for the maximum likelihood estimate for can be
obtained from the set of p- equations
The above graph shows several training points in one dimension, assumed to be
drawn from a Gaussian of a particular variance, but unknown mean.

From the graph, we can observe that, If we have more number of training points, the
likelihood P(D|) will be very narrow and the confidence in estimating the
maximum is high. Hence our confidence in estimating the value that maximizes the
likelihood is marked depends on the no. of samples. If we have more no. of
samples, the estimation of is more accurate.

You might also like