You are on page 1of 7

ECE 8443 – Pattern Recognition

LECTURE 14: SUFFICENT STATISTICS

• Objectives:
Sufficient Statistics
Dimensionality
Complexity
Overfitting
• Resources:
DHS – Chap. 3 (Part 2)
Rice – Sufficient Statistics
Ellem – Sufficient Statistics
TAMU – Dimensionality

• URL: .../publications/courses/ece_8443/lectures/current/lecture_14.ppt
14: SUFFICIENT STATISTICS
DEFINITION
• Direct computation of p(D|) and p(|D) for large data sets
is challenging (e.g. neural networks)
• We need a parametric form for p(x|) (e.g., Gaussian)
• Gaussian case: computation of the sample mean and
covariance, which was straightforward, contained all the
information relevant to estimating the unknown population
mean and covariance.
• This property exists for other distributions.
• A sufficient statistic is a function s of the samples D that
contains all the information relevant to a parameter, .
• A statistic, s, is said to be sufficient for  if p(D|s,) is
independent of :
p( D | s ,  ) p(  | s )
p(  | s , D )   p(  | s )
p( D | s )
14: SUFFICIENT STATISTICS
FACTORIZATION THEOREM

• Theorem: A statistic, s, is sufficient for , if and only if


p(D|) can be written as: p( D |  )  g( s ,  )h( D ) .
• There are many ways to formulate sufficient statistics
(e.g., define a vector of the samples themselves).
• Useful only when the function g() and the sufficient
statistic are simple (e.g., sample mean calculation).
• The factoring of p(D|) is not unique:
g( s ,  )  f ( s ) g( s ,  ) h( D )  h( D ) / f ( s)
• Define a kernel density invariant to scaling:

~ g( s ,  )
g( s, ) 
 g( s ,  )d
14: SUFFICIENT STATISTICS
GAUSSIAN DISTRIBUTION
n 1 1 t 1
p( D |  )   exp[  ( x k   )  ( x k   )]
d 2 12
k 1( 2 )  2
1 1 n t 1 t 1 1
 exp[       2  x k  x t
k  xk ]
d 2 12
( 2 )  2 k 1

t 1  
n t 1 n
 exp[          x k  ]
2  k 1 
1 1 n t 1
 exp[   x k  x k ]
d 2 12
( 2 )  2 k 1
• This isolates the  dependence in the first term, and
hence, the sample mean is a sufficient statistic.
• The kernel is: ~ 1 1  1 1 
g( 
ˆ n , )  exp[ (   
ˆ n )t   (   
ˆ n) ]
1
12 2 n 
( 2 )d 2

n
14: SUFFICIENT STATISTICS
EXPONENTIAL FAMILY
• This can be generalized:
p( x |  )  x  exp[ a(  )  b(  )t c( x )]
and: n n
p( D |  )  exp[ na(  )  b(  )  c( x k ) ]  x k   g( s ,  )h( D )
t
k 1 k 1

• Examples:
14: PROBLEMS OF DIMENSIONALITY
DIRECTIONS OF DISCRIMINATION
• If features are statistically independent, in theory we can
get excellent performance.
• Recall the Bayes error rate for a two-class multivariate
normal problem: 1   u2 2
p( e )  e du
2 r 2
where r2 is the Mahalanobis distance:
r 2  ( 1   2 )t  1 ( 1   2 )
• For conditionally independent features:
2
d  i1   i 2 
r   
2

i 1 i 
Most useful features are those for which the difference of
the means is large w.r.t. the standard deviation.
14: PROBLEMS OF DIMENSIONALITY
COMPUTATIONAL COMPLEXITY
• “Big Oh” notation used to describe complexity:
if f(x) = 2+3x+4x2, f(x) has computational complexity O(x2)
• Recall:
1 t ˆ 1 d 1 ˆ
g( x )   ( x  
ˆ )  (x 
ˆ )  ln( 2 )  ln   ln P (  )
2 2 2

O( dn ) O( nd 2 ) O( 1 ) O( d 2 n ) O ( n )

• Watch those constants of proportionality (e.g., O(nd2).


• If the number of data samples is inadequate, we can
experience overfitting (which implies poor generalization).
• Hence, later in the course, we will study ways to control
generalization and to smooth estimates of key parameters
such as the mean and covariance (see textbook).

You might also like