Professional Documents
Culture Documents
Contents
1
1
3
4
5
5
5
6
7
8
8
10
12
13
14
14
14
15
16
17
ii
Contents
4.8.2
4.8.3
20
22
23
23
24
24
26
This is page 1
Printer: Opaque this
4
Maximum entropy distributions
i=1
n
p (x, yi )
p(xj , y)
j=1
p (y | ) p ()
p (y)
For Bayesian analyses, we often find it convenient to suppress the normalizing factor, p (y), and write the posterior distribution is proportional to
the product of the sampling distribution or likelihood function and prior
distribution.
p ( | y) p (y | ) p ()
or for a particular draw y = y0
p ( | y = y0 ) ( | y = y0 ) p ()
where p (y | ) is the sampling distribution, ( | y = y0 ) is the likelihood
function evaluated at y = y0 , and p () is the prior distribution for . Bayes
theorem is the glue that holds consistent probability assignment together.
Example 1 Consider the following joint distribution:
p (y = y1 , = 1 )
0.1
p (y = y2 , = 1 )
0.4
p (y = y1 , = 2 )
0.2
p (y = y2 )
0.7
p ( = 1 )
0.5
p ( = 2 )
0.5
and
p (y = y2 , = 2 )
0.3
y1
y2
p (y | = 1 )
0.2
0.8
p (y | = 2 )
0.4
0.6
p ( | y = y1 )
p ( | y = y2 )
and
1
2
1
3
2
3
4
7
3
7
4.3 Entropy
Shannon defines entropy as2
h=
i=1
pi log (pi )
pi = 1
i=1
p (x) dx = 1
h=
p (x) log
p (x)
dx
m (x)
L max
pi log pi (0 1)
pi 1
pi
i=1
i=1
for i = 1, 2, 3
and
0 = log 3
Hence, as expected, the maximum entropy probability assignment is a discrete uniform distribution with pi = 13 for i = 1, 2, 3.
L max
pi log pi (0 1)
pi 1 1
pi xi 2.5
pi
i=1
i=1
i=1
for i = 1, 2, 3
and
0
= 2.987
= 0.834
= 0.116
p2
= 0.268
p3
= 0.616
n
m
exp [0 ]
exp
j fj (xi ) = 1,
j=1
k=1
p (xi ) =
=
m
exp [0 ] exp j=1 j fj (xi )
m
exp [0 ] exp j=1 j fj (xi )
n
m
exp [0 ] k=1 exp j=1 j fj (xi )
m
exp j=1 j fj (xi )
n
m
k=1 exp
j=1 j fj (xi )
k (xi )
Z (1 , . . . , m )
k (xi ) = exp
j fj (xi )
j=1
is a kernel, and
Z (1 , . . . , m ) =
k=1
exp
j=1
j fj (xk )
0 simply ensures the probabilities sum to unity and the partition function
assures this, we can define the partition function without 0 . That is, 0 cancels as
demonstrated above.
8 In physical statistical mechanics, the partition function describes the partitioning
among dierent microstates and serves as a generator function for all manner of results
regarding a process. The notation, Z, refers to the German word for sum over states,
zustandssumme. An example with relevance for our purposes is
log Z
=
1
Return to the example above. Since we know support and the mean,
n = 3 and the function f (xi ) = xi . This implies
Z (1 ) =
exp [1 xi ]
i=1
and
pi
=
=
k (xi )
Z (1 )
exp [1 xi ]
3
k=1 exp [1 xk ]
i=1
i=1
pi xi 2.5
exp [1 xi ]
xi 2.5
3
k=1 exp [1 xk ]
= 0.116
p2
p3
= 0.268
= 0.616
log Z
3 + 2e1 + e21
=
= 2.5
1
1 + e1 + e21
approach is
p (x) =
=
=
exp [ (1)]
exp [ (0)] + exp [ (1)]
exp []
1 + exp []
1
1 + exp []
This is the shape of the density for a logistic distribution (and would be a
logistic density if support were unbounded rather than binary).9 Solving for
= log 23 or p (x = 1) = 35 , reveals the assigned probability of success or
characteristic parameter for a Bernoulli distribution.
4.6.1 Binomial
Suppose we know there are binary (Bernoulli or "success"/"failure") outcomes associated with each of n draws and the expected value of success
equals np. The expected value of failure is redundant, hence there is only
one moment condition. Then, the maximum entropy assignment includes
the number of combinations which produce x1 "successes"
and x0 "fail
pi
s=
pi log
mi
i=1
Now, the kernel (drawn from the Lagrangian) is
ki = mi exp [1 xi ]
Satisfying the moment condition yields the maximum entropy or binomial
probability assignment.
n x
x
ki
p 1 (1 p) 0 , x1 + x0 = n
p (x, n) =
= x1i ,x0i
0
otherwise
Z
9 This suggests logistic regression is a natural (as well as the most common) strategy
for modeling discrete choice.
4, 096 combinations of heads and tails. Solving 1 = 0, the maximum entropy probability assignment associated with x equal x1 heads and x0 tails
in 12 coin flips is
p (x, n = 12) =
12 1 x1 1 x0
x1 2
, x1 + x0 = 12
otherwise
Example 6 (unbalanced coin) Continue the coin flip example above except heads are twice as likely as tails. In other words, the expected values
are E [x1 ] = 8 and E [x0 ] = 4 in 12 coin flips. Solving 1 = log 2, the
maximum entropy probability assignment associated with s heads in 12 coin
flips is
12 2 x1 1 x0
, x1 + x0 = 12
p (x, n = 12) = x1 3 3
0
otherwise
[x1 , x0 ]
p (x, n = 12)
balanced coin unbalanced coin
[0, 12]
1
4,096
1
531,441
[1, 11]
12
4,096
24
531,441
[2, 10]
66
4,096
264
531,441
[3, 9]
220
4,096
1,760
531,441
[4, 8]
495
4,096
7,920
531,441
[5, 7]
792
4,096
25,344
531,441
[6, 6]
924
4,096
59,136
531,441
[7, 5]
792
4,096
101,376
531,441
[8, 4]
495
4,096
126,720
531,441
[9, 3]
220
4,096
112,640
531,441
[10, 2]
66
4,096
67,584
531,441
[11, 1]
12
4,096
24,576
531,441
[12, 0]
1
4,096
4,096
531,441
10
Canonical analysis
The above is consistent
with a canonical analysis based on relative entropy
pi
i pi log pold,i where pold,i reflects probabilities assigned based purely on
(xni )
=
n
i (xi )
mi . Since the denominator of pold,i is absorbed via normalization it can
i mi
pi
be dropped, then entropy reduces to i pi log m
and the kernel is the
i
same as above
ki = mi exp [1 x1i ]
the number of exchangeable ways of generating xi ; hence, pold,i =
4.6.2 Multinomial
The multinomial is the multivariate analog to the binomial accommodating
k rather than two nominal
outcomes. Like the binomial, the sum of the
k
outcomes equals n, i=1 xi = n where xk = 1 for each occurrence of event
k. We know there are
n
n!
n
k =
=
x1 , , xk
x ! xk !
x ++x =n
x ++x =n 1
1
k1
n 1 j=1 pj .
The kernel is
n!
exp [1 x1 k1 xk1 ]
x1 ! xk !
which leads to the standard multinomial probability assignment when the
moment conditions are resolved
k
k
n!
p (x, n) = x1 !x
px1 pxkk ,
i=1 xi = n,
i=1 pi = 1
k! 1
p (x, n) =
0
otherwise
Example 7 (one balanced die) Suppose we roll a balanced die (k = 6)
one time (n = 1), the moment conditions are E [x1 ] = E [x6 ] = 16 .
Incorporating the number of exchangeable ways to generate n = 1 results,
the kernel is
1!
exp [1 x1 5 x5 ]
x1 ! x6 !
11
1!
1 x1
x1 !x6 ! 6
16
0
x6
= 16 ,
j=1 xj = 1
otherwise
1!
1 x1 2 x2 3 x3 4 x4 5 x5 6 x6
x1 !x6 ! 21
21
21
21
21
21 ,
j=1 xj = 1
otherwise
2!
1 x1
x1 !x6 ! 6
16
0
x6
= 16 ,
j=1 xj = 2
otherwise
2!
1 x1 2 x2 3 x3 4 x4 5 x5 6 x6
x1 !x6 ! 21
21
21
21
21
21 ,
j=1 xj = 2
otherwise
12
[x1 , x2 , x3 , x4 , x5 , x6 ]
p (x, n = 2)
balanced dice unbalanced dice
[2, 0, 0, 0, 0, 0]
1
36
1
441
[0, 2, 0, 0, 0, 0]
1
36
4
441
[0, 0, 2, 0, 0, 0]
1
36
9
441
[0, 0, 0, 2, 0, 0]
1
36
16
441
[0, 0, 0, 0, 2, 0]
1
36
25
441
[0, 0, 0, 0, 0, 2]
1
36
36
441
[1, 1, 0, 0, 0, 0]
1
18
4
441
[1, 0, 1, 0, 0, 0]
1
18
6
441
[1, 0, 0, 1, 0, 0]
1
18
8
441
[1, 0, 0, 0, 1, 0]
1
18
10
441
[1, 0, 0, 0, 0, 1]
1
18
12
441
[0, 1, 1, 0, 0, 0]
1
18
12
441
[0, 1, 0, 1, 0, 0]
1
18
16
441
[0, 1, 0, 0, 1, 0]
1
18
20
441
[0, 1, 0, 0, 0, 1]
1
18
24
441
[0, 0, 1, 1, 0, 0]
1
18
24
441
[0, 0, 1, 0, 1, 0]
1
18
30
441
[0, 0, 1, 0, 0, 1]
1
18
36
441
[0, 0, 0, 1, 1, 0]
1
18
40
441
[0, 0, 0, 1, 0, 1]
1
18
48
441
[0, 0, 0, 0, 1, 1]
1
18
60
441
4.6.3 Hypergeometric
The hypergeometric probability assignment is a purely combinatorics exercise. Suppose we take n draws without replacement from a finite population
of N items of which m are the target items
(or events) and x is the number of target items drawn. There are m
ways to draw the targets times
N m
xN
nx ways to draw nontargets out of n ways to make n draws. Hence,
our combinatoric measure is
N m
(m
x )( nx )
, x {max (0, m + n N ) , min (m, n)}
N
(n)
and
min(m,n)
mN m
x
x=max(0,m+nN )
Nnx
13
=1
p (x = 0, n = 2) =
620
4 3
6
=
6 5
15
262
1
p (x = 1, n = 2) =
621
2 4 4 2
8
+ =
6 5 6 5
15
p (x = 2, n = 2) =
262
2
622
2 1
1
=
6 5
15
m
exp j=1 j fj (x)
p (x) = b
m
exp
f
(x)
dx
j
j
j=1
a
we omit N
in the denominator, it would be captured via normalization. In other
n
words, the kernel is
mN m
k=
exp [0]
x
nx
and the partition function is
1 0 If
min(m,n)
Z=
x=max(0,m+nN )
k=
N
n
14
exp [0]
exp [0] dx
1
3
Of course, this is the density function for a uniform with support from 0 to
3.
exp [1 x]
exp [1 x] dx
x 3
0
xp (x) dx 1.35
exp [1 x]
exp [1 x] dx
dx 1.35
= 0
= 0
3
so that 1 = 0.2012, Z = 0 exp [1 x] dx = 2.25225, and the density
function is a truncated exponential distribution with support from 0 to 3.
p (x) = 0.444 exp [0.2012x] , 0 x 3
The base (non-truncated) distribution is exponential with mean approximately equal to 5 (4.9699).
1
x
p (x) = 4.9699
exp 4.9699
, 0x<
p (x) = 0.2012 exp [0.2012x]
15
find the maximum entropy density function for arbitrary mean, . Using
the partition function approach above we have
2
exp 2 (x )
p (x) =
2
exp
(x
)
dx
2
2
exp 2 (x )
2
(x )
100
2
exp
(x
)
dx
2
so that 2 =
1
2 2
= 0
= 0
2
1
(x )
p (x) =
exp
2 2
2
2
1
(x )
=
exp
200
210
2
E [log x] = = 1 as well as E (log x) = 2 = 10, then the maximum
entropy probability assignment is the lognormal distribution
x)2
p (x) = x12 exp (log2
, 0 < x, , <
2
Again, we utilize the partition function approach to demonstrate.
2
exp 1 log x 2 (log x)
p (x) =
2
exp
log
x
(log
x)
dx
1
2
0
and the constraints are
E [log x] =
log xp (x) dx 1 = 0
16
and
E (log x)
1
2 2
(log x) p (x) dx 10 = 0
1
2 (9)
2
p (x) exp
log x
(log x)
20
20
Completing
1 the square and adding in the constant (from normalization)
exp 20
gives
1
2
1
2
p (x) exp [ log x] exp +
log x
(log x)
20 20
20
Rewriting produces
(log x 1)
1
p (x) exp log x
exp
20
which simplifies as
1
p (x) x
(log x 1)
exp
20
Including the normalizing constants yields the probability assignment asserted above
x1)2
1
p (x) = x210
exp (log2(10)
, 0<x<
17
i = 1
i=1
Well refer to this a die generating process where 1 refers to outcome one
or two, 2 to outcome three or four, and 3 corresponds to die face five or
six. The maximum entropy prior with no moment conditions is
1
pold (1 , 2 , 3 ) = 8
j=1 j
1
36
max h = p (1 , 2 , 3 ) log p (1 , 2 , 3 )
p()
s.t.
p (1 , 2 , 3 ) (1 23 ) = 0
exp [ (1 23 )]
Z
18
Z=
exp [ (1 23 )]
log Z
=0
In other words,
p () =
exp [1.44756 (1 23 )]
28.7313
and
p (x, ) = p (x | ) p ()
(n1 + n2 + n3 )! n1 n2 n3 exp [1.44756 (1 23 )]
=
1 2 3
n1 !n2 !n3 !
28.7313
Hence, prior to collection and evaluation of evidence expected values of
are
E [1 ] =
1 p (x, )
x
1 p () = 0.444554
E [2 ] =
2 p (x, )
2 p () = 0.333169
E [3 ] =
3 p (x, )
3 p () = 0.222277
exp [1.44756 (1 23 )]
28.7313
p (x) =
p (, x = m) = 0.0286946
19
Hence,
E [1 | x = m] =
p ( | x = m) 1
p ( | x = m) 2
p ( | x = m) 3
= 0.505373
E [2 | x = m] =
= 0.302243
E [3 | x = m] =
= 0.192384
and
E [1 + 2 + 3 | x = m] = 1
However, notice E [1 23 | x = m] = 0. This is because our priors including the moment condition refer to the process and here were investigating
a specific "die".
20
1
E 2
3
n1 = 5
Pr n2 = 3
n3 = 2
1
n1 = 5
E 2 n2 = 3
3
n3 = 2
n1 = 3
Pr n2 = 4
n3 = 3
1
n1 = 3
E 2 n2 = 4
3
n3 = 3
uninformed
E [1 23 ] = 0
1
36
exp[1.44756( 1 2 3 )]
28.7313
1
3
1
3
1
3
3337.4551 32 23
0.4609
0.3067
0.2324
5527.5551 32 23
0.3075
0.3850
0.3075
0.444554
0.333169
0.22277
3056.6551 32 23
exp [1.44756 (1 23 )]
0.5054
0.3022
0.1924
7768.4951 32 23
exp [1.44756 (1 23 )]
0.3500
0.3922
0.2578
p(x,)
s.t.
s=
p (x, ) log
p(x,)
pold (x,)
p (x, ) (1 23 ) = 0
21
1
36 ,
p (x, ) =
exp [ (1 23 )]
pold (x, )
Z ()
Z () =
exp [ (1 23 )] pold ()
log Z ()
=0
exp [1.44756 (1 23 )]
28.7313
p (x) =
p (, x = m) = 0.0286946
p ( | x = m) 1
p ( | x = m) 2
= 0.505373
E [2 | x = m] =
= 0.302243
22
E [3 | x = m] =
p ( | x = m) 3
= 0.192384
and
E [1 + 2 + 3 | x = m] = 1
as above.
This maximum relative entropy analysis for the joint distribution helps
set up simultaneous evaluation of moment and data conditions when were
evaluating the process rather than a specific die.
p(x,)
s=
s.t.
p (x, ) log
p(x,)
pold (x,)
p (x, ) (1 23 ) = 0
x=m
Since the likelihood function remains the same, Lagrangian methods yield
exp (x,) (1 23 )
p (x = m, ) =
pold (x = m, )
Z (m, )
where the partition function is
Z (m, ) =
exp (x,) (1 23 ) pold (x = m, )
log Z (m, )
=0
(x,)
In other words, the joint probability given the moment and data conditions
is
2520 5 3 2
p (x = m, ) = exp [0.0420198 (1 23 )]
36 1 2 3
Since p (x = m) =
ity of is
23
p ( | x = m, E [1 23 ] = 0) = 70
exp [0.0420198 (1 23 )]
0.0209722
Hence,
E [1 | x = m, E [1 23 ] = 0] =
p ( | x = m, E [1 23 ] = 0) 1
p ( | x = m, E [1 23 ] = 0) 2
p ( | x = m, E [1 23 ] = 0) 3
= 0.462225
E [2 | x = m, E [1 23 ] = 0] =
= 0.306663
E [3 | x = m, E [1 23 ] = 0] =
= 0.2311125
E [1 + 2 + 3 | x = m, E [1 23 ] = 0] = 1
and unlike the previous analysis of a specific die, for the process the moment
condition is maintained
E [1 23 | x = m, E [1 23 ] = 0] = 0
exp[(x,) ( 1 2 3 )]
What weve done here is add a canonical term,
, to the
Z(m,)
standard Bayesian posterior for given the data, pold ( | x = m), to account for the moment condition. The partition function, Z (m, ), serves to
normalize the moment conditioned-posterior distribution.
24
dx
x
b
dx
dx =
bx
x
25
leaves their posterior beliefs largely unaected. Then, collectively individuals are dissuaded from engaging in (private) information search and an information cascade results characterized by herding behavior. Such bubbles
are sustained until some compelling information event bursts it. Changing
herding behavior or bursting an information bubble involves a likelihood
so powerful it overwhelms the public (common) information prior.
26
kernel form12
mass function
uniform
none
exp [0] = 1
Pr (xi = i) = n1 ,
i = 1, . . . , n
nonuniform
E [x] =
exp [xi ]
Pr (xi = i) = pi ,
i = 1, . . . , n
Bernoulli
E [x] = p
exp []
Pr (x = 1) = p,
x = (0, 1)
binomial13
E [x | n] = np,
Pr (x = 1 | n = 1) = p
multinomial14
E [xi | n] = npi ,
Pr (xi = 1 | n = 1) = pi ,
i = 1, . . . , k 1
Poisson15
E [x] =
geometric16
E [x] = p1 ,
Pr (x = 1) = p
exp [x]
Pr (x = 1; r) =
r1
p (1 p)
negative
binomial17
pr
E [x] = 1p
,
Pr (x = 1) = p
x+r1
Pr (x
= s; r) =
logarithmic18
p
E [x] = (1p) log(1p)
,
x
log x
E [log x] =
x plog(1p)
x=1
pk
exp [1 x 2 log x]
Pr (x) =
px
x log(1p)
,
x = (1, 2, . . .)
distribution class
discrete:
hypergeometric19
1 2 excluding
partition function
k=1
kt
n
x
n!
k!
x1 !x
k1
exp i=1 i xi
1
x!
|t=1
exp [x]
exp [x]
exp [x]
=
Pr
(x = s | n)
ns
n s
p
(1
p)
,
s
s = (0, . . . , n)
Pr (x1 , . . . , xk | n) =
xk
x1
n!
x1 !xk ! p1 pk ,
xi = (0, . . . , n)
Pr (x = s) =
s
s! exp [] ,
s = (0, 1, . . .)
s+r1
s
p (1 p) ,
s = (0, 1, . . .)
log(1p)
none
(note: E [x] =
nm
N )
N m
(m
x )( nx )
N
(n)
Pr (x = s; m, n, N ) =
m
(ms)(Nns
)
N
(n)
27
(z) = 0 et tz1 dt
(n) = (n 1)! for n a positive integer
B (a, b) = (a)(b)
(a+b)
1 3 The kernel for the binomial includes a measure of the number of exchangeable ways
n
to generate x successes
. The measure, say m (x), derives from generalx
in n trials,
pi
ized entropy, S = i p log m , where mi reflects a measure that ensures entropy is
i
invariant under transformation (or change in units) as required to consistently capture
background information including complete ignorance (see Jereys [1939], Jaynes [2003]
or Sivia and Skilling [2006]). The first order condition from the Lagrangian with mean
moment condition yields the kernel mi exp [xi ], as reflected for the binomial probabil1
ity assignment. Since mi = n
for the binomial distribution, mi is not absorbed through
normalization. Notice, generalization of entropy can be treated in canonical form by
adding a moment condition, E [ log m], whose Lagrange multiplier is unity. Then, the
kernel assignment with mean moment condition is as indicated above, mi exp [xi ].
1 4 Analogous to the binomial, the kernel for the multinomial includes a measure of the
n!
number of exchangeable ways to generate x1 , . . . , xk events in n trials, x !x
, where
1
k!
events are mutually exclusive (as with the binomial) and draws are with replacement.
1 5 Like the binomial, the kernel for the Poisson includes an invariance measure based
on the number of exchangeable permutations for generating x occurrences in a given
x
interval, nx! . Since a Poisson process resembles a binomial process with a large number
x
of trials within a fixed interval, we can
think
x of n as the large n approximation to
n!
n
x
x
x
. n is absorbed via n p =
where expected values for the binomial
(nx)!
n
x
series expansion of exp [] around zero (which is equal to
x=0 x! ) and normalized via
nx
1 6 The
geometric distribution indicates the likelihood success occurs in the rth trial.
Hence, the measure for the number of ways for this to occur is one.
1 7 Like the binomial, the kernel for the negative binomial includes a measure of the
m
x
N m
nx
N
n
. However, no moment
28
distribution class
moment constraints
kernel form
density function
none
exp [0] = 1
1
f (x) = ba
,
a<x<b
exp [x]
continuous:
uniform
exponential
gamma
chi-squared20
d.f.
E [x] = > 0
E [x] = a > 0,
E [log x] =
(a)
(a)
E [x] = > 0,
(1)
E [log x] = 12
(2)
+ log 2
f (x)
=
exp x ,
0<x<
f (x) =
exp [x] xa1 ,
0<x<
1
(a)
exp [1 x 2 log x]
1
f (x) = 2/2 (/2)
x /21
exp 2 x
,
0<x<
exp [1 x 2 log x]
E [log x] =
(a)
[a]
beta
(b)
(b)
normal or
Gaussian
(a+b)
[a+b]
,
E [log (1 x)] =
(a+b)
[a+b] ,
a, b > 0
E (x )
= 2
lognormal
E[log x] =
2
E (log x) =
2 + 2
Pareto
xm
E [log x] = 1+ log
,
>0
Laplace
E [|x|] = 1 ,
>0
Wishart
n d.f.
2 0 special
exp
1 log x
2 log (1 x)
exp (x )
exp
1 log x
2
2 (log x)
f (x) =
1
B(a,b)
b1
xa1 (1 x)
0<x<1
1
f (x) = 2
2
exp (x)
,
2 2
< x <
f (x) = x12
x)2
exp (log2
,
2
0<x<
x
m
f (x) = x+1
,
0 < xm x <
exp [ log x]
f (x) =
exp [ |x|] ,
x <
exp [ |x|]
exp
1 tr ()
2 log ||
f () =
n
|| 2 ||
np
2 2
np1
2
n
)
( 21
tr ( )
exp
2