Professional Documents
Culture Documents
FT
H 2 (P, Q) D(PQ)
1
P
2
Q21 D(PQ)
RA
H (P1 . . . Pn , Q 1 . . . Q n )2
D(P1 . . . Pn Q 1 . . . Q n ) =
n
H (Pi , Q i )2
i=1
D(Pi Q i )
in
[why]
1.
Section 7 describes a simple method for bounded total variation distance via calculations
of variances and covariances.
In my opinion, the total variation distance is the most important of the three, for
reasons that will be explained in Chapter 14. I will use the other two distances mostly
as a way of bounding total variation.
c David Pollard
Section 3.1
misleading
< 1>
Example. Let Pn, denote the joint distribution for n independent observations from the N (, 1) distribution, with R. Under Pn,0 , the sample
average, X n , converges to 0 at a n 1/2 rate. The parametrization is reasonable.
What happens if we reparametrize, replacing the N (, 1) by a N ( 3 , 1)?
We are still fitting the same model; same probability measures, only the labelling
1/3
has changed. The maximum likelihood estimator, X n , still converges at an
1/2
1/6
rate if 0 = 0, but for 0 = 0 we get an n
rate, as an artifact of the
n
reparametrization.
More imaginative reparametrizations can produce even stranger behaviour
for the maximum likelihood estimator. For example, define the one-to-one
reparametriztion
if is rational
() =
+ 1 if is irrational
Now let Pn, denote the joint distribution for n independent observations from
the N (( ), 1) distribution. If 0 is rational, the maximum likelihood estimator,
1 ( X n ), gets very confused: it concentrates around 0 1 as n gets larger.
RA
FT
[tot.var]
2.
Total variation
Let be a finite signed measure on a measure space (X, A). Remember that
M denote the set of all real, A-measurable functions on X, and that M+ is the
cone of nonnegative functions in M.
c David Pollard
Chapter 3
where the supremum runs over all finite partitions {Ai } of X into disjoint,
A-measurable sets. In fact, there is no need to consider partitions into more
than two sets: for if A = i {Ai : Ai 0} then
|Ai | = A Ac = |A| + |Ac |
i
with equality when A1 = {m 0} = Ac2 . That is, v() = {m 0}{m < 0}.
The total variation is often called the L 1 norm of , and is also denoted
by 1 . Note that v(), and hence 1 , does not depend on the choice of
the dominating measure.
The total variation v() is also equal to sup| f |1 |f |, the supremum
running over all A-measurable functions f bounded in absolute value by 1:
|f | = |m f | |m|
if | f | 1,
FT
RA
tot.var.special
<2>
= 2 sup |P1 A P2 A|
A
multiv.normal
Many authors, no doubt with the special case foremost in their minds, define
the total variation as sup A |P1 A P2 A|. Beware! An unexpected extra factor
of 2 can cause confusion.
<3>
c David Pollard
Section 3.2
Total variation
because u N (0, In ) is N (0, 1) distributed, and P|N (0, 1)| = 2/ .
Exact calculation of total variation distances in closed form can be difficult,
if not impossible. Bounds and approximations often suffice.
<4>
RA
FT
bin.poisson.L1
P Q1 = 2
n
x=0
It suffices to show that g(x) 1 (x/n), for then the last expected value is
bounded by 2Px/n = 2 p.
The lower bound for g(x) is trivial when x = n. For other values of x
note that
x
g(x)
i
log
log 1
= np (n x) log(1 p)
1 (x/n)
n
i=1
lower.integral
<5>
upper.integral
<6>
Bound the sum of logarthmic terms from below, using the following inequality
from Prerequisites [5]: Let h be a (necessarily convex) function defined
on [0, 1] with increasing derivative h . For each integer x with 1 x n 1,
x
x
i
1
h
h
h(0)
n
n i=1
n
x +1
1
h
h
n
n
tk
h(t) = t + (1 t) log(1 t) =
k(k 1)
k=2
4
for 0 t 1
c David Pollard
Chapter 3
From <5>,
x
log 1
n
i=1
=
x
i=1
x
i
nh
h
n
n
The inequality is also valid when x = 0, for then both sides equal zero. Thus,
for x = 0, 1, . . . , n 1,
x
g(x)
log
np (n x) log(1 p) + nh
1 (x/n)
n
As a convex function of a continuous variable x in [0, n], the lower bound
achieves its minimum when h (x/n) = log(1 p), that is, when x = np. The
lower bound is everywhere greater than than
np n(1 p) log(1 p) + nh( p) = nh( p) + nh( p) = 0.
[affinity]
3.
FT
P f + Qg = ( p f + qg)
( f + g)( p q)
( p q) = 1 (P, Q),
RA
which shows that the definition of the affinity does not depend on the particular
choice of dominating measure. The form of the minimizing f and g also
shows that the affinity equals the minimum of PA + QAc over all A in A. The
minimizing set has the statistical interpretation of the rejection region for the
(nonrandomized) test between the two hypotheses P and Q that minimizes the
sum of the type one and type two errors.
The affinity is also closely related to the total variation distance: integrating
the equality p q = 12 ( p + q | p q|) we get
1 (P, Q) = 12 ( p + q | p q|) = 1 12 P Q1 .
That is,
( f m) ( f q{ p q} + f p{ p < q}) = f ( p q)
for all f M+ .
<7>
c David Pollard
Section 3.3
4.
Hellinger distance
Let P and Q be probability measures with densities p and q with respect to a
dominating measure . The square roots of the densities, p and q are both
2
square integrable; they both belong to L (). The Hellinger distance between
the two measures is defined as the L2 distance between the square roots of their
densities,
= ( p + q 2 pq)
= 2 2 pq.
H2.L1
<8>
RA
FT
It is easy to show that the integral defining the Hellinger distance does not
( p q)2 = | p q|
p + q = | p q|.
L1.hellinger
<9>
by <8>
Both the Hellinger distance and the total variation norm define bounded metrics
on the space of all probability measures on A. From <9>, they define the same
topology for convergence of probability measures.
[KL]
5.
Relative entropy
Let P and Q be two probability measures with densities p and q with respect
to some dominating measure . The relative entropy (also known as the
c David Pollard
Chapter 3
<10>
x log x = x 1 + 12 (x 1)2 /x
When p > 0 and q > 0, put x = p/q, discard the nonnegative quadratic
remainder term, then multiply through by q to get
p log( p/q) p q.
<11>
RA
re.hellinger
FT
The same inequality also holds at points where q = 0 and p > 0, with
the left-hand side interpreted as ; and at points where p = 0 we get no
contribution to the defining integral. It follows that p (log( p/q)) < and
D(PQ) ( p q) = 0. That is, the relative entropy is well defined and
nonnegative. These conclusions would also follow, perhaps more directly, via
Jensens inequality. I prefer to argue via the Taylor expansion because, as you
will soon see, with refinements on the remainder term we get better lower
bounds for the relative entropy.
If { p > 0 = q} > 0 then p log( p/q) is infinite on a set of positive
measure, which forces D(PQ) = . That is, the relative entropy is infinite
unless P is absolutely continuous with respect to Q. It can also be infinite even
if P and Q are mutually absolutely continuous (Problem [4]).
As with the total variation and Hellinger distances, the relative entropy
does not depend on the choice of the dominating measure (Problem [3]).
It is easy to deduce from the conditions for equality in Jensens inequality
that D(PQ) = 0 if and only if P = Q. An even stronger assertion follows
from an inequality relating relative entropy to Hellinger distance.
Lemma.
Check citation
as asserted.
In a similar vein, there is a lower bound for the relative entropy involving
the L 1 -distance, due to Csiszar (1967), Kullback (1967), and Kemperman (1969).
c David Pollard
Section 3.5
re.L1
<12>
Relative entropy
Lemma.
Proof.
x2
for x 1.
1 + x/3
To establish the inequality asserted by the Lemma, we may assume, with
no loss of generality (compare with the proof of Lemma <11>), that P is
absolutely continuous with respect to Q. This time write 1 + for the density.
Notice that Q = 0 and thus
2
D(PQ) = Q ((1 + ) log(1 + ) ) 12 Q
.
1 + /3
(1 + x) log(1 + x) x
1
2
<13>
FT
Solution:
x1
i
= P np (n x) log(1 p)
log 1
n
i=1
x1
h (i/n)
RA
= nh( p) P
i=1
<14>
hprime.lower
The sum inside the last expectation should be interpreted as zero if x equals 0
or 1. The lower bound <5>, that is
x1
1
x 1
,
h (i/n) h
n i=1
n
By the Mean Value Theorem, the last difference equals h ( p ) for some p
between p n 1 and p. Because h is increasing, the bound is less than
p2
+ ...
2
The bound decreases at the rate p as p tends to zero, more slowly than the
asserted p 2 rate. What went wrong?
The source of the problem is <14>. It approximates the sum of logarithmic
terms too crudely. We need to break out the linear contributions to h . Define
tk
t2
h 1 (t) = h(t)
=
2
k(k 1)
k=3
h ( p) = log(1 p) = p +
c David Pollard
Chapter 3
2
n
2
2
p
h 1 ( p) +
2
p2
= log(1 p) p +
2
3
p
= p2 +
+ ...,
3
as asserted.
From the bound on D(PQ) in the last exercise and from Lemma <11>
we also get
p2
2
As p increases, the upper bound ( p) eventually exceeds 2, the largest possible
value for a squared Hellinger distance. In fact ( p0 ) = 2 for p0 0.918.
Using the fact that ( p)/ p 2 is also increasing in p, we can replace the upper
bound by p 2 ( p0 )/ p02 , leaving
FT
[product.measure]
RA
6.
Product measures
hell.product
<15>
Lemma.
in
n
H (Pi , Q i )2
i=1
c David Pollard
Section 3.6
Product measures
Proof. The product term comes from the factorization of the affinity between
the two product measures.
To establish the upper bound, write yi for H (Pi , Q i )2 /2. We need to show
that the function
n
n
G n (y1 , . . . , yn ) =
yi + (1 yi ) 1
i=1
i=1
<16>
KL.product
<17>
Lemma.
FT
in
[boundL1]
7.
RA
<18>
L1.v.L2
P Q21 ( p q)2
Notice that the right-hand side depends on the choice of , whereas the left-hand
side does not. Often it will be convenient to choose = P or = (P + Q)/2.
The second-moment upper bound is often of the correct order of magnitude
for a well chosen . For example, suppose dQ/dP = 1 + , with small
enough to justify integration of the expansion
(1 + )1/2 = 1 + 12 14 2 + . . .
to give P(1 + )1/2 1 P2 /4. Then
H 2 (P, Q) = 2 2P(1 + )1/2 12 P2
10
c David Pollard
Chapter 3
p+
q c everywhere then
| p q|2
c2
quadratic.average
<19>
normal.mixture
<20>
| p q|2
P+Q
| p q|2
H 2 (P, Q)
if =
4
2
2
See Problem [5] for a comparison between P and (Q + P)/2 as dominating
measures.
Example. As shown in Example <3>, and the explanation that follows
that Example, the total variation
distance between the N (, 1) and the N (0, 1)
distributions decreases like 2/ | | as 0. More precisely,
(x ) = (x) + x(x) + 12 2 (x 2 1)(x) + . . .
so that
|(x ) (x)| d x = ||
|x|(x) d x + O( 2 )
1
N (, 1)
2
+ 12 N (, 1)
RA
FT
so that
1
(x ) + 1 (x + ) (x) d x = 1 2 |x 2 1|(x) d x + O( 4 )
2
2
2
Integration by parts gives 12 |x 2 1|(x) d x = 2(1) 0.48.
It is not too difficult to make these calculations rigorous. The second
moment bound gives the same rate of convergence even more easily.
2
d P
d P 2
2
1
P P0 1 P0
1 = P0
d P0
d P0
2
2
2
= 14 P0 exp x
1
+ exp x
2
2
= 12 exp( 2 ) + exp( 2 ) 1
4
8
+
+ ...
2!
4!
L2.product
<21>
{i, j}
c David Pollard
{i, j,k}
11
Section 3.7
The notation {i, j} indicates that the the sum runs over all pairs
of distinct
pairs of integers with 1 i n and 1 j n; the notation {i, j,k} refers to
summation over triples of integers, all different; and so on.
When the squared term is expanded, any unpaired i factor is annihilated
by the corresponding Pi marginal of P. Only the squared terms survive,
reducing the expansion to
Pi2 +
Pi2 2j + . . . ,
{i, j}
The upper bound in the Exercise decreases like i Pi i2 when the sum is
small. In situations where Pi i2 behaves like H 2 (Pi , Q i )2 , the second moment
bound is comparable to the the analogous bound for Hellinger distance:
n
H 2 (P, Q)
H (Pi , Q i )2
i=1
FT
The second moment method also works for situations where it becomes
exceedingly difficult to calculate Hellinger distances directly. The calculation
of distances between mixtures of product measures, which will be the key to
finding minimax rates of convergence in Chapter 14, is a good illustration.
Suppose P is a product probability measure on Xn . For each in a finite
set A, suppose the probability Q is also a product measure, obtained by a
small perturbation of P,
dQ
(1 + ,i (xi )),
=
dP
in
RA
where P,i (xi ) = 0 for each and i, to give Q total mass one. Let
{w : A} be a finite set of nonnegative weights that sum to one. To a
Bayesian, the weights w would define a prior distribution on A.
Let Q0 be another product probability measure, with
dQ0
(1 + 0,i (xi )).
=
dP
in
L2.product.mixture
<22>
in
{i, j}
Q
a
0
1
A
A A
A A
w w (, ) 2
w (,0 ) + (0,0 )
Q
has
density
w
0
a
A a [( ) (0 )] with respect to P. The
quadratic bound for the total variation distance equals
P
w w ( )( ) (0 )( ) ( )(0 ) + (0 )2
,
12
c David Pollard
Chapter 3
Consider the contribution to the expectation from the first term inside the
square brackets. It expands to the product
P
,i +
,i , j +
,i , j ,k + . . .
i
{i, j}
{i, j,k}
,i +
,i , j +
{i, j}
,i , j ,k + . . .
{i, j,k}
P2,i
FT
P2,i
,
i
i
<23>
mammen
RA
of magnitude as H 2 (P, Q), by Problem [5]) and H (P, Q) is of order O(1/ n),
Asymptopia: 17 October 2000
c David Pollard
13
Section 3.7
[]
8.
Problems
[1]
change.of.measure
[2]
com.re
[3]
Adapt the argument from the previous Problem to show that the relative entropy
D(PQ) does not depend on the choice of dominating measure.
infinite.re
[4]
Let P be the standard Cauchy distribution on the real line, and let Q be the
standard normal distribution. Show that D(PQ) = , even though P and Q
are mutually absolutely continuous.
two.quadratics
[5]
RA
FT
Prob.hellinger
orig.mammen
[6]
14
For P and Q as in Example <23>, bound Q P21 using the second moment
method for densities with respect to = (P + Q)/2, by following these steps.
Write q = d Q/d = 1 + , so that p = d P/d = 1 . Write for 2 .
Note that, by <19>, H 2 (P, Q).
(i) Show that 0,i = for all i, and that ,i = if i = , and
otherwise.
(ii) Deduce that 0,0 (i) = for all i; that , (i) = if = = i or
= = i, and otherwise; and that ,0 (i) = if = i, and
otherwise.
(iii) Deduce that 1 + (0,0 ) = (1 + )n ; that 1 + (,0 ) = (1 + )n1 (1 );
and that
(1 + )n2 (1 )2 if =
1 + (, ) =
(1 + )n1 (1 ) if =
c David Pollard
Chapter 3
(q p)2 4
1
+ (1 + )n2
n
Notes
My definition of total variation follows Dunford & Schwartz (1958,
Section III.1).
I adapted the results on the total variation and relative entropy distances
between Binomial and Poisson distributions from Reiss (1993, p 25). They
credited Barbour & Hall (1984) with the first result, and Falk & Reiss (1992)
with the second result.
Barbour, Holst & Janson (1992) have devoted a whole book to the topic
of Poisson approximation.
References
FT
RA
9.
[]
c David Pollard
15