You are on page 1of 15

Chapter 3

Distances and affinities between


measures
At first reading you might find this Chapter a mere catalog of strange inequalities
involving several strange-looking distances between probability measures: the Hellinger
distance H 2 (P, Q); the total variation, or L 1 , distance P Q1 ; and the relative
entropy D(PQ). In later chapters you will discover that these intrinsic measures of
distance play a key role in the study of general statistical models. In particular, the
following inequalities will turn out to be most important.
In Section 4:
1/2

FT

P Q1 2H (P, Q) 2P Q1


In Section 5:

H 2 (P, Q) D(PQ)
1
P
2

Q21 D(PQ)

In Section 6, for product measures:

RA

H (P1 . . . Pn , Q 1 . . . Q n )2

D(P1 . . . Pn Q 1 . . . Q n ) =

n


H (Pi , Q i )2

i=1

D(Pi Q i )

in

[why]

1.

Section 7 describes a simple method for bounded total variation distance via calculations
of variances and covariances.
In my opinion, the total variation distance is the most important of the three, for
reasons that will be explained in Chapter 14. I will use the other two distances mostly
as a way of bounding total variation.

Why bother with dierent distances?


When we work with a family of probability measures, {P : }, indexed
by a metric space , there would seem to be an obvious way to calculate
the distance between measures: use the metric on . For many problems of
estimation, the obvious is what we want. We ask how close (in the metric)
we can come to guessing 0 , based on an observation from P0 ; we compare
estimators based on rates of convergence, or based on expected values of loss
functions involving the distance from 0 .
When the parametrization is reasonable (whatever that means), distances
measured by the  metric are reasonable. (What else could I say?) However
it is not hard to concoct examples where the  metric is misleading.

Asymptopia: 17 October 2000

c David Pollard

Section 3.1

misleading

< 1>

Why bother with different distances?

Example. Let Pn, denote the joint distribution for n independent observations from the N (, 1) distribution, with R. Under Pn,0 , the sample
average, X n , converges to 0 at a n 1/2 rate. The parametrization is reasonable.
What happens if we reparametrize, replacing the N (, 1) by a N ( 3 , 1)?
We are still fitting the same model; same probability measures, only the labelling
1/3
has changed. The maximum likelihood estimator, X n , still converges at an
1/2
1/6
rate if 0 = 0, but for 0 = 0 we get an n
rate, as an artifact of the
n
reparametrization.
More imaginative reparametrizations can produce even stranger behaviour
for the maximum likelihood estimator. For example, define the one-to-one
reparametriztion


if is rational
() =
+ 1 if is irrational
Now let Pn, denote the joint distribution for n independent observations from
the N (( ), 1) distribution. If 0 is rational, the maximum likelihood estimator,
1 ( X n ), gets very confused: it concentrates around 0 1 as n gets larger.

RA

FT

You would be right to scoff at the second reparametrization in the Example,


yet it does make the point that distances measured in the  metric, for some
parametrization picked out of the air, might not be particularly informative
about the behaviour of estimators. Less ridiculous examples arise routinely in
nonparametric problems, that is, in problems where infinite dimensional
parameters enter, making the choice of metric less obvious.
Fortunately, there are intrinsic ways to measure distances between probability measures, distances that dont depend on the parametrizations. The rest
of this Chapter will set forth a few of the basic definitions and facts. The total
variation distance has properties that will be familiar to students of the
Neyman-Pearson approach to hypothesis testing. The Hellinger distance is
closely related to the total variation distancefor example, both distances define
the same topology of the space of probability measuresbut it has several
technical advantages derived from properties of inner products. (Hilbert spaces
have nicer properties than general Banach spaces.) For example, Hellinger
distances are very well suited for the study of product measures (Section 6).
Also, Hellinger distance is closely related to the concept called Hellinger differentiability (Chapter 4), an elegant alternative to the tradional assumptions of
pointwise differentiability in some asymptotic problems. Relative entropy,
which is also known as Kullback-Leibler distance, emerges naturally from the
study of maximum likelihood estimation. The relative entropy is not a metric,
but it is closely related to the other two distances, and it too is well suited for
use with product measures.
The intrinsic measures of distance are the key to understanding minimax
rates of convergence, as you will learn in Part III.
For reasonable parametrizations, in classical finite-dimensional settings,
the intrinsic measures usually tell the same story as the  metric, as explained
in Chapter 4.

[tot.var]

2.

Total variation
Let be a finite signed measure on a measure space (X, A). Remember that
M denote the set of all real, A-measurable functions on X, and that M+ is the
cone of nonnegative functions in M.

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

The total variation norm of is defined as



v() = sup
|Ai |,
i

where the supremum runs over all finite partitions {Ai } of X into disjoint,
A-measurable sets. In fact, there is no need to consider partitions into more
than two sets: for if A = i {Ai : Ai 0} then

|Ai | = A Ac = |A| + |Ac |
i

If both 1 and 2 are absolutely continuous with respect to a measure ,


such as = 1 + 2 , with densities m 1 and m 2 , then has density m = m 1 m 2
and v() = |m|:



|Ai | =
|(m Ai )|
(|m|Ai ) = |m|,
i

with equality when A1 = {m 0} = Ac2 . That is, v() = {m 0}{m < 0}.
The total variation is often called the L 1 norm of , and is also denoted
by 1 . Note that v(), and hence 1 , does not depend on the choice of
the dominating measure.
The total variation v() is also equal to sup| f |1 |f |, the supremum
running over all A-measurable functions f bounded in absolute value by 1:
|f | = |m f | |m|

if | f | 1,

FT

with equality when f = {m 0} {m < 0}.


When (X) = 0, there are some slight simplifications in the formulae
for v(). In that case, 0 = m = m + m and hence
v() = 1 = 2m + 2m = 2{m 0} = 2 sup A
A

RA

As a special case, for probability measures P1 and P2 , with densities p1 and p2


with respect to ,
v() = 1 = 2( p1 p2 )+ = 2( p2 p1 )+
= 2 sup (P1 A P2 A) = 2 sup (P1 A P2 A)
A

tot.var.special

<2>

= 2 sup |P1 A P2 A|
A

multiv.normal

Many authors, no doubt with the special case foremost in their minds, define
the total variation as sup A |P1 A P2 A|. Beware! An unexpected extra factor
of 2 can cause confusion.

<3>

Example. Let P1 denote the N (1 , In ) multivariate normal distribution


and P2 denote the N (2 , In ), with 1 = 2 . Define = |1 2 |/2 and
u = (2 1 )/|2 1 |. Note that u  (x 1 ) has a N (0, 1) distribution under P1 ,
and a N (2, 1) distribution under P2 . The density of P1 P2 with respect to
Lebesgue measure is nonnegative in the halfspace A0 = {x : u  (x 1 ) }.
Thus
P1 P2 1 = 2(P1 P2 )A0

= 2P{N (0, 1) } 2P{N (2, 1) }


= 2P{|N (0, 1)| }
4
=
+ O( 2 )
as 0
2

For small |1 2 |, the total variation distance is approximately 2/ |1 2 |.

Asymptopia: 17 October 2000

c David Pollard

Section 3.2

Total variation

The rate of convergence in the last Example is typical. Consider for


example a family of probability measures {P : Rk } with densities { f } with
respect to a measure . Supose the family of densities is dierentiable in
L( ) norm at . That is, suppose there is an integrable function f for which
| f +t f t  f | = o(|t|)
as t 0
Then, writing u for the unit vector t/|t|, we have
P +t P 1
| f +t f |
=
= |u  f | + o(1)
|t|
|t|
For the N (, In ) densities, (x ), the pointwise derivative (x )(x )
is also the derivative in L1 norm, which gives

N ( + t, In ) N (, In )1
= o(1) + |u  (x )(x )|
|t|
= o(1) + P|u  N (0, In )|
2
= o(1) +


because u N (0, In ) is N (0, 1) distributed, and P|N (0, 1)| = 2/ .
Exact calculation of total variation distances in closed form can be difficult,
if not impossible. Bounds and approximations often suffice.
<4>

Exercise. Let P denote the Bin(n, p) distribution and Q denote the


Poisson(np) distribution. Show that P Q1 2 p.
Solution: Without loss of generality assume 0 < p < 1. For x =
0, 1, . . . , n define
 
Q{x}
enp (np)x  n x
g(x) :=
=
p (1 p)nx
P{x}
x!
x

x1 

i 1
1
= enp (1 p)xn
n
i=1
Then

RA

FT

bin.poisson.L1

P Q1 = 2

n


(P{x} Q{x})+ = 2Px (1 g(x))+

x=0

It suffices to show that g(x) 1 (x/n), for then the last expected value is
bounded by 2Px/n = 2 p.
The lower bound for g(x) is trivial when x = n. For other values of x
note that


x

g(x)
i
log
log 1
= np (n x) log(1 p)
1 (x/n)
n
i=1

lower.integral

<5>

upper.integral

<6>

Bound the sum of logarthmic terms from below, using the following inequality
from Prerequisites [5]: Let h be a (necessarily convex) function defined
on [0, 1] with increasing derivative h  . For each integer x with 1 x n 1,
 
x
x

i
1
h
h
h(0)
n
n i=1
n


 
x +1
1
h
h
n
n

To get h  (t) = log(1 t) for 0 t < 1 define


tk
h(t) = t + (1 t) log(1 t) =
k(k 1)
k=2
4

for 0 t 1

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

From <5>,

x


log 1
n
i=1


=

x

i=1

 
x

i
nh
h
n
n


The inequality is also valid when x = 0, for then both sides equal zero. Thus,
for x = 0, 1, . . . , n 1,
x

g(x)
log
np (n x) log(1 p) + nh
1 (x/n)
n
As a convex function of a continuous variable x in [0, n], the lower bound
achieves its minimum when h  (x/n) = log(1 p), that is, when x = np. The
lower bound is everywhere greater than than
np n(1 p) log(1 p) + nh( p) = nh( p) + nh( p) = 0.


[affinity]

3.

It follows that g(x) 1 (x/n), as asserted.

The anity between two probabilities


Let P and Q be probability measures with densities p and q with respect to a
dominating measure . The affinity 1 (P, Q) between P and Q is defined as
( p q). For each pair of functions f, g in M+ for which f + g 1,

also need affinity for finite


measures in Chapter 7

FT

P f + Qg = ( p f + qg)
( f + g)( p q)

( p q) = 1 (P, Q),

with equality when f = { p q} and g = { p > q}, because then p f + qg =


p q. Thus

RA

1 (P, Q) = inf{P f + Qg : f, g M+ , f + g 1},

which shows that the definition of the affinity does not depend on the particular
choice of dominating measure. The form of the minimizing f and g also
shows that the affinity equals the minimum of PA + QAc over all A in A. The
minimizing set has the statistical interpretation of the rejection region for the
(nonrandomized) test between the two hypotheses P and Q that minimizes the
sum of the type one and type two errors.
The affinity is also closely related to the total variation distance: integrating
the equality p q = 12 ( p + q | p q|) we get
1 (P, Q) = 12 ( p + q | p q|) = 1 12 P Q1 .

The function p q, when interpreted as a density with respect to , defines


a nonnegative measure for which f min(P f, Q f ) for all f M+ . That
is, P and Q as measures. In fact is the largest measure with this
property. For if , with density m, is another measure with the same property
then, for all f M+ ,
f = f { p q} + f { p < q} Q f { p q} + P f { p < q}.

That is,
( f m) ( f q{ p q} + f p{ p < q}) = f ( p q)

for all f M+ .

It follows that m p q almost everywhere mod[], whence . The


measure is also denoted by P Q, and is called the (lattice theoretic)
minimum of P and Q. In summary,
aff.L1

<7>

1 (P, Q) = P Q1 = 1 12 P Q1

Asymptopia: 17 October 2000

c David Pollard

Section 3.3

The affinity between two probabilities

for all probability measures P and Q defined on the same sigma-field.


Note that (P Q)(A) is, in general, strictly smaller than min(PA, QA).
Indeed, the set function A  min(PA, QA) is not even finitely additive. The
lattice theoretic minimum is not the same as the setwise minimum.
[hellinger]

4.

Hellinger distance
Let P and Q be probability measures with densities p and q with respect to a

dominating measure . The square roots of the densities, p and q are both
2
square integrable; they both belong to L (). The Hellinger distance between
the two measures is defined as the L2 distance between the square roots of their
densities,

H (P, Q)2 = ( p q)2

= ( p + q 2 pq)

= 2 2 pq.

H2.L1

<8>

RA

FT

It is easy to show that the integral defining the Hellinger distance does not

depend on the choice of dominating measure (Problem [2]). The quantity pq


is called the Hellinger anity between the two measures, and is denoted
by 2 (P, Q).

The Hellinger distance satisfies the inequality 0 H (P, Q) 2. Some


authors prefer to have an upper bound of 1; they include an extra factor of a

half in the definition of H (P, Q)2 . The equality at 0 occurs when p = q

almost surely mod[], that is, when P = Q as measures on A. Equality at 2


occurs when the Hellinger affinity is zero, that is, when pq = 0 almost surely
mod[], which is the condition that P and Q be supported by disjoint subsets
of X. For example, discrete distributions (concentrated on a countable set) are
always at the maximum Hellinger distance from nonatomic distributions (zero
mass at each point).

From the pointwise inequality pq p q it follows that 2 (P, Q)


1 (P, Q), and hence, via <7>, that
H (P, Q)2 P Q1 .

The last bound also follows directly from the inequality



( p q)2 = | p q|
p + q = | p q|.

The Cauchy-Schwarz inequality gives a companion lower bound:

P Q21 | p q|2 | p + q|2 = H (P, Q)2 (2 + 22 (P, Q)) .

L1.hellinger

<9>

Substitute for the Hellinger affinity to get



P Q1 H (P, Q) 4 H (P, Q)2 2H (P, Q)
1/2
2P Q1

by <8>

Both the Hellinger distance and the total variation norm define bounded metrics
on the space of all probability measures on A. From <9>, they define the same
topology for convergence of probability measures.
[KL]

5.

Relative entropy
Let P and Q be two probability measures with densities p and q with respect
to some dominating measure . The relative entropy (also known as the

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

Kullback-Leibler distance, even though it is not a metric) between P and Q


is defined as D(PQ) = ( p log( p/q)).
At first sight, it is not obvious that the definition cannot suffer from the
problem. Indeed, it is not immediately obvious that the negative parts
of p log( p/q) must be integrable. A Taylor expansion comes to the rescue: for
x > 0,
xlogx

<10>

x log x = x 1 + 12 (x 1)2 /x

for some x between 1 and x.

When p > 0 and q > 0, put x = p/q, discard the nonnegative quadratic
remainder term, then multiply through by q to get
p log( p/q) p q.

<11>

For probabilities P and Q on the same space, D(PQ) H 2 (P, Q).

RA

re.hellinger

FT

The same inequality also holds at points where q = 0 and p > 0, with
the left-hand side interpreted as ; and at points where p = 0 we get no
contribution to the defining integral. It follows that p (log( p/q)) < and
D(PQ) ( p q) = 0. That is, the relative entropy is well defined and
nonnegative. These conclusions would also follow, perhaps more directly, via
Jensens inequality. I prefer to argue via the Taylor expansion because, as you
will soon see, with refinements on the remainder term we get better lower
bounds for the relative entropy.
If { p > 0 = q} > 0 then p log( p/q) is infinite on a set of positive
measure, which forces D(PQ) = . That is, the relative entropy is infinite
unless P is absolutely continuous with respect to Q. It can also be infinite even
if P and Q are mutually absolutely continuous (Problem [4]).
As with the total variation and Hellinger distances, the relative entropy
does not depend on the choice of the dominating measure (Problem [3]).
It is easy to deduce from the conditions for equality in Jensens inequality
that D(PQ) = 0 if and only if P = Q. An even stronger assertion follows
from an inequality relating relative entropy to Hellinger distance.
Lemma.

Proof. This inequality is trivially true unless P is absolutely continuous with


respect to Q, in which case we can take the dominating equal to Q. Define

= p 1. Note that Q2 = H 2 (P, Q) and


1 = Q p = Q(1 + )2 = 1 + 2Q + Q2 ,

which implies that 2Q = H 2 (P, Q). Hence


D(PQ) = Q ( p log p)


= 2Q (1 + )2 log(1 + )


2
2Q (1 + )
)
1+
= 2Q + 2Q2
= H 2 (P, Q),


Check citation

as asserted.
In a similar vein, there is a lower bound for the relative entropy involving
the L 1 -distance, due to Csiszar (1967), Kullback (1967), and Kemperman (1969).

Asymptopia: 17 October 2000

c David Pollard

Section 3.5

re.L1

<12>

Relative entropy

Lemma.
Proof.

For probabilities P and Q on the same space, D(PQ) 12 PQ21 .

Recall from Prerequisites [5] that

x2
for x 1.
1 + x/3
To establish the inequality asserted by the Lemma, we may assume, with
no loss of generality (compare with the proof of Lemma <11>), that P is
absolutely continuous with respect to Q. This time write 1 + for the density.
Notice that Q = 0 and thus


2
D(PQ) = Q ((1 + ) log(1 + ) ) 12 Q
.
1 + /3
(1 + x) log(1 + x) x

1
2

Multiply the right-hand side by 1 = Q(1+/3), then invoke the Cauchy-Schwarz


inequality to bound the product from below by half the square of



||
Q
1 + /3 = Q|| = P Q1 .
1 + /3

bin.poisson.re

<13>

The asserted inequality <12> follows.


Exercise. Let P denote the Bin(n, p) distribution and Q denote the
Poisson(np) distribution, as in Exercise <4>. Show that
D(PQ) log(1 p) p + p 2 /2 = p 2 + O( p 3 ).
Using the same notation as in Exercise <4> we have

D(PQ) = Px log g(x)




FT

Solution:

x1


i
= P np (n x) log(1 p)
log 1
n
i=1
x1


h  (i/n)

RA

= nh( p) P



i=1

<14>

is also valid for x = 0, 1 if we extend h to have h(t) = 0 for t < 0. Take


expectations, then invoke Jensens inequality.




x 1
1
D(PQ) nh( p) nPh
nh( p) nh p
n
n

hprime.lower

The sum inside the last expectation should be interpreted as zero if x equals 0
or 1. The lower bound <5>, that is


x1
1
x 1

,
h (i/n) h
n i=1
n

By the Mean Value Theorem, the last difference equals h  ( p ) for some p
between p n 1 and p. Because h  is increasing, the bound is less than
p2
+ ...
2
The bound decreases at the rate p as p tends to zero, more slowly than the
asserted p 2 rate. What went wrong?
The source of the problem is <14>. It approximates the sum of logarithmic
terms too crudely. We need to break out the linear contributions to h  . Define


tk
t2
h 1 (t) = h(t)
=
2
k(k 1)
k=3
h  ( p) = log(1 p) = p +

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

which has derivative h 1 (t) = h  (t) t = log(1 t) t. Invoke the analog


of <14>, with h 1 in place of h, to refine the lower bound to


x1
x1
x1
1
x 1
x(x 1)
1
1
+
h  (i/n) =
h 1 (i/n) +
(i/n) h 1
n i=1
n i=1
n i=1
n
2n 2

Direct calculation of moments gives Px(x 1) = n(n 1) p 2 . Arguing once


again via Jensens inequality we now get


1
np 2
(n 1) p 2
D(PQ) nh 1 ( p) +
nh 1 p

2
n
2
2
p
h 1 ( p) +
2
p2
= log(1 p) p +
2
3
p
= p2 +
+ ...,
3
as asserted.
From the bound on D(PQ) in the last exercise and from Lemma <11>
we also get
p2
2
As p increases, the upper bound ( p) eventually exceeds 2, the largest possible
value for a squared Hellinger distance. In fact ( p0 ) = 2 for p0 0.918.
Using the fact that ( p)/ p 2 is also increasing in p, we can replace the upper
bound by p 2 ( p0 )/ p02 , leaving

FT

H 2 (Bin(n, p), Poisson(np))2 ( p) := log(1 p) p +

H 2 (Bin(n, p), Poisson(np))2 2 p 2 / p02 (1.5 p)2 ,

[product.measure]

RA

a slightly neater expresion that ( p).

6.

Product measures

Suppose P = P1 P2 and Q = Q 1 Q 2 , product measures on X1 X2 . If both


Pi and Q i are dominated by i , with corresponding densities pi and qi then

2 (P, Q) = 1 2 p1 (x1 ) p2 (x2 )q1 (x1 )q2 (x2 )


= 1 p1 (x1 )q1 (x1 )2 p2 (x2 )q2 (x2 ) = 2 (P1 , Q 1 )2 (P2 , Q 2 ).
A similar factorization holds for products of more than two measures. This
factorization is the chief reason for the great usefulness of Hellinger distance
when working with product measures. In particular, it gives a most convenient
way to bound total variation distances. By contrast, the affinity 1 (P, Q) enjoys
no comparable factorization, because minima of products do not factorize into
products of minima. It is awkward to deal directly with total variation distance
between product measures.

hell.product

<15>

For probability measures {Pi } and {Q i },




H (P1 . . . Pn , Q 1 . . . Q n )2 = 2 2
1 12 H (Pi , Q i )2

Lemma.

in

n


H (Pi , Q i )2

i=1

Asymptopia: 17 October 2000

c David Pollard

Section 3.6

Product measures

Proof. The product term comes from the factorization of the affinity between
the two product measures.
To establish the upper bound, write yi for H (Pi , Q i )2 /2. We need to show
that the function
n
n


G n (y1 , . . . , yn ) =
yi + (1 yi ) 1
i=1

i=1

is nonnegative for all 0 yi 1. The lower bound of 0 is achieved when n = 1.


For fixed y1 , . . . , yn1 , the left-hand side is linear in yn , achieving its minimum
at either yn = 0 or yn = 1. Thus the left-hand side is greater than
min (G n (y1 , . . . , yn1 , 0), G n (y1 , . . . , yn1 , 1)) G n1 (y1 , . . . , yn1 )

product.limit

<16>

An inductive argument completes the proof.

If Pn = Pnn and Qn = Q nn , both products of identical factors, and


Corollary.

if n H (Pn , Q n ) c as n , then H (Pn , Qn )2 2 2 exp(c2 /2).


The Corollary is the basis for a minor industry in the calculation of
minimax rates of convergence of estimators, as will be explained in Chapter 14.
Relative entropies between product measures also factorize.

KL.product

<17>

Lemma.

For probability measures {Pi } and {Q i },



D(P1 . . . Pn Q 1 . . . Q n ) =
D(Pi Q i ).

FT

in

Proof. Without loss of generality assume that Pi is absolutely continuous with


respect to Q i , with density pi , for each i. The the left-hand side of the asserted
equality equals

Q1 . . . Qn
pi (xi ) log pi (xi ),
in

[boundL1]

7.

which factorizes to give the right-hand side.

RA

Second-moment bounds on total variation distance


Particularly for probability measures P and Q that are close, we often need
only upper bounds on total variation distance. If both measures are dominated
by a probability measure , then

<18>

L1.v.L2

P Q21 ( p q)2

Notice that the right-hand side depends on the choice of , whereas the left-hand
side does not. Often it will be convenient to choose = P or = (P + Q)/2.
The second-moment upper bound is often of the correct order of magnitude
for a well chosen . For example, suppose dQ/dP = 1 + , with  small
enough to justify integration of the expansion
(1 + )1/2 = 1 + 12  14 2 + . . .
to give P(1 + )1/2 1 P2 /4. Then
H 2 (P, Q) = 2 2P(1 + )1/2 12 P2

More precisely, if there exists a constant C such that p + q C everywhere


then
| p q|2
| p q|2
H 2 (P, Q) =
2
| p + q|
C2

10

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

and if there exists a constant c such that

p+

q c everywhere then

| p q|2
c2

In particular, if l = (P + Q)/2 then p + q = 2, so that 2 p + q 2 and


H 2 (P, Q)

quadratic.average

<19>

normal.mixture

<20>

| p q|2
P+Q
| p q|2
H 2 (P, Q)
if =
4
2
2
See Problem [5] for a comparison between P and (Q + P)/2 as dominating
measures.
Example. As shown in Example <3>, and the explanation that follows
that Example, the total variation
distance between the N (, 1) and the N (0, 1)
distributions decreases like 2/ | | as 0. More precisely,
(x ) = (x) + x(x) + 12 2 (x 2 1)(x) + . . .


so that


|(x ) (x)| d x = ||

|x|(x) d x + O( 2 )

A similar argument suggests that the mixture P =


converges to the N (0, 1) at an even faster rate:
1
2

1
N (, 1)
2

+ 12 N (, 1)

((x ) + (x + )) = (x) + 12 2 (x 2 1)(x) + . . .

RA

FT

so that


1

 (x ) + 1 (x + ) (x) d x = 1 2 |x 2 1|(x) d x + O( 4 )
2
2
2

Integration by parts gives 12 |x 2 1|(x) d x = 2(1) 0.48.
It is not too difficult to make these calculations rigorous. The second
moment bound gives the same rate of convergence even more easily.
2




 d P
 d P 2
2



 1
P P0 1 P0 
1 = P0 
d P0
d P0 




2

2
2 
= 14 P0 exp x
1
+ exp x
2
2 


= 12 exp( 2 ) + exp( 2 ) 1
4
8
+
+ ...
2!
4!

The bound on the distance P P0 1 decreases like 2 / 2, an overestimate


by a constant factor of approximately 1.5.

L2.product

<21>

Often the second moment method reduces calculations of bounds on total


variation distances to calculations of variances and covariances.


Exercise. Let P = in Pi and Q = in Q i be finite products of
probability measures such
1 + i (xi ) with respect to Pi .

 that Q i has density
Show that P Q21 in 1 + Pi i2 1.

Solution: Notice that Pi = Pi i = 0 because both Pi and Q i are


probabilities. We may assume each Pi i2 finite, for otherwise the asserted
inequality is trivial. From <18>,
2
 




P Q21 = P  (1 + i ) 1
 i


2



P
i +
i  j +
i  j k + . . . .
i

Asymptopia: 17 October 2000

{i, j}

c David Pollard

{i, j,k}

11

Section 3.7

Second-moment bounds on total variation distance

The notation {i, j} indicates that the the sum runs over all pairs
 of distinct
pairs of integers with 1 i n and 1 j n; the notation {i, j,k} refers to
summation over triples of integers, all different; and so on.
When the squared term is expanded, any unpaired i factor is annihilated
by the corresponding Pi marginal of P. Only the squared terms survive,
reducing the expansion to


Pi2 +
Pi2 2j + . . . ,
{i, j}

the asserted upper bound.


The upper bound in the Exercise decreases like i Pi i2 when the sum is
small. In situations where Pi i2 behaves like H 2 (Pi , Q i )2 , the second moment
bound is comparable to the the analogous bound for Hellinger distance:
n

H 2 (P, Q)
H (Pi , Q i )2
i=1

FT

The second moment method also works for situations where it becomes
exceedingly difficult to calculate Hellinger distances directly. The calculation
of distances between mixtures of product measures, which will be the key to
finding minimax rates of convergence in Chapter 14, is a good illustration.
Suppose P is a product probability measure on Xn . For each in a finite
set A, suppose the probability Q is also a product measure, obtained by a
small perturbation of P,

dQ
(1 + ,i (xi )),
=
dP
in

RA

where P,i (xi ) = 0 for each and i, to give Q total mass one. Let
{w : A} be a finite set of nonnegative weights that sum to one. To a
Bayesian, the weights w would define a prior distribution on A.
Let Q0 be another product probability measure, with

dQ0
(1 + 0,i (xi )).
=
dP
in

L2.product.mixture

The index 0 should be understood as not belonging to A, even if Q0 happens


to coincide with Q for some in A.
A small extension of the method from Lemma
<21> gives a very useful

bound on the total variation distance between Q and Q0 . The bound again
involves the function



(x1 , . . . , xn ) = 1 + (1 + xi ) =
xi +
xi x j + . . . + x1 x2 . . . xn

<22>

in

{i, j}

Lemma. With notation as above, define , , for , A {0} to be the


vector in Rn with i th component , (i) = P,i ,i , which is assumed finite.
Then




2

w
Q

Q


w w (, ) 2(,0 ) + (0,0 )

a
0
1
A
A A


A A

w w (, ) 2

w (,0 ) + (0,0 )

Proof. Write  for the random


vector with components ,i (xi ). Then


w
Q

Q
has
density
w
0
a
A a [( ) (0 )] with respect to P. The
quadratic bound for the total variation distance equals



P
w w ( )( ) (0 )( ) ( )(0 ) + (0 )2
,

12

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

Consider the contribution to the expectation from the first term inside the
square brackets. It expands to the product





P
,i +
,i , j +
,i , j ,k + . . .
i

{i, j}

{i, j,k}

,i +

,i , j +

{i, j}


,i , j ,k + . . .

{i, j,k}

As in the proof of Lemma <21>, the expectation annihilates (because P,i = 0


for all and i) most of the cross product terms, leaving only the terms where the
i, j, . . . subscripts pair up exactly. Also P is a product measure, so expectations
like
P,i , j ,k ,i , j ,k
factorize like




P ,i ,i P , j , j P ,k ,k = , (i), ( j), (k)


The sum over all such products equals (, ).


The other contributions are handled similarly.
For each , Lemma <21> gives the bound
Q Q0 21 (, )

P2,i

FT

P2,i

when the sum of the


terms is small. Similarly, the first term in the bound
from Lemma <22> is approximately
2




 

w w
P ,i ,i =
P
w ,i 



,
i
i

<23>

Example. A key step in a beautiful calculation by Mammen (1986) was the


bounding of the total variation distance between a product measure P n and a
mixture
n
1
Q=
P 1 Q P n
n =1
Mammen used the second moment method with dominating measure (P + Q)/2
to obtain an upper bound in terms of H 2 (P, Q). (See Problem [6] for his bound.)
A similar bound is even easier to derive when Q has density 1 +  with respect
to P, with P2 < .
Write P for P n and Q for the th term in the sum for Q. Take Q0 = P.
Then, in the notation of Lemma <22>, we have 0,i = 0 for every i, and for
1 n,

(xi ) if i =
,i =
0
otherwise

mammen

RA

With appropriate choices of perturbations {,i } and weights {w } we might


hope to achieve some cancellations, similar to those for the normal mixture in
Example <20>.

Thus , (i) = P2 if i = = , and it is zero otherwise. Each (, )


simplifies to P2 . From the Lemma
n
1 
P2
Q P n 21 2
P2 =
n =1
n
In the typical case where  is bounded (so that P2 is of the same order

of magnitude as H 2 (P, Q), by Problem [5]) and H (P, Q) is of order O(1/ n),
Asymptopia: 17 October 2000

c David Pollard

13

Section 3.7

Second-moment bounds on total variation distance

the bound P2 /n for Q P n 1 converges to zero at a 1/n rate. By a direct


calculation of L1 norms,
Qa P n 1 = Q P1

[]

8.

which in typical parametric situations converges to zero at only a O(1/ n)


rate. The mixing greater improves the rate of convergence.

Problems
[1]

Suppose P1 and P2 are probability measures with densities p1 and p2 with


respect to a dominating measures . Let be another dominating measure.
Write  for the density of with respect to + .
(i) Show that Pi has density pi  with respect to + .

(ii) Show that ( + )( p1  p2 )2 = ( p1 p2 )2 .


(iii) Deduce that the integral that defines the Hellinger distance H (P1 , P2 ) does
not depend on the choice of dominating measure.

change.of.measure

[2]

Let P and Q be probability measures with densities p and q with respect to a


sigma-finite measure . For fixed 1, show that  (P, Q) := | p 1/ q 1/ |
does not depend on the choice of dominating measure. Hint: Let be another
sigma-finite dominating measure. Write for the density of with respect
to + . Show that d P/d( + ) = p and d Q/d( + ) = q. Express
 (P, Q) as an integral with respect to + . Argue similarly for .

com.re

[3]

Adapt the argument from the previous Problem to show that the relative entropy
D(PQ) does not depend on the choice of dominating measure.

infinite.re

[4]

Let P be the standard Cauchy distribution on the real line, and let Q be the
standard normal distribution. Show that D(PQ) = , even though P and Q
are mutually absolutely continuous.

two.quadratics

[5]

Suppose a probability measure Q has density 1 +  with respect to a probability


measure P. Define M = (P + Q)/2. Write p and q for the densities of P and Q
with respect to M.
(i) Show that p = (1 + /2)1 = 2 q.

RA

FT

Prob.hellinger

orig.mammen

(ii) Deduce that M| p q|2 2P2 . Hint:  1.


(iii) If /2 is bounded above by a constant C, show that M| p q|2
P2 /(1 + C).

[6]

calculations need checking

14

For P and Q as in Example <23>, bound Q P21 using the second moment
method for densities with respect to = (P + Q)/2, by following these steps.
Write q = d Q/d = 1 + , so that p = d P/d = 1 . Write for 2 .
Note that, by <19>, H 2 (P, Q).
(i) Show that 0,i =  for all i, and that ,i =  if i = , and 
otherwise.
(ii) Deduce that 0,0 (i) = for all i; that , (i) = if = = i or
= = i, and otherwise; and that ,0 (i) = if = i, and
otherwise.
(iii) Deduce that 1 + (0,0 ) = (1 + )n ; that 1 + (,0 ) = (1 + )n1 (1 );
and that

(1 + )n2 (1 )2 if =
1 + (, ) =
(1 + )n1 (1 ) if =

Asymptopia: 17 October 2000

c David Pollard

Chapter 3

Distances and affinities between measures

(iv) Deduce that


(q p)2 4


1
+ (1 + )n2
n

Compare with Mammen (1986, inequality 3.7).

Notes
My definition of total variation follows Dunford & Schwartz (1958,
Section III.1).
I adapted the results on the total variation and relative entropy distances
between Binomial and Poisson distributions from Reiss (1993, p 25). They
credited Barbour & Hall (1984) with the first result, and Falk & Reiss (1992)
with the second result.
Barbour, Holst & Janson (1992) have devoted a whole book to the topic
of Poisson approximation.
References

FT

Barbour, A. D. & Hall, P. (1984), On the rate of Poisson convergence,


Proceedings of the Cambridge Philosophical Society 95, 473480.
Barbour, A. D., Holst, L. & Janson, S. (1992), Poisson Approximation, Oxford
University Press.
Csiszar, I. (1967), Information-type measures of difference of probability distributions and indirect observations, Studia Scientarium Mathematicarum
Hungarica 2, 299318.
Dunford, N. & Schwartz, J. T. (1958), Linear Operators, Part I: General
Theory, Wiley.
Falk, M. & Reiss, R.-D. (1992), Poisson approximation of empirical processes,
Statist. Prob. Letters 14, 3948.
Kemperman, J. H. B. (1969), On the optimum rate of transmitting information,
in Probability and Information Theory, Springer-Verlag. (Lecture Notes
in Mathematics, 89, pages 126169.).
Kullback, S. (1967), A lower bound for discrimination information in terms of
variation, IEEE Transactions on Information Theory 13, 126127.
Mammen, E. (1986), The statistical information contained in additional
observations, Annals of Statistics 14, 665678.
Reiss, R.-D. (1993), A Course on Point Processes, Springer-Verlag.

RA

Check Barbour and Hall

9.

[]

Asymptopia: 17 October 2000

c David Pollard

15

You might also like