Metrics and Distance

Chapter 3
Distances and affinities between

measures
At first reading you might find this Chapter a mere catalog of strange inequalities
involving several strange-looking distances between probability measures: the Hellinger
distance H 2 (P, Q); the total variation, or L 1 , distance P Q1 ; and the relative
entropy D(PQ). In later chapters you will discover that these intrinsic measures of
distance play a key role in the study of general statistical models. In particular, the
following inequalities will turn out to be most important.
In Section 4:
1/2
FT
P Q1 2H (P, Q) 2P Q1

In Section 5:
H 2 (P, Q) D(PQ)
1
P
2
Q21 D(PQ)
In Section 6, for product measures:
RA
H (P1 . . . Pn , Q 1 . . . Q n )2
D(P1 . . . Pn Q 1 . . . Q n ) =
n
H (Pi , Q i )2
i=1
D(Pi Q i )
in
[why]
1.
Section 7 describes a simple method for bounded total variation distance via calculations
of variances and covariances.
In my opinion, the total variation distance is the most important of the three, for
reasons that will be explained in Chapter 14. I will use the other two distances mostly
as a way of bounding total variation.
Why bother with dierent distances?

When we work with a family of probability measures, {P : }, indexed
by a metric space , there would seem to be an obvious way to calculate
the distance between measures: use the metric on . For many problems of
estimation, the obvious is what we want. We ask how close (in the metric)
we can come to guessing 0 , based on an observation from P0 ; we compare
estimators based on rates of convergence, or based on expected values of loss
functions involving the distance from 0 .
When the parametrization is reasonable (whatever that means), distances
measured by the metric are reasonable. (What else could I say?) However
it is not hard to concoct examples where the metric is misleading.
Asymptopia: 17 October 2000
c David Pollard
Section 3.1
misleading
< 1>
Why bother with different distances?
Example. Let Pn, denote the joint distribution for n independent observations from the N (, 1) distribution, with R. Under Pn,0 , the sample
average, X n , converges to 0 at a n 1/2 rate. The parametrization is reasonable.
What happens if we reparametrize, replacing the N (, 1) by a N ( 3 , 1)?
We are still fitting the same model; same probability measures, only the labelling
1/3
has changed. The maximum likelihood estimator, X n , still converges at an
1/2
1/6
rate if 0 = 0, but for 0 = 0 we get an n
rate, as an artifact of the
n
reparametrization.
More imaginative reparametrizations can produce even stranger behaviour
for the maximum likelihood estimator. For example, define the one-to-one
reparametriztion
if is rational
() =
+ 1 if is irrational
Now let Pn, denote the joint distribution for n independent observations from
the N (( ), 1) distribution. If 0 is rational, the maximum likelihood estimator,
1 ( X n ), gets very confused: it concentrates around 0 1 as n gets larger.
RA
FT
You would be right to scoff at the second reparametrization in the Example,

yet it does make the point that distances measured in the metric, for some
parametrization picked out of the air, might not be particularly informative
about the behaviour of estimators. Less ridiculous examples arise routinely in
nonparametric problems, that is, in problems where infinite dimensional
parameters enter, making the choice of metric less obvious.
Fortunately, there are intrinsic ways to measure distances between probability measures, distances that dont depend on the parametrizations. The rest
of this Chapter will set forth a few of the basic definitions and facts. The total
variation distance has properties that will be familiar to students of the
Neyman-Pearson approach to hypothesis testing. The Hellinger distance is
closely related to the total variation distancefor example, both distances define
the same topology of the space of probability measuresbut it has several
technical advantages derived from properties of inner products. (Hilbert spaces
have nicer properties than general Banach spaces.) For example, Hellinger
distances are very well suited for the study of product measures (Section 6).
Also, Hellinger distance is closely related to the concept called Hellinger differentiability (Chapter 4), an elegant alternative to the tradional assumptions of
pointwise differentiability in some asymptotic problems. Relative entropy,
which is also known as Kullback-Leibler distance, emerges naturally from the
study of maximum likelihood estimation. The relative entropy is not a metric,
but it is closely related to the other two distances, and it too is well suited for
use with product measures.
The intrinsic measures of distance are the key to understanding minimax
rates of convergence, as you will learn in Part III.
For reasonable parametrizations, in classical finite-dimensional settings,
the intrinsic measures usually tell the same story as the metric, as explained
in Chapter 4.
[tot.var]
2.
Total variation
Let be a finite signed measure on a measure space (X, A). Remember that
M denote the set of all real, A-measurable functions on X, and that M+ is the
cone of nonnegative functions in M.
c David Pollard
Chapter 3
Distances and affinities between measures
The total variation norm of is defined as

v() = sup
|Ai |,
i
where the supremum runs over all finite partitions {Ai } of X into disjoint,
A-measurable sets. In fact, there is no need to consider partitions into more
than two sets: for if A = i {Ai : Ai 0} then

|Ai | = A Ac = |A| + |Ac |
i
If both 1 and 2 are absolutely continuous with respect to a measure ,

such as = 1 + 2 , with densities m 1 and m 2 , then has density m = m 1 m 2
and v() = |m|:

|Ai | =
|(m Ai )|
(|m|Ai ) = |m|,
i
with equality when A1 = {m 0} = Ac2 . That is, v() = {m 0}{m < 0}.
The total variation is often called the L 1 norm of , and is also denoted
by 1 . Note that v(), and hence 1 , does not depend on the choice of
the dominating measure.
The total variation v() is also equal to sup| f |1 |f |, the supremum
running over all A-measurable functions f bounded in absolute value by 1:
|f | = |m f | |m|
if | f | 1,
FT
with equality when f = {m 0} {m < 0}.

When (X) = 0, there are some slight simplifications in the formulae
for v(). In that case, 0 = m = m + m and hence
v() = 1 = 2m + 2m = 2{m 0} = 2 sup A
A
RA
As a special case, for probability measures P1 and P2 , with densities p1 and p2

with respect to ,
v() = 1 = 2( p1 p2 )+ = 2( p2 p1 )+
= 2 sup (P1 A P2 A) = 2 sup (P1 A P2 A)
A
tot.var.special
<2>
= 2 sup |P1 A P2 A|
A
multiv.normal
Many authors, no doubt with the special case foremost in their minds, define
the total variation as sup A |P1 A P2 A|. Beware! An unexpected extra factor
of 2 can cause confusion.
<3>
Example. Let P1 denote the N (1 , In ) multivariate normal distribution

and P2 denote the N (2 , In ), with 1 = 2 . Define = |1 2 |/2 and
u = (2 1 )/|2 1 |. Note that u (x 1 ) has a N (0, 1) distribution under P1 ,
and a N (2, 1) distribution under P2 . The density of P1 P2 with respect to
Lebesgue measure is nonnegative in the halfspace A0 = {x : u (x 1 ) }.
Thus
P1 P2 1 = 2(P1 P2 )A0
= 2P{N (0, 1) } 2P{N (2, 1) }

= 2P{|N (0, 1)| }
4
=
+ O( 2 )
as 0
2
For small |1 2 |, the total variation distance is approximately 2/ |1 2 |.
c David Pollard
Section 3.2
Total variation
The rate of convergence in the last Example is typical. Consider for

example a family of probability measures {P : Rk } with densities { f } with
respect to a measure . Supose the family of densities is dierentiable in
L( ) norm at . That is, suppose there is an integrable function f for which
| f +t f t f | = o(|t|)
as t 0
Then, writing u for the unit vector t/|t|, we have
P +t P 1
| f +t f |
=
= |u f | + o(1)
|t|
|t|
For the N (, In ) densities, (x ), the pointwise derivative (x )(x )
is also the derivative in L1 norm, which gives

N ( + t, In ) N (, In )1
= o(1) + |u (x )(x )|
|t|
= o(1) + P|u N (0, In )|
2
= o(1) +

because u N (0, In ) is N (0, 1) distributed, and P|N (0, 1)| = 2/ .
Exact calculation of total variation distances in closed form can be difficult,
if not impossible. Bounds and approximations often suffice.
<4>
Exercise. Let P denote the Bin(n, p) distribution and Q denote the

Poisson(np) distribution. Show that P Q1 2 p.
Solution: Without loss of generality assume 0 < p < 1. For x =
0, 1, . . . , n define

Q{x}
enp (np)x n x
g(x) :=
=
p (1 p)nx
P{x}
x!
x

x1

i 1
1
= enp (1 p)xn
n
i=1
Then
RA
FT
bin.poisson.L1
P Q1 = 2
n
(P{x} Q{x})+ = 2Px (1 g(x))+
x=0
It suffices to show that g(x) 1 (x/n), for then the last expected value is
bounded by 2Px/n = 2 p.
The lower bound for g(x) is trivial when x = n. For other values of x
note that

x

g(x)
i
log
log 1
= np (n x) log(1 p)
1 (x/n)
n
i=1
lower.integral
<5>
upper.integral
<6>
Bound the sum of logarthmic terms from below, using the following inequality
from Prerequisites [5]: Let h be a (necessarily convex) function defined
on [0, 1] with increasing derivative h . For each integer x with 1 x n 1,

x
x
i
1
h
h
h(0)
n
n i=1
n

x +1
1
h
h
n
n
To get h (t) = log(1 t) for 0 t < 1 define

tk
h(t) = t + (1 t) log(1 t) =
k(k 1)
k=2
4
for 0 t 1
c David Pollard
Chapter 3
From <5>,
x
log 1
n
i=1

=
x

i=1

x
i
nh
h
n
n

The inequality is also valid when x = 0, for then both sides equal zero. Thus,
for x = 0, 1, . . . , n 1,
x
g(x)
log
np (n x) log(1 p) + nh
1 (x/n)
n
As a convex function of a continuous variable x in [0, n], the lower bound
achieves its minimum when h (x/n) = log(1 p), that is, when x = np. The
lower bound is everywhere greater than than
np n(1 p) log(1 p) + nh( p) = nh( p) + nh( p) = 0.
[affinity]
3.
It follows that g(x) 1 (x/n), as asserted.
The anity between two probabilities

Let P and Q be probability measures with densities p and q with respect to a
dominating measure . The affinity 1 (P, Q) between P and Q is defined as
( p q). For each pair of functions f, g in M+ for which f + g 1,
also need affinity for finite

measures in Chapter 7
FT
P f + Qg = ( p f + qg)
( f + g)( p q)
( p q) = 1 (P, Q),
with equality when f = { p q} and g = { p > q}, because then p f + qg =

p q. Thus
RA
1 (P, Q) = inf{P f + Qg : f, g M+ , f + g 1},
which shows that the definition of the affinity does not depend on the particular
choice of dominating measure. The form of the minimizing f and g also
shows that the affinity equals the minimum of PA + QAc over all A in A. The
minimizing set has the statistical interpretation of the rejection region for the
(nonrandomized) test between the two hypotheses P and Q that minimizes the
sum of the type one and type two errors.
The affinity is also closely related to the total variation distance: integrating
the equality p q = 12 ( p + q | p q|) we get
1 (P, Q) = 12 ( p + q | p q|) = 1 12 P Q1 .
The function p q, when interpreted as a density with respect to , defines

a nonnegative measure for which f min(P f, Q f ) for all f M+ . That
is, P and Q as measures. In fact is the largest measure with this
property. For if , with density m, is another measure with the same property
then, for all f M+ ,
f = f { p q} + f { p < q} Q f { p q} + P f { p < q}.
That is,
( f m) ( f q{ p q} + f p{ p < q}) = f ( p q)
for all f M+ .
It follows that m p q almost everywhere mod[], whence . The

measure is also denoted by P Q, and is called the (lattice theoretic)
minimum of P and Q. In summary,
aff.L1
<7>
1 (P, Q) = P Q1 = 1 12 P Q1
c David Pollard
Section 3.3
The affinity between two probabilities
for all probability measures P and Q defined on the same sigma-field.

Note that (P Q)(A) is, in general, strictly smaller than min(PA, QA).
Indeed, the set function A min(PA, QA) is not even finitely additive. The
lattice theoretic minimum is not the same as the setwise minimum.
[hellinger]
4.
Hellinger distance
dominating measure . The square roots of the densities, p and q are both
2
square integrable; they both belong to L (). The Hellinger distance between
the two measures is defined as the L2 distance between the square roots of their
densities,
H (P, Q)2 = ( p q)2
= ( p + q 2 pq)
= 2 2 pq.
H2.L1
<8>
RA
FT
It is easy to show that the integral defining the Hellinger distance does not
depend on the choice of dominating measure (Problem [2]). The quantity pq

is called the Hellinger anity between the two measures, and is denoted
by 2 (P, Q).
The Hellinger distance satisfies the inequality 0 H (P, Q) 2. Some

authors prefer to have an upper bound of 1; they include an extra factor of a
half in the definition of H (P, Q)2 . The equality at 0 occurs when p = q
almost surely mod[], that is, when P = Q as measures on A. Equality at 2

occurs when the Hellinger affinity is zero, that is, when pq = 0 almost surely
mod[], which is the condition that P and Q be supported by disjoint subsets
of X. For example, discrete distributions (concentrated on a countable set) are
always at the maximum Hellinger distance from nonatomic distributions (zero
mass at each point).
From the pointwise inequality pq p q it follows that 2 (P, Q)

1 (P, Q), and hence, via <7>, that
H (P, Q)2 P Q1 .
The last bound also follows directly from the inequality

( p q)2 = | p q|
p + q = | p q|.
The Cauchy-Schwarz inequality gives a companion lower bound:
P Q21 | p q|2 | p + q|2 = H (P, Q)2 (2 + 22 (P, Q)) .
L1.hellinger
<9>
Substitute for the Hellinger affinity to get

P Q1 H (P, Q) 4 H (P, Q)2 2H (P, Q)
1/2
2P Q1
by <8>
Both the Hellinger distance and the total variation norm define bounded metrics
on the space of all probability measures on A. From <9>, they define the same
topology for convergence of probability measures.
[KL]
5.
Relative entropy
Let P and Q be two probability measures with densities p and q with respect
to some dominating measure . The relative entropy (also known as the
c David Pollard
Chapter 3
Kullback-Leibler distance, even though it is not a metric) between P and Q

is defined as D(PQ) = ( p log( p/q)).
At first sight, it is not obvious that the definition cannot suffer from the
problem. Indeed, it is not immediately obvious that the negative parts
of p log( p/q) must be integrable. A Taylor expansion comes to the rescue: for
x > 0,
xlogx
<10>
x log x = x 1 + 12 (x 1)2 /x
for some x between 1 and x.
When p > 0 and q > 0, put x = p/q, discard the nonnegative quadratic
remainder term, then multiply through by q to get
p log( p/q) p q.
<11>
For probabilities P and Q on the same space, D(PQ) H 2 (P, Q).
RA
re.hellinger
FT
The same inequality also holds at points where q = 0 and p > 0, with
the left-hand side interpreted as ; and at points where p = 0 we get no
contribution to the defining integral. It follows that p (log( p/q)) < and
D(PQ) ( p q) = 0. That is, the relative entropy is well defined and
nonnegative. These conclusions would also follow, perhaps more directly, via
Jensens inequality. I prefer to argue via the Taylor expansion because, as you
will soon see, with refinements on the remainder term we get better lower
bounds for the relative entropy.
If { p > 0 = q} > 0 then p log( p/q) is infinite on a set of positive
measure, which forces D(PQ) = . That is, the relative entropy is infinite
unless P is absolutely continuous with respect to Q. It can also be infinite even
if P and Q are mutually absolutely continuous (Problem [4]).
As with the total variation and Hellinger distances, the relative entropy
does not depend on the choice of the dominating measure (Problem [3]).
It is easy to deduce from the conditions for equality in Jensens inequality
that D(PQ) = 0 if and only if P = Q. An even stronger assertion follows
from an inequality relating relative entropy to Hellinger distance.
Lemma.
Proof. This inequality is trivially true unless P is absolutely continuous with

respect to Q, in which case we can take the dominating equal to Q. Define
= p 1. Note that Q2 = H 2 (P, Q) and

1 = Q p = Q(1 + )2 = 1 + 2Q + Q2 ,
which implies that 2Q = H 2 (P, Q). Hence

D(PQ) = Q ( p log p)

= 2Q (1 + )2 log(1 + )

2
2Q (1 + )
)
1+
= 2Q + 2Q2
= H 2 (P, Q),
Check citation
as asserted.
In a similar vein, there is a lower bound for the relative entropy involving
the L 1 -distance, due to Csiszar (1967), Kullback (1967), and Kemperman (1969).
c David Pollard
Section 3.5
re.L1
<12>
Relative entropy
Lemma.
Proof.
For probabilities P and Q on the same space, D(PQ) 12 PQ21 .
Recall from Prerequisites [5] that
x2
for x 1.
1 + x/3
To establish the inequality asserted by the Lemma, we may assume, with
no loss of generality (compare with the proof of Lemma <11>), that P is
absolutely continuous with respect to Q. This time write 1 + for the density.
Notice that Q = 0 and thus

2
D(PQ) = Q ((1 + ) log(1 + ) ) 12 Q
.
1 + /3
(1 + x) log(1 + x) x
1
2
Multiply the right-hand side by 1 = Q(1+/3), then invoke the Cauchy-Schwarz

inequality to bound the product from below by half the square of

||
Q
1 + /3 = Q|| = P Q1 .
1 + /3

bin.poisson.re
<13>
The asserted inequality <12> follows.

Exercise. Let P denote the Bin(n, p) distribution and Q denote the
Poisson(np) distribution, as in Exercise <4>. Show that
D(PQ) log(1 p) p + p 2 /2 = p 2 + O( p 3 ).
Using the same notation as in Exercise <4> we have
D(PQ) = Px log g(x)

FT
Solution:
x1
i
= P np (n x) log(1 p)
log 1
n
i=1
x1
h (i/n)
RA
= nh( p) P

i=1
<14>
is also valid for x = 0, 1 if we extend h to have h(t) = 0 for t < 0. Take

expectations, then invoke Jensens inequality.

x 1
1
D(PQ) nh( p) nPh
nh( p) nh p
n
n
hprime.lower
The sum inside the last expectation should be interpreted as zero if x equals 0
or 1. The lower bound <5>, that is

x1
1
x 1

,
h (i/n) h
n i=1
n
By the Mean Value Theorem, the last difference equals h ( p ) for some p
between p n 1 and p. Because h is increasing, the bound is less than
p2
+ ...
2
The bound decreases at the rate p as p tends to zero, more slowly than the
asserted p 2 rate. What went wrong?
The source of the problem is <14>. It approximates the sum of logarithmic
terms too crudely. We need to break out the linear contributions to h . Define

tk
t2
h 1 (t) = h(t)
=
2
k(k 1)
k=3
h ( p) = log(1 p) = p +
c David Pollard
Chapter 3
which has derivative h 1 (t) = h (t) t = log(1 t) t. Invoke the analog

of <14>, with h 1 in place of h, to refine the lower bound to

x1
x1
x1
1
x 1
x(x 1)
1
1
+
h (i/n) =
h 1 (i/n) +
(i/n) h 1
n i=1
n i=1
n i=1
n
2n 2
Direct calculation of moments gives Px(x 1) = n(n 1) p 2 . Arguing once

again via Jensens inequality we now get

1
np 2
(n 1) p 2
D(PQ) nh 1 ( p) +
nh 1 p
2
n
2
2
p
h 1 ( p) +
2
p2
= log(1 p) p +
2
3
p
= p2 +
+ ...,
3
as asserted.
From the bound on D(PQ) in the last exercise and from Lemma <11>
we also get
p2
2
As p increases, the upper bound ( p) eventually exceeds 2, the largest possible
value for a squared Hellinger distance. In fact ( p0 ) = 2 for p0 0.918.
Using the fact that ( p)/ p 2 is also increasing in p, we can replace the upper
bound by p 2 ( p0 )/ p02 , leaving
FT
H 2 (Bin(n, p), Poisson(np))2 ( p) := log(1 p) p +
H 2 (Bin(n, p), Poisson(np))2 2 p 2 / p02 (1.5 p)2 ,
[product.measure]
RA
a slightly neater expresion that ( p).
6.
Product measures
Suppose P = P1 P2 and Q = Q 1 Q 2 , product measures on X1 X2 . If both

Pi and Q i are dominated by i , with corresponding densities pi and qi then

2 (P, Q) = 1 2 p1 (x1 ) p2 (x2 )q1 (x1 )q2 (x2 )

= 1 p1 (x1 )q1 (x1 )2 p2 (x2 )q2 (x2 ) = 2 (P1 , Q 1 )2 (P2 , Q 2 ).
A similar factorization holds for products of more than two measures. This
factorization is the chief reason for the great usefulness of Hellinger distance
when working with product measures. In particular, it gives a most convenient
way to bound total variation distances. By contrast, the affinity 1 (P, Q) enjoys
no comparable factorization, because minima of products do not factorize into
products of minima. It is awkward to deal directly with total variation distance
between product measures.
hell.product
<15>
For probability measures {Pi } and {Q i },

H (P1 . . . Pn , Q 1 . . . Q n )2 = 2 2
1 12 H (Pi , Q i )2
Lemma.
in
n
H (Pi , Q i )2
i=1
c David Pollard
Section 3.6
Product measures
Proof. The product term comes from the factorization of the affinity between
the two product measures.
To establish the upper bound, write yi for H (Pi , Q i )2 /2. We need to show
that the function
n
n

G n (y1 , . . . , yn ) =
yi + (1 yi ) 1
i=1
i=1
is nonnegative for all 0 yi 1. The lower bound of 0 is achieved when n = 1.

For fixed y1 , . . . , yn1 , the left-hand side is linear in yn , achieving its minimum
at either yn = 0 or yn = 1. Thus the left-hand side is greater than
min (G n (y1 , . . . , yn1 , 0), G n (y1 , . . . , yn1 , 1)) G n1 (y1 , . . . , yn1 )

product.limit
<16>
An inductive argument completes the proof.
If Pn = Pnn and Qn = Q nn , both products of identical factors, and

Corollary.
if n H (Pn , Q n ) c as n , then H (Pn , Qn )2 2 2 exp(c2 /2).

The Corollary is the basis for a minor industry in the calculation of
minimax rates of convergence of estimators, as will be explained in Chapter 14.
Relative entropies between product measures also factorize.
KL.product
<17>
Lemma.
For probability measures {Pi } and {Q i },

D(P1 . . . Pn Q 1 . . . Q n ) =
D(Pi Q i ).
FT
in
Proof. Without loss of generality assume that Pi is absolutely continuous with

respect to Q i , with density pi , for each i. The the left-hand side of the asserted
equality equals

Q1 . . . Qn
pi (xi ) log pi (xi ),
in
[boundL1]
7.
which factorizes to give the right-hand side.
RA
Second-moment bounds on total variation distance

Particularly for probability measures P and Q that are close, we often need
only upper bounds on total variation distance. If both measures are dominated
by a probability measure , then
<18>
L1.v.L2
P Q21 ( p q)2
Notice that the right-hand side depends on the choice of , whereas the left-hand
side does not. Often it will be convenient to choose = P or = (P + Q)/2.
The second-moment upper bound is often of the correct order of magnitude
for a well chosen . For example, suppose dQ/dP = 1 + , with small
enough to justify integration of the expansion
(1 + )1/2 = 1 + 12 14 2 + . . .
to give P(1 + )1/2 1 P2 /4. Then
H 2 (P, Q) = 2 2P(1 + )1/2 12 P2
More precisely, if there exists a constant C such that p + q C everywhere

then
| p q|2
| p q|2
H 2 (P, Q) =
2
| p + q|
C2
10
c David Pollard
Chapter 3
and if there exists a constant c such that
p+
q c everywhere then
| p q|2
c2
In particular, if l = (P + Q)/2 then p + q = 2, so that 2 p + q 2 and

H 2 (P, Q)
quadratic.average
<19>
normal.mixture
<20>
| p q|2
P+Q
| p q|2
H 2 (P, Q)
if =
4
2
2
See Problem [5] for a comparison between P and (Q + P)/2 as dominating
measures.
Example. As shown in Example <3>, and the explanation that follows
that Example, the total variation
distance between the N (, 1) and the N (0, 1)
distributions decreases like 2/ | | as 0. More precisely,
(x ) = (x) + x(x) + 12 2 (x 2 1)(x) + . . .
so that

|(x ) (x)| d x = ||
|x|(x) d x + O( 2 )
A similar argument suggests that the mixture P =

converges to the N (0, 1) at an even faster rate:
1
2
1
N (, 1)
2
+ 12 N (, 1)
((x ) + (x + )) = (x) + 12 2 (x 2 1)(x) + . . .
RA
FT
so that

1

(x ) + 1 (x + ) (x) d x = 1 2 |x 2 1|(x) d x + O( 4 )
2
2
2

Integration by parts gives 12 |x 2 1|(x) d x = 2(1) 0.48.
It is not too difficult to make these calculations rigorous. The second
moment bound gives the same rate of convergence even more easily.
2

d P
d P 2
2

1
P P0 1 P0
1 = P0
d P0
d P0

2

2
2
= 14 P0 exp x
1
+ exp x
2
2

= 12 exp( 2 ) + exp( 2 ) 1
4
8
+
+ ...
2!
4!
The bound on the distance P P0 1 decreases like 2 / 2, an overestimate

by a constant factor of approximately 1.5.
L2.product
<21>
Often the second moment method reduces calculations of bounds on total

variation distances to calculations of variances and covariances.

Exercise. Let P = in Pi and Q = in Q i be finite products of
probability measures such
1 + i (xi ) with respect to Pi .

that Q i has density
Show that P Q21 in 1 + Pi i2 1.
Solution: Notice that Pi = Pi i = 0 because both Pi and Q i are

probabilities. We may assume each Pi i2 finite, for otherwise the asserted
inequality is trivial. From <18>,
2

P Q21 = P (1 + i ) 1
i

2

P
i +
i j +
i j k + . . . .
i
{i, j}
c David Pollard
{i, j,k}
11
Section 3.7
The notation {i, j} indicates that the the sum runs over all pairs
of distinct
pairs of integers with 1 i n and 1 j n; the notation {i, j,k} refers to
summation over triples of integers, all different; and so on.
When the squared term is expanded, any unpaired i factor is annihilated
by the corresponding Pi marginal of P. Only the squared terms survive,
reducing the expansion to

Pi2 +
Pi2 2j + . . . ,
{i, j}
the asserted upper bound.

The upper bound in the Exercise decreases like i Pi i2 when the sum is
small. In situations where Pi i2 behaves like H 2 (Pi , Q i )2 , the second moment
bound is comparable to the the analogous bound for Hellinger distance:
n

H 2 (P, Q)
H (Pi , Q i )2
i=1
FT
The second moment method also works for situations where it becomes
exceedingly difficult to calculate Hellinger distances directly. The calculation
of distances between mixtures of product measures, which will be the key to
finding minimax rates of convergence in Chapter 14, is a good illustration.
Suppose P is a product probability measure on Xn . For each in a finite
set A, suppose the probability Q is also a product measure, obtained by a
small perturbation of P,

dQ
(1 + ,i (xi )),
=
dP
in
RA
where P,i (xi ) = 0 for each and i, to give Q total mass one. Let
{w : A} be a finite set of nonnegative weights that sum to one. To a
Bayesian, the weights w would define a prior distribution on A.
Let Q0 be another product probability measure, with

dQ0
(1 + 0,i (xi )).
=
dP
in
L2.product.mixture
The index 0 should be understood as not belonging to A, even if Q0 happens

to coincide with Q for some in A.
A small extension of the method from Lemma
<21> gives a very useful

bound on the total variation distance between Q and Q0 . The bound again
involves the function

(x1 , . . . , xn ) = 1 + (1 + xi ) =
xi +
xi x j + . . . + x1 x2 . . . xn
<22>
in
{i, j}
Lemma. With notation as above, define , , for , A {0} to be the

vector in Rn with i th component , (i) = P,i ,i , which is assumed finite.
Then

2

w
Q
Q
w w (, ) 2(,0 ) + (0,0 )
a
0
1
A
A A

A A
w w (, ) 2
w (,0 ) + (0,0 )
Proof. Write for the random

vector with components ,i (xi ). Then

w
Q
Q
has
density
w
0
a
A a [( ) (0 )] with respect to P. The
quadratic bound for the total variation distance equals

P
w w ( )( ) (0 )( ) ( )(0 ) + (0 )2
,
12
c David Pollard
Chapter 3
Consider the contribution to the expectation from the first term inside the
square brackets. It expands to the product

P
,i +
,i , j +
,i , j ,k + . . .
i
{i, j}
{i, j,k}
,i +
,i , j +
{i, j}

,i , j ,k + . . .
{i, j,k}
As in the proof of Lemma <21>, the expectation annihilates (because P,i = 0

for all and i) most of the cross product terms, leaving only the terms where the
i, j, . . . subscripts pair up exactly. Also P is a product measure, so expectations
like
P,i , j ,k ,i , j ,k
factorize like

P ,i ,i P , j , j P ,k ,k = , (i), ( j), (k)
The sum over all such products equals (, ).

The other contributions are handled similarly.
For each , Lemma <21> gives the bound
Q Q0 21 (, )
P2,i
FT
P2,i
when the sum of the

terms is small. Similarly, the first term in the bound
from Lemma <22> is approximately
2

w w
P ,i ,i =
P
w ,i

,
i
i
<23>
Example. A key step in a beautiful calculation by Mammen (1986) was the

bounding of the total variation distance between a product measure P n and a
mixture
n
1
Q=
P 1 Q P n
n =1
Mammen used the second moment method with dominating measure (P + Q)/2
to obtain an upper bound in terms of H 2 (P, Q). (See Problem [6] for his bound.)
A similar bound is even easier to derive when Q has density 1 + with respect
to P, with P2 < .
Write P for P n and Q for the th term in the sum for Q. Take Q0 = P.
Then, in the notation of Lemma <22>, we have 0,i = 0 for every i, and for
1 n,

(xi ) if i =
,i =
0
otherwise
mammen
RA
With appropriate choices of perturbations {,i } and weights {w } we might

hope to achieve some cancellations, similar to those for the normal mixture in
Example <20>.
Thus , (i) = P2 if i = = , and it is zero otherwise. Each (, )

simplifies to P2 . From the Lemma
n
1
P2
Q P n 21 2
P2 =
n =1
n
In the typical case where is bounded (so that P2 is of the same order
of magnitude as H 2 (P, Q), by Problem [5]) and H (P, Q) is of order O(1/ n),
c David Pollard
13
Section 3.7
the bound P2 /n for Q P n 1 converges to zero at a 1/n rate. By a direct

calculation of L1 norms,
Qa P n 1 = Q P1
[]
8.
which in typical parametric situations converges to zero at only a O(1/ n)

rate. The mixing greater improves the rate of convergence.
Problems
[1]
Suppose P1 and P2 are probability measures with densities p1 and p2 with

respect to a dominating measures . Let be another dominating measure.
Write for the density of with respect to + .
(i) Show that Pi has density pi with respect to + .
(ii) Show that ( + )( p1 p2 )2 = ( p1 p2 )2 .

(iii) Deduce that the integral that defines the Hellinger distance H (P1 , P2 ) does
not depend on the choice of dominating measure.
change.of.measure
[2]

sigma-finite measure . For fixed 1, show that (P, Q) := | p 1/ q 1/ |
does not depend on the choice of dominating measure. Hint: Let be another
sigma-finite dominating measure. Write for the density of with respect
to + . Show that d P/d( + ) = p and d Q/d( + ) = q. Express
(P, Q) as an integral with respect to + . Argue similarly for .
com.re
[3]
Adapt the argument from the previous Problem to show that the relative entropy
D(PQ) does not depend on the choice of dominating measure.
infinite.re
[4]
Let P be the standard Cauchy distribution on the real line, and let Q be the
standard normal distribution. Show that D(PQ) = , even though P and Q
are mutually absolutely continuous.
two.quadratics
[5]
Suppose a probability measure Q has density 1 + with respect to a probability

measure P. Define M = (P + Q)/2. Write p and q for the densities of P and Q
with respect to M.
(i) Show that p = (1 + /2)1 = 2 q.
RA
FT
Prob.hellinger
orig.mammen
(ii) Deduce that M| p q|2 2P2 . Hint: 1.

(iii) If /2 is bounded above by a constant C, show that M| p q|2
P2 /(1 + C).
[6]
calculations need checking
14
For P and Q as in Example <23>, bound Q P21 using the second moment
method for densities with respect to = (P + Q)/2, by following these steps.
Write q = d Q/d = 1 + , so that p = d P/d = 1 . Write for 2 .
Note that, by <19>, H 2 (P, Q).
(i) Show that 0,i = for all i, and that ,i = if i = , and
otherwise.
(ii) Deduce that 0,0 (i) = for all i; that , (i) = if = = i or
= = i, and otherwise; and that ,0 (i) = if = i, and
otherwise.
(iii) Deduce that 1 + (0,0 ) = (1 + )n ; that 1 + (,0 ) = (1 + )n1 (1 );
and that

(1 + )n2 (1 )2 if =
1 + (, ) =
(1 + )n1 (1 ) if =
c David Pollard
Chapter 3
(iv) Deduce that

(q p)2 4

1
+ (1 + )n2
n
Compare with Mammen (1986, inequality 3.7).
Notes
My definition of total variation follows Dunford & Schwartz (1958,
Section III.1).
I adapted the results on the total variation and relative entropy distances
between Binomial and Poisson distributions from Reiss (1993, p 25). They
credited Barbour & Hall (1984) with the first result, and Falk & Reiss (1992)
with the second result.
Barbour, Holst & Janson (1992) have devoted a whole book to the topic
of Poisson approximation.
References
FT
Barbour, A. D. & Hall, P. (1984), On the rate of Poisson convergence,

Proceedings of the Cambridge Philosophical Society 95, 473480.
Barbour, A. D., Holst, L. & Janson, S. (1992), Poisson Approximation, Oxford
University Press.
Csiszar, I. (1967), Information-type measures of difference of probability distributions and indirect observations, Studia Scientarium Mathematicarum
Hungarica 2, 299318.
Dunford, N. & Schwartz, J. T. (1958), Linear Operators, Part I: General
Theory, Wiley.
Falk, M. & Reiss, R.-D. (1992), Poisson approximation of empirical processes,
Statist. Prob. Letters 14, 3948.
Kemperman, J. H. B. (1969), On the optimum rate of transmitting information,
in Probability and Information Theory, Springer-Verlag. (Lecture Notes
in Mathematics, 89, pages 126169.).
Kullback, S. (1967), A lower bound for discrimination information in terms of
variation, IEEE Transactions on Information Theory 13, 126127.
Mammen, E. (1986), The statistical information contained in additional
observations, Annals of Statistics 14, 665678.
Reiss, R.-D. (1993), A Course on Point Processes, Springer-Verlag.
RA
Check Barbour and Hall
9.
[]
c David Pollard
15

Metrics and Distance

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Metrics and Distance

Uploaded by

Copyright:

Available Formats

Chapter 3

Distances and affinities between

P Q1 2H (P, Q) 2P Q1

In Section 6, for product measures:

Why bother with dierent distances?

Asymptopia: 17 October 2000

Why bother with different distances?

You would be right to scoff at the second reparametrization in the Example,

Asymptopia: 17 October 2000

Distances and affinities between measures

The total variation norm of is defined as

If both 1 and 2 are absolutely continuous with respect to a measure ,

with equality when f = {m 0} {m < 0}.

As a special case, for probability measures P1 and P2 , with densities p1 and p2

Example. Let P1 denote the N (1 , In ) multivariate normal distribution

= 2P{N (0, 1) } 2P{N (2, 1) }

For small |1 2 |, the total variation distance is approximately 2/ |1 2 |.

Asymptopia: 17 October 2000

The rate of convergence in the last Example is typical. Consider for

Exercise. Let P denote the Bin(n, p) distribution and Q denote the

(P{x} Q{x})+ = 2Px (1 g(x))+

To get h  (t) = log(1 t) for 0 t < 1 define

Asymptopia: 17 October 2000

Distances and affinities between measures

It follows that g(x) 1 (x/n), as asserted.

The anity between two probabilities

also need affinity for finite

with equality when f = { p q} and g = { p > q}, because then p f + qg =

1 (P, Q) = inf{P f + Qg : f, g M+ , f + g 1},

The function p q, when interpreted as a density with respect to , defines

It follows that m p q almost everywhere mod[], whence . The

1 (P, Q) = P Q1 = 1 12 P Q1

Asymptopia: 17 October 2000

The affinity between two probabilities

for all probability measures P and Q defined on the same sigma-field.

H (P, Q)2 = ( p q)2

depend on the choice of dominating measure (Problem [2]). The quantity pq

The Hellinger distance satisfies the inequality 0 H (P, Q) 2. Some

half in the definition of H (P, Q)2 . The equality at 0 occurs when p = q

almost surely mod[], that is, when P = Q as measures on A. Equality at 2

From the pointwise inequality pq p q it follows that 2 (P, Q)

The last bound also follows directly from the inequality

The Cauchy-Schwarz inequality gives a companion lower bound:

P Q21 | p q|2 | p + q|2 = H (P, Q)2 (2 + 22 (P, Q)) .

Substitute for the Hellinger affinity to get

Asymptopia: 17 October 2000

Distances and affinities between measures

Kullback-Leibler distance, even though it is not a metric) between P and Q

for some x between 1 and x.

For probabilities P and Q on the same space, D(PQ) H 2 (P, Q).

Proof. This inequality is trivially true unless P is absolutely continuous with

= p 1. Note that Q2 = H 2 (P, Q) and

which implies that 2Q = H 2 (P, Q). Hence

Asymptopia: 17 October 2000

For probabilities P and Q on the same space, D(PQ) 12 PQ21 .

Recall from Prerequisites [5] that

Multiply the right-hand side by 1 = Q(1+/3), then invoke the Cauchy-Schwarz

The asserted inequality <12> follows.

D(PQ) = Px log g(x)

is also valid for x = 0, 1 if we extend h to have h(t) = 0 for t < 0. Take

Asymptopia: 17 October 2000

Distances and affinities between measures

which has derivative h 1 (t) = h  (t) t = log(1 t) t. Invoke the analog

To get h (t) = log(1 t) for 0 t < 1 define

which has derivative h 1 (t) = h (t) t = log(1 t) t. Invoke the analog

Solution: Notice that Pi = Pi i = 0 because both Pi and Q i are

w w (, ) 2(,0 ) + (0,0 )

Proof. Write for the random

As in the proof of Lemma <21>, the expectation annihilates (because P,i = 0

The sum over all such products equals (, ).

With appropriate choices of perturbations {,i } and weights {w } we might

Thus , (i) = P2 if i = = , and it is zero otherwise. Each (, )

the bound P2 /n for Q P n 1 converges to zero at a 1/n rate. By a direct

(ii) Show that ( + )( p1 p2 )2 = ( p1 p2 )2 .

Suppose a probability measure Q has density 1 + with respect to a probability

(ii) Deduce that M| p q|2 2P2 . Hint: 1.