You are on page 1of 11

15-359: Probability and Computing

Inequalities
The worst form of inequality is to try to make unequal things equal. Aristotle
1. Introduction
We have already seen several times the need to approximate event probabilities. Recall the birthday
problem, for example, or balls and bins calculations where we approximated the probability of the
tail of the binomial distribution P(X k), which is explicitly given by
P(X k) =
n

j=k
_
n
j
_
p
j
(1 p)
nj
The need to approximate this type of probability comes up in many applications. The diculty is
that the sum involving binomial coecients is unwieldy; wed like to replace it by a single factor
that we can use in further analysis.
Well now develop bounds more systematically, leading to methods that hold for general random
variables, not just the binomial. There is a tradition of naming inequalities after their inventors.
To be preserved in eponymity, devise an eective probability boundthere is still room for more!
2. Markov
Well begin with one of the simplest inequalities, called the Markov
inequality after Andrei A. Markov. Markov was actually a student of
Chebyshev, whom well hear from in a moment. He is best known for
initiating the study of sequences of dependent random variables, now
known as Markov processes.
In spite of its simplicity, the Markov inequality is a very important
bound because it is used as a subroutine to derive more sophisticated
and eective bounds.
Let X be a non-negative, discrete random variable, and let c > 0 be
a positive constant. We want to derive a bound on the tail probability
P(X c); this is the total probability mass after (and including) the
point X = c. Since X is discrete, we can write E[X] =

x
xP
X
(x) where the sum is over the set of
1
values x 0 taken by the random variable X. Now, we can bound this expectation from below as
E[X] =

x
xP
X
(x)
=

0x<c
xP
X
(x) +

xc
xP
X
(x)

xc
xP
X
(x)

xc
c P
X
(x)
= c

xc
P
X
(x)
= c P(X c)
This gives us the Markov inequality
P(X c)
E[X]
c
Although we havent yet covered continuous random variables, you may well be able to see that
this same inequality holds more generally.
2.1. Example
Suppose we ip a fair coin n timeswhat is the probability of getting more than
3
4
n heads? By
the Markov inequality, we have that
P
_
X
3n
4
_

E[X]
3n/4
=
n/2
3n/4
=
2
3
There is something obviously wrong with this bound. In particular, it is independent of the
number of ips n. Intuitively, the probability should go to zero as n gets large.
But note that the Markov inequality involves only the mean E[X] of the random variable. It does
not depend at all on the shape or spread of the distribution, which must be related to how
fast the tail probabilities thin out. So, the inequality is typically quite weak. (This should not be
confused with the fact that it is tight in certain cases, as seen on the homework.)
Our next inequality does depend on the spread of the distribution.
2
3. Chebyshev
According to Kolmogorov [2], Pafnuty Chebyshev was one of the rst
mathematicians to make use of random quantities and expected val-
ues, and to study rigorous inequalities for random variables that were
valid under general conditions. In 1867 he published the paper On
mean values that presented the inequality commonly associated with
his name.
Unlike the Markov inequality, the Chebyshev inequality involves the
shape of a distribution through its variance. The idea is to apply
the Markov inequality to the deviation of a random variable from its
mean.
For a general random variable X we wish to bound the probability of
the event {|X E[X]| > a}. Note that this is the same as the event
{(X E[X])
2
> a
2
}. Since Y = f(X) = (X E[X])
2
is a non-negative random variable, we can
apply the Markov inequality to obtain
P(|X E[X]| > a) = P
_
(X E[X])
2
> a
2
_

E[(X E[X])
2
]
a
2
=
Var(X)
a
2
As a special case, it follows that
P (|X E[X]| a(X))
1
a
2
In particular, for an arbitrary random variable there is no more than a total probability mass of
1/4 two standard deviations away from the mean. An alternative formulation is the following:
P (|X E[X]| aE[X])
Var(X)
a
2
E[X]
3.1. Example
Lets return to our coin ipping example. Applying the Chebyshev inequality we get
P
_
X
3n
4
_
= P
_
X
n
2

n
4
_
P
_
|X E[X]|
1
2
E[X]
_

n/4
1/4 (n/2)
2
=
4
n
This is betterunlike the Markov inequality, it suggests that the probability is becoming concen-
trated around the mean.
3
4. Cherno
We now come to one of the most powerful bounding techniques, due to
Herman Cherno. Cherno is known for many contributions to both
practical and theoretical statistics. The so-called Cherno bounds
originated in a paper on statistical hypothesis testing [1], but the re-
sulting bounds are very widely used in computer science.
Cherno is also known for a method called Cherno faces for visualiz-
ing high dimensional data. Each variable is associated with a particular
attribute in a cartoon facefor example, one variable might be associ-
ated with eyebrow slant, one with eye spacing, one with mouth shape,
etc. The characteristics of dierent data sets (or points) are then seen
at a glance.
1
On a personal note, Herman Cherno was very kind to
me when I was an undergraduate, giving his time for weekly private
tutoring in probability and stochastic processes one spring semester.
He was extremely generous, and so its a pleasure for me to see his name remembered so often in
computer science.
Figure 1: Cherno faces
Recall that we introduced a dependence on the shape of a distribution by squaring the random
variable and applying the Markov inequality. Here we exponentiate. By the Markov inequality, if
> 0, we have that
P(X c) = P
_
e
X
e
c
_

E[e
X
]
e
c
= e
c
M
X
()
where M
X
() = E[e
X
] is the moment generating function of X.
1
Its interesting to note that in his 1973 article on this topic Cherno says that At this time the cost of drawing
these faces is about 20 to 25 cents per face on the IBM 360-67 at Stanford University using the Calcomp Plotter.
Most of this cost is in the computing, and I believe that it should be possible to reduce it considerably.
4
It is best to consider this inequality in the log domain. Taking logarithms gives
log P(X c) c + log M
X
()
Now, its possible to show that log M
X
() is a convex function in . Since this inequality holds for
any , we can think of it as a variational parameter and select the value of to give the tightest
possible upper bound. See Figure 2.
0 0.5 1 1.5 2 2.5 3
7
6
5
4
3
2
1
0
1
2
3
Variational parameter
B
o
u
n
d

o
n

l
o
g

P
(
X


C
)
Figure 2: Cherno bounds for n=30 independent Bernoulli trials for the event C

=
{X|

X
i
> np (1 +)} with p =
1
2
and =
1
2
. The classical Cherno bound log P(X C

) <
np
2
/4 is the top horizontal line, the tighter bound log P(X C

) < np ( (1 +) log(1 +)) is


second horizontal line, and the true log probability is the lowest horizontal line. The curve shows
the variational approximation log P(X C

) < np(1 +) + log M


X
().
5
4.1. Example: Bounds for the Poisson
Let X Poisson(). Then by the general Cherno bound, P(X n) e
n
M
X
(). Now the
moment generating function for the Poisson is computed as
M
X
() =

k=0
e
k
p
X
(k)
=

k=0
e
k
e

k
k!
= e

e
e

= e
(e

1)
leading to the bound
log P(X n) n +(e

1)
Minimizing over gives = log(n/) and the bound
log P(X n) nlog
_
n

_
+n
or, after exponentiating,
P(X n) e

_
e
n
_
n
4.2. Cherno bounds for the binomial
More typically, Cherno bounds are derived after making a convex upper bound to the convex
function log M
X
(), for which the minimum can be computed analytically in a convenient form.
The following bounds for the binomial illustrate this.
If X Binomial(n, p) then we can derive the bounds
P(X np ) e
2
2
/n
P(X np ) e
2
2
/n
for > 0. To prove the rst of these (the second is similar) we follow the usual protocol: exponen-
6
tiate and apply the Markov inequality. This gives
P(X np ) = P
_
e
(Xnp)
e

_
e

E
_
e
(

n
i=1
X
i
np)
_
= e

E
_
e

n
i=1
(X
i
p)
_
= e

E
_
n

i=1
e
(X
i
p)
_
= e

i=1
E
_
e
(X
i
p)
_
(by independence)
= e

i=1
_
pe
(1p)
+ (1 p)e
p
_
Now, using the following inequality (which we wont prove)
p e
(1p)
+ (1 p) e
p
e

2
/8
we get that
P(X np ) e

i=1
_
p e
(1p)
+ (1 p)e
p
_
e
+n
2
/8
This gives the convex function +n
2
/8 as a second, weaker upper bound to log P(Xnp ).
Minimizing over the parameter then gives = 4/n and the bound
P(X np ) e
2
2
/n
4.3. Example
Lets now return again to the simple binomial example. Recall that the Markov inequality gave
P(X 3n/4) 2/3, and the Chebyshev inequality gave P(X 3n/4) 4/n. Now, we apply the
Cherno bound just derived to get
P(X 3n/4) = P(X n/2 n/4)
e
2(n/4)
2
/n
= e
n/8
This bound goes to zero exponentially fast in n. This is as we expect intuitively, since the binomial
should become concentrated around its mean as n .
7
4.4. Alternative forms of Cherno bounds
There are other forms of the Cherno bounds for binomial or Bernoulli trials that well state here.
If X
1
, . . . , X
n
are independent Bernoulli trials with X
i
Bernoulli(p
i
), let X =

n
i=1
X
i
. Then
P(X > (1 +) E[X])
_
e

(1 +)
(1+)
_
E[X]
If, moreover, 0 < < 1, then
P(X > (1 +) E[X]) e

2
3
E[X]
P(X < (1 ) E[X]) e

2
2
E[X]
The bounds are slightly dierent but the idea of the proof is the same as weve already seen.
4.5. Inverse bounds
As an example of the use of Cherno bounds, we can carry out an inverse calculation to see how
much beyond the mean a random variable must be to have a suciently small tail probability.
Specically, for X Bernoulli(n,
1
2
), we can ask how big does m have to be so that
P
_
X
n
2
> m
_

1
n
Using the Cherno bound we have that
P
_
X
n
2
> m
_
e
2m
2
/n
In order for the righthand side to be equal to 1/n, we need that
2
m
2
n
= log n
which we solve to get m =
_
nlog n
2
.
4.6. Application to Algorithms
We now apply the Cherno bound to the analysis of randomized quicksort. Recall that earlier
we showed randomized quicksort has expected running time O(nlog n). Here well show something
much strongerthat with high probability the running time is O(nlog n). In particular, we will
show the following.
Theorem. Randomized quicksort runs in O(nlog n) time with probability at least 1 1/n
b
for
some constant b > 1.
To sketch the proof of this, suppose that we run the algorithm and stop when the depth of the
recursion tree is c log n for some constant c. The question we need to ask is: what is the probability
there is a leaf that is not a singleton set {s
i
}?
8
Call a split (induced by some pivot) good if it breaks up the set S into two pieces S
1
and S
2
with
min(|S
1
|, |S
2
|)
1
3
|S|
Otherwise the split is bad. What is the probability of getting a good split? Assuming that all
elements are distinct, its easy to see that the probability of a good split is
1
3
.
Now, for each good split, the size of S reduces by a factor of (at least)
2
3
. How many good splits
are needed in the worst case to get to a singleton set? A little calculation shows that we require
x = log n/ log(3/2)
= a log n
good splits. We can next reason about how large the constant c needs to be so that the probability
that we get fewer than a log n good splits is small.
Consider a single path from the root to a leaf in the recursion tree. The average number of good
splits is (1/3)c log n. By the Cherno bound, we have that
P(number of good splits < (1 )
1
3
c log n) e

1
3
c log n

2
2
We can then choose c suciently large so that this righthand side is no larger than
1
n
2
.
We arranged this so that the probability of having too few good splits on a single path was less
than 1/n
2
. Thus, by the union bound, the probability that there is some path from root to a leaf
that has too few good splits is less than 1/n. It follows that with probability at least 1 1/n,
the algorithm nishes with a singleton set {s
i
} at each leaf in the tree, meaning that the set S is
successfully sorted in time O(nlog n).
This is a much stronger guarantee on the running time of the algorithm, which explains the eec-
tiveness of randomized quicksort in practice.
4.7. Pointer to recent research
As a nal note, in recent research in machine learning we have attempted to extend the scope
of Cherno bounds to non-independent random variables, using some of the machinery of convex
optimization [4].
9
5. Jensen
Johan Ludwig Jensen of Denmark was self-taught in mathematics, and
never had an academic or full research position. Instead, he worked
beginning in 1880 for a telephone company to support himself su-
ciently in order to be able to work on mathematics in his spare time.
(Only much later did telephone workers at such companies as AT&T
get paid directly to do mathematics!)
Jensen studied convex functions, and is given credit for a particularly
important and useful inequality involving convex functions. A convex
function is one such as f(x) = x
2
+ax+b that holds water. Jensens
inequality is best viewed geometrically. If f(x) and f(y) are two points
on the graph of f and we connect them by a chord, this line is traced
out by f(x) + (1 )f(y) for 0 1. By convexity, this line lies above the graph of the
function. Thus
2
,
f(x) + (1 ) f(y) f(x + (1 ) y)
By induction, it follows easily that

1
f(x
1
) +
n
f(x
n
) f(
1
x
1
+ +
n
x
n
)
where
i
0 and

n
i=1

i
= 1.
If X is a nite random variable, this shows that for f convex,
f (E[X]) E [f(X)]
Similarly, for g a concave function,
g (E[X]) E [g(X)]
These inequalities are best remembered by simply drawing a picture of a convex function and a
chord between two points on the curve.
5.1. Example
As a simple use of Jensens inequality, we note the arithmetic/geometric mean inequality. Using
convexity of log(x), if a and b are positive then

1
2
log(a)
1
2
log(b) log
_
1
2
a +
1
2
b
_
which implies

ab
1
2
(a +b)
2
It appears that Jensen only, in fact, studied the inequality f
_
x+y
2
_

f(x)+f(y)
2
.
10
5.2. Example: Entropy
If X is a nite r.v. taking values v
i
with probability p
i
= P(X = v
i
), the entropy of X, denoted
H(X), is dened as
H(X) =
n

i=1
p
i
log p
i
Clearly H(X) 0 since log(1/p
i
) 0. Now, since log is convex, we have by Jensens inequality
that
H(X) =

i
p
i
log p
i
=

i
p
i
log
1
p
i
log
_
n

i=1
p
i
1
p
i
_
= log n
so that the entropy lies in the interval [0, log n].
The entropy is a lower bound on the amount of compression possible for a sequence of characters
v
i
that are independent and identically distributed according to the pmf of X.
References
[1] Herman Cherno (1952). A measure of asymptotic eciency for tests of a hypothesis based
on the sum of observations. Annals of Mathematical Statistics, 23, 493507.
[2] The MacTutor history of mathematics archive (2004). http://www-gap.dcs.st-and.ac.uk/
~history.
[3] Rajeev Motwani and Prabhakar Raghavan (1995). Randomized Algorithms. Cambridge Uni-
versity Press.
[4] Pradeep Ravikumar and John Laerty (2004). Variational Cherno bounds for graphical
models, Uncertainty in Articial Intelligence.
[5] Sheldon M. Ross (2002). Probability Models for Computer Science. Harcourt/Academic Press.
11

You might also like