You are on page 1of 182

Lecture Notes in Information Theory

Part I
by
Fady Alajaji

and Po-Ning Chen

Department of Mathematics & Statistics,


Queens University, Kingston, ON K7L 3N6, Canada
Email: fady@mast.queensu.ca

Department of Electrical Engineering


Institute of Communication Engineering
National Chiao Tung University
1001, Ta Hsueh Road
Hsin Chu, Taiwan 30056
Republic of China
Email: poning@faculty.nctu.edu.tw
December 10, 2012
c _ Copyright by
Fady Alajaji

and Po-Ning Chen

December 10, 2012


Preface
This is a work in progress. Comments are welcome; please send them to fady@mast.queensu.ca.
ii
Acknowledgements
Many thanks are due to our families for their endless support.
iii
Table of Contents
Chapter Page
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Communication system model . . . . . . . . . . . . . . . . . . . . 2
2 Information Measures for Discrete Systems 5
2.1 Entropy, joint entropy and conditional entropy . . . . . . . . . . . 5
2.1.1 Self-information . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Properties of entropy . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Joint entropy and conditional entropy . . . . . . . . . . . . 12
2.1.5 Properties of joint entropy and conditional entropy . . . . 14
2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Properties of mutual information . . . . . . . . . . . . . . 16
2.2.2 Conditional mutual information . . . . . . . . . . . . . . . 17
2.3 Properties of entropy and mutual information for multiple random
variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Fanos inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Divergence and variational distance . . . . . . . . . . . . . . . . . 26
2.7 Convexity/concavity of information measures . . . . . . . . . . . . 36
2.8 Fundamentals of hypothesis testing . . . . . . . . . . . . . . . . . 39
3 Lossless Data Compression 43
3.1 Principles of data compression . . . . . . . . . . . . . . . . . . . . 43
3.2 Block codes for asymptotically lossless compression . . . . . . . . 45
3.2.1 Block codes for discrete memoryless sources . . . . . . . . 45
iv
3.2.2 Block codes for stationary ergodic sources . . . . . . . . . 54
3.2.3 Redundancy for lossless block data compression . . . . . . 58
3.3 Variable-length codes for lossless data compression . . . . . . . . . 60
3.3.1 Non-singular codes and uniquely decodable codes . . . . . 60
3.3.2 Prex or instantaneous codes . . . . . . . . . . . . . . . . 64
3.3.3 Examples of binary prex codes . . . . . . . . . . . . . . . 70
A) Human codes: optimal variable-length codes . . . . . 70
B) Shannon-Fano-Elias code . . . . . . . . . . . . . . . . . 74
3.3.4 Examples of universal lossless variable-length codes . . . . 76
A) Adaptive Human code . . . . . . . . . . . . . . . . . . 76
B) Lempel-Ziv codes . . . . . . . . . . . . . . . . . . . . . 78
4 Data Transmission and Channel Capacity 82
4.1 Principles of data transmission . . . . . . . . . . . . . . . . . . . . 82
4.2 Discrete memoryless channels . . . . . . . . . . . . . . . . . . . . 84
4.3 Block codes for data transmission over DMCs . . . . . . . . . . . 91
4.4 Calculating channel capacity . . . . . . . . . . . . . . . . . . . . . 102
4.4.1 Symmetric, weakly-symmetric and quasi-symmetric channels103
4.4.2 Channel capacity Karuch-Kuhn-Tucker condition . . . . . 109
5 Dierential Entropy and Gaussian Channels 113
5.1 Dierential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Joint and conditional dierential entropies, divergence and mutual
information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.3 AEP for continuous memoryless sources . . . . . . . . . . . . . . . 131
5.4 Capacity and channel coding theorem for the discrete-time mem-
oryless Gaussian channel . . . . . . . . . . . . . . . . . . . . . . . 132
5.5 Capacity of uncorrelated parallel Gaussian channels: The water-
lling principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.6 Capacity of correlated parallel Gaussian channels . . . . . . . . . 145
5.7 Non-Gaussian discrete-time memoryless channels . . . . . . . . . 147
5.8 Capacity of the band-limited white Gaussian channel . . . . . . . 148
A Overview on Suprema and Limits 154
A.1 Supremum and maximum . . . . . . . . . . . . . . . . . . . . . . 154
A.2 Inmum and minimum . . . . . . . . . . . . . . . . . . . . . . . . 156
A.3 Boundedness and suprema operations . . . . . . . . . . . . . . . . 157
A.4 Sequences and their limits . . . . . . . . . . . . . . . . . . . . . . 159
A.5 Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
v
B Overview in Probability and Random Processes 165
B.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
B.2 Random variable and random process . . . . . . . . . . . . . . . . 166
B.3 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.4 Convexity, concavity and Jensens inequality . . . . . . . . . . . . 166
vi
List of Tables
Number Page
3.1 An example of the -typical set with n = 2 and = 0.4, where
T
2
(0.4) = AB, AC, BA, BB, BC, CA, CB . The codeword set
is 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),
111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , where
the parenthesis following each binary codeword indicates those
sourcewords that are encoded to this codeword. The source distri-
bution is P
X
(A) = 0.4, P
X
(B) = 0.3, P
X
(C) = 0.2 and P
X
(D) =
0.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Quantized random variable q
n
(X) under an n-bit accuracy: H(q
n
(X))
and H(q
n
(X)) n versus n. . . . . . . . . . . . . . . . . . . . . . 118
vii
List of Figures
Number Page
1.1 Block diagram of a general communication system. . . . . . . . . 2
2.1 Binary entropy function h
b
(p). . . . . . . . . . . . . . . . . . . . . 10
2.2 Relation between entropy and mutual information. . . . . . . . . 17
2.3 Communication context of the data processing lemma. . . . . . . 21
2.4 Permissible (P
e
, H(X[Y )) region due to Fanos inequality. . . . . . 23
3.1 Block diagram of a data compression system. . . . . . . . . . . . . 45
3.2 Possible codebook (
n
and its corresponding o
n
. The solid box
indicates the decoding mapping from (
n
back to o
n
. . . . . . . . 53
3.3 (Ultimate) Compression rate R versus source entropy H
D
(X) and
behavior of the probability of block decoding error as block length
n goes to innity for a discrete memoryless source. . . . . . . . . . 54
3.4 Classication of variable-length codes. . . . . . . . . . . . . . . . 65
3.5 Tree structure of a binary prex code. The codewords are those
residing on the leaves, which in this case are 00, 01, 10, 110, 1110
and 1111. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6 Example of the Human encoding. . . . . . . . . . . . . . . . . . 73
3.7 Example of the sibling property based on the code tree from P
(16)

X
.
The arguments inside the parenthesis following a
j
respectively
indicate the codeword and the probability associated with a
j
. b
is used to denote the internal nodes of the tree with the assigned
(partial) code as its subscript. The number in the parenthesis
following b is the probability sum of all its children. . . . . . . . 78
3.8 (Continuation of Figure 3.7) Example of violation of the sibling
property after observing a new symbol a
3
at n = 17. Note that
node a
1
is not adjacent to its sibling a
2
. . . . . . . . . . . . . . . 79
3.9 (Continuation of Figure 3.8) Updated Human code. The sibling
property holds now for the new code. . . . . . . . . . . . . . . . . 80
viii
4.1 A data transmission system, where W represents the message for
transmission, X
n
denotes the codeword corresponding to message
W, Y
n
represents the received word due to channel input X
n
, and

W denotes the reconstructed message from Y


n
. . . . . . . . . . . 82
4.2 Binary symmetric channel. . . . . . . . . . . . . . . . . . . . . . . 87
4.3 Binary erasure channel. . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Binary symmetric erasure channel. . . . . . . . . . . . . . . . . . 89
4.5 Ultimate channel coding rate R versus channel capacity C and be-
havior of the probability of error as blocklength n goes to innity
for a discrete memoryless channel. . . . . . . . . . . . . . . . . . . 102
5.1 The water-pouring scheme for uncorrelated parallel Gaussian chan-
nels. The horizontal dashed line, which indicates the level where
the water rises to, indicates the value of for which

k
i=1
P
i
= P. 145
5.2 Band-limited waveform channel with additive white Gaussian noise.150
5.3 Water-pouring for the band-limited colored Gaussian channel. . . 153
A.1 Illustration of Lemma A.17. . . . . . . . . . . . . . . . . . . . . . 160
B.1 The support line y = ax + b of the convex function f(x). . . . . . 168
ix
Chapter 1
Introduction
1.1 Overview
Since its inception, the main role of Information Theory has been to provide the
engineering and scientic communities with a mathematical framework for the
theory of communication by establishing the fundamental limits on the perfor-
mance of various communication systems. The birth of Information Theory was
initiated with the publication of the groundbreaking works [39, 41] of Claude El-
wood Shannon (1916-2001) who asserted that it is possible to send information-
bearing signals at a xed positive rate through a noisy communication channel
with an arbitrarily small probability of error as long as the transmission rate
is below a certain xed quantity that depends on the channel statistical char-
acteristics; he baptized this quantity with the name of channel capacity. He
further proclaimed that random (stochastic) sources, representing data, speech
or image signals, can be compressed distortion-free at a minimal rate given by
the sources intrinsic amount of information, which he called source entropy and
dened in terms of the source statistics. He went on proving that if a source has
an entropy that is less than the capacity of a communication channel, then the
source can be reliably transmitted (with asymptotically vanishing probability of
error) over the channel. He further generalized these coding theorems from
the lossless (distortionless) to the lossy context where the source can be com-
pressed and reproduced (possibly after channel transmission) within a tolerable
distortion threshold [40].
Inspired and guided by the pioneering ideas of Shannon,
1
information theo-
rists gradually expanded their interests beyond communication theory, and in-
vestigated fundamental questions in several other related elds. Among them
we cite:
1
See [43] for accessing most of Shannons works, including his yet untapped doctoral dis-
sertation on an algebraic framework for population genetics.
1
statistical physics (thermodynamics, quantum information theory);
computer science (algorithmic complexity, resolvability);
probability theory (large deviations, limit theorems);
statistics (hypothesis testing, multi-user detection, Fisher information, es-
timation);
economics (gambling theory, investment theory);
biology (biological information theory);
cryptography (data security, watermarking);
data networks (self-similarity, trac regulation theory).
In this textbook, we focus our attention on the study of the basic theory of
communication for single-user (point-to-point) systems for which Information
Theory was originally conceived.
1.2 Communication system model
A simple block diagram of a general communication system is depicted in Fig. 1.1.
Source
- - -
Modulator
?
Physical
Channel
?
Demodulator

Destination
6
Discrete
Channel
Transmitter Part
Receiver Part
Focus of
this text
Source
Encoder
Channel
Encoder
Channel
Decoder
Source
Decoder
Figure 1.1: Block diagram of a general communication system.
2
Let us briey describe the role of each block in the gure.
Source: The source, which usually represents data or multimedia signals,
is modelled as a random process (the necessary background regarding ran-
dom processes is introduced in Appendix B). It can be discrete (nite or
countable alphabet) or continuous (uncountable alphabet) in value and in
time.
Source Encoder: Its role is to represent the source in a compact fashion
by removing its unnecessary or redundant content (i.e., by compressing it).
Channel Encoder: Its role is to enable the reliable reproduction of the
source encoder output after its transmission through a noisy communica-
tion channel. This is achieved by adding redundancy (using usually an
algebraic structure) to the source encoder output.
Modulator: It transforms the channel encoder output into a waveform
suitable for transmission over the physical channel. This is typically ac-
complished by varying the parameters of a sinusoidal signal in proportion
with the data provided by the channel encoder output.
Physical Channel: It consists of the noisy (or unreliable) medium that
the transmitted waveform traverses. It is usually modelled via a sequence of
conditional (or transition) probability distributions of receiving an output
given that a specic input was sent.
Receiver Part: It consists of the demodulator, the channel decoder and
the source decoder where the reverse operations are performed. The desti-
nation represents the sink where the source estimate provided by the source
decoder is reproduced.
In this text, we will model the concatenation of the modulator, physical
channel and demodulator via a discrete-time
2
channel with a given sequence of
conditional probability distributions. Given a source and a discrete channel, our
objectives will include determining the fundamental limits of how well we can
construct a (source/channel) coding scheme so that:
the smallest number of source encoder symbols can represent each source
symbol distortion-free or within a prescribed distortion level D, where
D > 0 and the channel is noiseless;
2
Except for a brief interlude with the continuous-time (waveform) Gaussian channel in
Chapter 5, we will consider discrete-time communication systems throughout the text.
3
the largest rate of information can be transmitted over a noisy channel
between the channel encoder input and the channel decoder output with
an arbitrarily small probability of decoding error;
we can guarantee that the source is transmitted over a noisy channel and
reproduced at the destination within distortion D, where D > 0.
4
Chapter 2
Information Measures for Discrete
Systems
In this chapter, we dene information measures for discrete-time discrete-alphabet
1
systems from a probabilistic standpoint and develop their properties. Elucidat-
ing the operational signicance of probabilistically dened information measures
vis-a-vis the fundamental limits of coding constitutes a main objective of this
book; this will be seen in the subsequent chapters.
2.1 Entropy, joint entropy and conditional entropy
2.1.1 Self-information
Let E be an event belonging to a given event space and having probability
Pr(E) p
E
, where 0 p
E
1. Let 1(E) called the self-information of E
represent the amount of information one gains when learning that E has occurred
(or equivalently, the amount of uncertainty one had about E prior to learning
that it has happened). A natural question to ask is what properties should 1(E)
have? Although the answer to this question may vary from person to person,
here are some common properties that 1(E) is reasonably expected to have.
1. 1(E) should be a decreasing function of p
E
.
In other words, this property rst states that 1(E) = I(p
E
), where I() is
a real-valued function dened over [0, 1]. Furthermore, one would expect
that the less likely event E is, the more information is gained when one
1
By discrete alphabets, one usually means nite or countably innite alphabets. We how-
ever mostly focus on nite alphabet systems, although the presented information measures
allow for countable alphabets (when they exist).
5
learns it has occurred. In other words, I(p
E
) is a decreasing function of
p
E
.
2. I(p
E
) should be continuous in p
E
.
Intuitively, one should expect that a small change in p
E
corresponds to a
small change in the amount of information carried by E.
3. If E
1
and E
2
are independent events, then 1(E
1
E
2
) = 1(E
1
) + 1(E
2
),
or equivalently, I(p
E
1
p
E
2
) = I(p
E
1
) + I(p
E
2
).
This property declares that when events E
1
and E
2
are independent from
each other (i.e., when they do not aect each other probabilistically), the
amount of information one gains by learning that both events have jointly
occurred should be equal to the sum of the amounts of information of each
individual event.
Next, we show that the only function that satises properties 1-3 above is
the logarithmic function.
Theorem 2.1 The only function dened over p [0, 1] and satisfying
1. I(p) is monotonically decreasing in p;
2. I(p) is a continuous function of p for 0 p 1;
3. I(p
1
p
2
) = I(p
1
) + I(p
2
);
is I(p) = c log
b
(p), where c is a positive constant and the base b of the
logarithm is any number larger than one.
Proof:
Step 1: Claim. For n = 1, 2, 3, ,
I
_
1
n
_
= c log
b
_
1
n
_
,
where c > 0 is a constant.
Proof: First note that for n = 1, condition 3 directly shows the claim, since
it yields that I(1) = I(1) +I(1). Thus I(1) = 0 = c log
b
(1).
Now let n be a xed positive integer greater than 1. Conditions 1 and 3
respectively imply
n < m I
_
1
n
_
< I
_
1
m
_
. (2.1.1)
6
and
I
_
1
mn
_
= I
_
1
m
_
+ I
_
1
n
_
(2.1.2)
where n, m = 1, 2, 3, . Now using (2.1.2), we can show by induction (on
k) that
I
_
1
n
k
_
= k I
_
1
n
_
(2.1.3)
for all non-negative integers k.
Now for any positive integer r, there exists a non-negative integer k such
that
n
k
2
r
< n
k+1
.
By (2.1.1), we obtain
I
_
1
n
k
_
I
_
1
2
r
_
< I
_
1
n
k+1
_
,
which together with (2.1.3), yields
k I
_
1
n
_
r I
_
1
2
_
< (k + 1) I
_
1
n
_
.
Hence, since I(1/n) > I(1) = 0,
k
r

I(1/2)
I(1/n)

k + 1
r
.
On the other hand, by the monotonicity of the logarithm, we obtain
log
b
n
k
log
b
2
r
log
b
n
k+1

k
r

log
b
(2)
log
b
(n)

k + 1
r
.
Therefore,

log
b
(2)
log
b
(n)

I(1/2)
I(1/n)

<
1
r
.
Since n is xed, and r can be made arbitrarily large, we can let r to
get:
I
_
1
n
_
= c log
b
(n).
where c = I(1/2)/ log
b
(2) > 0. This completes the proof of the claim.
7
Step 2: Claim. I(p) = c log
b
(p) for positive rational number p, where c > 0
is a constant.
Proof: A positive rational number p can be represented by a ratio of two
integers, i.e., p = r/s, where r and s are both positive integers. Then
condition 3 yields that
I
_
1
s
_
= I
_
r
s
1
r
_
= I
_
r
s
_
+ I
_
1
r
_
,
which, from Step 1, implies that
I(p) = I
_
r
s
_
= I
_
1
s
_
I
_
1
r
_
= c log
b
s c log
b
r = c log
b
p.
Step 3: For any p [0, 1], it follows by continuity and the density of the ratio-
nals in the reals that
I(p) = lim
ap, a rational
I(a) = lim
bp, b rational
I(b) = c log
b
(p).
2
The constant c above is by convention normalized to c = 1. Furthermore,
the base b of the logarithm determines the type of units used in measuring
information. When b = 2, the amount of information is expressed in bits (i.e.,
binary digits). When b = e i.e., the natural logarithm (ln) is used information
is measured in nats (i.e., natural units or digits). For example, if the event E
concerns a Heads outcome from the toss of a fair coin, then its self-information
is 1(E) = log
2
(1/2) = 1 bit or ln(1/2) = 0.693 nats.
More generally, under base b > 1, information is in b-ary units or digits.
For the sake of simplicity, we will throughout use the base-2 logarithm unless
otherwise specied. Note that one can easily convert information units from bits
to b-ary units by dividing the former by log
2
(b).
2.1.2 Entropy
Let X be a discrete random variable taking values in a nite alphabet A under
a probability distribution or probability mass function (pmf) P
X
(x) P[X = x]
for all x A. Note that X generically represents a memoryless source, which is
a random process X
n

n=1
with independent and identically distributed (i.i.d.)
random variables (cf. Appendix B).
8
Denition 2.2 (Entropy) The entropy of a discrete random variable X with
pmf P
X
() is denoted by H(X) or H(P
X
) and dened by
H(X)

xX
P
X
(x) log
2
P
X
(x) (bits).
Thus H(X) represents the statistical average (mean) amount of information
one gains when learning that one of its [A[ outcomes has occurred, where [A[
denotes the size of alphabet A. Indeed, we directly note from the denition that
H(X) = E[log
2
P
X
(X)] = E[1(X)]
where 1(x) log
2
P
X
(x) is the self-information of the elementary event [X =
x].
When computing the entropy, we adopt the convention
0 log
2
0 = 0,
which can be justied by a continuity argument since xlog
2
x 0 as x 0.
Also note that H(X) only depends on the probability distribution of X and is
not aected by the symbols that represent the outcomes. For example when
tossing a fair coin, we can denote Heads by 2 (instead of 1) and Tail by 100
(instead of 0), and the entropy of the random variable representing the outcome
would remain equal to log
2
(2) = 1 bit.
Example 2.3 Let X be a binary (valued) random variable with alphabet A =
0, 1 and pmf given by P
X
(1) = p and P
X
(0) = 1 p, where 0 p 1 is xed.
Then H(X) = plog
2
p(1p)log
2
(1p). This entropy is conveniently called
the binary entropy function and is usually denoted by h
b
(p): it is illustrated in
Fig. 2.1. As shown in the gure, h
b
(p) is maximized for a uniform distribution
(i.e., p = 1/2).
The units for H(X) above are in bits as base-2 logarithm is used. Setting
H
D
(X)

xX
P
X
(x) log
D
P
X
(x)
yields the entropy in D-ary units, where D > 1. Note that we abbreviate H
2
(X)
as H(X) throughout the book since bits are common measure units for a coding
system, and hence
H
D
(X) =
H(X)
log
2
D
.
Thus
H
e
(X) =
H(X)
log
2
(e)
= (ln 2) H(X)
gives the entropy in nats, where e is the base of the natural logarithm.
9
0
1
0 0.5 1
p
Figure 2.1: Binary entropy function h
b
(p).
2.1.3 Properties of entropy
When developing or proving the basic properties of entropy (and other informa-
tion measures), we will often use the following fundamental inequality on the
logarithm (its proof is left as an exercise).
Lemma 2.4 (Fundamental inequality (FI)) For any x > 0 and D > 1, we
have that
log
D
x log
D
(e)(x 1)
with equality if and only if (i) x = 1.
Setting y = 1/x and using FI above directly yields that for any y > 0, we also
have that
log
D
y log
D
(e)
_
1
1
y
_
,
also with equality i y = 1. In the above the base-D logarithm was used.
Specically, for a logarithm with base-2, the above inequalities become
log
2
(e)
_
1
1
x
_
log
2
x log
2
(e)(x 1)
with equality i x = 1.
Lemma 2.5 (Non-negativity) H(X) 0. Equality holds i X is determin-
istic (when X is deterministic, the uncertainty of X is obviously zero).
10
Proof: 0 P
X
(x) 1 implies that log
2
[1/P
X
(x)] 0 for every x A. Hence,
H(X) =

xX
P
X
(x) log
2
1
P
X
(x)
0,
with equality holding i P
X
(x) = 1 for some x A. 2
Lemma 2.6 (Upper bound on entropy) If a random variable X takes val-
ues from a nite set A, then
H(X) log
2
[A[,
where [A[ denotes the size of the set A. Equality holds i X is equiprobable or
uniformly distributed over A (i.e., P
X
(x) =
1
|X|
for all x A).
Proof:
log
2
[A[ H(X) = log
2
[A[
_

xX
P
X
(x)
_

xX
P
X
(x) log
2
P
X
(x)
_
=

xX
P
X
(x) log
2
[A[ +

xX
P
X
(x) log
2
P
X
(x)
=

xX
P
X
(x) log
2
[[A[ P
X
(x)]

xX
P
X
(x) log
2
(e)
_
1
1
[A[ P
X
(x)
_
= log
2
(e)

xX
_
P
X
(x)
1
[A[
_
= log
2
(e) (1 1) = 0
where the inequality follows from the FI Lemma, with equality i ( x A),
[A[ P
X
(x) = 1, which means P
X
() is a uniform distribution on A. 2
Intuitively, H(X) tells us how random X is. Indeed, X is deterministic (not
random at all) i H(X) = 0. If X is uniform (equiprobable), H(X) is maximized,
and is equal to log
2
[A[.
Lemma 2.7 (Log-sum inequality) For non-negative numbers, a
1
, a
2
, . . ., a
n
and b
1
, b
2
, . . ., b
n
,
n

i=1
_
a
i
log
D
a
i
b
i
_

_
n

i=1
a
i
_
log
D

n
i=1
a
i

n
i=1
b
i
(2.1.4)
with equality holding i, ( 1 i n) (a
i
/b
i
) = (a
1
/b
1
), a constant independent
of i. (By convention, 0 log
D
(0) = 0, 0 log
D
(0/0) = 0 and a log
D
(a/0) = if
a > 0. Again, this can be justied by continuity.)
11
Proof: Let a

n
i=1
a
i
and b

n
i=1
b
i
. Then
n

i=1
a
i
log
D
a
i
b
i
a log
D
a
b
= a
_

_
n

i=1
a
i
a
log
D
a
i
b
i

_
n

i=1
a
i
a
_
. .
=1
log
D
a
b
_

_
= a
n

i=1
a
i
a
log
D
_
a
i
b
i
b
a
_
a log
D
(e)
n

i=1
a
i
a
_
1
b
i
a
i
a
b
_
= a log
D
(e)
_
n

i=1
a
i
a

n

i=1
b
i
b
_
= a log
D
(e) (1 1) = 0
where the inequality follows from the FI Lemma, with equality holing i
a
i
b
i
b
a
= 1
for all i; i.e.,
a
i
b
i
=
a
b
i.
We also provide another proof using Jensens inequality (cf. Theorem B.6 in
Appendix B). Without loss of generality, assume that a
i
> 0 and b
i
> 0 for
every i. Jensens inequality states that
n

i=1

i
f(t
i
) f
_
n

i=1

i
t
i
_
for any strictly convex function f(),
i
0, and

n
i=1

i
= 1; equality holds
i t
i
is a constant for all i. Hence by setting
i
= b
i
/

n
j=1
b
j
, t
i
= a
i
/b
i
, and
f(t) = t log
D
(t), we obtain the desired result. 2
2.1.4 Joint entropy and conditional entropy
Given a pair of random variables (X, Y ) with a joint pmf P
X,Y
(, ) dened on
A , the self-information of the (two-dimensional) elementary event [X =
x, Y = y] is dened by
I(x, y) log
2
P
X,Y
(x, y).
This leads us to the denition of joint entropy.
Denition 2.8 (Joint entropy) The joint entropy H(X, Y ) of random vari-
ables (X, Y ) is dened by
H(X, Y )

(x,y)XY
P
X,Y
(x, y) log
2
P
X,Y
(x, y)
12
= E[log
2
P
X,Y
(X, Y )].
The conditional entropy can also be similarly dened as follows.
Denition 2.9 (Conditional entropy) Given two jointly distributed random
variables X and Y , the conditional entropy H(Y [X) of Y given X is dened by
H(Y [X)

xX
P
X
(x)
_

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x)
_
(2.1.5)
where P
Y |X
([) is the conditional pmf of Y given X.
Equation (2.1.5) can be written into three dierent but equivalent forms:
H(Y [X) =

(x,y)XY
P
X,Y
(x, y) log
2
P
Y |X
(y[x)
= E[log
2
P
Y |X
(Y [X)]
=

xX
P
X
(x) H(Y [X = x)
where H(Y [X = x)

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x).
The relationship between joint entropy and conditional entropy is exhibited
by the fact that the entropy of a pair of random variables is the entropy of one
plus the conditional entropy of the other.
Theorem 2.10 (Chain rule for entropy)
H(X, Y ) = H(X) + H(Y [X). (2.1.6)
Proof: Since
P
X,Y
(x, y) = P
X
(x)P
Y |X
(y[x),
we directly obtain that
H(X, Y ) = E[log P
X,Y
(X, Y )]
= E[log
2
P
X
(X)] +E[log
2
P
Y |X
(Y [X)]
= H(X) + H(Y [X).
2
By its denition, joint entropy is commutative; i.e., H(X, Y ) = H(Y, X).
Hence,
H(X, Y ) = H(X) +H(Y [X) = H(Y ) + H(X[Y ) = H(Y, X),
13
which implies that
H(X) H(X[Y ) = H(Y ) H(Y [X). (2.1.7)
The above quantity is exactly equal to the mutual information which will be
introduced in the next section.
The conditional entropy can be thought of in terms of a channel whose input
is the random variable X and whose output is the random variable Y . H(X[Y ) is
then called the equivocation
2
and corresponds to the uncertainty in the channel
input from the receivers point-of-view. For example, suppose that the set of
possible outcomes of random vector (X, Y ) is (0, 0), (0, 1), (1, 0), (1, 1), where
none of the elements has zero probability mass. When the receiver Y receives 1,
he still cannot determine exactly what the sender X observes (it could be either
1 or 0); therefore, the uncertainty, from the receivers view point, depends on
the probabilities P
X|Y
(0[1) and P
X|Y
(1[1).
Similarly, H(Y [X), which is called prevarication,
3
is the uncertainty in the
channel output from the transmitters point-of-view. In other words, the sender
knows exactly what he sends, but is uncertain on what the receiver will nally
obtain.
A case that is of specic interest is when H(X[Y ) = 0. By its denition,
H(X[Y ) = 0 if X becomes deterministic after observing Y . In such case, the
uncertainty of X after giving Y is completely zero.
The next corollary can be proved similarly to Theorem 2.10.
Corollary 2.11 (Chain rule for conditional entropy)
H(X, Y [Z) = H(X[Z) +H(Y [X, Z).
2.1.5 Properties of joint entropy and conditional entropy
Lemma 2.12 (Conditioning never increases entropy) Side information Y
decreases the uncertainty about X:
H(X[Y ) H(X)
with equality holding i X and Y are independent. In other words, condition-
ing reduces entropy.
2
Equivocation is an ambiguous statement one uses deliberately in order to deceive or avoid
speaking the truth.
3
Prevarication is the deliberate act of deviating from the truth (it is a synonym of equiv-
ocation).
14
Proof:
H(X) H(X[Y ) =

(x,y)XY
P
X,Y
(x, y) log
2
P
X|Y
(x[y)
P
X
(x)
=

(x,y)XY
P
X,Y
(x, y) log
2
P
X|Y
(x[y)P
Y
(y)
P
X
(x)P
Y
(y)
=

(x,y)XY
P
X,Y
(x, y) log
2
P
X,Y
(x, y)
P
X
(x)P
Y
(y)

_
_

(x,y)XY
P
X,Y
(x, y)
_
_
log
2

(x,y)XY
P
X,Y
(x, y)

(x,y)XY
P
X
(x)P
Y
(y)
= 0
where the inequality follows from the log-sum inequality, with equality holding
i
P
X,Y
(x, y)
P
X
(x)P
Y
(y)
= constant (x, y) A .
Since probability must sum to 1, the above constant equals 1, which is exactly
the case of X being independent of Y . 2
Lemma 2.13 Entropy is additive for independent random variables; i.e.,
H(X, Y ) = H(X) +H(Y ) for independent X and Y.
Proof: By the previous lemma, independence of X and Y implies H(Y [X) =
H(Y ). Hence
H(X, Y ) = H(X) + H(Y [X) = H(X) + H(Y ).
2
Since conditioning never increases entropy, it follows that
H(X, Y ) = H(X) + H(Y [X) H(X) + H(Y ). (2.1.8)
The above lemma tells us that equality holds for (2.1.8) only when X is inde-
pendent of Y .
A result similar to (2.1.8) also applies to conditional entropy.
Lemma 2.14 Conditional entropy is lower additive; i.e.,
H(X
1
, X
2
[Y
1
, Y
2
) H(X
1
[Y
1
) + H(X
2
[Y
2
).
15
Equality holds i
P
X
1
,X
2
|Y
1
,Y
2
(x
1
, x
2
[y
1
, y
2
) = P
X
1
|Y
1
(x
1
[y
1
)P
X
2
|Y
2
(x
2
[y
2
)
for all x
1
, x
2
, y
1
and y
2
.
Proof: Using the chain rule for conditional entropy and the fact that condition-
ing reduces entropy, we can write
H(X
1
, X
2
[Y
1
, Y
2
) = H(X
1
[Y
1
, Y
2
) +H(X
2
[X
1
, Y
1
, Y
2
)
H(X
1
[Y
1
, Y
2
) +H(X
2
[Y
1
, Y
2
), (2.1.9)
H(X
1
[Y
1
) + H(X
2
[Y
2
). (2.1.10)
For (2.1.9), equality holds i X
1
and X
2
are conditionally independent given
(Y
1
, Y
2
): P
X
1
,X
2
|Y
1
,Y
2
(x
1
, x
2
[y
1
, y
2
) = P
X
1
|Y
1
,Y
2
(x
1
[y
1
, y
2
)P
X
2
|Y
1
,Y
2
(x
2
[y
1
, y
2
). For
(2.1.10), equality holds i X
1
is conditionally independent of Y
2
given Y
1
(i.e.,
P
X
1
|Y
1
,Y
2
(x
1
[y
1
, y
2
) = P
X
1
|Y
1
(x
1
[y
1
)), and X
2
is conditionally independent of Y
1
given Y
2
(i.e., P
X
2
|Y
1
,Y
2
(x
2
[y
1
, y
2
) = P
X
2
|Y
2
(x
2
[y
2
)). Hence, the desired equality
condition of the lemma is obtained. 2
2.2 Mutual information
For two random variables X and Y , the mutual information between X and Y is
the reduction in the uncertainty of Y due to the knowledge of X (or vice versa).
A dual denition of mutual information states that it is the average amount of
information that Y has (or contains) about X or X has (or contains) about Y .
We can think of the mutual information between X and Y in terms of a
channel whose input is X and whose output is Y . Thereby the reduction of the
uncertainty is by denition the total uncertainty of X (i.e., H(X)) minus the
uncertainty of X after observing Y (i.e., H(X[Y )) Mathematically, it is
mutual information = I(X; Y ) H(X) H(X[Y ). (2.2.1)
It can be easily veried from (2.1.7) that mutual information is symmetric; i.e.,
I(X; Y ) = I(Y ; X).
2.2.1 Properties of mutual information
Lemma 2.15
1. I(X; Y ) =

xX

yY
P
X,Y
(x, y) log
2
P
X,Y
(x, y)
P
X
(x)P
Y
(y)
.
16
H(X) H(X[Y ) I(X; Y ) H(Y [X) H(Y )
H(X, Y )

E '
Figure 2.2: Relation between entropy and mutual information.
2. I(X; Y ) = I(Y ; X).
3. I(X; Y ) = H(X) + H(Y ) H(X, Y ).
4. I(X; Y ) H(X) with equality holding i X is a function of Y (i.e., X =
f(Y ) for some function f()).
5. I(X; Y ) 0 with equality holding i X and Y are independent.
6. I(X; Y ) minlog
2
[A[, log
2
[[.
Proof: Properties 1, 2, 3, and 4 follow immediately from the denition. Property
5 is a direct consequence of Lemma 2.12. Property 6 holds i I(X; Y ) log
2
[A[
and I(X; Y ) log
2
[[. To show the rst inequality, we write I(X; Y ) = H(X)
H(X[Y ), use the fact that H(X[Y ) is non-negative and apply Lemma 2.6. A
similar proof can be used to show that I(X; Y ) log
2
[[. 2
The relationships between H(X), H(Y ), H(X, Y ), H(X[Y ), H(Y [X) and
I(X; Y ) can be illustrated by the Venn diagram in Figure 2.2.
2.2.2 Conditional mutual information
The conditional mutual information, denoted by I(X; Y [Z), is dened as the
common uncertainty between X and Y under the knowledge of Z. It is mathe-
matically dened by
I(X; Y [Z) H(X[Z) H(X[Y, Z). (2.2.2)
17
Lemma 2.16 (Chain rule for mutual information)
I(X; Y, Z) = I(X; Y ) + I(X; Z[Y ) = I(X; Z) + I(X; Y [Z).
Proof: Without loss of generality, we only prove the rst equality:
I(X; Y, Z) = H(X) H(X[Y, Z)
= H(X) H(X[Y ) +H(X[Y ) H(X[Y, Z)
= I(X; Y ) + I(X; Z[Y ).
2
The above lemma can be read as: the information that (Y, Z) has about X
is equal to the information that Y has about X plus the information that Z has
about X when Y is already known.
2.3 Properties of entropy and mutual information for
multiple random variables
Theorem 2.17 (Chain rule for entropy) Let X
1
, X
2
, . . ., X
n
be drawn ac-
cording to P
X
n(x
n
) P
X
1
, ,Xn
(x
1
, . . . , x
n
), where we use the common super-
script notation to denote an n-tuple: X
n
(X
1
, , X
n
) and x
n
(x
1
, . . . , x
n
).
Then
H(X
1
, X
2
, . . . , X
n
) =
n

i=1
H(X
i
[X
i1
, . . . , X
1
),
where H(X
i
[X
i1
, . . . , X
1
) H(X
1
) for i = 1. (The above chain rule can also
be written as:
H(X
n
) =
n

i=1
H(X
i
[X
i1
),
where X
i
(X
1
, . . . , X
i
).)
Proof: From (2.1.6),
H(X
1
, X
2
, . . . , X
n
) = H(X
1
, X
2
, . . . , X
n1
) +H(X
n
[X
n1
, . . . , X
1
). (2.3.1)
Once again, applying (2.1.6) to the rst term of the right-hand-side of (2.3.1),
we have
H(X
1
, X
2
, . . . , X
n1
) = H(X
1
, X
2
, . . . , X
n2
) + H(X
n1
[X
n2
, . . . , X
1
).
The desired result can then be obtained by repeatedly applying (2.1.6). 2
18
Theorem 2.18 (Chain rule for conditional entropy)
H(X
1
, X
2
, . . . , X
n
[Y ) =
n

i=1
H(X
i
[X
i1
, . . . , X
1
, Y ).
Proof: The theorem can be proved similarly to Theorem 2.17. 2
Theorem 2.19 (Chain rule for mutual information)
I(X
1
, X
2
, . . . , X
n
; Y ) =
n

i=1
I(X
i
; Y [X
i1
, . . . , X
1
),
where I(X
i
; Y [X
i1
, . . . , X
1
) I(X
1
; Y ) for i = 1.
Proof: This can be proved by rst expressing mutual information in terms of
entropy and conditional entropy, and then applying the chain rules for entropy
and conditional entropy. 2
Theorem 2.20 (Independence bound on entropy)
H(X
1
, X
2
, . . . , X
n
)
n

i=1
H(X
i
).
Equality holds i all the X
i
s are independent from each other.
4
Proof: By applying the chain rule for entropy,
H(X
1
, X
2
, . . . , X
n
) =
n

i=1
H(X
i
[X
i1
, . . . , X
1
)

i=1
H(X
i
).
Equality holds i each conditional entropy is equal to its associated entropy, that
i X
i
is independent of (X
i1
, . . . , X
1
) for all i. 2
Theorem 2.21 (Bound on mutual information) If (X
i
, Y
i
)
n
i=1
is a pro-
cess satisfying the conditional independence assumption P
Y
n
|X
n =

n
i=1
P
Y
i
|X
i
,
then
I(X
1
, . . . , X
n
; Y
1
, . . . , Y
n
)
n

i=1
I(X
i
; Y
i
)
with equality holding i X
i

n
i=1
are independent.
4
This condition is equivalent to that X
i
is independent of (X
i1
, . . . , X
1
) for all i.
Their equivalence can be easily proved by chain rule for probabilities, i.e., P
X
n(x
n
) =

n
i=1
P(X
i
[X
i1
1
), which is left to the readers as an exercise.
19
Proof: From the independence bound on entropy, we have
H(Y
1
, . . . , Y
n
)
n

i=1
H(Y
i
).
By the conditional independence assumption, we have
H(Y
1
, . . . , Y
n
[X
1
, . . . , X
n
) = E
_
log
2
P
Y
n
|X
n(Y
n
[X
n
)

= E
_

i=1
log
2
P
Y
i
|X
i
(Y
i
[X
i
)
_
=
n

i=1
H(Y
i
[X
i
).
Hence
I(X
n
; Y
n
) = H(Y
n
) H(Y
n
[X
n
)

i=1
H(Y
i
)
n

i=1
H(Y
i
[X
i
)
=
n

i=1
I(X
i
; Y
i
)
with equality holding i Y
i

n
i=1
are independent, which holds i X
i

n
i=1
are
independent. 2
2.4 Data processing inequality
Lemma 2.22 (Data processing inequality) (This is also called the data pro-
cessing lemma.) If X Y Z, then I(X; Y ) I(X; Z).
Proof: The Markov chain relationship X Y Z means that X and Z
are conditional independent given Y (cf. Appendix B); we directly have that
I(X; Z[Y ) = 0. By the chain rule for mutual information,
I(X; Z) + I(X; Y [Z) = I(X; Y, Z) (2.4.1)
= I(X; Y ) + I(X; Z[Y )
= I(X; Y ). (2.4.2)
Since I(X; Y [Z) 0, we obtain that I(X; Y ) I(X; Z) with equality holding
i I(X; Y [Z) = 0. 2
20
Source
-
U
Encoder
-
X
Channel
-
Y
Decoder
-
V
I(U; V ) I(X; Y )
By processing, we can only reduce (mutual) information,
but the processed information may be in a more useful form!
Figure 2.3: Communication context of the data processing lemma.
The data processing inequality means that the mutual information will not
increase after processing. This result is somewhat counter-intuitive since given
two random variables X and Y , we might believe that applying a well-designed
processing scheme to Y , which can be generally represented by a mapping g(Y ),
could possibly increase the mutual information. However, for any g(), X
Y g(Y ) forms a Markov chain which implies that data processing cannot
increase mutual information. A communication context for the data processing
lemma is depicted in Figure 2.3, and summarized in the next corollary.
Corollary 2.23 For jointly distributed random variables X and Y and any
function g(), we have X Y g(Y ) and
I(X; Y ) I(X; g(Y )).
We also note that if Z obtains all the information about X through Y , then
knowing Z will not help increase the mutual information between X and Y ; this
is formalized in the following.
Corollary 2.24 If X Y Z, then
I(X; Y [Z) I(X; Y ).
Proof: The proof directly follows from (2.4.1) and (2.4.2). 2
It is worth pointing out that it is possible that I(X; Y [Z) > I(X; Y ) when X,
Y and Z do not form a Markov chain. For example, let X and Y be independent
equiprobable binary random variables, and let Z = X + Y . Then,
I(X; Y [Z) = H(X[Z) H(X[Y, Z)
= H(X[Z)
= P
Z
(0)H(X[z = 0) +P
Z
(1)H(X[z = 1) +P
Z
(2)H(X[z = 2)
= 0 + 0.5 + 0
= 0.5 bits,
21
which is clearly larger than I(X; Y ) = 0.
Finally, we observe that we can extend the data processing inequality for a
sequence of random variables forming a Markov chain:
Corollary 2.25 If X
1
X
2
X
n
, then for any i, j, k.l such that
1 i j k l n, we have that
I(X
i
; X
l
) I(X
j
; X
k
).
2.5 Fanos inequality
Fanos inequality is a quite useful tool widely employed in Information Theory
to prove converse results for coding theorems (as we will see in the following
chapters).
Lemma 2.26 (Fanos inequality) Let X and Y be two random variables, cor-
related in general, with alphabets A and , respectively, where A is nite but
can be countably innite. Let

X g(Y ) be an estimate of X from observing
Y , where g : A is a given estimation function. Dene the probability of
error as
P
e
Pr[

X ,= X].
Then the following inequality holds
H(X[Y ) h
b
(P
e
) + P
e
log
2
([A[ 1), (2.5.1)
where h
b
(x) xlog
2
x(1x) log
2
(1x) for 0 x 1 is the binary entropy
function.
Observation 2.27
Note that when P
e
= 0, we obtain that H(X[Y ) = 0 (see (2.5.1)) as
intuition suggests, since if P
e
= 0, then

X = g(Y ) = X (with probability
1) and thus H(X[Y ) = H(g(Y )[Y ) = 0.
Fanos inequality yields upper and lower bounds on P
e
in terms of H(X[Y ).
This is illustrated in Figure 2.4, where we plot the region for the pairs
(P
e
, H(X[Y )) that are permissible under Fanos inequality. In the gure,
the boundary of the permissible (dashed) region is given by the function
f(P
e
) h
b
(P
e
) + P
e
log
2
([A[ 1),
22
log
2
(|X| 1)
log
2
(|X|)
H(X[Y )
(|X| 1)/|X|
0 1
P
e
Figure 2.4: Permissible (P
e
, H(X[Y )) region due to Fanos inequality.
the right-hand side of (2.5.1). We obtain that when
log
2
([A[ 1) < H(X[Y ) log
2
([A[),
P
e
can be upper and lower bounded as follows:
0 < infa : f(a) H(X[Y ) P
e
supa : f(a) H(X[Y ) < 1.
Furthermore, when
0 < H(X[Y ) log
2
([A[ 1),
only the lower bound holds:
P
e
infa : f(a) H(X[Y ) > 0.
Thus for all non-zero values of H(X[Y ), we obtain a lower bound (of the
same form above) on P
e
; the bound implies that if H(X[Y ) is bounded
away from zero, P
e
is also bounded away from zero.
A weaker but simpler version of Fanos inequality can be directly obtained
from (2.5.1) by noting that h
b
(P
e
) 1:
H(X[Y ) 1 + P
e
log
2
([A[ 1), (2.5.2)
23
which in turn yields that
P
e

H(X[Y ) 1
log
2
([A[ 1)
( for [A[ > 2)
which is weaker than the above lower bound on P
e
.
Proof of Lemma 2.26:
Dene a new random variable,
E
_
1, if g(Y ) ,= X
0, if g(Y ) = X
.
Then using the chain rule for conditional entropy, we obtain
H(E, X[Y ) = H(X[Y ) + H(E[X, Y )
= H(E[Y ) + H(X[E, Y ).
Observe that E is a function of X and Y ; hence, H(E[X, Y ) = 0. Since con-
ditioning never increases entropy, H(E[Y ) H(E) = h
b
(P
e
). The remaining
term, H(X[E, Y ), can be bounded as follows:
H(X[E, Y ) = Pr[E = 0]H(X[Y, E = 0) + Pr[E = 1]H(X[Y, E = 1)
(1 P
e
) 0 + P
e
log
2
([A[ 1),
since X = g(Y ) for E = 0, and given E = 1, we can upper bound the condi-
tional entropy by the log of the number of remaining outcomes, i.e., ([A[ 1).
Combining these results completes the proof. 2
Fanos inequality cannot be improved in the sense that the lower bound,
H(X[Y ), can be achieved for some specic cases. Any bound that can be
achieved in some cases is often referred to as sharp.
5
From the proof of the above
lemma, we can observe that equality holds in Fanos inequality, if H(E[Y ) =
H(E) and H(X[Y, E = 1) = log
2
([A[ 1). The former is equivalent to E being
independent of Y , and the latter holds i P
X|Y
([y) is uniformly distributed over
the set A g(y). We can therefore create an example in which equality holds
in Fanos inequality.
Example 2.28 Suppose that X and Y are two independent random variables
which are both uniformly distributed on the alphabet 0, 1, 2. Let the estimat-
ing function be given by g(y) = y. Then
P
e
= Pr[g(Y ) ,= X] = Pr[Y ,= X] = 1
2

x=0
P
X
(x)P
Y
(x) =
2
3
.
5
Denition. A bound is said to be sharp if the bound is achievable for some specic cases.
A bound is said to be tight if the bound is achievable for all cases.
24
In this case, equality is achieved in Fanos inequality, i.e.,
h
b
_
2
3
_
+
2
3
log
2
(3 1) = H(X[Y ) = H(X) = log
2
3.
To conclude this section, we present an alternative proof for Fanos inequality
to illustrate the use of the data processing inequality and the FI Lemma.
Alternative Proof of Fanos inequality: Noting that X Y

X form a
Markov chain, we directly obtain via the data processing inequality that
I(X; Y ) I(X;

X),
which implies that
H(X[Y ) H(X[

X).
Thus, if we show that H(X[

X) is no larger than the right-hand side of (2.5.1),
the proof of (2.5.1) is complete.
Noting that
P
e
=

xX

xX: x=x
P
X,

X
(x, x)
and
1 P
e
=

xX

xX: x=x
P
X,

X
(x, x) =

xX
P
X,

X
(x, x),
we obtain that
H(X[

X) h
b
(P
e
) P
e
log
2
([A[ 1)
=

xX

xX: x=x
P
X,

X
(x, x) log
2
1
P
X|

X
(x[ x)
+

xX
P
X,

X
(x, x) log
2
1
P
X|

X
(x[x)

xX

xX: x=x
P
X,

X
(x, x)
_
log
2
([A[ 1)
P
e
+
_

xX
P
X,

X
(x, x)
_
log
2
(1 P
e
)
=

xX

xX: x=x
P
X,

X
(x, x) log
2
P
e
P
X|

X
(x[ x)([A[ 1)
+

xX
P
X,

X
(x, x) log
2
1 P
e
P
X|

X
(x[x)
(2.5.3)
log
2
(e)

xX

xX: x=x
P
X,

X
(x, x)
_
P
e
P
X|

X
(x[ x)([A[ 1)
1
_
+log
2
(e)

xX
P
X,

X
(x, x)
_
1 P
e
P
X|

X
(x[x)
1
_
25
= log
2
(e)
_
P
e
([A[ 1)

xX

xX: x=x
P

X
( x)

xX

xX: x=x
P
X,

X
(x, x)
_
+log
2
(e)
_
(1 P
e
)

xX
P

X
(x)

xX
P
X,

X
(x, x)
_
= log
2
(e)
_
P
e
([A[ 1)
([A[ 1) P
e
_
+ log
2
(e) [(1 P
e
) (1 P
e
)]
= 0
where the inequality follows by applying the FI Lemma to each logarithm term
in (2.5.3). 2
2.6 Divergence and variational distance
In addition to the probabilistically dened entropy and mutual information, an-
other measure that is frequently considered in information theory is divergence or
relative entropy. In this section, we dene this measure and study its statistical
properties.
Denition 2.29 (Divergence) Given two discrete random variables X and

X
dened over a common alphabet A, the divergence (other names are Kullback-
Leibler divergence or distance, relative entropy and discrimination) is denoted
by D(X|

X) or D(P
X
|P

X
) and dened by
6
D(X|

X) = D(P
X
|P

X
) E
X
_
log
2
P
X
(X)
P

X
(X)
_
=

xX
P
X
(x) log
2
P
X
(x)
P

X
(x)
.
In other words, the divergence D(P
X
|P

X
) is the expectation (with respect to
P
X
) of the log-likelihood ratio log
2
[P
X
/P

X
] of distribution P
X
against distribu-
tion P

X
. D(X|

X) can be viewed as a measure of distance or dissimilarity
between distributions P
X
and P

X
. D(X|

X) is also called relative entropy since
it can be regarded as a measure of the ineciency of mistakenly assuming that
the distribution of a source is P

X
when the true distribution is P
X
. For example,
if we know the true distribution P
X
of a source, then we can construct a lossless
data compression code with average codeword length achieving entropy H(X)
(this will be studied in the next chapter). If, however, we mistakenly thought
6
In order to be consistent with the units (in bits) adopted for entropy and mutual informa-
tion, we will also use the base-2 logarithm for divergence unless otherwise specied.
26
that the true distribution is P

X
and employ the best code corresponding to
P

X
, then the resultant average codeword length becomes

xX
[P
X
(x) log
2
P

X
(x)].
As a result, the relative dierence between the resultant average codeword length
and H(X) is the relative entropy D(X|

X). Hence, divergence is a measure of
the system cost (e.g., storage consumed) paid due to mis-classifying the system
statistics.
Note that when computing divergence, we follow the convention that
0 log
2
0
p
= 0 and p log
2
p
0
= for p > 0.
We next present some properties of the divergence and discuss its relation with
entropy and mutual information.
Lemma 2.30 (Non-negativity of divergence)
D(X|

X) 0
with equality i P
X
(x) = P

X
(x) for all x A (i.e., the two distributions are
equal).
Proof:
D(X|

X) =

xX
P
X
(x) log
2
P
X
(x)
P

X
(x)

xX
P
X
(x)
_
log
2

xX
P
X
(x)

xX
P

X
(x)
= 0
where the second step follows from the log-sum inequality with equality holding
i for every x A,
P
X
(x)
P

X
(x)
=

aX
P
X
(a)

bX
P

X
(b)
,
or equivalently P
X
(x) = P

X
(x) for all x A. 2
Lemma 2.31 (Mutual information and divergence)
I(X; Y ) = D(P
X,Y
|P
X
P
Y
),
where P
X,Y
(, ) is the joint distribution of the random variables X and Y and
P
X
() and P
Y
() are the respective marginals.
27
Proof: The observation follows directly from the denitions of divergence and
mutual information. 2
Denition 2.32 (Renement of distribution) Given distribution P
X
on A,
divide A into k mutually disjoint sets, |
1
, |
2
, . . . , |
k
, satisfying
A =
k
_
i=1
|
i
.
Dene a new distribution P
U
on | = 1, 2, , k as
P
U
(i) =

xU
i
P
X
(x).
Then P
X
is called a renement (or more specically, a k-renement) of P
U
.
Let us briey discuss the relation between the processing of information and
its renement. Processing of information can be modeled as a (many-to-one)
mapping, and renement is actually the reverse operation. Recall that the
data processing lemma shows that mutual information can never increase due
to processing. Hence, if one wishes to increase mutual information, he should
simultaneously anti-process (or rene) the involved statistics.
From Lemma 2.31, the mutual information can be viewed as the divergence
of a joint distribution against the product distribution of the marginals. It is
therefore reasonable to expect that a similar eect due to processing (or a reverse
eect due to renement) should also apply to divergence. This is shown in the
next lemma.
Lemma 2.33 (Renement cannot decrease divergence) Let P
X
and P

X
be the renements (k-renements) of P
U
and P

U
respectively. Then
D(P
X
|P

X
) D(P
U
|P

U
).
Proof: By the log-sum inequality, we obtain that for any i 1, 2, , k

xU
i
P
X
(x) log
2
P
X
(x)
P

X
(x)

_

xU
i
P
X
(x)
_
log
2

xU
i
P
X
(x)

xU
i
P

X
(x)
= P
U
(i) log
2
P
U
(i)
P

U
(i)
, (2.6.1)
with equality i
P
X
(x)
P

X
(x)
=
P
U
(i)
P

U
(i)
28
for all x |. Hence,
D(P
X
|P

X
) =
k

i=1

xU
i
P
X
(x) log
2
P
X
(x)
P

X
(x)

i=1
P
U
(i) log
2
P
U
(i)
P

U
(i)
= D(P
U
|P

U
),
with equality i
( i)( x |
i
)
P
X
(x)
P

X
(x)
=
P
U
(i)
P

U
(i)
.
2
Observation 2.34 One drawback of adopting the divergence as a measure be-
tween two distributions is that it does not meet the symmetry requirement of a
true distance,
7
since interchanging its two arguments may yield dierent quan-
tities. In other words, D(P
X
|P

X
) ,= D(P

X
|P
X
) in general. (It also does not
satisfy the triangular inequality.) Thus divergence is not a true distance or met-
ric. Another measure which is a true distance, called variational distance, is
sometimes used instead.
Denition 2.35 (Variational distance) The variational distance (or L
1
-distance)
between two distributions P
X
and P

X
with common alphabet A is dened by
|P
X
P

X
|

xX
[P
X
(x) P

X
(x)[.
Lemma 2.36 The variational distance satises
|P
X
P

X
| = 2 sup
EX
[P
X
(E) P

X
(E)[ = 2

xX:P
X
(x)>P

X
(x)
[P
X
(x) P

X
(x)] .
7
Given a non-empty set A, the function d : A A [0, ) is called a distance or metric
if it satises the following properties.
1. Non-negativity: d(a, b) 0 for every a, b A with equality holding i a = b.
2. Symmetry: d(a, b) = d(b, a) for every a, b A.
3. Triangular inequality: d(a, b) +d(b, c) d(a, c) for every a, b, c A.
29
Proof: We rst show that |P
X
P

X
| = 2

xX:P
X
(x)>P

X
(x)
[P
X
(x) P

X
(x)] .
Setting / x A : P
X
(x) > P

X
(x), we have
|P
X
P

X
| =

xX
[P
X
(x) P

X
(x)[
=

xA
[P
X
(x) P

X
(x)[ +

xA
c
[P
X
(x) P

X
(x)[
=

xA
[P
X
(x) P

X
(x)] +

xA
c
[P

X
(x) P
X
(x)]
=

xA
[P
X
(x) P

X
(x)] + P

X
(/
c
) P
X
(/
c
)
=

xA
[P
X
(x) P

X
(x)] + P
X
(/) P

X
(/)
=

xA
[P
X
(x) P

X
(x)] +

xA
[P
X
(x) P

X
(x)]
= 2

xA
[P
X
(x) P

X
(x)]
where /
c
denotes the complement set of /.
We next prove that |P
X
P

X
| = 2 sup
EX
[P
X
(E) P

X
(E)[ by showing
that each quantity is greater than or equal to the other. For any set E A, we
can write
|P
X
P

X
| =

xX
[P
X
(x) P

X
(x)[
=

xE
[P
X
(x) P

X
(x)[ +

xE
c
[P
X
(x) P

X
(x)[

xE
[P
X
(x) P

X
(x)]

xE
c
[P
X
(x) P

X
(x)]

= [P
X
(E) P

X
(E)[ +[P
X
(E
c
) P

X
(E
c
)[
= [P
X
(E) P

X
(E)[ +[P

X
(E) P
X
(E)[
= 2 [P
X
(E) P

X
(E)[.
Thus |P
X
P

X
| 2 sup
EX
[P
X
(E) P

X
(E)[. Conversely, we have that
2 sup
EX
[P
X
(E) P

X
(E)[ 2 [P
X
(/) P

X
(/)[
= [P
X
(/) P

X
(/)[ +[P

X
(/
c
) P
X
(/
c
)[
=

xA
[P
X
(x) P

X
(x)]

xA
c
[P

X
(x) P
X
(x)]

30
=

xA
[P
X
(x) P

X
(x)[ +

xA
c
[P
X
(x) P

X
(x)[
= |P
X
P

X
|.
Therefore, |P
X
P

X
| = 2 sup
EX
[P
X
(E) P

X
(E)[ . 2
Lemma 2.37 (Variational distance vs divergence: Pinskers inequality)
D(X|

X)
log
2
(e)
2
|P
X
P

X
|
2
.
This result is referred to as Pinskers inequality.
Proof:
1. With / x A : P
X
(x) > P

X
(x), we have from the previous lemma
that
|P
X
P

X
| = 2[P
X
(/) P

X
(/)].
2. Dene two random variables U and

U as:
U =
_
1, if X /;
0, if X /
c
,
and

U =
_
1, if

X /;
0, if

X /
c
.
Then P
X
and P

X
are renements (2-renements) of P
U
and P

U
, respec-
tively. From Lemma 2.33, we obtain that
D(P
X
|P

X
) D(P
U
|P

U
).
3. The proof is complete if we show that
D(P
U
|P

U
) 2 log
2
(e)[P
X
(/) P

X
(/)]
2
= 2 log
2
(e)[P
U
(1) P

U
(1)]
2
.
For ease of notations, let p = P
U
(1) and q = P

U
(1). Then to prove the
above inequality is equivalent to show that
p ln
p
q
+ (1 p) ln
1 p
1 q
2(p q)
2
.
Dene
f(p, q) p ln
p
q
+ (1 p) ln
1 p
1 q
2(p q)
2
,
31
and observe that
df(p, q)
dq
= (p q)
_
4
1
q(1 q)
_
0 for q p.
Thus, f(p, q) is non-increasing in q for q p. Also note that f(p, q) = 0
for q = p. Therefore,
f(p, q) 0 for q p.
The proof is completed by noting that
f(p, q) 0 for q p,
since f(1 p, 1 q) = f(p, q).
2
Observation 2.38 The above lemma tells us that for a sequence of distributions
(P
Xn
, P

Xn
)
n1
, when D(P
Xn
|P

Xn
) goes to zero as n goes to innity, |P
Xn

P

Xn
| goes to zero as well. But the converse does not necessarily hold. For a
quick counterexample, let
P
Xn
(0) = 1 P
Xn
(1) = 1/n > 0
and
P

Xn
(0) = 1 P

Xn
(1) = 0.
In this case,
D(P
Xn
|P

Xn
)
since by convention, (1/n) log
2
((1/n)/0) . However,
|P
X
P

X
| = 2
_
P
X
_
x : P
X
(x) > P

X
(x)
_
P

X
_
x : P
X
(x) > P

X
(x)
_
_
=
2
n
0.
We however can upper bound D(P
X
|P

X
) by the variational distance between
P
X
and P

X
when D(P
X
|P

X
) < .
Lemma 2.39 If D(P
X
|P

X
) < , then
D(P
X
|P

X
)
log
2
(e)
min
{x : P
X
(x)>0}
minP
X
(x), P

X
(x)
|P
X
P

X
|.
32
Proof: Without loss of generality, we assume that P
X
(x) > 0 for all x A.
Since D(P
X
|P

X
) < , we have that for any x A, P
X
(x) > 0 implies that
P

X
(x) > 0. Let
t min
{xX: P
X
(x)>0}
minP
X
(x), P

X
(x).
Then for all x A,
ln
P
X
(x)
P

X
(x)

ln
P
X
(x)
P

X
(x)

max
min{P
X
(x),P

X
(x)}smax{P
X
(x),P

X
(x)}
d ln(s)
ds

[P
X
(x) P

X
(x)[
=
1
minP
X
(x), P

X
(x)
[P
X
(x) P

X
(x)[

1
t
[P
X
(x) P

X
(x)[.
Hence,
D(P
X
|P

X
) = log
2
(e)

xX
P
X
(x) ln
P
X
(x)
P

X
(x)

log
2
(e)
t

xX
P
X
(x) [P
X
(x) P

X
(x)[

log
2
(e)
t

xX
[P
X
(x) P

X
(x)[
=
log
2
(e)
t
|P
X
P

X
|.
2
The next lemma discusses the eect of side information on divergence. As
stated in Lemma 2.12, side information usually reduces entropy; it, however,
increases divergence. One interpretation of these results is that side information
is useful. Regarding entropy, side information provides us more information, so
uncertainty decreases. As for divergence, it is the measure or index of how easy
one can dierentiate the source from two candidate distributions. The larger
the divergence, the easier one can tell apart between these two distributions
and make the right guess. At an extreme case, when divergence is zero, one
can never tell which distribution is the right one, since both produce the same
source. So, when we obtain more information (side information), we should be
able to make a better decision on the source statistics, which implies that the
divergence should be larger.
33
Denition 2.40 (Conditional divergence) Given three discrete random vari-
ables, X,

X and Z, where X and

X have a common alphabet A, we dene the
conditional divergence between X and

X given Z by
D(X|

X[Z) = D(P
X|Z
|P

X|Z
)

zZ

xX
P
X,Z
(x, z) log
P
X|Z
(x[z)
P

X|Z
(x[z)
.
In otherwords, it is the expected value with respect to P
X,Z
of the log-likelihood
ratio log
P
X|Z
P

X|Z
.
Lemma 2.41 (Conditional mutual information and conditional diver-
gence) Given three discrete random variables X, Y and Z with alphabets A,
and Z, respectively, and joint distribution P
X,Y,Z
, then
I(X; Y [Z) = D(P
X,Y |Z
|P
X|Z
P
Y |Z
)
=

xX

yY

zZ
P
X,Y,Z
(x, y, z) log
2
P
X,Y |Z
(x, y[z)
P
X|Z
(x[z)P
Y |Z
(y[z)
,
where P
X,Y |Z
is conditional joint distribution of X and Y given Z, and P
X|Z
and
P
Y |Z
are the conditional distributions of X and Y , respectively, given Z.
Proof: The proof follows directly from the denition of conditional mutual in-
formation (2.2.2) and the above dention of conditional divergence. 2
Lemma 2.42 (Chain rule for divergence) For three discrete random vari-
ables, X,

X and Z, where X and

X have a common alphabet A, we have that
D(P
X,Z
|P

X,Z
) = D(P
X
|P

X
) + D(P
X|Z
|P

X|Z
).
Proof: The proof readily follows from the divergence denitions. 2
Lemma 2.43 (Conditioning never decreases divergence) For three discrete
random variables, X,

X and Z, where X and

X have a common alphabet A,
we have that
D(P
X|Z
|P

X|Z
) D(P
X
|P

X
).
Proof:
D(P
X|Z
|P

X|Z
) D(P
X
|P

X
)
=

zZ

xX
P
X,Z
(x, z) log
2
P
X|Z
(x[z)
P

X|Z
(x[z)

xX
P
X
(x) log
2
P
X
(x)
P

X
(x)
34
=

zZ

xX
P
X,Z
(x, z) log
2
P
X|Z
(x[z)
P

X|Z
(x[z)

xX
_

zZ
P
X,Z
(x, z)
_
log
2
P
X
(x)
P

X
(x)
=

zZ

xX
P
X,Z
(x, z) log
2
P
X|Z
(x[z)P

X
(x)
P

X|Z
(x[z)P
X
(x)

zZ

xX
P
X,Z
(x, z) log
2
(e)
_
1
P

X|Z
(x[z)P
X
(x)
P
X|Z
(x[z)P

X
(x)
_
(by the FI Lemma)
= log
2
(e)
_
1

xX
P
X
(x)
P

X
(x)

zZ
P
Z
(z)P

X|Z
(x[z)
_
= log
2
(e)
_
1

xX
P
X
(x)
P

X
(x)
P

X
(x)
_
= log
2
(e)
_
1

xX
P
X
(x)
_
= 0
with equality holding i for all x and z,
P
X
(x)
P

X
(x)
=
P
X|Z
(x[z)
P

X|Z
(x[z)
.
2
Note that it is not necessary that
D(P
X|Z
|P

X|

Z
) D(P
X
|P

X
).
In other words, the side information is helpful for divergence only when it pro-
vides information on the similarity or dierence of the two distributions. For the
above case, Z only provides information about X, and

Z provides information
about

X; so the divergence certainly cannot be expected to increase. The next
lemma shows that if (Z,

Z) are independent of (X,

X), then the side information
of (Z,

Z) does not help in improving the divergence of X against

X.
Lemma 2.44 (Independent side information does not change diver-
gence) If (X,

X) is independent of (Z,

Z), then
D(P
X|Z
|P

X|

Z
) = D(P
X
|P

X
),
where
D(P
X|Z
|P

X|

Z
)

xX

x

X

zZ

z

Z
P
X,

X,Z,

Z
(x, x, z, z) log
2
P
X|Z
(x[z)
P

X|

Z
( x[ z)
.
35
Proof: This can be easily justied by the denition of divergence. 2
Lemma 2.45 (Additivity of divergence under independence)
D(P
X,Z
|P

X,

Z
) = D(P
X
|P

X
) + D(P
Z
|P

Z
),
provided that (X,

X) is independent of (Z,

Z).
Proof: This can be easily proved from the denition. 2
2.7 Convexity/concavity of information measures
We next address the convexity/concavity properties of information measures
with respect to the distributions on which they are dened. Such properties will
be useful when optimizing the information measures over distribution spaces.
Lemma 2.46
1. H(P
X
) is a concave function of P
X
, namely
H(P
X
+ (1 )P
e
X
) H(P
X
) + (1 )H(P
e
X
).
2. Noting that I(X; Y ) can be re-written as I(P
X
, P
Y |X
), where
I(P
X
, P
Y |X
)

xX

yY
P
Y |X
(y[x)P
X
(x) log
2
P
Y |X
(y[x)

aX
P
Y |X
(y[a)P
X
(a)
,
then I(X; Y ) is a concave function of P
X
(for xed P
Y |X
), and a convex
function of P
Y |X
(for xed P
X
).
3. D(P
X
|P

X
) is convex with respect to both the rst argument P
X
and the
second argument P

X
. It is also convex in the pair (P
X
, P

X
); i.e., if (P
X
, P

X
)
and (Q
X
, Q

X
) are two pairs of probability mass functions, then
D(P
X
+ (1 )Q
X
|P

X
+ (1 )Q

X
)
D(P
X
|P

X
) + (1 ) D(Q
X
|Q

X
), (2.7.1)
for all [0, 1].
Proof:
36
1. The proof uses the log-sum inequality:
H(P
X
+ (1 )P
e
X
)
_
H(P
X
) + (1 )H(P
e
X
)

xX
P
X
(x) log
2
P
X
(x)
P
X
(x) + (1 )P
e
X
(x)
+(1 )

xX
P
e
X
(x) log
2
P
e
X
(x)
P
X
(x) + (1 )P
e
X
(x)

_

xX
P
X
(x)
_
log
2

xX
P
X
(x)

xX
[P
X
(x) + (1 )P
e
X
(x)]
+(1 )
_

xX
P
e
X
(x)
_
log
2

xX
P
e
X
(x)

xX
[P
X
(x) + (1 )P
e
X
(x)]
= 0,
with equality holding i P
X
(x) = P
e
X
(x) for all x.
2. We rst show the concavity of I(P
X
; P
Y |X
) with respect to P
X
. Let

=
1 .
I(P
X
+

P
e
X
, P
Y |X
) I(P
X
, P
Y |X
)

I(P
e
X
, P
Y |X
)
=

yY

xX
P
X
(x)P
Y |X
(y[x) log
2

xX
P
X
(x)P
Y |X
(y[x)

xX
[P
X
(x) +

P
e
X
(x)]P
Y |X
(y[x)]
+

yY

xX
P
e
X
(x)P
Y |X
(y[x) log
2

xX
P
e
X
(x)P
Y |X
(y[x)

xX
[P
X
(x) +

P
e
X
(x)]P
Y |X
(y[x)]
0 (by the log-sum inequality)
with equality holding i

xX
P
X
(x)P
Y |X
(y[x) =

xX
P
e
X
(x)P
Y |X
(y[x)
for all y . We now turn to the convexity of I(P
X
, P
Y |X
) with respect to
P
Y |X
. For ease of notation, let P
Y

(y) P
Y
(y)+

P
e
Y
(y), and P
Y

|X
(y[x)
P
Y |X
(y[x) +

P
e
Y |X
(y[x). Then
I(P
X
, P
Y |X
) +

I(P
X
, P
e
Y |X
) I(P
X
, P
Y |X
+

P
e
Y |X
)
=

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)
P
Y
(y)
37
+

xX

yY
P
X
(x)P
e
Y |X
(y[x) log
2
P
e
Y |X
(y[x)
P
e
Y
(y)

xX

yY
P
X
(x)P
Y

|X
(y[x) log
2
P
Y

|X
(y[x)
P
Y

(y)
=

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)P
Y

(y)
P
Y
(y)P
Y

|X
(y[x)
+

xX

yY
P
X
(x)P
e
Y |X
(y[x) log
2
P
e
Y |X
(y[x)P
Y

(y)
P
e
Y
(y)P
Y

|X
(y[x)
log
2
(e)

xX

yY
P
X
(x)P
Y |X
(y[x)
_
1
P
Y
(y)P
Y

|X
(y[x)
P
Y |X
(y[x)P
Y

(y)
_
+

log
2
(e)

xX

yY
P
X
(x)P
e
Y |X
(y[x)
_
1
P
e
Y
(y)P
Y

|X
(y[x)
P
e
Y |X
(y[x)P
Y

(y)
_
= 0,
where the inequality follows from the FI Lemma, with equality holding i
( x A, y )
P
Y
(y)
P
Y |X
(y[x)
=
P
e
Y
(y)
P
e
Y |X
(y[x)
.
3. For ease of notation, let P
X

(x) P
X
(x) + (1 )P
e
X
(x).
D(P
X
|P

X
) + (1 )D(P
e
X
|P

X
) D(P
X

|P

X
)
=

xX
P
X
(x) log
2
P
X
(x)
P
X

(x)
+ (1 )

xX
P
e
X
(x) log
2
P
e
X
(x)
P
X

(x)
= D(P
X
|P
X

) + (1 )D(P
e
X
|P
X

)
0
by the non-negativity of the divergence, with equality holding i P
X
(x) =
P
e
X
(x) for all x.
Similarly, by letting P

(x) P

X
(x) + (1 )P
e
X
(x), we obtain:
D(P
X
|P

X
) + (1 )D(P
X
|P
e
X
) D(P
X
|P

)
=

xX
P
X
(x) log
2
P

(x)
P

X
(x)
+ (1 )

xX
P
X
(x) log
2
P

(x)
P
e
X
(x)


ln 2

xX
P
X
(x)
_
1
P

X
(x)
P

(x)
_
+
(1 )
ln 2

xX
P
X
(x)
_
1
P
e
X
(x)
P

(x)
_
38
= log
2
(e)
_
1

xX
P
X
(x)
P

X
(x) + (1 )P
e
X
(x)
P

(x)
_
= 0,
where the inequality follows from the FI Lemma, with equality holding i
P
e
X
(x) = P

X
(x) for all x.
Finally, by the log-sum inequality, for each x A, we have
(P
X
(x) + (1 )Q
X
(x)) log
2
P
X
(x) + (1 )Q
X
(x)
P

X
(x) + (1 )Q

X
(x)
P
X
(x) log
2
P
X
(x)
P

X
(x)
+ (1 )Q
X
(x) log
2
(1 )Q
X
(x)
(1 )Q

X
(x)
.
Summing over x, we yield (2.7.1).
Note that the last result (convexity of D(P
X
|P

X
) in the pair (P
X
, P

X
))
actually implies the rst two results: just set P

X
= Q

X
to show convexity
in the rst argument P
X
, and set P
X
= Q
X
to show convexity in the second
argument P

X
.
2
2.8 Fundamentals of hypothesis testing
One of the fundamental problems in statistics is to decide between two alternative
explanations for the observed data. For example, when gambling, one may wish
to test whether it is a fair game or not. Similarly, a sequence of observations on
the market may reveal the information that whether a new product is successful
or not. This is the simplest form of the hypothesis testing problem, which is
usually named simple hypothesis testing.
It has quite a few applications in information theory. One of the frequently
cited examples is the alternative interpretation of the law of large numbers.
Another example is the computation of the true coding error (for universal codes)
by testing the empirical distribution against the true distribution. All of these
cases will be discussed subsequently.
The simple hypothesis testing problem can be formulated as follows:
Problem: Let X
1
, . . . , X
n
be a sequence of observations which is possibly dr-
awn according to either a null hypothesis distribution P
X
n or an alternative
hypothesis distribution P

X
n
. The hypotheses are usually denoted by:
H
0
: P
X
n
39
H
1
: P

X
n
Based on one sequence of observations x
n
, one has to decide which of the hy-
potheses is true. This is denoted by a decision mapping (), where
(x
n
) =
_
0, if distribution of X
n
is classied to be P
X
n;
1, if distribution of X
n
is classied to be P

X
n
.
Accordingly, the possible observed sequences are divided into two groups:
Acceptance region for H
0
: x
n
A
n
: (x
n
) = 0
Acceptance region for H
1
: x
n
A
n
: (x
n
) = 1.
Hence, depending on the true distribution, there are possibly two types of pro-
bability of errors:
Type I error :
n
=
n
() P
X
n (x
n
A
n
: (x
n
) = 1)
Type II error :
n
=
n
() P

X
n
(x
n
A
n
: (x
n
) = 0) .
The choice of the decision mapping is dependent on the optimization criterion.
Two of the most frequently used ones in information theory are:
1. Bayesian hypothesis testing.
Here, () is chosen so that the Bayesian cost

n
+
1

n
is minimized, where
0
and
1
are the prior probabilities for the null
and alternative hypotheses, respectively. The mathematical expression for
Bayesian testing is:
min
{}
[
0

n
() +
1

n
()] .
2. Neyman Pearson hypothesis testing subject to a xed test level.
Here, () is chosen so that the type II error
n
is minimized subject to a
constant bound on the type I error; i.e.,

n

where > 0 is xed. The mathematical expression for Neyman-Pearson
testing is:
min
{ : n()}

n
().
40
The set considered in the minimization operation could have two dierent
ranges: range over deterministic rules, and range over randomization rules. The
main dierence between a randomization rule and a deterministic rule is that the
former allows the mapping (x
n
) to be random on 0, 1 for some x
n
, while the
latter only accept deterministic assignments to 0, 1 for all x
n
. For example, a
randomization rule for specic observations x
n
can be
( x
n
) = 0, with probability 0.2;
( x
n
) = 1, with probability 0.8.
The Neyman-Pearson lemma shows the well-known fact that the likelihood
ratio test is always the optimal test.
Lemma 2.47 (Neyman-Pearson Lemma) For a simple hypothesis testing
problem, dene an acceptance region for the null hypothesis through the likeli-
hood ratio as
/
n
()
_
x
n
A
n
:
P
X
n(x
n
)
P

X
n
(x
n
)
>
_
,
and let

n
P
X
n /
c
n
() and

n
P

X
n
/
n
() .
Then for type I error
n
and type II error
n
associated with another choice of
acceptance region for the null hypothesis, we have

n

n

n
.
Proof: Let B be a choice of acceptance region for the null hypothesis. Then

n
+
n
=

x
n
B
c
P
X
n(x
n
) +

x
n
B
P

X
n
(x
n
)
=

x
n
B
c
P
X
n(x
n
) +
_
1

x
n
B
c
P

X
n
(x
n
)
_
= +

x
n
B
c
[P
X
n(x
n
) P

X
n
(x
n
)] . (2.8.1)
Observe that (2.8.1) is minimized by choosing B = /
n
(). Hence,

n
+
n

n
+

n
,
which immediately implies the desired result. 2
The Neyman-Pearson lemma indicates that no other choices of acceptance re-
gions can simultaneously improve both type I and type II errors of the likelihood
41
ratio test. Indeed, from (2.8.1), it is clear that for any
n
and
n
, one can always
nd a likelihood ratio test that performs as good. Therefore, the likelihood ratio
test is an optimal test. The statistical properties of the likelihood ratio thus
become essential in hypothesis testing. Note that, when the observations are
i.i.d. under both hypotheses, the divergence, which is the statistical expectation
of the log-likelihood ratio, plays an important role in hypothesis testing (for
non-memoryless observations, one is then concerned with the divergence rate, an
extended notion of divergence for systems with memory which will be dened in
a following chapter).
42
Chapter 3
Lossless Data Compression
3.1 Principles of data compression
As mentioned in Chapter 1, data compression describes methods of representing
a source by a code whose average codeword length (or code rate) is acceptably
small. The representation can be: lossless (or asymptotically lossless) where
the reconstructed source is identical (or asymptotically identical) to the original
source; or lossy where the reconstructed source is allowed to deviate from the
original source, usually within an acceptable threshold. We herein focus on
lossless data compression.
Since a memoryless source is modelled as a random variable, the averaged
codeword length of a codebook is calculated based on the probability distribution
of that random variable. For example, a ternary memoryless source X exhibits
three possible outcomes with
P
X
(x = outcome
A
) = 0.5;
P
X
(x = outcome
B
) = 0.25;
P
X
(x = outcome
C
) = 0.25.
Suppose that a binary code book is designed for this source, in which outcome
A
,
outcome
B
and outcome
C
are respectively encoded as 0, 10, and 11. Then the
average codeword length (in bits/source outcome) is
length(0) P
X
(outcome
A
) + length(10) P
X
(outcome
B
)
+length(11) P
X
(outcome
C
)
= 1 0.5 + 2 0.25 + 2 0.25
= 1.5 bits.
There are usually no constraints on the basic structure of a code. In the
case where the codeword length for each source outcome can be dierent, the
43
code is called a variable-length code. When the codeword lengths of all source
outcomes are equal, the code is referred to as a xed-length code . It is obvious
that the minimum average codeword length among all variable-length codes is
no greater than that among all xed-length codes, since the latter is a subclass
of the former. We will see in this chapter that the smallest achievable average
code rate for variable-length and xed-length codes coincide for sources with
good probabilistic characteristics, such as stationarity and ergodicity. But for
more general sources with memory, the two quantities are dierent (cf. Part II
of the book).
For xed-length codes, the sequence of adjacent codewords are concate-
nated together for storage or transmission purposes, and some punctuation
mechanismsuch as marking the beginning of each codeword or delineating
internal sub-blocks for synchronization between encoder and decoderis nor-
mally considered an implicit part of the codewords. Due to constraints on space
or processing capability, the sequence of source symbols may be too long for the
encoder to deal with all at once; therefore, segmentation before encoding is often
necessary. For example, suppose that we need to encode using a binary code the
grades of a class with 100 students. There are three grade levels: A, B and C.
By observing that there are 3
100
possible grade combinations for 100 students,
a straightforward code design requires log
2
(3
100
) = 159 bits to encode these
combinations. Now suppose that the encoder facility can only process 16 bits
at a time. Then the above code design becomes infeasible and segmentation is
unavoidable. Under such constraint, we may encode grades of 10 students at a
time, which requires log
2
(3
10
) = 16 bits. As a consequence, for a class of 100
students, the code requires 160 bits in total.
In the above example, the letters in the grade set A, B, C and the letters
from the code alphabet 0, 1 are often called source symbols and code symbols,
respectively. When the code alphabet is binary (as in the previous two examples),
the code symbols are referred to as code bits or simply bits (as already used).
A tuple (or grouped sequence) of source symbols is called a sourceword and the
resulting encoded tuple consisting of code symbols is called a codeword. (In the
above example, each sourceword consists of 10 source symbols (students) and
each codeword consists of 16 bits.)
Note that, during the encoding process, the sourceword lengths do not have to
be equal. In this text, we however only consider the case where the sourcewords
have a xed length throughout the encoding process (except for the Lempel-Ziv
code briey discussed at the end of this chapter), but we will allow the codewords
to have xed or variable lengths as dened earlier.
1
The block diagram of a source
1
In other words, our xed-length codes are actually xed-to-xed length codes and our
variable-length codes are xed-to-variable length codes since, in both cases, a xed number
44
Source
-
sourcewords
Source
encoder
-
-
codewords
Source
decoder
-
sourcewords
Figure 3.1: Block diagram of a data compression system.
coding system is depicted in Figure 3.1.
When adding segmentation mechanisms to xed-length codes, the codes can
be loosely divided into two groups. The rst consists of block codes in which the
encoding (or decoding) of the next segment of source symbols is independent of
the previous segments. If the encoding/decoding of the next segment, somehow,
retains and uses some knowledge of earlier segments, the code is called a xed-
length tree code. As will not investigate such codes in this text, we can use
block codes and xed-length codes as synonyms.
In this chapter, we rst consider data compression for block codes in Sec-
tion 3.2. Data compression for variable-length codes is then addressed in Sec-
tion 3.3.
3.2 Block codes for asymptotically lossless compression
3.2.1 Block codes for discrete memoryless sources
We rst focus on the study of asymptotically lossless data compression of discrete
memoryless sources via block (xed-length) codes. Such sources were already
dened in Appendix B and the previous chapter; but we nevertheless recall their
denition.
Denition 3.1 (Discrete memoryless source) A discrete memoryless source
(DMS) X
n

n=1
consists of a sequence of independent and identically distributed
(i.i.d.) random variables, X
1
, X
2
, X
3
, . . ., all taking values in a common nite
alphabet A. In particular, if P
X
() is the common distribution or probability
mass function (pmf) of the X
i
s, then
P
X
n(x
1
, x
2
, . . . , x
n
) =
n

i=1
P
X
(x
i
).
of source symbols is mapped onto codewords with xed and variable lengths, respectively.
45
Denition 3.2 An (n, M) block code of blocklength n and size M (which can
be a function of n in general,
2
i.e., M = M
n
) for a discrete source X
n

n=1
is a set
c
1
, c
2
, . . . , c
M
A
n
consisting of M reproduction (or reconstruction) words,
where each reproduction word is a sourceword (an n-tuple of source symbols).
3
The block codes operation can be symbolically represented as
4
(x
1
, x
2
, . . . , x
n
) c
m
c
1
, c
2
, . . . , c
M
.
This procedure will be repeated for each consecutive block of length n, i.e.,
(x
3n
, . . . , x
31
)(x
2n
, . . . , x
21
)(x
1n
, . . . , x
11
) [c
m
3
[c
m
2
[c
m
1
,
where [ reects the necessity of punctuation mechanism or synchronization
mechanism for consecutive source block coders.
The next theorem provides a key tool for proving Shannons source coding
theorem.
2
In the literature, both (n, M) and (M, n) have been used to denote a block code with
blocklength n and size M. For example, [47, p. 149] adopts the former one, while [12, p. 193]
uses the latter. We use the (n, M) notation since M = M
n
is a function of n in general.
3
One can binary-index the reproduction words in c
1
, c
2
, . . . , c
M
using k log
2
M bits.
As such k-bit words in 0, 1
k
are usually stored for retrieval at a later date, the (n, M) block
code can be represented by an encoder-decoder pair of functions (f, g), where the encoding
function f : A
n
0, 1
k
maps each sourceword x
n
to a k-bit word f(x
n
) which we call a
codeword. Then the decoding function g : 0, 1
k
c
1
, c
2
, . . . , c
M
is a retrieving operation
that produces the reproduction words. Since the codewords are binary-valued, such a block
code is called a binary code. More generally, a D-ary block code (where D > 1 is an integer)
would use an encoding function f : A
n
0, 1, , D 1
k
where each codeword f(x
n
)
contains k D-ary code symbols.
Furthermore, since the behavior of block codes is investigated for suciently large n and
M (tending to innity), it is legitimate to replace log
2
M by log
2
M for the case of binary
codes. With this convention, the data compression rate or code rate is
bits required per source symbol =
k
n
=
1
n
log
2
M.
Similarly, for D-ary codes, the rate is
D-ary code symbols required per source symbol =
k
n
=
1
n
log
D
M.
For computational convenience, nats (under the natural logarithm) can be used instead of
bits or D-ary code symbols; in this case, the code rate becomes:
nats required per source symbol =
1
n
log M.
4
When one uses an encoder-decoder pair (f, g) to describe the block code, the codes oper-
ation can be expressed as: c
m
= g(f(x
n
)).
46
Theorem 3.3 (Shannon-McMillan) (Asymptotic equipartition property
or AEP
5
) If X
n

n=1
is a DMS with entropy H(X), then

1
n
log
2
P
X
n(X
1
, . . . , X
n
) H(X) in probability.
In other words, for any > 0,
lim
n
Pr
_

1
n
log
2
P
X
n(X
1
, . . . , X
n
) H(X)

>
_
= 0.
Proof: This theorem follows by rst observing that for an i.i.d. sequence X
n

n=1
,

1
n
log
2
P
X
n(X
1
, . . . , X
n
) =
1
n
n

i=1
[log
2
P
X
(X
i
)]
and that the sequence log
2
P
X
(X
i
)

i=1
is i.i.d., and then applying the weak
law of large numbers (WLLN) on the later sequence. 2
The AEP indeed constitutes an information theoretic analog of WLLN as
it states that if log
2
P
X
(X
i
)

i=1
is an i.i.d. sequence, then for any > 0,
Pr
_

1
n
n

i=1
[log
2
P
X
(X
i
)] H(X)


_
1 as n .
As a consequence of the AEP, all the probability mass will be ultimately placed
on the weakly -typical set, which is dened as
T
n
()
_
x
n
A
n
:

1
n
log
2
P
X
n(x
n
) H(X)


_
=
_
x
n
A
n
:

1
n
n

i=1
log
2
P
X
(x
i
) H(X)


_
.
Note that since the source is memoryless, for any x
n
T
n
(), (1/n) log
2
P
X
n(x
n
),
the normalized self-information of x
n
, is equal to (1/n)

n
i=1
[log
2
P
X
(x
i
)],
which is the empirical (arithmetic) average self-information or apparent en-
tropy of the source. Thus, a sourceword x
n
is -typical if it yields an apparent
source entropy within of the true source entropy H(X). Note that the source-
words in T
n
() are nearly equiprobable or equally surprising (cf. Property 1 of
Theorem 3.4); this justies naming Theorem 3.3 by AEP.
Theorem 3.4 (Consequence of the AEP) Given a DMS X
n

n=1
with en-
tropy H(X) and any greater than zero, then the weakly -typical set T
n
()
satises the following.
5
This is also called the entropy stability property.
47
1. If x
n
T
n
(), then
2
n(H(X)+)
P
X
n(x
n
) 2
n(H(X))
.
2. P
X
n (T
c
n
()) < for suciently large n, where the superscript c denotes
the complementary set operation.
3. [T
n
()[ > (1)2
n(H(X))
for suciently large n, and [T
n
()[ 2
n(H(X)+)
for every n, where [T
n
()[ denotes the number of elements in T
n
().
Note: The above theorem also holds if we dene the typical set using the base-
D logarithm log
D
for any D > 1 instead of the base-2 logarithm; in this case,
one just needs to appropriately change the base of the exponential terms in the
above theorem (by replacing 2
x
terms with D
x
terms) and also substitute H(X)
with H
D
(X)).
Proof: Property 1 is an immediate consequence of the denition of T
n
().
Property 2 is a direct consequence of the AEP, since the AEP states that for a
xed > 0, lim
n
P
X
n(T
n
()) = 1; i.e., > 0, there exists n
0
= n
0
() such
that for all n n
0
,
P
X
n(T
n
()) > 1 .
In particular, setting = yields the result. We nevertheless provide a direct
proof of Property 2 as we give an explicit expression for n
0
: observe that by
Chebyshevs inequality,
P
X
n(T
c
n
()) = P
X
n
_
x
n
A
n
:

1
n
log
2
P
X
n(x
n
) H(X)

>
_


2
X
n
2
< ,
for n >
2
X
/
3
, where the variance

2
X
Var[log
2
P
X
(X)] =

xX
P
X
(x) [log
2
P
X
(x)]
2
(H(X))
2
is a constant
6
independent of n.
6
In the proof, we assume that the variance
2
X
= Var[log
2
P
X
(X)] < . This holds since
the source alphabet is nite:
Var[log
2
P
X
(X)] E[(log
2
P
X
(X))
2
] =

xX
P
X
(x)(log
2
P
X
(x))
2

xX
4
e
2
[log
2
(e)]
2
=
4
e
2
[log
2
(e)]
2
[A[ < .
48
To prove Property 3, we have from Property 1 that
1

x
n
Fn()
P
X
n(x
n
)

x
n
Fn()
2
n(H(X)+)
= [T
n
()[2
n(H(X)+)
,
and, using Properties 2 and 1, we have that
1 < 1

2
X
n
2

x
n
Fn()
P
X
n(x
n
)

x
n
Fn()
2
n(H(X))
= [T
n
()[2
n(H(X))
,
for n
2
X
/
3
. 2
Note that for any n > 0, a block code (
n
= (n, M) is said to be uniquely
decodable or completely lossless if its set of reproduction words is trivially equal
to the set of all source n-tuples: c
1
, c
2
, . . . , c
M
= A
n
. In this case, if we are
binary-indexing the reproduction words using an encoding-decoding pair (f, g),
every sourceword x
n
will be assigned to a distinct binary codeword f(x
n
) of
length k = log
2
M and all the binary k-tuples are the image under f of some
sourceword. In other words, f is a bijective (injective and surjective) map and
hence invertible with the decoding map g = f
1
and M = [A[
n
= 2
k
. Thus the
code rate is (1/n) log
2
M = log
2
[A[ bits/source symbol.
Now the question becomes: can we achieve a better (i.e., smaller) compres-
sion rate? The answer is armative: we can achieve a compression rate equal to
the source entropy H(X) (in bits), which can be signicantly smaller than log
2
M
when this source is strongly non-uniformly distributed, if we give up unique de-
codability (for every n) and allow n to be suciently large to asymptotically
achieve lossless reconstruction by having an arbitrarily small (but positive) pro-
bability of decoding error P
e
( (
n
) P
X
nx
n
A
n
: g(f(x
n
)) ,= x
n
.
Thus, block codes herein can perform data compression that is asymptotically
lossless with respect to blocklength; this contrasts with variable-length codes
which can be completely lossless (uniquely decodable) for every nite block-
length.
We now can formally state and prove Shannons asymptotically lossless source
coding theorem for block codes. The theorem will be stated for general D-
ary block codes, representing the source entropy H
D
(X) in D-ary code sym-
bol/source symbol as the smallest (inmum) possible compression rate for asymp-
totically lossless D-ary block codes. Without loss of generality, the theorem
will be proved for the case of D = 2. The idea behind the proof of the for-
ward (achievability) part is basically to binary-index the source sequence in the
weakly -typical set T
n
() to a binary codeword (starting from index one with
corresponding k-tuple codeword 0 01); and to encode all sourcewords outside
T
n
() to a default all-zero binary codeword, which certainly cannot be repro-
duced distortionless due to its many-to-one-mapping property. The resultant
49
Source

1
2
2

i=1
log
2
P
X
(x
i
) H(X)

codeword
reconstructed
source sequence
AA 0.525 bits , T
2
(0.4) 000 ambiguous
AB 0.317 bits T
2
(0.4) 001 AB
AC 0.025 bits T
2
(0.4) 010 AC
AD 0.475 bits , T
2
(0.4) 000 ambiguous
BA 0.317 bits T
2
(0.4) 011 BA
BB 0.109 bits T
2
(0.4) 100 BB
BC 0.183 bits T
2
(0.4) 101 BC
BD 0.683 bits , T
2
(0.4) 000 ambiguous
CA 0.025 bits T
2
(0.4) 110 CA
CB 0.183 bits T
2
(0.4) 111 CB
CC 0.475 bits , T
2
(0.4) 000 ambiguous
CD 0.975 bits , T
2
(0.4) 000 ambiguous
DA 0.475 bits , T
2
(0.4) 000 ambiguous
DB 0.683 bits , T
2
(0.4) 000 ambiguous
DC 0.975 bits , T
2
(0.4) 000 ambiguous
DD 1.475 bits , T
2
(0.4) 000 ambiguous
Table 3.1: An example of the -typical set with n = 2 and = 0.4,
where T
2
(0.4) = AB, AC, BA, BB, BC, CA, CB . The codeword
set is 001(AB), 010(AC), 011(BA), 100(BB), 101(BC), 110(CA),
111(CB), 000(AA, AD, BD, CC, CD, DA, DB, DC, DD) , where
the parenthesis following each binary codeword indicates those source-
words that are encoded to this codeword. The source distribution is
P
X
(A) = 0.4, P
X
(B) = 0.3, P
X
(C) = 0.2 and P
X
(D) = 0.1.
code rate is (1/n)log
2
([T
n
()[ + 1) bits per source symbol. As revealed in the
Shannon-McMillan AEP theorem and its Consequence, almost all the probabi-
lity mass will be on T
n
() as n suciently large, and hence, the probability of
non-reconstructable source sequences can be made arbitrarily small. A simple
example for the above coding scheme is illustrated in Table 3.1. The converse
part of the proof will establish (by expressing the probability of correct decoding
in terms of the -typical set and also using the Consequence of the AEP) that
for any sequence of D-ary codes with rate strictly below the source entropy, their
probability of error cannot asymptotically vanish (is bounded away from zero).
Actually a stronger result is proven: it is shown that their probability of error
not only does not asymptotically vanish, it actually ultimately grows to 1 (this
is why we call this part a strong converse).
50
Theorem 3.5 (Shannons source coding theorem) Given integer D > 1,
consider a discrete memoryless source X
n

n=1
with entropy H
D
(X). Then the
following hold.
Forward part (achievability): For any 0 < < 1, there exists 0 < <
and a sequence of D-ary block codes (
n
= (n, M
n
)

n=1
with
limsup
n
1
n
log
D
M
n
H
D
(X) + (3.2.1)
satisfying
P
e
( (
n
) < (3.2.2)
for all suciently large n, where P
e
( (
n
) denotes the probability of decoding
error for block code (
n
.
7
Strong converse part: For any 0 < < 1, any sequence of D-ary block
codes (
n
= (n, M
n
)

n=1
with
limsup
n
1
n
log
D
M
n
< H
D
(X) (3.2.3)
satises
P
e
( (
n
) > 1
for all n suciently large.
Proof:
Forward Part: Without loss of generality, we will prove the result for the case
of binary codes (i.e., D = 2). Also recall that subscript D in H
D
(X) will be
dropped (i.e., omitted) specically when D = 2.
Given 0 < < 1, x such that 0 < < and choose n > 2/. Now
construct a binary (
n
block code by simply mapping the /2-typical sourcewords
x
n
onto distinct not all-zero binary codewords of length k log
2
M
n
bits. In
other words, binary-index (cf. the footnote in Denition 3.2) the sourcewords in
T
n
(/2) with the following encoding map:
_
x
n
binary index of x
n
, if x
n
T
n
(/2);
x
n
all-zero codeword, if x
n
, T
n
(/2).
7
(3.2.2) is equivalent to limsup
n
P
e
( (
n
) . Since can be made arbitrarily small,
the forward part actually indicates the existence of a sequence of D-ary block codes (
n

n=1
satisfying (3.2.1) such that limsup
n
P
e
( (
n
) = 0.
Based on this, the converse should be that any sequence of D-ary block codes satisfying
(3.2.3) satises limsup
n
P
e
( (
n
) > 0. However, the so-called strong converse actually gives
a stronger consequence: limsup
n
P
e
( (
n
) = 1 (as can be made arbitrarily small).
51
Then by the Shannon-McMillan AEP theorem, we obtain that
M
n
= [T
n
(/2)[ + 1 2
n(H(X)+/2)
+ 1 < 2 2
n(H(X)+/2)
< 2
n(H(X)+)
,
for n > 2/. Hence, a sequence of (
n
= (n, M
n
) block code satisfying (3.2.1) is
established. It remains to show that the error probability for this sequence of
(n, M
n
) block code can be made smaller than for all suciently large n.
By the Shannon-McMillan AEP theorem,
P
X
n(T
c
n
(/2)) <

2
for all suciently large n.
Consequently, for those n satisfying the above inequality, and being bigger than
2/,
P
e
( (
n
) P
X
n(T
c
n
(/2)) < .
(For the last step, the readers can refer to Table 3.1 to conrm that only the
ambiguous sequences outside the typical set contribute to the probability of
error.)
Strong Converse Part: Fix any sequence of block codes (
n

n=1
with
limsup
n
1
n
log
2
[ (
n
[ < H(X).
Let o
n
be the set of source symbols that can be correctly decoded through (
n
-
coding system. (A quick example is depicted in Figure 3.2.) Then [o
n
[ = [ (
n
[.
By choosing small enough with /2 > > 0, and also by denition of limsup
operation, we have
( N
0
)( n > N
0
)
1
n
log
2
[o
n
[ =
1
n
log
2
[ (
n
[ < H(X) 2,
which implies
[o
n
[ < 2
n(H(X)2)
.
Furthermore, from Property 2 of the Consequence of the AEP, we obtain that
( N
1
)( n > N
1
) P
X
n(T
c
n
()) < .
Consequently, for n > N maxN
0
, N
1
, log
2
(2/)/, the probability of cor-
rectly block decoding satises
1 P
e
( (
n
) =

x
n
Sn
P
X
n(x
n
)
=

x
n
SnF
c
n
()
P
X
n(x
n
) +

x
n
SnFn()
P
X
n(x
n
)
52
Source Symbols
o
n
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? W ? W
Codewords
Figure 3.2: Possible codebook (
n
and its corresponding o
n
. The solid
box indicates the decoding mapping from (
n
back to o
n
.
P
X
n(T
c
n
()) +[o
n
T
n
()[ max
x
n
Fn()
P
X
n(x
n
)
< +[o
n
[ max
x
n
Fn()
P
X
n(x
n
)
<

2
+ 2
n(H(X)2)
2
n(H(X))
<

2
+ 2
n
< ,
which is equivalent to P
e
( (
n
) > 1 for n > N. 2
Observation 3.6 The results of the above theorem is illustrated in Figure 3.3,
where R = limsup
n
(1/n) log
D
M
n
is usually called the ultimate (or asymp-
totic) code rate of block codes for compressing the source. It is clear from the
gure that the (ultimate) rate of any block code with arbitrarily small decod-
ing error probability must be greater than the source entropy. Conversely, the
probability of decoding error for any block code of rate smaller than entropy ul-
timately approaches 1 (and hence is bounded away from zero). Thus for a DMS,
the source entropy H
D
(X) is the inmum of all achievable source (block) cod-
ing rates; i.e., it is the inmum of all rates for which there exists a sequence
of D-ary block codes with asymptotically vanishing (as the blocklength goes to
innity) probability of decoding error.
For a source with (statistical) memory, Shannon-McMillans theorem cannot
be directly applied in its original form, and thereby Shannons source coding
theorem appears restricted to only memoryless sources. However, by exploring
the concept behind these theorems, one can nd that the key for the validity
of Shannons source coding theorem is actually the existence of a set /
n
=
53
-
H
D
(X)
P
e
n
1
for all block codes
P
e
n
0
for the best data compression block code
R
Figure 3.3: (Ultimate) Compression rate R versus source entropy
H
D
(X) and behavior of the probability of block decoding error as block
length n goes to innity for a discrete memoryless source.
x
n
1
, x
n
2
, . . . , x
n
M
with M D
nH
D
(X)
and P
X
n(/
c
n
) 0, namely, the existence
of a typical-like set /
n
whose size is prohibitively small and whose probability
mass is asymptotically large. Thus, if we can nd such typical-like set for a
source with memory, the source coding theorem for block codes can be extended
for this source. Indeed, with appropriate modications, the Shannon-McMillan
theorem can be generalized for the class of stationary ergodic sources and hence
a block source coding theorem for this class can be established; this is considered
in the next subsection. The block source coding theorem for general (e.g., non-
stationary non-ergodic) sources in terms of a generalized entropy measure (see
the end of the next subsection for a brief description) will be studied in detail
in Part II of the book.
3.2.2 Block codes for stationary ergodic sources
In practice, a stochastic source used to model data often exhibits memory or
statistical dependence among its random variables; its joint distribution is hence
not a product of its marginal distributions. In this subsection, we consider the
asymptotic lossless data compression theorem for the class of stationary ergodic
sources.
Before proceeding to generalize the block source coding theorem, we need
to rst generalize the entropy measure for a sequence of dependent random
variables X
n
(which certainly should be backward compatible to the discrete
memoryless cases). A straightforward generalization is to examine the limit of
the normalized block entropy of a source sequence, resulting in the concept of
entropy rate.
Denition 3.7 (Entropy rate) The entropy rate for a source X
n

n=1
is de-
noted by H(A) and dened by
H(A) lim
n
1
n
H(X
n
)
provided the limit exists, where X
n
= (X
1
, , X
n
).
54
Next we will show that the entropy rate exists for stationary sources (here,
we do not need ergodicity for the existence of entropy rate).
Lemma 3.8 For a stationary source X
n

n=1
, the conditional entropy
H(X
n
[X
n1
, . . . , X
1
)
is non-increasing in n and also bounded from below by zero. Hence by Lemma
A.20, the limit
lim
n
H(X
n
[X
n1
, . . . , X
1
)
exists.
Proof: We have
H(X
n
[X
n1
, . . . , X
1
) H(X
n
[X
n1
, . . . , X
2
) (3.2.4)
= H(X
n
, , X
2
) H(X
n1
, , X
2
)
= H(X
n1
, , X
1
) H(X
n2
, , X
1
) (3.2.5)
= H(X
n1
[X
n2
, . . . , X
1
)
where (3.2.4) follows since conditioning never increases entropy, and (3.2.5) holds
because of the stationarity assumption. Finally, recall that each conditional
entropy H(X
n
[X
n1
, . . . , X
1
) is non-negative. 2
Lemma 3.9 (Cesaro-mean theorem) If a
n
a as n and b
n
= (1/n)

n
i=1
a
i
,
then b
n
a as n .
Proof: a
n
a implies that for any > 0, there exists N such that for all n > N,
[a
n
a[ < . Then
[b
n
a[ =

1
n
n

i=1
(a
i
a)

1
n
n

i=1
[a
i
a[
=
1
n
N

i=1
[a
i
a[ +
1
n
n

i=N+1
[a
i
a[

1
n
N

i=1
[a
i
a[ +
n N
n
.
Hence, lim
n
[b
n
a[ . Since can be made arbitrarily small, the lemma
holds. 2
55
Theorem 3.10 For a stationary source X
n

n=1
, its entropy rate always exists
and is equal to
H(A) = lim
n
H(X
n
[X
n1
, . . . , X
1
).
Proof: The result directly follows by writing
1
n
H(X
n
) =
1
n
n

i=1
H(X
i
[X
i1
, . . . , X
1
) (chain-rule for entropy)
and applying the Cesaro-mean theorem. 2
Observation 3.11 It can also be shown that for a stationary source, (1/n)H(X
n
)
is non-increasing in n and (1/n)H(X
n
) H(X
n
[X
n1
, . . . , X
1
) for all n 1.
(The proof is left as an exercise.)
It is obvious that when X
n

n=1
is a discrete memoryless source, H(X
n
) =
n H(X) for every n. Hence,
H(A) = lim
n
1
n
H(X
n
) = H(X).
For a rst-order stationary Markov source,
H(A) = lim
n
1
n
H(X
n
) = lim
n
H(X
n
[X
n1
, . . . , X
1
) = H(X
2
[X
1
),
where
H(X
2
[X
1
)

x
1
X

x
2
X
(x
1
)P
X
2
|X
1
(x
2
[x
1
) log P
X
2
|X
1
(x
2
[x
1
),
and () is the stationary distribution for the Markov source. Furthermore, if
the Markov source is binary with P
X
2
|X
1
(0[1) = and P
X
2
|X
1
(1[0) = , then
H(A) =

+
h
b
() +

+
h
b
(),
where h
b
() log (1 ) log(1 ) is the binary entropy function.
Theorem 3.12 (Generalized AEP or Shannon-McMillan-Breiman The-
orem [12]) If X
n

n=1
is a stationary ergodic source, then

1
n
log
2
P
X
n(X
1
, . . . , X
n
)
a.s.
H(A).
56
Since the AEP theorem (law of large numbers) is valid for stationary ergodic
sources, all consequences of AEP will follow, including Shannons lossless source
coding theorem.
Theorem 3.13 (Shannons source coding theorem for stationary ergo-
dic sources) Given integer D > 1, let X
n

n=1
be a stationary ergodic source
with entropy rate (in base D)
H
D
(A) lim
n
1
n
H
D
(X
n
).
Then the following hold.
Forward part (achievability): For any 0 < < 1, there exists with
0 < < and a sequence of D-ary block codes (
n
= (n, M
n
)

n=1
with
limsup
n
1
n
log
D
M
n
< H
D
(A) + ,
and probability of decoding error satised
P
e
( (
n
) <
for all suciently large n.
Strong converse part: For any 0 < < 1, any sequence of D-ary block
codes (
n
= (n, M
n
)

n=1
with
limsup
n
1
n
log
D
M
n
< H
D
(A)
satises
P
e
> 1
for all n suciently large.
A discrete memoryless (i.i.d.) source is stationary and ergodic (so Theorem
3.5 is clearly a special case of Theorem 3.13). In general, it is hard to check
whether a stationary process is ergodic or not. It is known though that if a
stationary process is a mixture of two or more stationary ergodic processes,
i.e., its n-fold distribution can be written as the mean (with respect to some
distribution) of the n-fold distributions of stationary ergodic processes, then it
is not ergodic.
8
8
The converse is also true; i.e., if a stationary process cannot be represented as a mixture
of stationary ergodic processes, then it is ergodic.
57
For example, let P and Q be two distributions on a nite alphabet A such
that the process X
n

n=1
is i.i.d. with distribution P and the process Y
n

n=1
is i.i.d. with distribution Q. Flip a biased coin (with Heads probability equal to
, 0 < < 1) once and let
Z
i
=
_
X
i
if Heads
Y
i
if Tails
for i = 1, 2, . Then the resulting process Z
i

i=1
has its n-fold distribution as
a mixture of the n-fold distributions of X
n

n=1
and Y
n

n=1
:
P
Z
n(a
n
) = P
X
n(a
n
) + (1 )P
y
n(a
n
)
for all a
n
A
n
, n = 1, 2, . Then the process Z
i

i=1
is stationary but not
ergodic.
A specic case for which ergodicity can be easily veried (other than the
case of i.i.d. sources) is the case of stationary Markov sources. Specically, if a
(nite-alphabet) stationary Markov source is irreducible, then it is ergodic and
hence the Generalized AEP holds for this source. Note that irreducibility can
be veried in terms of the sources transition probability matrix.
In more complicated situations such as when the source is non-stationary
(with time-varying statistics) and/or non-ergodic, the source entropy rate H(A)
(if the limit exists; otherwise one can look at the liminf/limsup of (1/n)H(X
n
))
has no longer an operational meaning as the smallest possible compression rate.
This causes the need to establish new entropy measures which appropriately
characterize the operational limits of an arbitrary stochastic system with mem-
ory. This is achieved in [21] where Han and Verd u introduce the notions of
inf/sup-entropy rates and illustrate the key role these entropy measures play
in proving a general lossless block source coding theorem. More specically,
they demonstrate that for an arbitrary nite-alphabet source X = X
n
=
(X
1
, X
2
, . . . , X
n
)

n=1
(not necessarily stationary and ergodic), the expression for
the minimum achievable (block) source coding rate is given by the sup-entropy
rate

H(X), dened by

H(X) inf

_
: limsup
n
Pr
_

1
n
log P
X
n(X
n
) >
_
= 0
_
.
More details will be provided in Part II of the book.
3.2.3 Redundancy for lossless block data compression
Shannons block source coding theorem establishes that the smallest data com-
pression rate for achieving arbitrarily small error probability for stationary er-
godic sources is given by the entropy rate. Thus one can dene the source
58
redundancy as the reduction in coding rate one can achieve via asymptotically
lossless block source coding versus just using uniquely decodable (completely
lossless for any value of the sourceword blocklength n) block source coding. In
light of the fact that the former approach yields a source coding rate equal to the
entropy rate while the later approach provides a rate of log
2
[A[, we therefore
dene the total block source-coding redundancy
t
(in bits/source symbol) for a
stationary ergodic source X
n

n=1
as

t
log
2
[A[ H(A).
Hence
t
represents the amount of useless (or superuous) statistical source
information one can eliminate via binary
9
block source coding.
If the source is i.i.d. and uniformly distributed, then its entropy rate is equal
to log
2
[A[ and as a result its redundancy is
t
= 0. This means that the source
is incompressible, as expected, since in this case every sourceword x
n
will belong
to the -typical set T
n
() for every n > 0 and > 0 (i.e., T
n
() = A
n
) and
hence there are no superuous sourcewords that can be dispensed of via source
coding. If the source has memory or has a non-uniform marginal distribution,
then its redundancy is strictly positive and can be classied into two parts:
Source redundancy due to the non-uniformity of the source marginal dis-
tribution
d
:

d
log
2
[A[ H(X
1
).
Source redundancy due to the source memory
m
:

m
H(X
1
) H(A).
As a result, the source total redundancy
t
can be decomposed in two parts:

t
=
d
+
m
.
We can summarize the redundancy of some typical stationary ergodic sources
in the following table.
9
Since we are measuring
t
in code bits/source symbol, all logarithms in its expression are
in base 2 and hence this redundancy can be eliminated via asymptotically lossless binary block
codes (one can also change the units to D-ary code symbol/source symbol by using base-D
logarithms for the case of D-ary block codes).
59
Source
d

m

t
i.i.d. uniform 0 0 0
i.i.d. non-uniform log
2
[A[ H(X
1
) 0
d
1st-order symmetric
Markov
10
0 H(X
1
) H(X
2
[X
1
)
m
1st-order non-
symmetric Markov
log
2
[A[ H(X
1
) H(X
1
) H(X
2
[X
1
)
d
+
m
3.3 Variable-length codes for lossless data compression
3.3.1 Non-singular codes and uniquely decodable codes
We next study variable-length (completely) lossless data compression codes.
Denition 3.14 Consider a discrete source X
n

n=1
with nite alphabet A
along with a D-ary code alphabet B = 0, 1, , D 1, where D > 1 is an
integer. Fix integer n 1, then a D-ary n-th order variable-length code (VLC)
is a map
f : A
n
B

mapping (xed-length) sourcewords of length n to D-ary codewords in B

of
variable lengths, where B

denotes the set of all nite-length strings from B (i.e.,


c B

integer l 1 such that c B


l
).
The codebook ( of a VLC is the set of all codewords:
( = f(A
n
) = f(x
n
) B

: x
n
A
n
.
A variable-length lossless data compression code is a code in which the
source symbols can be completely reconstructed without distortion. In order
to achieve this goal, the source symbols have to be encoded unambiguously in
the sense that any two dierent source symbols (with positive probabilities) are
represented by dierent codewords. Codes satisfying this property are called
non-singular codes. In practice however, the encoder often needs to encode a
sequence of source symbols, which results in a concatenated sequence of code-
words. If any concatenation of codewords can also be unambiguously recon-
structed without punctuation, then the code is said to be uniquely decodable. In
10
A rst-order Markov process is symmetric if for any x
1
and x
1
,
a : a = P
X2|X1
(y[x
1
) for some y = a : a = P
X2|X1
(y[ x
1
) for some y.
60
other words, a VLC is uniquely decodable if all nite sequences of sourcewords
(x
n
A
n
) are mapped onto distinct strings of codewords; i.e., for any m and
m

, (x
n
1
, x
n
2
, , x
n
m
) ,= (y
n
1
, y
n
2
, , y
n
m
) implies that
(f(x
n
1
), f(x
n
2
), , f(x
n
m
)) ,= (f(y
n
1
), f(y
n
2
), , f(y
n
m
)).
Note that a non-singular VLC is not necessarily uniquely decodable. For exam-
ple, consider a binary (rst-order) code for the source with alphabet
A = A, B, C, D, E, F
given by
code of A = 0,
code of B = 1,
code of C = 00,
code of D = 01,
code of E = 10,
code of F = 11.
The above code is clearly non-singular; it is however not uniquely decodable
because the codeword sequence, 010, can be reconstructed as ABA, DA or AE
(i.e., (f(A), f(B), f(A)) = (f(D), f(A)) = (f(A), f(E)) even if (A, B, A), (D, A)
and (A, E) are all non-equal).
One important objective is to nd out how eciently we can represent
a given discrete source via a uniquely decodable n-th order VLC and provide
a construction technique that (at least asymptotically, as n ) attains the
optimal eciency. In other words, we want to determine what is the smallest
possible average code rate (or equivalently, average codeword length) can an
n-th order uniquely decodable VLC have when (losslessly) representing a given
source, and we want to give an explicit code construction that can attain this
smallest possible rate (at least asymptotically in the sourceword length n).
Denition 3.15 Let ( be a D-ary n-th order VLC code
f : A
n
0, 1, , D 1

for a discrete source X


n

n=1
with alphabet A and distribution P
X
n(x
n
), x
n

A
n
. Setting (c
x
n) as the length of the codeword c
x
n = f(x
n
) associated with
sourceword x
n
, then the average codeword length for ( is given by

x
n
X
n
P
X
n(x
n
)(c
x
n)
61
and its average code rate (in D-ary code symbols/source symbol) is given by
R
n


n
=
1
n

x
n
X
n
P
X
n(x
n
)(c
x
n).
The following theorem provides a strong condition which a uniquely decod-
able code must satisfy.
Theorem 3.16 (Kraft inequality for uniquely decodable codes) Let ( be
a uniquely decodable D-ary n-th order VLC for a discrete source X
n

n=1
with
alphabet A. Let the M = [A[
n
codewords of ( have lengths
1
,
2
, . . . ,
M
,
respectively. Then the following inequality must hold
M

m=1
D
m
1.
Proof: Suppose that we use the codebook ( to encode N sourcewords (x
n
i

A
n
, i = 1, , N) arriving in a sequence; this yields a concatenated codeword
sequence
c
1
c
2
c
3
. . . c
N
.
Let the lengths of the codewords be respectively denoted by
(c
1
), (c
2
), . . . , (c
N
).
Consider
_

c
1
C

c
2
C

c
N
C
D
[(c
1
)+(c
2
)++(c
N
)]
_
.
It is obvious that the above expression is equal to
_

cC
D
(c)
_
N
=
_
M

m=1
D
m
_
N
.
(Note that [([ = M.) On the other hand, all the code sequences with length
i = (c
1
) + (c
2
) + + (c
N
)
contribute equally to the sum of the identity, which is D
i
. Let A
i
denote the
number of N-codeword sequences that have length i. Then the above identity
can be re-written as
_
M

m=1
D
m
_
N
=
LN

i=1
A
i
D
i
,
62
where
L max
cC
(c).
Since ( is by assumption a uniquely decodable code, the codeword sequence
must be unambiguously decodable. Observe that a code sequence with length i
has at most D
i
unambiguous combinations. Therefore, A
i
D
i
, and
_
M

m=1
D
m
_
N
=
LN

i=1
A
i
D
i

LN

i=1
D
i
D
i
= LN,
which implies that
M

m=1
D
m
(LN)
1/N
.
The proof is completed by noting that the above inequality holds for every N,
and the upper bound (LN)
1/N
goes to 1 as N goes to innity. 2
The Kraft inequality is a very useful tool, especially for showing that the
fundamental lower bound of the average rate of uniquely decodable VLCs for
discrete memoryless sources is given by the source entropy.
Theorem 3.17 The average rate of every uniquely decodable D-ary n-th order
VLC for a discrete memoryless source X
n

n=1
is lower-bounded by the source
entropy H
D
(X) (measured in D-ary code symbols/source symbol).
Proof: Consider a uniquely decodable D-ary n-th order VLC code for the source
X
n

n=1
f : A
n
0, 1, , D 1

and let (c
x
n) denote the length of the codeword c
x
n = f(x
n
) for sourceword x
n
.
Hence,
R
n
H
D
(X) =
1
n

x
n
X
n
P
X
n(x
n
)(c
x
n)
1
n
H
D
(X
n
)
=
1
n
_

x
n
X
n
P
X
n(x
n
)(c
x
n)

x
n
X
n
(P
X
n(x
n
) log
D
P
X
n(x
n
))
_
=
1
n

x
n
X
n
P
X
n(x
n
) log
D
P
X
n(x
n
)
D
(c
x
n)

1
n
_

x
n
X
n
P
X
n(x
n
)
_
log
D
_
x
n
X
n
P
X
n(x
n
)

_
x
n
X
n
D
(c
x
n)

(log-sum inequality)
63
=
1
n
log
_

x
n
X
n
D
(c
x
n)
_
0
where the last inequality follows from the Kraft inequality for uniquely decodable
codes and the fact that the logarithm is a strictly increasing function. 2
From the above theorem, we know that the average code rate is no smaller
than the source entropy. Indeed a lossless data compression code, whose average
code rate achieves entropy, should be optimal (since if its average code rate is
below entropy, the Kraft inequality is violated and the code is no longer uniquely
decodable). We summarize
1. Uniquely decodability the Kraft inequality holds.
2. Uniquely decodability average code rate of VLCs for memoryless sources
is lower bounded by the source entropy.
Exercise 3.18
1. Find a non-singular and also non-uniquely decodable code that violates the
Kraft inequality. (Hint: The answer is already provided in this subsection.)
2. Find a non-singular and also non-uniquely decodable code that beats the
entropy lower bound.
3.3.2 Prex or instantaneous codes
A prex code is a VLC which is self-punctuated in the sense that there is no
need to append extra symbols for dierentiating adjacent codewords. A more
precise denition follows:
Denition 3.19 (Prex code) A VLC is called a prex code or an instanta-
neous code if no codeword is a prex of any other codeword.
A prex code is also named an instantaneous code because the codeword se-
quence can be decoded instantaneously (it is immediately recognizable) without
the reference to future codewords in the same sequence. Note that a uniquely
decodable code is not necessarily prex-free and may not be decoded instanta-
neously. The relationship between dierent codes encountered thus far is de-
picted in Figure 3.4.
A D-ary prex code can be represented graphically as an initial segment of
a D-ary tree. An example of a tree representation for a binary (D = 2) prex
code is shown in Figure 3.5.
64
Prex
codes
Uniquely
decodable codes
Non-singular
codes
Figure 3.4: Classication of variable-length codes.
Theorem 3.20 (Kraft inequality for prex codes) There exists a D-ary
nth-order prex code for a discrete source X
n

n=1
with alphabet A i the
codewords of length
m
, m = 1, . . . , M, satisfy the Kraft inequality, where M =
[A[
n
.
Proof: Without loss of generality, we provide the proof for the case of D = 2
(binary codes).
1. [The forward part] Prex codes satisfy the Kraft inequality.
The codewords of a prex code can always be put on a tree. Pick up a length

max
max
1mM

m
.
A tree has originally 2
max
nodes on level
max
. Each codeword of length
m
obstructs 2
maxm
nodes on level
max
. In other words, when any node is chosen
as a codeword, all its children will be excluded from being codewords (as for a
prex code, no codeword can be a prex of any other code). There are exactly
2
maxm
excluded nodes on level
max
of the tree. We therefore say that each
codeword of length
m
obstructs 2
maxm
nodes on level
max
. Note that no two
codewords obstruct the same nodes on level
max
. Hence the number of totally
obstructed codewords on level
max
should be less than 2
max
, i.e.,
M

m=1
2
maxm
2
max
,
which immediately implies the Kraft inequality:
M

m=1
2
m
1.
65

R
*
j
(0)
(1)
00
01
10
(11)
110
(111)
1110
1111
Figure 3.5: Tree structure of a binary prex code. The codewords are
those residing on the leaves, which in this case are 00, 01, 10, 110, 1110
and 1111.
(This part can also be proven by stating the fact that a prex code is a uniquely
decodable code. The objective of adding this proof is to illustrate the character-
istics of a tree-like prex code.)
2. [The converse part] Kraft inequality implies the existence of a prex code.
Suppose that
1
,
2
, . . . ,
M
satisfy the Kraft inequality. We will show that
there exists a binary tree with M selected nodes where the i
th
node resides on
level
i
.
Let n
i
be the number of nodes (among the M nodes) residing on level i
(namely, n
i
is the number of codewords with length i or n
i
= [m :
m
= i[),
and let

max
max
1mM

m
.
Then from the Kraft inequality, we have
n
1
2
1
+ n
2
2
2
+ + n
max
2
max
1.
The above inequality can be re-written in a form that is more suitable for this
proof as:
n
1
2
1
1
66
n
1
2
1
+ n
2
2
2
1

n
1
2
1
+ n
2
2
2
+ + n
max
2
max
1.
Hence,
n
1
2
n
2
2
2
n
1
2
1

n
max
2
max
n
1
2
max1
n
max1
2
1
,
which can be interpreted in terms of a tree model as: the 1
st
inequality says
that the number of codewords of length 1 is less than the available number of
nodes on the 1
st
level, which is 2. The 2
nd
inequality says that the number of
codewords of length 2 is less than the total number of nodes on the 2
nd
level,
which is 2
2
, minus the number of nodes obstructed by the 1
st
level nodes already
occupied by codewords. The succeeding inequalities demonstrate the availability
of a sucient number of nodes at each level after the nodes blocked by shorter
length codewords have been removed. Because this is true at every codeword
length up to the maximum codeword length, the assertion of the theorem is
proved. 2
Theorems 3.16 and 3.20 unveil the following relation between a variable-
length uniquely decodable code and a prex code.
Corollary 3.21 A uniquely decodable D-ary n-th order code can always be
replaced by a D-ary n-th order prex code with the same average codeword
length (and hence the same average code rate).
The following theorem interprets the relationship between the average code
rate of a prex code and the source entropy.
Theorem 3.22 Consider a discrete memoryless source X
n

n=1
.
1. For any D-ary n-th order prex code for the source, the average code rate
is no less than the source entropy H
D
(X).
2. There must exist a D-ary n-th order prex code for the source whose
average code rate is no greater than H
D
(X) +
1
n
, namely,
R
n

1
n

x
n
X
n
P
X
n(x
n
)(c
x
n) H
D
(X) +
1
n
, (3.3.1)
where c
x
n is the codeword for sourceword x
n
, and (c
x
n) is the length of
codeword c
x
n.
67
Proof: A prex code is uniquely decodable, and hence it directly follows from
Theorem 3.17 that its average code rate is no less than the source entropy.
To prove the second part, we can design a prex code satisfying both (3.3.1)
and the Kraft inequality, which immediately implies the existence of the desired
code by Theorem 3.20. Choose the codeword length for sourceword x
n
as
(c
x
n) = log
D
P
X
n(x
n
) + 1. (3.3.2)
Then
D
(c
x
n)
P
X
n(x
n
).
Summing both sides over all source symbols, we obtain

x
n
X
n
D
(c
x
n)
1,
which is exactly the Kraft inequality. On the other hand, (3.3.2) implies
(c
x
n) log
D
P
X
n(x
n
) + 1,
which in turn implies

x
n
X
n
P
X
n(x
n
)(c
x
n)

x
n
X
n
_
P
X
n(x
n
) log
D
P
X
n(x
n
)

x
n
X
n
P
X
n(x
n
)
= H
D
(X
n
) + 1 = nH
D
(X) + 1,
where the last equality holds since the source is memoryless. 2
We note that n-th order prex codes (which encode sourcewords of length
n) for memoryless sources can yield an average code rate arbitrarily close to
the source entropy when allowing n to grow without bound. For example, a
memoryless source with alphabet
A, B, C
and probability distribution
P
X
(A) = 0.8, P
X
(B) = P
X
(C) = 0.1
has entropy being equal to
0.8 log
2
0.8 0.1 log
2
0.1 0.1 log
2
0.1 = 0.92 bits.
One of the best binary rst-order or single-letter encoding (with n = 1) prex
codes for this source is given by c(A) = 0, c(B) = 10 and c(C) = 11, where c()
is the encoding function. Then the resultant average code rate for this code is
0.8 1 + 0.2 2 = 1.2 bits 0.92 bits.
68
Now if we consider a second-order (with n = 2) prex code by encoding two
consecutive source symbols at a time, the new source alphabet becomes
AA, AB, AC, BA, BB, BC, CA, CB, CC,
and the resultant probability distribution is calculated by
( x
1
, x
2
A, B, C) P
X
2(x
1
, x
2
) = P
X
(x
1
)P
X
(x
2
)
as the source is memoryless. Then one of the best binary prex codes for the
source is given by
c(AA) = 0
c(AB) = 100
c(AC) = 101
c(BA) = 110
c(BB) = 111100
c(BC) = 111101
c(CA) = 1110
c(CB) = 111110
c(CC) = 111111.
The average code rate of this code now becomes
0.64(1 1) + 0.08(3 3 + 4 1) + 0.01(6 4)
2
= 0.96 bits,
which is closer to the source entropy of 0.92 bits. As n increases, the average
code rate will be brought closer to the source entropy.
From Theorems 3.17 and 3.22, we obtain the lossless variable-length source
coding theorem for discrete memoryless sources.
Theorem 3.23 (Lossless variable-length source coding theorem) Fix in-
teger D > 1 and consider a discrete memoryless source X
n

n=1
with distribution
P
X
and entropy H
D
(X) (measured in D-ary units). Then the following hold.
Forward part (achievability): For any > 0, there exists a D-ary n-th
order prex (hence uniquely decodable) code
f : A
n
0, 1, , D 1

for the source with an average code rate R


n
satisfying
R
n
H
D
(X) +
for n suciently large.
69
Converse part: Every uniquely decodable code
f : A
n
0, 1, , D 1

for the source has an average code rate R


n
H
D
(X).
Thus, for a discrete memoryless source, its entropy H
D
(X) (measured in D-
ary units) represents the smallest variable-length lossless compression rate for n
suciently large.
Proof: The forward part follows directly from Theorem 3.22 by choosing
n large enough such that 1/n < , and the converse part is already given by
Theorem 3.17. 2
Observation 3.24 Theorem 3.23 actually also holds for the class of stationary
sources by replacing the source entropy H
D
(X) with the source entropy rate
H
D
(A) lim
n
1
n
H
D
(X
n
),
measured in D-ary units. The proof is very similar to the proofs of Theorems 3.17
and 3.22 with slight modications (such as using the fact that
1
n
H
D
(X
n
) is non-
increasing with n for stationary sources).
3.3.3 Examples of binary prex codes
A) Human codes: optimal variable-length codes
Given a discrete source with alphabet A, we next construct an optimal binary
rst-order (single-letter) uniquely decodable variable-length code
f : A 0, 1

,
where optimality is in the sense that the codes average codeword length (or
equivalently, its average code rate) is minimized over the class of all binary
uniquely decodable codes for the source. Note that nding optimal n-th oder
codes with n > 1 follows directly by considering A
n
as a new source with ex-
panded alphabet (i.e., by mapping n source symbols at a time).
By Corollary 3.21, we remark that in our search for optimal uniquely de-
codable codes, we can restrict our attention to the (smaller) class of optimal
prex codes. We thus proceed by observing the following necessary conditions
of optimality for binary prex codes.
70
Lemma 3.25 Let ( be an optimal binary prex code with codeword lengths

i
, i = 1, , M, for a source with alphabet A = a
1
, . . . , a
M
and symbol
probabilities p
1
, . . . , p
M
. We assume, without loss of generality, that
p
1
p
2
p
3
p
M
,
and that any group of source symbols with identical probability is listed in order
of increasing codeword length (i.e., if p
i
= p
i+1
= = p
i+s
, then
i

i+1


i+s
). Then the following properties hold.
1. Higher probability source symbols have shorter codewords: p
i
> p
j
implies

i

j
, for i, j = 1, , M.
2. The two least probable source symbols have codewords of equal length:

M1
=
M
.
3. Among the codewords of length
M
, two of the codewords are identical
except in the last digit.
Proof:
1) If p
i
> p
j
and
i
>
j
, then it is possible to construct a better code (

by
interchanging (swapping) codewords i and j of (, since
((

) (() = p
i

j
+ p
j

i
(p
i

i
+ p
j

j
)
= (p
i
p
j
)(
j

i
)
< 0.
Hence code (

is better than code (, contradicting the fact that ( is optimal.


2) We rst know that
M1

M
, since:
If p
M1
> p
M
, then
M1

M
by result 1) above.
If p
M1
= p
M
, then
M1

M
by our assumption about the ordering
of codewords for source symbols with identical probability.
Now, if
M1
<
M
, we may delete the last digit of codeword M, and the
deletion cannot result in another codeword since ( is a prex code. Thus
the deletion forms a new prex code with a better average codeword length
than (, contradicting the fact that ( is optimal. Hence, we must have that

M1
=
M
.
3) Among the codewords of length
M
, if no two codewords agree in all digits
except the last, then we may delete the last digit in all such codewords to
obtain a better codeword. 2
71
The above observation suggests that if we can construct an optimal code for
the entire source except for its two least likely symbols, then we can construct
an optimal overall code. Indeed, the following lemma due to Human follows
from Lemma 3.25.
Lemma 3.26 (Human) Consider a source with alphabet A = a
1
, . . . , a
M

and symbol probabilities p


1
, . . . , p
M
such that p
1
p
2
p
M
. Consider the
reduced source alphabet obtained from A by combining the two least likely
source symbols a
M1
and a
M
into an equivalent symbol a
M1,M
with probability
p
M1
+ p
M
. Suppose that (

, given by f

: 0, 1

, is an optimal code for


the reduced source . We now construct a code (, f : A 0, 1

, for the
original source A as follows:
The codewords for symbols a
1
, a
2
, , a
M2
are exactly the same as the
corresponding codewords in (

:
f(a
1
) = f

(a
1
), f(a
2
) = f

(a
2
), , f(a
M2
) = f

(a
M2
).
The codewords associated with symbols a
M1
and a
M
are formed by ap-
pending a 0 and a 1, respectively, to the codeword f

(a
M1,M
) associ-
ated with the letter a
M1,M
in (

:
f(a
M1
) = [f

(a
M1,M
)0] and f(a
M
) = [f

(a
M1,M
)1].
Then code ( is optimal for the original source A.
Hence the problem of nding the optimal code for a source of alphabet size
M is reduced to the problem of nding an optimal code for the reduced source
of alphabet size M1. In turn we can reduce the problem to that of size M2
and so on. Indeed the above lemma yields a recursive algorithm for constructing
optimal binary prex codes.
Human encoding algorithm: Repeatedly apply the above lemma until one is left
with a reduced source with two symbols. An optimal binary prex code for this
source consists of the codewords 0 and 1. Then proceed backwards, constructing
(as outlined in the above lemma) optimal codes for each reduced source until
one arrives at the original source.
Example 3.27 Consider a source with alphabet 1, 2, 3, 4, 5, 6 with symbol
probabilities 0.25, 0.25, 0.25, 0.1, 0.1 and 0.05, respectively. By following the
Human encoding procedure as shown in Figure 3.6, we obtain the Human
code as
00, 01, 10, 110, 1110, 1111.
72
0.05
0.1
0.1
0.25
0.25
0.25
(1111)
(1110)
(110)
(10)
(01)
(00)
0.15
0.1
0.25
0.25
0.25
111
110
10
01
00
0.25
0.25
0.25
0.25
11
10
01
00
0.5
0.25
0.25
1
01
00
0.5
0.5
1
0
1.0
Figure 3.6: Example of the Human encoding.
Observation 3.28
Human codes are not unique for a given source distribution; e.g., by
inverting all the code bits of a Human code, one gets another Human
code, or by resolving ties in dierent ways in the Human algorithm, one
also obtains dierent Human codes (but all of these codes have the same
minimal R
n
).
One can obtain optimal codes that are not Human codes; e.g., by inter-
changing two codewords of the same length of a Human code, one can get
another non-Human (but optimal) code. Furthermore, one can construct
an optimal sux code (i.e., a code in which no codeword can be a sux
of another codeword) from a Human code (which is a prex code) by
reversing the Human codewords.
Binary Human codes always satisfy the Kraft inequality with equality
(their code tree is saturated); e.g., see [13, p. 72].
Any n-th order binary Human code f : A
n
0, 1

for a stationary
source X
n

n=1
with nite alphabet A satises:
H(A)
1
n
H(X
n
) R
n
<
1
n
H(X
n
) +
1
n
.
73
Thus, as n increases to innity, R
n
H(A) but the complexity as well as
encoding-decoding delay grows exponentially with n.
Finally, note that non-binary (i.e., for D > 2) Human codes can also be
constructed in a mostly similar way as for the case of binary Human codes
by designing a D-ary tree and iteratively applying Lemma 3.26, where now
the D least likely source symbols are combined at each stage. The only
dierence from the case of binary Human codes is that we have to ensure
that we are ultimately left with D symbols at the last stage of the algorithm
to guarantee the codes optimality. This is remedied by expanding the
original source alphabet A by adding dummy symbols (each with zero
probability) so that the alphabet size of the expanded source [A

[ is the
smallest positive integer greater than or equal to [A[ with
[A

[ = 1 (modulo D 1).
For example, if [A[ = 6 and D = 3 (ternary codes), we obtain that [A

[ = 7,
meaning that we need to enlarge the original source A by adding one
dummy (zero-probability) source symbol.
We thus obtain that the necessary conditions for optimality of Lemma 3.25
also hold for D-ary prex codes when replacing A with the expanded
source A

and replacing two with D in the statement of the lemma.


The resulting D-ary Human code will be an optimal code for the original
source A (e.g., see [18, Chap. 3] and [33, Chap. 11]).
B) Shannon-Fano-Elias code
Assume A = 1, . . . , M and P
X
(x) > 0 for all x A. Dene
F(x)

ax
P
X
(a),
and

F(x)

a<x
P
X
(a) +
1
2
P
X
(x).
Encoder: For any x A, express

F(x) in decimal binary form, say

F(x) = .c
1
c
2
. . . c
k
. . . ,
and take the rst k (fractional) bits as the codeword of source symbol x, i.e.,
(c
1
, c
2
, . . . , c
k
),
74
where k log
2
(1/P
X
(x)) + 1.
Decoder: Given codeword (c
1
, . . . , c
k
), compute the cumulative sum of F() start-
ing from the smallest element in 1, 2, . . . , M until the rst x satisfying
F(x) .c
1
. . . c
k
.
Then x should be the original source symbol.
Proof of decodability: For any number a [0, 1], let [a]
k
denote the operation
that chops the binary representation of a after k bits (i.e., removing the (k+1)
th
bit, the (k + 2)
th
bit, etc). Then

F(x)
_

F(x)

k
<
1
2
k
.
Since k = log
2
(1/P
X
(x)) + 1,
1
2
k

1
2
P
X
(x)
=
_

a<x
P
X
(a) +
P
X
(x)
2
_

ax1
P
X
(a)
=

F(x) F(x 1).
Hence,
F(x 1) =
_
F(x 1) +
1
2
k
_

1
2
k


F(x)
1
2
k
<
_

F(x)

k
.
In addition,
F(x) >

F(x)
_

F(x)

k
.
Consequently, x is the rst element satisfying
F(x) .c
1
c
2
. . . c
k
.
Average codeword length:

xX
P
X
(x)
_
log
2
1
P
X
(x)
_
+ 1
<

xX
P
X
(x) log
2
1
P
X
(x)
+ 2
= (H(X) + 2) bits.
Observation 3.29 The Shannon-Fano-Elias code is a prex code.
75
3.3.4 Examples of universal lossless variable-length codes
In Section 3.3.3, we assume that the source distribution is known. Thus we can
use either Human codes or Shannon-Fano-Elias codes to compress the source.
What if the source distribution is not a known priori? Is it still possible to
establish a completely lossless data compression code which is universally good
(or asymptotically optimal) for all interested sources? The answer is arma-
tive. Two of the examples are the adaptive Human codes and the Lempel-Ziv
codes (which unlike Human and Shannon-Fano-Elias codes map variable-length
sourcewords onto codewords).
A) Adaptive Human code
A straightforward universal coding scheme is to use the empirical distribution
(or relative frequencies) as the true distribution, and then apply the optimal
Human code according to the empirical distribution. If the source is i.i.d.,
the relative frequencies will converge to its true marginal probability. Therefore,
such universal codes should be good for all i.i.d. sources. However, in order to get
an accurate estimation of the true distribution, one must observe a suciently
long sourceword sequence under which the coder will suer a long delay. This
can be improved by using the adaptive universal Human code [19].
The working procedure of the adaptive Human code is as follows. Start
with an initial guess of the source distribution (based on the assumption that
the source is DMS). As a new source symbol arrives, encode the data in terms of
the Human coding scheme according to the current estimated distribution, and
then update the estimated distribution and the Human codebook according to
the newly arrived source symbol.
To be specic, let the source alphabet be A a
1
, . . . , a
M
. Dene
N(a
i
[x
n
) number of a
i
occurrence in x
1
, x
2
, . . . , x
n
.
Then the (current) relative frequency of a
i
is N(a
i
[x
n
)/n. Let c
n
(a
i
) denote the
Human codeword of source symbol a
i
with respect to the distribution
_
N(a
1
[x
n
)
n
,
N(a
2
[x
n
)
n
, ,
N(a
M
[x
n
)
n
_
.
Now suppose that x
n+1
= a
j
. The codeword c
n
(a
j
) is set as output, and the
relative frequency for each source outcome becomes:
N(a
j
[x
n+1
)
n + 1
=
n (N(a
j
[x
n
)/n) + 1
n + 1
76
and
N(a
i
[x
n+1
)
n + 1
=
n (N(a
i
[x
n
)/n)
n + 1
for i ,= j.
This observation results in the following distribution updated policy.
P
(n+1)

X
(a
j
) =
nP
(n)

X
(a
j
) + 1
n + 1
and
P
(n+1)

X
(a
i
) =
n
n + 1
P
(n)

X
(a
i
) for i ,= j,
where P
(n+1)

X
represents the estimate of the true distribution P
X
at time (n+1).
Note that in the Adaptive Human coding scheme, the encoder and decoder
need not be re-designed at every time, but only when a sucient change in the
estimated distribution occurs such that the so-called sibling property is violated.
Denition 3.30 (Sibling property) A prex code is said to have the sibling
property if its codetree satises:
1. every node in the code-tree (except for the root node) has a sibling (i.e.,
the code-tree is saturated), and
2. the node can be listed in non-decreasing order of probabilities with each
node being adjacent to its sibling.
The next observation indicates the fact that the Human code is the only
prex code satisfying the sibling property.
Observation 3.31 A prex code is a Human code i it satises the sibling
property.
An example for a code tree satisfying the sibling property is shown in Fig-
ure 3.7. The rst requirement is satised since the tree is saturated. The second
requirement can be checked by the node list in Figure 3.7.
If the next observation (say at time n = 17) is a
3
, then its codeword 100 is
set as output (using the Human code corresponding to P
(16)

X
). The estimated
distribution is updated as
P
(17)

X
(a
1
) =
16 (3/8)
17
=
6
17
, P
(17)

X
(a
2
) =
16 (1/4)
17
=
4
17
P
(17)

X
(a
3
) =
16 (1/8) + 1
17
=
3
17
, P
(17)

X
(a
4
) =
16 (1/8)
17
=
2
17
77
a
1
(00, 3/8)
a
2
(01, 1/4)
a
3
(100, 1/8)
a
4
(101, 1/8)
a
5
(110, 1/16)
a
6
(111, 1/16)
b
11
(1/8)
b
10
(1/4)
b
0
(5/8)
b
1
(3/8)
8/8
b
0
_
5
8
_
b
1
_
3
8
_
. .
sibling pair
a
1
_
3
8
_
a
2
_
1
4
_
. .
sibling pair
b
10
_
1
4
_
b
11
_
1
8
_
. .
sibling pair
a
3
_
1
8
_
a
4
_
1
8
_
. .
sibling pair
a
5
_
1
16
_
a
6
_
1
16
_
. .
sibling pair
Figure 3.7: Example of the sibling property based on the code tree from
P
(16)

X
. The arguments inside the parenthesis following a
j
respectively
indicate the codeword and the probability associated with a
j
. b is
used to denote the internal nodes of the tree with the assigned (partial)
code as its subscript. The number in the parenthesis following b is the
probability sum of all its children.
P
(17)

X
(a
5
) =
16 [1/(16)]
17
=
1
17
, P
(17)

X
(a
6
) =
16 [1/(16)]
17
=
1
17
.
The sibling property is then violated (cf. Figure 3.8). Hence, codebook needs to
be updated according to the new estimated distribution, and the observation at
n = 18 shall be encoded using the new codebook in Figure 3.9. Details about
Adaptive Human codes can be found in [19].
B) Lempel-Ziv codes
We now introduce a well-known and feasible universal coding scheme, which is
named after its inventors, Lempel and Ziv (e.g., cf. [12]). These codes, unlike
Human and Shannon-Fano-Elias codes, map variable-length sourcewords (as
78
a
1
(00, 6/17)
a
2
(01, 4/17)
a
3
(100, 3/17)
a
4
(101, 2/17)
a
5
(110, 1/17)
a
6
(111, 1/17)
b
11
(2/17)
b
10
(5/17)
b
0
(10/17)
b
1
(7/17)
17/17
b
0
_
10
17
_
b
1
_
7
17
_
. .
sibling pair
a
1
_
6
17
_
b
10
_
5
17
_
a
2
_
4
17
_
a
3
_
3
17
_
a
4
_
2
17
_
. .
sibling pair
b
11
_
2
17
_
a
5
_
1
17
_
a
6
_
1
17
_
. .
sibling pair
Figure 3.8: (Continuation of Figure 3.7) Example of violation of the
sibling property after observing a new symbol a
3
at n = 17. Note that
node a
1
is not adjacent to its sibling a
2
.
opposed to xed-length codewords) onto codewords.
Suppose the source alphabet is binary. Then the Lempel-Ziv encoder can be
described as follows.
Encoder:
1. Parse the input sequence into strings that have never appeared before. For
example, if the input sequence is 1011010100010 . . ., the algorithm rst
grabs the rst letter 1 and nds that it has never appeared before. So 1
is the rst string. Then the algorithm scoops the second letter 0 and also
determines that it has not appeared before, and hence, put it to be the
next string. The algorithm moves on to the next letter 1, and nds that
this string has appeared. Hence, it hits another letter 1 and yields a new
string 11, and so on. Under this procedure, the source sequence is parsed
into the strings
1, 0, 11, 01, 010, 00, 10.
79
a
1
(10, 6/17)
a
2
(00, 4/17)
a
3
(01, 3/17)
a
4
(110, 2/17)
a
5
(1110, 1/17)
a
6
(1111, 1/17)
b
111
(2/17)
b
11
(4/17)
b
0
(7/17)
b
1
(10/17)
17/17
b
1
_
10
17
_
b
0
_
7
17
_
. .
sibling pair
a
1
_
6
17
_
b
11
_
4
17
_
. .
sibling pair
a
2
_
4
17
_
a
3
_
3
17
_
. .
sibling pair
a
4
_
2
17
_
b
111
_
2
17
_
. .
sibling pair
a
5
_
1
17
_
a
6
_
1
17
_
. .
sibling pair
Figure 3.9: (Continuation of Figure 3.8) Updated Human code. The
sibling property holds now for the new code.
2. Let L be the number of distinct strings of the parsed source. Then we
need log
2
L bits to index these strings (starting from one). In the above
example, the indices are:
parsed source : 1 0 11 01 010 00 10
index : 001 010 011 100 101 110 111
.
The codeword of each string is then the index of its prex concatenated
with the last bit in its source string. For example, the codeword of source
string 010 will be the index of 01, i.e., 100, concatenated with the last bit
of the source string, i.e., 0. Through this procedure, encoding the above
parsed strings with L = 3 yields the codeword sequence
(000, 1)(000, 0)(001, 1)(010, 1)(100, 0)(010, 0)(001, 0)
or equivalently,
0001000000110101100001000010.
80
Note that the conventional Lempel-Ziv encoder requires two passes: the rst
pass to decide L, and the second pass to generate the codewords. The algorithm,
however, can be modied so that it requires only one pass over the entire source
string. Also note that the above algorithm uses an equal number of bitslog
2
L
to all the location index, which can also be relaxed by proper modication.
Decoder: The decoding is straightforward from the encoding procedure.
Theorem 3.32 The above algorithm asymptotically achieves the entropy rate
of any (unknown statistics) stationary ergodic source.
Proof: Refer to [12, Sec. 13.5]. 2
81
Chapter 4
Data Transmission and Channel
Capacity
4.1 Principles of data transmission
A noisy communication channel is an input-output medium in which the output
is not completely or deterministically specied by the input. The channel is
indeed stochastically modeled, where given channel input x, the channel output
y is governed by a transition (conditional) probability distribution denoted by
P
Y |X
(y[x). Since two dierent inputs may give rise to the same output, the
receiver, upon receipt of an output, needs to guess the most probable sent in-
put. In general, words of length n are sent and received over the channel; in
this case, the channel is characterized by a sequence of n-dimensional transition
distributions P
Y
n
|X
n(y
n
[x
n
), for n = 1, 2, . A block diagram depicting a data
transmission or channel coding system (with no feedback
1
) is given in Figure 4.1.
W
-
Channel
Encoder
-
X
n
Channel
P
Y
n
|X
n([)
-
Y
n
Channel
Decoder
-

W
Figure 4.1: A data transmission system, where W represents the mes-
sage for transmission, X
n
denotes the codeword corresponding to mes-
sage W, Y
n
represents the received word due to channel input X
n
, and

W denotes the reconstructed message from Y


n
.
1
The capacity of channels with (output) feedback will be studied in Part II of the book.
82
The designer of a data transmission (or channel) code needs to carefully
select codewords from the set of channel input words (of a given length) so
that a minimal ambiguity is obtained at the channel receiver. For example,
suppose that a channel has binary input and output alphabets and that its
transition probability distribution induces the following conditional probability
on its output symbols given that input words of length 2 are sent:
P
Y |X
2(y = 0[x
2
= 00) = 1
P
Y |X
2(y = 0[x
2
= 01) = 1
P
Y |X
2(y = 1[x
2
= 10) = 1
P
Y |X
2(y = 1[x
2
= 11) = 1,
which can be graphically depicted as
*
-
*
-
11
10
01
00
1
0
1
1
1
1
and a binary message (either event A or event B) is required to be transmitted
from the sender to the receiver. Then the data transmission code with (codeword
00 for event A, codeword 10 for event B) obviously induces less ambiguity at
the receiver than the code with (codeword 00 for event A, codeword 01 for event
B).
In short, the objective in designing a data transmission (or channel) code
is to transform a noisy channel into a reliable medium for sending messages
and recovering them at the receiver with minimal loss. To achieve this goal, the
designer of a data transmission code needs to take advantage of the common parts
between the sender and the receiver sites that are least aected by the channel
noise. We will see that these common parts are probabilistically captured by the
mutual information between the channel input and the channel output.
As illustrated in the previous example, if a least-noise-aected subset of
the channel input words is appropriately selected as the set of codewords, the
messages intended to be transmitted can be reliably sent to the receiver with
arbitrarily small error. One then raises the question:
What is the maximum amount of information (per channel use) that
can be reliably transmitted over a given noisy channel ?
83
In the above example, we can transmit a binary message error-free, and hence
the amount of information that can be reliably transmitted is at least 1 bit
per channel use (or channel symbol). It can be expected that the amount of
information that can be reliably transmitted for a highly noisy channel should
be less than that for a less noisy channel. But such a comparison requires a good
measure of the noisiness of channels.
From an information theoretic viewpoint, channel capacity provides a good
measure of the noisiness of a channel; it represents the maximal amount of infor-
mational messages (per channel use) that can be transmitted via a data trans-
mission code over the channel and recovered with arbitrarily small probability
of error at the receiver. In addition to its dependence on the channel transition
distribution, channel capacity also depends on the coding constraint imposed
on the channel input, such as only block (xed-length) codes are allowed. In
this chapter, we will study channel capacity for block codes (namely, only block
transmission code can be used).
2
Throughout the chapter, the noisy channel is
assumed to be memoryless (as dened in the next section).
4.2 Discrete memoryless channels
Denition 4.1 (Discrete channel) A discrete communication channel is char-
acterized by
A nite input alphabet A.
A nite output alphabet .
A sequence of n-dimensional transition distributions
P
Y
n
|X
n(y
n
[x
n
)

n=1
such that

y
n
Y
n
P
Y
n
|X
n(y
n
[x
n
) = 1 for every x
n
A
n
, where x
n
=
(x
1
, , x
n
) A
n
and y
n
= (y
1
, , y
n
)
n
. We assume that the above
sequence of n-dimensional distribution is consistent, i.e.,
P
Y
i
|X
i (y
i
[x
i
) =

x
i+1
X

y
i+1
Y
P
X
i+1(x
i+1
)P
Y
i+1
|X
i+1(y
i+1
[x
i+1
)

x
i+1
X
P
X
i+1(x
i+1
)
=

x
i+1
X

y
i+1
Y
P
X
i+1
|X
i (x
i+1
[x
i
)P
Y
i+1
|X
i+1(y
i+1
[x
i+1
)
for every x
i
, y
i
, P
X
i+1
|X
i and i = 1, 2, .
2
See [44] for recent results regarding channel cpacity when no coding constraints are applied
on the channel input (so that variable-length codes can be employed).
84
In general, real-world communications channels exhibit statistical memory
in the sense that current channel outputs statistically depend on past outputs
as well as past, current and (possibly) future inputs. However, for the sake of
simplicity, we restrict our attention in this chapter to the class of memoryless
channels (channels with memory will later be treated in Volume II).
Denition 4.2 (Discrete memoryless channel) A discrete memoryless chan-
nel (DMC) is a channel whose sequence of transition distributions P
Y
n
|X
n satises
P
Y
n
|X
n(y
n
[x
n
) =
n

i=1
P
Y |X
(y
i
[x
i
) (4.2.1)
for every n = 1, 2, , x
n
A
n
and y
n

n
. In other words, a DMC is
fully described by the channels transition distribution matrix Q [p
x,y
] of size
[A[ [[, where
p
x,y
P
Y |X
(y[x)
for x A, y . Furthermore, the matrix Q is stochastic; i.e., the sum of the
entries in each of its rows is equal to 1
_
since

yY
p
x,y
= 1 for all x A
_
.
Observation 4.3 We note that the DMCs condition (4.2.1) is actually equiv-
alent to the following two sets of conditions:
_

_
P
Yn|X
n
,Y
n1(y
n
[x
n
, y
n1
) = P
Y |X
(y
n
[x
n
) n = 1, 2, , x
n
, y
n
;
(4.2.2a)
P
Y
n1
|X
n(y
n1
[x
n
) = P
Y
n1
|X
n1(y
n1
[x
n1
) n = 2, 3, , x
n
, y
n1
.
(4.2.2b)
_

_
P
Yn|X
n
,Y
n1(y
n
[x
n
, y
n1
) = P
Y |X
(y
n
[x
n
) n = 1, 2, , x
n
, y
n
;
(4.2.3a)
P
Xn|X
n1
,Y
n1(x
n
[x
n1
, y
n1
) = P
Xn|X
n1(x
n
[x
n1
) n = 1, 2, , x
n
, y
n1
.
(4.2.3b)
Condition (4.2.2a) (also, (4.2.3a)) implies that the current output Y
n
only de-
pends on the current input X
n
but not on past inputs X
n1
and outputs Y
n1
.
Condition (4.2.2b) indicates that the past outputs Y
n1
do not depend on the
current input X
n
. These two conditions together give
P
Y
n
|X
n(y
n
[x
n
) = P
Y
n1
|X
n(y
n1
[x
n
)P
Yn|X
n
,Y
n1(y
n
[x
n
, y
n1
)
= P
Y
n1
|X
n1(y
n1
[x
n1
)P
Y |X
(y
n
[x
n
);
85
hence, (4.2.1) holds recursively on n = 1, 2, . The converse (i.e., (4.2.1) implies
both (4.2.2a) and (4.2.2b)) is a direct consequence of
P
Yn|X
n
,Y
n1(y
n
[x
n
, y
n1
) =
P
Y
n
|X
n(y
n
[x
n
)

ynY
P
Y
n
|X
n(y
n
[x
n
)
and
P
Y
n1
|X
n(y
n1
[x
n
) =

ynY
P
Y
n
|X
n(y
n
[x
n
).
Similarly, (4.2.3b) states that the current input X
n
is independent of past outputs
Y
n1
, which together with (4.2.3a) implies again
P
Y
n
|X
n(y
n
[x
n
)
=
P
X
n
,Y
n(x
n
, y
n
)
P
X
n(x
n
)
=
P
X
n1
,Y
n1(x
n1
, y
n1
)P
Xn|X
n1
,Y
n1(x
n
[x
n1
, y
n1
)P
Yn|X
n
,Y
n1(y
n
[x
n
, y
n1
)
P
X
n1(x
n1
)P
Xn|X
n1(x
n
[x
n1
)
= P
Y
n1
|X
n1(y
n1
[x
n1
)P
Y |X
(y
n
[x
n
),
hence, recursively yielding (4.2.1). The converse for (4.2.3b)i.e., (4.2.1) imply-
ing (4.2.3b) can be analogously proved by noting that
P
Xn|X
n1
,Y
n1(x
n
[x
n1
, y
n1
) =
P
X
n(x
n
)

ynY
P
Y
n
|X
n(y
n
[x
n
)
P
X
n1(x
n1
)P
Y
n1
|X
n1(y
n1
[x
n1
)
.
Note that the above denition of DMC in (4.2.1) prohibits the use of channel
feedback, as feedback allows the current channel input to be a function of past
channel outputs (therefore, conditions (4.2.2b) and (4.2.3b) cannot hold with
feedback). Instead, a causality condition generalizing condition (4.2.2a) (e.g., see
Denition 7.4 in [47]) will be needed to dene a channel with feedback (feedback
will be considered in Part II of this book).
Examples of DMCs:
1. Identity (noiseless) channels: An identity channel has equal-size input and
output alphabets ([A[ = [[) and channel transition probability satisfying
P
Y |X
(y[x) =
_
1 if y = x
0 if y ,= x.
This is a noiseless or perfect channel as the channel input is received error-
free at the channel output.
86
2. Binary symmetric channels: A binary symmetric channel (BSC) is a chan-
nel with binary input and output alphabets such that each input has a
(conditional) probability given by for being received inverted at the out-
put, where [0, 1] is called the channels crossover probability or bit error
rate. The channels transition distribution matrix is given by
Q = [p
x,y
] =
_
p
0,0
p
0,1
p
1,0
p
1,1
_
=
_
P
Y |X
(0[0) P
Y |X
(1[0)
P
Y |X
(0[1) P
Y |X
(1[1)
_
=
_
1
1
_
(4.2.4)
and can be graphically represented via a transition diagram as shown in
Figure 4.2.
-
*
-
j
1
0
X
P
Y |X Y
1
0
1
1

Figure 4.2: Binary symmetric channel.


If we set = 0, then the BSC reduces to the binary identity (noiseless)
channel. The channel is called symmetric since P
Y |X
(1[0) = P
Y |X
(0[1);
i.e., it has the same probability for ipping an input bit into a 0 or a 1.
A detailed discussion of DMCs with various symmetry properties will be
discussed at the end of this chapter.
Despite its simplicity, the BSC is rich enough to capture most of the
complexity of coding problems over more general channels. For exam-
ple, it can exactly model the behavior of practical channels with additive
memoryless Gaussian noise used in conjunction of binary symmetric mod-
ulation and hard-decision demodulation (e.g., see [46, p. 240].) It is also
worth pointing out that the BSC can be explicitly represented via a binary
modulo-2 additive noise channel whose output at time i is the modulo-2
sum of its input and noise variables:
Y
i
= X
i
Z
i
for i = 1, 2,
87
where denotes addition modulo-2, Y
i
, X
i
and Z
i
are the channel output,
input and noise, respectively, at time i, the alphabets A = = Z = 0, 1
are all binary, and it is assumed that X
i
and Z
j
are independent from each
other for any i, j = 1, 2, , and that the noise process is a Bernoulli()
process i.e., a binary i.i.d. process with Pr[Z = 1] = .
3. Binary erasure channels: In the BSC, some input bits are received perfectly
and others are received corrupted (ipped) at the channel output. In some
channels however, some input bits are lost during transmission instead of
being received corrupted (for example, packets in data networks may get
dropped or blocked due to congestion or bandwidth constraints). In this
case, the receiver knows the exact location of these bits in the received
bitstream or codeword, but not their actual value. Such bits are then
declared as erased during transmission and are called erasures. This
gives rise to the so-called binary erasure channel (BEC) as illustrated in
Figure 4.3, with input alphabet A = 0, 1 and output alphabet =
0, E, 1, where E represents an erasure, and channel transition matrix
given by
Q = [p
x,y
] =
_
p
0,0
p
0,E
p
0,1
p
1,0
p
1,E
p
1,1
_
=
_
P
Y |X
(0[0) P
Y |X
(E[0) P
Y |X
(1[0)
P
Y |X
(0[1) P
Y |X
(E[1) P
Y |X
(1[1)
_
=
_
1 0
0 1
_
(4.2.5)
where 0 1 is called the channels erasure probability.
4. Binary channels with errors and erasures: One can combine the BSC with
the BEC to obtain a binary channel with both errors and erasures, as
shown in Figure 4.4. We will call such channel the binary symmetric
erasure channel (BSEC). In this case, the channels transition matrix is
given by
Q = [p
x,y
] =
_
p
0,0
p
0,E
p
0,1
p
1,0
p
1,E
p
1,1
_
=
_
1
1
_
(4.2.6)
where , [0, 1] are the channels crossover and erasure probabilities,
respectively. Clearly, setting = 0 reduces the BSEC to the BSC, and
setting = 0 reduces the BSEC to the BEC.
More generally, the channel need not have a symmetric property in the
sense of having identical transition distributions when inputs bits 0 or 1
88
-
-
z
:
1
0
X
P
Y |X Y
1
0
E
1
1

Figure 4.3: Binary erasure channel.


are sent. For example, the channels transition matrix can be given by
Q = [p
x,y
] =
_
p
0,0
p
0,E
p
0,1
p
1,0
p
1,E
p
1,1
_
=
_
1

_
(4.2.7)
where the probabilities ,=

and ,=

in general. We call such chan-


nel, an asymmetric channel with errors and erasures (this model might
be useful to represent practical channels using asymmetric or non-uniform
modulation constellations).
-
*
-
j
z
:
1
0
X
P
Y |X Y
1
0
E
1
1

Figure 4.4: Binary symmetric erasure channel.


5. q-ary symmetric channels: Given an integer q 2, the q-ary symmetric
channel is a non-binary extension of the BSC; it has alphabets A = =
89
0, 1, , q 1 of size q and channel transition matrix given by
Q = [p
x,y
]
=
_

_
p
0,0
p
0,1
p
0,q1
p
1,0
p
1,1
p
1,q1
.
.
.
.
.
.
.
.
.
.
.
.
p
q1,0
p
q1,1
p
q1,q1
_

_
=
_

_
1

q1


q1

q1
1

q1
.
.
.
.
.
.
.
.
.
.
.
.

q1

q1
1
_

_
(4.2.8)
where 0 1 is the channels symbol error rate (or probability). When
q = 2, the channel reduces to the BSC with bit error rate , as expected.
As the BSC, the q-ary symmetric channel can be expressed as a modulo-
q additive noise channel with common input, output and noise alphabets
A = = Z = 0, 1, , q 1 and whose output Y
i
at time i is given by
Y
i
= X
i

q
Z
i
, for i = 1, 2, , where
q
denotes addition modulo-q, and
X
i
and Z
i
are the channels input and noise variables, respectively, at time
i. Here, the noise process Z
n

n=1
is assumed to be an i.i.d. process with
distribution
Pr[Z = 0] = 1 and Pr[Z = a] =

q 1
a 1, , q 1.
It is also assumed that the input and noise processes are independent from
each other.
6. q-ary erasure channels: Given an integer q 2, one can also consider
a non-binary extension of the BEC, yielding the so called q-ary erasure
channel. Specically, this channel has input and output alphabets given
by A = 0, 1, , q 1 and = 0, 1, , q 1, E, respectively, where
E denotes an erasure, and channel transition distribution given by
P
Y |X
(y[x) =
_

_
1 if y = x, x A
if y = E, x A
0 if y ,= x, x A
(4.2.9)
where 0 1 is the erasure probability. As expected, setting q = 2
reduces the channel to the BEC.
90
4.3 Block codes for data transmission over DMCs
Denition 4.4 (Fixed-length data transmission code) Given positive in-
tegers n and M, and a discrete channel with input alphabet A and output
alphabet , a xed-length data transmission code (or block code) for this chan-
nel with blocklength n and rate
1
n
log
2
M message bits per channel symbol (or
channel use) is denoted by (
n
= (n, M) and consists of:
1. M information messages intended for transmission.
2. An encoding function
f : 1, 2, . . . , M A
n
yielding codewords f(1), f(2), , f(M) A
n
, each of length n. The set
of these M codewords is called the codebook and we also usually write
(
n
= f(1), f(2), , f(M) to list the codewords.
3. A decoding function g :
n
1, 2, . . . , M.
The set 1, 2, . . . , M is called the message set and we assume that a message
W follows a uniform distribution over the set of messages: Pr[W = w] =
1
M
for
all w 1, 2, . . . , M. A block diagram for the channel code is given at the
beginning of this chapter; see Figure 4.1. As depicted in the diagram, to convey
message W over the channel, the encoder sends its corresponding codeword
X
n
= f(W) at the channel input. Finally, Y
n
is received at the channel output
(according to the memoryless channel distribution P
Y
n
|X
n) and the decoder yields

W = g(Y
n
) as the message estimate.
Denition 4.5 (Average probability of error) The average probability of
error for a channel block code (
n
= (n, M) code with encoder f() and decoder
g() used over a channel with transition distribution P
Y
n
|X
n is dened as
P
e
( (
n
)
1
M
M

w=1

w
( (
n
),
where

w
( (
n
) Pr[

W ,= W[W = w] = Pr[g(Y
n
) ,= w[X
n
= f(w)]
=

y
n
Y
n
: g(y
n
)=w
P
Y
n
|X
n(y
n
[f(w))
is the codes conditional probability of decoding error given that message w is
sent over the channel.
91
Note that, since we have assumed that the message W is drawn uniformly
from the set of messages, we have that
P
e
( (
n
) = Pr[

W ,= W].
Observation 4.6 Another more conservative error criterion is the so-called
maximal probability of error
( (
n
) max
w{1,2, ,M}

w
( (
n
).
Clearly, P
e
( (
n
) ( (
n
); so one would expect that P
e
( (
n
) behaves dierently
than ( (
n
). However it can be shown that from a code (
n
= (n, M) with
arbitrarily small P
e
( (
n
), one can construct (by throwing away from (
n
half of
its codewords with largest conditional probability of error) a code (

n
= (n,
M
2
)
with arbitrarily small ( (

n
) at essentially the same code rate as n grows to
innity (e.g., see [12, p. 204], [47, p. 163]).
3
Hence, we will only use P
e
( (
n
)
as our criterion when evaluating the goodness or reliability
4
of channel block
codes.
Our target is to nd a good channel block code (or to show the existence of a
good channel block code). From the perspective of the (weak) law of large num-
bers, a good choice is to draw the codes codewords based on the jointly typical
set between the input and the output of the channel, since all the probability
mass is ultimately placed on the jointly typical set. The decoding failure then
occurs only when the channel input-output pair does not lie in the jointly typical
set, which implies that the probability of decoding error is ultimately small. We
next dene the jointly typical set.
Denition 4.7 (Jointly typical set) The set T
n
() of jointly -typical n-tuple
pairs (x
n
, y
n
) with respect to the memoryless distribution P
X
n
,Y
n(x
n
, y
n
) =

n
i=1
P
X,Y
(x
i
, y
i
) is dened by
T
n
()
_
(x
n
, y
n
) A
n

n
:
3
Note that this fact holds for single-user channels with known transition distributions (as
given in Denition 4.1) that remain constant throughout the transmission of a codeword. It
does not however hold for single-user channels whose statistical descriptions may vary in an
unknown manner from symbol to symbol during a codeword transmission; such channels, which
include the class of arbitrarily varying channels (see [13, Chapter 2, Section 6]), will not be
considered in this textbook.
4
We interchangeably use the terms goodness or reliability for a block code to mean
that its (average) probability of error asymptotically vanishes with increasing blocklength.
92

1
n
log
2
P
X
n(x
n
) H(X)

< ,

1
n
log
2
P
Y
n(y
n
) H(Y )

< ,
and

1
n
log
2
P
X
n
,Y
n(x
n
, y
n
) H(X, Y )

<
_
.
In short, a pair (x
n
, y
n
) generated by independently drawing n times under P
X,Y
is jointly -typical if its joint and marginal empirical entropies are respectively
-close to the true joint and marginal entropies.
With the above denition, we directly obtain the joint AEP theorem.
Theorem 4.8 (Joint AEP) If (X
1
, Y
1
), (X
2
, Y
2
), . . ., (X
n
, Y
n
), . . . are i.i.d.,
i.e., (X
i
, Y
i
)

i=1
is a dependent pair of DMSs, then

1
n
log
2
P
X
n(X
1
, X
2
, . . . , X
n
) H(X) in probability,

1
n
log
2
P
Y
n(Y
1
, Y
2
, . . . , Y
n
) H(Y ) in probability,
and

1
n
log
2
P
X
n
,Y
n((X
1
, Y
1
), . . . , (X
n
, Y
n
)) H(X, Y ) in probability
as n .
Proof: By the weak law of large numbers, we have the desired result. 2
Theorem 4.9 (Shannon-McMillan theorem for pairs) Given a dependent
pair of DMSs with joint entropy H(X, Y ) and any greater than zero, we can
choose n big enough so that the jointly -typical set satises:
1. P
X
n
,Y
n(T
c
n
()) < for suciently large n.
2. The number of elements in T
n
() is at least (1 )2
n(H(X,Y ))
for su-
ciently large n, and at most 2
n(H(X,Y )+)
for every n.
3. If (x
n
, y
n
) T
n
(), its probability of occurrence satises
2
n(H(X,Y )+)
< P
X
n
,Y
n(x
n
, y
n
) < 2
n(H(X,Y ))
.
93
Proof: The proof is quite similar to that of the Shannon-McMillan theorem for
a single memoryless source presented in the previous chapter; we hence leave it
as an exercise. 2
We herein arrive at the main result of this chapter, Shannons channel coding
theorem for DMCs. It basically states that a quantity C, termed as channel
capacity and dened as the maximum of the channels mutual information over
the set of its input distributions (see below), is the supremum of all achievable
channel block code rates; i.e., it is the supremum of all rates for which there
exists a sequence of block codes for the channel with asymptotically decaying
(as the blocklength grows to innity) probability of decoding error. In other
words, for a given DMC, its capacity C, which can be calculated by solely using
the channels transition matrix Q, constitutes the largest rate at which one can
reliably transmit information via a block code over this channel. Thus, it is
possible to communicate reliably over an inherently noisy DMC at a xed rate
(without decreasing it) as long as this rate is below C and the codes blocklength
is allowed to be large.
Theorem 4.10 (Shannons channel coding theorem) Consider a DMC
with nite input alphabet A, nite output alphabet and transition distribution
probability P
Y |X
(y[x), x A and y . Dene the channel capacity
5
C max
P
X
I(X; Y ) = max
P
X
I(P
X
, P
Y |X
)
where the maximum is taken over all input distributions P
X
. Then the following
hold.
Forward part (achievability): For any 0 < < 1, there exist > 0 and a
sequence of data transmission block codes (
n
= (n, M
n
)

n=1
with
liminf
n
1
n
log
2
M
n
C
5
First note that the mutual information I(X; Y ) is actually a function of the input statistics
P
X
and the channel statistics P
Y |X
. Hence, we may write it as
I(P
X
, P
Y |X
) =

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)

X
P
X
(x

)P
Y |X
(y[x

)
.
Such an expression is more suitable for calculating the channel capacity.
Note also that the channel capacity C is well-dened since, for a xed P
Y |X
, I(P
X
, P
Y |X
) is
concave and continuous in P
X
(with respect to both the variational distance and the Euclidean
distance (i.e., L
2
-distance) [47, Chapter 2]), and since the set of all input distributions P
X
is
a compact (closed and bounded) subset of R
|X|
due to the niteness of A. Hence there exists
a P
X
that achieves the supremum of the mutual information and the maximum is attainable.
94
and
P
e
( (
n
) < for suciently large n,
where P
e
( (
n
) denotes the (average) probability of error for block code (
n
.
Converse part: Any sequence of data transmission block codes (
n
=
(n, M
n
)

n=1
with
liminf
n
1
n
log
2
M
n
> C
satises
P
e
( (
n
) > 0 for suciently large n;
i.e., the codes probability of error is bounded away from zero for all n
suciently large.
Proof of the forward part: It suces to prove the existence of a good block
code sequence (satisfying the rate condition, i.e., liminf
n
(1/n) log
2
M
n

C for some > 0) whose average error probability is ultimately less than .
We will use Shannons original random coding proof technique in which the
good block code sequence is not deterministically constructed; instead, its exis-
tence is implicitly proven by showing that for a class (ensemble) of block code
sequences (
n

n=1
and a code-selecting distribution Pr[ (
n
] over these block
code sequences, the expectation value of the average error probability, evaluated
under the code-selecting distribution on these block code sequences, can be made
smaller than for n suciently large:
E
C n
[P
e
( (
n
)] =

C n
Pr[ (
n
]P
e
( (
n
) 0 as n .
Hence, there must exist at least one such a desired good code sequence (

n=1
among them (with P
e
( (

n
) 0 as n ).
Fix (0, 1) and some in (0, 4). Observe that there exists N
0
such that
for n > N
0
, we can choose an integer M
n
with
C

2

1
n
log
2
M
n
> C .
(Since we are only concerned with the case of sucient large n, it suces to
consider only those ns satisfying n > N
0
, and ignore those ns for n N
0
.)
Dene /8. Let P

X
be the probability distribution achieving the channel
capacity:
C max
P
X
I(P
X
, P
Y |X
) = I(P

X
, P
Y |X
).
95
Denote by P

Y
n
the channel output distribution due to channel input product
distribution P

X
n
(with P

X
n
(x
n
) =

n
i=1
P

X
(x
i
)), i.e.,
P

Y
n
(y
n
) =

x
n
X
n
P

X
n
,

Y
n
(x
n
, y
n
)
where
P

X
n
,

Y
n
(x
n
, y
n
) P

X
n
(x
n
)P
Y
n
|X
n(y
n
[x
n
)
for all x
n
A
n
and y
n

n
. Note that since P

X
n
(x
n
) =

n
i=1
P

X
(x
i
) and the
channel is memoryless, the resulting joint input-output process (

X
i
,

Y
i
)

i=1
is
also memoryless with
P

X
n
,

Y
n
(x
n
, y
n
) =
n

i=1
P

X,

Y
(x
i
, y
i
)
and
P

X,

Y
(x, y) = P

X
(x)P
Y |X
(y[x) for x A, y .
We next present the proof in three steps.
Step 1: Code construction.
For any blocklength n, independently select M
n
channel inputs with re-
placement
6
from A
n
according to the distribution P

X
n
(x
n
). For the se-
lected M
n
channel inputs yielding codebook (
n
c
1
, c
2
, . . . , c
Mn
, dene
the encoder f
n
() and decoder g
n
(), respectively, as follows:
f
n
(m) = c
m
for 1 m M
n
,
and
g
n
(y
n
) =
_

_
m, if c
m
is the only codeword in (
n
satisfying (c
m
, y
n
) T
n
();
any one in 1, 2, . . . , M
n
, otherwise,
where T
n
() is dened in Denition 4.7 with respect to distribution P

X
n
,

Y
n
.
(We evidently assume that the codebook (
n
and the channel distribution
P
Y |X
are known at both the encoder and the decoder.) Hence, the code
(
n
operates as follows. A message W is chosen according to the uniform
distribution from the set of messages. The encoder f
n
then transmits the
6
Here, the channel inputs are selected with replacement. That means it is possible and
acceptable that all the selected M
n
channel inputs are identical.
96
Wth codeword c
W
in (
n
over the channel. Then Y
n
is received at the
channel output and the decoder guesses the sent message via

W = g
n
(Y
n
).
Note that there is a total [A[
nMn
possible randomly generated codebooks
(
n
and the probability of selecting each codebook is given by
Pr[ (
n
] =
Mn

m=1
P

X
n
(c
m
).
Step 2: Conditional error probability.
For each (randomly generated) data transmission code (
n
, the conditional
probability of error given that message m was sent,
m
( (
n
), can be upper
bounded by:

m
( (
n
)

y
n
Y
n
: (cm,y
n
)Fn()
P
Y
n
|X
n(y
n
[c
m
)
+
Mn

=1
m

=m

y
n
Y
n
: (c
m
,y
n
)Fn()
P
Y
n
|X
n(y
n
[c
m
), (4.3.1)
where the rst term in (4.3.1) considers the case that the received channel
output y
n
is not jointly -typical with c
m
, (and hence, the decoding rule
g
n
() would possibly result in a wrong guess), and the second term in
(4.3.1) reects the situation when y
n
is jointly -typical with not only the
transmitted codeword c
m
, but also with another codeword c
m
(which may
cause a decoding error).
By taking expectation in (4.3.1) with respect to the m
th
codeword-
selecting distribution P

X
n
(c
m
), we obtain

cmX
n
P

X
n
(c
m
)
m
( (
n
)

cmX
n

y
n
Fn(|cm)
P

X
n
(c
m
)P
Y
n
|X
n(y
n
[c
m
)
+

cmX
n
Mn

=1
m

=m

y
n
Fn(|c
m
)
P

X
n
(c
m
)P
Y
n
|X
n(y
n
[c
m
)
= P

X
n
,

Y
n
(T
c
n
())
+
Mn

=1
m

=m

cmX
n

y
n
Fn(|c
m
)
P

X
n
,

Y
n
(c
m
, y
n
),
(4.3.2)
where
T
n
([x
n
) y
n

n
: (x
n
, y
n
) T
n
() .
97
Step 3: Average error probability.
We now can analyze the expectation of the average error probability
E
C n
[P
e
( (
n
)]
over the ensemble of all codebooks (
n
generated at random according to
Pr[ (
n
] and show that it asymptotically vanishes as n grows without bound.
We obtain the following series of inequalities.
E
C n
[P
e
( (
n
)] =

C n
Pr[ (
n
]P
e
( (
n
)
=

c
1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
Mn
)
_
1
M
n
Mn

m=1

m
( (
n
)
_
=
1
M
n
Mn

m=1

c
1
X
n

c
m1
X
n

c
m+1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
m1
)P

X
n
(c
m+1
) P

X
n
(c
Mn
)

_

cmX
n
P

X
n
(c
m
)
m
( (
n
)
_

1
M
n
Mn

m=1

c
1
X
n

c
m1
X
n

c
m+1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
m1
)P

X
n
(c
m+1
) P

X
n
(c
Mn
)
P

X
n
,

Y
n
(T
c
n
())
+
1
M
n
Mn

m=1

c
1
X
n

c
m1
X
n

c
m+1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
m1
)P

X
n
(c
m+1
) P

X
n
(c
Mn
)

Mn

=1
m

=m

cmX
n

y
n
Fn(|c
m
)
P

X
n
,

Y
n
(c
m
, y
n
) (4.3.3)
= P

X
n
,

Y
n
(T
c
n
())
+
1
M
n
Mn

m=1
_

_
Mn

=1
m

=m
_
_

c
1
X
n

c
m1
X
n

c
m+1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
m1
)P

X
n
(c
m+1
) P

X
n
(c
Mn
)
98

cmX
n

y
n
Fn(|c
m
)
P

X
n
,

Y
n
(c
m
, y
n
)
_
_
_
_
_
,
where (4.3.3) follows from (4.3.2), and the last step holds since P

X
n
,

Y
n
(T
c
n
())
is a constant independent of c
1
, . . ., c
Mn
and m. Observe that for n > N
0
,
Mn

=1
m

=m
_
_

c
1
X
n

c
m1
X
n

c
m+1
X
n

c
Mn
X
n
P

X
n
(c
1
) P

X
n
(c
m1
)P

X
n
(c
m+1
) P

X
n
(c
Mn
)

cmX
n

y
n
Fn(|c
m
)
P

X
n
,

Y
n
(c
m
, y
n
)
_
_
=
Mn

=1
m

=m
_
_

cmX
n

c
m
X
n

y
n
Fn(|c
m
)
P

X
n
(c
m
)P

X
n
,

Y
n
(c
m
, y
n
)
_
_
=
Mn

=1
m

=m
_
_

c
m
X
n

y
n
Fn(|c
m
)
P

X
n
(c
m
)
_

cmX
n
P

X
n
,

Y
n
(c
m
, y
n
)
_
_
_
=
Mn

=1
m

=m
_
_

c
m
X
n

y
n
Fn(|c
m
)
P

X
n
(c
m
)P

Y
n
(y
n
)
_
_
=
Mn

=1
m

=m
_
_

(c
m
,y
n
)Fn()
P

X
n
(c
m
)P

Y
n
(y
n
)
_
_

Mn

=1
m

=m
[T
n
()[2
n(H(

X))
2
n(H(

Y ))

Mn

=1
m

=m
2
n(H(

X,

Y )+)
2
n(H(

X))
2
n(H(

Y ))
= (M
n
1)2
n(H(

X,

Y )+)
2
n(H(

X))
2
n(H(

Y ))
M
n
2
n(H(

X,

Y )+)
2
n(H(

X))
2
n(H(

Y ))
2
n(C4)
2
n(I(

X;

Y )3)
= 2
n
,
where the rst inequality follows from the denition of the jointly typical
99
set T
n
(), the second inequality holds by the Shannon-McMillan theorem
for pairs (Theorem 4.9), the last inequality follows since C = I(

X;

Y ) by
denition of

X and

Y , and since (1/n) log
2
M
n
C (/2) = C 4.
Consequently,
E
C n
[P
e
( (
n
)] P

X
n
,

Y
n
(T
c
n
()) + 2
n
,
which for suciently large n (and n > N
0
), can be made smaller than
2 = /4 < by the Shannon-McMillan theorem for pairs. 2
Before proving the converse part of the channel coding theorem, let us recall
Fanos inequality in a channel coding context. Consider an (n, M
n
) channel
block code (
n
with encoding and decoding functions given by
f
n
: 1, 2, , M
n
A
n
and
g
n
:
n
1, 2, , M
n
,
respectively. Let message W, which is uniformly distributed over the set of
messages 1, 2, , M
n
, be sent via codeword X
n
(W) = f
n
(W) over the DMC,
and let Y
n
be received at the channel output. At the receiver, the decoder
estimates the sent message via

W = g
n
(Y
n
) and the probability of estimation
error is given by the codes average error probability:
Pr[W ,=

W] = P
e
( (
n
)
since W is uniformly distributed. Then Fanos inequality (2.5.2) yields
H(W[Y
n
) 1 + P
e
( (
n
) log
2
(M
n
1)
1 + P
e
( (
n
) log
2
M
n
. (4.3.4)
We next proceed with the proof of the converse part.
Proof of the converse part: For any (n, M
n
) block channel code (
n
as de-
scribed above, we have that W X
n
Y
n
form a Markov chain; we thus
obtain by the data processing inequality that
I(W; Y
n
) I(X
n
; Y
n
). (4.3.5)
We can also upper bound I(X
n
; Y
n
) in terms of the channel capacity C as follows
I(X
n
; Y
n
) max
P
X
n
I(X
n
; Y
n
)
100
max
P
X
n
n

i=1
I(X
i
; Y
i
) (by Theorem 2.21)

i=1
max
P
X
n
I(X
i
; Y
i
)
=
n

i=1
max
P
X
i
I(X
i
; Y
i
)
= nC. (4.3.6)
Consequently, code (
n
satises the following:
log
2
M
n
= H(W) (since W is uniformly distributed)
= H(W[Y
n
) + I(W; Y
n
)
H(W[Y
n
) + I(X
n
; Y
n
) (by 4.3.5)
H(W[Y
n
) + nC (by 4.3.6)
1 + P
e
( (
n
) log
2
M
n
+ nC. (by 4.3.4)
This implies that
P
e
( (
n
) 1
C
(1/n) log
2
M
n

1
log
2
M
n
.
So if liminf
n
(1/n) log
2
M
n
> C, then for any > 0, there exists an integer
N such that for n N,
1
n
log
2
M
n
> C + .
Hence, for n N
0
maxN, 2/,
P
e
( (
n
) > 1
C
C +

1
n(C + )
>

2(C + )
> 0;
i.e., P
e
( (
n
) is bounded away from zero for n suciently large. 2
The results of the above channel coding theorem is illustrated in Figure 4.5,
where R = liminf
n
(1/n) log
2
M
n
(measured in message bits/channel use) is
usually called the ultimate (or asymptotic) coding rate of channel block codes.
As indicated in the gure, the ultimate rate of any good block code for the
DMC must be smaller than its capacity C. Conversely, any block code with
(ultimate) rate greater than C, will have its probability of error bounded away
from zero. Thus for a DMC, its capacity C is the supremum of all achievable
channel block coding rates; i.e., it is the supremum of all rates for which there
101
-
C
lim
n
P
e
= 0
for the best channel block code
limsup
n
P
e
> 0
for all channel block codes
R
Figure 4.5: Ultimate channel coding rate R versus channel capacity C
and behavior of the probability of error as blocklength n goes to innity
for a discrete memoryless channel.
exists a sequence of channel block codes with asymptotically vanishing (as the
blocklength goes to innity) probability of error.
Shannons channel coding theorem, established in 1948 [39], provides the ul-
timate limit for reliable communication over a noisy channel. However, it does
not provide an explicit ecient construction for good codes since searching for
a good code from the ensemble of randomly generated codes is prohibitively
complex, as its size grows double-exponentially with blocklength (see Step 1 of
the proof of the forward part). It thus spurred the entire area of coding theory,
which ourished over the last 60 years with the aim of constructing power-
ful error-correcting codes operating close to the capacity limit. Particular ad-
vances were made for the class of linear codes (also known as group codes) whose
rich
7
yet elegantly simple algebraic structures made them amenable for ecient
practically-implementable encoding and decoding. Examples of such codes in-
clude Hamming codes, Golay codes, BCH and Reed-Solomon codes and convo-
lutional codes. In 1993, the so-called Turbo codes were introduced by Berrou
et al. [3, 4] and shown experimentally to perform close to the channel capacity
limit for the class of memoryless channels. Similar near-capacity achieving lin-
ear codes were later established with the re-discovery of Gallagers low-density
parity-check codes [16, 17, 29, 30]. Many of these codes are used with increased
sophistication in todays ubiquitous communication, information and multime-
dia technologies. For detailed studies on coding theory, see the following texts
[8, 10, 23, 28, 31, 37, 46].
4.4 Calculating channel capacity
Given a DMC with nite input alphabet A, nite input alphabet and channel
transition matrix Q = [p
x,y
] of size [A[ [[, where p
x,y
P
Y |X
(y[x), for x A
7
Indeed, there exist linear codes that can achieve the capacity of memoryless channels with
additive noise (e.g., see [13, p. 114]). Such channels include the BSC and the q-ary symmetric
channel.
102
and y , we would like to calculate
C max
P
X
I(X; Y )
where the maximization (which is well-dened) is carried over the set of input
distributions P
X
, and I(X; Y ) is the mutual information between the channels
input and output.
Note that C can be determined numerically via non-linear optimization tech-
niques such as the iterative algorithms developed by Arimoto [1] and Blahut
[7, 9], see also [14] and [47, Chap. 9]. In general, there are no closed-form (single-
letter) analytical expressions for C. However, for many simplied channels,
it is possible to analytically determine C under some symmetry properties of
their channel transition matrix.
4.4.1 Symmetric, weakly-symmetric and quasi-symmetric
channels
Denition 4.11 A DMC with nite input alphabet A, nite output alphabet
and channel transition matrix Q = [p
x,y
] of size [A[ [[ is said to be sym-
metric if the rows of Q are permutations of each other and the columns of Q
are permutations of each other. The channel is said to be weakly-symmetric if
the rows of Q are permutations of each other and all the column sums in Q are
equal.
It directly follows from the denition that symmetry implies weak-symmetry.
Examples of symmetric DMCs include the BSC, the q-ary symmetric channel and
the following ternary channel with A = = 0, 1, 2 and transition matrix
Q =
_
_
P
Y |X
(0[0) P
Y |X
(1[0) P
Y |X
(2[0)
P
Y |X
(0[1) P
Y |X
(1[1) P
Y |X
(2[1)
P
Y |X
(0[2) P
Y |X
(1[2) P
Y |X
(2[2)
_
_
=
_
_
0.4 0.1 0.5
0.5 0.4 0.1
0.1 0.5 0.4
_
_
.
The following DMC with [A[ = [[ = 3 and
Q =
_

_
0.5 0.25 0.25 0
0.5 0.25 0.25 0
0 0.25 0.25 0.5
0 0.25 0.25 0.5
_

_
(4.4.1)
is weakly-symmetric (but not symmetric). Noting that all above channels in-
volve square transition matrices, we emphasize that Q can be rectangular while
103
satisfying the symmetry or weak-symmetry properties. For example, the DMC
with [A[ = 2, [[ = 4 and
Q =
_
1
2

2
1
2

2
1
2

2
1
2
_
(4.4.2)
is symmetric (where [0, 1]), while the DMC with [A[ = 2, [[ = 3 and
Q =
_
1
3
1
6
1
2
1
3
1
2
1
6
_
is weakly-symmetric.
Lemma 4.12 The capacity of a weakly-symmetric channel Q is achieved by a
uniform input distribution and is given by
C = log
2
[[ H(q
1
, q
2
, , q
|Y|
) (4.4.3)
where (q
1
, q
2
, , q
|Y|
) denotes any row of Q and
H(q
1
, q
2
, , q
|Y|
)
|Y|

i=1
q
i
log
2
q
i
is the row entropy.
Proof: The mutual information between the channels input and output is given
by
I(X; Y ) = H(Y ) H(Y [X)
= H(Y )

xX
P
X
(x)H(Y [X = x)
where H(Y [X = x) =

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x) =

yY
p
x,y
log
2
p
x,y
.
Noting that every row of Q is a permutation of every other row, we obtain
that H(Y [X = x) is independent of x and can be written as
H(Y [X = x) = H(q
1
, q
2
, , q
|Y|
)
where (q
1
, q
2
, , q
|Y|
) is any row of Q. Thus
H(Y [X) =

xX
P
X
(x)H(q
1
, q
2
, , q
|Y|
)
= H(q
1
, q
2
, , q
|Y|
)
_

xX
P
X
(x)
_
104
= H(q
1
, q
2
, , q
|Y|
).
Thus
I(X; Y ) = H(Y ) H(q
1
, q
2
, , q
|Y|
)
log
2
[[ H(q
1
, q
2
, , q
|Y|
)
with equality achieved i Y is uniformly distributed over . We next show that
choosing a uniform input distribution, P
X
(x) =
1
|X|
x A, yields a uniform
output distribution, hence maximizing mutual information. Indeed, under a
uniform input distribution, we obtain that for any y ,
P
Y
(y) =

xX
P
X
(x)P
Y |X
(y[x) =
1
[A[

xX
p
x,y
=
A
[A[
where A

xX
p
x,y
is a constant given by the sum of the entries in any column
of Q, since by the weak-symmetry property all column sums in Q are identical.
Noting that

yY
P
Y
(y) = 1 yields that

yY
A
[A[
= 1
and thus
A =
[A[
[[
. (4.4.4)
Thus
P
Y
(y) =
A
[A[
=
[A[
[[
1
[A[
=
1
[[
for any y ; thus the uniform input distribution induces a uniform output
distribution and achieves channel capacity as given by (4.4.3). 2
Observation 4.13 Note that if the weakly-symmetric channel has a square (i.e.,
with [A[ = [[) transition matrix Q, then Q is a doubly-stochastic matrix; i.e.,
both its row sums and its column sums are equal to 1. Note however that having
a square transition matrix does not necessarily make a weakly-symmetric channel
symmetric; e.g., see (4.4.1).
Example 4.14 (Capacity of the BSC) Since the BSC with crossover proba-
bility (or bit error rate) is symmetric, we directly obtain from Lemma 4.12
that its capacity is achieved by a uniform input distribution and is given by
C = log
2
(2) H(1 , ) = 1 h
b
() (4.4.5)
where h
b
() is the binary entropy function.
105
Example 4.15 (Capacity of the q-ary symmetric channel) Similarly, the
q-ary symmetric channel with symbol error rate described in (4.2.8) is sym-
metric; hence, by Lemma 4.12, its capacity is given by
C = log
2
qH
_
1 ,

q 1
, ,

q 1
_
= log
2
q+ log
2

q 1
+(1) log
2
(1).
Note that when q = 2, the channel capacity is equal to that of the BSC, as ex-
pected. Furthermore, when = 0, the channel reduces to the identity (noiseless)
q-ary channel and its capacity is given by C = log
2
q.
We next note that one can further weaken the weak-symmetry property and
dene a class of quasi-symmetric channels for which the uniform input distri-
bution still achieves capacity and yields a simple closed-form formula for capacity.
Denition 4.16 A DMC with nite input alphabet A, nite output alphabet
and channel transition matrix Q = [p
x,y
] of size [A[ [[ is said to be quasi-
symmetric
8
if Q can be partitioned along its columns into m weakly-symmetric
sub-matrices Q
1
, Q
2
, , Q
m
for some integer m 1, where each Q
i
sub-matrix
has size [A[ [
i
[ for i = 1, 2, , m with
1

m
= and
i

j
=
i ,= j, i, j = 1, 2, , m.
Hence, quasi-symmetry is our weakest symmetry notion, since a weakly-
symmetric channel is clearly quasi-symmetric (just set m = 1 in the above
denition); we thus have: symmetry = weak-symmetry = quasi-symmetry.
Lemma 4.17 The capacity of a quasi-symmetric channel Q as dened above is
achieved by a uniform input distribution and is given by
C =
m

i=1
a
i
C
i
(4.4.6)
where
a
i

yY
i
p
x,y
= sum of any row in Q
i
, i = 1, , m,
and
C
i
= log
2
[
i
[ H
_
any row in the matrix
1
a
i
Q
i
_
, i = 1, , m
is the capacity of the ith weakly-symmetric sub-channel whose transition ma-
trix is obtained by multiplying each entry of Q
i
by
1
a
i
(this normalization renders
sub-matrix Q
i
into a stochastic matrix and hence a channel transition matrix).
8
This notion of quasi-symmetry is slightly more general that Gallagers notion [18, p. 94],
as we herein allow each sub-matrix to be weakly-symmetric (instead of symmetric as in [18]).
106
Proof: We rst observe that for each i = 1, , m, a
i
is independent of the
input value x, since sub-matrix i is weakly-symmetric (so any row in Q
i
is a
permutation of any other row); and hence a
i
is the sum of any row in Q
i
.
For each i = 1, , m, dene
P
Y
i
|X
(y[x)
_
px,y
a
i
if y
i
and x A;
0 otherwise
where Y
i
is a random variable taking values in
i
. It can be easily veried that
P
Y
i
|X
(y[x) is a legitimate conditional distribution. Thus [P
Y
i
|X
(y[x)] =
1
a
i
Q
i
is the transition matrix of the weakly-symmetric sub-channel i with input
alphabet A and output alphabet
i
. Let I(X; Y
i
) denote its mutual information.
Since each such sub-channel i is weakly-symmetric, we know that its capacity
C
i
is given by
C
i
= max
P
X
I(X; Y
i
) = log
2
[
i
[ H
_
any row in the matrix
1
a
i
Q
i
_
,
where the maximum is achieved by a uniform input distribution.
Now, the mutual information between the input and the output of our original
quasi-symmetric channel Q can be written as
I(X; Y ) =

yY

xX
P
X
(x) p
x,y
log
2
p
x,y

X
P
X
(x

)p
x

,y
=
m

i=1

yY
i

xX
a
i
P
X
(x)
p
x,y
a
i
log
2
px,y
a
i

X
P
X
(x

)
p
x

,y
a
i
=
m

i=1
a
i

yY
i

xX
P
X
(x)P
Y
i
|X
(y[x) log
2
P
Y
i
|X
(y[x)

X
P
X
(x

)P
Y
i
|X
(y[x

)
=
m

i=1
a
i
I(X; Y
i
).
Therefore, the capacity of channel Q is
C = max
P
X
I(X; Y )
= max
P
X
m

i=1
a
i
I(X; Y
i
)
=
m

i=1
a
i
max
P
X
I(X; Y
i
) (as the same uniform P
X
maximizes each I(X; Y
i
))
=
m

i=1
a
i
C
i
.
2
107
Example 4.18 (Capacity of the BEC) The BEC with erasure probability
as given in (4.2.5) is quasi-symmetric (but neither weakly-symmetric nor sym-
metric). Indeed, its transition matrix Q can be partitioned along its columns
into two symmetric (hence weakly-symmetric) sub-matrices
Q
1
=
_
1 0
0 1
_
and
Q
2
=
_

_
.
Thus applying the capacity formula for quasi-symmetric channels of Lemma 4.17
yields that the capacity of the BEC is given by
C = a
1
C
1
+ a
2
C
2
where a
1
= 1 , a
2
= ,
C
1
= log
2
(2) H
_
1
1
,
0
1
_
= 1 H(1, 0) = 1 0 = 1,
and
C
2
= log
2
(1) H
_

_
= 0 0 = 0.
Therefore, the BEC capacity is given by
C = (1 )(1) + ()(0) = 1 . (4.4.7)
Example 4.19 (Capacity of the BSEC) Similarly, the BSEC with crossover
probability and erasure probability as described in (4.2.6) is quasi-symmetric;
its transition matrix can be partitioned along its columns into two symmetric
sub-matrices
Q
1
=
_
1
1
_
and
Q
2
=
_

_
.
Hence by Lemma 4.17, the channel capacity is given by C = a
1
C
1
+a
2
C
2
where
a
1
= 1 , a
2
= ,
C
1
= log
2
(2) H
_
1
1
,

1
_
= 1 h
b
_
1
1
_
,
and
C
2
= log
2
(1) H
_

_
= 0.
108
We thus obtain that
C = (1 )
_
1 h
b
_
1
1
__
+ ()(0)
= (1 )
_
1 h
b
_
1
1
__
. (4.4.8)
As already noted, the BSEC is a combination of the BSC with bit error rate
and the BEC with erasure probability . Indeed, setting = 0 in (4.4.8) yields
that C = 1 h
b
(1 ) = 1 h
b
() which is the BSC capacity. Furthermore,
setting = 0 results in C = 1 , the BEC capacity.
4.4.2 Channel capacity Karuch-Kuhn-Tucker condition
When the channel does not satisfy any symmetry property, the following neces-
sary and sucient Karuch-Kuhn-Tucker (KKT) condition (e.g., cf. [18, pp. 87-
91], [5, 11]) for calculating channel capacity can be quite useful.
Denition 4.20 (Mutual information for a specic input symbol) The
mutual information for a specic input symbol is dened as:
I(x; Y )

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x)
P
Y
(y)
.
From the above denition, the mutual information becomes:
I(X; Y ) =

xX
P
X
(x)

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x)
P
Y
(y)
=

xX
P
X
(x)I(x; Y ).
Lemma 4.21 (KKT condition for channel capacity) For a given DMC, an
input distribution P
X
achieves its channel capacity i there exists a constant C
such that
_
I(x : Y ) = C x A with P
X
(x) > 0;
I(x : Y ) C x A with P
X
(x) = 0.
(4.4.9)
Furthermore, the constant C is the channel capacity (justifying the choice of
notation).
Proof: The forward (if) part holds directly; hence, we only prove the converse
(only-if) part.
109
Without loss of generality, we assume that P
X
(x) < 1 for all x A, since
P
X
(x) = 1 for some x implies that I(X; Y ) = 0. The problem of calculating the
channel capacity is to maximize
I(X; Y ) =

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)

X
P
X
(x

)P
Y |X
(y[x

)
, (4.4.10)
subject to the condition

xX
P
X
(x) = 1 (4.4.11)
for a given channel distribution P
Y |X
. By using the Lagrange multiplier method
(e.g., see [5]), maximizing (4.4.10) subject to (4.4.11) is equivalent to maximize:
f(P
X
)

xX
yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)

X
P
X
(x

)P
Y |X
(y[x

)
+
_

xX
P
X
(x) 1
_
.
We then take the derivative of the above quantity with respect to P
X
(x

), and
obtain that
9
f(P
X
)
P
X
(x

)
= I(x

; Y ) log
2
(e) + .
9
The details for taking the derivative are as follows:

P
X
(x

)
_
_
_

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
P
Y |X
(y[x)

xX

yY
P
X
(x)P
Y |X
(y[x) log
2
_

X
P
X
(x

)P
Y |X
(y[x

)
_
+
_

xX
P
X
(x) 1
_
_
_
_
=

yY
P
Y |X
(y[x

) log
2
P
Y |X
(y[x

)
_
_

yY
P
Y |X
(y[x

) log
2
_

X
P
X
(x

)P
Y |X
(y[x

)
_
+log
2
(e)

xX

yY
P
X
(x)P
Y |X
(y[x)
P
Y |X
(y[x

X
P
X
(x

)P
Y |X
(y[x

)
_
_
+
= I(x

; Y ) log
2
(e)

yY
_

xX
P
X
(x)P
Y |X
(y[x)
_
P
Y |X
(y[x

X
P
X
(x

)P
Y |X
(y[x

)
+
= I(x

; Y ) log
2
(e)

yY
P
Y |X
(y[x

) +
= I(x

; Y ) log
2
(e) +.
110
By Property 2 of Lemma 2.46, I(X; Y ) = I(P
X
, P
Y |X
) is a concave function in
P
X
(for a xed P
Y |X
). Therefore, the maximum of I(P
X
, P
Y |X
) occurs for a zero
derivative when P
X
(x) does not lie on the boundary, namely 1 > P
X
(x) > 0.
For those P
X
(x) lying on the boundary, i.e., P
X
(x) = 0, the maximum occurs i
a displacement from the boundary to the interior decreases the quantity, which
implies a non-positive derivative, namely
I(x; Y ) + log
2
(e), for those x with P
X
(x) = 0.
To summarize, if an input distribution P
X
achieves the channel capacity, then
_
I(x

; Y ) = + log
2
(e), for P
X
(x

) > 0;
I(x

; Y ) + log
2
(e), for P
X
(x

) = 0.
for some . With the above result, setting C = + 1 yields (4.4.9). Finally,
multiplying both sides of each equation in (4.4.9) by P
X
(x) and summing over
x yields that max
P
X
I(X; Y ) on the left and the constant C on the right, thus
proving that the constant C is indeed the channels capacity. 2
Example 4.22 (Quasi-symmetric channels) For a quasi-symmetric channel,
one can directly verify that the uniform input distribution satises the KKT con-
dition of Lemma 4.21 and yields that the channel capacity is given by (4.4.6);
this is left as an exercise. As we already saw, the BSC, the q-ary symmetric
channel, the BEC and the BSEC are all quasi-symmetric.
Example 4.23 Consider a DMC with a ternary input alphabet A = 0, 1, 2,
binary output alphabet = 0, 1 and the following transition matrix
Q =
_
_
1 0
1
2
1
2
0 1
_
_
.
This channel is not quasi-symmetric. However, one may guess that the capacity
of this channel is achieved by the input distribution (P
X
(0), P
X
(1), P
X
(2)) =
(
1
2
, 0,
1
2
) since the input x = 1 has an equal conditional probability of being re-
ceived as 0 or 1 at the output. Under this input distribution, we obtain that
I(x = 0; Y ) = I(x = 2; Y ) = 1 and that I(x = 1; Y ) = 0. Thus the KKT con-
dition of (4.4.9) is satised; hence conrming that the above input distribution
achieves channel capacity and that channel capacity is equal to 1 bit.
Observation 4.24 (Capacity achieved by a uniform input distribution)
We close this chapter by noting that there is a class of DMCs that is larger
than that of quasi-symmetric channels for which the uniform input distribution
111
achieves capacity. It concerns the class of so-called T-symmetric channels [36,
Section V, Denition 1] for which
T(x) I(x; Y ) log
2
[A[ =

yY
P
Y |X
(y[x) log
2
P
Y |X
(y[x)

X
P
Y |X
(y[x

)
is a constant function of x (i.e., independent of x), where I(x; Y ) is the mu-
tual information for input x under a uniform input distribution. Indeed the
T-symmetry condition is equivalent to the property of having the uniform input
distribution achieve capacity. This directly follows from the KKT condition of
Lemma 4.21. An example of a T-symmetric channel that is not quasi-symmetric
is the binary-input ternary-output channel with the following transition matrix
Q =
_
1
3
1
3
1
3
1
6
1
6
2
3
_
.
Hence its capacity is achieved by the uniform input distribution. See [36, Fig. 2]
for (innitely-many) other examples of T-symmetric channels. However, unlike
quasi-symmetric channels, T-symmetric channels do not admit in general a sim-
ple closed-form expression for their capacity (such as the one given in (4.4.6)).
112
Chapter 5
Dierential Entropy and Gaussian
Channels
We have so far examined information measures and their operational character-
ization for discrete-time discrete-alphabet systems. In this chapter, we turn our
focus to continuous-alphabet (real-valued) systems. Except for a brief interlude
with the continuous-time (waveform) Gaussian channel, we consider discrete-
time systems, as treated throughout the book.
We rst recall that a real-valued (continuous) random variable X is described
by its cumulative distribution function (cdf)
F
X
(x) Pr[X x]
for x R, the set of real numbers. The distribution of X is called absolutely con-
tinuous (with respect to the Lebesgue measure) if a probability density function
(pdf) f
X
() exists such that
F
X
(x) =
_
x

f
X
(t)dt
where f
X
(t) 0 t and
_
+

f
X
(t)dt = 1. If F
X
() is dierentiable everywhere,
then the pdf f
X
() exists and is given by the derivative of F
X
(): f
X
(t) =
dF
X
(t)
dt
.
The support of a random variable X with pdf f
X
() is denoted by S
X
and can
be conveniently given as
S
X
= x R : f
X
(x) > 0.
We will deal with random variables that admit a pdf.
1
1
A rigorous (measure-theoretic) study for general continuous systems, initiated by Kol-
mogorov [25], can be found in [34, 22].
113
5.1 Dierential entropy
Recall that the denition of entropy for a discrete random variable X represent-
ing a DMS is
H(X)

xX
P
X
(x) log
2
P
X
(x) (in bits).
As already seen in Shannons source coding theorem, this quantity is the mini-
mum average code rate achievable for the lossless compression of the DMS. But if
the random variable takes on values in a continuum, the minimum number of bits
per symbol needed to losslessly describe it must be innite. This is illustrated
in the following example, where we take a discrete approximation (quantization)
of a random variable uniformly distributed on the unit interval and study the
entropy of the quantized random variable as the quantization becomes ner and
ner.
Example 5.1 Consider a real-valued random variable X that is uniformly dis-
tributed on the unit interval, i.e., with pdf given by
f
X
(x) =
_
1 if x [0, 1);
0 otherwise.
Given a positive integer m, we can discretize X by uniformly quantizing it into
m levels by partitioning the support of X into equal-length segments of size
=
1
m
( is called the quantization step-size) such that:
q
m
(X) =
i
m
, if
i 1
m
X <
i
m
,
for 1 i m. Then the entropy of the quantized random variable q
m
(X) is
given by
H(q
m
(X)) =
m

i=1
1
m
log
2
_
1
m
_
= log
2
m (in bits).
Since the entropy H(q
m
(X)) of the quantized version of X is a lower bound to
the entropy of X (as q
m
(X) is a function of X) and satises in the limit
lim
m
H(q
m
(X)) = lim
m
log
2
m = ,
we obtain that the entropy of X is innite.
The above example indicates that to compress a continuous source without
incurring any loss or distortion indeed requires an innite number of bits. Thus
114
when studying continuous sources, the entropy measure is limited in its eective-
ness and the introduction of a new measure is necessary. Such new measure is
indeed obtained upon close examination of the entropy of a uniformly quantized
real-valued random-variable minus the quantization accuracy as the accuracy
increases without bound.
Lemma 5.2 Consider a real-valued random variable X with support [a, b) and
pdf f
X
such that f
X
log
2
f
X
is integrable
2
(where
_
b
a
f
X
(x) log
2
f
X
(x)dx is
nite). Then a uniform quantization of X with an n-bit accuracy (i.e., with
a quantization step-size of = 2
n
) yields an entropy approximately equal to

_
b
a
f
X
(x) log
2
f
X
(x)dx + n bits for n suciently large. In other words,
lim
n
[H(q
n
(X)) n] =
_
b
a
f
X
(x) log
2
f
X
(x)dx
where q
n
(X) is the uniformly quantized version of X with quantization step-size
= 2
n
.
Proof:
Step 1: Mean-value theorem.
Let = 2
n
be the quantization step-size, and let
t
i

_
a + i, i = 0, 1, , j 1
b, i = j
where j = (b a)2
n
. From the mean-value theorem (e.g., cf [32]), we
can choose x
i
[t
i1
, t
i
] for 1 i j such that
p
i

_
t
i
t
i1
f
X
(x)dx = f(x
i
)(t
i
t
i1
) = f
X
(x
i
).
Step 2: Denition of h
(n)
(X). Let
h
(n)
(X)
j

i=1
[f
X
(x
i
) log
2
f
X
(x
i
)]2
n
.
Since f
X
(x) log
2
f
X
(x) is integrable,
h
(n)
(X)
_
b
a
f
X
(x) log
2
f
X
(x)dx as n .
2
By integrability, we mean the usual Riemann integrability (e.g, see [38]).
115
Therefore, given any > 0, there exists N such that for all n > N,

_
b
a
f
X
(x) log
2
f
X
(x)dx h
(n)
(X)

< .
Step 3: Computation of H(q
n
(X)). The entropy of the (uniformly) quan-
tized version of X, q
n
(X), is given by
H(q
n
(X)) =
j

i=1
p
i
log
2
p
i
=
j

i=1
(f
X
(x
i
)) log
2
(f
X
(x
i
))
=
j

i=1
(f
X
(x
i
)2
n
) log
2
(f
X
(x
i
)2
n
)
where the p
i
s are the probabilities of the dierent values of q
n
(X).
Step 4: H(q
n
(X)) h
(n)
(X) .
From Steps 2 and 3,
H(q
n
(X)) h
(n)
(X) =
j

i=1
[f
X
(x
i
)2
n
] log
2
(2
n
)
= n
j

i=1
_
t
i
t
i1
f
X
(x)dx
= n
_
b
a
f
X
(x)dx = n.
Hence, we have that for n > N,
_

_
b
a
f
X
(x) log
2
f
X
(x)dx + n
_
< H(q
n
(X))
= h
(n)
(X) +n
<
_

_
b
a
f
X
(x) log
2
f
X
(x)dx + n
_
+ ,
yielding that
lim
n
[H(q
n
(X)) n] =
_
b
a
f
X
(x) log
2
f
X
(x)dx.
2
116
More generally, the following result due to Renyi [35] can be shown for (absolutely
continuous) random variables with arbitrary support.
Theorem 5.3 [35, Theorem 1] For any real-valued random variable with pdf f
X
,
if

j
i=1
p
i
log
2
p
i
is nite, where the p
i
s are the probabilities of the dierent
values of uniformly quantized q
n
(X) over support S
X
, then
lim
n
[H(q
n
(X)) n] =
_
S
X
f
X
(x) log
2
f
X
(x)dx
provided the integral on the right-hand side exists.
In light of the above results, we can dene the following information measure.
Denition 5.4 (Dierential entropy) The dierential entropy (in bits) of a
continuous random variable X with pdf f
X
and support S
X
is dened as
h(X)
_
S
X
f
X
(x) log
2
f
X
(x)dx = E[log
2
f
X
(X)],
when the integral exists.
Thus the dierential entropy h(X) of a real-valued random variable X has
an operational meaning in the following sense. Since H(q
n
(X)) is the minimum
average number of bits needed to losslessly describe q
n
(X), we thus obtain that
h(X) + n is approximately needed to describe X when uniformly quantizing it
with an n-bit accuracy. Therefore, we may conclude that the larger h(X) is, the
larger is the average number of bits required to describe a uniformly quantized
X within a xed accuracy.
Example 5.5 A continuous random variable X with support S
X
= [0, 1) and
pdf f
X
(x) = 2x for x S
X
has dierential entropy equal to
_
1
0
2x log
2
(2x)dx =
x
2
(log
2
e 2 log
2
(2x))
2

1
0
=
1
2 ln 2
log
2
(2) = 0.278652 bits.
We herein illustrate Lemma 5.2 by uniformly quantizing X to an n-bit accuracy
and computing the entropy H(q
n
(X)) and H(q
n
(X)) n for increasing values of
n, where q
n
(X) is the quantized version of X.
We have that q
n
(X) is given by
q
n
(X) =
i
2
n
, if
i 1
2
n
X <
i
2
n
,
117
n H(q
n
(X)) H(q
n
(X)) n
1 0.811278 bits 0.188722 bits
2 1.748999 bits 0.251000 bits
3 2.729560 bits 0.270440 bits
4 3.723726 bits 0.276275 bits
5 4.722023 bits 0.277977 bits
6 5.721537 bits 0.278463 bits
7 6.721399 bits 0.278600 bits
8 7.721361 bits 0.278638 bits
9 8.721351 bits 0.278648 bits
Table 5.1: Quantized random variable q
n
(X) under an n-bit accuracy:
H(q
n
(X)) and H(q
n
(X)) n versus n.
for 1 i 2
n
. Hence,
Pr
_
q
n
(X) =
i
2
n
_
=
(2i 1)
2
2n
,
which yields
H(q
n
(X)) =
2
n

i=1
2i 1
2
2n
log
2
_
2i 1
2
2n
_
=
_

1
2
2n
2
n

i=1
(2i 1) log
2
(2i 1) + 2 log
2
(2
n
)
_
.
As shown in Table 5.1, we indeed observe that as n increases, H(q
n
(X)) tends
to innity while H(q
n
(X)) n converges to h(X) = 0.278652 bits.
Thus a continuous random variable X contains an innite amount of infor-
mation; but we can measure the information contained in its n-bit quantized
version q
n
(X) as: H(q
n
(X)) h(X) + n (for n large enough).
Example 5.6 Let us determine the minimum average number of bits required
to describe the uniform quantization with 3-digit accuracy of the decay time
(in years) of a radium atom assuming that the half-life of the radium (i.e., the
median of the decay time) is 80 years and that its pdf is given by f
X
(x) = e
x
,
where x > 0.
Since the median of the decay time is 80, we obtain:
_
80
0
e
x
dx = 0.5,
118
which implies that = 0.00866. Also, 3-digit accuracy is approximately equiv-
alent to log
2
999 = 9.96 10 bits accuracy. Therefore, by Theorem 5.3, the
number of bits required to describe the quantized decay time is approximately
h(X) + 10 = log
2
e

+ 10 = 18.29 bits.
We close this section by computing the dierential entropy for two common
real-valued random variables: the uniformly distributed random variable and
the Gaussian distributed random variable.
Example 5.7 (Dierential entropy of a uniformly distributed random
variable) Let X be a continuous random variable that is uniformly distributed
over the interval (a, b), where b > a; i.e., its pdf is given by
f
X
(x) =
_
1
ba
if x (a, b);
0 otherwise.
So its dierential entropy is given by
h(X) =
_
b
a
1
b a
log
2
1
b a
= log
2
(b a) bits.
Note that if (b a) < 1 in the above example, then h(X) is negative, unlike
entropy. The above example indicates that although dierential entropy has a
form analogous to entropy (in the sense that summation and pmf for entropy are
replaced by integration and pdf, respectively, for dierential entropy), dieren-
tial entropy does not retain all the properties of entropy (one such operational
dierence was already highlighted in the previous lemma and theorem).
Example 5.8 (Dierential entropy of a Gaussian random variable) Let
X ^(,
2
); i.e., X is a Gaussian (or normal) random variable with nite mean
, variance Var(X) =
2
> 0 and pdf
f
X
(x) =
1

2
2
e

(x)
2
2
2
for x R. Then its dierential entropy is given by
h(X) =
_
R
f
X
(x)
_
1
2
log
2
(2
2
) +
(x )
2
2
2
log
2
e
_
dx
=
1
2
log
2
(2
2
) +
log
2
e
2
2
E[(X )
2
]
119
=
1
2
log
2
(2
2
) +
1
2
log
2
e
=
1
2
log
2
(2e
2
) bits. (5.1.1)
Note that for a Gaussian random variable, its dierential entropy is only a func-
tion of its variance
2
(it is independent from its mean ).
5.2 Joint and conditional dierential entropies, divergence
and mutual information
Denition 5.9 (Joint dierential entropy) If X
n
= (X
1
, X
2
, , X
n
) is a
continuous random vector of size n (i.e., a vector of n continuous random vari-
ables) with joint pdf f
X
n and support S
X
n R
n
, then its joint dierential
entropy is dened as
h(X
n
)
_
S
X
n
f
X
n(x
1
, x
2
, , x
n
) log
2
f
X
n(x
1
, x
2
, , x
n
) dx
1
dx
2
dx
n
= E[log
2
f
X
n(X
n
)]
when the n-dimensional integral exists.
Denition 5.10 (Conditional dierential entropy) Let X and Y be two
jointly distributed continuous random variables with joint pdf f
X,Y
and support
S
X,Y
R
2
such that the conditional pdf of Y given X, given by f
Y |X
(y[x) =
f
X,Y
(x,y)
f
X
(x)
, is well dened for all (x, y) S
X,Y
, where f
X
is the marginal pdf of X.
Then the conditional entropy of Y given X is dened as
h(Y [X)
_
S
X,Y
f
X,Y
(x, y) log
2
f
Y |X
(y[x) dx dy = E[log
2
f
Y |X
(Y [X)],
when the integral exists.
Note that as in the case of (discrete) entropy, the chain rule holds for dier-
ential entropy:
h(X, Y ) = h(X) + h(Y [X) = h(Y ) + h(X[Y ).
Denition 5.11 (Divergence or relative entropy) Let X and Y be two con-
tinuous random variables with marginal pdfs f
X
and f
Y
, respectively, such that
their supports satisfy S
X
S
Y
R. Then the divergence (or relative entropy or
120
Kullback-Leibler distance) between X and Y is written as D(X|Y ) or D(f
X
|f
Y
)
and dened by
D(X|Y )
_
S
X
f
X
(x) log
2
f
X
(x)
f
Y
(x)
dx = E
_
f
X
(X)
f
Y
(X)
_
when the integral exists. The denition carries over similarly in the multivariate
case: for X
n
= (X
1
, X
2
, , X
n
) and Y
n
= (Y
1
, Y
2
, , Y
n
) two random vectors
with joint pdfs f
X
n and f
Y
n, respectively, and supports satisfying S
X
n S
Y
n
R
n
, then the divergence between X
n
and Y
n
is dened as
D(X
n
|Y
n
)
_
S
X
n
f
X
n(x
1
, x
2
, , x
n
) log
2
f
X
n(x
1
, x
2
, , x
n
)
f
Y
n(x
1
, x
2
, , x
n
)
dx
1
dx
2
dx
n
when the integral exists.
Denition 5.12 (Mutual information) Let X and Y be two jointly distributed
continuous random variables with joint pdf f
X,Y
and support S
XY
R
2
, then
the mutual information between X and Y is dened by
I(X; Y ) D(f
X,Y
|f
X
f
Y
) =
_
S
X,Y
f
X,Y
(x, y) log
2
f
X,Y
(x, y)
f
X
(x)f
Y
(y)
dx dy,
assuming the integral exists, where f
X
and f
Y
are the marginal pdfs of X and
Y , respectively.
Observation 5.13 For two jointly distributed continuous random variables X
and Y with joint pdf f
X,Y
, support S
XY
R
2
and joint dierential entropy
h(X, Y ) =
_
S
XY
f
X,Y
(x, y) log
2
f
X,Y
(x, y) dx dy,
then as in Lemma 5.2 and the ensuing discussion, one can write
H(q
n
(X), q
m
(Y )) h(X, Y ) + n + m
for n and m suciently large, where q
k
(Z) denotes the (uniformly) quantized
version of random variable Z with an k-bit accuracy.
On the other hand, for the above continuous X and Y ,
I(q
n
(X); q
m
(Y )) = H(q
n
(X)) + H(q
m
(Y )) H(q
n
(X), q
m
(Y ))
[h(X) + n] + [h(Y ) + m] [h(X, Y ) +n + m]
= h(X) + h(Y ) h(X, Y )
121
=
_
S
X,Y
f
X,Y
(x, y) log
2
f
X,Y
(x, y)
f
X
(x)f
Y
(y)
dx dy
for n and m suciently large; in other words,
lim
n,m
I(q
n
(X); q
m
(Y )) = h(X) +h(Y ) h(X, Y ).
Furthermore, it can be shown that
lim
n
D(q
n
(X)|q
n
(Y )) =
_
S
X
f
X
(x) log
2
f
X
(x)
f
Y
(x)
dx.
Thus mutual information and divergence can be considered as the true tools
of Information Theory, as they retain the same operational characteristics and
properties for both discrete and continuous probability spaces (as well as general
spaces where they can be dened in terms of Radon-Nikodym derivatives (e.g.,
cf. [22]).
3
The following lemma illustrates that for continuous systems, I(; ) and D(|)
keep the same properties already encountered for discrete systems, while dier-
ential entropy (as already seen with its possibility if being negative) satises
some dierent properties from entropy. The proof is left as an exercise.
Lemma 5.14 The following properties hold for the information measures of
continuous systems.
1. Non-negativity of divergence: Let X and Y be two continuous ran-
dom variables with marginal pdfs f
X
and f
Y
, respectively, such that their
supports satisfy S
X
S
Y
R. Then
D(f
X
|f
Y
) 0
with equality i f
X
(x) = f
Y
(x) for all x S
X
(i.e., X = Y almost surely).
2. Non-negativity of mutual information: For any two continuous ran-
dom variables X and Y ,
I(X; Y ) 0
with equality i X and Y are independent.
3
This justies using identical notations for both I(; ) and D(|) as opposed to the dis-
cerning notations of H() for entropy and h() for dierential entropy.
122
3. Conditioning never increases dierential entropy: For any two con-
tinuous random variables X and Y with joint pdf f
X,Y
and well-dened
conditional pdf f
X|Y
,
h(X[Y ) h(X)
with equality i X and Y are independent.
4. Chain rule for dierential entropy: For a continuous random vector
X
n
= (X
1
, X
2
, , X
n
),
h(X
1
, X
2
, . . . , X
n
) =
n

i=1
h(X
i
[X
1
, X
2
, . . . , X
i1
),
where h(X
i
[X
1
, X
2
, . . . , X
i1
) h(X
1
) for i = 1.
5. Chain rule for mutual information: For continuous random vector
X
n
= (X
1
, X
2
, , X
n
) and random variable Y with joint pdf f
X
n
,Y
and
well-dened conditional pdfs f
X
i
,Y |X
i1, f
X
i
|X
i1 and f
Y |X
i1 for i = 1, , n,
we have that
I(X
1
, X
2
, , X
n
; Y ) =
n

i=1
I(X
i
; Y [X
i1
, , X
1
),
where I(X
i
; Y [X
i1
, , X
1
) I(X
1
; Y ) for i = 1.
6. Data processing inequality: For continuous random variables X, Y
and Z such that X Y Z, i.e., X and Z are conditional independent
given Y (cf. Appendix B),
I(X; Y ) I(X; Z).
7. Independence bound for dierential entropy: For a continuous ran-
dom vector X
n
= (X
1
, X
2
, , X
n
),
h(X
n
)
n

i=1
h(X
i
)
with equality i all the X
i
s are independent from each other.
8. Invariance of dierential entropy under translation: For continuous
random variables X and Y with joint pdf f
X,Y
and well-dened conditional
pdf f
X|Y
,
h(X + c) = h(X) for any constant c R,
123
and
h(X + Y [Y ) = h(X[Y ).
The results also generalize in the multivariate case: for two continuous
random vectors X
n
= (X
1
, X
2
, , X
n
) and Y
n
= (Y
1
, Y
2
, , Y
n
) with
joint pdf f
X
n
,Y
n and well-dened conditional pdf f
X
n
|Y
n,
h(X
n
+ c
n
) = h(X
n
)
for any constant n-tuple c
n
= (c
1
, c
2
, , c
n
) R
n
, and
h(X
n
+ Y
n
[Y
n
) = h(X
n
[Y
n
),
where the addition of two n-tuples is performed component-wise.
9. Dierential entropy under scaling: For any continuous random vari-
able X and any non-zero real constant a,
h(aX) = h(X) + log
2
[a[.
10. Joint dierential entropy under linear mapping: Consider the ran-
dom (column) vector X = (X
1
, X
2
, , X
n
)
T
with joint pdf f
X
n, where T
denotes transposition, and let Y = (Y
1
, Y
2
, , Y
n
)
T
be a random (column)
vector obtained from the linear transformation Y = AX, where A is an
invertible (non-singular) n n real-valued matrix. Then
h(Y ) = h(Y
1
, Y
2
, , Y
n
) = h(X
1
, X
2
, , X
n
) + log
2
[det(A)[,
where det(A) is the determinant of the square matrix A.
11. Joint dierential entropy under nonlinear mapping: Consider the
random (column) vector X = (X
1
, X
2
, , X
n
)
T
with joint pdf f
X
n, and
let Y = (Y
1
, Y
2
, , Y
n
)
T
be a random (column) vector obtained from the
nonlinear transformation
Y = g(X) (g
1
(X
1
), g
2
(X
2
), , g
n
(X
n
))
T
,
where each g
i
: R
n
R is a dierentiable function, i = 1, 2, , n. Then
h(Y ) = h(Y
1
, Y
2
, , Y
n
)
= h(X
1
, , X
n
) +
_
R
n
f
X
n(x
1
, , x
n
) log
2
[det(J)[ dx
1
dx
n
,
where J is the n n Jacobian matrix given by
J
_

_
g
1
x
1
g
1
x
2

g
1
xn
g
2
x
1
g
2
x
2

g
2
xn
.
.
.
.
.
.
.
.
.
gn
x
1
gn
x
2

gn
xn
_

_
.
124
Observation 5.15 Property 9 of the above Lemma indicates that for a contin-
uous random variable X, h(X) ,= h(aX) (except for the trivial case of a = 1)
and hence dierential entropy is not in general invariant under invertible maps.
This is in contrast to entropy, which is always invariant under invertible maps:
given a discrete random variable X with alphabet A,
H(f(X)) = H(X)
for all invertible maps f : A , where is a discrete set; in particular
H(aX) = H(X) for all non-zero reals a.
On the other hand, for both discrete and continuous systems, mutual infor-
mation and divergence are invariant under invertible maps:
I(X; Y ) = I(g(X); Y ) = I(g(X); h(Y ))
and
D(X|Y ) = D(g(X)|g(Y ))
for all invertible maps g and h properly dened on the alphabet/support of the
concerned random variables. This reinforces the notion that mutual information
and divergence constitute the true tools of Information Theory.
Denition 5.16 (Multivariate Gaussian) A continuous random vector X =
(X
1
, X
2
, , X
n
)
T
is called a size-n (multivariate) Gaussian random vector with
a nite mean vector (
1
,
2
, ,
n
)
T
, where
i
E[X
i
] < for i =
1, 2, , n, and an n n invertible (real-valued) covariance matrix
K
X
= [K
i,j
]
E[(X )(X )
T
]
=
_

_
Cov(X
1
, X
1
) Cov(X
1
, X
2
) Cov(X
1
, X
n
)
Cov(X
2
, X
1
) Cov(X
2
, X
2
) Cov(X
2
, X
n
)
.
.
.
.
.
.
.
.
.
Cov(X
n
, X
1
) Cov(X
n
, X
2
) Cov(X
n
, X
n
)
_

_
,
where K
i,j
= Cov(X
i
, X
j
) E[(X
i

i
)(X
j

j
)] is the covariance
4
between X
i
and X
j
for i, j = 1, 2, , n, if its joint pdf is given by the multivariate Gaussian
pdf
f
X
n(x
1
, x
2
, , x
n
) =
1
(

2)
n
_
det(K
X
)
e

1
2
(x)
T
K
1
X
(x)
for any (x
1
, x
2
, , x
n
) R
n
, where x = (x
1
, x
2
, , x
n
)
T
. As in the scalar case
(i.e., for n = 1), we write X ^
n
(, K
X
) to denote that X is a size-n Gaussian
random vector with mean vector and covariance matrix K
X
.
4
Note that the diagonal components of K
X
yield the variance of the dierent random
variables: K
i,i
= Cov(X
i
, X
i
) = Var(X
i
) =
2
Xi
, i = 1, , n.
125
Observation 5.17 In light of the above denition, we make the following re-
marks.
1. Note that a covariance matrix K is always symmetric (i.e., K
T
= K)
and positive-semidenite.
5
But as we require K
X
to be invertible in the
denition of the multivariate Gaussian distribution above, we will hereafter
assume that the covariance matrix of Gaussian random vectors is positive-
denite (which is equivalent to having all the eigenvalues of K
X
positive),
thus rendering the matrix invertible.
2. If a random vector X = (X
1
, X
2
, , X
n
)
T
has a diagonal covariance ma-
trix K
X
(i.e., all the o-diagonal components of K
X
are zero: K
i,j
= 0
for all i ,= j, i, j = 1, , n), then all its component random variables are
uncorrelated but not necessarily independent. However, if X is Gaussian
and have a diagonal covariance matrix, then all its component random
variables are independent from each other.
3. Any linear transformation of a Gaussian random vector yields another
Gaussian random vector. Specically, if X ^
n
(, K
X
) is a size-n Gaus-
sian random vector with mean vector and covariance matrix K
X
, and if
Y = A
mn
X, where A
mn
is a given mn real-valued matrix, then
Y ^
m
(A
mn
, A
mn
K
X
A
T
mn
)
is a size-m Gaussian random vector with mean vector A
mn
and covariance
matrix A
mn
K
X
A
T
mn
.
More generally, any ane transformation of a Gaussian random vector
yields another Gaussian random vector: if X ^
n
(, K
X
) and Y =
A
mn
X +b
m
, where A
mn
is a mn real-valued matrix and b
m
is a size-m
real-valued vector, then
Y ^
m
(A
mn
+ b
m
, A
mn
K
X
A
T
mn
).
5
An nn real-valued symmetric matrix K is positive-semidenite (e.g., cf. [15]) if for every
real-valued vector x = (x
1
, x
2
, x
n
)
T
,
x
T
Kx = (x
1
, , x
n
)K
_
_
_
x
1
.
.
.
x
n
_
_
_ 0,
with equality holding only when x
i
= 0 for i = 1, 2, , n. Furthermore, the matrix is positive-
denite if x
T
Kx > 0 for all real-valued vectors x ,= 0, where 0 is the all-zero vector of size
n.
126
Theorem 5.18 (Joint dierential entropy of the multivariate Gaussian)
If X ^
n
(, K
X
) is a Gaussian random vector with mean vector and (positive-
denite) covariance matrix K
X
, then its joint dierential entropy is given by
h(X) = h(X
1
, X
2
, , X
n
) =
1
2
log
2
[(2e)
n
det(K
X
)] . (5.2.1)
In particular, in the univariate case of n = 1, (5.2.1) reduces to (5.1.1).
Proof: Without loss of generality we assume that X has a zero mean vec-
tor since its dierential entropy is invariant under translation by Property 8 of
Lemma 5.14:
h(X) = h(X );
so we assume that = 0.
Since the covariance matrix K
X
is a real-valued symmetric matrix, then
it is orthogonally diagonizable; i.e., there exits a square (n n) orthogonal
matrix A (i.e., satisfying A
T
= A
1
) such that AK
X
A
T
is a diagonal ma-
trix whose entries are given by the eigenvalues of K
X
(A is constructed using
the eigenvectors of K
X
; e.g., see [15]). As a result the linear transformation
Y = AX ^
n
_
0, AK
X
A
T
_
is a Gaussian vector with the diagonal covariance
matrix K
Y
= AK
X
A
T
and has therefore independent components (as noted in
Observation 5.17). Thus
h(Y ) = h(Y
1
, Y
2
, , Y
n
)
= h(Y
1
) + h(Y
2
) + + h(Y
n
) (5.2.2)
=
n

i=1
1
2
log
2
[2eVar(Y
i
)] (5.2.3)
=
n
2
log
2
(2e) +
1
2
log
2
_
n

i=1
Var(Y
i
)
_
=
n
2
log
2
(2e) +
1
2
log
2
[det (K
Y
)] (5.2.4)
=
1
2
log
2
(2e)
n
+
1
2
log
2
[det (K
X
)] (5.2.5)
=
1
2
log
2
[(2e)
n
det (K
X
)] , (5.2.6)
where (5.2.2) follows by the independence of the random variables Y
1
, . . . , Y
n
(e.g., see Property 7 of Lemma 5.14), (5.2.3) follows from (5.1.1), (5.2.4) holds
since the matrix K
Y
is diagonal and hence its determinant is given by the product
of its diagonal entries, and (5.2.5) holds since
det (K
Y
) = det
_
AK
X
A
T
_
127
= det(A)det (K
X
) det(A
T
)
= det(A)
2
det (K
X
)
= det (K
X
) ,
where the last equality holds since det(A)
2
= 1, as the matrix A is orthogonal
(A
T
= A
1
= det(A) = det(A
T
) = 1/[det(A)]; thus, det(A)
2
= 1).
Now invoking Property 10 of Lemma 5.14 and noting that [det(A)[ = 1 yield
that
h(Y
1
, Y
2
, , Y
n
) = h(X
1
, X
2
, , X
n
) + log
2
[det(A)[
. .
=0
= h(X
1
, X
2
, , X
n
).
We therefore obtain using (5.2.6) that
h(X
1
, X
2
, , X
n
) =
1
2
log
2
[(2e)
n
det (K
X
)] ,
hence completing the proof.
An alternate (but rather mechanical) proof to the one presented above con-
sists of directly evaluating the joint dierential entropy of X by integrating
f
X
n(x
n
) log
2
f
X
n(x
n
) over R
n
; it is left as an exercise. 2
Corollary 5.19 (Hadamards inequality) For any real-valued nn positive-
denite matrix K = [K
i,j
]
i,j=1, ,n
,
det(K)
n

i=1
K
i,i
with equality i K is a diagonal matrix, where K
i,i
are the diagonal entries of
K.
Proof: Since every positive-denite matrix is a covariance matrix (e.g., see
[20]), let X = (X
1
, X
2
, , X
n
)
T
^
n
(0, K) be a jointly Gaussian random
vector with zero mean vector and covariance matrix K. Then
1
2
log
2
[(2e)
n
det(K)] = h(X
1
, X
2
, , X
n
) (5.2.7)

i=1
h(X
i
) (5.2.8)
=
n

i=1
1
2
log
2
[2eVar(X
i
)] (5.2.9)
128
=
1
2
log
2
_
(2e)
n
n

i=1
K
i,i
_
, (5.2.10)
where (5.2.7) follows from Theorem 5.18, (5.2.8) follows from Property 7 of
Lemma 5.14 and (5.2.9)-(5.2.10) hold using (5.1.1) along with the fact that
each random variable X
i
^(0, K
i,i
) is Gaussian with zero mean and variance
Var(X
i
) = K
i,i
for i = 1, 2, , n (as the marginals of a multivariate Gaussian
are also Gaussian (e,g., cf.[20])).
Finally, from (5.2.10), we directly obtain that
det(K)
n

i=1
K
i,i
,
with equality i the jointly Gaussian random variables X
1
, X
2
, . . ., X
n
are inde-
pendent from each other, or equivalently i the covariance matrix K is diagonal.
2
The next theorem states that among all real-valued size-n random vectors (of
support R
n
) with identical mean vector and covariance matrix, the Gaussian
random vector has the largest dierential entropy.
Theorem 5.20 (Maximal dierential entropy for real-valued random
vectors) Let X = (X
1
, X
2
, , X
n
)
T
be a real-valued random vector with
support S
X
n = R
n
, mean vector and covariance matrix K
X
. Then
h(X
1
, X
2
, , X
n
)
1
2
log
2
[(2e)
n
det(K
X
)] , (5.2.11)
with equality i X is Gaussian; i.e., X ^
n
_
, K
X
_
.
Proof: We will present the proof in two parts: the scalar or univariate case,
and the multivariate case.
(i) Scalar case (n = 1): For a real-valued random variable with support S
X
= R,
mean and variance
2
, let us show that
h(X)
1
2
log
2
_
2e
2
_
, (5.2.12)
with equality i X ^(,
2
).
For a Gaussian random variable Y ^(,
2
), using the non-negativity of
divergence, can write
0 D(X|Y )
129
=
_
R
f
X
(x) log
2
f
X
(x)
1

2
2
e

(x)
2
2
2
dx
= h(X) +
_
R
f
X
(x)
_
log
2
_

2
2
_
+
(x )
2
2
2
log
2
e
_
dx
= h(X) +
1
2
log
2
(2
2
) +
log
2
e
2
2
_
R
(x )
2
f
X
(x) dx
. .
=
2
= h(X) +
1
2
log
2
_
2e
2

.
Thus
h(X)
1
2
log
2
_
2e
2

,
with equality i X = Y (almost surely); i.e., X ^(,
2
).
(ii). Multivariate case (n > 1): As in the proof of Theorem 5.18, we can use an
orthogonal square matrix A (i.e., satisfying A
T
= A
1
and hence [det(A)[ = 1)
such that AK
X
A
T
is diagonal. Therefore, the random vector generated by the
linear map
Z = AX
will have a covariance matrix given by K
Z
= AK
X
A
T
and hence have uncorre-
lated (but not necessarily independent) components. Thus
h(X) = h(Z) log
2
[det(A)[
. .
=0
(5.2.13)
= h(Z
1
, Z
2
, , Z
n
)

i=1
h(Z
i
) (5.2.14)

i=1
1
2
log
2
[2eVar(Z
i
)] (5.2.15)
=
n
2
log
2
(2e) +
1
2
log
2
_
n

i=1
Var(Z
i
)
_
=
1
2
log
2
(2e)
n
+
1
2
log
2
[det (K
Z
)] (5.2.16)
=
1
2
log
2
(2e)
n
+
1
2
log
2
[det (K
X
)] (5.2.17)
=
1
2
log
2
[(2e)
n
det (K
X
)] ,
where (5.2.13) holds by Property 10 of Lemma 5.14 and since [det(A)[ = 1,
(5.2.14) follows from Property 7 of Lemma 5.14, (5.2.15) follows from (5.2.12)
130
(the scalar case above), (5.2.16) holds since K
Z
is diagonal, and (5.2.17) follows
from the fact that det (K
Z
) = det (K
X
) (as A is orthogonal). Finally, equality is
achieved in both (5.2.14) and (5.2.15) i the random variables Z
1
, Z
2
, . . ., Z
n
are
Gaussian and independent from each other, or equivalently i X ^
n
_
, K
X
_
.
2
Observation 5.21 The following two results can also be shown (the proof is
left as an exercise):
1. Among all continuous random variables admitting a pdf with support the
interval (a, b), where b > a are real numbers, the uniformly distributed
random variable maximizes dierential entropy.
2. Among all continuous random variables admitting a pdf with support the
interval [0, ) and nite mean , the exponential distribution with param-
eter (or rate parameter) = 1/ maximizes dierential entropy.
A systematic approach to nding distributions that maximize dierential entropy
subject to various support and moments constraints can be found in [12, 47].
5.3 AEP for continuous memoryless sources
The AEP theorem and its consequence for discrete memoryless (i.i.d.) sources
reveal to us that the number of elements in the typical set is approximately
2
nH(X)
, where H(X) is the source entropy, and that the typical set carries al-
most all the probability mass asymptotically (see Theorems 3.3 and 3.4). An
extension of this result from discrete to continuous memoryless sources by just
counting the number of elements in a continuous (typical) set dened via a law-
of-large-numbers argument is not possible, since the total number of elements
in a continuous set is innite. However, when considering the volume of that
continuous typical set (which is a natural analog to the size of a discrete set),
such an extension, with dierential entropy playing a similar role as entropy,
becomes straightforward.
Theorem 5.22 (AEP for continuous memoryless sources) Let X
i

i=1
be
a continuous memoryless source (i.e., an innite sequence of continuous i.i.d. ran-
dom variables) with pdf f
X
() and dierential entropy h(X). Then

1
n
log f
X
(X
1
, . . . , X
n
) E[log
2
f
X
(X)] = h(X) in probability.
Proof: The proof is an immediate result of the law of large numbers (e.g., see
Theorem 3.3). 2
131
Denition 5.23 (Typical set) For > 0 and any n given, dene the typical
set for the above continuous source as
T
n
()
_
x
n
R
n
:

1
n
log
2
f
X
(X
1
, . . . , X
n
) h(X)

<
_
.
Denition 5.24 (Volume) The volume of a set / R
n
is dened as
Vol(/)
_
A
dx
1
dx
n
.
Theorem 5.25 (Consequence of the AEP for continuous memoryless
sources) For a continuous memoryless source X
i

i=1
with dierential entropy
h(X), the following hold.
1. For n suciently large, P
X
n T
n
() > 1 .
2. Vol(T
n
()) 2
n(h(X)+)
for all n.
3. Vol(T
n
()) (1 )2
n(h(X))
for n suciently large.
Proof: The proof is quite analogous to the corresponding theorem for discrete
memoryless sources (Theorem 3.4) and is left as an exercise. 2
5.4 Capacity and channel coding theorem for the discrete-
time memoryless Gaussian channel
We next study the fundamental limits for error-free communication over the
discrete-time memoryless Gaussian channel, which is the most important cont-
inuous-alphabet channel and is widely used to model real-world wired and wire-
less channels. We rst state the denition of discrete-time continuous-alphabet
memoryless channels.
Denition 5.26 (Discrete-time continuous memoryless channels) Con-
sider a discrete-time channel with continuous input and output alphabets given
by A R and R, respectively, and described by a sequence of n-dimensional
transition (conditional) pdfs f
Y
n
|X
n(y
n
[x
n
)

n=1
that govern the reception of
y
n
= (y
1
, y
2
, , y
n
)
n
at the channel output when x
n
= (x
1
, x
2
, , x
n
)
A
n
is sent as the channel input.
The channel (without feedback) is said to be memoryless with a given (marginal)
transition pdf f
Y |X
if its sequence of transition pdfs f
Y
n
|X
n satises
f
Y
n
|X
n(y
n
[x
n
) =
n

i=1
f
Y |X
(y
i
[x
i
) (5.4.1)
132
for every n = 1, 2, , x
n
A
n
and y
n

n
.
In practice, the real-valued input to a continuous channel satises a certain
constraint or limitation on its amplitude or power; otherwise, one would have a
realistically implausible situation where the input can take on any value from the
uncountably innite set of real numbers. We will thus impose an average cost
constraint (t, P) on any input n-tuple x
n
= (x
1
, x
2
, , x
n
) transmitted over the
channel by requiring that
1
n
n

i=1
t(x
i
) P, (5.4.2)
where t() is a given non-negative real-valued function describing the cost for
transmitting an input symbol, and P is a given positive number representing
the maximal average amount of available resources per input symbol.
Denition 5.27 The capacity (or capacity-cost function) of a discrete-time con-
tinuous memoryless channel with input average cost constraint (t, P) is denoted
by C(P) and dened as
C(P) sup
F
X
:E[t(X)]P
I(X; Y ) (in bits/channel use) (5.4.3)
where the supremum is over all input distributions F
X
.
Lemma 5.28 (Concavity of capacity) If C(P) as dened in (5.4.3) is nite
for any P > 0, then it is concave, continuous and strictly increasing in P.
Proof: Fix P
1
> 0 and P
2
> 0. Then since C(P) is nite for any P > 0, then
by Property A.4.3, there exist two input distributions F
X
1
and F
X
2
such that for
all > 0,
I(X
i
; Y
i
) C(P
i
) (5.4.4)
and
E[t(X
i
)] P
i
(5.4.5)
where X
i
denotes the input with distribution F
X
i
and a corresponding channel
output given by Y
i
, for i = 1, 2. Now, for 0 1, let X

be a random variable
with distribution F
X

F
X
1
+ (1 )F
X
2
. Then by (5.4.5)
E
X

[t(X)] = E
X
1
[t(X)] + (1 )E
X
2
[t(X)] P
1
+ (1 )P
2
. (5.4.6)
Furthermore,
C(P
1
+ (1 )P
2
) = sup
{F
X
: E[t(X)]P
1
+(1)P
2
}
I(F
X
, f
Y |X
)
133
I(F
X

, f
Y |X
)
I(F
X
1
, f
Y |X
) + (1 )I(F
X
2
, f
Y |X
)
= I(X
1
; Y
1
) + (1 )I(X
2
: Y
2
)
C(P
1
) + (1 )C(P
2
) ,
where the rst inequality holds by (5.4.6), the second inequality follows from the
concavity of the mutual information with respect to its rst argument (cf. Lemma 2.46)
and the third inequality follows from (5.4.4). Letting 0 yields that
C(P
1
+ (1 )P
2
) C(P
1
) + (1 )C(P
2
)
and hence C(P) is concave in P.
Finally, it can directly be seen by denition that C() is non-decreasing,
which, together with its concavity, imply that it is continuous and strictly in-
creasing. 2
The most commonly used cost function is the power cost function, t(x) = x
2
,
resulting in the average power constraint P for each transmitted input n-tuple:
1
n
n

i=1
x
2
i
P. (5.4.7)
Throughout this chapter, we will adopt this average power constraint on the
channel input.
We herein focus on the discrete-time memoryless Gaussian channel with av-
erage input power constraint P and establish an operational meaning for the
channels capacity C(P) as the largest coding rate for achieving reliable com-
munication over the channel. The channel is described by the following additive
noise equation:
Y
i
= X
i
+ Z
i
, for i = 1, 2, , (5.4.8)
where Y
i
, X
i
and Z
i
are the channels output, input and noise at time i. The
input and noise processes are assumed to be independent from each other and
the noise source Z
i

i=1
is i.i.d. Gaussian with each Z
i
having mean zero and
variance
2
, Z
i
^(0,
2
). Since the noise process is i.i.d, we directly get
that the channel satises (5.4.1) and is hence memoryless, where the channels
transition pdf is explicitly given in terms of the noise pdf as follows:
f
Y |X
(y[x) = f
Z
(y x) =
1

2
2
e

(yx)
2
2
2
.
As mentioned above, we impose the average power constraint (5.4.7) on the
channel input.
134
Observation 5.29 The memoryless Gaussian channel is a good approximating
model for many practical channels such as radio, satellite and telephone line
channels. The additive noise is usually due to a multitude of causes, whose
cumulative eect can be approximated via the Gaussian distribution. This is
justied by the Central Limit Theorem which states that for an i.i.d. process
U
i
with mean and variance
2
,
1

n
i=1
(U
i
) converges in distribution as
n to a Gaussian distributed random variable with mean zero and variance

2
(see Appendix B).
Before proving the channel coding theorem for the above memoryless Gaus-
sian channel with input power constraint P, we rst show that its capacity C(P)
as dened in (5.4.3) with t(x) = x
2
admits a simple expression in terms of P
and the channel noise variance
2
. Indeed, we can write the channels mutual
information I(X; Y ) between its input and output as follows:
I(X; Y ) = h(Y ) h(Y [X)
= h(Y ) h(X + Z[X) (5.4.9)
= h(Y ) h(Z[X) (5.4.10)
= h(Y ) h(Z) (5.4.11)
= h(Y )
1
2
log
2
_
2e
2
_
, (5.4.12)
where (5.4.9) follows from (5.4.8), (5.4.10) holds since dierential entropy is
invariant under translation (see Property 8 of Lemma 5.14), (5.4.11) follows
from the independence of X and Z, and (5.4.12) holds since Z ^(0,
2
) is
Gaussian (see (5.1.1)). Now since Y = X + Z, we have that
E[Y
2
] = E[X
2
] + E[Z
2
] + 2E[X]E[Z] = E[X
2
] +
2
+ 2E[X](0) P +
2
since the input in (5.4.3) is constrained to satisfy E[X
2
] P. Thus the variance
of Y satises Var(Y ) E[Y
2
] P +
2
, and
h(Y )
1
2
log
2
(2eVar(Y ))
1
2
log
2
_
2e(P +
2
)
_
where the rst inequality follows by Theorem 5.20 since Y is real-valued (with
support R). Noting that equality holds in the rst inequality above i Y is
Gaussian and in the second inequality i Var(Y ) = P +
2
, we obtain that
choosing the input X as X ^(0, P) yields Y ^(0, P +
2
) and hence max-
imizes I(X; Y ) over all inputs satisfying E[X
2
] P. Thus, the capacity of the
discrete-time memoryless Gaussian channel with input average power constraint
P and noise variance (or power)
2
is given by
C(P) =
1
2
log
2
_
2e(P +
2
)
_

1
2
log
2
_
2e
2
_
135
=
1
2
log
2
_
1 +
P

2
_
. (5.4.13)
Denition 5.30 Given positive integers n and M, and a discrete-time memory-
less Gaussian channel with input average power constraint P, a xed-length data
transmission code (or block code) (
n
= (n, M) for this channel with blocklength
n and rate
1
n
log
2
M message bits per channel symbol (or channel use) consists
of:
1. M information messages intended for transmission.
2. An encoding function
f : 1, 2, . . . , M R
n
yielding real-valued codewords c
1
= f(1), c
2
= f(2), , c
M
= f(M),
where each codeword c
m
= (c
m1
, . . . , c
mn
) is of length n and satises the
power constraint P
1
n
n

i=1
c
2
i
P,
for m = 1, 2, , M. The set of these M codewords is called the codebook
and we usually write (
n
= c
1
, c
2
, , c
M
to list the codewords.
3. A decoding function g : R
n
1, 2, . . . , M.
As in Chapter 4, we assume that a message W follows a uniform distribution
over the set of messages: Pr[W = w] =
1
M
for all w 1, 2, . . . , M. Similarly,
to convey message W over the channel, the encoder sends its corresponding
codeword X
n
= f(W) (
n
at the channel input. Finally, Y
n
is received at
the channel output and the decoder yields

W = g(Y
n
) as the message estimate.
Also, the average probability of error for this block code used over the memoryless
Gaussian channel is dened as
P
e
( (
n
)
1
M
M

w=1

w
( (
n
),
where

w
( (
n
) Pr[

W ,= W[W = w] = Pr[g(Y
n
) ,= w[X
n
= f(w)]
=
_
y
n
R
n
: g(y
n
)=w
f
Y
n
|X
n(y
n
[f(w)) dy
n
136
is the codes conditional probability of decoding error given that message w is
sent over the channel. Here f
Y
n
|X
n(y
n
[x
n
) =

n
i=1
f
Y |X
(y
i
[x
i
) as the channel is
memoryless, where f
Y |X
is the channels transition pdf.
We next prove that for a memoryless Gaussian channel with input average
power constraint P, its capacity C(P) has an operational meaning in the sense
that it is the supremum of all rates for which there exists a sequence of data
transmission block codes satisfying the power constraint and having a probability
of error that vanishes with increasing blocklength.
Theorem 5.31 (Shannons coding theorem for the memoryless Gaus-
sian channel) Consider a discrete-time memoryless Gaussian channel with
input average power constraint P, channel noise variance
2
and capacity C(P)
as given by (5.4.13).
Forward part (achievability): For any (0, 1), there exist 0 < < 2 and
a sequence of data transmission block code (
n
= (n, M
n
)

n=1
satisfying
1
n
log
2
M
n
> C(P)
with each codeword c = (c
1
, c
2
, . . . , c
n
) in (
n
satisfying
1
n
n

i=1
c
2
i
P (5.4.14)
such that the probability of error P
e
( (
n
) < for suciently large n.
Converse part: If for any sequence of data transmission block codes (
n
=
(n, M
n
)

n=1
whose codewords satisfy (5.4.14), we have that
liminf
n
1
n
log
2
M
n
> C(P),
then the codes probability of error P
e
( (
n
) is bounded away from zero for
all n suciently large.
Proof of the forward part: The theorem holds trivially when C(P) = 0
because we can choose M
n
= 1 for every n and have P
e
( (
n
) = 0. Hence, we
assume without loss of generality C(P) > 0.
Step 0:
Take a positive satisfying < min2, C(P). Pick > 0 small enough
such that 2[C(P) C(P )] < , where the existence of such is assured
137
by the strictly increasing property of C(P). Hence, we have C(P )
/2 > C(P) > 0. Choose M
n
to satisfy
C(P )

2
>
1
n
log
2
M
n
> C(P) ,
for which the choice should exist for all suciently large n. Take = /8.
Let F
X
be the distribution that achieves C(P ), where C(P) is given
by (5.4.13). In this case, F
X
is the Gaussian distribution with mean zero
and variance P and admits a pdf f
X
. Hence, E[X
2
] P and
I(X; Y ) = C(P ).
Step 1: Random coding with average power constraint.
Randomly draw M
n
codewords according to pdf f
X
n with
f
X
n(x
n
) =
n

i=1
f
X
(x
i
).
By law of large numbers, each randomly selected codeword
c
m
= (c
m1
, . . . , c
mn
)
satises
lim
n
1
n
n

i=1
c
2
mi
= E[X
2
] P almost surely
for m = 1, 2, . . . , M
n
1.
Step 2: Code construction.
For M
n
selected codewords c
1
, . . . , c
Mn
, replace the codewords that vio-
late the power constraint (i.e., (5.4.14)) by an all-zero (default) codeword
0. Dene the encoder as
f
n
(m) = c
m
for 1 m M
n
.
Given a received output sequence y
n
, the decoder g
n
() is given by
g
n
(y
n
) =
_
_
_
m, if (c
m
, y
n
) T
n
()
and ( m

,= m) (c
m
, y
n
) , T
n
(),
arbitrary, otherwise,
where the set
T
n
()
_
(x
n
, y
n
) A
n

n
:

1
n
log
2
f
X
n
Y
n(x
n
, y
n
) h(X, Y )

< ,
138

1
n
log
2
f
X
n(x
n
) h(X)

< ,
and

1
n
log
2
f
Y
n(y
n
) h(Y )

<
_
is generated by f
X
n
Y
n(x
n
, y
n
) =

n
i=1
f
XY
(x
i
, y
i
) where f
X
n
Y
n(x
n
, y
n
) is
the joint input-output pdf realized when the memoryless Gaussian channel
(with n-fold transition pdf f
Y
n
|X
n(y
n
[x
n
) =

n
i=1
f
Y |X
(y
i
[x
i
)) is driven by
input X
n
with pdf f
X
n(x
n
) =

n
i=1
f
X
(x
i
) (where f
X
achieves C(P )).
Step 3: Conditional probability of error.
Let
m
denote the conditional error probability given codeword m is trans-
mitted. Dene
c
0

_
x
n
A
n
:
1
n
n

i=1
x
2
i
> P
_
.
Then by following similar argument as (4.3.2), we get:
E[
m
] P
X
n(c
0
) + P
X
n
,Y
n (T
c
n
())
+
Mn

=1
m

=m
_
cmX
n
_
y
n
Fn(|c
m
)
f
X
n
,Y
n(c
m
, y
n
) dc
m
dy
n
, (5.4.15)
where
T
n
([x
n
) y
n

n
: (x
n
, y
n
) T
n
() .
Note that the additional term P
X
n(c
0
) in (5.4.15) is to cope with the
errors due to all-zero codeword replacement, which will be less than for
all suciently large n by the law of large numbers. Finally, by carrying
out a similar procedure as in the proof of the channel coding theorem for
discrete channels (cf. page 99), we obtain:
E[P
e
(C
n
)] P
X
n(c
0
) + P
X
n
,Y
n (T
c
n
())
+M
n
2
n(h(X,Y )+)
2
n(h(X))
2
n(h(Y ))
P
X
n(c
0
) + P
X
n
,Y
n (T
c
n
()) + 2
n(C(P)4)
2
n(I(X;Y )3)
= P
X
n(c
0
) + P
X
n
,Y
n (T
c
n
()) + 2
n
.
Accordingly, we can make the average probability of error, E[P
e
(C
n
)], less
than 3 = 3/8 < 3/4 < for all suciently large n. 2
Proof of the converse part: Consider an (n, M
n
) block data transmission
code satisfying the power constraint (5.4.14) with encoding function
f
n
: 1, 2, . . . , M
n
A
n
139
and decoding function
g
n
:
n
1, 2, . . . , M
n
.
Since the message W is uniformly distributed over 1, 2, . . . , M
n
, we have
H(W) = log
2
M
n
. Since W X
n
= f
n
(W) Y
n
forms a Markov chain
(as Y
n
only depends on X
n
), we obtain by the data processing lemma that
I(W; Y
n
) I(X
n
; Y
n
). We can also bound I(X
n
; Y
n
) by C(P) as follows:
I(X
n
; Y
n
) sup
{F
X
n : (1/n)
P
n
i=1
E[X
2
i
]P}
I(X
n
; Y
n
)
sup
{F
X
n : (1/n)
P
n
i=1
E[X
2
i
]P}
n

j=1
I(X
j
; Y
j
), (by the bound on mutual
information for memoryless channels as in Theorem 2.21)
= sup
{(P
1
,P
2
,...,Pn) : (1/n)
P
n
i=1
P
i
=P}
sup
{F
X
n : ( i) E[X
2
i
]P
i
}
n

j=1
I(X
j
; Y
j
)
sup
{(P
1
,P
2
,...,Pn) : (1/n)
P
n
i=1
P
i
=P}
n

j=1
sup
{F
X
n : ( i) E[X
2
i
]P
i
}
I(X
j
; Y
j
)
sup
{(P
1
,P
2
,...,Pn) : (1/n)
P
n
i=1
P
i
=P}
n

j=1
sup
{F
X
j
: E[X
2
j
]P
j
}
I(X
j
; Y
j
)
= sup
{(P
1
,P
2
,...,Pn):(1/n)
P
n
i=1
P
i
=P}
n

j=1
C(P
j
)
= sup
{(P
1
,P
2
,...,Pn):(1/n)
P
n
i=1
P
i
=P}
n
n

j=1
1
n
C(P
j
)
sup
{(P
1
,P
2
,...,Pn):(1/n)
P
n
i=1
P
i
=P}
nC
_
1
n
n

j=1
P
j
_
(by concavity of C(P))
= nC(P).
Consequently, recalling that P
e
( (
n
) is the average error probability incurred by
guessing W from observing Y
n
via the decoding function g
n
:
n
1, 2, . . . , M
n
,
we get
log
2
M
n
= H(W)
= H(W[Y
n
) + I(W; Y
n
)
H(W[Y
n
) + I(X
n
; Y
n
)
h
b
(P
e
( (
n
)) + P
e
( (
n
) log
2
([[ 1) + nC(P),
(by Fano

s inequality)
140
1 + P
e
( (
n
) log
2
(M
n
1) + nC(P),
(by the fact that ( t [0, 1]) h
b
(t) 1)
1 + P
e
( (
n
) log
2
M
n
+ nC(P),
which implies that
P
e
( (
n
) 1
C(P)
(1/n) log
2
M
n

1
log
2
M
n
.
So if liminf
n
(1/n) log
2
M
n
> C(P), then there exists > 0 and an integer N
such that for n N,
1
n
log
2
M
n
> C(P) + .
Hence, for n N
0
maxN, 2/,
P
e
( (
n
) 1
C(P)
C(P) +

1
n(C(P) + )


2(C(P) +)
.
2
We next show that among all power-constrained continuous memoryless chan-
nels with additive noise admitting a pdf, choosing a Gaussian distributed noise
yields the smallest channel capacity. In other words, the memoryless Gaussian
model results in the most pessimistic (smallest) capacity within the class of
additive-noise continuous memoryless channels.
Theorem 5.32 (Gaussian noise minimizes capacity of additive-noise
channels) Every discrete-time continuous memoryless channel with additive
noise (admitting a pdf) of mean zero and variance
2
and input average power
constraint P has its capacity C(P) lower bounded by the capacity of the mem-
oryless Gaussian channel with identical input constraint and noise variance:
C(P)
1
2
log
2
_
1 +
P

2
_
.
Proof: Let f
Y |X
and f
Yg|Xg
denote the transition pdfs of the additive-noise chan-
nel and the Gaussian channel, respectively, where both channels satisfy input
average power constraint P. Let N and N
g
respectively denote their zero-mean
noise variables of identical variance
2
. Writing the mutual information in terms
of the channels transition pdf and input distribution as in Lemma 2.46, then
for any Gaussian input with pdf f
Xg
with corresponding outputs Y and Y
g
when
applied to channels f
Y |X
and f
Yg|Xg
, respectively, we have that
I(f
Xg
, f
Y |X
) I(f
Xg
, f
Yg|Xg
)
141
=
_
X
_
Y
f
Xg
(x)f
N
(y x) log
2
f
N
(y x)
f
Y
(y)
dydx

_
X
_
Y
f
Xg
(x)f
Ng
(y x) log
2
f
Ng
(y x)
f
Yg
(y)
dydx
=
_
X
_
Y
f
Xg
(x)f
N
(y x) log
2
f
N
(y x)
f
Y
(y)
dydx

_
X
_
Y
f
Xg
(x)f
N
(y x) log
2
f
Ng
(y x)
f
Yg
(y)
dydx
=
_
X
_
Y
f
Xg
(x)f
N
(y x) log
2
f
N
(y x)f
Yg
(y)
f
Ng
(y x)f
Y
(y)
dydx

_
X
_
Y
f
Xg
(x)f
N
(y x)(log
2
e)
_
1
f
Ng
(y x)f
Y
(y)
f
N
(y x)f
Yg
(y)
_
dydx
= (log
2
e)
_
1
_
Y
f
Y
(y)
f
Yg
(y)
__
X
f
Xg
(x)f
Ng
(y x)dx
_
dy
_
= 0,
with equality holding in the inequality i f
Y
(y)/f
Yg
(y) = f
N
(y x)/f
Ng
(y x)
for all x. Therefore,
1
2
log
2
_
1 +
P

2
_
= sup
{F
X
: E[X
2
]P}
I(F
X
, f
Yg|Xg
)
= I(f

Xg
, f
Yg|Xg
)
I(f

Xg
, f
Y |X
)
sup
{F
X
: E[X
2
]P}
I(F
X
, f
Y |X
)
= C(P).
2
Observation 5.33 (Channel coding theorem for continuous memory-
less channels) We close this section by noting that Theorem 5.31 can be gen-
eralized to a wide class of discrete-time continuous memoryless channels with
input cost constraint (5.4.2) where the cost function t() is arbitrary, by show-
ing that C(P) sup
F
X
:E[t(X)]P
I(X; Y ) is the largest rate for which there exist
block codes for the channel satisfying (5.4.2) which are reliably good (i.e., with
asymptotically vanishing error probability). The proof is quite similar to that
of Theorem 5.31, except that some modications are needed in the forward part
as for a general (non-Gaussian) channel, the input distribution F
X
used to con-
struct the random code may not admit a pdf (e.g., cf. [18, Chapter 7], [47,
Theorem 11.14]).
142
5.5 Capacity of uncorrelated parallel Gaussian channels:
The water-lling principle
Consider a network of k mutually-independent discrete-time memoryless Gaus-
sian channels with respective positive noise powers (variances)
2
1
,
2
2
, . . . and
2
k
.
If one wants to transmit information using these channels simultaneously (in
parallel), what will be the systems channel capacity, and how should the signal
powers for each channel be apportioned given a xed overall power budget ? The
answer to the above question lies in the so-called water-lling or water-pouring
principle.
Theorem 5.34 (Capacity of uncorrelated parallel Gaussian channels)
The capacity of k uncorrelated parallel Gaussian channels under an overall input
power constraint P is given by
C(P) =
k

i=1
1
2
log
2
_
1 +
P
i

2
i
_
,
where
2
i
is the noise variance of channel i,
P
i
= max0,
2
i
,
and is chosen to satisfy

k
i=1
P
i
= P. This capacity is achieved by a tuple
of independent Gaussian inputs (X
1
, X
2
, , X
k
), where X
i
^(0, P
i
) is the
input to channel i, for i = 1, 2, , k.
Proof: By denition,
C(P) = sup
_
F
X
k
:
P
k
i=1
E[X
2
k
]P
_
I(X
k
; Y
k
).
Since the noise random variables N
1
, . . . , N
k
are independent from each other,
I(X
k
; Y
k
) = h(Y
k
) h(Y
k
[X
k
)
= h(Y
k
) h(N
k
+ X
k
[X
k
)
= h(Y
k
) h(N
k
[X
k
)
= h(Y
k
) h(N
k
)
= h(Y
k
)
k

i=1
h(N
i
)

i=1
h(Y
i
)
k

i=1
h(N
i
)
143

i=1
1
2
log
2
_
1 +
P
i

2
i
_
where the rst inequality follows from the chain rule for dierential entropy and
the fact that conditioning cannot increase dierential entropy, and the second
inequality holds since output Y
i
of channel i due to input X
i
with E[X
2
i
] = P
i
has
its dierential entropy maximized if it is Gaussian distributed with zero-mean
and variance P
i
+N
i
. Equalities hold above if all the X
i
inputs are independent
of each other with each input X
i
^(0, P
i
) such that

k
i=1
P
i
= P.
Thus the problem is reduced to nding the power allotment that maximizes
the overall capacity subject to the constraint

k
i=1
P
i
= P. By using the La-
grange multiplier technique and verifying the KKT condition (see Example ??
in Appendix ??), the maximizer of
max
_
k

i=1
1
2
log
2
_
1 +
P
i

2
i
_
+
_
k

i=1
P
i
P
__
can be found by taking the derivative of the above equation (with respect to
P
i
0) and setting it to zero, which yields
_

_
1
2
1
P
i
+
2
i
+ = 0, if P
i
> 0;
1
2
1
P
i
+
2
i
+ 0, if P
i
= 0.
Hence,
_
P
i
=
2
i
, if P
i
> 0;
P
i

2
i
, if P
i
= 0,
where (log
2
e)/(2) is chosen to satisfy

k
i=1
P
i
= P. 2
We illustrate the above result in Fig. 5.1 and elucidate why the P
i
power
allotments form a water-lling (or water-pouting) scheme. In the gure, we
have a vessel where the height of each of the solid bins represents the noise
power of each channel (while the width is set to unity so that the area of each
bin yields the noise power of the corresponding Gaussian channel). We can thus
visualize the system as a vessel with an uneven bottom where the optimal input
signal allocation P
i
to each channel is realized by pouring an amount P units of
water into the vessel (with the resulting overall area of lled water equal to P).
Since the vessel has an uneven bottom, water is unevenly distributed among the
bins: noisier channels are allotted less signal power (note that in this example,
channel 3, whose noise power is largest, is given no input power at all and is
hence not used).
144

2
1

2
2

2
3

2
4
P
1

P
2
P
4
P = P
1
+ P
2
+ P
4
Figure 5.1: The water-pouring scheme for uncorrelated parallel Gaus-
sian channels. The horizontal dashed line, which indicates the level
where the water rises to, indicates the value of for which

k
i=1
P
i
= P.
5.6 Capacity of correlated parallel Gaussian channels
In the previous section, we considered a network of k parallel discrete-time mem-
oryless Gaussian channels in which the noise samples from dierent channels are
independent from each other. We found out that the power allocation strat-
egy that maximizes the systems capacity is given by the water-lling scheme.
We next study a network of k parallel memoryless Gaussian channels where the
noise variables from dierent channels are correlated. Surprisingly, we obtain
that water-lling provides also the optimal power allotment policy.
Let K
N
denote the covariance matrix of the noise tuple (N
1
, N
2
, . . . , N
k
), and
let K
X
denote the covariance matrix of the system input (X
1
, . . . , X
k
), where we
assume (without loss of the generality) that each X
i
has zero mean. We assume
that K
N
is positive denite. The input power constraint becomes
k

i=1
E[X
2
i
] = tr(K
X
) P,
where tr() denotes the trace of the k k matrix K
X
. Since in each channel, the
input and noise variables are independent from each other, we have
I(X
k
; Y
k
) = h(Y
k
) h(Y
k
[X
k
)
= h(Y
k
) h(N
k
+ X
k
[X
k
)
= h(Y
k
) h(N
k
[X
k
)
= h(Y
k
) h(N
k
).
145
Since h(N
k
) is not determined by the input, determining the systems capacity
reduces to maximizing h(Y
k
) over all possible inputs (X
1
, . . . , X
k
) satisfying the
power constraint.
Now observe that the covariance matrix of Y
k
is equal to K
Y
= K
X
+ K
N
,
which implies by Theorem 5.20 that the dierential entropy of Y
k
is upper
bounded by
h(Y
k
)
1
2
log
2
_
(2e)
k
det(K
X
+K
N
)

,
with equality i Y
k
Gaussian. It remains to nd out whether we can nd in-
puts (X
1
, . . . , X
k
) satisfying the power constraint which achieve the above upper
bound and maximize it.
As in the proof of Theorem 5.18, we can orthogonally diagonalize K
N
as
K
N
= AA
T
,
where AA
T
= I
k
(and thus det(A)
2
= 1), where I
k
is the k k identity matrix,
and is a diagonal matrix with positive diagonal components consisting of the
eigenvalues of K
N
(as K
N
is positive denite). Then
det(K
X
+K
N
) = det(K
X
+AA
T
)
= det(AA
T
K
X
AA
T
+AA
T
)
= det(A) det(A
T
K
X
A+) det(A
T
)
= det(A
T
K
X
A+)
= det(B+),
where B A
T
K
X
A. Since for any two matrices C and D, tr(CD) = tr(DC),
we have that tr(B) = tr(A
T
K
X
A) = tr(A
T
AK
X
) = tr(I
k
K
X
) = tr(K
X
). Thus
the capacity problem is further transformed to maximizing det(B+ ) subject
to tr(B) P.
By observing that B+ is positive denite (because is positive denite)
and using Hadamards inequality given in Corollary 5.19, we have
det(B+)
k

i=1
(B
ii
+
i
),
where
i
is the component of matrix locating at i
th
row and i
th
column, which
is exactly the i-th eigenvalue of K
N
. Thus, the maximum value of det(B + )
under tr(B) P is realized by a diagonal matrix B (to achieve equality in
Hadamards inequality) with
k

i=1
B
ii
= P.
146
Finally, as in the proof of Theorem 5.34, we obtain a water-lling allotment for
the optimal diagonal elements of B:
B
ii
= max0,
i
,
where is chosen to satisfy

k
i=1
B
ii
= P. We summarize this result in the next
theorem.
Theorem 5.35 (Capacity of correlated parallel Gaussian channels) The
capacity of k correlated parallel Gaussian channels with positive-denite noise
covariance matrix K
N
under overall input power constraint P is given by
C(P) =
k

i=1
1
2
log
2
_
1 +
P
i

i
_
,
where
i
is the i-th eigenvalue of K
N
,
P
i
= max0,
i
,
and is chosen to satisfy

k
i=1
P
i
= P. This capacity is achieved by a tuple
of independent Gaussian inputs (X
1
, X
2
, , X
k
), where X
i
^(0, P
i
) is the
input to channel i, for i = 1, 2, , k.
5.7 Non-Gaussian discrete-time memoryless channels
If a discrete-time channel has an additive but non-Gaussian memoryless noise
and an input power constraint, then it is often hard to calculate its capacity.
Hence, in this section, we introduce an upper bound and a lower bound on the
capacity of such a channel (we assume that the noise admits a pdf).
Denition 5.36 (Entropy power) For a continuous random variable N with
(well-dened) dierential entropy h(N) (measured in bits), its entropy power is
denoted by N
e
and dened as
N
e

1
2e
2
2h(N)
.
Lemma 5.37 For a discrete-time continuous-alphabet memoryless additive-noise
channel with input power constraint P and noise variance
2
, its capacity satis-
es
1
2
log
2
P +
2
N
e
C(P)
1
2
log
2
P +
2

2
. (5.7.1)
147
Proof: The lower bound in (5.7.1) is already proved in Theorem 5.32. The upper
bound follows from
I(X; Y ) = h(Y ) h(N)

1
2
log
2
[2e(P +
2
)]
1
2
log
2
[2eN
e
].
2
The entropy power of N can be viewed as the variance of a corresponding
Gaussian random variable with the same dierential entropy as N. Indeed, if N
is Gaussian, then its entropy power is equal to
N
e
=
1
2e
2
2h(X)
= Var(N),
as expected.
Whenever two independent Gaussian random variables, N
1
and N
2
, are
added, the power (variance) of the sum is equal to the sum of the powers (vari-
ances) of N
1
and N
2
. This relationship can then be written as
2
2h(N
1
+N
2
)
= 2
2h(N
1
)
+ 2
2h(N
2
)
,
or equivalently
Var(N
1
+ N
2
) = Var(N
1
) + Var(N
2
).
However, when two independent random variables are non-Gaussian, the rela-
tionship becomes
2
2h(N
1
+N
2
)
2
2h(N
1
)
+ 2
2h(N
2
)
, (5.7.2)
or equivalently
N
e
(N
1
+ N
2
) N
e
(N
1
) + N
e
(N
2
). (5.7.3)
Inequality (5.7.2) (or equivalently (5.7.3)), whose proof can be found in [12,
Section 17.8] or [9, Theorem 7.10.4], is called the entropy-power inequality. It
reveals that the sum of two independent random variables may introduce more
entropy power than the sum of each individual entropy power, except in the
Gaussian case.
5.8 Capacity of the band-limited white Gaussian channel
We have so far considered discrete-time channels (with discrete or continuous
alphabets). We close this chapter by briey presenting the capacity expression
of the continuous-time (waveform) band-limited channel with additive white
Gaussian noise. The reader is referred to [?], [18, Chapter 8], [2, Sections 8.2
148
and 8.3] and [22, Chapter 6] for rigorous and detailed treatments (including
coding theorems) of waveform channels.
The continuous-time band-limited channel with additive white Gaussian noise
is a common model for a radio network or a telephone line. For such a channel,
illustrated in Fig. 5.2, the output waveform is given by
Y (t) = (X(t) + Z(t)) h(t), t 0,
where represents the convolution operation (recall that the convolution be-
tween two signals a(t) and b(t) is dened as a(t) b(t) =
_

a()b(t )d).
Here X(t) is the channel input waveform with average power constraint
lim
T
1
T
_
T
0
X(t)
2
dt P
and bandwidth W cycles per second or Hertz (Hz); i.e., its spectrum or Fourier
transform X(f) T[X(t)] =
_
+

X(t)e
j2ft
dt = 0 for all frequencies [f[ > W,
where j =

1 is the imaginary unit number. Z(t) is the noise waveform of a


zero-mean stationary white Gaussian process with power spectral density N
0
/2;
i.e., its power spectral density PSD
Z
(f), which is the Fourier transform of the
process covariance function K
Z
() E[Z(s )Z(s)] E[Z(s )]E[Z(s)] =
E[Z(s )Z(s)], s, R, is given by
PSD
Z
(f) = T[K
Z
(t)] =
_
+

K
Z
(t)e
j2ft
dt =
N
0
2
f.
Finally, h(t) is the impulse response of an ideal bandpass lter with cuto fre-
quencies at W Hz:
H(f) = T[(h(t)] =
_
1 if W f W,
0 otherwise.
Recall that one can recover h(t) by taking the inverse Fourier transform of H(f);
this yields
h(t) = T
1
[H(f)] =
_
+

H(f)e
j2ft
df = 2Wsinc(2Wt),
where
sinc(t)
sin(t)
t
,
is the sinc function and is dened to equal 1 at t = 0 by continuity.
Note that we can write the channel output as
Y (t) = X(t) +

Z(t)
149
-
X(t)
H(f)
-
+
?
Waveform channel
Z(t)
-
Y (t)
Figure 5.2: Band-limited waveform channel with additive white Gaus-
sian noise.
where

Z(t) Z(t) h(t) is the ltered noise waveform. The input X(t) is not
aected by the ideal unit-gain bandpass lter since it has an identical bandwidth
as h(t). Note also that the power spectral density of the ltered noise is given
by
PSD

Z
(f) = PSD
Z
(f)[H(f)[
2
=
_
N
0
2
if W f W,
0 otherwise.
Taking the inverse Fourier transform of PSD

Z
(f) yields the covariance function
of the ltered noise process:
K

Z
() = T
1
[PSD

Z
(f)] = N
0
Wsinc(2W). R (5.8.1)
To determine the capacity (in bits per second) of this continuous-time band-
limited white Gaussian channel with parameters, P, W and N
0
, we convert it to
an equivalent discrete-time channel with power constraint P by using the well-
known Sampling theorem (due Nyquist, Kotelnikov and Shannon), which states
that sampling a band-limited signal with bandwidth W at a rate of 1/(2W)
is sucient to reconstruct the signal from its samples. Since X(t),

Z(t) and
Y (t) are all band-limited to [W, W], we can thus represent these signals by
their samples taken
1
2W
seconds apart and model the channel by a discrete-time
channel described by:
Y
n
= X
n
+

Z
n
, n = 1, 2, ,
where X
n
X(
n
2W
) are the input samples and

Z
n
and Y
n
are the random samples
of the noise

Z(t) and output Y (t) signals, respectively.
Since

Z(t) is a ltered version of Z(t), which is a zero-mean stationary Gaus-
sian process, we obtain that

Z(t) is also zero-mean, stationary and Gaussian.
150
This directly implies that the samples

Z
n
, n = 1, 2, , are zero-mean Gaussian
identically distributed random variables. Now an examination of the expression
of K

Z
() in (5.8.1) reveals that K

Z
() = 0 for =
n
2W
, n = 1, 2, , since
sinc(t) = 0 for all non-zero integer values of t. Hence, the random variables

Z
n
,
n = 1, 2, , are uncorrelated and hence independent (since they are Gaussian)
and their variance is given by E[

Z
2
n
] = K

Z
(0) = N
0
W. We conclude that the
discrete-time process

Z
n

n=1
is i.i.d. Gaussian with each

Z
n
^(0, N
0
W). As
a result, the above discrete-time channel is a discrete-time memoryless Gaussian
channel with power constraint P and noise variance N
0
W; thus the capacity
of the band-limited white Gaussian channel in bits/channel use is given using
(5.4.13) by
1
2
log
2
_
1 +
P
N
0
W
_
bits/channel use.
Given that we are using the channel (with inputs X
n
) every
1
2W
seconds, we ob-
tain that the capacity in bits/second of the band-limited white Gaussian channel
is given by
C(P) = W log
2
_
1 +
P
N
0
W
_
bits/second, (5.8.2)
where
P
N
0
W
is typically referred to as the signal-to-noise ratio (SNR).
We emphasize that the above derivation of (5.8.2) is heuristic as we have not
rigorously shown the equivalence between the original band-limited Gaussian
channel and its discrete-time version and we have not established a coding the-
orem for the original channel. We point the reader to the references mentioned
at the beginning of the section for a full development of this subject.
Example 5.38 (Telephone line channel) Suppose telephone signals are band-
limited to 4 kHz. Given an SNR of 40 decibels (dB) i.e., 10 log
10
P
N
0
W
= 40 dB
then from (5.8.2), we calculate that the capacity of the telephone line channel
(when modeled via the band-limited white Gaussian channel) is given by
4000 log
2
(1 + 10000) = 53151.4 bits/second.
Example 5.39 (Innite bandwidth white Gaussian channel) As the chan-
nel bandwidth W grows without bound, we obtain from (5.8.2) that
lim
W
C(P) =
P
N
0
log
2
e bits/second,
which indicates that in the innite-bandwidth regime, capacity grows linearly
with power.
151
Observation 5.40 (Band-limited colored Gaussian channel) If the above
band-limited channel has a stationary colored (non-white) additive Gaussian
noise, then it can be shown (e.g., see [18]) that the capacity of this channel
becomes
C(P) =
1
2
_
W
W
max
_
0, log
2

PSD
Z
(f)
_
df,
where is the solution of
P =
_
W
W
max [0, PSD
Z
(f)] df.
The above capacity formula is indeed reminiscent of the water-pouring scheme we
saw in Sections 5.5 and 5.6, albeit it is herein applied in the spectral domain. In
other words, we can view the curve of PSD
Z
(f) as a bowl, and water is imagined
being poured into the bowl up to level under which the area of the water is
equal to P (see Fig. 5.3.(a)). Furthermore, the distributed water indicates the
shape of the optimum transmission power spectrum (see Fig. 5.3.(b)).
152
(a) The spectrum of PSD
Z
(f) where the horizontal line represents ,
the level at which water rises to.
(b) The input spectrum that achieves capacity.
Figure 5.3: Water-pouring for the band-limited colored Gaussian chan-
nel.
153
Appendix A
Overview on Suprema and Limits
We herein review basic results on suprema and limits which are useful for the
development of information theoretic coding theorems; they can be found in
standard real analysis texts (e.g., see [32, 45]).
A.1 Supremum and maximum
Throughout, we work on subsets of R, the set of real numbers.
Denition A.1 (Upper bound of a set) A real number u is called an upper
bound of a non-empty subset / of R if every element of / is less than or equal
to u; we say that / is bounded above. Symbolically, the denition becomes:
/ R is bounded above ( u R) such that ( a /), a u.
Denition A.2 (Least upper bound or supremum) If / is a non-empty
subset of R, then we say that a real number s is a least upper bound or supremum
of / if s is an upper bound of the set / and if s s

for each upper bound s

of /. In this case, we write s = sup /; other notations are s = sup


xA
x and
s = supx : x /.
Completeness Axiom: (Least upper bound property) Let / be a non-
empty subset of R that is bounded above. Then / has a least upper bound.
It follows directly that if a non-empty set in R has a supremum, then this
supremum is unique. Furthermore, note that the empty set () and any set
not bounded above do not admit a supremum in R. However, when working
in the set of extended real numbers given by R , , we can dene the
154
supremum of the empty set as and that of a set not bounded above as .
These extended denitions will be adopted in the text.
We now distinguish between two situations: (i) the supremum of a set /
belongs to /, and (ii) the supremum of a set / does not belong to /. It is quite
easy to create examples for both situations. A quick example for (i) involves
the set (0, 1], while the set (0, 1) can be used for (ii). In both examples, the
supremum is equal to 1; however, in the former case, the supremum belongs to
the set, while in the latter case it does not. When a set contains its supremum,
we call the supremum the maximum of the set.
Denition A.3 (Maximum) If sup / /, then sup / is also called the max-
imum of /, and is denoted by max /. However, if sup / , /, then we say that
the maximum of / does not exist.
Property A.4 (Properties of the supremum)
1. The supremum of any set in R , always exits.
2. ( a /) a sup /.
3. If < sup / < , then ( > 0)( a
0
/) a
0
> sup /.
(The existence of a
0
(sup /, sup /] for any > 0 under the condition
of [ sup /[ < is called the approximation property for the supremum.)
4. If sup / = , then ( L R)( B
0
/) B
0
> L.
5. If sup / = , then / is empty.
Observation A.5 In Information Theory, a typical channel coding theorem
establishes that a (nite) real number is the supremum of a set /. Thus, to
prove such a theorem, one must show that satises both properties 3 and 2
above, i.e.,
( > 0)( a
0
/) a
0
> (A.1.1)
and
( a /) a , (A.1.2)
where (A.1.1) and (A.1.2) are called the achievability (or forward) part and the
converse part, respectively, of the theorem. Specically, (A.1.2) states that is
an upper bound of /, and (A.1.1) states that no number less than can be an
upper bound for /.
155
Property A.6 (Properties of the maximum)
1. ( a /) a max /, if max / exists in R , .
2. max / /.
From the above property, in order to obtain = max /, one needs to show
that satises both
( a /) a and /.
A.2 Inmum and minimum
The concepts of inmum and minimum are dual to those of supremum and
maximum.
Denition A.7 (Lower bound of a set) A real number is called a lower
bound of a non-empty subset / in R if every element of / is greater than or
equal to ; we say that / is bounded below. Symbolically, the denition becomes:
/ R is bounded below ( R) such that ( a /) a .
Denition A.8 (Greatest lower bound or inmum) If / is a non-empty
subset of R, then we say that a real number is a greatest lower bound or
inmum of / if is a lower bound of / and if

for each lower bound

of /. In this case, we write = inf /; other notations are = inf


xA
x and
= infx : x /.
Completeness Axiom: (Greatest lower bound property) Let / be a
non-empty subset of R that is bounded below. Then / has a greatest lower
bound.
As for the case of the supremum, it directly follows that if a non-empty set
in R has an inmum, then this inmum is unique. Furthermore, working in the
set of extended real numbers, the inmum of the empty set is dened as and
that of a set not bounded below as .
Denition A.9 (Minimum) If inf / /, then inf / is also called the min-
imum of /, and is denoted by min /. However, if inf / , /, we say that the
minimum of / does not exist.
156
Property A.10 (Properties of the inmum)
1. The inmum of any set in R , always exists.
2. ( a /) a inf /.
3. If > inf / > , then ( > 0)( a
0
/) a
0
< inf /+ .
(The existence of a
0
[inf /, inf /+) for any > 0 under the assumption
of [ inf /[ < is called the approximation property for the inmum.)
4. If inf / = , then (A R)( B
0
/)B
0
< L.
5. If inf / = , then / is empty.
Observation A.11 Analogously to Observation A.5, a typical source coding
theorem in Information Theory establishes that a (nite) real number is the
inmum of a set /. Thus, to prove such a theorem, one must show that
satises both properties 3 and 2 above, i.e.,
( > 0)( a
0
/) a
0
< + (A.2.1)
and
( a /) a . (A.2.2)
Here, (A.2.1) is called the achievability or forward part of the coding theorem;
it species that no number greater than can be a lower bound for /. Also,
(A.2.2) is called the converse part of the theorem; it states that is a lower
bound of /.
Property A.12 (Properties of the minimum)
1. ( a /) a min /, if min / exists in R , .
2. min / /.
A.3 Boundedness and suprema operations
Denition A.13 (Boundedness) A subset / of R is said to be bounded if it
is both bounded above and bounded below; otherwise it is called unbounded.
Lemma A.14 (Condition for boundedness) A subset / of R is bounded i
( k R) such that ( a /) [a[ k.
157
Lemma A.15 (Monotone property) Suppose that / and B are non-empty
subsets of R such that / B. Then
1. sup / sup B.
2. inf / inf B.
Lemma A.16 (Supremum for set operations) Dene the addition of two
sets / and B as
/+B c R : c = a + b for some a / and b B.
Dene the scaler multiplication of a set / by a scalar k R as
k / c R : c = k a for some a /.
Finally, dene the negation of a set / as
/ c R : c = a for some a /.
Then the following hold.
1. If / and B are both bounded above, then / + B is also bounded above
and sup(/+B) = sup /+ sup B.
2. If 0 < k < and / is bounded above, then k / is also bounded above
and sup(k /) = k sup /.
3. sup / = inf(/) and inf / = sup(/).
Property 1 does not hold for the product of two sets, where the product
of sets / and B is dened as as
/ B c R : c = ab for some a / and b B.
In this case, both of these two situations can occur:
sup(/ B) > (sup /) (sup B)
sup(/ B) = (sup /) (sup B).
158
Lemma A.17 (Supremum/inmum for monotone functions)
1. If f : R R is a non-decreasing function, then
supx R : f(x) < = infx R : f(x)
and
supx R : f(x) = infx R : f(x) > .
2. If f : R R is a non-increasing function, then
supx R : f(x) > = infx R : f(x)
and
supx R : f(x) = infx R : f(x) < .
The above lemma is illustrated in Figure A.1.
A.4 Sequences and their limits
Let N denote the set of natural numbers (positive integers) 1, 2, 3, . A
sequence drawn from a real-valued function is denoted by
f : N R.
In other words, f(n) is a real number for each n = 1, 2, 3, . It is usual to write
f(n) = a
n
, and we often indicate the sequence by any one of these notations
a
1
, a
2
, a
3
, , a
n
, or a
n

n=1
.
One important question that arises with a sequence is what happens when n
gets large. To be precise, we want to know that when n is large enough, whether
or not every a
n
is close to some xed number L (which is the limit of a
n
).
Denition A.18 (Limit) The limit of a
n

n=1
is the real number L satisfying:
( > 0)( N) such that ( n > N)
[a
n
L[ < .
In this case, we write L = lim
n
a
n
. If no such L satises the above statement,
we say that the limit of a
n

n=1
does not exist.
159
-
6
f(x)

supx : f(x) <


= infx : f(x)
supx : f(x)
= infx : f(x) >
-
6
f(x)

supx : f(x)
= infx : f(x) <
supx : f(x) >
= infx : f(x)
Figure A.1: Illustration of Lemma A.17.
Property A.19 If a
n

n=1
and b
n

n=1
both have a limit in R, then the fol-
lowing hold.
1. lim
n
(a
n
+ b
n
) = lim
n
a
n
+ lim
n
b
n
.
2. lim
n
( a
n
) = lim
n
a
n
.
3. lim
n
(a
n
b
n
) = (lim
n
a
n
)(lim
n
b
n
).
Note that in the above denition, and cannot be a legitimate limit
for any sequence. In fact, if ( L)( N) such that ( n > N) a
n
> L, then we
160
say that a
n
diverges to and write a
n
. A similar argument applies to
a
n
diverging to . For convenience, we will work in the set of extended real
numbers and thus state that a sequence a
n

n=1
that diverges to either or
has a limit in R , .
Lemma A.20 (Convergence of monotone sequences) If a
n

n=1
is non-de-
creasing in n, then lim
n
a
n
exists in R, . If a
n

n=1
is also bounded
from above i.e., a
n
L n for some L in R then lim
n
a
n
exists in R.
Likewise, if a
n

n=1
is non-increasing in n, then lim
n
a
n
exists in R
, . If a
n

n=1
is also bounded from below i.e., a
n
L n for some L
in R then lim
n
a
n
exists in R.
As stated above, the limit of a sequence may not exist. For example, a
n
=
(1)
n
. Then a
n
will be close to either 1 or 1 for n large. Hence, more general-
ized denitions that can describe the general limiting behavior of a sequence is
required.
Denition A.21 (limsup and liminf ) The limit supremum of a
n

n=1
is the
extended real number in R , dened by
limsup
n
a
n
lim
n
(sup
kn
a
k
),
and the limit inmum of a
n

n=1
is the extended real number dened by
liminf
n
a
n
lim
n
(inf
kn
a
k
).
Some also use the notations lim and lim to denote limsup and liminf, respectively.
Note that the limit supremum and the limit inmum of a sequence is always
dened in R , , since the sequences sup
kn
a
k
= supa
k
: k n and
inf
kn
a
k
= infa
k
: k n are monotone in n (cf. Lemma A.20). An immediate
result follows from the denitions of limsup and liminf.
Lemma A.22 (Limit) For a sequence a
n

n=1
,
lim
n
a
n
= L limsup
n
a
n
= liminf
n
a
n
= L.
Some properties regarding the limsup and liminf of sequences (which are
parallel to Properties A.4 and A.10) are listed below.
161
Property A.23 (Properties of the limit supremum)
1. The limit supremum always exists in R , .
2. If [ limsup
m
a
m
[ < , then ( > 0)( N) such that ( n > N)
a
n
< limsup
m
a
m
+ . (Note that this holds for every n > N.)
3. If [ limsup
m
a
m
[ < , then ( > 0 and integer K)( N > K) such
that a
N
> limsup
m
a
m
. (Note that this holds only for one N, which
is larger than K.)
Property A.24 (Properties of the limit inmum)
1. The limit inmum always exists in R , .
2. If [ liminf
m
a
m
[ < , then ( > 0 and K)( N > K) such that
a
N
< liminf
m
a
m
+ . (Note that this holds only for one N, which is
larger than K.)
3. If [ liminf
m
a
m
[ < , then ( > 0)( N) such that ( n > N) a
n
>
liminf
m
a
m
. (Note that this holds for every n > N.)
The last two items in Properties A.23 and A.24 can be stated using the
terminology of suciently large and innitely often, which is often adopted in
Information Theory.
Denition A.25 (Suciently large) We say that a property holds for a se-
quence a
n

n=1
almost always or for all suciently large n if the property holds
for every n > N for some N.
Denition A.26 (Innitely often) We say that a property holds for a se-
quence a
n

n=1
innitely often or for innitely many n if for every K, the prop-
erty holds for one (specic) N with N > K.
Then properties 2 and 3 of Property A.23 can be respectively re-phrased as:
if [ limsup
m
a
m
[ < , then ( > 0)
a
n
< limsup
m
a
m
+ for all suciently large n
and
a
n
> limsup
m
a
m
for innitely many n.
162
Similarly, properties 2 and 3 of Property A.24 becomes: if [ liminf
m
a
m
[ < ,
then ( > 0)
a
n
< liminf
m
a
m
+ for innitely many n
and
a
n
> liminf
m
a
m
for all suciently large n.
Lemma A.27
1. liminf
n
a
n
limsup
n
a
n
.
2. If a
n
b
n
for all suciently large n, then
liminf
n
a
n
liminf
n
b
n
and limsup
n
a
n
limsup
n
b
n
.
3. limsup
n
a
n
< r a
n
< r for all suciently large n.
4. limsup
n
a
n
> r a
n
> r for innitely many n.
5.
liminf
n
a
n
+ liminf
n
b
n
liminf
n
(a
n
+ b
n
)
limsup
n
a
n
+ liminf
n
b
n
limsup
n
(a
n
+ b
n
)
limsup
n
a
n
+ limsup
n
b
n
.
6. If lim
n
a
n
exists, then
liminf
n
(a
n
+ b
n
) = lim
n
a
n
+ liminf
n
b
n
and
limsup
n
(a
n
+ b
n
) = lim
n
a
n
+ limsup
n
b
n
.
Finally, one can also interpret the limit supremum and limit inmum in terms
of the concept of clustering points. A clustering point is a point that a sequence
a
n

n=1
approaches (i.e., belonging to a ball with arbitrarily small radius and
that point as center) innitely many times. For example, if a
n
= sin(n/2),
then a
n

n=1
= 1, 0, 1, 0, 1, 0, 1, 0, . . .. Hence, there are three clustering
points in this sequence, which are 1, 0 and 1. Then the limit supremum of
the sequence is nothing but its largest clustering point, and its limit inmum
is exactly its smallest clustering point. Specically, limsup
n
a
n
= 1 and
liminf
n
a
n
= 1. This approach can sometimes be useful to determine the
limsup and liminf quantities.
163
A.5 Equivalence
We close this appendix by providing some equivalent statements that are often
used to simplify proofs. For example, instead of directly showing that quantity
x is less than or equal to quantity y, one can take an arbitrary constant > 0
and prove that x < y +. Since y + is a larger quantity than y, in some cases
it might be easier to show x < y + than proving x y. By the next theorem,
any proof that concludes that x < y + for all > 0 immediately gives the
desired result of x y.
Theorem A.28 For any x, y and a in R,
1. x < y + for all > 0 i x y;
2. x < y for some > 0 i x < y;
3. x > y for all > 0 i x y;
4. x > y + for some > 0 i x > y;
5. [a[ < for all > 0 i a = 0.
164
Appendix B
Overview in Probability and Random
Processes
This appendix presents a quick overview of basic concepts from probability the-
ory and the theory of random processes. The reader can consult comprehensive
texts on these subjects for a thorough study (e.g., cf. [2, 6, 20]). We close the ap-
pendix with a brief discussion of Jensens inequality and the Lagrange multipliers
technique for the optimization of convex functions [5, 11].
B.1 Probability space
Denition B.1 (-Fields) Let T be a collection of subsets of a non-empty set
. Then T is called a -eld (or -algebra) if the following hold:
1. T.
2. Closure of T under complementation: If A T, then A
c
fc, where
A
c
is the complement set of A (relative to ).
3. Closure of T under countable union: If A
i
T for i = 1, 2, 3, . . ., then

i=1
A
i
T.
It directly follows that the empty set is also an element of T (as
c
= )
and that T is closed under countable intersection since

i=1
A
i
=
_

_
i=1
A
i
_
c
.
The largest -eld of subsets of a given set is the collection of all subsets of
(i.e., its powerset), while the smallest -eld is given by , . Also, if A is
165
a proper (strict) non-empty subset of , then the smallest -eld containing A
is given by , , A, A
c
.
Denition B.2 (Probability space) A probability space is a triple (, T, P),
where is a given set called sample space containing all possible outcomes
(usually observed from an experiment), T is the -eld of subsets and P
is a probability measure P : T [0, 1] on the -eld satisfying the following:
1. 0 P(A) 1 for all A T.
2. P() = 1.
3. Countable additivity: If A
1
, A
2
, . . . is a sequence of disjoint sets (i.e.,
A
i
A
j
= i ,= j) in T, then
P
_

_
k=1
A
k
_
=

k=1
P(A
k
).
It directly follows from properties 1-3 of the above denition that P() = 0.
Usually, the -eld T is called the event space and its elements (which are
subsets of satisfying the properties of Denition B.1) are called events.
B.2 Random variable and random process
B.3 Central limit theorem
Theorem B.3 (Central limit theorem) If X
n

n=1
is a sequence of i.i.d. ran-
dom variables with nite common marginal mean and variance
2
, then
1

n
n

i=1
(X
i
)
d
Z ^(0,
2
),
where the convergence is in distribution (as n ) and Z ^(0,
2
) is a
Gaussian distributed random variable with mean 0 and variance
2
.
B.4 Convexity, concavity and Jensens inequality
Jensens inequality provides a useful bound for the expectation of convex (or
concave) functions.
166
Denition B.4 (Convexity) Consider a convex set
1
O R
m
, where m is a
xed positive integer. Then a function f : O R is said to be convex over O if
for every x, y in O and 0 1,
f
_
x + (1 )y
_
f(x) + (1 )f(y).
Furthermore, a function f is said to be strictly convex if equality holds only when
= 0 or = 1.
Denition B.5 (Concavity) A function f is concave if f is convex.
Note that when O = (a, b) is an interval in R and function f : O R has a
non-negative (respectively positive) second derivative over O, then the function
is convex (resp. strictly convex). This can be easily shown via the Taylor series
expansion of the function.
Theorem B.6 (Jensens inequality) If f : O R is convex over a convex
set O R
m
, and X = (X
1
, X
2
, , X
m
)
T
is an m-dimensional random vector
with alphabet A O, then
E[f(X)] f(E[X]).
Moreover, if f is strictly convex, then equality in the above inequality immedi-
ately implies X = E[X] with probability 1.
Note: O is a convex set; hence, A O implies E[X] O. This guarantees
that f(E[X]) is dened. Similarly, if f is concave, then
E[f(X)] f(E[X]).
Furthermore, if f is strictly concave, then equality in the above inequality im-
mediately implies that X = E[X] with probability 1.
Proof: Let y = a
T
x +b be a support hyperplane for f with slope vector a
T
and ane parameter b that passes through the point (E[X], f(E[X])), where a
1
A set O R
m
is said to be convex if for every x = (x
1
, x
2
, , x
m
)
T
and y =
(y
1
, y
2
, , y
m
)
T
in O (where T denotes transposition), and every 0 1, x+(1)y O;
in other words, the convex combination of any two points x and y in O also belongs to O.
167
support hyperplane
2
for function f at x

is by denition a hyperplane passing


through the point (x

, f(x

)) and lying entirely below the graph of f (see Fig. B.4


for an illustration of a support line for a convex function over R).
-
6
x
y
support line
y = ax + b
f(x)
Figure B.1: The support line y = ax + b of the convex function f(x).
Thus,
(x A) a
T
x + b f(x).
By taking the expectation value of both sides, we obtain
a
T
E[X] + b E[f(X)],
but we know that a
T
E[X] + b = f(E[X]). Consequently,
f(E[X]) E[f(X)].
2
2
A hyperplane y = a
T
x+b is said to be a support hyperplane for a function f with slope
vector a
T
R
m
and ane parameter b R if among all hyperplanes of the same slope vector
a, it is the largest one satisfying a
T
x+b f(x) for every x O. Hence, a support hyperplane
may not necessarily pass through the point (x

, f(x

)) for every x

O. Here, since we only


consider convex functions, the validity of the support hyperplane at x
0
passing (x

, f(x

)) is
therefore guaranteed. Note that when x is one-dimensional (i.e., m = 1), a support hyperplane
is simply referred to as a support line.
168
Bibliography
[1] S. Arimoto, An algorithm for computing the capacity of arbitrary discrete
memoryless channel, IEEE Trans. Inform. Theory, vol. 18, no. 1, pp. 14-20,
Jan. 1972.
[2] R. B. Ash and C. A. Doleans-Dade, Probability and Measure Theory, Aca-
demic Press, MA, 2000.
[3] C. Berrou, A. Glavieux and P. Thitimajshima, Near Shannon limit error-
correcting coding and decoding: Turbo-codes(1), Proc. IEEE Int. Conf.
Commun., pp. 1064-1070, Geneva, Switzerland, May 1993.
[4] C. Berrou and A. Glavieux, Near optimum error correcting coding and
decoding: Turbo-codes, IEEE Trans. Commun., vol. 44, no. 10, pp. 1261-
1271, Oct. 1996.
[5] D. P. Bertsekas, with A. Nedic and A. E. Ozdagler, Convex Analysis and
Optimization, Athena Scientic, Belmont, MA, 2003.
[6] P. Billingsley. Probability and Measure, 2nd. Ed., John Wiley and Sons, NY,
1995.
[7] R. E. Blahut, Computation of channel capacity and rate-distortion func-
tions, IEEE Trans. Inform. Theory, vol. 18, no. 4, pp. 460-473, Jul. 1972.
[8] R. E. Blahut, Theory and Practice of Error Control Codes, Addison-Wesley,
MA, 1983.
[9] R. E. Blahut. Principles and Practice of Information Theory. Addison Wes-
ley, MA, 1988.
[10] R. E. Blahut, Algebraic Codes for Data Transmission, Cambridge Univ.
Press, 2003.
[11] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University
Press, Cambridge, UK, 2003.
169
[12] T. M. Cover and J.A. Thomas, Elements of Information Theory, 2nd Ed.,
Wiley, NY, 2006.
[13] I. Csiszar and J. Korner, Information Theory: Coding Theorems for Discrete
Memoryless Systems, Academic, NY, 1981.
[14] I. Csiszar and G. Tusnady, Information geometry and alternating min-
imization procedures, Statistics and Decision, Supplement Issue, vol. 1,
pp. 205-237, 1984.
[15] S. H. Friedberg, A.J. Insel and L. E. Spence, Linear Algebra, 4th Ed., Pren-
tice Hall, 2002.
[16] R. G. Gallager, Low-density parity-check codes, IRE Trans. Inform. The-
ory, vol. 28, no. 1, pp. 8-21, Jan. 1962.
[17] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press, 1963.
[18] R. G. Gallager, Information Theory and Reliable Communication, Wiley,
1968.
[19] R. Gallager, Variations on theme by Human, IEEE Trans. Inform. The-
ory, vol. 24, no. 6, pp. 668-674, Nov. 1978.
[20] G. R. Grimmett and D. R. Stirzaker, Probability and Random Processes,
Third Edition, Oxford University Press, NY, 2001.
[21] T. S. Han and S. Verd u, Approximation theory of output statistics, IEEE
Trans. Inform. Theory, vol. 39, no. 3, pp. 752-772, May 1993.
[22] S. Ihara, Information Theory for Continuous Systems, World-Scientic, Sin-
gapore, 1993.
[23] R. Johanesson and K. Zigangirov, Fundamentals of Convolutional Coding,
IEEE, 1999.
[24] W. Karush, Minima of Functions of Several Variables with Inequalities as
Side Constraints, M.Sc. Dissertation, Dept. Mathematics, Univ. Chicago,
Chicago, Illinois, 1939.
[25] A. N. Kolmogorov, On the Shannon theory of information transmission in
the case of continuous signals, IEEE Trans. Inform. Theory, vol. 2, no. 4,
pp. 102-108, Dec. 1956.
[26] A. N. Kolmogorov and S. V. Fomin, Introductory Real Analysis, Dover Pub-
lications, NY, 1970.
170
[27] H. W. Kuhn and A. W. Tucker, Nonlinear programming, Proc. 2nd Berke-
ley Symposium, Berkeley, University of California Press, pp. 481-492, 1951.
[28] S. Lin and D. J. Costello, Error Control Coding: Fundamentals and Appli-
cations, 2nd Edition, Prentice Hall, NJ, 2004.
[29] D. J. C. MacKay and R. M. Neal, Near Shannon limit performance of low
density parity check codes, Electronics Letters, vol. 33, no. 6, Mar. 1997.
[30] D. J. C. MacKay, Good error correcting codes based on very sparse matri-
ces, IEEE Trans. Inform. Theory, vol. 45, no. 2, pp. 399-431, Mar. 1999.
[31] F. J. MacWilliams and N. J. A. Sloane, The Theory of Error Correcting
Codes, North-Holland Pub. Co., 1978.
[32] J. E. Marsden and M. J.Homan, Elementary Classical Analysis, W.H.
Freeman & Company, 1993.
[33] R. J. McEliece, The Theory of Information and Coding, 2nd. Ed., Cam-
bridge University Press, 2002.
[34] M. S. Pinsker, Information and Information Stability of Random Variables
and Processes, Holden-Day, San Francisco, 1964.
[35] A. Renyi, On the dimension and entropy of probability distributions, Acta
Math. Acad. Sci. Hung., vol. 10, pp. 193-215, 1959.
[36] M. Rezaeian and A. Grant, Computation of total capacity for discrete
memoryless multiple-access channels, IEEE Trans. Inform. Theory, vol. 50,
no. 11, pp. 2779-2784, Nov. 2004.
[37] T. J. Richardson and R. L. Urbanke, Modern Coding Theory, Cambridge
University Press, 2008.
[38] H. L. Royden. Real Analysis, Macmillan Publishing Company, 3rd. Ed., NY,
1988.
[39] C. E. Shannon, A mathematical theory of communications, Bell Syst.
Tech. Journal, vol. 27, pp. 379-423, 1948.
[40] C. E. Shannon, Coding theorems for a discrete source with a delity cri-
terion, IRE Nat. Conv. Rec., Pt. 4, pp. 142-163, 1959.
[41] C. E. Shannon and W. W. Weaver, The Mathematical Theory of Commu-
nication, Univ. of Illinois Press, Urbana, IL, 1949.
171
[42] P. C. Shields, The Ergodic Theory of Discrete Sample Paths, American
Mathematical Society, 1991.
[43] N. J. A. Sloane and A. D. Wyner, Ed., Claude Elwood Shannon: Collected
Papers, IEEE Press, NY, 1993.
[44] S. Verd u and S. Shamai, Variable-rate channel capacity, IEEE Trans.
Inform. Theory, vol. 56, no. 6, pp. 2651-2667, June 2010.
[45] W. R. Wade, An Introduction to Analysis, Prentice Hall, NJ, 1995.
[46] S. Wicker, Error Control Systems for Digital Communication and Storage,
Prentice Hall, NJ, 1995.
[47] R. W. Yeung, Information Theory and Network Coding, Springer, NY, 2008.
172

You might also like