Dig Com

AT77.
13
DIGITAL COMMUNICATIONS
Poompat Saengudomlert
Asian Institute of Technology
February 2012
ii
Contents
1 Introduction
2 Review of Related Mathematics

2.1 Review of Probability . . . . . .
2.2 Review of Fourier Analysis . . .
2.3 Review of Linear Algebra . . . .
2.4 Review of Random Processes . .
2.5 Practice Problems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
9
12
17
20
3 Source Coding
3.1 Binary Source Code for Discrete Sources . . . . .
3.2 Entropy of Discrete Random Variables . . . . . .
3.3 Source Coding Theorem for Discrete Sources . . .
3.4 Asymptotic Equipartition Property . . . . . . . .
3.5 Source Coding for Discrete Sources with Memory
3.6 Source Coding for Continuous Sources . . . . . .
3.7 Vector Quantization . . . . . . . . . . . . . . . .
3.8 Summary . . . . . . . . . . . . . . . . . . . . . .
3.9 Practice Problems . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
23
23
29
34
36
38
41
45
46
47
4 Communication Signals
4.1 L2 Signal Space . . . . . . . . .
4.2 Pulse Amplitude Modulation . .
4.3 Nyquist Critetion for No ISI . .
4.4 Passband Modulation: DSB-AM
4.5 K-Dimensional Signal Sets . . .
4.6 Summary . . . . . . . . . . . .
4.7 Practice Problems . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
53
55
57
64
71
72
73
.
.
.
.
.
77
77
80
82
86
90
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . .
. . . . . .
. . . . . .
and QAM
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
5 Signal Detection
5.1 Hypothesis Testing . . . . . . . . . . . . . .
5.2 AWGN Channel Model . . . . . . . . . . . .
5.3 Optimal Receiver for AWGN Channels . . .
5.4 Performance of Optimal Receivers . . . . . .
5.5 Detection of Multiple Transmitted Symbols .
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
CONTENTS
5.6
5.7
5.8
Comparison of Modulation Schemes . . . . . . . . . . . . . . . . . . .

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Practice Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Channel Coding
6.1 Hard Decision and Soft Decision Decoding
6.2 Binary Linear Block Codes . . . . . . . . .
6.3 Binary Linear Convolutional Codes . . . .
6.4 Summary . . . . . . . . . . . . . . . . . .
6.5 Practice Problems . . . . . . . . . . . . . .
93
95
96
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
101
103
110
118
119
7 Capacities of Communication Channels

7.1 Discrete Memoryless Channels . . . . . . . . . . . . .
7.2 Mutual Information . . . . . . . . . . . . . . . . . . .
7.3 Capacity of a DMC . . . . . . . . . . . . . . . . . . .
7.4 Jointly Typical Sequences and Joint AEP . . . . . . .
7.5 Data Processing and Fano Inequalities . . . . . . . .
7.6 Proof of Channel Coding Theorem for DMCs . . . . .
7.7 Dierential Entropy . . . . . . . . . . . . . . . . . . .
7.8 Capacity of AWGN Channels . . . . . . . . . . . . .
7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . .
7.10 Appendix: Convex Functions and Jensen Inequality
7.11 Practice Problems . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
123
124
126
127
129
130
133
137
142
142
144
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
In this course, we discuss principles of digital communications. We shall focus on
fundamental knowledge behind the constructions of practical systems, rather than
on detailed specications of particular standards or commercial systems. Having
mastered the fundamental knowledge, you should be able to read and understand
technical specications of practical systems in the path of your career. For most
of the course, we shall focus our attention to point-to-point digital communication
systems, leaving the networking aspects of digital communications to other courses.
Figure 1.1 shows a block diagram of a typical point-to-point communication system.
We discuss dierent parts of the block diagram below.
input
output
source
encoder
source
decoder
bits
bits
channel
encoder
channel
decoder
signal
waveform
bits
modulator
bits
demodulator
signal
waveform
physical
channel
Figure 1.1: Block diagram of a point-to-point digital communication system.

Source coding: The function of a source encoder is to eciently represent input
information as a sequence of bits called message bits or information bits. The
function of a source decoder is to convert the information bits back to the
original information, or as close as possible to the original if not exactly the
same.
Channel coding: The function of a channel encoder is to transform information
bits into another sequence of bits, which we shall refer to as transmitted bits,
that can be eciently transmitted through the channel. This is accomplished
by introducing redundancy to the information bits so that the bit errors can be
detected and in some cases corrected at the decoder. The function of a channel
1
CHAPTER 1. INTRODUCTION
decoder is to convert the possibly corrupted received bit sequence back to the
information bits, or as close as possible to the information bits.
Modulation: The function of a modulator is to map the coded bit sequence into
a signal waveform suitable for the transmission over the physical channel. The
function of a demodulator is to convert the possibly corrupted received signal
waveform back to the transmitted bit sequence, or as close as possible to the
transmitted bit sequence.
Note that the structure in gure 1.1 is common, but is not always the case. For
example, in some cases, it is desirable to perform channel coding and modulation
together in a single step called coded modulation. Breaking the overall communication
problem into dierent steps is in general suboptimal. However, such separations
are often practical; dierent parts of the system can be designed and constructed
separately.
If the information signal from the source is an analog waveform, the source encoder
typically needs to perform sampling and quantization to the input. Sampling refers
to obtaining sample values from the waveform, while quantization refers to converting
the sample values to information bits. Sampling and quantization is in general lossy,
i.e. even though the physical channel is ideal, the system output will be distorted
and cannot be used to retrieve the original waveform exactly.
For applications that require encryption, we can add the encryptor after the source
encoder, and the decryptor after the channel decoder. Encryption is, however, beyond
the scope of this course and will not be discussed.
The subsequent chapters discuss various components of a typical digital communication system. As a note to the student reader, the sections that are marked with
are optional materials; you will not be responsible for them in the examinations.
Chapter 2
Review of Related Mathematics
In this chapter, we give a brief review on basic mathematical tools that we shall use
in the analysis of digital communication systems. The review includes probability,
Fourier analysis, linear algebra, and random processes. The review is not meant to
be comprehensive, but is used to help refresh relevant concepts that we shall use in
this course. Several results are stated without proofs. However, references are given
for more detailed information.
2.1
Review of Probability
The sample space S of an experiment is the set of all possible outcomes. An event
is a set of outcomes, or a subset of the sample space. For an event E, we shall use
Pr{E} to denote the probability of E. We rst present the axioms that a probability
measure must satisfy.
Axiom 2.1 (Axiom of probability): Let S be the sample space and E, F S be
events.
1. Pr{S} = 1.
2. 0 Pr{E} 1.
3. If E and F are disjoint, then Pr{E F } = Pr{E} + Pr{F}.
The above axiom can be used to prove basic properties such as Pr{E c } = 1Pr{E}
and Pr{E F} = Pr{E} + Pr{F} Pr{E, F}.1 For example, since E and E c are
disjoint and their union is S, from statement 3 of the axiom, Pr{S} = Pr{E}+Pr{E c }.
From statement 1, we obtain the desired property: 1 = Pr{E} + Pr{E c }.
By induction, statement 3 can{be extended
three or more events. In particular,
} to
n
n
= j=1 Pr{Ej }.
if E1 , . . . , En are disjoint, then Pr
j=1 Ej
The conditional probability of event E given that event F happens (or in short
given event F), denoted by Pr{E|F}, is dened as Pr{E|F} = Pr{E, F}/ Pr{F}.2 A
E denotes the complement of E, i.e. E c = S E.
Pr{E, F} denotes Pr{E F}.
1 c
2
CHAPTER 2. REVIEW OF RELATED MATHEMATICS
partition of E is a set of disjoint subsets of E whose union is equal to E. Let F1 , . . . , Fn

be a partition of F. From the denition of
probability, we can obtain the
conditional
n
Bayes rule, which is written as Pr{E} = j=1 Pr{E|Fj } Pr{Fj }.
Events E and F are independent if Pr{E, F} = Pr{E} Pr{F}. Equivalently, E and
F are independent if Pr{E|F } = Pr{E}. In addition, events E and F are conditionally
independent given event G if Pr{E, F|G} = Pr{E|G} Pr{F|G}.
Random Variables
A random variable (RV) is a mapping from a sample space S to a set of nite real
numbers. By convention, we use capital letters to denote RVs and use lower case
letters to denote their values. Strictly speaking, the value of a RV must be a real
number. Note that a result of an experiment, e.g. head or tail in a coin toss, may
not be a RV. (Some prefer to use the term chance variable for such a quantity.)
However, if we assign real numbers to outcomes, e.g. 1 for head and 0 for tail, it is
straightforward to turn such an experimental result into a RV. For this reason, we
shall use the term RV to refer to any experimental result.
A discrete RV takes on a discrete, i.e. countable, set of values.3 A continuous
RV takes on a continuous set of values. A RV can be neither discrete nor continuous.
A probability distribution or in short a distribution of a RV X, denoted by FX (x), is
dened as FX (x) = Pr{X x}. An example of a distribution for a RV that is neither
discrete nor continuous is given below. Note that X is equal to 1 with probability
1/2, or else is uniformly distributed between 0 and 1.
0,
x (, 1)
1/2,
x [1, 0)
FX (x) =
(x + 1)/2, x [0, 1)
1,
x (1, )
The probability mass function (PMF) for a discrete RV X, denoted by fX (x),
is dened as fX (x) = Pr{X = x}. The probability density function (PDF) for a
continuous RV X, also denoted by fX (x), is dened as fX (x) = dFX (x)/dx when the
derivative exists. Note that we use the same notation for PMF and PDF. It will be
clear from the context whether fX (x) is a PMF or a PDF.
A joint distribution of RVs X1 , . . . , Xn , denoted by FX1 ,...,Xn (x1 , . . . , xn ), is dened
as FX1 ,...,Xn (x1 , . . . , xn ) = Pr{X1 x1 , . . . , Xn xn }. The joint PDFs and PMFs are
dened similarly to the case of a single RV. In particular, we can write the joint PDF
of RVs X1 , . . . , Xn as
fX1 ,...,Xn (x1 , . . . , xn ) =
n FX1 ,...,Xn (x1 , . . . , xn )

.
x1 . . . xn
RVs X and Y are independent if FX,Y (x, y) = FX (x)FY (y) for all x and y.
When the PDFs/PMFs exist, we can write fX,Y (x, y) = fX (x)fY (y) if X and Y
3
A set A is countable if we can assign a one-to-one mapping from its elements to a subset of
positive integers {1, 2, . . .}.
2.1. REVIEW OF PROBABILITY
are independent. Moregenerally, RVs X1 , . . . , Xn are independent if we can write

fX1 ,...,Xn (x1 , . . . , xn ) = nj=1 fXj (xj ).
Conditional PDFs/PMFs
In this section, we discuss conditional PDFs/PMFs involving two RVs in four dierent
cases.
1. Consider two discrete RVs X and Y . The conditional PMF of X given Y ,
denoted by fX|Y (x|y), is equal to the conditional probability Pr{X = x|Y = y}.
It follows that
fX|Y (x|y) =
Pr{X = x, Y = y}
fX,Y (x, y)
=
.
Pr{Y = y}
fY (y)
2. For continuous RVs X and Y , the conditional PDF of X given Y , denoted by

fX|Y (x|y), is obtained from the conditional distribution Pr{X x|y y
Y y} by letting y 0 and taking derivative with respect to x (e.g. [?, p.
27]), i.e.
fX|Y (x|y) =
(limy0 Pr{X x|y y Y y})

.
x
We rst rewrite the conditional distribution as

Pr{X x, y y Y y}
Pr{y y Y y}
[FX,Y (x, y) FX,Y (x, y y)] /y
.
=
[FY (y) FY (y y)] /y
Pr{X x|y y Y y} =
Letting y 0, we can write

[FX,Y (x, y) FX,Y (x, y y)] /y
FX,Y (x, y)/y
=
.
y0
[FY (y) FY (y y)] /y
FY (y)/y
lim
Finally, taking the derivative with respect to x and using the denition of PDF,
we can write
(
)
FX,Y (x,y)/y
FY (y)/y
2 FX,Y (x, y)/xy
fX,Y (x, y)
=
=
.
x
FY (y)/y
fY (y)
Therefore, for continuous RVs X and Y , we can write fX|Y (x|y) =
fX,Y (x,y)
.
fY (y)
3. Consider the conditional PDF of continuous RV X given discrete RV Y , denoted

by fX|Y (x|y). We can dene
fX|Y (x|y) =
(Pr{X x|Y = y})

,
x

from which we can write
(
)
=y}
Pr{Xx,Y
Pr{Y =y}
(Pr{X x, Y = y}) /x
fX,Y (x, y)
=
=
.
fX|Y (x|y) =
x
Pr{Y = y}
fY (y)
4. Consider the conditional PMF of discrete RV X given continuous RV Y , denoted by fX|Y (x|y). In this case, we can dene
fX|Y (x|y) = lim Pr{X = x|y y Y y},
y0
from which we can write

Pr{X = x, y y Y y}/y
y0
Pr{y y Y y}/y
limy0 Pr{X = x, y y Y y}/y
=
limy0 Pr{y y Y y}/y
fX,Y (x, y)
=
.
fY (y)
fX|Y (x|y) =
lim
In summary, in all cases of discrete and continuous RVs X and Y , we can write
fX|Y (x|y) =
fX,Y (x, y)
, or equivalently fX,Y (x, y) = fX|Y (x|y)fY (y).
fY (y)
(2.1)
It is worth noting that, if X and Y are independent, then fX|Y (x|y) = fX (x), i.e.
knowing Y does not alter the statistics of X compared to knowing X alone.
Means, Variances, and Moment Generating Functions

From this point on, we focus on continuous RVs. Similar statements can be made for
discrete RVs.
The expected value or mean of a RV X, denoted by E[X] or X, is dened as
E[X] = SX xfX (x)dx, where SX denotes the sample space of X. The variance of X,
2
denoted by var[X] or X
, is dened as var[X] = S (x X)2 fX (x)dx. (Note that we
can also write var[X] = E[(X X)2 ], as will be seen shortly.) The standard deviation
of X, denoted by X , is the positive square root of the variance.
The conditional mean of RV X given that RV Y is equal to y, denoted by E[X|Y =
y], is dened as E[X|Y = y] = SX|Y =y xfX|Y (x|y)dx, where SX|Y =y is the sample space
2
of X given that Y = y. The conditional variance, denoted by X|Y
=y , can be dened
2
2
similarly, i.e. X|Y =y = SX|Y =y (x E[X|Y = y]) fX|Y (x|y)dx.
The jth moment of X is dened as E[X j ]. The moment generating function (MGF)
of X, denoted by X (s), is dened as X (s) = E[esX ]. As the name suggests, the jth
moment of X can be obtained from using the relationship E[X j ] = dj X (s)/dsj |s=0 .
2.1. REVIEW OF PROBABILITY
Functions of RVs
Let Y = g(X), where g is a monotonically increasing and dierentiable function. It
is known that the PDF of Y is
fY (y) =
fX (x)
,
dg(x)/dx
where y = g(x) (e.g. [?, p. 130] or [?, p. 541]).

The result can be extended to functions of multiple RVs (e.g. [?, p. 244]). Suppose
that we have n RVs X1 , . . . , Xn . Dene Y1 , . . . , Yn as Yk = gk (X1 , . . . , Xn ), where
g = [g1 , . . . , gn ] is one-to-one. Then,
fY1 ,...,Yn (y1 , . . . , yn ) =
fX1 ,...,Xn (x1 , . . . , xn )

|J(x1 , . . . , xn )|
(2.2)
where (y1 , . . . , yn ) = g(x1 , . . . , xn ) and

dg1 /dx1 dg1 /dxn

..
..
..
J(x1 , . . . , xn ) =
.
.
.
.

dgn /dx1 dgn /dxn
The quantity J(x1 , . . . , xn ) is called the Jacobian for the transformation g.
We often need to evaluate the expected value of a function of a RV. To compute
E[Y ] where Y = g(X), it is often
easier not to compute the distribution of Y but to
use the relationship E[Y ] = SX g(x)fX (x)dx (e.g. [?, p. 142] or [?, p. 560]). The
relationship can be extended for multiple RVs, i.e. if Y = g(X1 , . . . , Xn ), then
E[Y ] =
g(x1 , . . . , xn )fX1 ,...,Xn (x1 , . . . , xn )dx1 . . . dxn .

(2.3)
SX 1
SXn
In a special case where Y =

write E[Y ] as
(
E[Y ] =
SX1
n
j=1
gj (Xj ) and X1 , . . . , Xn are independent, we can
g1 (x1 )fX1 (x1 )dx1
SXn
)
gn (xn )fXn (xn )dxn
E[gj (Xj )].
(2.4)
j=1
RVs X and Y are uncorrelated if E[(X X)(Y Y )] = 0. If X and Y are

independent, they are also uncorrelated, as shown below.
E[(X X)(Y Y )] = E[X X]E[Y Y ] = 0
However, the converse is not true in general, i.e. uncorrelated RVs are not necessarily
dependent. The exception is when X and Y are jointly Gaussian (to be discussed
later).
Sum of RVs
Let X and Y be any two RVs.
n Then X+Y is another RV. More generally, if X1 , . . . , Xn
are RVs, so is the sum j=1 Xj . Below are useful properties of the mean and the
variance of a sum of RVs. These properties can be derived from the denitions of
mean and variance.
For RVs X and Y and real numbers a and b, E[aX + bY ] = aE[X] + bE[Y ]. More
generally, for RVs X1 , . . . , Xn and real numbers a1 , . . . , an ,
]
[ n
n
aj E[Xj ].
(2.5)
E
aj Xj =
j=1
j=1
For RVs X and Y and real numbers a and b, var[aX + bY ] = a2 var[X] + b2 var[Y ]
if X and Y are uncorrelated. More generally, for uncorrelated RVs X1 , . . . , Xn and
real numbers a1 , . . . , an ,
[ n
]
n
var
aj Xj =
a2j var[Xj ].
(2.6)
j=1
j=1
Note that the statement in (2.5) does not require the RVs to be uncorrelated to
be valid, while the statement in (2.6) does.
Laws of Large Numbers

Laws of large numbers concern with the statistics of a sum of independent and identically distributed (IID) RVs. We start by deriving a simple bound on the probability
Pr{X a}. Then we use the bound to establish the weak law of large numbers.
Theorem 2.1 (Markov inequality): For a nonnegative RV X,
Pr{X a}
Proof: Pr{X a} =
fX (x)dx
E[X]
.
a
x
f (x)dx
a X
1
a
xfX (x)dx =
E[X]
.
a
Theorem 2.2 (Chebyshev inequality): For a RV X,

Pr{|X E[X]| b}
2
X
.
b2
Proof: Take |X E[X]|2 as a RV in the Markov inequality.
Theorem 2.3 (Weak law of large numbers (WLLN)): Consider

IID RVs X1 ,
2
. . ., Xn with mean E[X] and variance X
. Dene the average Sn = n1 nj=1 Xj . Then,
for any > 0,
lim Pr{|Sn E[X]| < } = 1.
n
2.2. REVIEW OF FOURIER ANALYSIS
Proof: Take Sn as a RV in the Chebyshev inequality.
Roughly speaking, the WLLN states that, as n gets large, the empirical average
Sn is equal to the mean E[X]. There is a stronger version of the law of large numbers
called the strong law of large numbers, which states that Pr {limn Sn = E[X]} = 1
(e.g. [?, p. 566] or [?, p. 258]). However, we shall use only the weak law in this
course. Finally, we present the central limit theorem without proof (e.g. [?, p. 258]).
Theorem 2.4 Central Limit Theorem (CLT): ConsiderIID RVs X1 , . . . , Xn
2
. Dene the average Sn = n1 nj=1 Xj . Then,
with mean E[X] and variance X
{
lim Pr
} a
Sn E[X]
1
2
ex /2 dx.
a =
X / n
2
Rougly speaking, the CLT states that, as n gets large, the distribution of
approaches that of a zero-mean unit-variance Gaussian RV.
2.2
Sn E[X]
X / n
Review of Fourier Analysis
We shall assume for general discussion that the signals are complex. Initially, when
we discuss baseband communications, this assumption is not required since we are
dealing with real signals. However, when we discuss passband communications, it is
convenient to consider complex signals.
Fourier Transforms of L2 Signals
The energy of a complex signal u(t) is dened as |u(t)|2 dt. We shall focus on
signals whose energies are nite. Such signals are called L2 signals. Two L2 signals
u(t) and v(t) are L2 -equivalent if their dierence has zero energy, i.e. |u(t)
v(t)|2 dt = 0.
We focus on L2 signals partly because L2 signals always have Fourier transforms
and their inverse transforms always exist in the L2 -equivalent sense [?, p. 118]. For
practical purposes, if two signals are L2 -equivalent, they are considered the same.
More specically, let u(t) be an L2 signal. The Fourier transform of u(t), denoted
by F{u(t)} or u(f ), is equal to

u(f ) = F{u(t)} =
u(t)ei2f t dt.
(2.7)
For an L2 signal u(t) in the time domain, u(f ) is an L2 signal in the frequency
domain. The inverse Fourier transform of u(f ), denoted by F 1 {
u(f )} or uinv (t) , is
equal to
uinv (t) = F 1 {
u(f )} =
u(f )ei2f t df.
(2.8)
10
Figure 2.1: Fourier transform pair for the rectangle signal.

As mentioned above, for an L2 signal u(t), uinv (t) always exists and is L2 -equivalent
to u(t). For convenience, for L2 signals, we shall simply write u(t) = v(t) to state
that u(t) and v(t) are L2 -equivalent. For example, in the above discussion, we simply
write uinv (t) = u(t).
Example 2.1 Consider the rectangle signal u(t) = rect(t) shown in gure 2.1a. Its
Fourier transform is given by u(f ) = sinc(f ) and is shown in gure 2.1b.4 Finally,
the inverse Fourier transform uinv (t) is shown in gure 2.1c. Note that uinv (t) and
u(t) are not exactly equal, but are L2 -equivalent.

We shall write u(t) u(f ) to denote that u(t) and u(f ) are a Fourier transform
pair. It is assumed that the reader is familiar with basic Fourier transform properties
(e.g. [?, p. 223-225]). Some useful properties are given below.
au(t) + bv(t)
u (t)
u(t)
u(t t0 )
ei2f0 t u(t)
u(t/T )
du(t)/dt
u(t) v(t)
u( )v ( t)d
a
u(f ) + b
v (f )
u (f )
u(f )
ei2f t0 u(f )
u(f f0 )
T u(f T )
i2f u(f )
u(f )
v (f )
u(f )
v (f )
linearity
conjugation
time/frequency duality
time shift
frequency shift
scaling (T > 0)
dierentiation
convolution
correlation
From the denition of Fourier transfrom and its inverse, note that
we have the
following identities in the special cases when t = 0 and f = 0: u(0) = u(f )df and
u(0) = u(t)dt. Using the rst identify and the correlation property, we obtain
the Parseval theorem

u(t)v (t)dt =
u(f )
v (f )df.
(2.9)
If we set v(t) = u(t) in the Parseval theorem, we obtain the energy equation

2
|u(t)| dt =
|
u(f )|2 df.
(2.10)
The quantity |
u(f )|2 is called the spectral density of u(t), which describes the
amount of energy contained per unit frequency around f .
4
sinc(x) =
sin(x)
x .
2.2. REVIEW OF FOURIER ANALYSIS
11
Unit Impulse (t) and Its Fourier Transform

The unit impulse (t) is not an L2 signal. In fact, it does not behave like an ordinary
function. To deal with (t) in a rigorous fashion, we need to view (t) as a generalized
function (e.g. [?, p. 269]). For our purpose, it suces to consider (t) based on some
of its properties which we state below without proofs.
Let u(t) be any L2 signal, and s(t) be the unit-step signal.5
1. (t)u(t)dt = u(0)
2. (t) =
d
s(t)
dt
3. (t) 1
4. 1 (f )
By assuming the above properties and manipulating (t) as if it were an ordinary
function, we can carry out most analysis in digital communications.
Example 2.2 The signal cos(2fc t) is not an L2 signal. Its Fourier transform can
be evaluated using the above properties of the unit impulse. In particular, from
the property of the Fourier transform pair ei2fc t u(t) u(f fc ), setting u(t) = 1
yields ei2fc t (f fc ). Similarly, from ei2fc t u(t) u(f + fc ), we can write
ei2fc t (f + fc ). It follows that
1
1
1
1
cos(2fc t) = ei2fc t + ei2fc t (f + fc ) + (f fc ),
2
2
2
2
which gives us the Fourier transform of cos(2fc t).
Fourier Series of L2 Signals
]
[
For an L2 signal u(t) that is time-limited to the time interval T2 , T2 , the following
set of Fourier series coecients exists.6
1 T /2
u(t)ei2kt/T dt, k Z.
(2.11)
uk =
T T /2
In addition, the following signal reconstructed from the above coecients is L2 equivalent to u(t) [?, p. 110].
[
]
T T
i2kt/T
urec (t) =
uk e
, t ,
.
(2.12)
2
2
k=
In addition, if the signal u(t) is continuous, then the reconstruction is perfect, i.e.
urec (t) and u(t) are exactly the same.
{
0, t < 0
1, t 0
6
Z denotes the set of all integers, while Z+ denotes the set of all nonnegative integers.
5
The unit step signal is dened as s(t) =
12
Figure 2.2: Fourier series reconstruction of the rectangle signal.

Example 2.3 Consider the rectangle signal u(t) = rect(2t/T ) shown in gure 2.2a.
, k Z. Finally, the reconstructed
Its Fourier coecients are given by uk = sin(k/2)
k
signal is shown in gure 2.1b. Note that urec (t) and u(t) are not exactly equal, but
are L2 -equivalent.
2.3
Review of Linear Algebra
A eld F is a set of elements together with addition and multiplication dened to

satisfy the following eld axioms.7
Axiom 2.2 (Field axioms): For all a, b, c F , we have the following properties.
1. Commutativity: a + b = b + a, ab = ba
2. Associativity: (a + b) + c = a + (b + c), (ab)c = a(bc)
3. Distributivity: a(b + c) = ab + ac
4. Existence of additive and multiplicative identities: There exist elements, denoted by 0 and 1, such that a + 0 = a and 1a = a for all a F .
5. Existence of additive inverse: For each a F, there exists an element, denoted
by a, such that a + (a) = 0.
6. Existence of multiplicative inverse: For each a F and a = 0, there exists an
element, denoted by a1 , such that aa1 = 1.
An example of a eld is the set of real numbers R with the usual addition and
multiplication. Another example of a eld is the set of complex numbers C with
complex addition and multiplication. Recall that, for a, b C, the addition and
multiplication of a = aR +iaI and b = bR +ibI are dened as a+b = (aR +bR )+i(aI +bI )
and ab = (aR bR aI bI ) + i(aR bI + aI bR ) respectively. For both R and C, 0 is the
additive identity while 1 is the multiplicative identity.
Another important eld in communication theory is the Galois eld of order k
denoted by Fk . The eld has k elements which are the integers 0, . . . , k1. Its addition
7
The addition and multiplication of a and b are denoted by a + b and ab respectively.
2.3. REVIEW OF LINEAR ALGEBRA
13
Figure 2.3: Mod-k addition and multiplication for Fk .

and multiplication are given by the rules of modulo-k or in short mod-k arithmetic,
with 0 as the additive identity and 1 as the multiplicative identity. Figure 2.3 shows
the additive and multiplicative operations for F2 and F5 . The eld F2 is also called
the binary eld.
Vector Spaces
A vector space V is a set of elements dened over a eld F according to the following
vector space axioms. The elements of the eld are called scalars. The elements of a
vector space are called vectors.
Axiom 2.3 (Vector space axioms): For all u, v, w V and , F, we have
the following properties.
1. Closure: u + v V, u V
2. Axioms for addition
Commutativity: u + v = v + u
Associativity: (u + v) + w = u + (v + w)
Existence of identity: There exists an element in V, denoted by 0, such that
u + 0 = u for all u V.
Existence of inverse: For each u V, there exists an element in V, denoted by
u, such that u + (u) = 0.
3. Axioms for multiplication
Associativity: ()u = (u)
Unit multiplication: 1u = u
Distributivity: (u + v) = u + v, ( + )u + u + u
In this course, we consider three dierent scalar elds: R, C, and Fk . A vector
space with scalar eld R is called a real vector space. A vector space with scalar eld
C is called a complex vector space. A vector space with scalar eld F2 is called a
binary vector space.
14
A set of vectors v1 , . . . , vn V spans

n V if each u V can be written as a linear
combination of v1 , . . . , vn , i.e. u = j=1 j vj for some scalars 1 , . . . , n . A vector
space V is nite-dimensional if there is a nite set of vectorsthat spans V.
A set of vectors v1 , . . . , vn V is linearly dependent if nj=1 j vj = 0 for some
scalars 1 , . . . , n not all equal to zero. A set of vectors v1 , . . . , vn V is linearly
independent if it is not linearly dependent. A set of vectors v1 , . . . , vn V is a
basis for V if it spans V and is linearly independent. The following theorem states
important properties of a basis [?, p. 157].
Theorem 2.5 Let V be a nite-dimensional vector space.
1. If a set of vectors v1 , . . . , vm V spans V but are linearly dependent, then there
is a subset of v1 , . . . , vm that forms a basis for V with n < m vectors.
2. If a set of vectors v1 , . . . , vm V is linearly independent but does not span V,
then there is a basis for V with n > m vectors that includes v1 , . . . , vm .
3. Every basis for V contains the same number of vectors.
Statement 3 of theorem 2.5 allows us to dene the dimension of a nite-dimensional
vector space as the number of vectors in a basis. A vector space is innite-dimensional
if it is not nite-dimensional. For such a space, a basis must contain an innite number of vectors.
Inner Product Spaces

An inner product dened on a vector space V (dened over a eld F) is a function
of two vectors u, v V, denoted by u, v, that satises the following properties for
all u, v, w V and all F .
1. Commutativity: u, v = v, u
2. Distributivity: u + v, w = u, w + v, w
3. Associativity: u, v = u, v
4. Positivity: u, u 0 with equality if and only if u = 0
Note that we can use properties 1 and 2 to show that u, v + w = u, v + u, w,
and properties 1 and 3 to show that u, v = u, v. A vector space with a dened
inner product is called an inner product space.
In an inner
product space V, the norm of vector u V, denoted by u, is dened
as u = u, u. Two vectors u, v V are called orthogonal if u, v = 0.
Example 2.4 One familiar inner product space is the vector space Rn consisting
of all real n-tuples with the inner product of u = (u1 , . . . , un ) and v = (v1 , . . . , vn )
2.3. REVIEW OF LINEAR ALGEBRA
15
n
2
dened as u, v = nj=1 uj vj .8 The corresponding norm is u =
j=1 uj . A real
linear vector space with a dened inner product is called a Euclidean space. Therefore,
Rn is a Euclidean space.
The vector space Cn consisting of all complex n-tuples can be made an inner
product space by dening the inner product of u = (u1 , . .
. , un ) and v = (v1 , . . . , vn )
n
n
2
as u, v = j=1 uj vj . The corresponding norm is u =

j=1 |uj | .
Subspaces and Projections

A subspace S of a vector space V is a subset of the vectors in V that is itself a vector
space over the same scalar eld.
Example 2.5 Consider the Euclidean space R3 . Let u = (1, 0, 0). Consider the set
of all vectors of the form u for all R. This set of vectors is itself a real vector
space and is therefore a subspace of R3 . This subspace has dimension 1 and has {u}
as a basis.
However, if we consider the set of all vectors of the form u for all real = 0.
This set of vectors is not a vector space, and hence not a subspace of R3 .
Finally, let v = (1, 1, 0). Consider the set of all vectors of the form u + v for all
, R. This set of vectors is itself a real vector space and is therefore a subspace
of R3 . Note that this subspace has dimension 2 and has {u, v} as a basis.

Let u and v be two vectors in an inner product space V. The projection of u on
v, denoted by u|v , is dened as
u|v =
u, v
v.
v2
(2.13)
From the denition, we can verify that the dierence between u and u|v is orthogonal to v as follows.
u, v
u, v
v, v = u, v
v, v = u, v u, v = 0
u u|v , v = u
v2
v2
Let uv = u u|v . Since uv , v = 0 and u|v is a scalar multiple of v, it follows
that uv and u|v are orthogonal, i.e. uv , u|v = 0, and u can be expressed as a
sum of two orthogonal components: u = uv + u|v , as illustrated in gure 2.4 for
Euclidean space R2 .
The following theorem states that u|v is the best approximate of u among the
vectors in the subspace spanned by v based on the square error.
Theorem 2.6 u, v/v2 = arg minR u v2 .
8
The notation (u1 , . . . , un ) corresponds

toa column vector. To save the space while writing, we
u1
..
normally write (u1 , . . . , un ) instead of . .
un
16
Figure 2.4: Two orthogonal components of u R2 .

Proof: We rst show that, for two orthogonal vectors x and y, x + y2 = x2 +
y2 . To see this, we write
x + y2 = x + y, x + y = x2 + x, y + y, x + y2 = x2 + y2 ,
where the last equality follows from the orthogonality of x and y.
To prove the theorem, we write u v2 = uv + (u|v v)2 . Since u|v v
is a scalar multiple of v, uv and u|v v are orthogonal. It follows that
u v2 = uv 2 + u|v v2 uv 2 .
Therefore, the choice = u, v/v2 that yields the square error uv 2 minimizes
the square error.

The identity u = uv + u|v can also be used to derive two important inequalities
which we state as theorems below.
Theorem 2.7 (Schwarz inequality): | u, v | uv
Proof: We rst write u2 = uv + u|v 2 = uv 2 + u|v 2 u|v 2 . From the
2
denition u|v = u,v
v, it follows that u2 |u,v|
, yielding | u, v | uv.
v2
v2
Theorem 2.8 (Triangle inequality): u + v u + v
Proof: We rst write u + v2 = u + v, u + v = u2 + u, v + v, u + v2 .
Note that u, v + v, u = 2Re{u, v} 2|u, v|.9 The Schwarz inequality can
then be applied to write
u + v2 u2 + 2| u, v | + v2 u2 + 2uv + v2 = (u + v)2
yielding u + v u + v.
In an inner product space, a set of vectors 1 , . . . , n is orthonormal if

{
0, j = k
j , k =
1, j = k
(2.14)
We call a basis that is orthonormal an orthonormal basis. Note

thatthe projection
of vector u on a unit-norm vector j has a simple form u| = u, j j .
j
Re{x} denotes the real part of complex number x.
2.4. REVIEW OF RANDOM PROCESSES
17
2
Example
One possible
] [orthonormal
]} basis is
{[ ] [ 2.6]}Consider the Euclidean space R . {[
1
1/ 2
0
1/2
,
.

,
. A dierent orthonormal basis is
0
1
1/ 2
1/ 2
If S is a subspace of an inner product space V, then the projection of u V on

S is the vector denoted by u|S such that uS = u u|S is orthogornal to all vectors
in S. Using the same argument as in the proof of theorem 2.6, we can show that the
projection u|S is the vector in S that is closest to u, i.e. u u|S 2 u v2 for
all v S.
Suppose
that S has dimension
n and an orthonormal basis {1 , . . . , n }. It follows
n
that u|S = j=1 u, j j , i.e. the projection of u on S is the summation of onedimensional projections of u on all the basis vectors. One can easily check that the
corresponding uS is orthogonal to all vectors in S.

If u is itself in S, then u|S = u and u can be expressed as u = nj=1 u, j j .
Such an expression for u in terms of a linear combination of orthonormal basis vectors
is called an orthonormal expansion of u.
Given a set of orthogonal vectors v1 , v2 , . . ., we can create an orthonormal set by
normalizing v1 , v2 , . . .. The resultant normalized vectors are v1 /v1 , v2 /v2 , . . .. If
a given set of vectors v1 , v2 , . . . is linearly independent but is not orthogonal, then
we can use the Gram-Schmidt procedure to create an orthonormal set 1 , 2 , . . . that
spans the same vector space as follows.
Gram-Schmidt procedure:
1. Set 1 = v1 /v1 .
2. At each step j {2, 3, . . .}, substract from vj its projections on the subspace
spanned by 1 , . . . , j1 to create an intermediate result j , i.e.
j
= vj
j1
vj , k k .
k=1
Then, normalize j to obtain j , i.e. j = j /j .

3. Repeat step 2 for additional basis vectors.
2.4
Review of Random Processes
Recall that a random variable (RV) is a mapping from the sample space S to the
set of real numbers R. In comparison, a stochastic process or random process is a
mapping from the sample space S to the set of real-valued functions called sample
functions. We can denote a stochastic process as {X(t), t R} to emphasize that
it consists of a set of RVs, one for each time t. However, for convenience, we shall
simply use X(t) instead of {X(t), t R} to denote a random process in this course.
18
A random process is strict sense stationary (SSS) if, for all values of n Z+ ,
t1 , . . ., tn , and R, the joint PDF satises
fX(t1 ),...,X(tn ) (x1 , . . . , xn ) = fX(t1 + ),...,X(tn + ) (x1 , . . . , xn )
for all x1 , . . . , xn . Roughly speaking, the statistics of the random process looks the
same at all time.
Let E[X(t)] and X(t) denote the mean of the random process X(t) time t. The
covariance function, denoted by KX (t1 , t2 ), is dened as
[(
)(
)]
KX (t1 , t2 ) = E X(t1 ) X(t1 ) X(t2 ) X(t2 ) .
For the purpose of analyzing communication systems, it is usually sucient to
assume a stationary condition that is weaker than SSS. In particular, a random process
X(t) is wide-sense stationary (WSS) if, for all t1 , t2 R,
E[X(t1 )] = E[X(0)] and KX (t1 , t2 ) = KX (t1 t2 , 0).
Roughly speaking, for a WSS random process, the rst and second order statistics
look the same at all time.
Since the covariance function KX (t1 , t2 ) of a WSS random process only depends on
the time dierence t1 t2 , we usually write KX (t1 , t2 ) as a function with one argument
KX (t1 t2 ). Note that a SSS random process is always WSS, but the converse is not
always true.
Dene the correlation function of a WSS random process X(t) as
RX ( ) = E [X(t)X(t )] .
The power spectral density (PSD), denoted by SX (f ), is dened as the Fourier transform of RX ( ), i.e. RX ( ) SX (f ). It is possible to show that SX (f ) is real and
non-negative, and can be thought of as the power per unit frequency at f (e.g. [?, p.
68]).
A complex-valued random process Z(t) is dened as Z(t) = X(t) + iY (t) where
X(t) and Y (t) are random processes. The joint PDF of complex-valued RVs Z(t1 ),
. . ., Z(tn ) is given by the joint PDF of their components
fX(t1 ),...,X(tn ),Y (t1 ),...,Y (tn ) (x1 , . . . , xn , y1 , . . . , yn ).
The covariance function of a complex-valued random process Z(t) is dened as
)(
) ]
1 [(
,
KZ (t1 , t2 ) = E Z(t1 ) Z(t1 ) Z(t2 ) Z(t2 )
2
where the scaling factor 1/2 is introduced for convenience in the analysis. Finally, we
can extend the denition of SSS, WSS, and PSD for complex-valued random processes
in a straightforward fashion.
2.4. REVIEW OF RANDOM PROCESSES
19
Gaussian Processes
A set of RVs X1 , . . . , Xn are zero-mean jointly Gaussian if there is a set of IID zeromean unit-variance Gaussian RVs N1 , . . . , Nm such that, for each k {1, . . . , n},
Xk can be expressed as a linear combination of N1 , . . . , Nm , i.e. Xk = m
j=1 k,j Nj .
For convenience, dene a random vector X = (X1 ,. . . , Xn ) and a random
vector
1,1 1,m
..
.. so that we
..
N = (N1 , . . . , Nm ). In addition, dene a matrix A = .
.
.
n,1 n,m
can write X = AN.
Let KX be the covariance matrix for random vector X, i.e.
E[(X1 X 1 )(X1 X 1 )] E[(X1 X 1 )(Xn X n )]
..
..
..
KX =
.
.
.
.
E[(Xn X n )(X1 X 1 )] E[(Xn X n )(Xn X n )]
The PDF of a zero-mean jointly Gaussian random vector X is
f X (x) =
1
12 xT KX 1 x
e
.
(2)n/2 det KX
(2.15)
The above PDF can be derived from the PDF of N together with (2.2). Note
that, for IID zero-mean unit-variance Gaussian random vector N, the PDF of jointly
Gaussian random vector has the following simple form.
f N (n) =
1
12 nT n
e
(2)m/2
(2.16)
A random vector X = (X1 , . . . , Xn ) is jointly Gaussian if X = X+, where X

is zero-mean jointly Gaussian and Rn . The PDF of X is
f X (x ) =
(2)n/2
1
1
1
e 2 (x ) KX (x ) .
det KX
(2.17)
Some important properties of jointly Gaussian random vector X are listed below [?, chp. 7].
1. A linear transformation of X yields another jointly Gaussian random vector.
2. The PDF of X is fully determined by the mean and the covariance matrix
KX , which are the rst-order and second-order statistics.
3. Jointly Gaussian RVs that are uncorrelated are independent.
We are now ready to dene a Gaussian process. We say that X(t) is a zero-mean
Gaussian process if, for all n Z+ and t1 , . . . , tn R, (X(t1 ), . . . , X(tn )) is a zeromean jointly Gaussian random vector. In addition, we say that X (t) is a Gaussian
process if it is the sum of a zero-mean Gaussian process X(t) and a deterministic
function (t).
Some important properties of Gaussian process X (t) are listed below [?, chp. 7].
20

1. For any linear time invariant (LTI) lter with impulse response h(t) that is
an L2 signal, if we pass X (t) through the lter, the output X (t) h(t) is a
Gaussian process.
2. The statistics of X (t) is fully determined by the mean (t) and the covariance
function KX (t1 , t2 ).
3. We refer to the quantity of the form X (t)u(t)dt as an observable or linear

functional of X (t). Any set of linear functionals of X (t) are jointly Gaussian.
Finally, a complex-valued Gaussian process Z(t) is dened as Z(t) = X(t) + iY (t)

where X(t) and Y (t) are Gaussian processes.
Filtered Random Processes

Consider passing a WSS zero-mean random process X(t) through an LTI lter with
impulse response h(t). Denote the output random process by Y (t) = X(t) h(t).
Given the covariance function KX ( ) of X(t), we now derive the covariance function
KY ( ) of Y (t) in terms of KX ( ) and h(t).
From the denition of thecovariance function, we can write KY ( ) = E[Y ( )Y (0)].
In addition, from Y ( ) = h()X( )d and Y (0) = h()X()d =
h()X()d, we can write
[(
KY ( ) = E
[
) (
h()X( )d
)]
h()X()d
]
h()h()X( )X()dd
= E

=
h()h()KX ( )dd

=
h() (h( ) KX ( )) d
= h( ) h( ) KX ( )
(2.18)
In the frequency domain, (2.18) can be written in terms of the PSDs as

2
SY (f ) = h(f ) SX (f ).
2.5
(2.19)
Practice Problems
Problem 2.1 (Constant addition to RV): Let X be a RV with mean X and

2
variance X
. Consider another RV Y dened as Y = X + a, where a R. Find the
mean and the variance of Y .
2.5. PRACTICE PROBLEMS
21
Problem 2.2 (Means and variances of uniform RVs): Let X be a uniformly

distributed RV with the following PDF. Assume that a > 0.
{
1/a, x [0, a]
fX (x) =
0,
otherwise
(a) Find the mean and the variance of X in terms of a.
(b) Given 0 < b < a, nd the conditional mean E[X|X b] and the conditional
variance var[X|X b] in terms of a and b.
Problem 2.3 (Sum of independent Gaussian RVs): Let X and Y be two inde2
and Y2 denote
pendent Gaussian RVs. Let X and Y denote their means, and X
their variances. Using the fact that the MGF of a Gaussian RV with mean and
2 2
variance 2 is given by (s) = es+ s /2 , argue that the sum X + Y is another Gaus2
sian RV with mean X + Y and variance X
+ Y2 . (HINT: Find the MGF of the RV
X + Y .)
Problem 2.4 (Problem 2.6 in [Pro95]): Let RVs X1 , . . . , Xn be

IID binary RVs
with the following PMF: fX (1) = p and fX (0) = 1 p. Dene Y = nj=1 Xj .
(a) Find the MGF of Y , i.e. compute Y (s) = E[esY ].

j

(b) Use the relationship E[Y j ] = d dsYj(s)
to nd the rst two moments of Y , i.e.
E[Y ] and E[Y 2 ].
s=0
Problem 2.5 (Uncorrelated and dependent RVs): Verify that the RVs X and
Y with the following joint PMF are uncorrelated but are not independent.
fX,Y (1, 0) = fX,Y (1, 0) = fX,Y (0, 1) = fX,Y (0, 1) = 1/4
Problem 2.6 (Sample mean and sample variance): Consider n IID RVs X1 , . . .,
Xn . Let and 2 denote the mean and the variance of each Xj respectively.
(a) The quantity S = n1 nj=1 Xj is known as the sample mean. Show that E[S ] =
.
n
1
2
(b) The quantify S2 = n1
j=1 (Xj S ) is known as the sample variance. Show
that E[S2 ] = 2 .
22
Problem 2.7 (Pairwise independence versus independence): Recall that RVs

X1
, . . . , Xn are independent if we can write the joint PMF/PDF as fX1 ,...,Xn (x1 , . . . , xn )
= nj=1 fXj (xj ).
RVs X1 , . . . , Xn are pairwise independent if Xj and Xk are independent for all
j, k {1, . . . , n}, i.e. we can write fXj ,Xk (xj , xk ) = fXj (xj )fXk (xk ) for all j, k
{1, . . . , n}.
Give an example of three RVs X1 , X2 , X3 that are pairwise independent but are not
independent. (HINT: One possible example can be constructed from two independent
and equally likely binary RVs X1 and X2 . For the third RV, let X3 = X1 +X2 mod 2.)
Chapter 3
Source Coding
In this chapter, we shall consider the problem of source coding. With respect to
the block diagram of the point-to-point digital communication system in gure 1.1,
we shall discuss in detail the operations of the source encoder and decoder. We
rst focus on source coding for discrete sources. For continuous sources, we shall
discuss sampling and quantization that convert a continuous source to a discrete one.
During these discussions, we shall introduce basic denitions of several quantities in
information theory, including entropy and mutual information.
3.1
Binary Source Code for Discrete Sources
A discrete source is an information source that produces a sequence of symbols each

drawn from a nite set of symbols, called an alphabet and denoted by X , according
to some probability mass function (PMF) fX (x). An example of a discrete source is
a person typing a text message on a computer keyboard.
A discrete source is memoryless if the source output symbols are statistically
independent. Such a source is called a discrete memoryless source (DMS). Note that
a practical source is usually not a DMS. For example, in an English article, what
comes after a character or symbol q is almost always a u. Therefore, there is
some dependency among successive symbols. While DMSs are not always practical,
they serve as a good starting point for the discussion on source coding.
A binary source code or in short a code, denoted by C, is a mapping from alphabet
X to nite sequences of bits. The bit sequence corresponding to symbol x X is,
called a codeword of x and is denoted by C(x). Denote the length (in bit) of C(x) by
l(C(x)). The expected codeword length of C, denoted by L(C), can then be written
as
L(C) =
fX (x)l(C(x))
(3.1)
xX
Example 3.1 Let X = {a, b, c, d} with the PMF {1/2, 1/4, 1/8, 1/8}, i.e. fX (a) =
1/2, . . . , fX (d) = 1/8. Consider the code C = {0, 10, 110, 111}, where the codewords
are for a, b, c, and d respectively. It follows that L(C) = 21 1+ 14 2+ 81 3+ 18 3 = 1.75
bit.

23
24
CHAPTER 3. SOURCE CODING
One fundamental problem that we shall consider is the design of a code C that
minimizes the expected codeword length L(C) in (3.1).
Fixed-Length and Variable-Length Codes

A xed-length code has all the codewords of equal length. An example of a xed-length
code is the standard ASCII code that maps each keyboard character to a sequence
of 7 bits, e.g. z is mapped to 1111010. In general, given the alphabet X and its
associated xed-length code C, we can write L(C) = log2 |X |.
When the symbols are equally likely and |X | = 2m for some positive integer m,
a xed-length code with m bits minimizes L(C). This can be seen after we discuss
an optimal code construction. Otherwise, we may do better with a variable-length
code. As the name suggests, in a variable-length code, the codeword lengths are not
necessarily equal.
Intuitively, for a variable-length code, we can assign longer codewords to less
likely symbols in order to minimize L(C). An example of a variable-length code
is the Morse code for telegraphs. In the Morse code, e is mapped to . while
z is mapped to ... One potential diculty in using a variable-length code is the
problem of parsing or separating individual codewords from the encoded bit sequence.
For example, suppose that X = {a, b, c} and C = {0, 1, 01}. The coded bit sequence
01 can be decoded as either ab or c. This example code is thus not useful.
A code C is uniquely decodable if, for any nite sequence of source symbols
x1 , . . . , xn , the concatenation of the codewords C(x1 ) . . . C(xn ) diers from the concatenation of the codewords C(y1 ) . . . C(ym ) for any other sequence of source symbols
y1 , . . . , y m .
One special class of uniquely decodable codes is a class of prex-free codes. A
prex of a bit sequence b1 , . . . , bn is any initial subsequence b1 , . . . , bm , m n, of that
bit sequence. A code is prex-free if no codeword is a prex of any other codeword.
Example 3.2 Suppose that X = {a, b, c} and C = {0, 10, 11}. This code is prexfree, and hence uniquely decodable.
Suppose now that C = {0, 01, 011}. This code is uniquely decodable by recognizing
bit 0 as a beginning of each codeword. However, the code is not prex-free.

Prex-free codes are desirable because of the following reasons.
1. If there exists a uniquely decodable code with a certain set of codeword lengths,
then a prex-free code can easily be constructed with the same set of codeword
lengths. We shall describe the justication of this statement shortly.
2. The decoder can decode each codeword immediately upon receiving the last bit
of that codeword. (Some call a code with this property an instantaneous code.
Thus, every prex-free code is an instantaneous code.)
3. Given the PMF of the source symbols, it is easy to construct a prex-free code
with the minimum value of L(C). We shall see such a construction shortly.
3.1. BINARY SOURCE CODE FOR DISCRETE SOURCES
25
a
0
root
Figure 3.1: Example code tree of a prex-free code.

With all the above properties of prex-free codes, we shall focus on prex-free
codes; we shall not consider uniquely decodable but not prex-free codes. A prexfree code can be represented on a binary code tree or simply a code tree. Starting
from the root, from each branching node, the two branches are labeled 0 and 1.
Because of the prex-free property, each codeword can correspond to a leaf of the tree.
Such a code tree can be used in the decoding process, as illustrated in the following
example.
Example 3.3 Suppose that X = {a, b, c} and C = {0, 10, 11}. The corresponding
code tree is shown in gure 3.1.
Suppose that the decoder receives the bit sequence 1011011. . .. Proceeding
through the code tree from the root, the decoder sees that 1 is not a codeword,
but 10 is a codeword for b. So b is decoded, leaving the sequence 11011. . ..
Proceeding from the root again, the decoder sees that 1 is not a codeword, but
11 is a codeword for c. So c is decoded, leaving the sequence 011. . .. This
decoding process repeats over and over again.

There is a simple way of checking whether it is possible to construct a prex-free
code for a given alphabet X = {a1 , . . . , aM } with a given set of codeword lengths
{l1 , . . . , lM }. This result is known as the Kraft inequality.
Theorem 3.1 (Kraft inequality for prex-free codes): There exists a prexfree code for the alphabetX = {a1 , . . . , aM } with a given set of codeword lengths
lm
{l1 , . . . , lM } if and only if M
1.
m=1 2
Proof: To show that a prex-free code with codeword lengths l1 , . . . , lM must satisfy
the Kraft inequality, consider the following analogy. Suppose we have a unit of nutrition to be distributed from the root to the leaves of the code tree. At each branching
node, the nutrition is distributed into two equal halves, as shown in gure 3.2. Note
that it is possible that some leaves may not be used; they do not correspond to any
codeword.
It follows that the amount of nutrition received by a codeword with length lm is
obtained after lm splits and is equal to 2lm . Since we start with a unit
of nutrition and
lm
1,
since every codeword corresponds to a leaf in the tree, we must have M
m=1 2
where the equality holds when there is no unused leaf.
Conversely, given the codeword lengths l1 , . . . , lM that satisfy the Kraft inequality,
a prex-free code with these codeword lengths can be constructed as follows. Without
26
codeword
1
root
unused leaf
Figure 3.2: Analogy to nutrition distribution in a code tree.
codeword 1
codeword
node available
to be a codeword
codeword 2
node NOT available
to be a codeword
Figure 3.3: Systematic construction of a code tree from the codeword lengths l1
. . . lM .
loss of generality, assume that l1 . . . lM . Start with a full binary code tree of
depth lM , i.e. lM branches between the root and each leaf. Pick any node at depth l1
to be the rst codeword leaf. At this point, all nodes at depth l1 are still available
except for a fraction 2l1 of nodes stemming from the rst codeword. Next, pick any
node at depth l2 to be the second codeword leaf. At this point, all nodes at depth
l2 are still available except for a fraction 2l1 + 2l2 of nodes stemming from the
rst and second codewords, as illustrated in gure 3.3.
Repeat
process until the last codeword. Since the Kraft inequality is satised,
M the
lm
1, there is always at least a fraction of 2lj+1 nodes available at
i.e.
m=1 2
depth lj+1 after each step j, j {1, . . . , M 1}, of the process. This means there
is never a problem of nding a free leaf to use as a codeword.
It can be shown the the Kraft inequality also holds for any uniquely decodable
code. For the proof of this fact, see [?, p. 115]. Therefore, given a uniquely decodable
code, we can use their codeword lengths to construct a prex-free code.
We have so far described a property that the codeword lengths of a prex-free code
must satisfy, i.e. the Kraft inequality. We shall now present a procedure to construct
a prex-free code with the minimum expected codeword length. This procedure is
called the Human algorithm. The resultant code is called a Human code.
3.1. BINARY SOURCE CODE FOR DISCRETE SOURCES
PMF
codeword
0.35
00
0.2
10
0.2
11
0.15
010
0.1
011
0
0.6
0
symbol
27
0.4
1
0.25
Figure 3.4: Construction of a Human code.
Human Algorithm
Suppose that the alphabet is X = {a1 , . . . , aM } with the PMF {p1 , . . . , pM }, i.e.
fX (a1 ) = p1 , . . . , fX (aM ) = pM . For a prex-free code, two codewords are siblings of
each other if they dier only in the last bit. The Human algorithm is an iterative
process that proceeds as follows.
In each step, take the two least likely symbols, say with probabilities q1 and q2 ,
and make them siblings. The pair of siblings are regarded as one symbol with
probability q1 + q2 in the next step.
Repeat the process until only one symbol remains. The resultant code tree
yields an optimal prex-free code.
Example 3.4 Suppose that M = 5 and the PMF is {0.35, 0.2, 0.2, 0.15, 0.1}. The
Human code tree is shown in gure 3.4. The corresponding set of codewords is C =
{00, 10, 11, 010, 011}. For this code, L(C) = (0.35+0.2+0.2)2+(0.15+0.1)3 = 2.25
bit.

Note that a Human code is not unique. For example, we can arbitrarily interchange bit 0 and bit 1 in each branching step of the code tree without changing the
value of L(C).
Proof of Optimality for the Human Algorithm
This section provides a proof of optimality for the Human algorithm. The proof
approach is based on [?, p. 123]. We start by proving some useful facts, and then go
on to the optimality proof.
Lemma 3.1 An optimal code must satisfy the following properties.
1. If pj > pk , then lj lk .
2. The two longest codewords have the same length.
28
Proof: We prove the two properties one by one below.

1. If there are pj > pk with lj > lk , then we can swap the two codewords and L(C)
would decrease by pj lj + pk lk (pj lk + pk lj ) = (pj pk )(lj lk ) > 0. Therefore,
we must have lj lk .
2. If there are two longest codewords with dierent lengths, then we can delete
the last bit of the longer one and use the result as a new codeword to reduce
L(C). This new codeword is valid because the longest codeword with a unique
length cannot have any sibling, and thus the new codeword does not violate the
prex-free property.

Lemma 3.2 There exists an optimal prex-free code such that the two least likely
symbols correspond to two longest codewords that are siblings.
Proof: From lemma 3.1, the two least likely symbols must have the codewords with
the maximum length. However, they may not be siblings. In that case, we can
exchange the codewords (with the maximum length) to make them siblings without
changing L(C).

Proof of optimality of the Human algorithm: Without loss of generality,
assume that p1 . . . pM . Let CM be any code for M symbols that satises the
property in lemma 3.2. From lemma 3.2, note that there is at least one optimal code
with this property.
Dene the merged code CM 1 for M 1 symbols as follows. The code CM 1 has
M 1 codewords. The rst M 2 codewords are the same as those in CM . The last
codeword is the rst lM 1 bits of the M th codeword (equivalently of the (M 1)th
codeword) of CM ; it has the codeword length lM 1 = lM 1 and has the symbol
probability pM 1 = pM 1 + pM .
Let L(CM ) and L(CM 1 ) be the expected codeword lengths for CM and CM 1
respectively. Note that we can write
(M 2
)
M
L(CM ) =
pm lm =
pm lm + pM 1 lM 1 + pM lM .
m=1
m=1
From the property in lemma 3.2, lM 1 = lM . From the construction of CM 1 , we

can write
(M 2
)
L(CM ) =
pm lm + pM 1 lM
m=1
(M 2
)
pm lm
+ pM 1 (lM 1 + 1)
m=1
= L(CM 1 ) + pM 1 + pM .
The above relationship tells us that, minimizing L(CM ) can be done through minimizing L(CM 1 ). Therefore, we can reduce the problem of nding M codewords to
3.2. ENTROPY OF DISCRETE RANDOM VARIABLES
PMF
symbol
29
codeword length
1/3
1/3
1/3
0
1
2/3
Figure 3.5: Human coding for block size equal to 1.

nding M 1 codewords without loss of optimality. By reducing the problem all
the way down to 2 symbols, we have 2 obvious codewords for C2 , i.e. 0 and 1.
This process is precisely what the Human algorithm does. Therefore, the Human
algorithm yields an optimal code.
Coding Blocks of Symbols

In some cases, we can reduce the number of bits per symbol used for source coding by
encoding a block of symbols at a time. This property is illustrated in the following
example. For a fair comparison, we dene the expected codeword length per symbol
as L(C)
= L(C)/n, where n is the number of symbols per block.
Example 3.5 Suppose that X = {a, b, c} and the symbols are equally likely. The
corresponding Human code tree is shown in gure 3.5. For this code, we can compute
L(C)
= 31 1 + 2 13 2 = 53 1.67 bit/symbol.
Suppose now that a block of two symbols are encoded at a time. Then the
new alphabet is X = {aa, ab, ac, ba, bb, bc, ca, cb, cc} with equally likely symbols. The
corresponding
Human code tree)is shown in gure 3.6. For this code, we can compute
(
1
L(C) = 2 7 91 3 + 2 19 4 = 29
1.61 bit/symbol. Note the decrease in the
18
value of L(C).
The process can be repeated to show that, for n = 1, 2, 3, 4, 5, . . ., the correspond
ing values of L(C)
are 1.67, 1.61, 1.60, 1.60, 1.59, . . .. The specic Human codes are
not presented here.
From the above example, there seems to be a lower bound on the value of L(C);
this bound is approached as we increase n. In the next section, we dene the entropy
of a discrete random variable (RV), which is the quantity that serves as the limit
value of L(C)
as n grows large.
3.2
Entropy of Discrete Random Variables
Consider a discrete RV X with the alphabet X and the PMF fX (x). The entropy of
X, denoted by H(X), is dened as
H(X) =
fX (x) log fX (x).
(3.2)
xX
30
0
0
0
3/9
PMF
symbol
1/9
aa
1/9
ab
1/9
ac
1/9
ba
1/9
bb
1/9
bc
1/9
ca
1/9
cb
1/9
cc
codeword length
2/9
5/9
1
2/9
2/9
0
4/9
1
2/9
Figure 3.6: Human coding for block size equal to 2.

The unit of entropy H(X) is bit if the logarithm has base 2, and is nat (natural
unit) if the logarithm has base e (natural logarithm).
Example 3.6 Consider a binary RV X with fX (1) = p and fX (0) = 1 p. Its
entropy (in bit) is
H(X) = p log2 (p) (1 p) log2 (1 p),
and is plotted in gure 3.7. Note that the maximum is equal to 1 and corresponds to
having p = 0.5.

Roughly speaking, H(X) can be considered as the amount of uncertainty in RV
X. The higher the value of H(X), the higher the error probability in guessing X, as
can be seen in gure 3.7. (In the extreme, if H(X) = 0, we can guess X correctly
with zero error probability.)
For convenience, the entropy of a binary RV with parameter p will be denoted as
Hbin (p), i.e.
Hbin (p) = p log2 (p) (1 p) log2 (1 p).
(3.3)
The following theorem contains important properties of H(X).
Theorem 3.2 (Bounds on the entropy of a discrete RV): Let H(X) be the
entropy of a discrete RV X with the alphabet X . Denote the alphabet size by M =
|X |. Then, H(X) has the following properties.
1. H(X) 0.
2. H(X) log M , with the equality if and only if the M symbols are equally
likely.
31
Hbin(p)
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Figure 3.7: Entropy (in bit) of a binary RV.

Proof: We shall prove the statements one by one below.
1. Since log fX (x) is always nonnegative for all x X , it follows from (3.2) that
H(X) 0.
2. We rst rewrite H(X) as follows.
H(X) =
fX (x) log
1
1/M
=
fX (x) log
fX (x)
fX (x) 1/M
xX
xX
1
1/M
=
fX (x) log
+
fX (x) log
1/M xX
fX (x)
xX
1/M
= log M +
fX (x) log
fX (x)
xX
Assuming for now that the logarithm has base e. Using the fact that ln x x1,
as shown in gure 3.8, we can bound H(X) by1
)
)
(
( 1
1/M
1 = ln M +
fX (x)
H(X) ln M +
fX (x)
f
M
X (x)
xX
xX
= ln M + 1 1 = ln M
We see from gure 3.8 that the bound ln x x 1 holds with equality if and
only if x = 1. In the derivation above, we see that H(X) = ln M if and only if
fX (x) = 1/M for all x X , i.e. the symbols are equally likely.
x
Finally, if the logarithm has base 2, then we can use the bound log2 x = ln
ln 2
x1
to
show
that
H(X)
log
M
.
The
argument
is
the
same
as
before
and
is
2
ln 2
thus omitted.

1
ln x denotes the natural logarithm of x.
32
1
0.5
0
x-1
-0.5
ln(x)
-1
-1.5
-2
-2.5
-3
0
0.5
1.5
Figure 3.8: Upper bound ln x x 1.

Consider now two discrete RVs X and Y with the alphabets X and Y and the
PMFs fX (x) and fY (y) respectively. The joint entropy of X and Y , denoted by
H(X, Y ), is dened as
H(X, Y ) =
fX,Y (x, y) log fX,Y (x, y),
(3.4)
xX yY
where fX,Y (x, y) is the joint PMF of X and Y . The joint entroy H(X, Y ) can be
considered as the amount of uncertainty in RVs X and Y . The denition can be
extended to n RVs X1 , . . . , Xn ; their joint entropy is written as
H(X1 , . . . , Xn ) =
fX1 ,...,Xn (x1 , . . . , xn ) log fX1 ,...,Xn (x1 , . . . , xn ).

x1 X1
xn Xn
(3.5)
Consider again two discrete RVs X and Y with the alphabets X and Y and the
PMFs fX (x) and fY (y) respectively. The conditional entropy of X given that Y = y,
denoted by H(X|Y = y), is dened as
H(X|Y = y) =
fX|Y (x|y) log fX|Y (x|y)
(3.6)
xX
where fX|Y (x|y) is the conditional PMF of X given Y . The conditional entropy
H(X|Y = y) can be considered as the amount of uncertainty left in RV X given that
we know Y = y. The average conditional entropy of X given the RV Y or in short
the conditional entropy of X given Y , denoted by H(X|Y ), is dened as
H(X|Y ) =
fY (y)H(X|Y = y) =
fX,Y (x, y) log fX|Y (x|y).
(3.7)
yY
xX yY
Two special cases should be noted. First, if X and Y are independent, then
H(X|Y ) = H(X). Intuitively, in this case, the knowledge of Y does not change the
33
uncertainty about X. Second, if X is a function of Y , then H(X|Y ) = 0. Intuitively,

in this case, the knowledge of Y tells us the exact value of X, leaving zero uncertainty.
One important property is the fact that conditioning can only reduce the entropy.
More specically, it is always true that H(X|Y ) H(X), as indicated in the theorem
below.
Theorem 3.3 (Conditioning can only reduce the entropy): For two discrete
RVs X and Y , H(X|Y ) H(X) with equality if and only if X and Y are independent.
Proof: We rst rewrite H(X|Y ) as follows.
1
H(X|Y ) =
fX,Y (x, y) log
fX|Y (x|y)
xX yY
=
fX,Y (x, y) log
fX (x)
fX|Y (x|y)fX (x)
fX,Y (x, y) log
1
fX (x)
+
fX,Y (x, y) log
fX (x) xX yY
fX|Y (x|y)
xX yY
xX yY
fX (x) log
xX
= H(X) +
fX (x)fY (y)
1
+
fX,Y (x, y) log
fX (x) xX yY
fX,Y (x, y)
fX,Y (x, y) log
xX yY
fX (x)fY (y)
fX,Y (x, y)
Assuming the natural logarithm, we can use ln x x 1 to bound H(X|Y ) by

(
)
fX (x)fY (y)
H(X|Y ) H(X) +
fX,Y (x, y)
1
fX,Y (x, y)
xX yY
= H(X) +
fX (x)fY (y)
fX,Y (x, y)
xX yY
xX yY
= H(X) + 1 1 = H(X).
Note that the equality H(X|Y ) = H(X) holds if and only if the logarithm argument is equal to 1 while applying ln x x 1. This happens when the ratio
fX (x)fY (y)/fX,Y (x, y) is equal to 1 for all x and y. This is equivalent to having
fX,Y (x, y) = fX (x)fY (y) for all x and y, i.e. X and Y are independent.
x
Finally, as for the proof of theorem 3.2, we can use the bound ln
x1
to prove
ln 2
ln 2
the theorem if the logarithm has base 2.

The following theorem tells us that the joint entropy can be written as the sum
of conditional entropies.
Theorem 3.4 (Chain rules for the joint entropy): For discrete RVs X1 , . . . , Xn ,
H(X1 , . . . , Xn ) = H(X1 ) +
j=2
H(Xj |X1 , . . . , Xj1 )
34
Proof: We provide below a proof for two RVs X1 and X2 . The proof is essentially
based on the fact that fX1 ,X2 (x1 , x2 ) = fX1 (x1 )fX2 |X1 (x2 |x1 ).

fX1 ,X2 (x1 , x2 ) log fX1 ,X2 (x1 , x2 )
H(X1 , X2 ) =
x1 X1 x2 X2
(
)
fX1 ,X2 (x1 , x2 ) log fX1 (x1 )fX2 |X1 (x2 |x1 )
x1 X1 x2 X2
fX1 ,X2 (x1 , x2 ) log fX1 (x1 )
x1 X1 x2 X2
fX1 ,X2 (x1 , x2 ) log fX2 |X1 (x2 |x1 )
x1 X1 x2 X2
= H(X1 ) + H(X2 |X1 )

The proof for more than two RVs can be done using mathematical induction.
From theorems 3.2 and 3.4, we can establish an upper bound on the joint entropy
H(X1 , . . . , Xn )
H(Xj ),
(3.8)
j=1
where the equality holds if and only if X1 , . . . , Xn are independent.
3.3
Source Coding Theorem for Discrete Sources
Consider a discrete RV X with the alphabet X and entropy H(X). Consider constructing a prex-free source code for X. Let Lmin (C) be the minimum expected
codeword length (in bit). Recall that Lmin (C) can be achieved using the Human
algorithm. The following theorem relates Lmin (C) to H(X).
Theorem 3.5 (Entropy bound for prex-free codes): Assume that all quantities have the bit unit.
1. H(X) Lmin (C) < H(X) + 1
2. H(X) = Lmin (C) if and only if the PMF fX () is dyadic, i.e. fX (x) is a negative
integer power of 2 for all x X .
Proof: For convenience, let M = |X |. Denote the PMF values by p1 , . . . , pM , and
. We rst prove the lower

the codeword lengths of the Human code by l1 , . . . , lM
ln x
bound H(X) Lmin (C). Using the inequality log2 x = ln 2 x1
(see gure 3.8), we
ln 2
can upper bound H(X) Lmin (C) as follows.
) (M
)
(M
2lm
1
p m lm =
pm log2
H(X) Lmin (C) =
pm log2
p
pm
m
m=1
m=1
m=1
((
)
)
)
( l
M
M
1
1
2 m
1 =
pm
2lm 1
ln 2 m=1
pm
ln 2
m=1
3.3. SOURCE CODING THEOREM FOR DISCRETE SOURCES
35
lm
Using the Kraft inequality (see theorem 3.1), i.e. M
1, H(X)Lmin (C)
m=1 2
1
is further upper bounded by ln 2 (1 1) = 0, yielding H(X) Lmin (C).
We now prove the upper bound Lmin (C) < H(X) + 1 by showing the existence
of one prex-free code with the expected codeword length L(C) < H(X) + 1. In
this code, we choose the codeword lengths to be lm = log2 pm , m {1, . . . , M }.
(This is also refered to as the Shannon-Fano-Elias coding.) Note that the following
inequality holds: log2 pm lm < log2 pm + 1.
lm
From the bound log2 pm lm , or equivalently 2lm pm , we obtain M
m=1 2
M
m=1 pm = 1. Thus, the Kraft inequality holds, and there exists a prex-free code
with the above choice of codeword lengths. From lm < log2 pm + 1 , we can write
L(C) =
p m lm <
m=1
pm ( log2 pm + 1)
m=1
M
m=1
pm log2 pm
(
+
)
pm
= H(X) + 1.
(3.9)
m=1
Finally, if the PMF of X is dyadic, then lm = log2 pm , and L(C) = Lmin (C) =
x
x1
in
H(X). On the other hand, if Lmin (C) = H(X), then the inequality ln
ln 2
ln 2
the proof for H(X) Lmin (C) must be satised with equality. This means that
pm = 2lm , m {1, . . . , M }, and thus the PMF is dyadic.

The next theorem states that, by coding blocks of symbols for a DMS with symbol entropy H(X), the minimum expected codeword length per symbol, denoted by
min (C), approaches H(X) as the block size grows large. Consequently, the entropy
L
H(X) can serve as the fundamental limit of data compression for a DMS. However,
note that a large block size for source encoding may either involve too much computation or require too much decoding delay to be practical.
Theorem 3.6 (Lossless source coding theorem): For a DMS with symbol entropy H(X), coding blocks of symbols with block length n using a prex-free code
yields
min (C) < H(X) + 1 .
H(X) L
n
Proof: For coding n symbols X1 , . . . , Xn , we can view these n symbols as a combined
symbol when using the Human algorithm. As a result, theorem 3.5 yields
H(X1 , . . . , Xn ) Lmin (C) < H(X1 , . . . , Xn ) + 1
nFrom (3.8), noting that X1 , . . . , Xn are independent, we can write H(X1 , . . . , Xn ) =

j=1 H(Xj ) = nH(X). Normalizing Lmin (C) by n, we have the desired bound.
36
3.4
Asymptotic Equipartition Property
The asymptotic equipartition property (AEP) is a key concept in information theory.

Roughly speaking, AEP states that, given a long sequence of independent and identically distributed (IID) RVs X1 , . . . , Xn with symbol entropy H(X), we have the
following events.
1. There exists a typical set of sequences (x1 , . . . , xn ) X n whose aggregate probability is approximately 1.
2. There are approximately 2nH(X) typical sequences each of which occurs with
approximately equal probability 2nH(X) .
Dene the typical set Tn with respect to symbol PMF fX (x) to be the set of
sequences (x1 , . . . , xn ) X n such that

log fX1 ,...,Xn (x1 , . . . , xn )

< .
H(X)
(3.10)

n
Any length-n sequence (x1 , ..., xn ) satisfying (3.10) is called a typical sequence.
Example 3.7 Consider a DMS with two symbols 0 and 1 with probabilities 0.8 and
0.2 respectively. The typical set Tn with n = 6 and = 0.1 is found as follows. Note
that H(X) = Hbin (0.2). It follows that the condition for a typical sequence is
0.033 26(Hbin (0.2)+0.1) < fX1 ,...,X6 (x1 , . . . , x6 ) < 26(Hbin (0.2)0.1) 0.075.
For j {0, . . . , 6}, dene a group-j sequence to be a sequence (x1 , . . . , xn ) with
6
bit 1 appearing j times. We now check whether a group-j sequence is in T0.1
for each
j.
Group-0 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.86 0.26
Group-1 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.85 0.2 0.066
Group-2 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.84 0.22 0.016
Group-3 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.83 0.23 4.1 103
Group-4 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.82 0.24 1.0 103
Group-5 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.8 0.25 2.6 104
Group-6 : fX1 ,...,X6 (x1 , . . . , x6 ) = 0.26 6.4 105
6
6
= {100000, 010000,
. Therefore, T0.1
We see that only group-1 sequences are in T0.1
001000, 000100, 000010, 000001}.
The following theorem lists important properties of typical sets and typical sequences.
Theorem 3.7 (AEP):
3.4. ASYMPTOTIC EQUIPARTITION PROPERTY
37
1. For any > 0, Pr{Tn } > 1 for suciently large n.

2. For each typical sequence (x1 , . . . , xn ) Tn , the joint PMF is bounded by
2n(H(X)+) < fX1 ,...,Xn (x1 , . . . , xn ) < 2n(H(X)) .
3. For any > 0, (1 )2n(H(X)) < |Tn | < 2n(H(X)+) for suciently large n.
Proof: We shall prove the statements one by one.
log f
(x ,...,x )
1. Since X1 , . . . , Xn are IID, X1 ,...,Xnn 1 n = n1 nj=1 log fX (xj ). Dene
a RV Wj = log fX (Xj ). Note that W1 , . . . , Wn are IID with mean H(X).
n
n
2
denote the
Let{(W
) variance of W
}j . From the denition of T , 1 Pr{T } =

Pr n1 nj=1 Wj H(X) . Using the weak law of large number (WLLN),
we obtain the following bound.
1 Pr{Tn }
2
2
W
W
n
,
or
equivalently
Pr{T
}
n2
n2
Given any > 0, there is an n suciently large so that
2
W
n2
< .
2. The statement follows directly from the denition of a typical sequence in (3.10).
3. Since fX1 ,...,Xn (x1 , . . . , xn ) > 2n(H(X)+) for each typical sequence in Tn ,
fX1 ,...,Xn (x1 , . . . , xn ) > |Tn |2n(H(X)+) ,

1
(x1 ,...,xn )Tn
yielding |Tn | < 2n(H(X)+) .

From statement 1 and fX1 ,...,Xn (x1 , . . . , xn ) < 2n(H(X)) , given any > 0, there
is an n suciently large so that Pr{Tn } > 1 . With this choice of n,
fX1 ,...,Xn (x1 , . . . , xn ) < |Tn |2n(H(X)) ,

1 <
(x1 ,...,xn )Tn
yielding |Tn | > (1 )2n(H(X)) .
Lossy Source Coding

The AEP can be used to construct a source code that is optimal in minimizing the
expected codeword length per symbol, denoted by L(C).

Consider the following source
coding strategy. We take a sequence of n source symbols at a time. We then provide
codewords only for the typical sequences in Tn for some small > 0. For a sequence
not in Tn , we declare a coding failure since no codeword is provided. Since there is
a nonzero probability of a coding failure, this approach is called lossy source coding.
Since |Tn | < 2n(H(X)+) (statement 3 of theorem 3.7), it suces to use a xedlength code with codeword length equal to n(H(X) + ) bit, yielding the bound
L(C)
< H(X) + + 1/n. Since Pr{Tn } approaches 1 for large n (statement 1 of
theorem 3.7), the probability of coding failure approaches 0 for large n. Since can
be chosen to be arbitrarily small, it follows that, for large n, this coding scheme yields
L(C)
H(X), which is optimal.
38
3.5
Source Coding for Discrete Sources with Memory
In this section, we consider discrete sources with memory. In particular, let X1 , X2 , . . .

be the symbols emitted from the source. It is no longer true that X1 , X2 , . . . are
independent; they may, however, still be identically distributed. For such sources,
symbol-by-symbol encoding is clearly not optimal. An obvious example is when a
source emits two identical symbols successively, e.g. aabbccbb . . .; encoding two
symbols at a time is clearly more ecient.
We focus on discrete stationary sources whose statistics do not depend on time.
For such a source, X1 , X2 , . . . are identically distributed with some marginal PMF
fX (x). Note that a complete statistical description of X1 , X2 , . . . has to include joint
PMFs. Let H(X) be the entropy corresponding to fX (x). Because of memory, it is
no longer true that H(X) is the lower limit on the number of bits per symbol for
source coding.
However, we can still use the Human algorithm to code blocks of symbols since
the algorithm does not rely on the independence assumption for X1 , X2 , . . .. Suppose
that we code a block of n symbols at a time. Let Lmin (C) be the expected codeword
length from the Human algorithm. From theorem 3.5, we can bound Lmin (C) as
H(X1 , . . . , Xn ) Lmin (C) H(X1 , . . . , Xn ) + 1.
By normalizing Lmin (C) by n to get the expected codeword length per symbol
Lmin (C) = Lmin (C)/n, we have
H(X1 , . . . , Xn )
H(X1 , . . . , Xn ) 1
Lmin (C)
+ .
n
n
n
As n grows large, we see that the optimal coding procedure uses approximately
H(X1 , . . . , Xn )/n bits per symbol. This motivates the following denition of the
entropy rate of a discrete stationary source, denoted by H (X), as
H(X1 , . . . , Xn )
.
n
n
H (X) = lim
(3.11)
It is known that the limit in (3.11) exists and hence the denition is valid (see [?,
p. 103] or [?, p. 74]).
Even though the Human algorithm can be used as an optimal coding procedure
for discrete sources with memory, it requires the knowledge of the joint PMF of source
symbols. In practice, this information may not be available or may be dicult to estimate. To overcome this requirement, the Lempel-Ziv (LZ) algorithm was proposed
as a source coding procedure that does not require the knowledge of the source statistics. Due to this property, the LZ algorithm is considered as a universal source coding
algorithm. Various versions of the LZ algorithm have been implemented in practice,
e.g. the compress command in Unix.
3.5. SOURCE CODING FOR DISCRETE SOURCES WITH MEMORY
39
Figure 3.9: Example operations of the LZ algorithm.
Lempel-Ziv Algorithm
We now discuss the operations of the LZ algorithm as well as a rough explanation on
why it is ecient. The discussion is taken from [?, p. 51].
The LZ algorithm we describe is a variable-to-variable length coding process.2
At each step, the algorithm maps a variable number of symbols to a variable-length
codeword. In addition, the code C adapts or changes over time depending on the
statistics in the recent past.
Let X1 , X2 , . . . denote the sequence of identically distributed source symbols. Let
X denote the alphabet, and dene M = |X |. Let xnm , m n, denote the subsequence
(xm , . . . , xn ). The algorithm operates by keeping a sliding window of size W = 2k ,
where k is some large positive integer. The operations of the algorithm are as follows.
1. Encode the rst W symbols using a xed-length code with log2 M bits per
symbol. (In terms of the overall eciency, it does not really matter how eciently these W symbols are coded since the W log2 M bits used in this step
is a negligible fraction of the total number of encoded bits.)
2. Set the pointer P = W indicating that all symbols up to xP have been coded.
P +nu
3. Find the largest positive integer n 2 (if exists) such that xPP +n
+1 = xP +1u
for some u {1, . . . , W }. (In other words, nd the longest match between the
symbol sequence starting at index P + 1 and a symbol subsequence starting in
the sliding window.) Encode xPP +n
+1 by encoding n and then encode u. Figure 3.9
gives some example values of n and u.
If no match exists for n 2, then set n = 1.

4. Encode n with the unary-binary code, as shown in gure 3.10. In particular,
n is encoded by a binary representation preceded by a prex of log2 n zeros.
The corresponding codeword length for n is 2log2 n + 1 bits. (In terms of the
overall eciency, the encoded bits for n are negligible compared to the encoded
bits for u, as will be seen shortly.)
5. If n > 1, encode u using a xed-length code of length log2 W bits. Else (i.e.
n = 1), encode xP +1 using the previously dened xed-length code.
2
A dierent version of the LZ algorithm is a variable-to-xed length coding process [?, p. 106].
40
Figure 3.10: Unary-binary code tree.

6. Update the pointer by incrementing P by n, and go back to step 3.
Optimality of the LZ Algorithm

In this section, we provide a rough explanation why the LZ algorithm is optimal.
First, we modify the denition of a typical sequence in order to deal with sources
with memory. In particular, dene the typical set Tn in this case to be the set of
sequences (x1 , . . . , xn ) X n such that

log fX1 ,...,Xn (x1 , . . . , xn )

H (X) < .

n
The following properties follow from the above denition (with no justication
given here).
1. For large n, Pr {Tn } 1.
2. There are approximately 2nH (X) typical sequences each of which occurs with
approximately equal probability 2nH (X) .
We are now ready to understand why the LZ algorithm is optimal. Recall that,
with the sliding window size W , there are W positions that the algorithm can search
for a matching subsequence. The length n of the match can be estimated as follows.
If n is so large that 2nH (X) W , then most length-n typical sequences will not
have their starting points in the sliding window and there will likely be no matching
of length n.
On the other hand, if n is so small that 2nH (X) W , then most length-n typical
sequences will be matched, but n will likely not be the size of the longest match.
Roughly speaking, the longest match will occur for length n such that 2n H (X) W ,
W
symbols.
or equivalently n Hlog2(X)
Recall that, in the algorithm, coding for a match of length n and a match position
u uses 2log2 n + 1 bits for n (unary-binary code) and log2 W bits for u (xedW
, the term
length code), for a total of (2log2 n + 1) + log2 W bits. Since n Hlog2(X)
2log2 n + 1 is negligible compared to log2 W . Therefore, in a typical step, we have

a match of length n and can encode both n and u using approximately log2 W bits,
yielding the number of bits per symbol approximately equal to logn2W H (X).
3.6. SOURCE CODING FOR CONTINUOUS SOURCES
3.6
41
Source Coding for Continuous Sources
When a source emits an information-bearing signal that is a continuous waveform,

e.g. speech waveform, we need to convert the continuous signal into a discrete-time
signal with a nite number of values for the transmission over a digital communication
system. To create a discrete-time signal, we can sample the continuous signal. We
shall shortly see that, if the continuous signal is band-limited to the frequency range
1
[W, W ] (in Hz), then its samples taken every period 2W
(in s) can be used to
perfectly reconstruct the original signal.
Having obtained an equilvalent discrete-time signal representation, it remains to
quantize the amplitude of each sample so that each sample can be considered as a
discrete RV with a nite alphabet. In general, quantization is a lossy process; a
quantized signal is a distorted version of the original signal. More quantization levels
can be used to decrease the amount of distortion but at the expense of using more
bits for source coding.
Example 3.8 A typical local oce of a telephone network converts a voice signal
to bits as follows. First, a voice signal is ltered to be band-limited to 4 kHz. The
ltered signal is then sampled at a rate of 8,000 samples per second. Each sample
is then quantized to one of 256 possible levels. Finally, each quantized sample is
mapped to an 8-bit codeword using a xed-length code. Over all, the corresponding
bit rate for a voice signal is 64 kbps.
Sampling Theorem for Band-Limited Signals

Suppose a continuous L2 signal x(t) is band-limited to the frequency band [W, W ].
More specically, if x(f ) is a Fourier transform of x(t) (denoted by x(t) x(f )),
then x(f ) = 0 for f
/ [W, W ]. The following theorem states that the samples of
1
x(t) taken at a period of 2W
can be used to perfectly reconstruct x(t).
Theorem 3.8 Sampling theorem for band-limited signals): A continuous L2
signal x(t) band-limited to the frequency range [W, W ] is uniquely determined by
1
.
its samples taken at the sampling period 2W
Proof: Recall that, for an L2 signal x(t) that is time-limited to the time interval
[ T T]
T /2
2 , 2 , there exists a set of Fourier series coecients xk = T1 T /2 x(t)ei2kt/T dt, k
Z.
In
addition,
x(t)
can
be
reconstructed
from
x
by
x(t)
=
k ei2kt/T , t
k
kZ x
[ T T]
2, 2 .
By applying the same properties in the frequency domain, we can establish the
sampling theorem as follows. Since x(t) x(f ) and x(t) is an L2 signal, x(f ) is also
an L2 signal. Since x(f ) = 0 for f
/ [W, W ], there exists a set of Fourier series
coecients
W
2kf
1
x(f )ei 2W df, k Z.
(3.12)
xk =
2W W
42

In addition, x(f ) can be reconstructed from xk , k Z, by
x(f ) =
xk ei
2kf
2W
, f [W, W ]
(3.13)
kZ
1
From the inverse Fourier
( k ) transform formula, we see that xk in (3.12) is 2W times
the sampling value x 2W in the time domain. Since (x(f )) uniquely determines x(t),
k
we can reconstruct x(t) perfectly from the samples x 2W
, k Z.
The reconstruction formula can be obtained by taking the inverse Fourier transform of the expression in (3.13). We can reexpress (3.13) for f R as3
x(f ) =
i 2kf
2W
xk e
(
rect
kZ
f
2W
)
, f R.
(3.14)
Using Fourier transform pairs

(
f
2W
,
2W sinc(2W t) rect
(
(
))
(
)
k
f
i 2kf
2W sinc 2W t +
e 2W rect
,
2W
2W
we can write the inverse Fourier transform of (3.14) as
( k )
x
2W xk sinc(2W t + k) =
sinc(2W t + k)
x(t) =
2W
kZ
kZ
By making a change of variable j = k, we get the reconstruction formula below.

x(t) =
j=
(
x
j
2W
)
sinc(2W t j).
(3.15)
Scalar Quantization
In this section, we discuss quantization of a single symbol produced from a source,
e.g. its sample value. A scalar quantizer with M levels partitions the set R into M
subsets R1 , . . . , RM called quantization regions. Each region Rm , m {1, . . . , M }, is
then represented by a quantization point qm Rm . If a symbol u Rm is produced
from the source, then u is quantized to qm .
Our goal is to treat the following problem. Let U be a RV denoting a source
symbol with probability density function (PDF) fU (u). Let q(U ) be a RV denoting
its quantized value. Given the number of quantization levels M , we want to nd the
3
(
Recall that rect
f
2W
)
=
1, f [W, W ]
0, otherwise
3.6. SOURCE CODING FOR CONTINUOUS SOURCES
43
Figure 3.11: Example quantization regions and quantization points for M = 4.

quantization regions R1 , . . . , RM and the quantization points q1 , . . . , qM to minimize
the following mean square error (MSE) distortion.

[
]
2
MSE = E (U q(U )) =
(u q(u))2 fU (u)du.
(3.16)
For the time being, let us assume that R1 , . . . , RM are intervals, as shown in
gure 3.11. We ask two simplied questions.
1. Given q1 , . . . , qM , how do we choose R1 , . . . , RM ?
2. Given R1 , . . . , RM , how do we choose q1 , . . . , qM ?
We rst consider the problem of choosing R1 , . . . , RM given q1 , . . . , qM . For a given
u R, the square error to qm is (uqm )2 . To minimize the MSE, u should be quantized
to the closest quantization point, i.e. q(u) = qm where m = arg minj{1,...,M } (u qj )2 .
It follows that the boundary point bm between Rm and Rm+1 must be the halfway
point between qm and qm+1 , i.e. bm = (qm + qm+1 )/2. In addition, we can say that
R1 , . . . , RM must be intervals.
We now consider the problem of choosing q1 , . . . , qM given R1 , . . . , RM . Given
R1 , . . . , RM , the MSE in (3.16) can be written as
MSE =
m=1
Rm
(u qm )2 fU (u)du.
To minimize the MSE, we can consider each quantization region separately from
the rest. Dene a RV V such that V = m if U Rm , and let pm = Pr{V = m}. The
conditional PDF of U given that V = m can be written as
fU |V (u|m) =
fV |U (m|u)fU (u)
fU,V (u, m)
fU (u)
fU (u)
=
=
=
fV (m)
fV (m)
fV (m)
pm
in region Rm . In terms of fU |V (u|m), the contribution of region Rm to the MSE can

be written as
fU (u)
2
du
(u qm ) fU (u)du = pm
(u qm )2
pm
Rm
Rm
[
]
= pm
(u qm )2 fU |V (u|m)du = pm E (U qm )2 |V = m . (3.17)
Rm
44
It is known that the value of a that minimizes E[(X a)2 ] is the mean of X, i.e.
E[X] = arg minaR E[(X a)2 ].4 Therefore, the MSE is minimized when we set qm
equal to the conditional mean of U given V = m, i.e.
qm = E[U |V = m] = E[U |U Rm ].
(3.18)
In summary, we have the following necessary conditions for an optimal scalar

quantizer.
1. R1 , . . . , RM are intervals. In addition, the boundary point bm between Rm and
Rm+1 must be halfway between qm and qm+1 , i.e. bm = (qm + qm+1 )/2.
2. In each region Rm , the quantization point qm must be equal to the conditional
mean E[U |U Rm ].
These two necessary conditions are the ideas behind the Lloyd-Max algorithm for
scalar quantization described below. Since the algorithm is based on the necessary
but not sucient conditions, the resultant solution may be suboptimal, as illustrated
in the example that follows.
Lloyd-Max Algorithm:
1. Choose an arbitrary initial set of q1 , . . . , qM such that q1 < . . . < qM .
2. For each m {1, . . . , M 1}, set bm = (qm + qm+1 )/2.
3. For each m {1, . . . , M }, set qm = E[U |U (bm1 , bm ]], where we set b0 and
bM to be the minimum and the maximum values of U (possibly innite).
4. Repeat steps 2 and 3 until the improvement in the MSE is negligible, e.g. less
than some small > 0.
Example 3.9 Let U be a RV with the PDF shown in gure 3.12a. Two possible
outcomes from the Lloyd-Max algorithm are illustrated in gures 3.12b and 3.12c.
The MSE of the two solutions are 0.224 and 0.208 respectively. Depending on the
initial values of q1 and q2 , it is possible that the algorithm yields the solution in
gure 3.12b. Therefore, the Lloyd-Max algorithm can be suboptimal in general.
High-rate Uniform Scalar Quantization
We now consider the special case of high-rate uniform scalar quantization. In this
scenario, we assume that U is in a nite interval [umin , umax ]. Consider using M
quantization regions of equal lengths, i.e. uniform quantization. In addition, assume
that M is large, i.e. high-rate quantization. Let denote the length of each quantization region. Note that = (umax umin )/M . When M is suciently large (and hence
4
To see why, we can write E[(X a)2 ] = E[X 2 ] 2aE[X] + a2 . Dierentiating the expression with
respect to a and setting the result to zero, we can solve for the optimal value of a.
3.7. VECTOR QUANTIZATION
45
Figure 3.12: Suboptimality of the Lloyd-Max algorithm.

small ), we can approximate the PDF fU (u) as being constant in each quantization
region. More specically,
pm
fU (u)
, u Rm .
(3.19)
Under this approximation, the quantization point in each region is the midpoint
of the region. From (3.19), the corresponding MSE can be expressed as
(
)
M
M
/2
pm
pm
2
2
MSE
(u qm ) du =
w dw
Rm
/2
m=1
m=1
( )
M
pm 3
2
=
=
,
(3.20)
12
12
m=1
/2
where we use the fact that Rm (u qm )2 du = /2 w2 dw for each length- quantization region with the quantization point in the middle. Therefore, the approximate
MSE does not depend on the form of fU (u) for a high-rate uniform quantizer.
If we represent the quantization points using a xed-length code, then the codeword length L(C) is equal to L(C) = log2 M (assuming M = 2k for some k Z+ ),
and is related to the MSE by
(umax umin )2
.
(3.21)
12 22L(C)
Therefore, the MSE decreases exponentially with the number of bits used for a
high-rate uniform quantizer; each extra bit decreases the MSE by a factor of 1/4.
MSE =
3.7
Vector Quantization
When the source produces successive symbols U1 , U2 , . . . that are continuous RVs, it
is possible to use scalar quantization to quantize these symbols one by one. However,
46
Figure 3.13: Reduction of MSE from vector quantization.

a reduction in the MSE per symbol may be obtained if we quantize multiple symbols
at the same time, especially when these symbols are dependent. The process of
quantizing multiple symbols at a time is called vector quantization.
Consider two-dimensional vector quantization. Suppose that two source symbols
U1 and U2 are quantized to X1 and X2 . The MSE per symbol, denoted by MSE2D , is
dened as
]
1 [
MSE2D = E (U1 X1 )2 + (U2 X2 )2 .
(3.22)
2
The following example illustrates the reduction in MSE obtained from vector
quantization.
Example 3.10 (from problem 3.41 of [Pro95]): Two random variables U1 and
U2 are uniformly distributed on the square shown in gure 3.13a. If we perform scalar
quantization to U1 and then to U2 using the 4-level uniform quantizer in gure 3.13b,
the resultant MSE is 1/12 per symbol.
However, if we perform vector quantization using the 16-level two-dimensional
quantizer in gure 3.13c, the resultant MSE is 1/24 per symbol. Note that, in both
cases, we can represent the quantized value using 2 bits per source symbol.

We shall not discuss vector quantization further; more discussion can be found
in [?, p. 72]. It should be noted that coding for sources with memory in practice
often use content specic techniques, e.g. speech analysis for voice encoding.
3.8
Summary
In this chapter, we considered the problem of source coding. We showed that, for
a discrete memoryless source (DMS), the entropy serves as a fundamental limit on
the average number of bits required to represent each source symbol. For sources
with memory, we dene the entropy rate which serves as a fundamental limit for
these sources. In either case, we can use the Human algorithm to eciently encode
source symbols, assuming the knowledge of joint probability mass function (PMF) of
the source symbols. In cases where the joint PMF is not available, the Lempel-Ziv
universal encoding algorithm can be applied.
47
When the source produces a continuous waveform, we can convert the source to
a discrete source by sampling the source output and quantizing the sample values.
We saw that, for a band-limited continuous waveform, we can perfectly represent
the waveform by its sample values with the sampling rate equal to twice the source
bandwidth. In addition, we discussed a heuristic for nding quantization regions and
quantization points for scalar quantization. Compared to scalar quantization using
the same data bit rate, we saw that vector quantization can reduce the distortion
even though the source symbols are independent.
Quantization can be studied using the information theory framework. Using mutual information between the symbol and its quantized value, we can dene the rate
distortion function which gives a theoretical lower bound on the data rate subject to
the constraint on the distortion. See [?, p. 108] or [?, p. 301] for the discussion on
rate distortion theory.
In practice, there are techniques for source coding that are specialized to the applications. For example, for coding of speeches in cellular networks, model-based source
coding based on linear predictive coding (LPC) is commonly used. Such specialized
source coding techniques are beyond the scope of this course. See [?, p. 125] for more
detailed discussions.
3.9
Practice Problems
Problem 3.1 (Problem 3.7 in [Pro95]): A DMS has an alphabet containing eight
letters a1 , . . . , a8 with probabilities 0.25, 0.2, 0.15, 0.12, 0.1, 0.08, 0.05, and 0.05.
(a) Assume that we encode one source symbol at a time, nd an optimal prex-free
code C for this source.
(b) Compute the expected codeword length L(C) for the code in part (a).
(c) Compute the entropy H(X) of the source symbol.
Problem 3.2 (Problem 3.8 in [Pro95]): A DMS has an alphabet containing ve

letters a1 , . . . , a5 with equal probabilities. Evaluate the expected codeword length per
symbol L(C) for a xed-length code in each case.
(a) One symbol is encoded at a time.
(b) Two symbols are encoded at a time.
(c) Three symbols are encoded at a time.
Problem 3.3 (Problem 2.16 in [CT91]): Consider two discrete RVs X and Y
with the joint PMF given below.
48

fX,Y (x, y)
X=0
X=1
Y =0
1/3
0
Y =1
1/3
1/3
Compute H(X), H(Y ), H(X|Y ), H(Y |X), and H(X, Y ).
Problem 3.4 (Entropy computation from cards): Consider drawing two cards
randomly from a deck of 8 dierent cards without putting the 1st card back before
drawing the 2nd card. Let the cards be numbered by 1, . . . , 8. Let X and Y denote
the numerical values of the 1st and 2nd cards respectively. Note that X and Y are
RVs.
(a) Compute H(X) and H(Y ).
(b) Compute H(X, Y ) and H(X|Y ).
(c) Suppose that you put the 1st card back into the deck before randomly drawing
the 2nd card. Compute H(X, Y ) in this case.
Problem 3.5 (Entropy of a sum, problem 2.8 in [CT91]): Let X and Y be

RVs with the alphabet sizes M and N respectively. Let Z = X + Y .
(a) Show that H(Z|X) = H(Y |X). Argue that, if X and Y are independent, then
H(Y ) H(Z) and H(X) H(Z). Thus, the addition of independent RVs can
only increase the uncertainty.
(b) Give an example in which H(X) > H(Z) and H(Y ) > H(Z).
(c) Give an example in which H(Z) = H(X) + H(Y ), with H(X) = 0 and H(Y ) =
0.
Problem 3.6 (Encoding a sequence of RV pairs):

(a) Let X and Y be two independent RVs with the following PMFs.
{
1/4, y = 1
1/2, x = 0
1/2, y = 0
fX (x) =
fY (y) =
1/2, x = 1
1/4, y = 1
Let Z = X + Y . Compute H(Z) and H(Z|X).
(b) Suppose we want to transmit the values of X and Y in part (a). Find an
optimal source code that minimizes the expected number of bits used in the
transmission.
49
(c) Let X1 , X2 , . . . be a sequence of IID RVs with the PMF fX (x) in part (a). Let
Y1 , Y2 , . . . be a sequence of IID RVs, independent of the sequence X1 , X2 , . . .,
with the PMF fY (y) in part (a). Suppose we wnat to transmit the two sequences. What is the minimum number of bits per symbol pair (Xj , Yj ) required
for the transmission?
Problem 3.7 (Kraft inequality for uniquely decodable codes): Consider a

uniquely decodable code C for a RV X with the alphabet X . Let l(x) denote the
length of the codeword for x X . For convenience, let C k denote the kth extension
of the code, i.e. the code formed by encoding k successive symbols using C. In
k
addition, let l(x1 , . . . , xk ) denote the total
k length of the codeword in C for symbol
(x1 , . . . , xk ). Note that l(x1 , . . . , xk ) = j=1 l(xj ).
l(x)
1, by
Show that the code C must satisfy the Kraft inequality, i.e.
xX 2
following the steps below.
(
)
l(x) k
(a) Write
as (x1 ,...,xk )X k 2l(x1 ,...,xk ) .
xX 2
(b) Dene lmax = maxxX l(x). In addition, let a(m) denote the number of symbols
(
)
l(x) k
(x1 , . . . , xk ) such that l(x1 , . . . , xk ) = m. Rewrite
in part (a) as
xX 2
klmax
m
.
m=1 a(m)2
(c) Argue that, for unique decodability, we must have a(m) 2m . Use this in(
)
l(x) k
equality to upperbound
in part (a) by klmax .
xX 2
(klmax )1/k and use the fact that

(d) Write the bound in part (c) as xX 2l(x)
1/k
(klmax ) 1 as k to conclude that xX 2l(x) 1.
Problem 3.8 (Typical set of sequences):

(a) Consider a sequence of independent and equally likely binary RVs X1 , X2 , . . ..
Let Tn denote the typical set of length-n sequences (x1 , . . . , xn ). Compute the
5
.
size of the set T0.2
(b) Consider a sequence of IID RVs X1 , X2 , . . .
1/4,
1/2,
fX (x) =
1/4,
whose PMF is given below.

x=0
x=1
x=2
Let Tn denote the typical set of length-n sequences (x1 , . . . , xn ) with respect to
2
2
.
and T0.6
fX (x). Compute the probabilities of the sets T0.2
50
Problem 3.9 (LZ universal souce coding): Use the LZ universal source coding
(as discussed in class) to encode the following bit sequence. Assume that the size of
the sliding window is 8.
0010011010101110010
Problem 3.10 (First-order Markov source, based on 3.22 in [Pro95]): Recall

that, for a source with memory, the fundamental limit on source coding is the entropy
rate dened as H (X) = limn n1 H(X1 , . . . , Xn ). It can be shown that [Pro95, p.
104]
H (X) = lim H(Xn |X1 , . . . , Xn1 ).
n
Consider a discrete source with the alphabet X = {x1 , . . . , xM }. A stationary

rst-order Markov source is characterized by a Markov chain with M states denoted
by 1, . . . , M . In state j, the source outputs the symbol xj before moving to the next
state (possibly the same state) according to the transition probabilities pk|j , where
j, k {1, . . . , M } with k = j.
From the transition probabilities, we can compute steady-state probabilities pj ,
where j {1, . . . , M }, which describes the long-term average fraction of time the
source spends in state j.
Several practical sources are modeled as Markov sources. From the above form
of the entropy rate, it can be seen that the entropy rate of a stationary rst-order
Markov source is
H (X) = H(X2 |X1 ) =
pj H(X2 |X1 = xj ),
j=1
where we assume that X1 has the PMF according to the steady-state probabilities.
(a) Determine the entropy rate of the binary stationary rst-order Markov source
with two states as shown below. Note that the source has transition probabilities
between the two states equal to p2|1 = 0.2 and p1|2 = 0.3. (HINT: The steadystate probabilities p1 and p2 can be found in this case as follows. In the steady
state, the probability of being in state 1 and moving to state 2 must be equal to
the probability of being in state 2 and moving to state 1, i.e. p1 p2|1 = p2 p1|2 .)
(b) How does the entropy rate compare with the entropy of a binary DMS with the
same output symbol probabilities p1 and p2 ?
51
Problem 3.11 (Uniform quantization with uniform distribution): Let U be

a RV uniformly distributed over the interval [0, a], where a > 0. Consider uniform
scalar quantization of U with M quantization regions, where M = 2k for some k Z+ .
(a) Express the MSE of this quantizer in terms of a and M .
(b) Let V denote the quantized value of U . Show that an optimal source code for
the discrete RV V uses k bit/symbol.
(c) Show that the MSE in part (a) decreases exponentially with k, which is the bit
rate of the source (in bit/symbol). In particular, how does the MSE change if
we increase the value of k by 1?
Problem 3.12 (One-bit scalar quantizers):

(a) Let U be a continuous RV with the PDF shown below.
{
2 2u, u [0, 1]
fU (u) =
0,
otherwise
Consider designing a scalar quantizer for U with two quantization points. Find
the optimal quantizer that minimizes the MSE. You need not compute the MSE.
(b) Let U be a Gaussian RV with the PDF below.
1
2
fU (u) = eu /2
2
Consider designing a scalar quantizer for U with two quantization points located
at a and a, where a > 0. Find the optimal value of a that minimizes the MSE.
You need not compute the MSE.
Problem 3.13 (Quantization with Human coding): Consider a scalar quantizer shown below together with the PDF of a DMS.
52
1/4
1/8
1/16
quantization
points
quantization
region
boundaries
1 2
3 4
(a) Can the given scalar quantizer be a result of the Lloyd-Max algorithm? Why?
(b) Compute the associated MSE for the given quantizer.
(c) Let a RV V denote the quantized value. Suppose that we use a xed-length
code for V . What is the number of bits per symbol required for source coding?
(d) Find the optimal (variable-length) source code for the quantized value V . What
is the number of bits per symbol required for source coding in this case?
Chapter 4
Communication Signals
Physically, communication signals are continuous waveforms. In this chapter, we
show how to represent communication signals as vectors in a linear vector space.
This vector representation is a powerful tool for analysis of communication systems.
It also allows us to understand communication theory using geometric visualization.
In addition, we discuss various modulation schemes including pulse amplitude
modulation (PAM), quadrature amplitude modulation (QAM), and other modulations with higher dimensions. We assume throughout the chapter that the transmission channel is ideal and the system is noise-free. We shall relax these two assumptions
in the next chapter.
4.1
L2 Signal Space
Recall that a signal u(t) is an L2 signal if |u(t)|2 dt < . The set of L2 signals
together with the complex scalar eld C forms a vector space called the L2 signal
space. Most communication signals of interest can be reasonably modeled as L2
signals. Consequently, we shall view a signal as a vector in the L2 signal space, and
use the terms signal and vector interchangeably.
To make the L2 signal space an inner product space, we can dene the inner
product between two L2 signals u(t) and v(t), denoted by u(t), v(t), as

u(t), v(t) =
u(t)v (t)dt.
(4.1)
Note that we need to consider the equal sign = in (4.1) in terms of L2 equivalence. Otherwise, the above denition is not a valid inner product. For example,
consider the following signal.
{
1, t = 0
u(t) =
0, otherwise
The above signal has u(t), u(t) = 0, but it is not a zero vector. Without the notion of L2 equivalence, the positivity property of the inner product, i.e. u(t), u(t)
0 with the equality if and only if u(t) = 0, is violated. Based on the inner product
53
54
CHAPTER 4. COMMUNICATION SIGNALS
in (4.1), the norm of u(t) is u(t) =

|u(t)|2 . It follows that each signal in the
L2 signal space has a nite norm.1

Let {1 (t), 2 (t), . . .} be an orthonormal basis for the L2 signal space. Each L2
signal u(t) is equivalent to its innite-dimensional projection of u(t) [?, p. 168], i.e.
u(t) =
u(t), j (t) j (t).
(4.2)
j=1
The representation of u(t) in (4.2) is called an orthonormal expansion of u(t). We

now give two important examples of an orthonormal expansion in the L2 signal space.
Example
From
the Fourier series, the set of L2 signals time-limited to the time
[ 4.1
]
T T
interval 2 , 2 forms an innite-dimensional complex vector space. One orthonormal basis for this vector space is the set of vectors
[ T T]
{ 1 i2kt/T
e
,
t
2, 2
T
k (t) =
0,
otherwise
where k Z. Given a signal u(t) in this vector space, the corresponding orthonormal
expansion is
[
]
T T
1
i2kt/T
uk e
, t ,
,
u(t) =
2 2
T k=
T /2
where uk = u(t), 1T ei2kt/T = 1T T /2 u(t)ei2kt/T dt.
.
Example 4.2 From the sampling theorem, the set of L2 signals band-limited to the
frequency range [W, W ] forms an innite-dimensional complex vector space. The
reconstruction formula in (3.15) tells us that one orthonormal basis for this vector
space is the set of vectors
k (t) = 2W sinc(2W t k), k Z.

Given a signal u(t) in this vector space, the corresponding orthonormal expansion
is
u(t) =
2W
uk sinc(2W t k),
k=
where uk = u(t), 2W sinc(2W t k) =
1 u
2W
k
2W
Consider the subspace Sn of L2 that is spanned by a set of orthonormal basis

vectors {1 (t), . . . ,
n (t)}. Let u(t) and v(t) be
two signals in Sn with the orthonormal
expansions u(t) = nj=1 uj n (t) and v(t) = nj=1 vj n (t) respectively. It follows that
1
A complex vector space with the inner product dened such that all vectors have nite norms
is called a Hilbert space. The L2 signal space is thus a Hilbert space.
4.2. PULSE AMPLITUDE MODULATION
55
the inner product between u(t) and v(t) can be computed from the coecients of
their orthonormal expansions as shown below.
n
n
n
n
n
u(t), v(t) =
uj j (t),
vk k (t) =
uj vk j (t), k (t) =
uj vj
j=1
j=1 k=1
k=1
j=1
Note that the last equality follows from the fact that 1 (t), . . . , n (t) are orthonormal.
The relationship
is also valid for
the innite-dimensional L2 signal space, i.e. for
u(t) = j=1 uj j (t) and v(t) = j=1 vj j (t),

u(t), v(t) =
ui vj .
(4.3)
j=1
As a consequence of (4.3), the energy of an L2 signal

u(t) can be computed from
its coecients of the orthonormal expansion u(t) =
j=1 uj j (t) as

|u(t)| dt = u(t), u(t) =

uj uj =
|uj |2 .
(4.4)
4.2
j=1
j=1
Pulse Amplitude Modulation
Consider a baseband communication system in which an information carrying pulse

p(t) is transmitted every symbol period T with varying amplitudes depending on the
information contents. Such a scheme is called pulse amplitude modulation (PAM).
More specically, let a0 , a1 , . . . be the amplitudes of the pulses p(t), p(t T ), . . . respectively. The transmitted signal can be written as2
s(t) =
aj p(t jT ).
(4.5)
j=0
One possible choice for p(t) is the rectangular pulse shown below.3
[
]
{
1, t T2 , T2
prec (t/T ) =
0, otherwise
(4.6)
However, the rectangular pulse is not practical since its bandwidth is innite; the
pulse cannot be transmitted over a bandlimited channel. Another choice for p(t) is
the sinc pulse sinc(t/T ). Strictly speaking, although the sinc pulse is bandlimited,
it cannot be generated perfectly in practice since its support, i.e. the time interval
during which sinc(t/T ) is nonzero, is innite. To generate the sinc pulse in practice,
we need to approximate by truncating the pulse to be time-limited. Later on in the
chapter, we shall see other choices of pulses that are more practical than the sinc
pulse.
2
For notational convenience, we start indexing the amplitudes from 0 so that the amplitude aj
is used to modulate the pulse delayed by time jT , i.e. p(t jT ).
3
We are not concerned about the non-causality of p(t) since in practice the modulator can be
made causal if we allow some delay, e.g. T /2 for the rectangular pulse.
56

d
...
...
0
Figure 4.1: Standard M -PAM signal set with spacing d
LTI filter
sampled at
Figure 4.2: Receiver structure for PAM
M -PAM Signal Set

Suppose that independent and equiprobable data bits enter the modulator at the rate
of b/T (in bps), where T is the symbol period for PAM. The number of possible values
for each amplitude aj is then 2b . Let M = 2b . A set of values for each amplitude is
called a signal set. Each amplitude value is called a signal point.
The standard M -PAM signal set consists of M equally spaced signal points located
symmetrically around the origin, as illustrated in gure 4.1. Let d be the distance
between two adjacent signal points.
Let Aj denote the random variable (RV)
to aj . The energy per
[ corresponding
]
symbol, denoted by Es , is dened as Es = E A2j . In particular, Es can be computed
for the standard M-PAM with spacing d as follows.4
Es,M PAM
2
=
M
=
(( )
( )2
(
)2 )
2
d
3d
(M 1)d
+
+ ... +
2
2
2
d2 (M 2 1)
d2 (22b 1)
=
.
12
12
(4.7)
Receiver Structure for PAM

Let r(t) be the received signal at the receiver. For the time being, assume a noiseless
system in which r(t) = s(t). Consider the receiver structure shown in gure 4.2. The
receiver passes r(t) through a linear time invariant (LTI) lter with impulse response
q(t). The receiver then samples the lter output v(t) at time t = 0, T, 2T, . . . in order
to recover the pulse amplitudes a0 , a1 , . . .. When we consider noise in the system later
on, we shall discuss why the receiver structure in gure 4.2 is optimal.
4
The following known identities are useful in such computation:
n(n+1)(2n+1)
.
6
n
j=1
j=
n(n+1)
2
and
n
j=1
j2 =
4.3. NYQUIST CRITETION FOR NO ISI
57
Using the convolution formula, we can express v(t) as

v(t) =
r( )q(t )d =
aj
p( jT )q(t )d
j=0
For convenience, dene g(t) = p(t) q(t). We can then write v(t) as
v(t) =
aj g(t jT )
(4.8)
j=0
To obtain v(jT ) = aj for j {0, 1, . . .}, it suces to choose g(t) with the following
property.
{
1, k = 0
(4.9)
g(kT ) =
0, k Z, k = 0
A signal g(t) that satises the condition in (4.9) is called ideal Nyquist with period
T . Note that the rectangular pulse in (4.6) and the sinc pulse sinc(t/T ) are both
ideal Nyquist with period T . If g(t) is not ideal Nyquist, then
ak g(jT kT ).
(4.10)
v(jT ) = g(0)aj +
k=j
From (4.10), we seethat v(jT ) contains a desired contribution g(0)aj and an

undesired contribution k=j ak g(jT kT ) from the other symbols. This undesired
contribution is called inter-symbol interference (ISI).
4.3
Nyquist Critetion for No ISI
In this section, we develop a general condition that makes an L2 signal g(t) ideal
Nyquist with period T . Before doing so, it is useful to develop the sampling theorem
for passband signals and the aliasing theorem.
Sampling Theorem for Passband Signals

Theorem 4.1 (Sampling theorem for passband signals): A continuous L2 signal x(t) band-limited to the frequency range [fc W, fc + W ], where fc > W , is
1
uniquely determined by its samples taken at the sampling period 2W
.
Proof: Since x(t) is an L2 signal, x(f ) is an L2 signal in the frequency domain. Since
x(f ) is band-limited, we can write a Fourier series expansion
x(f ) =
xk ei
2kf
2W
, f [fc W, fc + W ],
(4.11)
k=
where xk s are the Fourier series coecients given by

fc +W
2kf
1
xk =
x(f )ei 2W df, k Z.
2W fc W
(4.12)
58
1
From the inverse Fourier
( k ) transform formula, we see that xk in (4.12) is 2W times
the sampling value x 2W in the time domain. Since (x(f )) uniquely determines x(t),
k
we can reconstruct x(t) perfectly from the samples x 2W
, k Z.
The reconstruction formula can be obtained by taking the inverse Fourier transform of the expression in (4.11). We can reexpress (4.11) for f R as follows.
(
)
f fc
i 2kf
x(f ) =
, f R,
xk e 2W rect
2W
k=
{
1, |f | 1/2
(4.13)
where rect(f ) =
0, otherwise
( f )
From the Fourier transform pair 2W sinc(2W t) rect 2W
, basic properties of
the Fourier transform pair yield
(
)
k
f fc
i2fc (t+ 2W
i 2kf
)
2W e
sinc(2W t + k) e 2W rect
.
(4.14)
2W
From (4.13) and (4.14), we obtain
x(t) =
2W xk sinc(2W t + k)ei2fc (t+ 2W ) .
(4.15)
k=
Since xk =
shown below.
1
x
2W
k
2W
x(t) =
, we can rewrite (4.15) to obtain the reconstruction formula
k=
(
x
k
2W
sinc(2W t k)ei2fc (t 2W )
(4.16)
Aliasing Theorem
1
. However,
Consider sampling a continuous L2 signal x(t) at the sampling period 2W
we do not assume that x(t) is band-limited to the frequency range [W, W ]. Let z(t)
be the reconstructed signal from the samples of x(t) according to the reconstruction
formula in (3.15), i.e.
(
)
k
z(t) =
x
sinc(2W t k).
(4.17)
2W
k=
The aliasing theorem below gives an explicit expression for the Fourier transform
of the reconstructed signal, i.e. z(f ).
Theorem 4.2 (Aliasing theorem): The Fourier transform of the reconstructed
signal z(t) is given by
{
(f 2W j), f [W, W ]
j= x
z(f ) =
0,
otherwise
59
(a)
(b)
Figure 4.3: Frequency components xj (f ).

Proof: Consider
breaking x(t) into dierent frequency components xj (t), j Z, such
that x(t) =
j= xj (t) and
{
xj (f ) =
x(f ), f [2W j W, 2W j + W ]
0,
otherwise
Figure( 4.3a
xj)(f ). Using the rectangle func) illustrates the frequency components
( f
f
tion rect 2W , we can writexj (f ) = x(f )rect 2W j .
From (4.17) and x(t) =
j= xj (t), we can write
z(t) =
(
xj
k= j=
k
2W
)
sinc(2W t k).
(4.18)
formula for passband signals in (4.16), note that xj (t) =

( kreconstruction
)
From the
i2j(2W tk)
. We can rewrite (4.18) as
k= xj 2W sinc(2W t k)e
z(t) =
xj (t)ei2(2W j)t ,
j=
where we use the fact that ei2jk = 1. It follows that, in the frequency domain,
60
z(f ) =
( f
)
xj (f + 2W j). From xj (f ) = x(f )rect 2W
j , we can write
(
)
(
)
f
f
z(f ) =
x(f + 2W j)rect
=
x(f 2W j)rect
,
2W
2W
j=
j=
j=
which is the statement in the theorem. Figure 4.3b illustrates z(f ).
It should be clear from the example in gure 4.3 that, unless x(t) is band-limited
to the frequency range [W, W ], the reconstructed signal z(t) is not equal to x(t).
Nyquist Criterion
We are now ready to develop a condition that makes an L2 signal g(t) ideal Nyquist
with period T . The condition is called the Nyquist criterion and is given in the
following theorem.
Theorem 4.3 (Nyquist criterion): A continuous L2 signal g(t) is ideal Nyquist
with period T if and only if
(
)
[
]
1
j
1 1
g f
= 1, f ,
.
T j=
T
2T 2T
Proof: Let z(t) be the signal reconstructed from the samples g(kT ), k Z, i.e.
(
)
t
g(kT )sinc
z(t) =
k .
T
k=
Note that g(t) is ideal Nyquist with period T if and only if z(t) = sinc(t/T ), or
equivalently z(f ) = T rect(f T ). From the aliasing theorem, we can write
(
)
j
z(f ) =
g f
rect(f T ),
T
j=
)
(
yielding T rect(f T ) =
j f Tj rect(f T ), or equivalently
j= g
(
)
[
]
1
j
1 1
g f
= 1, f ,
,
T j=
T
2T 2T
which is the desired result.
Two important observations can be made from the Nyquist criterion.

1. A baseband signal cannot satisfy the Nyquist criterion if its bandwidth is less
1
. In other words, the bandwidth of an ideal Nyquist signal with period
than 2T
1
1
T must be at least 2T
. This bandwidth of 2T
is called the Nyquist bandwidth.
1
2. If a baseband signal g(t) has bandwidth exactly equal to 2T
, then the only
choice for g(t) to be ideal Nyquist with period T is g(t) = sinc(t/T ).
From the above observations, we are interested in nding a signal g(t) whose
1
1
, e.g. (1 + ) 2T
for some small > 0.
bandwidth is slightly above 2T
61
Figure 4.4: Band edge symmetry of g(f ).
Band Edge Symmetry and Raised Cosine Pulses

1
Consider nding an ideal Nyquist pulse with period T and bandwidth (1 + ) 2T
,
where 0 < < 1. In this case, the Nyquist criterion becomes
( (
)
)
[
]
1
1
1
g f +
+ g(f ) = 1, f , 0 ,
T
T
2T
(
(
))
[
]
1
1
1
g(f ) + g f
= 1, f 0,
.
(4.19)
T
T
2T
[ 1]
Let us concentrate on the frequency
range
0, 2T ; the same conclusion can be
[ 1 ]
drawn in the frequency range 2T
, 0 . In practice, g(t) is real and thus g(f ) has
1
a conjugate symmetry, i.e. g(f ) = g (f ). Let f = 2T
. Then, the condition
in (4.19) can be written as
(
)
(
)
1
1
T g
= g
+ ,
(4.20)
2T
2T
and is refered to as the band edge symmetry. Figure 4.4 illustrates the band edge symmetry when g(f ) is real. If g(f ) is complex, then the gure illustrates the symmetry
for the real part of g(f ).
It can be veried that the following choice of g(t), called the raised cosine pulse,
satises the band edge symmetry in (4.20). The raised cosine pulse with parameter
and period T is shown below together with its Fourier transform.
( )(
)
t
cos(t/T )
grc, (t) = sinc
(4.21)
T
1 42 t2 /T 2
| 1
T,
2T
( (
)) |f
1
1
2 T
T cos 2 |f | 2T
, 2T < |f | 1+
grc, (f ) =
(4.22)
T
0,
|f | > 1+
2T
The Fourier transform of a raised cosine pulse is called a raised cosine spectrum.
Note that, when = 0, the raised cosine pulse is the same as the sinc pulse. Compared to the sinc pulse, a raised consine pulse decays faster in time, as illustrated
in gure 4.5, and is more desirable when there is ISI due to signal distortion in the
nonideal channel.
grc,(t)
62
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
=1
=0.5
=0
grc,(f)
-3
-2
-1
0
t
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
=1
=0.5
=0
-2
-1.5
-1
-0.5
0
f
0.5
1.5
Figure 4.5: Raised cosine pulses with parameter , where 0 1 and T = 1. The
higher the value of , the faster the pulse decays.
Choosing PAM Pulses Using Nyquist Criterion

Having specied choices of signal g(t) = p(t) q(t) that are ideal Nyquist with period
T , it remains to specify PAM pulse p(t) and the receiver LTI lter impulse response
q(t). For simplicity, we focus on real and nonnegative g(f ), e.g. raised cosine spectrum. Since g(f ) = p(f )
q (f ), one choice for p(f ) and q(f ) is to set
|
p(f )| = |
q (f )| = g(f ).
(4.23)
Since g(f ) is real, it follows that q(f ) = p (f ), or equivalently q(t) = p (t).
With this choice of p(t) and q(t), it turns out that the set of pulses {p(t jT ), j Z}
is a set of orthonormal signals, as stated formally in the following theorem.
Theorem 4.4 Let g(t) be ideal Nyquist with
period T . In addition, assume that
g(f ) is real and nonnegative. Let |
p(f )| = g(f ). Then, {p(t jT ), j Z} is a set
of orthonormal signals.
Proof: Since q(t) = p (t), we can write g(t) = p(t) q(t) as

g(t) =
p( )q(t )d =
p( )p ( t)d.
For t = kT, k Z, we can write

{

1, k = 0
g(kT ) =
p( )p ( kT )d =
0, k =
0
(4.24)
grc,(t)
63
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
=1
=0.5
=0
-3
-2
-1
t
Figure 4.6: Square root of raised cosine pulses with parameter , where 0 1
and T = 1. The higher the value of , the faster the pulse decays.
where the last equality follows from the assumption that g(t) is ideal Nyquist with
period T . Thus, p(t) is orthogonal to p(t kT ), k = 0. By the change of variable =
jT , we can establish that p(t jT ) is orthogonal to p(t kT ), k = j. In addition,
from (4.24), we see that p(t) = 1. It follows that p(t jT ) = p(t) = 1, j Z.
In conclusion, {p(t jT ), j Z} is an orthonormal set.

If we usethe raised cosine spectrum for g(f ), then the choice of p(f ) in (4.23),
i.e. p(f ) = grc, (f ), is called a square root of raised cosine spectrum. In the time
domain,
a square root of raised cosine pulse is equal to the inverse Fourier transform
of grc, (f ), which is given below [?, p. 228] and illustrated in gure 4.6.
(
)
cos((1 + )t/T ) + T sin((1 )t/T )/(4t)
4
psqrc, (t) =
(4.25)
1 (4t/T )2
T
Orthonormal Expansions of PAM Signals and Matched Filtering

Given that
{p(t jT ), j Z} is a set of orthonormal signals, a PAM signal in (4.5),
i.e. s(t) =
j=0 aj p(t jT ), can be thought of as an orthonormal expansion in the
subspace spanned by {p(t jT ), j Z+ }. Note that the amplitudes a0 , a1 , . . . are
simply the coecients of the orthonormal expansion.
Retrieving the amplitude aj can be done by projecting s(t) on the orthonormal
basis vector p(t jT ), i.e.

aj = s(t), p(t jT ) =
s( )p ( jT )d.
(4.26)
From (4.26), we see that aj can be obtained by passing s(t) through an LTI lter
with impulse response q(t) = p (t) and sampling the output at time t = jT . This
is exactly the operation of the receiver in gure 4.2. Therefore, we can think of
the receiver operations in gure 4.2 as nding the coecients of the orthonormal
expansion of the PAM signal.
64
(a)
(b)
Figure 4.7: Moving sb (t) from the baseband to the passband.

The choice q(t) = p (t) at the receiver is called a matched lter. Note that, in
practice, p(t) is real, yielding q(t) = p(t).
4.4
Passband Modulation: DSB-AM and QAM
In a majority of communication systems, the available frequency band is not at the

baseband but is centered around some frequency fc 0. Suppose that the available
frequency bands are [fc W, fc + W ] as well as [fc W, fc + W ], where fc > W .
Note that the negative frequency band is required for transmitted signals to be real,
which is the case in practice. Such a system is called a passband system. We say that
the given passband system has bandwidth 2W since the bandwidth is universally
dened to be the range of positive frequencies in the transmission band.
Consider a real baseband PAM signal sb (t) band-limited to the frequency range
[W, W ]. More specically, we can think of sb (t) as a PAM signal given in (4.5) with
1
p(t) = sinc(t/T ) and T = 2W
. The signal is to be transmitted over the passband
system described above. Moving sb (t) up to the passband can be done by multiplying
sb (t) by ei2fc t , as illustrated in gure 4.7a.
However, sb (t)ei2fc t is complex. To make the passband signal real, we can add
the complex conjugate sb (t)ei2fc t . This method of producing a real passband signal
is called double-sideband amplitude modulation (DSB-AM). Denote the real passband
DSB-AM signal by sDSB-AM (t). We can write
sDSB-AM (t) = sb (t)ei2fc t + sb (t)ei2fc t = 2sb (t) cos(2fc t).
(4.27)
Figure 4.7b illustrates the spectrum of sDSB-AM (t). Note that, since sb (t) is real,
sb (f ) must have the conjugate symmetry. Consequently, sDSB-AM (f ) in the frequency
range [fc W, fc ] can be determined from sDSB-AM (f ) in the frequency range [fc , fc +
W ]. This redundancy indicates an inecient use of bandwidth by DSB-AM.
4.4. PASSBAND MODULATION: DSB-AM AND QAM
65
Degrees of Freedom in Passband Signals

Given the passband bandwidth 2W , it is known that the number of independent
signal values that can be transmitted is at most 2W complex values per second [?,
p. 128]; the justication of this fact is beyond the scope of this course. This number
of independent real values is called the number of real degrees of freedom; one real
signal value corresponds to one real degree of freedom.
For the above DSB-AM system, suppose that we use the sinc pulse p(t) = sinc(t/T )
1
as the baseband pulse shape, where T = 2W
. Note that this pulse, when modulated
to the passband, will have the bandwidth equal to 2W . It follows that the number
of degrees of freedom provided by DSB-AM is 2W real degrees per second. In the
next section, we discuss a passband modulation scheme that can provide 2W complex
degrees per second, which is equivalent to 4W real degrees per second.
Quadrature Amplitude Modulation (QAM)

Similar to DSB-AM, a QAM signal is constructed from a baseband PAM signal
sb (t) =
aj p(t jT )
(4.28)
j=0
but with complex amplitudes a0 , a1 , . . .. Since transmitting one complex value is

equivalent to transmitting two real values, a QAM signal has twice as many degrees
of freedom as a DSB-AM signal with the same bandwidth.
To construct a passband signal, the complex baseband signal sb (t) is modulated
by multiplying with ei2fc t , yielding sb (t)ei2fc t . To make the passband signal real,
i2fc t
we can add its complex
. In addition, it is convenient to add a
conjugate sb (t)e
scaling factor of 1/ 2, yielding the following expressions for a QAM signal.5
1
1
sQAM (t) = sb (t)ei2fc t + sb (t)ei2fc t
2
2
{
}
i2fc t
=
2Re sb (t)e
=
2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t)
(4.29)
We now look at the conceptual implementation of QAM that is simple to follow

but allows for the use of complex signals. Figure 4.8a shows such an implementation.
Note that the Hilbert lter (with gain 1) has the following frequency response.
{
Hilbert (f ) = 1, f 0
h
0, f < 0
Figure 4.8b shows the Fourier transform of the real passband QAM signal. Compared to gure 4.7b, there is no redundancy in the Fourier transform sQAM (f ).
5
With this scaling factor, we shall see later that the real and imaginary parts of complex amplitudes a0 , a1 , . . . are equal to the coecients of an orthonormal expansion of a QAM signal. For a
complex number x, Re{x} denotes its real part, while Im{x} denotes its imaginary part.
66
transmitter
(complex)
(a)
baseband
(complex)
passband
(complex)
passband
(real)
passband channel (assumed ideal)

receiver
(complex)
passband
(real)
passband
(complex)
baseband
(complex)
(b)
Figure 4.8: Conceptual implementation of QAM.
67
(real)
PAM transmitter
with pulse
baseband
(real)
passband
(real)
PAM transmitter
with pulse
baseband
(real)
(real)
Figure 4.9: Transmitter implementation for QAM.
QAM Implementation
To avoid the use of complex signals in QAM implementation, we can view the complex baseband signal sb (t) in (4.28) as two real signals Re{sb (t)} and Im{sb (t)}. In
particular, we can write
Re{sb (t)} =
Re{aj }p(t jT ), Im{sb (t)} =
Im{aj }p(t jT ).
j=0
j=0
We can view the transmissions of Re{sb (t)} and Im{sb (t)} as transmissions over
two parallelbaseband PAM systems. From the expression of sQAM (t) in (4.29), i.e.
sQAM (t) = 2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t), we have the transmitter
implementation in gure 4.9. Notice that all involved signals are real.
To recover the complex baseband signal sb (t), we can separately recover Re{sb (t)}
and Im{sb (t)}. From trigonometric identities 2 cos2 x = 1+cos(2x) and 2 sin x cos x =
sin(2x), we can write
2sQAM (t) cos(2fc t) = 2 [Re{sb (t)} cos(2fc t) Im{sb (t)} sin(2fc t)] cos(2fc t)
= Re{sb (t)} [1 + cos(4fc t)] Im{sb (t)} sin(4fc t).
It follows that, after multiplying sQAM (t) with 2 cos(2fc t), we can use a low pass
lter (LPF) passing the frequency range [W, W ] to recover Re{sb (t)}. Figure 4.10
shows the demodulation of Re{sb (t)}.
Since Re{sb (t)} is a baseband PAM signal, it is passed through a PAM receiver
that contains a matched lter q(t) = p(t). Since p(t) is band-limited to [W, W ],
so is q(t). (Note that q(f ) = p (f ).) It follows that the LPF in gure 4.10 is in
fact redundant. Therefore, to recover Re{a0 }, Re{a1 }, . . ., we can use the receiver
structure shown in gure 4.11.
Similarly, to demodulate Im{sb (t)}, we can multiply sQAM (t) by 2 sin(2fc t)

and lowpass lter the multiplication result to get
2sQAM (t) sin(2fc t) = Im{sb (t)} [1 cos(4fc t)] Re{sb (t)} sin(4fc t).
68
LPF
(assume no noise)
Figure 4.10: Demodulation of Re{sb (t)} from sQAM (t).
(assume no noise)
PAM matched
filter
Figure 4.11: Recovering Re{a0 }, Re{a1 }, . . . from sQAM (t).

As with Re{sb (t)}, Im{sb (t)} is a baseband PAM signal. Therefore, the LPF is
redundant when we pass Im{sb (t)} through a matched lter at the PAM receiver. In
summary, we have the receiver implementation in gure 4.12. Note that all involved
signals are real.
Orthonormal Expansions of QAM Signals
We have seen that a baseband PAM signal in (4.5), i.e. s(t) =

j=0 aj p(t jT ), can
be thought of as an orthonormal expansion in the subspace spanned by basis vectors
{p(t jT ), j Z+ }; the amplitudes a0 , a1 , . . . are the coecients of expansion.
Similarly, we can think of the passband signal sQAM (t) in (4.29) as an orthonormal expansion in the subspace spanned by the following orthonormal vectors. (The
verication that these vectors are indeed orthonormal is left as an exercise.)
{
}
+
2p(t jT ) cos(2fc t), 2p(t jT ) sin(2fc t), j Z
PAM matched
filter
(assume no noise)
PAM matched
filter
Figure 4.12: Receiver implementation for QAM.
69
To nd the coecients of expansion, we can write sQAM (t) in (4.29) as follows.
sQAM (t) =
2Re{sb (t)} cos(2fc t) 2Im{sb (t)} sin(2fc t)
=
Re{aj } 2p(t jT ) cos(2fc t)
j=0
Im{aj } 2p(t jT ) sin(2fc t)
(4.30)
j=0
From (4.30), we see that the coecients of expansion are equal to Re{a0 }, Re{a1 }, . . .
and Im{a0 }, Im{a1 }, . . .. Being a coecient of an orthonormal expansion, Re{aj } can
be retrieved from the inner product
Re{aj } = sQAM (t), 2p(t jT ) cos(2fc t) ,

which is indeed equivalent to the operation in gure 4.11. The equivalence can be
seen more easily by rewriting the inner product as follows.

Re{aj } =
sQAM (t) 2p(t jT ) cos(2fc t)dt
]
[
=
sQAM (t) 2 cos(2fc t) p(t jT )dt

[
]

= sQAM (t) 2 cos(2fc t) p(t)
t=jT
The coecient Im{aj } can be retrieved similarly from the inner product
Im{aj } = sQAM (t), 2p(t jT ) sin(2fc t) .
QAM Signal Sets

Suppose that independent and equiprobable data bits enter the modulator at the rate
of b/T bps, where T is the symbol period for QAM. Then, the number of possible
values for each random complex amplitude Aj must be 2b . Let M = 2b . A set of
values for each complex amplitude is called a signal constellation or signal set. Each
amplitude value is called a signal point.
The standard M M -QAM signal set A with spacing d, where M = M , is

the Cartesian product of two standard M -PAM signal sets with spacing d, i.e.
A = A A = {(x, y) : x A , y A } ,
{
}
d d
d(M 1)
d(M 1)
,..., , ,...,
A =
.
2
2 2
2
(4.31)
Figure 4.13 shows the standard 4 4-QAM signal set with spacing d. For QAM,
the energy per symbol, denoted by Es , is dened as Es = E [|Ak |2 ]. For the standard
M M -QAM signal set,
Es,M M -QAM =
d2 (M 1)
.
6
(4.32)
70
Figure 4.13: Standard 4 4-QAM signal set with spacing d.
2PSK or binary PSK

(BPSK)
4PSK or quadrature PSK

(QPSK)
8PSK
Figure 4.14: Commonly used M -PSK signal sets.

Note that the energy per symbol for the standard M M -QAM is twice the
energy per symbol for the standard M -PAM. (The derivation of the above symbol
energy is left as an exercise.)
One class of commonly used QAM signal sets is the class of phase shift keying
(PSK) signal sets. In general, an M -PSK signal set can be described as
{
}
i( 2
+)
M
A=
Es e
, j {0, . . . , M 1} ,
(4.33)
where is either 0 or /4 . Note that, for M -PSK, every signal point has the same
energy equal to Es . Figure 4.14 shows some examples of M -PSK signal sets.
When we study optimal detection in the presence of noise, we shall see that the
performance of the signal set depends on the minimum distance between signal points,
denoted by dmin . For the signal sets in gure 4.14, we can compute dmin to be
(4.34)
dmin,BPSK = 2 Es , dmin,QPSK = 2Es , dmin,8PSK = (2 2)Es .
One fundamental question is how to choose an M -point QAM signal set such that
it has the maximum value of dmin subject to a xed value of Es . In general, optimal
signal sets are dicult to derive. In addition, the performance gain is limited and
often not worth the additional complexity involved in signal detection. As a result,
several simple but suboptimal signal sets are often used in practice. We shall not
investigate the problem of nding optimal signal sets any further.
4.5. K-DIMENSIONAL SIGNAL SETS
71
Figure 4.15: Orthogonal signal set (K = M = 3).
4.5
K-Dimensional Signal Sets
So far, we have seen one-dimensional signal sets for PAM and two-dimensional signal
sets for QAM. It is possible to generalize to K-dimensional signal sets. For a transmission system that uses a K-dimensional signal set, the jth transmitted symbol is
a signal point that can be described as a K-dimensional vector aj = (aj,1 , . . . , aj,K ).
As with PAM and QAM, we can view aj,k , j Z+ , k {1, . . . , K}, as the coefcients of an orthonormal expansion. In particular, let {1 (t), . . . , K (t)} be the set
of K orthonormal signals corresponding to the 0th transmission. In addition, assume
that {1 (tjT ), . . . , K (tjT ), j Z+ } is an orthonormal set, where T is the symbol
period. Then, the transmitted signal can be described as
sK-dim (t) =
aj,k k (t jT ).
(4.35)
j=0 k=1
As with PAM and QAM, the process of retrieving the coecient aj,k from sK-dim (t)
is equivalent to computing the inner product aj,k = sK-dim (t), k (t jT ).
In an orthogonal signal set with M signal points, we can describe the M signal
points as the following M vectors in M dimensions.

0
0
E
s
Es
0
0
A = .. , .. , . . . , ..
(4.36)
. .
.
0
Es
0
Figure 4.15 shows the orthogonal signal set with 3 signal points. Note that, for an
orthogonal signal set, the number of dimensions K is equal to the number of signal
points M . One example of an orthogonal signal set is a set of M -point pulse position
modulation (M -PPM) shown in gure 4.16 for M = 4.
A biorthogonal signal set with M signal points (M even) is obtained from an
orthogonal signal set with M/2 signal points by including the negatives of those
signal points. In particular, if {s1 , . . . , sM/2 } is the M/2-point orthogonal signal
set, then the corresponding biorthogonal signal set is {s1 , . . . , sM/2 , s1 , . . . , sM/2 }.
Figure 4.17 shows the 6-point biorthogonal signal set constructed from the signal set
in gure 4.15. Note that, for a biorthogonal signal set, K = M/2.
72
Figure 4.16: 4-PPM signals.
Figure 4.17: Biorthogonal signal set (K = 3, M = 6).

A simplex signal set with M signal points is obtained from an orthogonal signal
set by subtracting each signal point with the mean signal point. In particular, if
{s1 , . . . , sM } is the M -point orthogonal signal set,
then the corresponding simplex
M
1
signal set is {s1 m, . . . , sM m}, where m = M m=1 sm . Note that, for a simplex
signal set, the dimension of the subspace of RM spanned by the signal points is M 1.
(The justication is left as an exercise.)
Similar to PAM and QAM, the performance of a K-dimensional signal set depends
on the minimum distance between signal points, denoted by dmin . It can be shown
that, for the same dmin , the expected signal energy per symbol (i.e. Es = E [Aj 2 ])
for
signal set is lower than that of an orthogonal signal set by a factor of
( a simplex
)
1 M1 . (The justication is left as an exercise.) Therefore, the simplex signal set
is usually preferred to the orthogonal signal set when the transmit power is limited.
4.6
Summary
We started the chapter by discussing the L2 signal space. We showed that transmitted signals in PAM, QAM, and higher dimensional modulation techniques can be
conveniently viewed as orthonormal expansions in the L2 signal space. Based on this
viewpoint, the process of retrieving the transmitted symbols is equivalent to computing the inner product between the transmitted signal and the appropriate basis
vectors.
Our discussion on modulations was based on the assumption of an ideal channel
with no noise. Under this perfect condition, there is a problem of designing a PAM
pulse so that there is no ISI. We described the Nyquist criterion which can be used to
73
identify PAM pulses with no ISI, e.g. sinc pulse and square root of raised cosine pulse.
1
We also specied the Nyquist bandwidth of 2T
, which is the bandwidth required for
a PAM pulse with period T to have no ISI.
Our discussion on modulation techniques is by no mean complete. In particular, we only focused on linear modulations with no memory where each symbol is
modulated onto a waveform in a linear fashion and is modulated independently from
the other symbols. Nonlinear modulations and modulations with memory are discussed in [?, sec. 4.3]. Their advantages include the improvement of characteristics
of the transmit signal spectrum, and the ability to perform signal detection without
synchronization at the receiver.
4.7
Practice Problems
Problem 4.1 (TRUE or FALSE): Indicate whether each of the following statements is true or false (i.e. not always true). Justify your answer.
(a) The set of signals {sinc(t 2k), k Z} forms an orthonormal signal set.
(b) The set of signals {sinc(3W t k), k Z} is a basis for the vector space of
continuous L2 signals band-limited to the frequency band [W, W ].
(c) A continuous L2 passband signal band-limited to the frequency band [fc
W, fc + W ] can be uniquely determined by its samples taken at the sampling
rate 2W (in sample/s).
(d) Suppose p(t) is ideal Nyquist with period T . Let W be the bandwidth of p(t),
1
W T1 .
i.e. p(f ) = 0 for f
/ [W, W ]. Then, W is bounded by 2T
(e) Consider baseband transmission using the standard 4-PAM signal set. Suppose
we want to double the transmission bit rate while keeping the same signal
spacing. If we use the same amount of channel bandwidth, we need to increase
the expected symbol energy by a factor of 4.
Problem 4.2 (Properties of PAM signals): Consider using the standard 4-PAM
signal set with signal spacing d for the transmission of independent and equally likely
data bits that enter the baseband modulator at the rate of 4 Mbps. Suppose that we
want to transmit a PAM signal over the baseband channel. Assume that we use the
sinc function as the pulse shape, i.e. the transmitted signal is
)
(
t
j ,
s(t) =
aj sinc
T
j=0
where T is the symbol period and aj is the signal value for symbol j {0, 1, . . .}.
(a) Write down all possible values for each aj .
74
(b) Find the amount of bandwidth (in Hz) required to transmit the above information.
(c) Suppose that the transmission lasts for 1 s, i.e. we only transmit 4 million bits.
Express the expected energy of the PAM signal that is used to carry this amount
of information bits in terms of d. (HINT: Use an orthonormal expansion.)
(d) Repeat parts (a), (b), and (c) for the standard 8-PAM with signal spacing d.
In addition, what is the ratio between the bandwidth required in this case and
that in part (b)?
Problem 4.3 (Necessity of zero mean for PAM signal sets): Consider a PAM
signal set with
M M signal points a1 , . . . , aM . Let m be the mean signal point dened
1
by m = M j=1 aj . Let Es denote the expected symbol energy for this signal set.
(a) Show that, if m = 0, then the symbol energy can be reduced further by
constructing a modied signal set with signal points a1 , . . . , aM , where aj =
aj m, j {1, . . . , M }. In particular, let Es be the symbol energy of the
modied signal set. Write Es in terms of Es and m.
(b) Compute the expected symbol energy of the following M -PAM signal set, where
d > 0 and M is a positive integer power of 2.
)
}
{ (
M
M
1 d, . . . , d, 0, d, . . . , d
2
2
Problem 4.4 (Symbol energy of standard M M -QAM): Show that, for the
standard M M -QAM signal set with the minimum distance d between signal points,
the expected symbol energy is given by
Es,M M -QAM =
d2 (M 1)
, where M = M 2 .
6
Problem 4.5 (Symbol energy of QAM signal sets): Consider the following 8point QAM signal sets. Note that each signal set has zero mean and the minimum
distance dmin equal to d.
(a) For each signal set, compute the expected symbol energy Es in terms of d.
(b) Which of the three signal sets has the lowest symbol energy Es ?
Problem 4.6 (Orthonormal basis for QAM signals): Let the set of signals
{p(t jT ), j {0, 1, . . .}} be an orthonormal set band-limited to the frequency range
[W, W ]. Let fc be the carrier frequency with fc > W . Show that the following set
of vectors or signals in the L2 signal space is an orthonormal set.
}
{
2p(t jT ) cos(2fc t), 2p(t jT ) sin(2fc t), j {0, 1, . . .}
signal set 1
75
signal set 2
signal set 3
Problem 4.7 (Properties of QAM signals): Suppose that we want to transmit

a QAM signal over the passband channel. Consider using a QPSK signal set for the
transmission of data bits that enter the modulator at the rate of 1 kbps.
(a) Suppose that we use the sinc function as the baseband pulse shape. In particular, the complex baseband signal is
sb (t) =
j=0
(
aj sinc
)
t
j ,
T
where a0 , a1 , . . . are the complex signal amplitudes and T is the symbol period. What is the minimum value of the channel bandwidth required for this
transmission?
(b) Suppose that we want to have the minimum distance of dmin between signal
points. What is the expected symbol energy Es of the QPSK signal set?
(c) Continuing from (a) and (b), suppose that the transmission lasts for 1 s, i.e. only
1,000 bits are transmitted.{
Express the expected
signal energy of the passband
}
QAM signal sQAM(t) = Re
2sb (t)ei2fc t in terms of dmin .
Problem 4.8 (Pulse position modulation (PPM)): Consider a 4-point orthogonal signal set constructed based on the four orthonormal signals or vectors in the L2
signal space as shown below.
76
Consider the transmission of a single symbol. Let Es denote the expected symbol
energy. The corresponding transmitted signal is
s(t) =
ak k (t),

a1
Es
a2 0
where
a3 0
a4
0
k=1

0
0
Es 0
,

0 , Es
0
0
0
.
,
0
Es

The corresponding waveforms for s(t) are called the set of 4-point pulse position
modulation (4-PPM) signals.
(a) Specify the value of (in terms of T ) that makes the signals 1 (t), 2 (t), 3 (t),
4 (t) orthonormal.
(b) Draw all possible 4-PPM signal waveforms associated with the transmission of
a single symbol. Specify the signal values in your drawing.
(c) What is the minimum distance dmin between signal points in the given 4-point
signal set?
(d) Consider constructing a simplex signal set from the given 4-point signal set.
Draw all signal waveforms associated with the transmission of a single symbol
based on this simplex signal set. Specify the signal values in your drawing.
Problem 4.9 (Symbol energy and dimension of a simplex signal set): Consider the simplex signal set constructed from an M -point orthogonal signal set {s1 , . . .,
sM }.
(a) Show that the expected symbol energy of the simplex signal set is lower than
that of the orthogonal signal set by a factor of (1 1/M ).
(b) Let dmin be the minimum distance between signal points in the orthogonal signal
set. What is the minimum distance between signal points for the corresponding
simplex signal set?
(c) Show that the dimension of the subspace of RM spanned by the simplex signal
set is M 1.
Chapter 5
Signal Detection
In this chapter, we consider the presence of noise in a communication channel. We
shall focus our attention on additive white Gaussian noise (AWGN) channels and
investigate how to perform signal detection for various modulation schemes discussed
in the previous chapter. Since the problem of signal detection involves hypothesis
testing, we shall start our discussion there.
5.1
Hypothesis Testing
In hypothesis testing, there are M possible outcomes in the sample space. Each
outcome is called a hypothesis. We shall index these hypotheses from 1 to M . Let H
be a discrete random variable (RV) whose value is equal to h if hypothesis h actually
occurs, where h {1, . . . , M }. Denote the probability mass function (PMF) values of
H by p1 , . . . , pM . In the context of hypothesis testing, p1 , . . . , pM are called a priori
probabilities. We assume that a priori probabilities p1 , . . . , pM are known.
Assume there is an observation RV R (or random vector R) whose statistics
depends on the hypothesis. In addition, assume that the conditional probability density function (PDF) or the conditional probability mass function (PMF), denoted by
fR|H (r|h), is known.
For our discussion on digital communications, M hypotheses correspond to M
possible signal points with probabilities p1 , . . . , pM . An observation R corresponds to a
received signal. The conditional PDF/PMF fR|H (r|h) characterizes a communication
channel. In what follows, we assume that R is a continuous RV. Note, however,
that the discussion is also valid for a discrete observation RV R, as well as for an
observation random vector R.
Given R, the goal of hypothesis testing is to decide which event h actually occurs
while minimizing the probability of decision error or equivalently maximizing the
denote the decision value that is a function of
probability of correct decision. Let H
Note that H
{1, . . . , M }. In addition, the probability
R. Since R is a RV, so is H.
= H}. Using the conditional probability, we can
of correct decision is equal to Pr{H
write
= H} = fR (r) Pr{H
= H|R = r}dr.
Pr{H
77
78
CHAPTER 5. SIGNAL DETECTION
= H} is equivalent to maximizing
Since fR (r) is nonnegative, maximizing Pr{H
Pr{H = H|R = r} for each value of r.

For a particular r, fH|R (h|r) is equal to the probability that hypothesis h actually
occurs given r. In the context of hypothesis testing, fH|R (1|r), . . . , fH|R (M |r) are
= h, then
called a posteriori probabilities. If we set H
= H|R = r} = Pr{H = h|R = r} = fH|R (h|r).
Pr{H
(5.1)
= H|R = r}, it is optimal to set

From (5.1), to maximize Pr{H
= arg
H
max
h{1,...,M }
fH|R (h|r).
(5.2)
The decision rule in (5.2) is called the maximum a posteriori (MAP) decision rule.
Observe that, in case of a tie, i.e. more than one value of h maximize fH|R (h|r), we
can arbitrarily select one of the optimal values of h without changing the probability
of correct decision.
Since we are not given the values of fH|R (h|r), it is convenient to rewrite the
MAP decision rule in (5.2) in terms of the known quantities. Note that we can write
f
(r|h)p
fH|R (h|r) = R|HfR (r) h . Since fR (r) is independent of h, we can express the MAP rule
in (5.2) as follows.
= arg
MAP rule: H
max
h{1,...,M }
fR|H (r|h)ph
(5.3)
For equally likely hypotheses, i.e. p1 = . . . = pM = 1/M , the MAP rule in (5.3)
can be simplied as follows.
= arg
ML rule: H
max
h{1,...,M }
fR|H (r|h).
(5.4)
The decision rule in (5.4) is called the maximum likelihood (ML) decision rule.
Note that the ML decision rule can be applied in cases where we know fR|H (r|h) but
do not know p1 , . . . , pM .
Binary Hypothesis Testing

For M = 2, the decision process is called binary hypothesis testing. In this case, the
MAP decision rule in (5.3) can be written as
=1
H
fR|H (r|1)p1
fR|H (r|2)p2
<
=2
H
= 1 if the left hand side (LHS) is at
The above expression means that we set H
= 2 if the RHS is more
least the right hand side (RHS). On the other hand, we set H
5.1. HYPOTHESIS TESTING
79
than the LHS. We can rewrite the above MAP decision rule as
=1
H
fR|H (r|1) p2
L(r) =
.
fR|H (r|2) < p1
=2
H
(5.5)
The quantity L(r) is called the likelihood ratio, and the decision rule of the form
in (5.5) is called a likelihood ratio test (LRT).
Example 5.1 Consider binary hypothesis testing in which the two hypotheses are
equally likely and the observation RV R is given by (assuming > 0)
{
+ N, h = 1
R=
+ N,
h=2
where N is a Gaussian RV with mean 0 and variance 2 . Note that we can write
(r+)2
(r)2
1
1
fR|H (r|1) =
e 22 , fR|H (r|2) =
e 22 .
2 2
2 2
It follows that the LRT can be expressed as
(r+)2
2 2
(r)2
2 2
=2
=1
H
H
>
0.
1 r
<
=1
=2
H
H
Figure 5.1 shows the conditional PDFs fR|H (r|1) and fR|H (r|2). From gure 5.1,
it is easy to see that 0 is the threshold of the decision rule.
The probability of decision error, denoted by Pe , is equal to
1
1
=
= H|H = 2}
Pr{H
H|H = 1} + Pr{H
2
2
1
1
=
Pr{R > 0|H = 1} + Pr{R 0|H = 2}
2
2
= Pr{R > 0|H = 1}
= H} =
Pe = Pr{H
where the last equality follows from symmetry. Note that Pe is equal to the area of
the shaded region in gure 5.1. Since R = + N under hypothesis 1, we can write
Pr{R > 0|H = 1} = Pr{N > |H = 1} = Pr{N > }, where the last equality
follows from the fact that N is independent of H.
Let Q denote the complementary cumulative
distribution function of a zero-mean
2
unit-variance Gaussian RV, i.e. Q(x) = x 12 ey /2 dy. In terms of the Q function,
we can express Pe as
}
{
()
N
>
.
=Q
Pe = Pr{N > } = Pr
Note that Pe decreases with , but increases with .
80
0.5
fR|H(r|1)
0.4
fR|H(r|2)
0.3
0.2
0.1
0
-4
-3
-2
-1
Figure 5.1: Conditional PDFs for binary hypothesis testing with Gaussian noise.
Figure 5.2: AWGN channel model.
5.2
AWGN Channel Model
Figure 5.2 shows an additive white Gaussian noise (AWGN) channel model in which
the received signal R(t) is the sum of the transmitted signal S(t) and a zero-mean
white Gaussian random process N (t). AWGN channel models are often used in
practice.
We shall assume that N (t) is wide-sense stationary (WSS). Let SN (f ) denote the
power spectral density (PSD) of N (t). By convention, we normally set
SN (f ) = N0 /2.
(5.6)
Accordingly, the covariance function KN ( ) of N (t) is given by

KN ( ) =
N0
( ).
2
(5.7)
The Gaussian noise assumption is a result of the central limit theorem (CLT)
which tells us that a superposition of a large number of waveforms associated with
ltered impulse noises in electronics converges to a Gaussian random process.
The white noise assumption is for modeling convenience. Although white noise
does not exist in practice, as long as the PSD is approximately constant over the
5.2. AWGN CHANNEL MODEL
white noise PSD
nonwhite noise PSD
81
frequency response
of receiver filter
frequency response
of receiver filter
filtered noise PSD
filtered noise PSD

(same as above)
Figure 5.3: White noise yields the same ltered noise PSD as wideband non-white
noise.
transmission band, the noise behaves as if it were white, i.e. with innite bandwidth.
In particular, note that we usually pass R(t) through a receiver lter. Figure 5.3
illustrates that the ltered noises are the same for white and non-white noises whose
PSDs are constant in the transmission band.
White Gaussian Noise Through LTI Filters

Consider passing a zero-mean white Gaussian noise process N (t) with PSD N0 /2
through a linear time invariant (LTI) lter with impulse response q(t). In the context
of a communication system, q(t) corresponds to a receiver lter, e.g. matched lter
for pulse amplitude modulation (PAM). Let W (t) = N (t) q(t) denote the ltered
noise process at the output. From KW ( ) = q( ) q( ) KN ( ), we can write the
auto-covariance function and the PSD of W (t) as
KW ( ) = q( ) q( )
SW (f ) =
N0
N0
( ) =
q( ) q( ),
2
2
(5.8)
N0
|
q (f )|2 .
2
(5.9)
Consider taking a sample of the ltered noise W (t) at t = t1 . Since W (t) is a

zero-mean Gaussian process, W (t1 ) is a zero-mean Gaussian RV whose variance is
equal to

[
]
N
N0
0
2
=
q( ) q( )
|q()|2 d.
(5.10)
E W (t1 ) = KW (0) =
2
2
=0
Now consider taking two samples W (t1 ) and W (t2 ). The covariance between the
82
Figure 5.4: Passing N (t) through two LTI lters.

two samples is given by

N0
E [W (t1 )W (t2 )] = KW (t1 t2 ) =
q( ) q( )
2
=t1 t2

N0
q()q( + t2 t1 )d.
=
2
(5.11)
Finally, consider splitting and passing N (t) through two LTI lters q1 (t) and q2 (t),
as shown in gure 5.4. Let W1 (t) = N (t) q1 (t) and W2 (t) = N (t) q2 (t). Consider
taking two samples W1 (t1 ) and W2 (t2 ). The covariance between the two samples is
given by
]
[
q1 ()N (t1 )q2 ()N (t2 )dd
E [W1 (t1 )W2 (t2 )] = E

N0
=
q1 ()q2 ()(t1 t2 + )dd
2
N0
=
q1 ()q2 ( + t2 t1 )d.
(5.12)
2
5.3
Optimal Receiver for AWGN Channels
Let us rst consider a single symbol transmission using an M -point K-dimensional

signal set. More specically, the transmitted signal is of the form
S(t) =
Ak k (t),
(5.13)
k=1
where {1 (t), . . . , K (t)} is a set of orthonormal signals and A = (A1 , . . . , AK ) denotes

a signal point in RK .
( )
Recall that, for PAM, K = 1 and one possible choice of 1 (t) is 1 (t) = 1T sinc Tt ,
where T is the symbol period. For quadrature amplitude
modulation (QAM), K = 2
( )
2
and one possible choice for 1 (t) and 2 (t) is 1 (t) =
sinc Tt cos(2fc t) and
T
( )
2 (t) = T2 sinc Tt sin(2fc t), where fc is the carrier frequency.
5.3. OPTIMAL RECEIVER FOR AWGN CHANNELS
83
Figure 5.5: Optimal receiver structure for a single symbol transmission over an AWGN
channel.
Assume that the signal points are equally likely. Denote the set of signal points
by
sM,1
s1,1
..
..
{s1 , . . . , sM } = . , . . . , .
s
sM,K
1,K
Note that A takes its value in {s1 , . . . , sM }. Suppose that we transmit the symbol
through an AWGN channel whose noise PSD is equal to N0 /2. Let N (t) denote the
noise process. The received signal R(t) is given by
R(t) = S(t) + N (t).
(5.14)
Consider the K-dimensional signal space S spanned by the orthonormal set {1 (t),
. . ., K (t)}. We shall see shortly that we can project the received signal R(t) on S;
the noise components outside S can be ignored without loss of optimality in terms
of the probability of decision error. In particular, given R(t), the receiver can use a
bank of K matched lters to compute
Rk = R(t), k (t) = Ak + Nk , k {1, . . . , K},
(5.15)
where we dene Nk = N (t), k (t). Figure 5.5 shows the receiver structure corresponding to the computation in (5.15).
From (5.10), the variance of Nk is equal to N20 |k ()|2 d = N0 /2. From (5.12),
the covariance between Nj and Nk , j = k, is equal to N20 j ()k ()d = 0. Since

N1 , . . . , NK are linear functionals of N (t), they are jointly Gaussian. Since uncorrelated jointly Gaussian RVs are independent, it follows that N1 , . . . , NK are independent and identically distributed (IID) with zero mean and variance N0 /2. In
particular, the joint PDF of N1 , . . . , NK is given by
fN1 ,...,NK (n1 , . . . , nK ) =
1
2
K
j=1 nj /N0 .
e
(N0 )K/2
(5.16)
For convenience, let R = (R1 , . . . , RK ) and N = (N1 , . . . , NK ). Note that, if

A = sm , m {1, . . . , M }, then we can write
R = sm + N.
(5.17)
84
Compared to the expression of the AWGN channel in (5.14), we see that the
channel can be described in (5.17) using vectors in the signal space S instead of
waveforms. From (5.17), we see that detection of the transmitted signal point can
be viewed as a hypothesis testing problem. There are M hypotheses indexed from 1
to M . Under hypothesis m {1, . . . , M }, the observation random vector R is given
in (5.17).
Irrelevant Noise
Consider again the receiver structure in gure 5.5. Let NS (t) = K

k=1 Nk k (t). Note
that the orthonormal expansion in the signal space S constructed from R is equal to
K
k=1
Rk k (t) =
k=1
Ak k (t) +
Nk k (t) = S(t) + NS (t).
k=1
The above expression implies that the receiver in gure 5.5 discards the noise
component N (t) NS (t). In what follows, we argue that this noise component is in
fact irrelevant to the receivers decision, and hence can be ignored. The discussion is
based on [?, p. 220]. We start by proving a useful theorem.
Theorem 5.1 (Theorem of irrelevance): Let vector R and R be two observations at the receiver after the signal point A is sent. An optimal receiver can disregard
R if and only if fR |R,A (r |r, a) = fR |R (r |r). In addition, a sucient condition for
disregarding R is fR |R,A (r |r, a) = fR (r ).
Proof: Since the hypotheses are equally likely, the MAP decision rule is equal to the
ML decision rule. In particular, the ML decision rule compares
fR ,R|A (r , r|sm ) = fR|A (r|sm )fR |R,A (r |r, sm ), m {1, . . . , M }.
If fR |R,A (r |r, a) = fR |R (r |r), then the last term can be ignored since it is the
same for all m. Thus, we can simplify the decision rule to be based only on R.
Conversely, if the last term can be ignored, it cannot depend on m, and we necessarily
have fR |R,A (r |r, a) = fR |R (r |r).
Finally, if fR |R,A (r |r, a) = fR (r ), then the last term can be ignored since it
is the same for all m. Thus, fR |R,A (r |r, a) = fR (r ) is a sucient condition for
disregarding R .

Let us now extend the orthonormal set {1 (t), . . . , K (t)} to an innite orthonormal set {1 (t), 2 (t), . . .} that spans the L2 signal space. For a nite observation
time, we can view each realization of N (t) as an L2 signal. Dene Nk = N (t), k (t)
for k Z+ and let N = (NK+1 , NK+2 , . . .). Note that N completely species the
noise component N (t) NS (t).
Dene Rk = R(t), k (t) for k Z+ and let R = (RK+1 , RK+2 , . . .). Note that R
and R completely specify R(t). In addition, note that R = N . From the theorem
5.3. OPTIMAL RECEIVER FOR AWGN CHANNELS
85
of irrelevance, in order to discard N from the decision rule, it suces to show that
fN |R,A (n |r, a) = fN (n ).
Since knowing R and A is equivalent to knowing N and A (note that R = N + A),
fN |R,A (n |r, a) = fN |N,A (n |n, a).
From the denition of N , N is independent of A given N, so
fN |N,A (n |n, a) = fN |N (n |n).
Therefore, it remains to show fN |N (n |n) = fN (n ), or equivalently N and N
are independent. We prove that N and N are independent by showing that, for any
) of N and any subset N
1 , . . . , N
Q ) of N, we can
, . . . , N
= (N
= (N
nite subset N
1
P
n )fN
n).
n , n
) = fN
write fN
(
,N
(
(
Q are jointly Gaussian since they are linear funNote that N1 , . . . , NP , N1 , . . . , N

tionals of N (t). Using (5.10) and (5.12), we can show that they are pairwise uncorrelated, and are consequently IID Gaussian RVs with zero mean and variance N0 /2.
It follows that we can write
fN
n , n
) = fN1 ,...,N ,N1 ,...,NQ (
n1 , . . . , n
P , n
1, . . . , n
Q)
,N
(
P
Q
1
2
2
P
j=1 Rj /N0 k=1 Rk /N0
=
e
(P
+Q)/2
(N )
( 0
)(
)
1
1
n
2
/N0
P
Q
n
2
/N0
j=1
j
k
k=1
=
e
e
(N0 )P/2
(N0 )Q/2
= fN
n )fN
n)
(
(
from which we conclude that N and N are independent. In conclusion, N can be
ignored in optimal detection at the receiver.
Minimum Distance Decision Rule

Consider again the hypothesis testing problem whose observation R is given in (5.17).
Recall that N1 , . . . , NK are IID Gaussian RVs with zero mean and variance N0 /2. The
MAP decision rule, which is equal to the ML decision rule, is given by
= arg
H
max
m{1,...,M }
fR|A (r|sm ).
(5.18)
Since R = sm + N under hypothesis m, it follows that

1
N1 rsm 2
0
e
m{1,...,M } (N0 )K/2
= arg min r sm .
= arg
H
max
m{1,...,M }
(5.19)
Since the quantity r sm is the distance between the receive signal r and the
signal point sm in the signal space, the decision rule of the form in (5.19) is called the
minimum distance decision rule. The minimum distance decision rule is quite simple
intuitively. Given an observation point r, the most likely transmitted signal point is
the one closest to the observation r.
86
perpendicular
bisector of
Figure 5.6: Pairwise error between two signal points.
5.4
Performance of Optimal Receivers
Dene the pairwise error probability Pm |m as the probability that the received signal
r is closer to sm than to sm . From the illustration of pairwise error probability in
gure 5.6, r is closer to sm than to sm when the noise component N along the
1
direction sm sm is greater than
} d(sm , sm ) = sm sm .
{ 2 d(s1m , sm ), where
It follows that Pm |m = Pr N > 2 d(sm , sm ) . From (5.16), note that the joint
PDF of N is spherically symmetric. It follows that N is a zero-mean Gaussian RV
with variance N0 /2. Using the Q function, we can write
(
)
d(sm , sm )
Pm |m = Q
.
(5.20)
2N0
Union Bound Estimate of Symbol Error Probability

Let E1 and E2 be two events, the union bound on the probability Pr{E1 E2 } is given
by Pr{E1 E2 } Pr{E1 } + Pr{E2 }. The union bound can be extended to any nite
number of events E1 , . . . , En , i.e.
{n }
n
Pr
Ej
Pr{Ej }.
(5.21)
j=1
j=1
Let Em denote the event in which a decision error is made given H = m, i.e.
a = sm . Let Em |m denote the event inwhich r is closer to sm than to sm . By the
denition of Em |m , we can write Em = m =m Em |m . Using the union bound,
Pr{Em }
m =m
Pr{Em |m } =
Pm |m .
m =m
Let dmin be the minimum distance between signal points of the signal set. The
union bound estimate of Pr{Em } is based on the idea that the nearest neighbors to
sm at distance dmin will dominate the summation in the union bound.
5.4. PERFORMANCE OF OPTIMAL RECEIVERS
000
001
011
010
110
111
101
87
100
8PAM
000
001
100
011
101
111
010
0000 0001 0011 0010

0100 0101 0111 0110
1100 1101 1111 1110
110
8PSK
1000 1001 1011 1010
Figure 5.7: Gray encoding for PAM and QAM signal sets.
Let Kmin,m be the number of neighbors of sm that are at distance dmin away. The
union bound estimate of Pr{Em } is
(
)
dmin
Pr{Em } Kmin,m Q
.
2N0
Let Kmin be the average value of Kmin,m over all m. The overall union bound
estimate of the symbol error probability Ps is
(
)
M
1
dmin
Ps =
Pr{Em } Kmin Q
.
(5.22)
M m=1
2N0
Instead of the symbol error probability, it is conventional to express the transmission system performance in terms of the bit error probability or the bit error rate
(BER). Let Pb denote the bit error probability. In terms of Ps , we can approximate
Pb as
Pb Ps / log2 M
(5.23)
with the assumption that a symbol error leads to only one bit error. This assumption
is reasonable if we can map log2 M information bits to M signal points such that
adjacent points dier in only one information bit. For PAM and QAM signal sets,
such a mapping is called Gray encoding [?, p. 175]. Figure 5.7 illustrates examples
of Gray encoding.
Error Performance of PAM Signal Sets

Consider binary PAM with the expected symbol energy Es . Dene Eb as the expected
energy per bit, i.e. Eb = Es /log2 M . For binary PAM, note that Eb = Es , and Pb = Ps .
88
0
-1
16-PAM
log10Pb
-2
8-PAM
-3
4-PAM
-4
2-PAM
-5
-6
-7
-8
0
10
15
20
Eb/N0(dB)
Figure 5.8: Bit error probability for M -PAM signal sets.
Given Eb ,
the two signal points are Eb and Eb , and the distance between signal
points is 2 Eb . Using (5.22), we can write
(
)
( )
2 Eb
Eb
Pb,2-PAM = 1 Q
=Q
2
.
(5.24)
N0
2N0
d2 (M 2 1)
12Es
or equivalently dmin =
(see (4.7)). In
For M -PAM, Es = min 12
M 2 1
addition, note that Eb = Es / log2 M and Pb Ps / log2 M . Note also that
1
2(M 1)
(2 1 + (M 2) 2) =
.
M
M
Using (5.22) and (5.23), we can write
12Es
2
M 1
1
2(M 1)
Q
Pb,M -PAM
log2 M
M
2N0
((
)
)
2(M 1)
6 log2 M Eb
=
Q
.
M log2 M
M 2 1 N0
Kmin =
(5.25)
Figure 5.8 shows the bit error probability for M -PAM signal sets according to (5.25)
for dierent values of M .
Error Performance of QAM Signal Sets

Consider standard M M -QAM signal sets with
the expected symbol energy Es .
d2min (M 2 1)
Es
s
or equivalently dmin = M6E
Note that Es =
2 1 (see (4.32)), Eb = 2 log M ,
6
2
5.4. PERFORMANCE OF OPTIMAL RECEIVERS
89
0
-1
16 16-QAM
log10Pb
-2
8 8-QAM
-3
4 4-QAM
-4
2 2-QAM
-5
-6
-7
-8
0
10
15
20
Eb/N0(dB)
Figure 5.9: Bit error probability for M M -QAM signal sets.

and Pb
Ps
.
2 log2 M
Kmin =
In addition, we can compute
) 4(M 1)
1 (
2
(M
2)
4
+
4(M
2)
3
+
4
2
=
.
M 2
M
6Es
M 2 1
1
4(M 1)
2 log2 M
M
2N0
((
)
)
6 log2 M Eb
2(M 1)
=
Q
.
M log2 M
M 2 1 N0
Pb,M M -QAM
(5.26)
Figure 5.9 shows the bit error probability for standard M M -QAM signal sets
according to (5.26) for dierent values of M .
Consider M -point phase shift keying (PSK) signal sets with the expected symbol
energy Es . Note that, for M = 2, 2-PSK is the same as binary PAM. For M =
4,
is the same as 2 2-QAM. We now consider 8-PSK. Note that dmin =
4-PSK
(2 2)Es , Eb = Es /3, and Pb Ps /3. In addition, it is easy to see that Kmin = 2.

(
)
(2
2)E
Eb
s
1
2
3
= Q
Pb,8-PSK 2 Q
(2 2)
.
(5.27)
3
3
2
N0
2N0
Figure 5.10 shows the bit error probability for M -PSK signal sets for dierent M .
90
0
-1
log10Pb
-2
-3
8-PSK
-4
-5
2-PSK,4-PSK
-6
-7
-8
0
10
15
20
Eb/N0(dB)
Figure 5.10: Bit error probability for M -PSK signal sets.
Error Performance of Orthogonal Signal Sets

Consider M -point
orthogonal signal sets with the expected symbol energy Es . Note
that dmin = 2Es and Eb = Es / log2 M . Unlike previous signal sets discussed so far,
we cannot well approximate Pb Ps / log2 M since all distances between signal points
are the same and thus Gray encoding is not applicable. However, we can approximate
M/2
Pb M
P since, for each bit position, there are M/2 out of M 1 error symbols
1 s
with the incorrect bit value in that position. In addition, note that Kmin = M 1.
Using (5.22), we can write
(
)
2Es
M/2
Pb,M orthogonal
(M 1) Q
M 1
2N0
)
(
M
Eb
=
Q
log2 M
.
(5.28)
2
N0
Figure 5.11 shows the bit error probability for M -point orthogonal signal sets
according to (5.28) for dierent values of M .
5.5
Detection of Multiple Transmitted Symbols
Consider J successive symbol transmissions using an M -point K-dimensional signal

set. More specically, the transmitted signal is of the form
S(t) =
J1
K
j=0 k=1
Aj,k k (t jT ),
(5.29)
5.5. DETECTION OF MULTIPLE TRANSMITTED SYMBOLS
91
0
-1
log10Pb
-2
-3
2-orthogonal
-4
4
-5
8
-6
16
-7
-8
0
10
15
20
Eb/N0(dB)
Figure 5.11: Bit error probability for M -point orthogonal signal sets.
Figure 5.12: Optimal receiver structure for J symbol transmissions over an AWGN
channel.
where T is the symbol period, {1 (t), . . . , K (t)} is a set of orthonormal signals, and
Aj = (Aj,1 , . . . , Aj,K ) denotes the jth transmitted signal point. Denote the set of M
signal points by {s1 , . . . , sM }. Note that Aj takes its value in {s1 , . . . , sM } for each j.
For no intersymbol interference (ISI), assume that {k (t jT ), k {1, . . . , K}, j Z}
is an orthonormal set.
In the context of hypothesis testing, there are M J hypotheses. We can describe
a hypothesis using vector m = (m0 , . . . , mJ1 ), where mj {1, . . . , M } for j
{0, . . . , J 1}. Note that, under hypothesis m, the J transmitted signal points are
sm0 , . . . , smJ1 .
Consider transmitting the signal in (5.29) through the AWGN channel whose
noise PSD is N0 /2. Let N (t) denote the noise and R(t) denote the received signal,
i.e. R(t) = S(t) + N (t). Viewing the signal in (5.29) as an orthonormal expansion, it
follows that the optimal receiver has the structure shown in gure 5.12.
Note that the receiver in gure 5.12 only preserves the signal and noise components
in the signal space spanned by {k (t jT ), k {1, . . . , K}, j {0, . . . , J 1}}.
92
The justication that we can throw away noise components outside the signal space
without loss of optimality in detection performance is the same as in the case of a
single symbol transmission and is omitted here.
From gure 5.12, the optimal receiver computes Rj,k = R(t), k (t jT ) for
k {1, . . . , K} and j {0, . . . , J 1}. Dene Nj,k = N (t), k (t jT ). Note
that these Nj,k s are IID Gaussian RVs with zero mean and variance N0 /2. For
convenience, dene the following vectors.
A = (A0 , . . . , AJ1 )
sm = (sm0 , . . . , smJ1 )
N = (N0 , . . . , NJ1 ), where Nj = (Nj,1 , . . . , Nj,K )
R = (R0 , . . . , RJ1 ), where Rj = (Rj,1 , . . . , Rj,K )
It follows that, under hypothesis m, we can write R = sm + N. For optimal
detection, we use the ML decision rule (equivalent to the MAP decision rule) given
below.
= arg max fR|A (r|sm )
H
(5.30)
m{1,...,M }J
Under hypothesis m, R contains independent Gaussian RVs with means given by

sm and variances all equal to N0 /2. Thus, we can write the decision rule in (5.30) as
1
N1 rsm 2
0
e
m{1,...,M }J (N0 )JK/2
= arg
min
r sm 2 ,
= arg
H
max
m{1,...,M }J
(5.31)
which is the minimum distance decision rule for multiple symbol transmissions. We
can rewrite the decision rule in (5.31) as
= arg
H
min
m{1,...,M }J
J1
rj smj 2 .
j=0
Minimizing r sm 2 is equivalent to minimizing rj smj 2 for each j. Therefore,

the optimal value m = (m0 , . . . , mJ1 ) from the MAP rule can be found by separate
decisions on mj as follows.
mj = arg
min
mj {1,...,M }
rj smj 2 = arg
min
mj {1,...,M }
rj smj
Note that the above decision rule for the jth symbol is exactly the minimum
distance decision rule that we saw before in (5.19). In conclusion, when the set {k (t
jT ), k {1, . . . , K}, j Z} is an orthonormal set, we can detect transmitted symbols
separately from the observations R0 , R1 , . . . respectively without loss of optimality.
In other words, symbol-by-symbol detection is optimal when there is no ISI.
5.6. COMPARISON OF MODULATION SCHEMES
93
Equivalent Discrete-Time AWGN Channel

From that fact that we can write R = A + N, it follows that the AWGN channel
model with noise PSD N0 /2 can be replaced by an equivalent discrete-time AWGN
channel model described by
Rj = Aj + Nj , j Z+ ,
(5.32)
where Rj , Aj , Nj are the received signal, the transmitted signal, and the Gaussian
noise for the jth transmitted symbol respectively. Recall that each of these vectors
has K components for K-dimensional modulation. In addition, the K components of
Nj are IID Gaussian RVs with zero mean and variance N0 /2, and are independent of
the components of Nj , j = j.
For analysis of communication systems, it is usually more convenient to work with
the discrete-time AWGN channel model in (5.32) than to work with the continuoustime channel model. For the rest of the course, we shall use the discrete-time channel
model whenever it is possible to do so.
5.6
Comparison of Modulation Schemes
We compare dierent modulation schemes based on the approach in [?, sec. 5.2.10].
In particular, for each modulation scheme, we consider two performance parameters.
The rst parameter, called the bandwidth eciency, is the transmission bit rate (in bps
or bit/s) obtained per unit of bandwidth (in Hz). Let R and W be the transmission
bit rate and the bandwidth, then the bandwidth eciency is the ratio R/W (in
bit/s/Hz).
The second performance parameter is the value of Eb /N0 associated with a certain
requirement on the bit error probability Pb . We shall assume 105 as this requirement
in the following discussion, and denote the corresponding Eb /N0 by (Eb /N0 )105 .
M -PAM: We assume that M -PAM utilizes the orthonormal set of baseband
signals
{
(
)
}
1
t
sinc
j ,j Z ,
T
T
1
where T is the symbol period. For M -PAM, note that W = 2T
, R = logT2 M ,
R
= 2 log2 M . The value of (Eb /N0 )105 can be obtained from the union
and W
bound estimate of Pb shown in gure 5.8.
M M -QAM: We assume that M M -QAM utilizes the orthonormal set of

passband signals
}
{
(
)
(
)
2
t
2
t
sinc
j cos(2fc t),
sinc
j sin(2fc t), j Z ,
T
T
T
T
where T is the symbol period and
fc is the carrier frequency. For M M -QAM,
2
log
M
R
2
, and W
= 2 log2 M . The value of (Eb /N0 )105
note that W = T1 , R =
T
can be obtained from the union bound estimate of Pb shown in gure 5.9.
94

M -PSK: We assume that M -PSK utilizes the orthonormal set of passband signals
}
{
(
)
(
)
2
t
2
t
sinc
j cos(2fc t),
sinc
j sin(2fc t), j Z ,
T
T
T
T
where T is the symbol period and fc is the carrier frequency. For M -PSK, note
R
that W = T1 , R = logT2 M , and W
= log2 M . The value of (Eb /N0 )105 can be
obtained from the union bound estimate of Pb shown in gure 5.10.
M -point orthogonal modulation: We assume that M -point orthogonal modulation utilizes the orthonormal set of signals
{
}
(
)
1
t
sinc
k , k {0, . . . , M 1}
T /M
T /M
for the 0th symbol, where T is the symbol period. (Note that this is the same
as M -point pulse position modulation (M -PPM) shown in gure 4.16.) In
addition, for the jth symbol, we use
{
}
(
)
t
1
sinc
jM k , k {0, . . . , M 1} ,
T /M
T /M
M
,R=
where j Z+ . For M -point orthogonal modulation, note that W = 2T
log2 M
2 log2 M
R
, and W = M . The value of (Eb /N0 )105 can be obtained from the
T
union bound estimate of Pb shown in gure 5.11.
Figure 5.13 shows the curve of R/W versus (Eb /N0 )105 for dierent modulation
schemes that we discussed above. Shown also is the upper bound on R/W as a
function of Eb /N0 . This upper limit is called the channel capacity, a quantity that
we shall dene and study in more detail in a later chapter. For now, it suces to say
that, for a given Eb /N0 , if the bandwidth eciency R/W does not exceed the channel
capacity, then we can make the bit error probability as small as we want. Conversely,
if the bandwidth eciency R/W exceeds the channel capacity, then we cannot make
the bit error rate approach zero.
From gure 5.13, we see that there is a trade-o between the bandwidth eciency
R/W and the power eciency Eb /N0 . We can categorize communication system
scenarios into two regions: bandwidth-limited region with R/W > 1 and power-limited
region with R/W < 1. Examples of systems in the bandwidth-limited region are
Asymmetric Digital Subscriber Line (ADSL) systems, cellular phone systems, and
wireless local area networks (WLANs). Examples of systems in the power-limited
region are optical communication systems and communications in deep space.
Based on gure 5.13, for bandwidth-limited communications, we should consider
the following modulation schemes: M -PAM, M -PSK, and M M -QAM with large
M and M . On the other hand, for power-limited communications, orthogonal signal
sets should be considered. Since bi-orthogonal and simplex signal sets are created
from orthogonal signal sets, they are also good candidates for power-limited communications.
5.7. SUMMARY
95
10
channel capacity
R/W(dB)
8-PAM,8 8-QAM
4-PAM,4 4-QAM
8-PSK
2-PAM,2 2-QAM,4-PSK
2-PSK 4-orthogonal
2-orthogonal
8-orthogonal
16-orthogonal
-5
10
15
20
25
Eb/N0(dB) for Pb=10-5
Figure 5.13: The curve of R/W versus (Eb /N0 )105 for dierent modulation schemes
(similar to [?, Fig. 5.2.17]). Note that, to obtain the above gure, the bit error
probabilities are computed based on the union bound estimate of Pb .
5.7
Summary
In this chapter, we consider the presence of noise in a communication channel. We

modeled noise in communication systems as a white Gaussian process and assumed
that the received signal is the transmitted signal plus noise, yielding the additive
white Gaussian noise (AWGN) channel model.
We showed that the optimal receiver structure preserves only the noise components
in the signal space and ignores the noise components outside the signal space. Using
this receiver structure, the observations for the detection of a single transmitted
symbol are the outputs of matched lters whose impulse responses are taken from an
orthonormal basis of the signal waveform.
Using the framework of hypothesis testing, we discussed optimal detection of
transmitted symbols that minimizes the symbol error probability. In particular, we
showed that optimal detection is based on the maximum a posteriori probability
(MAP) decision rule, which in turn is equivalent to the maximum likelihood (ML)
decision rule for equally likely symbols.
One convenient observation that we made is that, when signal waveforms for different symbols are orthonormal, optimal detection can be done through symbol-bysymbol detection for AWGN channels. So the discussion on a single symbol transmission can be extended to multiple symbol transmissions in a straightforward fashion.
We then analyzed the detection performances for various modulation schemes that
we discussed in the previous chapter. For other modulation schemes not discussed
in the previous chapter, see [?, chp. 5] for more details. We pointed out the two
96
operating regions of communication systems: bandwidth-limited and power-limited.

We saw that PAM and QAM with large numbers of signal points are bandwidth
ecient, while orthogonal signal sets are power ecient. Thus, the choice of the
modulation scheme should depend on the constraints of the system.
5.8
Practice Problems
Problem 5.1 (Independent observations for a symbol detection): Consider

binary hypothesis testing in which the observation is a random vector R = (R1 , . . . , R5 )
such that

1
N1
R1
1 N2
R2

R3 = A 1 + N3 ,
R4
1 N4
1
N5
R5
where N1 , . . . , N5 are IID zero-mean Gaussian RVs with variance 2 , and A is the
signal amplitude equal to under hypothesis 1 and equal to under hypothesis 2.
Assume that > 0.
(a) Assume that the two hypotheses are equally likely. Find the optimal decision
rule that minimizes the probability of decision error, i.e., the MAP decision
rule, and its associated probability of decision error.
(b) Consider now a hard decision procedure in which we make 5 separate decisions based on R1 , . . . , R5 , and use the majority rule for the nal decision. For
1 = 1, H
2 = 2, H
3 = 1, H
4 = 1, and H
5 = 2,
example, if the 5 decisions are H
then the nal decision is 1. Express the probability of decision error for this
hard decision procedure.
Problem 5.2 (Optimal combining of independent observations): Consider

binary hypothesis testing in which the observation is a random vector R = (R1 , R2 )
such that
]
] [
]
[
[
N1
1
R1
,
+
=A
N2
2
R2
where N1 , N2 are IID zero-mean Gaussian RVs with variance 2 , A is the signal
amplitude equal to under hypothesis 1 and equal to under hypothesis 2, and
1 , 2 are attenuation parameters. Assume that > 0 and 1 , 2 (0, 1).
(a) Assume that the two hypotheses are equally likely. Find the optimal decision
rule that minimizes the probability of decision error, i.e. the MAP decision rule,
and its associated probability of decision error.
(b) Assume now that 1 = 2 . What is the optimal decision rule in this case?
97
(c) Suppose we use the decision rule in part (b) when 1 = 2 , i.e. suboptimal
decision rule. Compare the probability of decision error to that of part (a).
NOTE: The optimal combining of independent observations taking into account different attenuation parameters in part (a) is called maximum ratio combining (MRC).
If we use the suboptimal decision rule in part (b) based on the assumption that
1 = 2 , the corresponding combining is called equal gain combining (EGC).
Problem 5.3 (based on problem 4.6 in [WJ65]): Consider binary hypothesis

testing in which the observation is a random vector R = (R1 , R2 ) such that
R1 = A + N1
R2 = N1 + N2
where N1 , N2 are IID zero-mean Gaussian RVs with variance 2 , A is the signal
amplitude equal to under hypothesis 1 and equal to under hypothesis 2. Assume
that > 0.
(a) Find the optimal decision rule for this binary hypothesis testing problem.
NOTE: For a jointly Gaussian RVs X1 and X2 , the joint PDF is given by
1
21 2 1 2
(
)]
[
(x1 1 )2 2(x1 1 )(x2 2 ) (x2 2 )2
1
+
exp
2(1 )2
12
1 2
22
where 1 , 2 and 12 , 22 are the means and the variances of X1 , X2 , and is the
2 2 )]
covariance coecient equal to E[(X1 11)(X
.
2
(b) Find the corresponding probability of an incorrect decision.
(c) Suppose that the observation contains only R1 . Show that the probability of an
incorrect decision under the optimal decision rule in this case is strictly higher
than that in part (b).
Problem 5.4 (On-o keying (OOK) modulation): Consider binary PAM in

which the transmitted signal is
{
0,
for bit 0
s(t) =
2Eb p(t), for bit 1
where p(t) is a unit-norm band-limited baseband signal. This type of modulation
is called on-o keying (OOK) and is common for optical transmission. In addition,
assume that bit 0 and bit 1 are equally likely.
98
(a) Show that the expected signal energy per bit is Eb .

(b) Consider a single bit transmission through an AWGN channel with noise PSD
equal to N0 /2. Describe the optimal detection at the receiver. (In other words,
describe the optimal receiver structure and the optimal decision rule.)
(c) Dene the signal-to-noise ratio (SNR) to be the expected signal energy per bit
(Eb ) divided by the noise variance (N0 /2). Note that, for OOK, SNR = 2Eb /N0 .
Using the Q function, express the bit error probability Pb for the decision rule
in part (b) in terms of the SNR.
(d) Compare the bit error probability for OOK in part (c) to the bit error probability
for BPSK with
{
Eb p(t), for bit 0
s(t) =
Eb p(t),
for bit 1
In particular, express the bit error probability Pb for BPSK in terms of the SNR,
which is also equal to 2Eb /N0 . For the same value of SNR, which modulation
(OOK or BPSK) has a lower bit error probability?
Problem 5.5 (Frequency shift keying (FSK) modulation): Suppose that we

transmit information bits at the rate of 2 kbps using the frequency shift keying (FSK)
signal set
{
A cos (2(fc + mf )t) , t [0, T ]
sm (t) =
0,
otherwise
where m {1, . . . , M }, T is the symbol period, and fc is an integer multiple of 1/T .
Throughput the problem, assume that T = 1 ms.
1
(a) Assume that f = 2T
. Specify the value of M , and compute the expected
energy per bit Eb in terms of A and T .
1
(b) Show that, for f = 2T
, the signals s1 (t), . . . , sM (t) are orthogonal to one
another. HINT: You may utilize the fact that 2 cos x cos y = cos(x + y) +
cos(x y).
1
and fc = 0, sketch all M possible signal waveforms. HINT: Recall
(c) For f = 2T
that M is already specied in part (a).
(d) Assume that we transmit over an AWGN channel with noise PSD equal to N0 /2.
Draw the optimal receiver structure and specify the optimal decision rule for a
single symbol transmission.
(e) Use the union bound estimate to express the symbol error probability Ps for
the decision rule in part (d) in terms of Eb /N0 .
(f ) Describe how we can further reduce Ps for a xed Eb /N0 without any channel
coding.
99
Problem 5.6 (Optimal detection of orthogonal signal sets): Consider an orthogonal signal set with M equally likely signals
s(t) = Es m (t), m {1, . . . , M }

where {1 (t), . . . , M (t)} is an orthonormal set of signals. Consider a single symbol
transmission through an AWGN channel with noise PSD equal to N0 /2.
(a) Describe the optimal detection at the receiver. (In other words, describe the
optimal receiver structure and the optimal decision rule.) Simplify the decision
rule as much as you can.
(b) Compute the union bound estimate of the symbol error probability Ps for the
decision rule in part (a). Express your answer in terms of M , Es , and N0 .
Problem 5.7 (Optimal detection of biorthogonal signal sets): Consider a

biorthogonal signal set with M equally likely signals
{
}
{
M
E
(t),
m
1,
.
.
.
,
s
m
2
{
}
s(t) =
Es mM/2 (t), m M2 + 1, . . . , M
where {1 (t), . . . , M/2 (t)} is an orthonormal set of signals.
(a) Sketch the signal set in the signal space diagram for M = 4. HINT: For Kdimensional modulation, the signal space diagram has K dimensions. The kth
axis species the signal amplitude in the kth dimension, i.e. the coecient of
k (t) for k {1, . . . , K}.
(b) Consider a single symbol transmission through an AWGN channel with noise
PSD equal to N0 /2. Describe the optimal detection at the receiver. (In other
words, describe the optimal receiver structure and the optimal decision rule.)
(c) Compute the union bound estimate of the symbol error probability Ps for the
decision rule in part (b). Express your answer in terms of M , Es , and N0 .
100
Chapter 6
Channel Coding
In this chapter, we discuss the functions of the channel encoder and decoder in the
schematic diagram of a communication system shown in gure 1.1. Studying in detail
the subject of channel coding is beyond the scope of this course. For our course, we
shall not study how to construct a channel code, but we shall study how to evaluate
the performance of a given code.
We shall focus on binary block codes and binary convolutional codes in this chapter. For such codes, a block of information bits are mapped to an encoded bit sequence
that contains additional bits.1 Such a mapping from information bits to encoded bits
for transmission is called channel coding. The redundancy introduced by channel
coding can improve the bit error rate (BER) of a system in the presence of noise.
6.1
Hard Decision and Soft Decision Decoding
Let us start with an example. The simplest but not the most ecient channel code is a
repetition code, which simply repeats the information bit multiple times. In particular,
consider a repetition code in which each bit is repeated 3 times: 0 000 and 1 111.
For transmission, suppose that we use binary pulse amplitude modulation (PAM)
through an additive white Gaussian noise (AWGN) channel with noise power spectral
density (PSD) equal to N0 /2.
Note that the observations for detection are the output of a matched lter sampled
at 3 successive symbol periods. Let R = (R1 , R2 , R3 ) denote the observation. Let
hypotheses 0 and 1 correspond to the transmission of bit 0 and of bit 1 respectively.
In particular, we write
R = sm + N under hypothesis m {0, 1},
where s0 = Ed (1, 1, 1), s1 = Ed (1, 1, 1), N = (N1 , N2 , N3 ), and N1 , N2 , N3 are

independent and identically distributed (IID) zero-mean Gaussian RVs with variance
N0 /2.2 The optimal decision rule is the maximum a posteriori probability (MAP)
1
2
Instead of raw data bits, information bits can also be the output of a source encoder.
In the presence of channel coding, there is a dierence between information bits and encoded
101
102
CHAPTER 6. CHANNEL CODING
decision rule. Assuming equally likely hypotheses, the optimal MAP decision rule
has the following form. (The derivation is left as an exercise.)
=1
H
>
r1 + r2 + r3
0
=0
H
We call the above decision process that jointly utilizes the exact values of r1 , r2 , r3
soft decision decoding. Alternatively, we can perform 3 separate hypothesis tests
based on r1 , r2 , r3 , and then use a majority rule for a nal decision. More specically,
1 = 1, H
2 = 0, H
3 = 1, then the nal decision is H
= 1. Such
if the 3 decisions are H
a decision process that involves separate bit decisions based on dierent observations
is called hard decision decoding.
While soft decision decoding performs better in term of the error probability, hard
decision decoding can be attractive since it usually requires less computational eorts
for the decoding process.3 Because of its optimality, we shall focus on soft decision
decoding in this chapter.
Following our example, the corresponding probability of decision error for soft
decision decoding, denoted by PbSOFT , is
{
}
PbSOFT = Pr {R1 + R2 + R3 > 0|m = 0} = Pr N1 + N2 + N3 > 3 Ed .
Since N1 + N2 + N3 is a Gaussian random variable (RV) with mean zero and
variance 3N0 /2, it follows that
{
}
(
)
N
+
N
+
N
3
E
6E
1
3
d
d
2
PbSOFT = Pr
>
=Q
.
N
3N0 /2
3N0 /2
0
1, H
2, H
3 from r1 , r2 , r3 . Note
For hard decision decoding, we make 3 decisions H
that each
(decision)Hj has the error probability p equal to that of binary PAM, i.e.
p=Q
2Ed /N0 . From the majority rule, the overall bit error occurs when there
1, H
2, H
3 . Therefore, the overall probability of decision
are two or three errors in H
error, denoted by PbHARD , is given by
( )
( )
3
3
2
HARD
=
p (1 p) +
p3
Pb
2
3
(
(
(
)(
))
)
2E
2E
2E
d
d
d
= 3Q2
1Q
+ Q3
.
N0
N0
N0
or transmitted bits. We shall use Eb to denote the expected energy per information bit (as before),
and Ed to denote the expected energy per transmission of encoded bits or equivalently the expected
energy per dimension (as will be seen later).
3
Note that the discussions on bit error detection and bit error correction are only relevant when
we discuss hard decision decoding.
6.2. BINARY LINEAR BLOCK CODES
103
0
-1
hard decision decoding

log10Pb
-2
soft decision decoding
-3
-4
-5
-6
0
10
Eb/N0(dB)
Figure 6.1: PbSOFT and PbHARD for dierent values of Eb /N0 for the repetition code.
Figure 6.1 compares PbSOFT and PbHARD for dierent values of Eb /N0 , where
Eb = 3Ed . The gure veries that soft decision decoding outperforms hard decision decoding in our example.
6.2
Binary Linear Block Codes
To understand the operations of binary linear block codes, it is convenient to use the
vector space viewpoint.
The binary eld, or equivalently the Galois eld of order 2, is denoted by F2 and
contains two elements: 0 and 1. Its addition and multiplication are given by the rules
of modulo-2 or mod-2 arithmetic. A vector space dened over the scalar eld F2 is
called a binary vector space. Examples of binary vector spaces are given below.
Example 6.1 The set of all binary n-tuples (i.e. vectors with n components), denoted by Fn2 , with componentwise mod-2 addition and mod-2 scalar multiplication is
a binary vector space.

Example 6.2 Let {g1 , . . . , gk } be a set of linearly independent vectors in Fn2 , where
k n. Then the set of all binary linear combinations
{ k
}
C=
j gj : 1 , . . . , k F2
j=1
104
is itself a binary vector space. In particular, C is a subspace of Fn2 . Its dimension is

k, while its size is 2k .

An (n, k) binary linear block code, denoted by C(n, k), is a subspace of the binary
vector space Fn2 that has dimension k. In the encoding process based on C(n, k), k
information bits are mapped to an n-bit codeword in C(n, k). Let Rc be equal to the
ratio k/n; we call the quantity Rc the rate of the code. Note that, in general, Rc < 1.
Below are some examples of binary linear block codes.
Example 6.3 Consider a binary linear block code C that has only two codewords:
the all-zero and the all-one n-tuples. This code is an (n, 1) code and is called the
rate-1/n repetition code. Note that this code simply maps bit 0 to the all-zero n-tuple,
and bit 1 to the all-one n-tuple.

Example 6.4 Consider a binary linear block code C that contains all binary n-tuples
with even numbers of ones. This code is an (n, n 1) code and is called the single
parity check (SPC) code of length n. The (n, n 1) SPC code maps a length-(n 1)
information bit sequence to a length-n transmitted bit sequence such that the rst
n 1 bits of the encoded sequence are the same as the information sequence.

Example 6.5 Let {g1 , . . . , gk } be a set of linearly independent vectors in Fn2 , where
k n. Then, the set of all binary linear combinations
{ k
}
C=
j gj : 1 , . . . , k F2
j=1
is an (n, k) binary linear block code.
For an (n, k) code that contains all binary linear combinations of linearly independent vectors g1 , . . . , gk in Fn2 , we can dene the k n generator matrix G such
that its rows are g1T , . . . , gkT , i.e.
g1T
G = ... .
gkT
The encoding operation can then be viewed as computing the product of the information bit vector b = [b1 , . . . , bk ] and G; the encoded sequence is bG.4
1 0 0 0 1 0 1
0 1 0 0 1 1 1
Example 6.6 Consider the generator matrix G =

0 0 1 0 1 1 0 . The
0 0 0 1 0 1 1
corresponding binary linear block code is a (7, 4) code. There are 16 codewords as
4
By convention, we write the information bit vector b as a row vector [?, p. 417].
105
shown below.
x16 = [0000]G = [0000000]
x1 = [0001]G = [0001011]
x2 = [0010]G = [0010110]
x3 = [0011]G = [0011101]
x4 = [0100]G = [0100111]
x5 = [0101]G = [0101100]
x6 = [0110]G = [0110001]
x7 = [0111]G = [0111010]
x8 = [1000]G = [1000101]
x9 = [1001]G = [1001110]
x10 = [1010]G = [1010011]
x11 = [1011]G = [1011000]
x12 = [1100]G = [1100010]
x13 = [1101]G = [1101001]
x14 = [1110]G = [1110100]
x14 = [1111]G = [1111111]
Note that, for convenience, we let the all-zero codeword to be the 16th codeword
so that the index of any other codeword corresponds to the decimal value of the
information bits.

The Hamming metric or Hamming weight of a binary vector x in the binary vector
space Fn2 , denoted by wH (x), is dened as
wH (x) = number of ones in x.
(6.1)
The Hamming distance between two binary vectors x and y in Fn2 , denoted by
dH (x, y), is dened as
dH (x, y) = wH (x + y),
(6.2)
and can be thought as the number of bit positions that are dierent between x and
y. For example, let x = [001] and y = [100]. Both x and y have Hamming weight 1.
Their Hamming distance is 2.
An (n, k) binary linear block code C has the minimum Hamming distance d, and
is called an (n, k, d) binary linear block code if
d=
min
x,yC,x=y
dH (x, y).
(6.3)
Since an (n, k, d) binary linear block code C is itself a binary vector space, it has
the closure property; a mod-2 addition of any two codewords in C yields a codeword
in C. This closure property allows us to easily identify d as described next.
We rst argue that, for an arbitrary codeword y C, the set Cy = {y + x : x C}
is the same as C. To see this, note that each addition y + x is a codeword in C from
the closure property. In addition, for any codeword z C, we see that z is also in Cy
since z is equal to y + (y + z) and y + z is in C by the closure property.
Let W(C) be the set of Hamming weights of the codewords in C, i.e. W(C) =
{wH (x) : x C}. Consider an arbitrary codeword y C. Let Dy (C) be the
set of Hamming distances between y and all the codewords in C, i.e. Dy (C) =
{wH (x + y) : x C}. Since Cy = C, it follows that Dy (C) = W(C) for all y. This
observation yields the following theorem.
Theorem 6.1 (Minimum Hamming distance of binary linear block codes):
An (n, k, d) binary linear block code C has the following properties.
106
1. The minimum Hamming distance d is equal to the minimum Hamming weight

among the nonzero codewords in C, i.e. d = minxC,x=0 wH (x).
2. With respect to each codeword x C, the number of codewords that are at
Hamming distance d away from x is equal to the number of codewords in C
with Hamming weight d.
From statement 2 of the theorem, we can meaningfully dene Nd to be the number
of codewords that are at Hamming distance d away from each codeword in C. At this
point, we can describe a code using four parameters: k, n, d, Nd . By an (n, k, d, Nd )
binary linear block code, we refer to a code that maps k information bits to n encoded
bits such that the minimum Hamming weight of the nonzero codewords is d and each
codeword has Nd other codewords at Hamming distance d away from it.
Example 6.7 Consider again the (7,4) code in example 6.6. By inspection, we see
that d = 3. There are 7 codewords with Hamming weight equal to 3: x1 , x2 , x5 , x6 ,
x8 , x11 , x12 . Thus, Nd = 7, and the code is an (7,4,3,7) code.

Example 6.8 Consider a (3,2) code C = {000, 011, 101, 110}. Note that this is an
SPC code with length 3. By inspection, we see that d = 2. Since there are 3 codewords
with Hamming weight equal to 2, we see that Nd = 3. In summary, the code is an
(3,2,2,3) code.
Signal Space Images of Binary Linear Block Codes

We now consider the transmission of encoded
is
bits. Assume that the modulation
binary PAM that maps bit 0 to amplitude Ed and bit 1 to amplitude Ed , where
Ed is the expected energy per transmission of encoded bits. An (n, k) binary linear
block code C that is a subspace of Fn2 can be mapped into a 2k -point signal set S in
the signal space Rn by using the mapping of the given binary PAM. We make the
following observations from such mapping.
1. The set of all binary n-tuples Fn2 is mapped tothe set of 2n vertices of an n-cube
centered at the origin and with side length 2 Ed .
2. The set of codewords of an (n, k) code C is mapped to a subset of 2k vertices of
this n-cube.
We shall refer to the set of signal points S obtained from the above mapping of C
to Rn as the signal space image of the code C.
Example 6.9 Consider again the (3,2,2) SPC code C = {000, 011, 101, 110}. The
corresponding signal set S in R3 is
1
1
1
1
Ed 1 , Ed 1 , Ed 1 , Ed 1
S=
1
1
1
1
which is illustrated in gure 6.2.
107
Figure 6.2: Signal space image in R3 of the (3,2,2) SPC code.

If two codewords x, y C have Hamming distance dH (x, y), then their signal space
images, denoted by s(x), s(y) Rn , have the square distance (Euclidean distance)
equal to
s(x) s(y)2 = 4Ed dH (x, y).
(6.4)
We can view the transmission of a signal point in the signal set S obtained from
an (n, k, d) binary linear block code as performing an n-dimensional modulation with
2k possible signal points or hypotheses. Each signal point is transmitted using n successive transmissions of PAM pulses. Each transmission corresponds to a dimension
in the signal set. Therefore, we can talk about a quantity per transmission and per
dimension interchangeably, e.g. Ed is the expected energy per transmission or per
dimension.
From (6.4), the minimum distance dmin between signal points of S is related to
the minimum Hamming distance d of C by
dmin = 2 Ed d.
(6.5)
Since for an (n, k, d, Nd ) code the number of codewords at Hamming distance d
from each codeword is Nd , the number of nearest neighbors to each signal point in S
is Nd . Therefore, the average nearest neighbors of the signal set S is
Kmin = Nd .
(6.6)
Consider the transmission through an AWGN channel with noise PSD N0 /2. It
follows that the union bound estimate for the symbol error probability Ps of the signal
set is equal to
(
)
(
)
2Ed d
dmin /2
= Nd Q
Ps Kmin Q
.
N0
N0 /2
To estimate the bit error probability Pb , we follow the convention of normalizing
Ps by a factor of k to get the error probability per information bit [?, p. 2391], i.e.
we can approximate5
(
)
Nd
2Ed d
1
Q
.
Pb Ps =
k
k
N0
5
Note that this error probability per information bit is only an approximate, and is not the same
as the bit error probability, which is dicult to obtain analytically.
108
Finally, it is desirable to express Pb in terms of Eb /N0 in order to compare with the

baseline uncoded scenario. Since for every n transmissions, we transmit k information
bits, we can write Ed = nk Eb . It follows that
Nd
Pb
Q
k
kd 2Eb
n N0
)
.
(6.7)
Error Performance of Binary Linear Block Codes

As the baseline scenario for comparing performances of binary linear block codes,
consider uncoded binary PAM. From (5.24), the baseline performance is given by
)
(
2E
b
.
(6.8)
PbUNCODED = Q
N0
By comparing Pb in (6.7) with PbUNCODED in (6.8), we can quantify the coding gain
of an (n, k, d, Nd ) binary linear block code. For a specic value of Pb , e.g. Pb = 105 ,
the coding gain is the reduction in the value of Eb /N0 required, as illustrated in the
following example.
Example 6.10 For the (3, 2, 2, 3) SPC code, we can use (6.7) to write
(
)
4 2Eb
Pb 1.5Q
.
3 N0
From the plot of Pb versus Eb /N0 in gure 6.3, we can see that the coding gain of
this (3,2,2,3) code is about 1 dB for Pb = 105 .
In general, the curve for Pb for an (n, k, d, Nd ) binary linear block code can be
approximately obtained from that for uncoded binary PAM by moving the curve to
the left by 10 log10 (kd/n) dB and up by log10 (Nd /k).
It should be noted that the reduction in Eb /N0 does not come without a cost. In
particular, if we want to keep the information bit rate the same as for the uncoded
system, then the coded system based on an (n, k) binary linear block
{ code( requires
)
more bandwidth. In particular, if we use the set of sinc pulses 1T sinc Tt j ,
j Z} for uncoded binary PAM, then the period T must be changed to kT /n for the
coded binary PAM system to support the same information bit rate. It follows that
1
1
the required bandwidth increases from 2T
to nk 2T
. Therefore, the factor n/k can be
viewed as the bandwidth expansion ratio for using an (n, k) code.
Table 6.1 contains the coding gains for some known codes with soft decision decoding. These gains are computed based on the estimate of Pb in (6.7) and the baseline
PbUNCODED in (6.8). As previously mentioned, we shall not discuss the theory behind
the construction of these codes in this course.
109
0
-1
-2
log10Pb
uncoded
-3
coded
-4
-5
-6
-7
0
10
12
Eb/N0(dB)
Figure 6.3: Error performance of the (3,2,2,3) SPC code.
Table 6.1: Coding gains at Pb = 106 for some known block codes [?, p. 2392].
(n, k, d, Nd )
n/k
(8,4,4,14)
(16,5,8,30)
(24,12,8,759)
(32,6,16,62)
(32,16,8,620)
(64,7,32,126)
2.0
3.2
2.0
5.3
2.0
9.1
coding
(n, k, d, Nd )
n/k
gain (dB)
2.6
(64,22,16,2604)
2.9
3.5
(128,8,64,254)
16
4.8
(128,64,16,10668) 2.0
4.1
(128,64,16,94488) 2.0
5.0
(256,9,128,510) 28.4
4.6
(256,37,64,43180) 6.9
coding
gain (dB)
6.0
5.0
6.9
6.9
5.4
7.6
110
6.3
Binary Linear Convolutional Codes
Similar to the discussion on binary linear block codes, we can discuss binary linear
convolutional codes using the binary vector space Fn2 with mod-2 arithmetic. A
binary linear convolutional code is dened by an encoding structure that contains a
shift register; the outputs are linear combinations of the contents of the shift register.
A rate-k/n binary convolutional code uses a shift register of length Kk, where
K is called the constraint length of the code. At each time step, k information bits
are shifted into the register, and n encoded bits are formed as linear combinations
of the Kk bits in the register. Let vector b = [b1 , . . . , bKk ] contain the bits in the
shift register. Dene the Kk n generator matrix G such that the encoded bits are
x = bG at each time step, as illustrated in gure 6.3.
input bits
(k bits shifted
in each step)
A link from
to
the mth output
exists if
Convolutional encoding using a shift register.

Observe that the n encoded bits at each time step lie in a Kk-dimensional subspace
of the binary vector space Fn2 . This subspace is spanned by the rows of G. Forming
a subspace, the encoded bit sequences have the closure property, i.e. if [x1 , x2 , . . .]
and [y1 , y2 , . . .] are two valid encoded bit sequences, so is [x1 + y1 , x2 + y2 , . . .].
Example 6.11 Consider the
rate-1/3convolutional encoder with constraint length 3
1 1 1
and generator matrix G = 1 0 1 . Figure 6.4 shows the corresponding encoding

1 0 0
structure using a shift register.
Assume that the contents of the shift register are initially all zeros. Given 6 information bits 110101 that enter the encoder, the encoded bits are 111 010 001 011 101
011, as illustrated in gure 6.5.

The shift register contents (bk+1 , . . . , bKk ) can be viewed as the state of the
convolutional encoder. Note that the current inputs (b1 , . . . , bk ) are not parts of the
state denition. Observe that there are in total 2(K1)k states.
The state transition diagram is an alternative representation of a binary linear
convolutional code. Each state is labeled by the register contents (bk+1 , . . . , bKk ).
6.3. BINARY LINEAR CONVOLUTIONAL CODES
111
stored bits
current bit
input bits
(1 bits shifted
in each step)
Figure 6.4: Encoding structure for the convolutional code in example 6.11.
111
1
011
010
001
101
011
Figure 6.5: Encoding operations of the convolutional code in example 6.11.
112
000
00
100
111
011
01
10
101
001
Solid lines correspond to

transitions with input bit 1.
Dashed lines correspond to
transitions with input bit 0.
Link labels are output bits
for each time step.
010
11
110
Figure 6.6: State transition diagram for the convolutional code in example 6.11.
state
000
000
01
01
1
11
bit 0
10
01
bit 1
0
10
0
10
1
1
10
10
01
001
10
0
10
000
01
10
111
111
111
0
10
000
01
110
110
bit 2
last bit
001
000
001
000
111
00
tail bit tail bit
Figure 6.7: Trellis diagram for the convolutional code in example 6.11.
From each state, there are 2k links indicating 2k possible transitions. The n output
bits for that time step are labeled on the transition link.
Example 6.12 The state transition diagram of the rate-1/3 binary linear convolutional code in example 6.11 is shown in gure 6.6.

Another way to represent a binary linear convolutional code is to use the trellis
diagram, which is essentially the state transition diagram with the states at dierent
time steps drawn separately.
Example 6.13 For the rate-1/3 binary linear convolutional code in example 6.11,
assuming that the original state is 00 and the nal two information bits are zero,
we can draw the trellis diagram as shown in gure 6.7. Note that, regardless of the
current state, the appearance of two zero tail bits always leads to state 00.
Given six input bits 110101 and two tail bits 00, the encoded bit sequence corresponds
to a path in the trellis diagram, as shown in gure 6.8. From the link labels along
the path, we can read out the encoded bits, i.e. 111 010 001 011 101 011 101 100.
113
state
00
0
10
111
01
10
01
1
10
001
01
10
01
0
11
1
Figure 6.8: Correspondence between an encoded bit sequence and a path in the trellis
diagram.
Signal Space Images of Binary Linear Convolutional Codes

Assume that binary PAM is used for modulation. As for binary linear block codes,
the encoded bits can be
the signal space Rn by mapping
mapped to signal points in
bit 0 to amplitude Ed and bit 1 to amplitude Ed , where Ed is the expected
energy per transmission of encoded bits, or equivalently the expected energy per
dimension. At each step of the encoding operations, the convolutional encoder
outputs a signal point that is in the 2k -point signal set S that is a subspace of Rn
and can be viewed as a result of n-dimensional modulation.
Unlike successive codewords that result from a binary linear block code, encoded bit
sequences from dierent time steps of convolutional encoding are not statistically
independent. Therefore, optimal detection of transmitted signal points cannot be
done separately. We next describe an optimal and ecient procedure for detecting
an encoded bit sequence.
Soft-Decision Decoding for Binary Convolutional Codes: the

Viterbi Algorithm
For simplicity, assume for now that k = 1. Consider passing J information bits
through a rate-1/n convolutional encoder with constraint length K. Assume that we
are originally in the all-zero state. In addition, assume that the J information bits
are followed by K 1 zero tail bits so that the nal state is also the all-zero state.
The encoded bits are transmitted using binary PAM described above. Using the
discrete-time AWGN channel model, the receiver observes a sequence of random
vectors R = (R1 , . . . , RJ+K1 ), where random vector Rj , j {1, . . . , J + K 1},
corresponds to the n-dimensional observation during the transmission of the jth
signal point. Notice that R is an n(J + K 1)-dimensional vector.
Let A = (A1 , . . . , AJ+K1 ) be the signal points transmitted in J + K 1 dierent
time steps. Note that there are 2J hypotheses for signal detection. Let
sm = (sm,1 , . . . , sm,J+K1 ) be the transmitted signal points under hypothesis m,
where m {1, . . . , 2J }. Note that A takes its value in {s1 , . . . , s2J } . Using the
114

optimal minimum-distance decision rule, the receiver decision is to set
= arg
H
= arg
= arg
min
m{1,...,2J }
min
m{1,...,2J }
min
m{1,...,2J }
r sm 2
J+K1
rj sm,j 2
j=1
J+K1
rj +
2
J+K1
j=1
sm,j 2
2
j=1
J+K1
rj , sm,j .
j=1
Notice that the rst and the second terms are the same for all m. Therefore, we can
simplify the decision rule to
= arg
H
min
m{1,...,2J }
J+K1
rj , sm,j .
(6.9)
j=1
The Viterbi algorithm is an ecient recursive algorithm that can be used for
soft-decision decoding for binary linear convolutional codes. The underlying ideas
are based on the following observations.
1. If we assign to each branch or link l in the trellis diagram a metric equal to
rl , xl , where rl and xl are the n-dimensional observations and encoded outputs associated with that branch, then the optimal decision is equivalent to
nding the shortest path through the trellis diagram from the all-zero state at
time 1 to the all-zero state in the nal time step.
2. The initial segment of the shortest path, say from time 1 to time m, must be
the shortest path to whatever state Sm that is passed at time m; if there were
any shorter path to Sm , it could replace the segment of the shortest path to
create an even shorter path, yielding a contradiction. Therefore, it suces at
time m to determine and keep, for each state Sm , only the shortest path from
the all-zero state at time 1 to Sm . These paths are called survivors.
3. The survivors at time m + 1 can be found from the survivors at time m using
the following recursive operations:
(i) For each branch from a state at time m to a state at time m + 1, add
the metric of that branch to the metric of the time-m survivor to get a
candidate path metric at time m + 1.
(ii) For each state at time m + 1, compare the candidate path metrics arriving
at that state and select the path with the smallest metric as the survivor.
Store the survivor path history, i.e. the previous node along the survivor
path, as well as its metric.
(iii) Since we have the all-zero state in the nal time step, it is clear that there
is only one survivor in the end.
path
metric
state
.7
0.1
0.1
3.5
0
5.2
2.6 0
0.4
0.52.2
.1
0.7 1
.7
0
2.3
0
0.5
.6
0 4.6
.5
2.3
4 .6
3 .5
0.9
0.5
.2 0.5 .1
0
0.7
0
0
0
1
0
0
0.5
0.2
.1
.7
0.7
2.5
3
1.5
0.4
1
0.3
0.3
.3
0.8
0.3
11
.8
.2
10
.3
0
0.4
.5
0
01
0.5 0.4 0.1 0.6 0.5 0.3 1.6 0.7 1.1 0.1
0.5
0.6
00
115
0.3
0.3
decoded
bits
0
Figure 6.9: Operations of the Viterbi algorithm.

Notice that the procedure described in observation 3 can be used as the decoding
procedure, as illustrated in the following example.
Example 6.14 For the rate-1/3 convolutional code in example 6.11, assume that the
original state is 00 and the nal two tail bits are zero. Suppose that 6 information bits
are sent and the corresponding 24 received signal values (18 values for 6 information
bits and 6 values for 2 tail bits) are
0.2, 0.4, 0.1 0.5, 0.3, 0.2 0.3, 0.4, 0.1 0.2, 0.2, 0.3
0.4, 0.1, 0.2
0.2, 0.4, 0.1 0.2, 0.1, 0.3
0.3, 0.2, 0.1.
Assume that Ed = 1. Figure 6.9 illustrates the steps of the Viterbi algorithm. In
each step, the survivors are indicated by bolded paths.
In this example, the decoded bits are 11010100, which are the same as the 6 information bits plus two zero tail bits.

In practice, we do not use the tail bits to force the all-zero state. Empirically, it was
found that, with high probability, all survivors have the same history if we look back
to about 5 constraint lengths (5K) in the past. Thus, we can view 5K time steps as
the decoding delay in practice.
Finally, for k > 1, we have a trellis diagram with 2(K1)k states. At each step, there
are 2(K1)k survivors. In addition, there are 2k paths that merge at each node. Since
each path that converges at a common node requires the computation of a metric,
there are 2k metrics computed for each node. Of the 2k paths that merge at each
node, only one survives.
Error Performance of Binary Linear Convolutional Codes

Error analysis of convolutional codes can be done based on the notion of an error
event. Suppose that the transmitted encoded bits (of nite length) are x and the
decoded bits are y. Typically, x and y agree for a long time, but will disagree over
some nite intervals. An error event corresponds to one of these nite intervals.
116
(110)
(010)
(001)
(011)
(100)
(111)
(101)
Figure 6.10: Modied state transition diagram for the derivation of the transfer function of the convolutional code in example 6.11.
An error event corresponds to some nite error bit sequence e. Suppose that there
is only one error event, i.e. we can write y = x + e. The probability of such a nite
error bit sequence e is given by
(
)
s(y) s(x)
Pr{y detected|x sent} = Q

,
2N0
where s(x) and s(y) are the signal space images of x and y.
As with binary linear block codes, we have
s(y) s(x)2 = 4Ed dH (y, x) = 4Ed wH (e).
Thus, the error event that is the most likely is the one corresponding to the
minimum Hamming distance between x and another valid encoded bit sequence y.
This distance is called the free distance of a convolutional code, and is denoted by
dfree . In addition, let Nfree be the number of error events with Hamming weight dfree .
By the closure property of a binary linear convolutional code, the error bit sequence
e = x + y is a valid encoded bit sequence. Hence, dfree is equal to the minimum
Hamming weight of any nite-length encoded bit sequence, i.e. an encoded bit
sequence that starts and ends with the all-zero sequences. In the trellis diagram,
dfree is the minimum Hamming weight of any path that leaves from the all-zero state
and returns to the all-zero state.
The value of dfree and Nfree can be obtained from the transfer function of the
convolutional code, which is constructed as follows. Consider the state transition
diagram in which the all-zero state is split into the source and the destination
nodes. Label each link by Dw , where w is the Hamming weight of the corresponding
encoded outputs on that link. To illustrate, consider the rate-1/3 convolutional code
in example 6.11. The modied state diagram is shown in gure 6.10
From the modied state transition diagram, we can write the state transition
s
, the state transition equation is of the form
equations. For each state S, S = S00
S = Du U + Dv V,
117
where u is the Hamming weight of the link label for the transition from state U to
state S, and v is the Hamming weight of the link label for the transition from state
d
s
V to state S. The transfer function T (D) is dened as the ratio S00
/S00
. For the
rate-1/3 convolutional code in example 6.11, the state transition equations are
S10
S11
S01
d
S00
=
=
=
=
s
D3 S00
+ D2 S01
DS10 + D2 S11
D2 S10 + D2 S11
DS01
From the above state transition equations, we can write the transfer function
T (D) =
d
S00
2D6 D8
= 2D6 + D8 + 5D10 + . . . ,
=
s
S00
1 D2 2D4 + D6
(6.10)
from which the exponent of D in the rst term (with the smallest exponent) is dfree
(= 6) and the coecient of the rst term is Nfree (= 2).
In general, the powers of D in T (D) indicate the Hamming weights of error events,
and the coecient of Dw indiciates how many error events have Hamming weight w.
For example, the transfer function in (6.10) indicates that there are 2 error events
with Hamming weight 6, one error event with Hamming weight 8, 5 error events
with Hamming weight 10, and so on.
For error probability, we consider the probability of an error event starting at a
given time, assuming that no error event is in progress. The union bound estimate
for this probability is
(
)
(
)
dmin /2
2Ed dfree
Ps Kmin Q
= Nfree Q
.
N0
N0 /2
(
)
kdfree 2Eb
Since Ed = nk Eb , we can write Ps Nfree Q
. As with binary linear block
n
N0
codes, we can normalize Ps by k to obtain the error probability per information bit,
denoted by Pb , as shown below.
(
)
Nfree
kdfree 2Eb
Pb
Q
(6.11)
k
n N0
By comparing Pb in (6.11) to PbUNCODED in (6.8) for uncoded binary PAM, we can
quantify the coding gain, as illustrated in the following example.
Example 6.15 For the rate-1/3 convolutional code in example 6.11, we have dfree = 6
and Nfree = 2. Therefore,
)
(
4Eb
.
Pb 2Q
N0
From the plot of Pb versus Eb /N0 in gure 6.11, we see that the coding gain is about
2.6 dB for Pb = 105 .
118
0
-1
-2
log10Pb
uncoded
-3
coded
-4
-5
-6
-7
0
10
12
Eb/N0(dB)
Figure 6.11: Error performance of the rate-1/3 convolutional code in example 6.11.
Table 6.2 shows the coding gains for some known codes with soft decision decoding.
These gains are computed based on the estimate of Pb in (6.11) and the baseline
UNCODED
Pb
in (6.8). As previously mentioned, we shall not discuss the theory behind
the construction of these codes in this course.
6.4
Summary
We discussed two fundamental types of channel coding: binary linear block codes
and binary linear convolutional codes. For the purpose of the course, we did not
discuss how to construct these codes, but discussed how to evaluate the error
performances of given codes. By comparing to uncoded binary PAM, we see that a
rate-k/n channel code can reduce the requirement of Eb /N0 for a given bit error rate
by the amount called the coding gain at the expense of using more bandwidth by a
factor of n/k.
Channel coding is a rich subject that requires a separate course to master the
materials. For more information, see for example on-line materials of the MIT Open
Courseware available for free at ocw.mit.edu.
119
Table 6.2: Coding gains at Pb = 106 for some known convolutional codes [?, p.
2393].
(n, k, K, dfree , Nfree )
n/k
(2,1,1,3,1)
(2,1,2,5,1)
(2,1,3,6,1)
(2,1,4,7,2)
(2,1,5,8,2)
(2,1,6,10,12)
(2,1,7,10,1)
(2,1,8,12,10)
(3,1,1,5,1)
(3,1,2,8,2)
(3,1,3,10,3)
(3,1,4,12,5)
2.0
2.0
2.0
2.0
2.0
2.0
2.0
2.0
3.0
3.0
3.0
3.0
6.5
coding
gain (dB)
1.8
4.0
4.8
5.2
5.8
6.3
7.0
7.1
2.2
4.1
4.9
5.6
(n, k, K, dfree , Nfree )
n/k
(3,1,5,13,1)
(3,1,6,15,3)
(3,1,7,16,1)
(3,1,8,18,5)
(4,1,1,7,1)
(4,1,2,10,1)
(4,1,3,13,2)
(4,1,4,16,4)
(4,1,5,18,3)
(4,1,6,20,10)
(4,1,7,22,1)
(4,1,8,24,2)
3.0
3.0
3.0
3.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
coding
gain (dB)
6.4
6.7
7.3
7.4
2.4
4.0
4.9
5.6
6.2
6.4
7.4
7.6
Practice Problems
Problem 6.1 (Bit error probability of a binary linear block code): Consider
the generator matrix G for an (n, k, d, Nd ) binary linear block code given below.
1 0 0 1 1 0
G= 0 1 0 0 1 1
0 0 1 1 0 1
(a) Specify the values of n, k, d, and Nd .
(b) Assume
( that
) the uncoded binary PAM system has the bit error probability
2Eb
Q
, where Eb is the energy per bit and N0 /2 is the PSD of AWGN.
N0
Find the union bound estimate of the error probability per bit Pb in terms of
Eb /N0 for the above block code.
(c) Compute the coding gain (numerically in dB) for this block code at Pb = 104 .
NOTE: You will need a calculator to evaluate the Q function, e.g. MATLAB.
Problem 6.2 : Suppose you have to choose between two binary linear block codes
with the following generator matrices. Based on the union bound estimate of the
error probability per bit, which one would you choose and why?
[
]
[
]
1 0 1 0 1
1 0 1 0 1
G1 =
, G2 =
0 1 0 1 1
0 1 0 1 0
120
HINT: The following facts (not all) may be helpful.

(
)
Eb
12Eb
Q
= 105
= 8.8 dB,
5N0
N0
(
)
1
12Eb
Eb
Q
= 105
= 8.5 dB
2
5N0
N0
(
)
8Eb
Eb
Q
= 105
= 10.6 dB,
5N0
N0
(
)
8Eb
Eb
1
Q
= 105
= 10.2 dB
2
5N0
N0
Problem 6.3 (Binary linear convolutional code and Viterbi algorithm):

Consider a binary linear convolutional code described by the following generator matrix.
0 1 1
1 0 1
1 1 1
(a) Draw the state transition diagram for this code.
(b) Assume that we use binary PAM transmission through an AWGN channel with
the expected energy per dimension (or per transmission) Ed = 1. In addition,
bits 0 and 1 are mapped to amplitudes 1 and 1 respectively. Assume that 3
information bits followed by 2 zero tail bits are transmitted. Suppose that the
received signals are given by
(0.1, 0.4, 0.2, 0.3, 0.2, 0.4, 0.1, 0.4, 0.2, 0.1, 0.1, 0.3, 0.2, 0.1, 0.3).
Use the Viterbi algorithm to perform soft decision decoding of this received
sequence. Identify the 3 information bits.
(c) Assume
( that
) uncoded binary PAM transmission has the bit error probability
2Eb
Q
, where Eb is the energy per bit and N0 /2 is the PSD of AWGN.
N0
Specify dfree and Nfree , and use the union bound estimate to express the error
proability per bit Pb .
121
Problem 6.4 : Consider the structure of a convolutional encoder with constraint

length 3 and the shift register implementation shown below.
current bit
stored bits
input bits
(1 bits shifted
in each step)
output bits
(2 bits output
in each step)
(a) What is the rate of this convolutional code?

(b) Write down the generator matrix of this code.
(c) Draw the state transition diagram for this code.
(d) Find the parameters dfree and Nfree of this code.
Now consider a binary linear block code whose generator matrix is shown below.
]
[
1 0 1 0 1
G=
0 1 0 1 1
(e) Assume
( that
) uncoded binary PAM transmission has the bit error probability
2Eb
Q
, where Eb is the energy per bit, and N0 /2 is the PSD of AWGN.
N0
Find the union bound estimate of the error probability per bit Pb in terms of
Eb /N0 for the above code.
122
Chapter 7
Capacities of Communication
Channels
In this chapter, we discuss fundamental limits on transmission rates of digital
communication systems. We shall focus on discrete-time channel models. In the
rst part, we shall consider discrete channels, where both inputs and outputs of the
communication systems belong to nite sets of possible values. In the second part,
we shall consider discrete-time additive white Gaussian noise (AWGN) channels,
where outputs are inputs plus Gaussian noise random variables (RVs) and are
therefore continuous RVs. To treat AWGN channels, we introduce the dierential
entropy and its related denitions. After that, we derive the channel capacity
formula for AWGN channels.
Since the discussion involves information theory, reviewing the concept of the
entropy and its basic properties can be helpful. In addition, we shall introduce the
mutual information, which is another important fundamental quantity in
information theory.
7.1
Discrete Memoryless Channels
Consider a discrete-time channel model (the rst symbol has index 1)

Rj = Sj + Nj , j {1, 2, . . .},
(7.1)
where Rj , Sj , Nj are observation, signal, and noise RVs at symbol time j

respectively. Note that the discrete-time AWGN channel in (5.32) is one example of
the channel model in (7.1).
For a discrete-time channel, if the set of input values X and the set of output values
Y are nite (not the case for the discrete-time AWGN channel), then we call the
discrete-time channel a discrete channel.
j ) be the decision
Example 7.1 Let Xj = Sj be the input signal RV, and Yj = H(R
RV (on the value of Sj ) at the receiver of a discrete-time AWGN channel. Usually,
j ) take on a nite number of values, e.g. M values for M -point signal
Sj and H(R
123
124
CHAPTER 7. CAPACITIES OF COMMUNICATION CHANNELS
Figure 7.1: BSC model for binary PAM.

j ) can be considered as the input and the output of a discrete
sets. Then Sj and H(R
channel.
Since Rj is a continuous RV, Sj and Rj cannot be considered as the input and the
output of a discrete channel.

In general, a discrete channel consists of the set of input values X , the set of output
values Y, and a conditional probability mass function (PMF) fY |X (y|x).
For a discrete memoryless channel (DMC), the n-tuple input sequence x = (x1 , . . .,
xn ) X n and the n-tuple output sequence y = (y1 , . . . , yn ) Y n have their
conditional PMF satisfying
fY1 ,...,Yn |X1 ,...,Xn (y1 , . . . , yn |x1 , . . . , xn ) =
fYj |Xj (yj |xj ).
(7.2)
j=1
Example 7.2 Consider transmitting uncoded information bits using binary

pulse
amplitude modulation (PAM) with the mapping 0 Eb and 1 Eb over
an AWGN channel with noise power spectral(density (PSD)
N0 /2. Recall that the
)
probability of bit error Pb is given by Pb = Q

2Eb /N0 .
With no intersymbol interference (ISI), we can model the channel as a DMC with
X = Y ={0, 1} and the following conditional PMF.
{
1 Pb , y = 0
fY |X (y|0) =
Pb ,
y=1
{
Pb ,
y=0
fY |X (y|1) =
1 Pb , y = 1
Such a DMC is called a binary-symmetric channel (BSC). The probability of error
Pb is called the crossover probability. The corresponding BSC for binary PAM is
illustrated in gure 7.1.
7.2
Mutual Information
The mutual information between X and Y , denoted by I(X; Y ), is dened as

I(X; Y ) = H(X) H(X|Y ) =
xX yY
fX,Y (x, y) log
fX,Y (x, y)
.
fX (x)fY (y)
(7.3)
7.2. MUTUAL INFORMATION
125
The mutual information I(X; Y ) can be considered as the amount of information

that Y provides about X. In other words, it is the reduction in the uncertainty
about X from the knowledge of Y . From the expression of I(X; Y ) in the right
hand side of (7.3), we see that there is a symmetry between the roles of X and Y . It
follows that I(X; Y ) = I(Y ; X). In addition, from theorem 3.3 and the denition of
I(X; Y ), we see that the mutual information is nonnegative, i.e. I(X; Y ) 0.
Two special cases should be noted. First, if X and Y are independent, then
I(X; Y ) = H(X) H(X) = 0. Intuitively, in this case, the knowledge of Y does not
provide any information about X. Second, if X is a function of Y , then
I(X; Y ) = H(X) 0 = H(X). Intuitively, in this case, the knowledge of Y tells us
everything about X, and reduces the uncertainty about X by its full amount H(X).
Example 7.3 Consider again the BSC in example 7.2. Let = Pb , i.e. the crossover
probability, and p = Pr{X = 0}. The conditional entropy of Y given a specic value
of X is
H(Y |X = 0) = H(Y |X = 1) = Hbin (),
which is also equal to the conditional entropy H(Y |X).
The mutual information I(X; Y ) can be expressed as
I(X; Y ) = H(Y ) H(Y |X) = Hbin (p(1 ) + (1 p)) Hbin ().
Suppose we want to obtain H(X|Y ). To do so, we use the identity I(X; Y ) =
H(X) H(X|Y ) to write
H(X|Y ) = H(X) I(X; Y ) = Hbin (p) Hbin (p(1 ) + (1 p)) + Hbin ().
Figure 7.2 illustrates H(X|Y ) and I(X; Y ) for the BSC with dierent values of p and
.

The conditional mutual information between RVs X and Y given RV Z, denoted by
I(X; Y |Z), is dened as
I(X; Y |Z) = H(X|Z) H(X|Z, Y )
=
fX,Y,Z (x, y, z) log
xX yY
fX,Y |Z (x, y|z)

.
fX|Z (x|z)fY |Z (y|z)
(7.4)
The conditional mutual information I(X; Y |Z) can be considered as the amount of
information that Y provides about X given Z.
As for the denition of conditional entropy, the denition of conditional mutual
information can be extended in a straightforward fashion to more than two RVs.
The following theorem provides a useful identity for multiple RVs. The proof can be
done using induction.
Theorem 7.1 (Chain rule for mutual information): Consider discrete RVs
X1 , . . . , Xn , Y .
I(X1 , . . . , Xn ; Y ) = I(X1 ; Y ) +
j=2
I(Xj ; Y |X1 , . . . , Xj1 )
126
=0
=0.5
=0.3
0.8
0.6
0.6
I(X;Y)
H(X|Y)
0.8
=0.1
0.4
=0.1
0.4
0.2
0.2
=0
0
0
0.2 0.4 0.6 0.8
=0.3
=0.5
0
1
0.2 0.4 0.6 0.8
Figure 7.2: H(X|Y ) and I(X; Y ) for a BSC.
7.3
Capacity of a DMC
An (n, k) channel code (binary or nonbinary) for the DMC with X , Y, and
fY |X (y|x) consists of the following.
1. The index set {1, . . . , 2k } for 2k possible hypotheses or signal points. Let U be
the hypothesis RV.1
2. An encoding mapping x : {1, . . . , 2k } X n that maps each index m in
{1, . . . , 2k } to a codeword xm = (xm,1 , . . . , xm,n ) in X n . The set of codewords C
is called a codebook or simply a code.
3. A decoding mapping u : Y n {1, . . . , 2k } that maps each possible received
vector y = (y1 , . . . , yn ) in Y n to a hypothesis index u(y) in {1, . . . , 2k }.
Let m = Pr{U (y) = U |U = m} be the decision error probability under hypothesis
m. Let nmax = maxm{1,...,2k } m be the maximum error probability for an (n, k)
2k
m
code. Let Pen = m=1
be the average error probability for an (n, k) code. Note
2k
that the rate of an (n, k) code is k/n bit/dimension, or equivalently k/n bits per
transmission.
Let R be the information bit rate (in bit/dimension). We say that R is achievable if
there is a sequence of (n, Rn) codes with nmax 0 as n . The channel capacity
of a DMC with X , Y, and fY |X (y|x), denoted by C, is dened as
C = max I(X; Y ) (in bit/dimension),
(7.5)
fX (x)
1
Unlike in the previous two chapters, we use U instead of H to denote the hypothesis RV because
H will denote the entropy in this chapter.
7.4. JOINTLY TYPICAL SEQUENCES AND JOINT AEP
127
where I(X; Y ) denotes the mutual information between X and Y . Note that the
maximization is performed over all input PMFs. In a later section, we shall prove
the following theorem on the capacity of a DMC.
Theorem 7.2 (Channel coding theorem for a DMC): All rates R < C are
achievable. Conversely, if a rate R is achievable, then R C.
Before we can justify theorem 7.2, we need to develop additional analytical tools in
information theory. The next two sections are for this purpose.
7.4
Jointly Typical Sequences and Joint AEP
First, let us briey review the asymptotic equipartition property (AEP). Consider a
sequence of independent and identically distributed (IID) RVs X1 , X2 , . . ., where
each Xj takes its value in the set X according to the PMF fX (x). Let H(X) denote
the entropy of X. The typical set Tn with respect to the PMF fX (x) is the set of
sequences x = (x1 , . . . , xn ) that satisfy the inequality

n(H(X)+)
2
< f X (x) < 2n(H(X)) , or equivalently n1 log f X (x) H(X) < .
The AEP states that, for suciently large n, we have the following properties.
1. Pr {Tn } > 1 .
2. (1 )2n(H(X)) < |Tn | < 2n(H(X)+) .
Roughly speaking, as n , there are about 2nH(X) equally likely sequences. Now
consider a sequence of IID pairs of RVs (X1 , Y1 ), (X2 , Y2 ), . . ., where each (Xj , Yj )
takes its value in the set X Y with joint PMF fX,Y (x, y). The jointly typical set
An with respect to fX,Y (x, y) is the set of sequences (x, y) = (x1 , . . . , xn , y1 , . . . , yn )
such that

1

1
log f X (x) H(X) < , log f Y (y) H(Y ) < ,
n

n

1

log f X,Y (x, y) H(X, Y ) < .
(7.6)
n

We next derive the AEP for jointly typical sequences. This AEP will be useful in
proving the forward part of theorem 7.2.
Theorem 7.3 (Joint AEP): For suciently large n, we have the following.
1. Pr {An } > 1 .
2. (1 )2n(H(X,Y )) < |An | < 2n(H(X,Y )+) .
1, . . . , X
n , Y1 , . . . , Yn ) be a sequence of RVs. If their joint PMF
Y)
= (X
3. Let (X,
x, y
) = f X (
x)f Y (
y), then
satises fX,
Y
(
{
}
Y)
An < 2n(I(X;Y )3) .
(1 )2n(I(X;Y )+3) < Pr (X,
128

Proof:
2
1. Dene RV Wj = log fXj (xj ). Note that E[Wj ] = H(X). Let W
denote the
variance
of large number (WLLN), we can write
weak
{ of Wj . From the
} law
2
1 n

W
Pr n j=1 Wj H(X) n2 . By choosing n large enough, say n = n1 ,
2
W
n2
< 3 , it follows that

}
{ n

{
}
1

1

= Pr log f X (x) H(X)
Pr
Wj H(X)
n

n
j=1
<
.
(7.7)
3
so that
Similarly, with n suciently large, say n = n2 ,

{
}
1

Pr log f Y (y) H(Y ) < .
n
3
(7.8)
Finally, dene RV Zj = log fXj ,Yj (xj , yj ). Note

E[Zj ] = H(X, Y ). Let}Z2
{ that

denote the variance of Zj . From the WLLN, Pr n1 nj=1 Zj H(X, Y )
2
Z
.
n2
By choosing n large enough, say n = n3 , so that nZ2 < 3 ,

{
}
1

Pr log f X,Y (x, y) H(X, Y ) < .
n
3
(7.9)
By choosing n = max(n1 , n2 , n3 ), the probability that a given sequence (x1 , . . .,

xn , y1 , . . . , yn ) is not in An is, from the union bound, upper bounded by the
summation of the three probabilities in (7.7), (7.8), and (7.9). Therefore, for
large n, we can write 1 Pr {An } < , or equivalently Pr {An } > 1 .

2. Since each sequence (x, y) in An satises n1 log f X,Y (x, y) H(X, Y ) < ,
n(H(X,Y )+)
we
< f X,Y (x, y) < 2n(H(X,Y )) . It follows that 1
can write 2
n n(H(X,Y )+)
f X,Y (x, y) > |A | 2
, yielding |An | < 2n(H(X,Y )+) .
(x,y)An
From statement 1
of the theorem and f X,Y (x, y) < 2n(H(X,Y )) , with n large
enough, 1 <
f X,Y (x, y) < |An | 2n(H(X,Y )) , yielding |An | >
(x,y)An
(1 )2n(H(X,Y )) .
x, y
) = f X (
x)f Y (
y), we can write
3. From the assumption that f X,
Y
(
}
{
Y)
An =
f X (
x)f Y (
y).
Pr (X,
(
x,
y)An
From |An | < 2n(H(X,Y )+) , f X (x) < 2n(H(X)) , and f Y (y) < 2n(H(Y )) ,
{
}
Y)
An < 2n(H(X,Y )+) 2n(H(X)) 2n(H(Y )) = 2n(I(X;Y )3)
Pr (X,
7.5. DATA PROCESSING AND FANO INEQUALITIES
129
From |An | > (1 )2n(H(X,Y )) , f X (x) > 2n(H(X)+) , and f Y (y) > 2n(H(Y )+) ,
{
}
n
Pr (X, Y) A
> (1 )2n(H(X,Y )) 2n(H(X)+) 2n(H(Y )+)
= (1 )2n(I(X;Y )+3) .
Combining the above two inequalities, we obtain statement 3.
Roughly speaking, the joint AEP tells us that there are about 2nH(X) typical
x-sequences, about 2nH(Y ) typical y-sequences, but only about 2nH(XY ) jointly
typical (x, y)-sequences. Thus, the probability that a randomly selected
2nH(X,Y )
(x, y)-sequence is jointly typical is about 2nH(X)
= 2nI(X;Y ) .
2nH(Y )
7.5
Data Processing and Fano Inequalities
In this section, we establish two inequalities that will be useful for proving the
converse part of theorem 7.2.
Three RVs X, Y, Z form a Markov chain, denoted by X Y Z, if the
conditional PMF of Z given X and Y depends only on Y , i.e.
fZ|Y,X (z|y, x) = fZ|Y (z|y).
In particular, consider three RVs X, Y, Z. If Z is a function of Y , i.e. Z = g(Y ),
then X, Y, Z form a Markov chain. We now establish a useful result called the data
processing inequality.
Theorem 7.4 (Data processing inequality): If X Y Z, then I(X; Y )
I(X; Z) and I(Y ; Z) I(X; Z).
Proof: Using the chain rule for mutual information, we can expand I(X; Y, Z) in
two ways, as shown below.
I(X; Y, Z) = I(X; Z) + I(X; Y |Z) = I(X; Y ) + I(X; Z|Y )
Since X Y Z, fX,Z|Y (x, z|y) = fX|Y (x|y)fZ|Y,X (z|y, x) = fX|Y (x|y)fZ|Y (z|y).
Thus, X and Z are independent given Y . Consequently, I(X; Z|Y ) = 0 and
I(X; Z) + I(X; Y |Z) = I(X; Y ).
Since I(X; Y |Z) 0 (nonnegativity of mutual information), I(X; Y ) I(X; Z).
The proof of I(X; Y ) I(X; Z) is similar (start with I(Z; Y, X) instead of with
I(X; Y, Z)) and is thus omitted.

Note that, when Z = g(Y ), the data processing inequality tells us that no clever
processing of data Y can increase the amount of information about X from that
already contained in Y .
130
Consider now a discrete RV X taking a value from the set X . Suppose we guess the
value of X from a RV Y that are related to X by the conditional PMF fY |X (y|x).
= g(Y ) be the guess. Let Pe = Pr{X
= X} be the probability of error.
Let X
Intuitively, if the conditional entropy H(X|Y ) is small, then Pe should be small, and
vice versa. The Fano inequality quanties the bound on Pe based on H(X|Y ).
Theorem 7.5 (Fano inequality): The probability of error Pe in guessing X from
Y is bounded by
1 + Pe log(|X | 1) H(X|Y ), or equivalently Pe
H(X|Y ) 1
.
log(|X | 1)
Proof: Dene an indicator RV E as follows.

{
= X
1, X
E=
=X
0, X
Using the chain rule to expand H(E, X|Y ) in two ways, we can write
H(E, X|Y ) = H(X|Y ) + H(E|Y, X) = H(E|Y ) + H(X|Y, E).
Since E is known given X and Y , H(E|Y, X) = 0. Since E takes one of two values,
H(E|Y ) log2 = 1. Finally, we can express H(X|Y, E) as
H(X|Y, E) = Pr{E = 0}H(X|Y, E = 0) + Pr{E = 1}H(X|Y, E = 1)
= (1 Pe ) 0 + Pe H(X|Y, E = 1)
Pe log(|X | 1),
where the last inequality follows since X takes one of |X | 1 values given that
= X.
X
From H(E|Y, X) = 0, H(E|Y ) 1, H(X|Y, E) Pe log(|X | 1) , we can write
H(X|Y ) + 0 1 + Pe log(|X | 1),
which is the Fano inequality.
7.6
Proof of Channel Coding Theorem for DMCs
We now prove theorem 7.2 in two steps. First, we justify the forward part of the
theorem by showing that any information bit rate R < C is achievable. Then we
justify the converse part of the theorem by showing that any rate R > C is not
achievable.
7.6. PROOF OF CHANNEL CODING THEOREM FOR DMCS
131
Achievability of R < C
The proof that R < C is achievable is based on the random coding argument.
Consider the transmission scheme described below.
Transmission and detection: We rst randomly generate an (n, Rn) code C with 2Rn
codewords according to PMF fX (x).2 In particular, we can write the codebook as a
matrix
x1,1
x1,2 x1,n
..
..
..
C = ...
,
.
.
.
x2Rn ,1 x2Rn ,2 x2Rn ,n
where the entries are generated IID with PMF fX (x), and the codewords are listed
as the rows of the matrix. Note that the probability of obtaining a particular C is
given by
2
n
Rn
Pr{C} =
fX (xm,j ).
m=1 j=1
Assume that the transmitter and the receiver both know the code C and the channel
description fY |X (y|x). The receiver observed y = (y1 , . . . , yn ) and uses typical set
decoding; the decision is to set U (y) = m if the following two conditions hold.
1. (xm , y) is jointly typical, i.e. (xm , y) An .
2. There is no other index m such that (xm , y) is jointly typical.
Otherwise, the receiver declares an error, i.e. the transmission is not successful.
Analysis of error probability: Let Pr{E} denote the error probability averaged over
all the codewords as well as over all the codebooks, i.e.
Pr{E} =
Pr{C}Pen (C),
C
where Pen (C) is the error probability averaged over all the codewords for codebook
2Rn
m
C. From the denition Pen = m=1
, where m = Pr{U (y) = U |U = m},
Rn
2
2
1
Rn
Pr{E} =
2Rn
m=1
Pr{C}m (C).
By the symmetry of the code construction, C Pr{C}m (C) does not depend on m.
Therefore, we can write
Pr{E} =
Pr{C}1 (C) = Pr{E|U = 1}.
C
2
For simplicity, we assume that Rn is an integer. Note that this is always possible for large n if
R is rational.
132

c
Let Em denote the event that (xm , y) is jointly typical. Let Em
denote the
complement of Em . From typical set decoding,
Pr{E|U = 1} = Pr{E1c E2 . . . E2Rn }. Using the union bound,
2
Rn
Pr{E|U = 1} Pr{E1c } +
Pr{Em }.
m=2
From joint AEP (statement 1 of theorem 7.3), Pr{E1c } < for large n. From the
code construction, for m = 1, xm and y are independent, i.e. their joint PMF is
fX (xm )fY (y). From joint AEP (statement 3 of theorem 7.3),
Pr{Em } < 2n(I(X;Y )3) . It follows that we can bound Pr{E|U = 1} as follows.
2
Rn
Pr{E|U = 1} < +
2n(I(X;Y )3) = +(2Rn 1)2n(I(X;Y )3) < +2n(I(X;Y )3R)
m=2
If R < I(X; Y ), then it is possible to choose small and n large enough so that
2n(I(X;Y )3R) < , yielding Pr{E|U = 1} < 2.
Existence of capacity achieving code: We nish the proof by arguing that there
exists a codebook whose rate achieves the capacity as follows.
1. By selecting the PMF fX (x) to be the capacity achieving PMF fX (x), i.e. setting fX (x) = arg maxfX (x) I(X; Y ), we can replace the condition R < I(X; Y )
by R < C.
2. Since the average error probability over all codebooks is less than 2, there must
exist at least one codebook C with error probability Pen (C ) < 2.
2Rn
m
and Pen (C ) < 2, throwing away the worse half (based on
3. From
= m=1
2Rn
m ) of the codewords of C yields a codebook with half the size and nmax < 4.
(Note that if more than half the codewords of C have m 4, then Pen (C )
2, yielding a contradiction.) We can reindex the codewords in the modied
codebook with 2Rn /2 = 2Rn1 codewords. The rate of this code is reduced from
R to R 1/n.
Pen
In conclusion, given any > 0, if R < C, there exists a code with rate R 1/n and
nmax < 4 for suciently large n. As n , the code rate approaches R. Thus, R
is achievable.
Non-Achievability of R > C
Proving that R > C is not achievable is equivalent to showing that, for any
sequence of (n, Rn) codes with nmax 0, we have R C.
Note that nmax 0 implies Pen 0. Since Pen = Pr{U = U }, we can apply the
Fano inequality (theorem 7.5) for guessing U from Y = (Y1 , . . . , Yn ) to write
H(U |Y) 1 + Pen log(2Rn 1) < 1 + Pen Rn.
(7.10)
7.7. DIFFERENTIAL ENTROPY
133
To relate R to C, we rst note that H(U ) = Rn. Then, we can use the equality
H(U ) = H(U |Y) + I(U ; Y) and (7.10) to write
Rn = H(U ) = H(U |Y) + I(U ; Y) < 1 + Pen Rn + I(U ; Y).
(7.11)
From the data processing inequality (theorem 7.4), I(U ; Y) I(X(U ); Y), where
X(U ) = xm under hypothesis m. It follows from (7.11) that
Rn < 1 + Pen Rn + I(X(U ); Y).
(7.12)
We now show that I(X(U ); Y) nC. For convenience, we drop the notation U
below. Using the chain rule on the entropy and the memoryless property of a DMC,
we can write I(X; Y) as
I(X; Y) = H(Y) H(Y|X)
n
= H(Y)
H(Yj |X, Yj1 , . . . , Y1 )
= H(Y)
Using the inequality H(Y)
I(X; Y)
j=1
H(Yj |Xj ).
j=1
n
j=1
H(Yj )
j=1
n
H(Yj ) and the denition of C, we can write
j=1
H(Yj |Xj ) =
I(Xj ; Yj ) nC.
j=1
Substituting I(X; Y) nC in (7.12), we can write Rn < 1 + Pen Rn + Cn, yielding

R < 1/n + Pen R + C, which becomes R C as n .
7.7
Dierential Entropy
The dierential entropy of a continuous RV X with probability density function

(PDF) fX (x), denoted by h(X), is dened as
h(X) = fX (x) log fX (x)dx.

(7.13)
Unlike the entropy of a discrete RV, the dierential entropy h(X) can be negative, as
can be seen in the following examples.
Example 7.4 For RV X that is uniformly distributed over the interval [0, a], h(X) =
log a. Note that, for a < 1, h(X) is negative.

Example 7.5 For Gaussian RV X with mean 0 and variance 2 ,
h(X) =
1
log(2e 2 ).
2
Note that, for any real constant a, h(X) = h(X + a). It follows that, for Gaussian
RV Y = X + a with mean a and variance 2 , h(Y ) = h(X) = 21 log(2e 2 ).
134
It is left as an exercise for the reader to show that, among all the RVs with variance
2 , the Gaussian RV has the highest dierential entropy. We state this result
formally below.
Theorem 7.6 (Maximum entropy of Gaussian RV): Let X be a continuous RV
with variance 2 . Then, h(X) log(2e 2 ), where the equality holds if and only if
X is Gaussian.
Joint and Conditional Dierential Entropy

The joint dierential entropy of continuous RVs X and Y with joint PDF
fX,Y (x, y), denoted by h(X, Y ), is dened as
h(X, Y ) = fX,Y (x, y) log fX,Y (x, y)dxdy.

(7.14)
Let X and Y be continuous RVs. The conditional dierential entropy of X given Y ,
denoted by h(X|Y ), is dened as
h(X|Y ) = fX,Y (x, y) log fX|Y (x|y)dxdy.

(7.15)
The denitions of joint and conditional dierential entropies can be extended to
more than two RVs in a straightforward fashion. More specically, consider two sets
of RVs X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ). Their joint and conditional
dierential entropies can be written as follows.
h(X, Y) = f X,Y (x, y) log f X,Y (x, y)dxdy

(7.16)
h(X|Y) = f X,Y (x, y) log f X|Y (x|y)dxdy

(7.17)
From the above denitions, we can derive the following theorems. The proofs are
similar to those for discrete RVs and are therefore omitted.
Theorem 7.7 (Chain rule for dierential entropy): For continuous RVs X1 ,
. . ., Xn ,
h(X1 , . . . , Xn ) = h(X1 ) + h(X2 |X1 ) + . . . + h(Xn |Xn1 , . . . , X1 )
n
=
h(Xj |Xj1 , . . . , X1 ).
j=1
Theorem 7.8 (Conditioning can only reduces entropy): For continuous RVs
X and Y ,
h(X|Y ) h(X).
7.7. DIFFERENTIAL ENTROPY
135
Mutual Information for Continuous RVs

The mutual information between two continuous RVs X and Y , denoted by
I(X; Y ), is dened as
I(X; Y ) = h(X) h(X|Y ).
(7.18)
As for discrete RVs, we can establish that

I(X; Y ) = h(Y ) h(Y |X) = h(X) + h(Y ) h(X, Y ).
(7.19)
In addition, since h(X|Y ) h(X), I(X; Y ) 0. Let X, Y, Z be three continuous

RVs, the conditional mutual information, denoted by I(X; Y |Z), is dened as
I(X; Y |Z) = h(X|Z) h(X|Y, Z).
(7.20)
As for discrete RVs, we have the following theorems. The proofs are omitted.
Theorem 7.9 (Chain rule for mutual information): For continuous RVs X1 ,
. . ., Xn , Y ,
I(X1 , . . . , Xn ; Y ) = I(X1 ; Y ) + I(X2 ; Y |X1 ) + . . . + I(Xn ; Y |Xn1 , . . . , X1 )
n
=
I(Xj ; Y |Xj1 , . . . , X1 ).
(7.21)
j=1
In the remaining part of this section, we shall establish some inequalities that are
useful in proving the channel coding theorem for AWGN channels. Since the
derivations are similar in nature to the results for discrete RVs, we omit the proofs
in what follows. For more details, see [?].
AEP for Continuous RVs

Consider a sequence of IID continuous RVs X1 , X2 , . . ., where each Xj takes its
value in the set X according to PDF fX (x). Let h(X) be the dierential entropy of
Xj . The typical set Tn with respect to fX (x) is the set of IID sequence
x = (x1 , . . . , xn ) such that 2n(h(X)+) f X (x) 2n(h(X)) , or equivalently

1

log f X (x) h(X) < .
n

The volume of a set A in Rn , denoted by vol(A), is dened as
vol(A) =
dx1 . . . dxn .
(7.22)
Theorem 7.10 (AEP): For suciently large n, we have the following properties.
1. Pr {Tn } > 1 .
136
2. (1 )2n(h(X)) vol(Tn ) 2n(h(X)+) .

Roughly speaking, as n , the volume of an n-dimensional set that contains
most probability is 2nh(X) . In addition, this set is an n-cube with side length 2h(X) .
Consider now a sequence of IID continuous RVs X1 , Y1 , X2 , Y2 , . . . with joint PDF
fX,Y (x, y). The jointly typical set An is the set of IID sequence
(x, y) = (x1 , . . . , xn , y1 , . . . , yn ) such that

1
1

log f X (x) h(X) < , log f Y (y) h(Y ) < ,
n
n

1

log f X,Y (x, y) h(X, Y ) < .
n

Theorem 7.11 (Joint AEP): For suciently large n, we have the following.
1. Pr {An } > 1 .
2. (1 )2n(h(X,Y )) < vol(An ) < 2n(h(X,Y )+) .
1, . . . , X
n , Y1 , . . . , Yn ) be a sequence of RVs. If their joint PDF
Y)
= (X
3. Let (X,
satises fX,
x, y
) = f X (
x)f Y (
y), then
Y
(
(1 )2
n(I(X;Y )+3)
{
}
n
< Pr (X, Y) A < 2n(I(X;Y )3) .
Roughly speaking, the typical x-sequences form a set of volume 2nh(X) in Rn . The
typical y-sequences form a set of volume 2nh(Y ) in Rn . The jointly typical
(x, y)-sequences form a set of volume 2nh(X,Y ) in R2n . Thus, the probability that a
2nh(X,Y )
= 2nI(X;Y ) .
randomly selected (x, y)-sequence is jointly typical is about 2nh(X)
2nh(Y )
Data Processing and Fano Inequalities

Theorem 7.12 (Data processing inequality): If three RVs X, Y, Z form a Markov
chain, then I(X; Y ) I(X; Z) and I(Y ; Z) I(X; Z).
Consider now a discrete RV X and a continuous RV Y . In this case, we can dene
the conditional entropy of X given Y as H(X|Y ) = fY (y)H(X|Y = y)dy. In the
context of a digital communication system, X can be a transmitted signal value
while Y is the observation at the receiver. The Fano inequality gives a lower bound
on the probability of decision error.
Theorem 7.13 (Fano inequality): Let X be a discrete RV taking the value in set
= g(Y ) be the
X . Suppose we guess the value of X from a continuous RV Y . Let X
guess. Let Pe = Pr{X = X} be the error probability. Then,

1 + Pe log(|X | 1) H(X|Y ).
7.8. CAPACITY OF AWGN CHANNELS
7.8
137
Capacity of AWGN Channels
Consider the discrete-time AWGN channel model

Yj = Xj + Nj , j {1, 2, . . .},
where the IID Gaussian noise RVs N1 , N2 , . . . have mean 0 and variance N0 /2.
Without further restriction, the capacity of this channel is innite since we can
choose an arbitrarily large number of signal points for each Xj to be arbitrarily far
apart with respect to the noise variance N0 /2; the noise will then have negligible
eects on the bit error probaility.
The most common limitation on the input is an energy or a power constraint, i.e.
for any codeword x = (x1 , . . . , xn ) transmitted over the channel, we require the
energy bound
1 2
x Ed .
n j=1 j
n
(7.23)
We now dene the channel capacity of the AWGN channel and then prove the
channel coding theorem for the AWGN channel. Note that we use the denitions of
an achievable rate as well as various error probabilities as in the case of a DMC.
The channel capacity of the AWGN channel with energy constraint Ed is dened as
C=
max
fX (x):E[X 2 ]Ed
I(X; Y ).
(7.24)
Theorem 7.14 (Channel coding theorem for AWGN channel): Any rate R <
C is achievable. Conversely, if a rate R is achievable, then R C.
Before proving the channel coding theorem, we rst derive the capacity expression
for the AWGN channel.
Theorem 7.15 (Capacity formula for AWGN channel): Let SNR = NE0d/2 .
Then,
1
C = log2 (1 + SNR) (in bit/dimension).
(7.25)
2
Proof: We rst write
1
I(X; Y ) = h(Y )h(Y |X) = h(Y )h(X+N |X) = h(Y )h(N ) = h(Y ) log(eN0 ).
2
Since h(Y ) is maximized when Y is Gaussian and the variance of Y is Ed + N0 /2,
1
1
log(2e(Ed + N0 /2)) log(eN0 )
2
2
(
)
1
Ed
1
=
log 1 +
= log(1 + SNR),
2
N0 /2
2
I(X; Y )
which is the desired expression.
Before formally proving the channel coding theorem for AWGN channels, we make
the following comments.
138
1. Note that Y is Gaussian only when X is Gaussian. Therefore, to achieve the

capacity, the input PDF must be Gaussian. Note that Gaussian inputs are not
the case in our discussions so far. Additional eorts in signal set design are
needed to make the inputs Gaussian (see [?, sec. IV-B]).
2. Recall that a passband channel with bandwidth W has 2W degrees of freedom
(or dimensions) per second. Dene the signal power P = 2W Ed . In terms of
P , the capacity formula can be written as
(
)
P
1
(in bit/dimension)
log2 1 +
C =
2
N0 W
(
)
P
= W log2 1 +
(in bit/s or bps)
(7.26)
N0 W
3. The capacity formula in (7.25) can be derived using the sphere packing argument [?, p. 734]. Consider transmitting the sequence x = (x1 , . . . , xn ).
The received n-dimensional vector y = (y1 , . . . , yn ) is given by y = x + n,
where
1 , . . . , nn ) is the IID noise sequence. By the WLLN, we have
n n 2= (n
N0
1
j=1 nj 2 + with high probability for large n. Since n = y x, we can
n
write
(
)
N0
2
y x n
+ .
2
Therefore, with
high probability, the received sequence y is contained in the
sphere of radius nN0 /2 around the signal point x.
Given the transmitted
energy bound Ed , the WLLN tells us that, the received
sequence y satises n1 nj=1 yj2 Ed + N20 + with high probability for large n.
In other words, we can write
(
)
N0
2
y n Ed +
+ .
2
Therefore, y could possibly be anywhere in the sphere of radius n(Ed + N0 /2)

centered at the origin. For reliable communication, we want thesignal points to
be located far enough apart so that the hyperspheres of radius nN0 /2 around
them do not overlap, as shown in gure 7.3
Since the hypersphere of radius r in Rn has volume
rn , where is the constant
of proportionality, there can be as many as
n(Ed +N0 /2))n
nN0 /2)n
codewords, yielding
the capacity of
(
)
( n(Ed + N0 /2))n
n
Ed
=
log2 1 +
log2
bit/ n dimension
2
N0 /2
( nN0 /2)n
1
log2 (1 + SNR) bit/dimension
=
2
139
Figure 7.3: Illustration of the sphere packing argument.
Achievability of R < C
The proof is based on the random coding argument. The main ideas are similar to
the proof of the channel coding theorem for DMCs. Consider the transmission
scheme described below.
Transmission and detection: We rst randomly generate an (n, Rn) codebook C
with 2Rn codewords according to a Gaussian PDF with mean 0 and variance Ed .
In particular, we can write the codebook as a matrix
x1,1
x1,2 x1,n
..
..
..
C = ...
,
.
.
.
x2Rn ,1 x2Rn ,2
x2Rn ,n
where the entries are generated IID according to the above Gaussian PDF.
The transmitter and the receiver both know the code C and the channel fY |X (y|x).
The receiver observed y = (y1 , . . . , yn ) and uses the following typical set decoding;
the decision is to set U (y) = m if the following conditions hold.
1. (xm , y) is jointly typical, i.e. (xm , y) An .
2. There is no other index m such that (xm , y) is jointly typical.
3. xm satises the energy constraint, i.e. n1 nj=1 x2m,j Ed .

Otherwise, the receiver declares an error, i.e. the transmission is not successful.
Analysis of error probability: Let Pr{E} denote the error probability averaged over
all the codewords as well as over all the codebooks, i.e.
Pr{E} = Pr{C}Pen (C)dC,
140
where Pen (C) is the error probability averaged over all the codewords for a codebook
C.
By the symmetry of the code construction,
Pr{E} = Pr{E|U = 1}.
n
c
Let Em denote the event that (xm , y) is jointly typical, i.e. (x
m , y) A . Let Em
n
1
2
denote the complement of Em . Let F denote the event that n j=1 x1,j > Ed . From
the modied typical set decoding, Pr{E|U = 1} = Pr{F E1c E2 . . . E2Rn }.
Using the union bound,
2
Rn
Pr{E|U = 1} Pr{F} +
Pr{E1c }
Pr{Em }.
m=2
By the WLLN, Pr{F} < for large n. From joint AEP (statement 1 of
theorem 7.11), Pr{E1c } < for large n. From the code construction, for m = 1, xm
and y are independent, i.e. their joint PMF is fX (xm )fY (y). From joint AEP
(statement 3 theorem 7.11), Pr{Em } < 2n(I(X;Y )3) . It follows that
Pr{E|U = 1} < 2 + (2Rn 1)2n(I(X;Y )3) < 2 + 2n(I(X;Y )3R) .
Note that, if R < I(X; Y ), we can choose n large enough so that
2n(I(X;Y )3R) < , yielding Pr{E|U = 1} < 3.
Existence of capacity achieving code: We nish the proof by arguing that there
exists a code whose rate achieves the capacity as follows.
1. By selecting the PMF fX (x) to be the capacity achieving PMF fX (x), i.e. setting fX (x) = arg maxfX (x) I(X; Y ), we can replace the condition R < I(X; Y )
by R < C.
2. Since the average error probability over all codebooks is less than 3, there must
be at least one codebook C with error probability Pen (C ) < 3.
Pen
2Rn
m
= m=1
3. From
, throwing away the worse half (based on m ) of the code2Rn
words of C yields a codebook with half the size and max < 6. (Note that if
more than half the codewords of C have m 6, then Pen (C ) 3, yielding
a contradiction.) We can reindex the codewords in the modied codebook with
2Rn /2 = 2Rn1 codewords. The rate of this code is reduced from R to R 1/n.
In conclusion, if R < C, there exists a code with rate R 1/n with nmax < 6 for
suciently large n. As n , the code rate approaches R. Thus, R is achievable.
141
Non-Achievability of R > C
In addition to data processing and Fano inequalities, we shall make use of the
Jensen inequality described next. A function f is convex over an interval (a, b) if,
for every x1 , x2 (a, b) and [0, 1], f satises
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
Theorem 7.16 (Jensen inequality): If a function f is convex and X is a RV, then
E[f (X)] f (E[X]).
Proof: See the appendix.
We now proceed with the proof that R > C is not achievable. Proving that R > C
is not achievable is equivalent to showing that, for any sequence of (n, Rn) codes
with nmax 0, we must have R C. Note that nmax 0 implies that Pen 0.
Since Pen = Pr{U = U }, we can apply the Fano inequality for guessing U from
Y = (Y1 , . . . , Yn ) to write
H(U |Y) 1 + Pen log(2Rn 1) < 1 + Pen Rn.
(7.27)
To relate R to C, we rst note that H(U ) = Rn. Then, we can use the equality
H(U ) = H(U |Y) + I(U ; Y) and (7.27) to write
Rn = H(U ) = H(U |Y) + I(U ; Y) < 1 + Pen Rn + I(U ; Y).
(7.28)
From the data processing inequality, I(U ; Y) I(X(U ); Y), where X(U ) = xm
under hypothesis m. It follows from (7.28) that
Rn < 1 + Pen Rn + I(X(U ); Y)
(7.29)
For convenience, we drop the notation U below. Using the chain rule on the entropy
and the memoryless property of the discrete-time channel, we can write and bound
I(X; Y) as
I(X; Y) = h(Y) h(Y|X) = h(Y)
h(Yj |X, Yj1 , . . . , Y1 )
j=1
= h(Y)
h(Yj |Xj ) = h(Y)
j=1
j=1
h(Yj )
h(Nj )
j=1
n
j=1
h(Nj ) =
[h(Yj ) h(Nj )]
(7.30)
j=1
2Rn 2
1
Let Ej = 2Rn
m=1 xm,j be the average energy of position j in the codeword. Since
Yj and Nj are zero-mean Gaussian RVs with variances Ej + N0 /2 and N0 /2
142
respectively, h(Yj ) = 12 log(2e(Ej + N0 /2)) and h(Nj ) = 12 log(2eN0 /2). It follows

from (7.29) that
(
)
n
1
11
2Ej
n
log 1 +
.
R < + Pe R +
n
n j=1 2
N0
We now apply the Jensen inequality for the convex function log(1 + x) to obtain
(
))
n (
1
1
2Ej
n
R <
+ Pe R
log 1 +
n
2n j=1
N0
)
(
n
1
1
1
2E
j
.
+ Pen R + log 1 +
n
2
n j=1 N0
Since the energy constraint is satised, i.e. n1 nj=1 Ej Ed ,

(
)
1
1
2Ed
1
n
R < + Pe R + log 1 +
= + Pen R + C.
n
2
N0
n
Finally, since
1
n
+ Pen R 0 as n , we have R C as n .
7.9
Summary
In this chapter, we dened the channel capacities of DMCs and AWGN channels.
Using the AEP and related inequalities in information theory, we proved the
channel coding theorems for both DMCs and AWGN channels. We observed that
the proofs of the coding theorems guarantee the existence of good codes, but do not
tell us specically how to construct good codes. Nevertheless, the channel capacity
can provide us with a fundamental limit on what we can achieve in terms of the
communication rate subject to the availability in bandwidth (for both DMCs and
AWGN channels) as well as in transmit power (for AWGN channels). It is
interesting to note that the random coding argument assumes no restriction on the
coding delay. In particular, by allowing the sequence length to go to innity, we
allow the coding delay to become arbitrarily large.
Finally, we derived the channel capacity formula for AWGN channels. The formula
provides the bound in gure 5.13. In addition, it is used as a benchmark to evaluate
the performances of practical systems.
7.10
Appendix: Convex Functions and Jensen

Inequality
A function f is convex over an interval (a, b) if, for every x1 , x2 (a, b) and
[0, 1], f satises
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
(7.31)
The following theorem can be used to check whether a function f is convex.
7.10.
APPENDIX: CONVEX FUNCTIONS AND JENSEN INEQUALITY
143
Theorem 7.17 If a function f is twice dierentiable in (a, b), then f is convex if and
only if its second derivative f is nonnegative in (a, b).
Proof: We rst show that, if f (x) 0 in (a, b), then f (x) is convex in (a, b).
Consider the second-order Taylor series expansion around a point x0
f (x )
(x x0 )2 ,
2
where x lies somewhere between x and x0 . Since f (x) 0,
f (x) = f (x0 ) + f (x0 )(x x0 ) +
f (x) f (x0 ) + f (x0 )(x x0 ).

By setting x0 = x1 + (1 )x2 and x = x1 , where [0, 1], we obtain
f (x1 ) f (x1 + (1 )x2 ) + f (x0 )((1 )(x1 x2 )).
By setting x0 = x1 + (1 )x2 and x = x2 , we obtain
f (x2 ) f (x1 + (1 )x2 ) + f (x0 )((x2 x1 )).
Adding times (7.10) and (1 ) times (7.10) yields
f (x1 + (1 )x2 ) f (x1 ) + (1 )f (x2 ).
which implies that f (x) is convex.
For the converse part, we shall prove by contradiction. Suppose there exists some x0
such that f (x0 ) < 0. Then, for x > x0 close enough to x0 , we have the second-order
Taylor series expansion
f (x )
(x x0 )2 ,
2
where x lies somewhere between x and x0 , and f (x ) < 0. Thus,

f (x) = f (x0 ) + f (x0 )(x x0 ) +
f (x) < f (x0 ) + f (x0 )(x x0 ).

Now consider a point x0 + h in (x0 , x). This point can be written as
(x0 )
,
x0 + h = x0 + (1 )x, where [0, 1]. By writing f (x0 ) = limh0 f (x0 +h)f
h
we have
f (x0 + h) f (x0 )
f (x) < lim f (x0 ) +
(x x0 ).
h0
h
Substituting x0 + h = x0 + (1 )x and h = (1 )(xx0 ), we get
f (x0 + (1 )x) f (x0 )
.
1
1
Rewriting the above inequality yields
f (x) < lim f (x0 ) +
f (x0 ) + f (x0 + (1 )x)

,
1
1
which implies that there exists some value of > 0 such that
f (x0 ) + (1 )f (x) < f (x0 + (1 )x), contradicting the assumption that f (x)
is convex.

f (x) < lim
Example 7.6 We can check the second derivatives to verify that ex and log x are
convex functions.
144
Proof of Jensen inequality

Recall that the Jensen inequality states that, for a convex function f ,
E[f (X)] f (E[X]).
Using the second-order Taylor series expansion around point x = x0 ,
f (x) = f (x0 ) + f (x0 )(x x0 ) +
f (x )
(x x0 )2 ,
2
where x lies between x and x0 . Since f (x ) 0,

f (x) f (x0 ) + f (x0 )(x x0 ).
It follows that, for a RV X,
f (X) f (x0 ) + f (x0 )(X x0 ).
Taking the expectation and setting x0 = E[X], we obtain the Jensen inequality.
7.11
Practice Problems
Problem 7.1 (Z channel and binary erasure channel): Consider transmitting

an equally likely information bit through each of the following two DMCs. Note that,
in DMC 1, there is a bit error only if bit 1 is sent. In DMC 2, the output has three
possible values: bit 0, bit 1, and an error denoted by e.
DMC 1
DMC 2
Write the expressions for the mutual information I(X; Y ) between the input and the
output as well as the conditional entropy H(X|Y ) for each channel.
Problem 7.2 (Cascade of two BSCs): Consider the cascade of two BSCs with
the crossover probabilities given below. (An example scenario of this model is a
satellite transmission system in which the rst BSC corresponds to the uplink while
the second BSC corresponds to the downlink.)
Let be the crossover probability of the rst BSC, and be the crossover probability
of the second BSC. Let RVs X, Y , and Z denote the input, the intermediate output,
and the nal output of the transmission system respectively.
145
(a) Compute the mutual information I(X; Y ) and I(Y ; Z).

(b) Compute the mutual information I(X; Z).
(c) Argue that the overall channel is equivalent to a single BSC. Specify the crossover
probability for this equivalent BSC.
(d) Find the channel capacity of this cascade channel.
Problem 7.3 (Capacity of a DMC for 8-PSK): Consider an uncoded QAM

system in which a signal point (a1 , a2 ) in the signal set is transmitted by the signal
waveform
s(t) = a1 2p(t) cos(2fc t) a2 2 sin(2fc t),

where p(t) is the baseband unit-norm pulse and fc is the carrier frequency. The signal
s(t) is then transmitted through an AWGN channel with noise PSD equal to N0 /2.
Assume that we use the 8-PSK signal set shown below.
(a) Assume that we want to support the bit rate of 6 Mbps. Find the amount of
bandwidth required to transmit at this bit rate.
(b) Draw the decision regions for the receive signals according to the optimal decision rule that minimizes the probability of decision error.
(c) Find the union bound estimate of the symbol error probability associated with
the optimal decision rule in part (c). Express the bound in terms of the energy
per bit Eb and N0 .
146
(d) Assume that a decision error occurs only for nearest neighbors. In other words,
assume that the probability that a signal point (a1 , a2 ) is sent but a non-nearest
neighbor is decided is negligible.
View the channel as (a DMC
) with 8 inputs and 8 outputs (after the receivers
d
min
decision). Let = Q 2N0 , where dmin denotes the minimum distance between
signal points. Find the channel capacity of this DMC in terms of with the
unit of bit/transmission (or equivalently bit/dimension).
Problem 7.4 : Consider the additive-noise DMC shown below. In the gure, X, Y ,
and N denote the input, the output, and the additive noise RVs respectively.
In addition, the alphabet set for X is X = 0, 1; the PMF for N is given by

{
1/2, n = a
pN (n) =
1/2, n = a
where a > 0. Compute the capacity of this DMC. HINT: Since your answer will
depend on the value of a, you can compute the capacity separately for dierent
cases.
Bibliography
147

Dig Com

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dig Com

Uploaded by

Copyright:

Available Formats

AT77.

2 Review of Related Mathematics

Comparison of Modulation Schemes . . . . . . . . . . . . . . . . . . .

7 Capacities of Communication Channels

Figure 1.1: Block diagram of a point-to-point digital communication system.

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

partition of E is a set of disjoint subsets of E whose union is equal to E. Let F1 , . . . , Fn

n FX1 ,...,Xn (x1 , . . . , xn )

2.1. REVIEW OF PROBABILITY

are independent. Moregenerally, RVs X1 , . . . , Xn are independent if we can write

2. For continuous RVs X and Y , the conditional PDF of X given Y , denoted by

(limy0 Pr{X x|y y Y y})

We rst rewrite the conditional distribution as

Letting y 0, we can write

3. Consider the conditional PDF of continuous RV X given discrete RV Y , denoted

(Pr{X x|Y = y})

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

from which we can write

Means, Variances, and Moment Generating Functions

2.1. REVIEW OF PROBABILITY

where y = g(x) (e.g. [?, p. 130] or [?, p. 541]).

fX1 ,...,Xn (x1 , . . . , xn )

where (y1 , . . . , yn ) = g(x1 , . . . , xn ) and

g(x1 , . . . , xn )fX1 ,...,Xn (x1 , . . . , xn )dx1 . . . dxn .

In a special case where Y =

gj (Xj ) and X1 , . . . , Xn are independent, we can

g1 (x1 )fX1 (x1 )dx1

E[gj (Xj )].

RVs X and Y are uncorrelated if E[(X X)(Y Y )] = 0. If X and Y are

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

Laws of Large Numbers

Theorem 2.2 (Chebyshev inequality): For a RV X,

Proof: Take |X E[X]|2 as a RV in the Markov inequality.

Theorem 2.3 (Weak law of large numbers (WLLN)): Consider

2.2. REVIEW OF FOURIER ANALYSIS

Proof: Take Sn as a RV in the Chebyshev inequality.

Review of Fourier Analysis

Fourier Transforms of L2 Signals

u(f )ei2f t df.

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

Figure 2.1: Fourier transform pair for the rectangle signal.

2.2. REVIEW OF FOURIER ANALYSIS

Unit Impulse (t) and Its Fourier Transform

Fourier Series of L2 Signals

The unit step signal is dened as s(t) =

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

Figure 2.2: Fourier series reconstruction of the rectangle signal.

Review of Linear Algebra

A eld F is a set of elements together with addition and multiplication dened to

The addition and multiplication of a and b are denoted by a + b and ab respectively.

2.3. REVIEW OF LINEAR ALGEBRA

Figure 2.3: Mod-k addition and multiplication for Fk .

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

A set of vectors v1 , . . . , vn V spans

Inner Product Spaces

2.3. REVIEW OF LINEAR ALGEBRA

Subspaces and Projections

The notation (u1 , . . . , un ) corresponds

CHAPTER 2. REVIEW OF RELATED MATHEMATICS

Figure 2.4: Two orthogonal components of u R2 .

In an inner product space, a set of vectors 1 , . . . , n is orthonormal if

We call a basis that is orthonormal an orthonormal basis. Note

Re{x} denotes the real part of complex number x.