You are on page 1of 35

CHAPTER 1

INTRODUCTION
The basis for understanding and analyzing stochastic processes is probabili ty
theory and real analy- sis.This chapter presents an overviewof probabili ty and
the associated concepts in order to develop the basis for the theory of stochastic
or random processes in subseque nt chapters. The results given in this chapter
are the key ideas and results with direct bearing on the tools necessary to develop
the results in the sequel. Although it is not necessary to have had a prior course
in probabili ty, graduate stude nts in engineering usually have had an
undergraduate course in probabili ty and statistics. Prior background will enable
readers to go through the early parts faster however the emphasis in this
book is understanding the basic concepts and their use and this is different
from an undergraduate course in which the emphasis is on computation and
calculation.
1.1 Definition of a probability space
Let us begin at the beginning. In order to introduce the notion of
probabil ities and operations on them we first need to set up the
mathematical basis or hypotheses on which we can construct our edifice.
This mathematical basis is the notion of a probability space. The first
object we need to define is the space which is known as the space of all
possible outcomes (or obser vations). This is the mathematical artifact which
is the setting for studying the likelihood of occurrence or outcome of an
experiment based on some assumptions on how we expect the quantities of
interest to behave. By experiment it is meant the setting in which quantities
are observed or measured. For example the experiment may be the
measureme nt of rainfall in a particular area and the method of determining
the amou nt. The outcome i.e. is the actual value measured (or
observed) at a given time.
Another, classical example is the roll of a die. The space is just the
possible set of values which can appear i.e. {1, 2, 3, 4, 5, 6} in one roll. In the
1
case of measureme nts of rainfall it is the numerical value typically any real
non-negative number between [0, ). Once is specified the next notion
we need is the notion of an event. Once an experiment is performed the
outcome is observed and it is possible to tell whether an event of interest
has occurred or not. For example, in a roll of the die we can say whether the
number is even or odd. In the case of rainfall whether it is greater than
5mm but less than 25 mm or not. The set of all events is usually denoted
by F . F is just the
collection of subsets of which satisfy the
following axioms:
1.2 Random variables and probability distributions
One of the most important concepts for a useful operational theory is the notion
of a random variable or r.v. for short. In this course we will usually concern
ourselves with so-called real valued random variables or countably valued
(or discrete- valued) random variables. In order to define random variables
we first need the notion of what are termed Borel sets. In the case of real
spaces then the Borel sets are just sets formed by intersections and unions of
open sets (and their compleme nts) and
hence typical Borel sets are any open intervals of the type (a, b) where a, b
< and by property ii)
it can also include half-open, closed sets of the form (a, b] or [a, b] and by the
intersection property
a point can be taken to be a Borel set etc. Borel sets play an important
role in the definition of random variables. Note for discrete valued r.vs a
single value can play the role of a Borel set.
{ : X () C } F
or what is equivalent
{X
1
(C )
F}
Remark: In words what the above definition says that if we consider all
the elementary events
which are mapped by X (.) into C then the
2
collection {} will define a valid event so that a
probabili ty can be
assigned to it. At the level of these notes there wil l never be any
difficulty as
to whether the given mappings will define random variables, in fact we treat
r.vs as primiti ves. This will be clarified a little later on.
From the viewpoint of computations it is most convenient to work with an
induced measure rather than on the original space. This amounts to
defining the probabili ty measure induced by or associated with a r.v. on its
range so that rather than treat the points and the probabili ty measure
IP we can work with a probabili ty distribution on < (or X ) with x < as
the sample values.
Definition 1.2.2 Let X () be a real valued r.v. defined on (, F , IP) then
the function
F (x) = IP({ : X () x})
Is called
the(cumulative) probabilitydistribution function
of X .
1.2.1 Functions of random variables
It is evident that given a r.v., any (measurable) mapping of the r.v. will define a
random variable. By measurability it is meant that the inverse images of Borel sets
in the range of the function belong to the -field of X ; more precisely, given any
mapping f () : X Y , then f (.) is said to be measurable if given any Borel set C
BY the Borel -field in Y then
f 1(C )BX
where B
X
denotes the Borel -field in X. This property assures us that the
mapping f (X ()) will define a r.v. since we can associate probabilities
associated with events.
Hence, the expectation can be defined (if it exists) by the following:
E[f (X ())] =Zf (X ())dIP() =Zf (x)dFX (x) =XYydFY (y)
In the next section we will obtain a generalization of these results for the case
of vector valued r.vs.. We will then present several examples.
3
}
i=1
In the sequel we will generally omit the argume nt for r.vs and capital
letters will usually denote random variables while lowercase letters will
denote values.
1.3 Joi nt distributions and Conditional Probabilities
So far we have worked with one r.v. on a probabili ty space. In order to
develop a useful theory we must develop tools to analyze collection of r.vs and
in particular interactions between them. Of
course they must be defined on a
common probabili ty space. Let {X
i
(
)
n
b
e a collection of r.vsdefined on a
common probabili ty space (, F , IP). Specifying their individual
distributions F
i
(
x)
does not account for their possible interactions. In fact, to
say that they are defined on a common probabili ty space is equivalent to
specifying a probabili ty measure IP which can assign probabilities to events
such as { : X
1
() A
1
, X
2
() A
2
, ..., X
n
() A
n
}. In the particular case
where the r.vs are real valued this amounts to specifying a joint proba bility
distribution function defined as follows:
F (x
1
,
x
2
,
.., x
n
)
= IP{X
1
(
) x
1
,
X
2
(
) x
2
,
..., X
n
(
) x
n
}
Definition 1.3.2 Let X
1
and X
2
be two r.vs with means m
1
= E[X
1
] and m
2
= E[X
2
] respectively. Then the covariance between X
1
and X
2
denoted by
cov(X
1
, X
2
) is defined as follows:
cov(X
1
, X
2
) = E[(X
1
m
1
)(X
2
m
2
)]
Definition 1.3.3 A r.v. X
1
is said to be uncorrelat ed with X
2
if cov(X
1
, X
2
)
= 0.
Remark: If two r.vs are independent then they are uncorrelated but not
vice versa. The reverse implication holds only if they are jointly Gaussian or
normal (which we will see later on). In statistical literature the normalized
covariance (or correlation) between two r.vs is referred to as the correlation
coefficient or coefficient of variation.
We conclude this section with the result on how the distribution of a vector
4
i
valued (or a finite collection of r.vs) which is a transformation of another
vector valued r.v. can be obtained.
Let X be a <
n
valued r.v. and let Y = f (X) be a <
n
valued r.v. and f (.)
be a 1:1 mapping.
Then just as in the case of scalar valued r.vs the joint distribution of Y
i
can be obtained from the joint distribut ion of the X
0
s by the extension of
the techniques for scalar valued r.vs. These transformations are called the
Jacobian transformations.
First note that since f (.) is 1:1, we can write
X = f
1
(Y). Let x
k
= [f
1
(y)]
k
i.e. the kth. component of f
1
.
The definition of conditional probabilities and the sequential computation of
the joint proba- bilities above leads to a very useful formula of importance
in estimation theory called the Bayes Rule
The first relation is known as the law of total probabili ty which allows us to
compute the prob- ability of an event by computing the conditional
probabili ty on simpler events which are disjoint and exhaust the space .
The second relation is usually referred to as Bayes rule. The importance
arises in applications where we cannot observe a particular event but instead
observe another event whose probabili ty can be inferred by the knowledge
of the event conditioned on simpler events. Then Bayes rule just states
that the conditional probabili ty can be computed from the knowledge of how
the observed event interacts with other simpler events. The result also
holds for a finite decomposition with the convention that if P (B
k
) = 0 then
the corresponding conditional probabili ty is set to 0.
Once we have the definition of conditional probabilities we can readily obtain
conditional distri- butions. Let us begin by the problem of computing the
conditional distribution of a real valued r.v.
X given that X A where IP(X A) > 0. In particular let us take A =
(a, b]. We denote the
conditional distribution by F
(a,b]
(x).
5
Now by definition of conditional proba bilities:
Remark: Although we assumed that X and Y are real valued r.vs the
above definitions can be directly used in the case when X and Y are vector
valued r.vs where the densities are replaced by the joint densities of the
component r.vs of the vectors. We assumed the existence of densities to
define the conditional distribution (through the definition of the conditional
density) given a particular value of the r.v. Y . This is strictly not necessary
but to give a proof will take us beyond the scope of the course.
Once we have the conditional distribution we can calculate conditional
moments. From the point of view of applications the conditional mean is of
great significance which we discuss in the next subsection.
Pro position 1.3.3 Let X and Y be jointly distributed r.vs. Let f (.) be a
measurable function with
E[f (X )] < i.e. f(X) is integrable.
Then:
a) If X and Y are independent, E[f (X )|Y ] = E[f (X )].
b) If X is a function of Y say X = h(Y ), then E[f (X )|Y
=
f (X ) = f (h(Y )).
c) E[f (X )] = E[E[f (X )|Y ]].
d) E[h(Y )f (X )|Y ] = h(Y )E[f (X )|Y ] for all functions h(.) such that
E[h(Y )f (X )] is defined
Properties c) and d) give rise to a useful characterization of the conditional
expectations. This is called the orthogonality principle. This states that the
difference between a r.v. and its conditional expectation is uncorrelated with
any function of the r.v. on which it is conditioned i.e,
E[(X E[X |Y ])h(Y )] =0
In the context of mean squared estimation theory the above is just a
stateme nt of the orthogonali ty
principle with E[X Y ] playing the role of the inner-pr oduct on the Hilbert
space of square integrable r.vs i.e. r.vs such that E[X ]
2
< . Thus the
conditional expectation can be seen as a projection
6
onto the subspace spanned by Y with the inner-pr oduct as defined. This will be
discussed in Chapters
2 and 4.
E[X g(Y )]
2
= E[X E[X |Y ] + E[X |Y ] g(Y )]
2
= E[X E[X |Y ]]
2
+ 2E[(X E[X |Y ])(E[X |Y ] g(Y ))]+E[E[X |Y ] g(Y
)]
2
= E[X E[X |Y ]]
2
+ E[E[X |Y ] g(Y )]
2
where we have used properties b and c to note that E[(E[X |Y ] g(Y ))(X
E[X |Y ])] = 0. Hence since the right hand side is the sum of two squares and
we are free to choose g(Y), the right hand
side is minimized by choosing g(Y ) = E[X |Y ].
Remark 1.3.3 Following the same proof it is easy to show that the constant
C which minimizes
E[(X C )
2
] is C = E[X ] provided E[X
2
] < .
Finally we conclude our discussion of conditional expectations by
showing another important property associated with conditional
expectations. This is related to the fact that if we have a 1:1 transformation
of a r.v. then conditioning w.r.t to the r.v. or its transformed version gives the
same result.
Pro position 1.3.5 Let X and Y be two jointly distributed r.vs. Let (.) be a
1:1 mapping. Then: E[X |Y ] = E[X |(Y )]
Note this is not trivial since X
2
: < < is not 1:1 (since knowing X
2
does not
tell us whether X is positive or negative. Moreover calculating the conditional
densities involves generalized functions (delta functions)(see below) . The way
to show this is by using the property that for any measurable and integrable
function g(.) we have:
7
1.4 Gaussian or Normal random variables
We now discuss properties of Gaussian or Normal r.vs. since they play an
important role in the modeling of signals and the fact that they also possess
special properties which are crucial in the development of estimation theory.
Througho ut we will work with vector valued r.vs. Note in probabili ty we
usually refer to Gaussian distributions while statisticians refer to them as
Normal distributions. In these notes we prefer the terminology Gaussian.
We first develop some general results regarding random vectors in <
n
.
Given a vector valued r.v X <
n
(i.e.
n r.vs which are jointly distributed)
the mean is just
the column vector with elements m
i
= E[X
i
] and the covariance is a matrix
of elements R
i,j
= E[(X
i
m
i
)(X
j
m
j
)] and hence R
i,j
= R
j,i
or the
matrix R is self-adjoi nt (symmetric). In vectorial terms this can be written
as R = E[(X m)(X m)

].
Remark: The above proposition shows that not only is the conditional
distribution of jointly Gaussian r.vs Gaussian but also the conditional mean
is an affine function of the second r.v. and the conditional covariance does not
depend on the second r.v. i.e. the conditional covariance is a consta nt and not
a function of the r.v. on which the conditioning takes place. This result only
holds for Gaussian distributions and thus the problem of finding the best
mean squared estimate can be restricted to the class of linear maps.
1.5 Probabilistic Inequalities and Bounds
In this section we give some important inequalities for probabilities and
bounds associated with random variables. These results will play an
important role in the study of the convergence issues.
Pro position1.5.1
(Mar kov Inequal ity)
8

Let X be a r.v. and f (.) : < <


+
such that E[f (X )] < . Then
This amounts to choosing the function f (X ) = (X m)
2
in the Markov
inequali ty. An immediate consequence of Chebychevs inequality is the fact
that if a r.v. X has variance equal to 0 then such a random variable is almost
surely a consta nt.
Definition 1.5.1 An event is said to be P -almost sure(usually abbreviated as
IP-a.s.) if the proba- bility of that event is 1 i.e.
which is the required result.
Using the Cauchy-Schwarz inequality we can prove a one-sided version of
Chebychevs inequali ty that is often useful. This is sometimes referred to as
Cantellis inequali ty.
Proposition 1.5.3 (Cantellis inequality)
Let X be a random variable with mean E[X ] = m and finite variance. Then:
Proof: Without loss of generality we assume m = 0. Then,
X ( X )1I
[>X
Another important inequality known as Jensens inequality allows us to
relate the mean of a function of a r.v. and the function of the mean of the
r.v.. Of course we need some assumptions
on the functions. These are called
convex functions. Recall, a function f (.) : < < is said to be
convex downward (or convex) if for any a, b 0 such that a + b = 1 :
f (ax + by) af (x) + bf (y)
Remark: The FKG inequali ty states that if f (.) and g(.) are non-decreasing
functions from <
n
<
then the above result holds for <
n
valued r.vs X . Note here a function is non-
decreasing if it is true
for each coordinate.
We conclude this section with a brief discussion of the so-called Cramers
theorem (also called the Chernoff bound) which allows us to obtain finer
bounds on probabilities than those provided by Chebychevs inequality
9

particularly when the tails of the distributions are rapidly decreasing. This
can be done if the moment generating function of a r.v. is defined and thus
imposes stronger assumptions on the existence of moments. This result is the
basis of the so-called large deviations theory which plays an important role in
calculating tail distributions when the probabilities are small and important in
information theory and in simulation.
Let X be a real valued r.v. such that the moment generating function M (h)
is defined for h < .
Remark: The FKG inequali ty states that if f (.) and g(.) are non-decreasing
functions from <
n
<
then the above result holds for <
n
valued r.vs X . Note here a function is non-
decreasing if it is true
for each coordinate.
We conclude this section with a brief discussion of the so-called Cramers
theorem (also called the Chernoff bound) which allows us to obtain finer
bounds on probabilities than those provided by Chebychevs inequality
particularly when the tails of the distributions are rapidly decreasing. This
can be done if the moment generating function of a r.v. is defined and thus
imposes stronger assumptions on the existence of moments. This result is the
basis of the so-called large deviations theory which plays an important role in
calculating tail distributions when the probabilities are small and important in
information theory and in simulation.
Let X be a real valued r.v. such that the moment generating function M (h) is
defined for h < .
1.6 Borel-Cantelli Lemmas
We conclude our overview of probabili ty by discussing the so-called Borel-
Cantelli Lemmas. These results are crucial in establishing the almost sure
properties of events and in particular in studying the almost sure behavior of
limits of events. The utility of these results will be seen in the context of
convergence of sequences of r.vs and in establishing limiting behavior of
Markov chains.
10
CONCLUDING REMARKS
In this chapter we have had a very quick overview of probab ility and
random variables. These results will form the basis on which we will build
to advance our study of stochastic processes in the subseque nt chapters.
The results presented in this chapter are just vignettes of the deep and
important theory of probabili ty. A list of reference texts is provided in the
bibliography which the reader may consult for a more comprehensive
presentation of the results in this chapter.
CHAPTER 2
Srong Markov Processes
1 Markov time
56
Definition (). Let (S , W, P
a
) be a Markov process with W = W
rc
, W
d
1
or
W
c
. A mapping : W [0, ] is called Markov time if
(w : (w) t)
B
t
.
It is easily seen that w w


is a measurable map of W
W . In fact, it is enough to show that
11

(w)
w w

(t) =
x(t, W )
is measurable, and this is immediate since x(s, w), (w), t and w are all
+
measurable in the pair (s, w). Similarly, w w is measurable.
The system of all subsets of W of the form (w
: w

B), B B,
is denoted by B

. B

is a Borel algebra contained in
B.We shall give examples to show that is not always B

-measurable. However, if <
, x

= w((w)) is B

-measurable, for
x

= lim
w

(t) and x

(w
t
) is
12
CHAPTER 3.
RANDOM PROCESSES
Definition
A random process X (or stochastic) is an indexed collection
( ) ( ) T t t X X ,
of random variables, all on the same probability space (S,F,P)
In many applications the index set T (parameter set ) is a set of times (continuous or
discrete)
discrete time random processes
continuous time random processes
To every S , there corresponds a function of time
a sample function
The totality of all sample functions is called an ensemble
The values assumed by X (t ) are called states and they form a state space E of the
random process.
Example 3.1
In the tossing coin experiment, where S = {H, T}, define the random function
) cos( ) , (
) sin( ) , (
t Tails t X
t Heads t X

t t
S
H T
X X
Both parameter set and state space can be discrete or continuous. Depending on
that, the process is:
PARAMETER SET T STATE SPACE E
DISCRETE DISCRETE PARAMETER
or
DISCRETE STATE
DISCRETE TIME
or
RANDOM SEQUENCE
{X
n
, n = 1, 2, ....}
or
CHAIN
CONTINUOUS
CONTINUOUS PARAMETER
or
CONTINUOUS TIME
CONTINUOUS STATE
There are three ways to look at the random process
1. X (,t ) as a function of both S and T,
2. for each fixed S, X (t ) is a function of t T,
3. for each fixed t T, X ( ) is a function on S.
Distribution and density functions
.
The first-order distribution function is defined as:
( ) ( ) ( ) x t X P t x F
X
;
The first-order density function is defined as:
( )
( )
dx
t x F
t x f
X
X
,
;
In general, we can define the nth-order distribution function as:
( ) ( ) ( ) ( )
2 2 1 1 1 1
,... ,... ; ,... x t X x t X P t t x x F
n n X

and the nth-order density function as:
( )
( )
n
n n X
n
n n X
x x
t t x x F
t t x x f

....
,... ; ,...
,... ; ,...
1
1 1
1 1
First- and second-order statistical averages
The mean or expected value of random process X(t ) is defined as:
( ) ( )

+

dx t x xf t X E t
X X
) , ( ) (
X(t ) is treated as a random variable for a fixed value of t. In general,
X
(t) is a
function of time, and it is often called the ensemble average of X(t ).
A measure of dependence of random variables of X(t ) is expressed by its
autocorrelation function, defined by:
( ) ( ) ( )

+


2 1 2 1 2 1 2 1 2 1 2 1
) , ; , ( ) , ( dx dx t t x x f x x t X t X E t t R
X X
and autocovariance function, defined by:
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
2 2 1 1 2 1 2 1
) , ( t t X t t X E t X t X Cov t t K
X X X

Classification of random processes
Stationary processes
A random process X(t ) is stationary, or strict-sense stationary, if its statistical
properties do not change with time, or more precisely:
( ) ( ) + +
n n X n n X
t t x x F t t x x F ,... ; ,... ,... ; ,...
1 1 1 1
for all orders n and all time shifts .
Stationarity influences the form of the first- and second-order distribution and density
function:
( ) ( ) ( ) x F t x F t x F
X X X
+ ; ;
( ) ( ) ( ) x f t x f t x f
X X X
+ ; ;
( ) ( )
1 2 2 1 2 1 2 1
; , , ; , t t x x F t t x x F
X X

( ) ( )
1 2 2 1 2 1 2 1
; , , ; , t t x x f t t x x f
X X

The mean of a stationary process
( ) ( )

+

dx x xf t X E t
X X
) ( ) (
does not depend on time, and the autocorrelation function

+


2 1 1 2 2 1 2 1 2 1
) ; , ( ) , ( dx dx t t x x f x x t t R
X X
depends only on time difference t
2
t
1
.If stationarity condition of a random process
X(t ) does not hold for all n, but only for n k, than the process X(t ) is stationary to
order k.
If X(t ) is stationary to order 2, then it is wide-sense stationary (WSS) or weak
stationary.
Independent processes
In a random process X(t ),if X(t
i
) are independent random variables for i = 1, 2,n,
than for n 2 we have
( ) ( )

n
i
i i X n n X
t x F t t x x F
1
1 1
; ,... ; ,...
Only the first-order distribution is sufficient to characterize an independent random
process.
A random process X(t ) is said to be a Markov process if
( ) ( ) ( ) ( ) ( )
( ) ( ) ( )
n n n n
n n n n
x t X x t X P
x t X x t X x t X x t X P


+ +
+ +
|
,... , |
1 1
2 2 1 1 1 1
The future states of the process depend only on the present state and not on the past
history (memoryless property)
For a Markov process we can write:
( )
( ) ( ) ( ) ( )

n
k
k k k k
n n X
x t X x t X P t x F
t t x x F
2
1 1 1 1
1 1
| ;
,... ; ,...
Ergodic processes
A random process X(t ) is ergodic if the time averages of the sample functions are
equal to ensemble averages.
The time average of x(t ) is defined as:

2 /
2 /
) (
1
lim ) (
T
T T
dt t x
T
t x
2
3
4
1
t
2
t
3
t
4
x(t)
t
1
t
Similarly, the time autocorrelation function of x(t ) is defined as:

+

+ +
2 /
2 /
) ( ) (
1
lim ( ) ( ) (
T
T T
X
dt t x t x
T
t x t x R
Counting process
A random process { X(t ), t 0} is called a counting process if X(t ) represents the
total number of events that have occurred in the interval (0, t). It has the following
properties:
1. X(t ) 0 and X(0) = 0
2. X(t ) is integer valued
3. X(t
1
) X(t
2
) if t
1
t
2
4. X(t
1
) - X(t
2
) equals the number of events in the interval (t
1
, t
2
)
A sample function of a counting process
Poison processes
If the number of events n in any interval of length is Poisson distributed with the
mean , that is:
( ) ( ) ( )
( )

,
_

+
!
exp
n
n t X t X P
n


then the counting process X(t ) is said to be a Poisson process with rate ( or intensity)
.
CHAPTER 4
4.1Ergodic transformation
Let be a measure-preserving transformation on a measure space
(X,,). An element A of is T-invariant if A differs from T
1
(A) by a set of measure
zero, i.e. if
where denotes the set-theoretic symmetric difference of A and B.
The transformation T is said to be ergodic if for every T-invariant element A of ,
either A or X\A has measure zero.
Ergodic transformations capture a very common phenomenon in statistical physics.
For instance, if one thinks of the measure space as a model for the particles of some
gas contained in a bounded recipient, with X being a finite set of positions that the
particles fill at any time and the counting measure on X, and if T(x) is the position of
the particle x after one unit of time, then the assertion that T is ergodic means that any
part of the gas which is not empty nor the whole recipient is mixed with its
complement during one unit of time. This is of course a reasonable assumption from a
physical point of view.
In other words, for any A where 0< (A)<1, mixing must happen after transformation
T, i.e., . That is, the system state can change to any state in the
sample space (with non-zero transition probability or non-zero conditional probability
density) after transformation T; every state is reachable with nonzero probability
measure after transformation T. Given state X(t) at time t, the next state X(t+1) =
T(X(t)); X(t+1) can take any value x with (X(t+1)=x)>0, where x satisfies
(X(t)=x)>0. If T is ergodic transformation, then X(t+1)=T(X(t)) can reach any state
reachable by X(t).
Ergodic transformation could be applied integer number of times (discrete
time); ergodic transformation can be extended to the case of continuous time.
A stochastic process created by ergodic transformation is called ergodic
process.
A process possesses ergodic property if the time/empirical averages converge
(to a r.v. or deterministic value) in some sense (almost sure, in probability, and
in p-th norm sense).
Strong law of large numbers: the sample average of i.i.d. random variables,
each with finite mean and variance, converges to their expectation with
probability one (a.s.).
Weak law of large numbers: the sample average of i.i.d. random variables,
each with finite mean and variance, converges to their expectation in
probability.
Central limit theorem: the normalized sum of i.i.d. random variables, each
with finite mean and variance, converges to a Gaussian r.v. (convergence in
distribution). Specifically, the central limit theorem states that as the sample
size n increases, the distribution of the sample average of these random
variables approaches the normal distribution with a mean and variance
2
/ n
, irrespective of the shape of the original distribution. In other words,
converges to a Gaussian r.v. of zero mean, unit variance.
An ergodic process may not have ergodic property.
Similar to Probability theory, the theory of stochastic process can be
developed with non-measure theoretic probability theory or measure theoretic
probability theory.
How to characterize a stochastic process:
1. Use n-dimensional pdf (or cdf or pmf) of n random variable at n
randomly selected time instants. (It is also called nth-order pdf).
Generally, the n-dimensional pdf is time varying. If it is time
invariant, the stochastic process is stationary in the strict sense.
2. To characterize the transient behavior of a queueing system (rather
than the equilibrium behavior), we use time-varying marginal cdf
F(q,t) of the queue length Q(t). Then the steady-state distribution F(q)
is simply the limit of F(q,t) as t goes to infinity.
3. Use moments: expectation, auto-correlation, high-order statistics
4. Use spectrum:
power spectral density: Fourier transform of the second-order moment
bi-spectrum: Fourier transform of the third-order moment
tri-spectrum: Fourier transform of the fourth-order moment
poly-spectrum.
Limit Theorems:
1. Ergodic theorems: sufficient condition for ergodic property. A
process possesses ergodic property if the time/empirical averages
converge (to a r.v. or deterministic value) in some sense (almost sure,
in probability, and in p-th mean sense).
Laws of large numbers
Mean Ergodic Theorems in L^p space
Necessary condition for limiting sampling averages to be constants instead
of random variable: the process has to be ergodic. (not ergodic property)
2. Central limit theorems: sufficient condition for normalized time
averages converge to a Gaussian r.v. in distribution.
Laws of large numbers
1. Weak law of large numbers (WLLN)
Sample means converge to a numerical value (not necessarily statistical
mean) in probability.
2. Strong law of large numbers (SLLN)
Sample means converge to a numerical value (not necessarily statistical
mean) with probability 1.
(SLLN/WLLN) If X1, X2, ... are i.i.d. with finite mean \mu, then sample
means converge to \mu with probability 1 and in probability.
(Kolmogorov): If {X_i} are i.i.d. r.v.'s with E[|X_i|]<infinity and
E[X_i]= \mu, then sample means converge to \mu with probability 1.
1. For {X_i} are i.i.d. r.v.'s with E[|X_i|]<infinity, E[X_i]= \mu, and
Var(X_i)=infinity, then sample means converge to \mu with probability 1.
But the variance of sample means does not converge to 0. Actually, the
variance of sample means is infinity. This is an example that convergence
almost sure does not imply convergence in mean square sense.
Mean Ergodic Theorems:
1. Sample means converge to a numerical value (not necessarily
statistical mean) in mean square sense.
2. A stochastic process is said to be mean ergodic if its sample means
converge to the expectation.
Central limit theorems (CLT)
1. Normalized sample means converge to a Gaussian random variable in
distribution.
2. Normalized by the standard deviation of the sample mean.
3. Like lim_{x goes to 0}x/x=1 (the limit of 0/0 is a constant), CLT characterizes
that as n goes to infinity, (S_n-E(X)/(\sigma/sqrt(n)) converges to a r.v.
N(0,1), i.e., convergence in distribution. By abusing notation a bit, lim_{n
goes to \infty}(S_n-E(X)/(\sigma/sqrt(n)) = Y, the limit of 0/0 is a r.v.
4. SLLN/WLLN is about the re-centered sample mean converging to 0. CLT is
about the limit of 0/0.
(Lindeberg-Levy): If {x_i} are i.i.d. and have finite mean m and finite variance
\sigma^2 (\neq 0), then the CDF of [(\sum_{i=1}^n x_i /n) - m]/(\sigma/\sqrt{n})
converges to a Gaussian distribution with mean 0 and unity variance.
Comments: WLLN/SLLN does not require finite variance but they obtain
convergence with probability and in probability, respectively; stronger than
convergence in distribution in CLT. Why?
The difference is that in WLLN/SLLN, sample means converge to a
deterministic value rather than a random variable as in CLT. Since CLT also
requires finite variance, CLT gives a stronger result than WLLN/SLLN. That
is, WLLN/SLLN only tell that sample means converge to a deterministic value
but WLLN/SLLN do not tell how sample means converge to the
deterministicvalue (in what distribution?). CLT tells that the sample mean is
asymptotically Gaussian distributed.
Implication of CLT: The aggregation of random effects follows Gaussian
distribution. We can use Gaussian approximation/assumption in practice and
enjoy the ease of doing math with Gaussian r.v.'s.
What's the intuition of CLT? Why do we have this phenomenon (sample
means converge to a Gaussain r.v.)?
For i.i.d. r.v.s with a heavy-tail or sub-exponential distribution, as long as
the mean and variance are finite, the sequence satisfies CLT.
When the Gaussian approximation under/over estimates the tail probability?
If the tail of the distribution decays slower than the Gaussian (e.g., heavy-
tail distributions), the Gaussian approximation under-estimates the tail
probability, i.e., the actual tail probability is larger than the Gaussian
approximation.
If the tail of the distribution decays faster than the Gaussian (e.g., close-to-
deterministic distributions), the Gaussian approximation over-estimates the
tail probability, i.e., the actual tail probability is smaller than the Gaussian
approximation.
Maximum likelihood estimators of mean and variance of i.i.d. Xi
)
1
( ) (
1
1
),
1
(
1

0
1
2 0
1 n
o
n
Z
X
n
n
o
n
Y
X
n
n
i
i
n
i
i
+ +

+ +



where Y0 is distributed as N(0,1); Z0 is a Gaussian r.v.
2. Non-identical case: If {x_i} are independent but not necessarily
identically distributed, and if each x_i << \sum_{i=1}^n x_i for
sufficiently large n, then the CDF of [(\sum_{i=1}^n x_i /n) - m]/
(\sigma/\sqrt{n}) converges to a Gaussian distribution with mean 0
and unity variance.
3. Non-independent case:
There are some theorems which treat the case of sums of non-
independent variables, for instance the m-dependent central limit
theorem, the martingale central limit theorem and the central limit
theorem for mixing processes.
How to characterize the correlation structure of a stochastic process?
1. auto-correlation function R(t1,t2)=E[X(t1)X(t2)]
For wide-sense (covariance) stationary process, R(\tau) = R(t1,t1+\tau) for
all t1 \in R.
If the process is a white noise with zero mean, R(\tau) is a Dirac delta
function, the magnitude of which is the double-sided power spectrum
density of the white noise. Note that the variance of a r.v. at any time in a
white process is infinity
If R(\tau) is a Dirac delta function, then the r.v.'s at any two different instant
are orthogonal.
In discrete time, we have similar conclusions:
If a random sequence consists of i.i.d. r.v.'s, R(n) is a Kronecker delta
function, the magnitude of which is the second moment of a r.v.
If R(n) is a Kronecker delta function, then the r.v.'s at any two different
instant are orthogonal.
R(t1,t2) characterizes orthogonality between a process' two r.v.'s at different
instant.
Cross-reference: temporal (time) autocorrelation function of a
deterministic process (which is energy limited):
R(\tau)= \int_{-infinity}^{+infinity} X(t)*X(t+\tau) dt
Discrete time: R(n) = \sum_{i=-infinity}^{+infinity}
X(i)*X(i+n)
2. auto-covariance function C(t1,t2)=E[(X(t1)-E[X(t1)])(X(t1)-E[X(t1)])]
For wide-sense (covariance) stationary process, C(\tau) = C(t1,t1+\tau) for
all t1 \in R.
If the process is a white noise, C(\tau) is a Dirac delta
function, the magnitude of which is the double-sided
power spectrum density of the white noise.
If C(\tau) is a Dirac delta function, then the r.v.'s at any
two different instant are (linearly) uncorrelated.
In discrete time, we have similar conclusions:
If a random sequence consists of i.i.d. r.v.'s, C(n) is a
Kronecker delta function, the magnitude of which is the
variance of a r.v.
If C(n) is a Kronecker delta function, then the r.v.'s at
any two different instant are uncorrelated.
C(t1,t2) characterizes linear correlation between a process' two r.v.'s at
different instant.
C(t1,t2)>0: positively correlated
C(t1,t2)<0: negatively correlated
C(t1,t2)=0: uncorrelated
3. mixing coefficient:
Why is Toeplitz matrix important?
1. The covariance matrix of any wide-sense stationary discrete-time
process is Toeplitz.
2. A n-by-n Toeplitz matrix T = [t_{i,j}], where t_{i,j}=a_{i-j}, and a_{-
(n-1)}, a_{-(n-2)}, ... a_{n-1} are constant numbers. Only 2n-1
numbers are enough to specify a n-by-n Toeplitz matrix.
3. In a word, the diagonals of a Toeplitz matrix is constant (constant-
along-diagonals).
4. Toeplitz matrix is constant-along-diagonals; circulant matrix is
constant-along-diagonals and symmetric along diagonals.
Toeplitz matrix: "right shift without rotation"; circulant matrix: "right shift
with rotation".
Why is circulant/cyclic matrix important?
1. Circulant matrices are used to approximate the behavior of Toeplitz
matrices.
2. A n-by-n circulant matrix C=[c_{i,j}], where each row is a cyclic shift
of the row above it. Denote the top row by {c_0, c_1, ..., c_{n-1}}.
Other rows are cyclic shifts of this row. Only n numbers are enough
to specify a n-by-n circulant matrix.
3. Circulant matrices are an especially tractable class of matrices since
their inverse, product, and sums are also circulants and it is
straightforward to construct inverse, product, and sums of circulants.
The eigenvalues of such matrices can be easily and exactly found.
Empirical/sample/time average (mean)
Borel-Cantelli theorem
1. The first Borel-Cantelli theorem
If sum_{n=1}^{\infty} Pr{A_n}<\infty, then Pr{limsup_{n goes to \infty}
A_n}=0.
Intuition (using a queue with infinite buffer size): if the sum of probability
of queue length=n is finite, then the probability that the actual queue length
is infinite is 0, i.e., the actual/expected queue length is finite with probability
1.
E[Q]=sum_{n=1}^{\infty} Pr{Q>=n}, this is another way to compute
expectation.
2. The second Borel-Cantelli theorem
If {A_n} are independent and sum_{i=1}^n Pr{A_i} diverges, then
Pr{limsup_{n goes to \infty} A_n}=1.
1. First-order result (about mean): WLLN/SLLN
Question: Can we use the sample mean to estimate the expectation?
WLLN studies the sufficient conditions for 0 , 1 } | ] [ {| > <


n
n
X E S P or
] [ X E S
p
n
(in probability). Let
] [ ] [ X E X E
i

. If WLLN is satisfied for every
R X E ] [
, the estimator is called a consistent estimator.

n
i
n
j
j i n
n
S
1 1
,
2
1
} var{
.
(Markov) If
0 } var{ lim

n
n
S
, then
] [ X E S
p
n

; i.e., convergence in mss
implies convergence in probability.
o For wide sense stationary LRD process with < } var{
i
X and
0
,


+
n
n i i

, we have ] [ X E S
p
n
. So in this LRD case,
sample mean is a consistent estimator.
(Khintchin) If
} {
i
X
are iid random variables with
< ] [
i
X E
(and possibly
infinite variance), then ] [ X E S
p
n
. Note that here
} {
i
X
converges in
probability even if
} {
i
X
does not converge in mss (infinite variance case).
SLLN studies the sufficient conditions for ] [
1
X E S
wp
n
(with probability 1).
(Kolmogorov) If
} {
i
X
are iid random variables with
< |] [|
i
X E
(and
possibly infinite variance), then ] [
1
X E S
wp
n
. Note that here
} {
i
X

converges with probability 1 even if
} {
i
X
does not converge in mss (infinite
variance case).
2. Second-order result (about variance): convergence in mss and CLT
Mean ergodic theorem:
0 } var{ lim ] [

n
n
mss
n
S X E S
.
Mean ergodic theorem (wide sense stationary, WSS): Let
} {
i
X
be WSS with
< ] [
i
X E
. If
< } var{
i
X
and covariance
0
,


+
n
n i i

, then
] [ X E S
mss
n
.
o For LRD process with < } var{
i
X and 0
,


+
n
n i i
, we have
] [ X E S
mss
n
. So in this LRD case, as the number of samples
increases, the variance of the estimator (sample mean) reduces.
o LRD with covariance matrix

1
1
1
1
]
1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
has n S
n
, 1 } var{ , i.e., the
variance of the estimator does not reduce due to more samples.
o LRD with covariance matrix

1
1
1
1
]
1
1 5 . 0 5 . 0 5 . 0
5 . 0 1 5 . 0 5 . 0
5 . 0 5 . 0 1 5 . 0
5 . 0 5 . 0 5 . 0 1
has
5 . 0 } var{ lim

n
n
S
, i.e., the variance of the estimator does not reduce to
0 due to infinite samples.
o SRD process always has
0 } var{ lim

n
n
S
since
<

+

n
j
j i i
n
1
,
lim

implies
0
,


+
n
n i i

.
o Note that mean ergodic theorem only tells the condition for the
variance of the sample mean to converge but does not tell the
convergence rate. We expect a good estimator has a fast convergence
rate, i.e., for a fixed n, it should have small variance; in terms of
convergence rate, we know the relation iid>SRD>LRD holds, i.e., iid
sequence has the highest convergence rate.
CLT:
o Motivation: WLLN tells the first-order behavior of the estimator,
sample mean; i.e., sample means converge to the expectation and we
have unbiased estimator since
]. [ ] [ X E S E
n

Mean ergodic theorem
tells the second-order behavior of the estimator; i.e., the variance of the
sample means converges to 0. The next question is: what about the
distribution of the estimator? This is answered by CLT.
o (Lindeberg-Levy CLT): If } {
i
X are iid random variables with
< ] [
i
X E
and
< } var{
i
X
, then Y S
d
n
, where Y is a normal
random variable with mean
] [
i
X E
and variance
n X
i
/ } var{
.
Denote
) / , ( ~
2
n N S
n

. I.e.,
0
/ } var{ / ]) [ ( Y n X X E S
d
i i n
, where
0
Y
is a normal
random variable with zero mean and unity variance.
o How to use CLT: given the normal distribution of the estimator
(sample means), we can evaluate the performance of the estimator
(e.g., confidence interval). That is, CLT provides approximations to,
or limits of, performance measures (e.g., confidence interval) for the
estimator, as the sample size gets large. So, we can compute the
confidence interval according to the normal distribution.
Since
) / , ( ~
2
n N S
n

, we also have
) / , ( ~
2
n S N
n

.
Then we can compute the confidence interval of

.
o CLT does not apply to self-similar processes since the randomness
does not average out for self-similar processes. The sample means
have the same (non-zero) variance; i.e., even if the number of samples
goes to infinity, the variance of the sample mean does not go to zero.
o CLT only use the expectation and variance of the sample mean since
the sample mean is asymptotically normal. High order statistics of the
sample mean are ignored. Why is the limiting distribution Gaussian?
In the proof using moment-generating function, we see that the high-
order terms ) (
2
n O are gone since first order and second order
statistics dominate the value of the moment-generating function as n
goes to infinity. Intuitively, first order and second order statistics can
completely characterize the distribution in the asymptotic region, since
the randomness is averaged out; we only need to capture the mean and
variance (energy) of the sample mean in the asymptotic domain.
Gaussian distribution is maximum entropy distribution under the
constraints on mean and variance.
o What about LRD, subexponential distribution, heavy-tail distribution?
Large deviation theory
o Motivation: WLLN tells . 0 , 0 } | ] [ {| >


n
n
X E S P But
it does not tell how fast it converges to zero. Large deviation theory
characterizes the convergence rate. Why is this important? Because
in many applications, we need to compute the probability of a rare
event like
, | ] [ | X E S
n
where

is large.
o Large deviation principle (LDP) is about computing a small probability
} | ] [ {| X E S P
n
for sufficiently large n, more specifically,
} | ] [ {| log
1
sup lim

X E S P
n
n
n
or
}, { log
1
sup lim
1
na X P
n
n
i
i
n


for
a>E[X]. LDP characterizes the probability of a large deviation of

n
i
i
X
1
in the order
) (n
about the expectation n*E[X]; i.e., a large
deviation of
n
S
in the order
) 1 (
about the expectation E[X].
o In contrast, CLT is for computing } / / ]) [ {( lim
2
x n X E S P
n
n


. It
characterizes the probability of a small deviation in the order ) (
2 / 1
n O
about the expectation. As n goes to infinity, ) (
2 / 1
n O goes to 0; so it
is a small deviation with respect to a fixed expectation. On the other
hand, this deviation ) (
2 / 1
n O is in the same order of the variance of
the sample mean, so that this deviation normalized by the variance is
still finite.
o For LRD process with < } var{
i
X and 0
,


+
n
n i i
, even though
we have WLLN, i.e., 0 , 0 } | ] [ {| >


n
n
X E S P , the
probability
} | ] [ {| X E S P
n
does not satisfy large deviation
principle (i.e., the convergence rate is not exponential). We need to
compress the time axis to obtain large deviation principle.
White Gaussian noise:
The auto-correlation function: ) ( ) (
2
R .
The power spectral density: ). , ( , ) (
2
+ f f S
The marginal PDF is Gaussian with zero mean, infinite variance. If the white
Gaussian noise passes a filer with frequency response H(f) in frequency band
[-B,B], the resulting marginal PDF is Gaussian with zero mean, variance

B
B
df f H
2 2
| ) ( | .
Markov property:
Markov property is good since you can reduce the dimensionality of the sufficient
statistics using conditional independence. But there exists stationary sources that are
not Markovian of any finite order; that is, there isnt conditional independence and
you cannot reduce the dimensionality of the sufficient statistics; you have to use all
the data to construct the sufficient statistics.
Classification of stochastic processes:
Memoryless processes: Poisson process, Bernoulli process
Short-memory processes: sum of the auto-correlations (actually, auto-
covariance coefficients) is finite. (SRD) Markov processes (having
memoryless property, i.e., conditional independence) possess short memory.
E.g., AR(1) has a memory of 1 time unit.
Long-memory processes: sum of the auto-correlations (actually, auto-
covariance coefficients) is infinite. (LRD, self similar, sub-exponential
distribution, heavy-tail distribution)
CONCLUSION
In probability theory, a stochastic process (pronunciation: /sto?'kst?k/), or
sometimes random process (widely used) is a collection of random variables; this is
often used to represent the evolution of some random value, or system, over time.
This is the probabilistic counterpart to a deterministic process (or deterministic
system). Instead of describing a process which can only evolve in one way (as in the
case, for example, of solutions of an ordinary differential equation), in a stochastic or
random process there is some indeterminacy: even if the initial condition (or starting
point) is known, there are several (often infinitely many) directions in which the
process may evolve.
REFERENCE
[1] P. Bremaud; An introduction to probabilistic modeling, Undergraduate
Texts in Mathematics, Springer-Verlag, N.Y., 1987
[2] W. Feller; An introduction to probability theory and its appli cations
Vol. 1, 2nd. edition, J.
Wiley and Sons, N.Y.,1961.
[3] R. E. Mortensen; Random signals and systems, J. Wiley, N.Y. 1987
[4] A. Papoulis; Probability, random variables and stochastic
processes, McGraw Hill, 1965 [5] B. Picinbono; Random signals and
systems, Prentice-Hall, Englewood Cliffs, 1993
[6] E. Wong; Introduction to random processes, Springer-Verlag, N.Y., 1983
At an advanced level the following two books are noteworthy from the
point of view of engi- neering applications.
[1] E. Wong and B. Hajek; Stochastic processes in engineering systems, 2nd.
Edition,
[2] A. N. Shiryayev; Probability, Graduate Texts in Mathematics, Springer-
Verlag, N.Y. 1984
[10] DOOB, J. L. -A probability approach to the heat equation, Trans.
Amer. Math. Soc.80, 1(1955), pp. 216-280
[11] DOOB, J. L. -Conditional Brownian mation and the boundary lim-
its of harmonic functions, Bull. Soc Math. France, 85(1957), pp.
431-458
[12] DOOB, J. L. -Probability methods applied to the first boundary
value problem, Proc. of Thrid Berkeley Symposium on Math. stat.
and Probability, 2, pp. 49-80
[13] DOOB, J. L. -Probability theory and the first boundary value prob-
lem, IIi. Jour. Math., 2 -1 (1958), pp. 19-36
[14] DOOB, J. L. -Semi-martingales and subharmonic functions, Trans
Amer. Math. Soc, 77-1 (1954), pp. 86-121
[15] HUNT, G. A. -Markov processes and potentials,III. Jour. Math.,
1(1957), pp. 44-93; pp. 316-369 and 2(1958), pp. 151-213
[16] ITO, K & McKEAN, H. P -Potentials and the random walk, Ill.
Jour. Math.4-1(1960), pp. 119-132
[17] LEVY, P. -Processes Stochastique et Movement Brownien, P
[18] DOOB, J. L. - loc.cit.
[19] FROSTMAN, O. - Potential dequilibre et capacite des ensembles
avecquelques applications a la theorie des fonctions, These pour le
doctract, Luad, 1935, pp. 1-118
[20] ITO, K. - Stochastic Processes, jap. Jour. Math., 18(1942), pp.
261-301

You might also like