You are on page 1of 9

3.

Lecture 3

1-D Random Variables

1. Probability
Many people have an intuitive understanding of probability. For example, when one hears the
weather forecast: There is a 50% chance of snow today, it suggests that the probability of the
event that it will rain today is 0.5. In fact, probability is a measure of the size of an event.
Hence, in order to develop a firm grasp of the basic mathematical elements of probability, it is
first necessary to understand what an event is. Then it is necessary to discuss in concrete terms
what is meant by a measure of size.
Definition 1. An event is a set; that is, it is a collection of entities (or objects). The notation { }
will denote a set.
For example, consider a weather forecast that includes the event that it snows, and the event that
it does not snow. The notation for these events is {recording snow} and {recording no snow}.
Note the use of the word record. It is an action verb. It is not correct to say that to snow is an
event. IT may well snow, but not in sufficient amounts at a given measurement location to record
it. This distinction points out a crucial characteristic of an event; namely how it is
recorded/observed/measured. This, in turn, is related directly to the resolution of the sensing
device. The following example may help to clarify this.
Example 1. An increasingly important problem in the aviation industry is the problem of aging
aircraft. Over time, a part such as a wing or a turbine blade will develop micro-cracks. Even
though a part may have a crack, to say that the part has a crack is not an event. It is a claim as
to an attribute of that part. This claim is not the same as the claim: Based on a nondestructive
evaluation of the part with crack width detection resolution of .005mm, the part has a crack.
This claim includes a description of the act of recording the presence of a crack of at least .
005mm width.
At this point a typical course in probability would proceed to discuss the general concepts of a
sample space and sigma algebras. Since we do not have the luxury of devoting so much time to
probability theory, we will present these and other concepts in relation to a random variable.
Definition 2. A random variable is an action which, when performed, results in a number.
Moreover, the number that results is not perfectly predictable. Notationally, the action will be
assigned an uppercase letter (e.g. X), whereas the number that results by performing the action
will be assigned the corresponding lower case (e.g. x).
Example 1 (continued). Let X = The act of recording whether or not a crack is detected. If we
enter the number 0 to represent no crack and 1 to represent crack, then we have two possible
events. Notationally, these can be represented simply as {0} and {1}, respectively. In the

3.2

parlance of random variables, these two events are typically denoted as [X = 0] and [X=1],
respectively. For a number, x {0,1} , the notation become {x} or, equivalently, [X = x].
Definition 3. The set (or collection) of all the possible numerical values that a random variable,
X, can have is called the sample space for X. It will be denoted as SX.
Definition 4. The set (or collection) of all possible subsets of SX is called the field of events. It
will be denoted as X .
Example 1 (continued). The sample space for X = The act of noting whether or not a crack of at
least .005mm is detected is S X {0,1} . The corresponding field of events is
X {{0} , {1} , S X , } , where is the symbol for the empty set.
The above definitions are not entirely mathematically correct. We have ignored the concept of
sigma algebras. But since this is not a course in probability theory, these definitions will suffice
for our purposes. We are now in a position to discuss probability.
Definition 5. Let X be a random variable with sample space SX and field of events X . Then
probability, denoted Pr(*) is a measure of the size of the sets contained in X . Moreover, it has
the following properties:
(P1): Pr( ) = 0 ; (P2): Pr( SX ) = 1 ; (P3): for any A, B X with A B ,
Pr( A B ) Pr( A) Pr( B ) .
It need to be emphasized that Pr(*) measures the size of a set, and not the size of a number. For
example, Pr(1) makes no sense, whereas Pr( {1} ) does.
Example 1 (continued). From property (P2), Pr( S X ) Pr({0,1}) 1 . Suppose that we define

Pr({1}) p . Since {0} {1} , and {0} {1} S X , it follows from properties (P2) and (P3)
that:
1 Pr( S X ) Pr({0} {1}) Pr({0}) Pr({1}) Pr({0}) p

Pr({0}) 1 p .

Definition 6. Let X be a random variable with sample space SX, field of events X , and
probability measure Pr(*). Then the ordered triple ( S X , X , Pr) is said to be a probability space
for X.

3.3

Even though, for a given random variable, say, X, the attendant probability measure Pr(*) is
appropriate, very often it is not used directly. Instead, one of the following two quantities related
to it are used.
Definition 7. Let X be a random variable with sample space SX, field of events, X , and
probability measure Pr(*). For any number x (, ) , the half-open interval ( , x ] , which
can also be denoted as [ X x ] , is called a cumulative event. The probability of this event

Pr{( , x]} Pr[ X x] FX ( x) is called the cumulative distribution function (cdf) for X.
Definition 8. The (perhaps generalized) derivative of F X (x ) , f X ( x) dFX ( x) / dx is called the
probability density function (pdf) for X.
Example 1 (continued). Since the sample space for X is SX = {0,1}, then for any x (,0)
we have
FX ( x) Pr{( , x]} Pr[ X x] 0

for x 0 .

Since Pr[ X 0] 1 p , we have


F X (0) Pr{( ,0]} Pr[ X 0] 1 p .

Since there is no probability to accumulate for x > 0 until we reach the point x = 1, we have
F X ( x ) Pr{(, x ]} Pr[ X x ] 1 p

for 0 x 1 ,

and
FX ( x) Pr{( , x ]} Pr[ X x] 1

The cdf is shown in the plot below.

FX (x)
1

1 p

x0
Figure 1. Graph of the cdf related to Example 1.

x 1

for x 1 .

3.4

Clearly, the formal derivative f X ( x) dFX ( x) / dx does not exist at the collection of points SX =
{0,1}. In this situation we will define the generalized derivative:
f X ( x ) dFX ( x) / dx

(1 p) ( x)

p ( x 1) .

The function ( x) is called the Dirac delta function. It is not a proper function. Rather, it is
defined in relation to its integral. Specifically, for any continuous function, g (x ) and any chosen
x o , ( x ) is defined via its sifting property:

g ( x) ( x x

) dx

g ( xo )

In words, what (*) does is that it sifts out the value of the function it is integrated against, at
the location that makes the argument * equal to zero. Another less mathematical term for
describing ( x) , and one that is used in systems and control theory, is that it is the unit impulse.
In relation to F X (x ) given in Figure 1, the generalized derivative, f X (x) , is shown below.
f X (x )
1
1 p
p

x0

x 1

Figure 2. The generalized derivative of F X (x ) given in Figure 1,


Remark. The numerical values 1 p and p shown in Figure 2 are not the values of f X (x) .
This function has no well-defined values on S X {0,1} . Loosely speaking, one could also say
that its value is infinity on this set. The numerical values 1 p and p are said to be the intensity
values of f X (x ) . They are, in fact, the size of the jumps of the cdf F X (x ) .
Example 2 The life time of electrical components are often such that the probability of failure is
greatest at the beginning of the life. This is why when you order a personal computer, the pc
manufacturer will run a series of exhaustive tests (i.e. burn it in) prior to sending it to you. Let X
denote the act of recording the life time (i.e. time to failure) of an electronic device, and, for
convenience, suppose that the recording resolution is infinitely precise. Then the sample space
for X is S X (0, ) . Since this is a continuous open interval, X is said to be a continuous

3.5

random variable. A very common pdf model for X is the exponential pdf: f X ( x) e x . The
corresponding cdf is:
x

Pr[ X x ] F X ( x)

f X (u ) du

du

1 e x .

This is a 1-parameter pdf model, since it is parameterized by the single parameter, .. Notice that
since x has units of time, and since the cdf is dimensionless, then the parameter must have units
of 1/time. Equivalently, 1 must have units of time. Suppose that 1 0.63 years. The pdf and
cdf for X are plotted below.

Plots of the pdf (blue) & cdf (black)

1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0

0.5

1.5

2.5
x

3.5

4.5

Figure 3. Plots of the pdf, f X (x) , and the cdf, F X (x ) .


Even though the pdf is more commonly used to describe X, when it comes to computing
probability it is often the cdf that can make life a little easier. For example, suppose that we want
to compute the probability of the event [1.0 , 2.0], which is also denoted as [1.0 X 2.0] .
This event is the shaded region of the x-axis shown in Figure 3. To compute the probability of
this event using the pdf, we need to integrate it over this region:
2

Pr[1.0 X 2.0]

f X ( x)dx

e x / 1.25

e 1 / 0.63 e 2 / 1.25

0.163 .

x 1

However, since F X (x ) gives all the area to the left of any chosen number, x, we can more easily
compute the probability of interest as:
Pr[1.0 X 2.0] FX (2) FX (1) e 1 / 1.25 e 2 / 1.25

0.163 .

3.6

2. Expectations
Again, let X be a random variable, with sample space SX, and with pdf f X (x) . If one knows
f X (x ) , then one knows everything about the probabilistic structure of X. However, this
knowledge is not in relation to a scalar parameter, but to a collection of scalars { f X ( x)} xS X In
the case where S X includes only two scalars (e.g. 0 and 1), it follows that the collection of pdf
scalars includes only two values. However, suppose that as is often the case) X is a continuous
random variable. Then there are an infinite number of scalars in the collection { f X ( x)} xS X .
This can be problematic when using a finite number, say, n, of measurements of X in order to
estimate f X (x) for every x S X . As a general rule of thumb, one needs to have many more
measurements than parameters being estimated. Suppose, for example, that one has 100
measurements with which to construct a histogram-based estimate of f X (x) . If one desires a
histogram with high resolution (say, 10 bins), then one can expect that the 10 bin heights will not
be very trustworthy. If, one desires trustworthy bin heights then it will be necessary to use fewer
bins. Consequently, there a trade-off: More bins will give higher resolution but higher bin height
uncertainty. Fewer bins will reduce the bin height uncertainty, but give lower resolution.
For this reason, often one will downplay, if not forego an investigation of f X (x) , and resort to
estimating parameters such as the mean and variance of X. These parameters are defined as
special cases of the following definition.
Definition 8. Let X be a random variable, with sample space SX, and with pdf f X (x ) , and let
g ( X ) be any chosen function of X. Then
E[ g ( X )]

g ( x) f

( x) dx .

(1)

SX

Remark. In any textbook on probability it would be heresy to call (1) a definition. It is, in fact, a
theorem. However, since this course is not a course in probability, we will take (1) to be a
definition.
Definition 9. The kth moment of a random variable X is E ( X k ) . In particular, the first moment

of X is called the mean (or expected value) of X. It will be denoted as E ( X ) X . The kth
central moment of X is E[( X X ) k ] . In particular, the second central moment of X is called
the variance of X. It will be denoted as X2 .
Example 1 (continued) We will now use (1) to compute the mean of X. In this case, we have
g( X ) X

3.7

X E( X )

x f X ( x) dx

SX

x f
x0

( x) 0(1 p ) 1( p )

p.

To compute the variance of X, we use (1) with g ( X ) ( X X ) 2 . In this case, we have

X2 E[( X X ) 2 ] ( x X ) 2 f X ( x) dx ( x p) 2 f X ( x) (0 p) 2 (1 p) (1 p) 2 p p(1 p)
x 0

SX

Before we proceed, it is both instructive and expedient to present the following theorem.
THEOREM 1. E (aX b) aE ( X ) b .
Proof: Let g ( X ) aX b . Then (1) gives
E[ g ( X )] E (aX b)

(ax b) f

SX

( x) dx a xf X ( x) dx b f X ( x) dx aE ( X ) b.
SX

SX

We will now use this theorem to prove the next theorem.


THEOREM 2. X2 E ( X 2 ) X2 .

Proof:

X2 E[( X X ) 2 ] E ( X 2 2 X X X2 ) E ( X 2 ) 2 X E ( X ) X2 E ( X 2 ) 2 X2 X2
E ( X 2 ) X2 .

The method of computing the variance of a random variable using THEOREM 2 can be
computationally advantageous.
Example 2 (continued) From a table of integrals,
http://en.wikipedia.org/wiki/List_of_integrals_of_exponential_functions
we have:
x
xe dx

The mean of X is:

e x
(x 1)

x e
2

e x

(x 1)

The second moment of X is: E ( X 2 ) e x x 2

the variance of X is:

2
X

1/ .
2

x 0

2x 2

dx e x x 2

2

2x 2

2
. Hence, from THEOREM 2,
2

3.8

Before proceeding to extend the results of this section and the last to the case of two random
variables, we offer the following extension of Example 2.
Example 2 (continued) Suppose that it has been decided that an electrical component that is
incorporated should not remain in the field for longer than 1 year. Then the resulting random
variable, call it Y, relates to X as follows:
For 0 x 1 , the cumulative events [ X x] and [Y x] are one and the same event.
Also, the events [ X 1] and [Y 1] are one and the same event. It follows that for 0 x 1 ,
Pr[ X x ] Pr[Y x ] and [Y x ] ; that is, FX ( x ) FY ( x ) . Hence, for 0 x 1 ,
e 1 / 0.63 0.20 Pr[Y 1] . In relation to Figure
3, the pdf for Y is exactly the pdf for X in the interval (0,1) , The area beneath f X (x) over the
interval [1, ) is mapped to a lump of probability at the location x = 1.
f X ( x ) f Y ( x) , and Pr[ X 1] 1 FX (1)

Plot of fX(x)(BLUE) & fY (x) (RED)

1.6
1.4
1.2

1
0.8
0.6
0.4
0.2
0

0.5

1.5
x

2.5

Figure 4 Plots of f X (x) (BLUE) and f Y (x ) (RED).


We are now in a position to compute the mean and variance of Y. The component of this moment
associated with the continuous part of the pdf is

1
1 ( 1)e . = 0.23. The component

associated with the lump of probability, 0.20, at the location x = 1 is 0.20. Hence,
Y 0.23 0.2(1) 0.43 (versus X 0.63 ).

Computation of the variance of Y proceeds in the same manner. The component of the second
moment associated with the continuous portion of the pdf is:
1

x
0

f X ( x ) dx

2
2
2

e 1 2 0.68 .
2

3.9

The component associated with the lump of probability at x = 1 is 0.20. Hence, the second
moment for Y is E (Y 2 ) 0.88 [versus E ( X 2 ) 0.80 ]. It follow from THEOREM 2 that the
variance of Y is

Y2 E (Y 2 ) Y2

End of Lecture 3

.88 .43 2 0.70 (versus X2 0.40 ).

You might also like