You are on page 1of 11

Handout No.

5
Phil.015
April 4, 2016
JOINT AND CONDITIONAL PROBABILITY
DISTRIBUTIONS, INDEPENDENT VARIABLES, AND
HYPOTHESIS TESTING
1. Joint Probability Distribution Functions
Studying the relationships between two (or more) random variables requires
their joint probability distribution.1 In Handout No. 4 we treated probability
distributions of single discrete random variables. We now consider two (or more)
random variables that are defined simultaneously on the same sample space. In
this section we focus on examples involving two random variables. First we make
the following definition of the concept of a joint or more specifically bivariate
probability distribution function (jpdf):
Given a probability model h, A, Pi of a target random experiment, let X :
R and Y : R be two discrete random
variables with respective finite
sets
of possible values X() = x1 , x2 , , xn and Y () = y1 , y2 , , ym . The
probability distribution function
pX,Y (x, y) =df P X = x & Y = y

is called the joint probability distribution function of X and Y . Thus, to specify


a joint probability distribution of X and Y , we must specify the pairs of values
(xi , yj ) together with the probabilities pX,Y (xi , yj ) =df P X = xi & Y = yj
for all i = 1, 2, , n and j = 1, 2, , m.2 Joint probability distributions of
two random variables with finitely many values are best exhibited by a m n
double-entry table, similar to the one below:
1

For example, changing one of the two related random variables may cause a change in the other.
Also, a relationship between random variables can be used for predicting one variable from the
knowledge of others, even when the relationship is not causal.
2
Remember that we may define a bivariate probability distribution function simply by giving the
function pX,Y (x, y) without any reference to the underlying sample space , event algebra A and a
probability mreasure thereon. Statisticians seldom refer to the underlying sample space (domain)
of a random variable. All one has to know are the values of random variables.

/
X Y

y1

y2

ym

pX

x1

pX,Y (x1 , y1 )

pX,Y (x1 , y2 )

pX,Y (x1 , ym )

pX (x1 )

x2

pX,Y (x2 , y1 )

pX,Y (x2 , y2 )

pX,Y (x2 , ym )

pX (x2 )

..
.

..
.

..
.

..
.

pX,Y (xn , ym )

pX (xn )

pY (ym )

..
.
xn
pY

pX,Y (xn , y1 ) pX,Y (xn , y1 )


pY (y1 )

pY (y2 )

Note that pX,Y (x, y) 0 and the double-sum of all values of pX,Y (x, y) is equal
to 1:

pX,Y (x1 , y1 ) + pX,Y (x2 , y1 ) + + pX,Y (xn , y1 ) + + pX,Y (xn , ym ) = 1


Suppose, as above, that random variables X and Y come with the joint probability distribution function pX,Y (x, y). Then the functions befined by the sums
pX (x) =df pX,Y (x, y1 ) + pX,Y (x, y2 ) + + pX,Y (x, ym )
and
pY (y) =df pX,Y (x1 , y) + pX,Y (x2 , y) + + pX,Y (xn )
are called the marginal probability distribution function of X and Y , respectively.
These are entered in the last column and last row of the extended double-entry
table above. They are simply the probability distributions of random variables
X and Y , and are used in defining conditional probability distribution functions.
Please remember that from the joint probability distribution pX,Y (x, y) of X and
Y we can calculate a large variety of probabilities, including not only P(X) xi
and P(Y ) yj (obtained by summing up the probabilities of X = x with x xj
and by adding the probabilities of Y = y for all y yj ), but also the probabilities of the form P(X < Y ), P(X + Y 2), P(X Y 2), P(X 1 & Y 1),
and so forth.3
3

As expected, the sum X + Y of random variables X and Y is the unique random variable Z,
defined argumentwise as follows: Z() =df X() + Y () for all . Similarly for product and
the other algebraic operations on random variables.

Given two random variables X and Y with data as above, we say that they are
statistically independent, in symbols X
Y , provided that their joint probability distribution is the product of their marginal probability distributions. That
is to say, the equation
pX,Y (x, y) = pX (x) pY (y)
holds for all values x X() and y Y (). Obviously, independence is a symmetric relation, so that we have X
Y if and only if Y
X. Furthermore,
independence crucially depends on the associated probability distribution functions. Informally, independence means that a knowledge that X has assumed a
given value, say xi , does not effect at all the probability that Y will assume any
given value, say yj . The notion of independence can be carried over to more
than two random variables.
We mention in passing that in the case of three random variables X, Y and Z,
their joint (trivariate) probability distribution is defined by
pX,Y,Z (x, y, z) =df P X = x & Y = y & Z = z

and the mutual independence is defined by the equation


pX,Y,Z (x, y, z) = pX (x) pY (y) pZ (z).

Independence implies many pleasing properties for thexpectation and variance.


In particular, recall that while E(X + Y ) = E(X) + E(Y ) always holds, the
following holds
If X
Y, then E(X Y ) = E(X) E(Y ).
for the multiplication of independent raandom variables. (The implication cannot be reversed.) Correspondingly, while the additivity law does not hold for
variance in general, we have the following special situation:
If X
Y, then Var(X + Y ) = Var(X) + Var(Y ).
Example 1: Consider the following probability experiment: Roll a balanced
die once and let X : 6 R be the random variable whose values are determined by the number on the dies upturned face.
Hence the possible values of X
are specified by the set X(6 ) = 1, 2, 3, 4, 5, 6 . Next, after the dies outcome
has been observed, toss a fair coin exactly as many times as the value of X. Let
3

Y : (2 + 22 + 26 ) R be the random variable that counts the number


of tails in all possible coin tossing experiments. Clearly, the possible values of
Y form the set {0, 1, 2, 3, 4, 5, 6}.4
Problem: Determine the joint probability distribution pX,Y (x, y) of the random
variables specified above!
Solution: All we have to do is calculate the values of pX,Y (x, y) for x = 1, 2, 3, , 6
and for y = 0, 1, 2, , 6:
 11
1
pX,Y (1, 0) = P(X = 1 & Y = 0) = P(X = 1) P Y = 0 | X = 1 =
=
62
12
 11
1
pX,Y (1, 1) = P(X = 1 & Y = 1) = P(X = 1) P Y = 1 | X = 1 =
=
62
12
 11
1
pX,Y (2, 0) = P(X = 2 & Y = 0) = P(X = 2) P Y = 0 | X = 2 =
=
64
24

1
pX,Y (2, 1) = P(X = 2 & Y = 1) = P(X = 2) P Y = 1 | X = 2 =
12

1
pX,Y (2, 2) = P(X = 2 & Y = 2) = P(X = 2) P Y = 2 | X = 2 =
 24
pX,Y (3, 0) = P(X = 3 & Y = 0) = P(X = 3) P Y = 0 | X = 3
  0  3
1 3
1
1
1
=
=
6 0
2
2
48

pX,Y (3, 1) = P(X = 3 & Y = 1) = P(X = 3) P Y = 1 | X = 3
  2  1
1
1
3
1 3
=
=
6 1
2
2
48

3
pX,Y (3, 2) = P(X = 3 & Y = 2) = P(X = 3) P Y = 2 | X = 3 =
48

1
pX,Y (3, 3) = P(X = 3 & Y = 3) = P(X = 3) P Y = 3 | X = 3 =
48
= =

the remaining values of pX,Y (x, y) (and the total of 6 7 = 42) are calculated
similarly. Recall that the marginal probability distribution pX (x) is calculated
4

Astute readers will object that X and Y are not defined on the same sample space! This can
easily be fixed by assuming that a slightly modified variant of X, denoted by X 0 , is actually defined
on the product sample space 6 , where =df 2 + 22 + 26 , i.e., X 0 : 6 R,
constant in the second coordinate (X 0 does not depend on the coin experiments), i.e., we have
X 0 (, 0 ) =df X(). Likewise, Y 0 : 6 R is a slightly modified variant of Y , presumed to
be constant in the first coordinate (does not depend directly on the dies outcome), i.e., we have
Y 0 (, 0 ) =df Y ( 0 ).

by summing up the rows of the table for pX,Y (x, y), so that pX (1) = 16 , pX (2) =
1
, , and pX (6) = 61 . Obviously, we get the same values, since rolling the die
6
does not depend on flipping the coin.
On the other hand, the marginal probability distribution pY (y) is calculated by
63
, pY (1) =
summing up the colums of the table for pX,Y (x, y), so that pY (0) = 384
120
99
64
29
8
1
, pY (2) = 384 , pY (3) = 384 , pY (4) = 384 , pY (5) = 384 , and pY (6) = 384
. Here
384
the values are significantly different fro case to case, since Y depends on X.
2. Conditional Probability Distribution Functions
Given a probability model h, A, Pi of a target random experiment, let X :
R and Y : R be twodiscrete random
variables withrespective finite

sets of possible values X() = x1 , x2 , , xn and Y () = y1 , y2 , , ym .
The probability distribution function of the form pY |X (y|x), defined below, is
called the conditional probability distribution of random variable Y given variable X:
pY |X (y|x) =df


P X=x&Y =y
pX,Y (x, y)

=
,
pX (x)
P X=x

provided the marginal distribution satisfies pX (x) > 0 for all x. Note that the
foregoing distribution is in general different from the conditional probability
distribution function pX|Y (x|y) of random variable X given variable Y , defined
by
pX|Y (x|y) =df

P Y =y&X=x
pX,Y (x, y)

=
pY (y)
P Y =y

provided pY (y) > 0 for all y. Of course, for a fixed value y, pX|Y (x|y) itself is
a probability distribution (i.e., is nonnegative and sums up to 1). It is easy to
see that if X
Y , then pX|Y (x|y) = pX (x).
Example 2: Suppose we are given two ranbdom variables X and Y with respective possible values X() = {1, 2, 3} and Y () = {1, 2, 3}, and joint probability
1
distribution pX,Y (x, y) =df 36
x y.
Problem: Determine whether X and Y are independent.
Answer: Because the marginal for X is pX (x) = x6 (with x = 1, 2, 3) and the
1
marginal for Y is pY (y) = y6 (with y = 1, 2, 3), and pX,Y (x, y) =df 36
xy = pX (x)
pY (y), we have X
Y . Therefore pX|Y (x|y) = pX (x) and pY |X (y|x) = pY (y).
Mathematical expectation automatically generalizes to the conditional case.
Specifically, let X and Y be discrete random variables, and let the set of possi-

ble values of X be X() = {x1 , x2 , , xn }. Then the conditional expectation


of random variable X, given that Y = y, is defined by the weighted sum

E(X | Y = y) =df x1 pX|Y (x1 |y) + x2 pX|Y (x2 |y) + + xn pX|Y (xn |y),
of conditional probability distribution functions, where pY (y) > 0.

As expected, conditional variance Var X | Y = y is defined by the sum

2
2
2
x1 y pX|Y (x1 |y) + x2 y pX|Y (x2 |y) + + xn y pX|Y (xn |y),
where y =df E(X | Y = y).
The purpose of conditional probability distribution functions is best seen in
the applications of Bayes theorem. In particular, Bayesians write the binomial
probability distribution in the conditional form
BinSn

 
n k
k | p =df
p (1 p)nk ,
k


where Sn denotes the so-called success random variable in n trials and 0 p 1


isthe probability
 of achieving success in any single trial. Recall that BinSn k |
p = P Sn = k , from which all other probabilities of interest regarding Sn can
be calculated. The conditional symbolization of the binomial distribution above
suggests that parameter p is best viewed as a random variable that interacts
with the values of Sn . In particular, an experimenter may receive valuable
information from Sn about the unknown
value of p. This works in general,


5
since whenever
Y Binn k | p , we have a connection, namely E Y = np

and Var Y = np(1 p).6
3. Classical Hypothesis Testing
A typical problem in scientific disciplines that rely on statistical models is to
learn something about a particular population. (For example, a politician wants
to know the opinion of a certain group of people.) It may be impractical or even
impossible to examine the whole population, so one must rely on a sample from
the population. From the standpoint of probability theory, a sample is modeled
5

Statisticians often write X pX (x) to indicate that random variable X comes with distribution
pX (x).

6
The values of the cumulative binomial probability distribution P Sn k|p can be obtained
from a binomial table handed
out in class. Also, you can open Excel in your laptop and calculate

the value of P Sn k|p by typing BINOMDIST(k,n,p,1) with appropriate values for k, n and p.

by a sequence of mytually independent random variables, each of which has the


same probability distribution function.
Consider a probability experiment about which we formally reason in terms of
a probability model h, A, Pi. Next, suppose that n replications of the experiment are performed independently and under identical conditions (e.g., imagine
some coin floipping experiments). Suppose that X is a random variable associated with the repeated experiment with a probability distribution function
pX (x). The induced pair hX, pX (x)i is called a statistical model of the repeated
target experiment. Using a somewhat sloppy notation, statisticians often write
X p(x) to indicate that the representing random variable X of the target
experiment has distribution p(x). For example, if the experiment consists of
rolling a fair die once and the interest is in the number on the dies upturned
face, then X will have the possible values 1, 2, 3, 4, 5, 6, and the associated probability distribution function will be the so-called uniform distribution pX (x) = 16 ,
giving the same probability values to all x.
Now, if the target experiment is repeated n times (so that we have n trials,
where n is any natural number) resulting in n-fold repeated observations of
the values of random variable X, the correct probability model for the repeated
experiment consists of n random variables X1 , X2 , , Xn such that
(i) the probability distribution of each Xi (for i = 1, 2, , n) is exactly the
same, namely pX (x);
(ii) the joint probability distribution function of X1 , X2 , , Xn is the product
of the marginal distributions:
pX1 ,X2 , ,Xn (x1 , x2 , , xn ) = pX (x1 ) pX (x2 ) pX (xn ),
and the joint probability distribution is symmetric in all of its arguments
x1 , x2 , , xn , so that
(iii) all X1 , X2 , , Xn are mutually independent.
The sequence X1 , X2 , , Xn is commonly referred to as an exchangeable random sample or simply a sample. In probability theory, alternatively, the sequence X1 , X2 , , Xn is called an iid sequence, meaning an identically and
independently distributed sequence of random variables. In any case, the sequence X1 , X2 , , Xn together with pX (x) is a probabilistic representation of
n independent outcomes of a target experiment, repeated n times.
Samples have a mean. Specifically, the sample mean of an iid sequence X1 , X2 , , Xn
is defined by the random variable
X =df


1
X1 + X 2 + Xn .
n
7

It is simply the arithmetic average of n observations. Remember that lower case


symbols x1 , x2 , xn denote the distinct deterministic values in a particular
data set, whereas the capital symbols X1 , X2 , , Xn are the random variables
representing individual observations in each trial that take the respective values
x1 , x2 , xn in some repeated experiment with a certain probability.
In addition, samples have a variance. We tentatively define the sample variance
of an iid sequence X1 , X2 , , Xn by the random variable
V =df

2
2
2
1
X 1 X + X2 X + + X n X .
n

However, the usual statistical convention is to replace n in the denominator


with n 1. Therefore the sample variance is standardly defined by the sum
2

S =df

1
n1


X1 X

2

+ X2 X

2

+ Xn X

2


.

So now we have the population mean X = E(X) of variable


 X, parametrizing the associated probability distribution function pX x|, , and the sample
mean X of sample X1 , X2 , , Xn of size n, taken from the target population. The exact relationship between the population mean and the sample
mean X is described by the weak and strong laws of large numbers. Specifically, if the purpose of sampling is to obtain information about the population average
, the weak law

 of large numbers tells us that the sample average
1
X = n X1 +X2 + +Xn is likely to be near to when n is large. So, for large
sample, we expect to find the sample average (which we can compute) close to
the population average which is unknown.
In parallel, we have the population variance Var(X) of X, parametrizing the associated
 probability distribution function pX x|, , and the sample variance
2
S X of sample X1 , X2 , , Xn of size n, taken from the target population.
Of course, the sample standard deviation of the sample X1 , X2 , , Xn is the
random variable (statisatic)
s
S =df

1
n1


X1 X

2

+ X2 X

2


+ Xn X .
2

We also have the notion of sample correlation, sample moments, and a host of
other concepts, paralleing the population terminology. Generally, probability
appears only to relate the calculations of sample mean, sample variance, etc.,
to population mean, population variance, and so on.
Often it is easier to work with a stadardized variant of a random variable X
8

that transforms its expectation to 0 and its standard deviation to 1. This is


acheived by using the so-called standard score or Z-score, defined by the linear
transformation
Z =df

X X
.
X



It is easy to check that Z = E Z = 0 and Z = Var Z = 1. We shall use
the Z-score in calculating the so-called P-values.
Sample data provide evidence concerning hypotheses about the population from
which they are drawn. Here is a typical example:
Extrasensory Perception: In attempting to determine whether or not a subject
may be said to have extrasensory perception, something akin to the following
procedure is commonly used. The subject is placed in one room, while, in
another room, a card is randomly selected from a deck of cards. The subject is
then asked to guess a particular feature (e.g., color, number, suit, etc.) of the
card drawn. In this experiment, a person with no extrasensory powers might
be expected to guess correctly an average number of cards. On the other hand,
a person who claims that (s)he has extrasensory powers should presumably be
able to guess correctly an impressively large number of cards.
For specificity, Felix from the TV show called The Odd Couple claims to have
ESP. Oscar tested Felixs claim by drawing a card at random from a set of four
large cards, each with a different number on it, and without showing it, he asked
Felix to identify the card. They repeated this basic experiment many times.
At each such trial, an individual without ESP has one chance in four ( 14 ) of
correctly identifying the card. In 10 trials, Felix made six correct identifications.
Although he did not claim to be perfect, six is rather more than 2.5 = 14 [1 +
2 + 3 + 4], the average number of correctly guessed cards, if Felix does not
have ESP. Question: Does this prove anything about Felix having ESP? Here
is where hypothesis testing comes in.
Let p denote the probability that Felix correctly identifies a card. The so-called
null hypothesis H0 is that Felix has no ESP, and is only guessing or, in terms
of the parameter p, we specify the hypothesis formally by setting
1
H0 : p = .
4
Now we need a test statistic, say random variable Y10 , that represents exactly
10 trials, each consisting of drawing a card by Oscar and then asking Felix to
identify it. Because the trials are independent and the probability p is the same
at each trial, the statistical model is given by the binomial distribution
9

BinY10 k|p

In words, Y10 is a Bernoulli process with a binomial distribution. Under hypothesis H0 (i.e., p = 41 ), the probability distribution function of Y10 is given
by

y
pY10 (y)


Specification of BinY10 k|p
Probability Assignment
2
3
4
5
6

0.056 0.188 0.282 0.250 0.146

0.058 0.016 0.003 0.000

We left out the values pY (9) = pY (10) = 0.000, because they are practically
zero.
Recall that the population mean is = np = 10 41 = 2.5. And Felixs score
(he guessed six times correctly) is rather far from 2.5, out in a tail of the null
hypothesis distribution BinY10 k| 41 . In this sense, 6 correct is rather surprising
when H0 is true.

From the table above we have to calculate P Y10 6|H0 :


P 6 or more correct|H0 = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019.
That this probability is small tells us that Y10 = 6 is quite far from what we
expect when p = 14 . Simply, the sample results are rather inconsistent with
the null hypothesis. On the other hand, 6 correct is not very surprising if Felix
really has some degree of ESP.
So lets consider the alternative hypothesis
1
Ha : p > ,
4
, the alternative to H0 , capturing the informal conjecture that Felix has ESP.
Now, the so-called P-value (or observed level of significance) is the probability
in the right tail at and beyond the observed number of sucesses (Y10 = 6), that
is

P Y10 6 = 0.019.
Many statisticians take the following interpretations as benchmarks:
10

(i) Highly statistically significant: P-value < 0.01 is strong evidence against
H0 ;
(ii) Statistically significant: 0.01 < P-value < 0.05 is moderate evidence
against H0 , and
(iii) P-value > 0.10 is little or no evidence against H0 .
In view of the foregoing classification, at 5% level, the test is statisatically
significant, and therefore H0 should be rejected and Ha should be accepted
instead.

11

You might also like