Professional Documents
Culture Documents
5
Phil.015
April 4, 2016
JOINT AND CONDITIONAL PROBABILITY
DISTRIBUTIONS, INDEPENDENT VARIABLES, AND
HYPOTHESIS TESTING
1. Joint Probability Distribution Functions
Studying the relationships between two (or more) random variables requires
their joint probability distribution.1 In Handout No. 4 we treated probability
distributions of single discrete random variables. We now consider two (or more)
random variables that are defined simultaneously on the same sample space. In
this section we focus on examples involving two random variables. First we make
the following definition of the concept of a joint or more specifically bivariate
probability distribution function (jpdf):
Given a probability model h, A, Pi of a target random experiment, let X :
R and Y : R be two discrete random
variables with respective finite
sets
of possible values X() = x1 , x2 , , xn and Y () = y1 , y2 , , ym . The
probability distribution function
pX,Y (x, y) =df P X = x & Y = y
For example, changing one of the two related random variables may cause a change in the other.
Also, a relationship between random variables can be used for predicting one variable from the
knowledge of others, even when the relationship is not causal.
2
Remember that we may define a bivariate probability distribution function simply by giving the
function pX,Y (x, y) without any reference to the underlying sample space , event algebra A and a
probability mreasure thereon. Statisticians seldom refer to the underlying sample space (domain)
of a random variable. All one has to know are the values of random variables.
/
X Y
y1
y2
ym
pX
x1
pX,Y (x1 , y1 )
pX,Y (x1 , y2 )
pX,Y (x1 , ym )
pX (x1 )
x2
pX,Y (x2 , y1 )
pX,Y (x2 , y2 )
pX,Y (x2 , ym )
pX (x2 )
..
.
..
.
..
.
..
.
pX,Y (xn , ym )
pX (xn )
pY (ym )
..
.
xn
pY
pY (y2 )
Note that pX,Y (x, y) 0 and the double-sum of all values of pX,Y (x, y) is equal
to 1:
As expected, the sum X + Y of random variables X and Y is the unique random variable Z,
defined argumentwise as follows: Z() =df X() + Y () for all . Similarly for product and
the other algebraic operations on random variables.
Given two random variables X and Y with data as above, we say that they are
statistically independent, in symbols X
Y , provided that their joint probability distribution is the product of their marginal probability distributions. That
is to say, the equation
pX,Y (x, y) = pX (x) pY (y)
holds for all values x X() and y Y (). Obviously, independence is a symmetric relation, so that we have X
Y if and only if Y
X. Furthermore,
independence crucially depends on the associated probability distribution functions. Informally, independence means that a knowledge that X has assumed a
given value, say xi , does not effect at all the probability that Y will assume any
given value, say yj . The notion of independence can be carried over to more
than two random variables.
We mention in passing that in the case of three random variables X, Y and Z,
their joint (trivariate) probability distribution is defined by
pX,Y,Z (x, y, z) =df P X = x & Y = y & Z = z
the remaining values of pX,Y (x, y) (and the total of 6 7 = 42) are calculated
similarly. Recall that the marginal probability distribution pX (x) is calculated
4
Astute readers will object that X and Y are not defined on the same sample space! This can
easily be fixed by assuming that a slightly modified variant of X, denoted by X 0 , is actually defined
on the product sample space 6 , where =df 2 + 22 + 26 , i.e., X 0 : 6 R,
constant in the second coordinate (X 0 does not depend on the coin experiments), i.e., we have
X 0 (, 0 ) =df X(). Likewise, Y 0 : 6 R is a slightly modified variant of Y , presumed to
be constant in the first coordinate (does not depend directly on the dies outcome), i.e., we have
Y 0 (, 0 ) =df Y ( 0 ).
by summing up the rows of the table for pX,Y (x, y), so that pX (1) = 16 , pX (2) =
1
, , and pX (6) = 61 . Obviously, we get the same values, since rolling the die
6
does not depend on flipping the coin.
On the other hand, the marginal probability distribution pY (y) is calculated by
63
, pY (1) =
summing up the colums of the table for pX,Y (x, y), so that pY (0) = 384
120
99
64
29
8
1
, pY (2) = 384 , pY (3) = 384 , pY (4) = 384 , pY (5) = 384 , and pY (6) = 384
. Here
384
the values are significantly different fro case to case, since Y depends on X.
2. Conditional Probability Distribution Functions
Given a probability model h, A, Pi of a target random experiment, let X :
R and Y : R be twodiscrete random
variables withrespective finite
sets of possible values X() = x1 , x2 , , xn and Y () = y1 , y2 , , ym .
The probability distribution function of the form pY |X (y|x), defined below, is
called the conditional probability distribution of random variable Y given variable X:
pY |X (y|x) =df
P X=x&Y =y
pX,Y (x, y)
=
,
pX (x)
P X=x
provided the marginal distribution satisfies pX (x) > 0 for all x. Note that the
foregoing distribution is in general different from the conditional probability
distribution function pX|Y (x|y) of random variable X given variable Y , defined
by
pX|Y (x|y) =df
P Y =y&X=x
pX,Y (x, y)
=
pY (y)
P Y =y
provided pY (y) > 0 for all y. Of course, for a fixed value y, pX|Y (x|y) itself is
a probability distribution (i.e., is nonnegative and sums up to 1). It is easy to
see that if X
Y , then pX|Y (x|y) = pX (x).
Example 2: Suppose we are given two ranbdom variables X and Y with respective possible values X() = {1, 2, 3} and Y () = {1, 2, 3}, and joint probability
1
distribution pX,Y (x, y) =df 36
x y.
Problem: Determine whether X and Y are independent.
Answer: Because the marginal for X is pX (x) = x6 (with x = 1, 2, 3) and the
1
marginal for Y is pY (y) = y6 (with y = 1, 2, 3), and pX,Y (x, y) =df 36
xy = pX (x)
pY (y), we have X
Y . Therefore pX|Y (x|y) = pX (x) and pY |X (y|x) = pY (y).
Mathematical expectation automatically generalizes to the conditional case.
Specifically, let X and Y be discrete random variables, and let the set of possi-
E(X | Y = y) =df x1 pX|Y (x1 |y) + x2 pX|Y (x2 |y) + + xn pX|Y (xn |y),
of conditional probability distribution functions, where pY (y) > 0.
As expected, conditional variance Var X | Y = y is defined by the sum
2
2
2
x1 y pX|Y (x1 |y) + x2 y pX|Y (x2 |y) + + xn y pX|Y (xn |y),
where y =df E(X | Y = y).
The purpose of conditional probability distribution functions is best seen in
the applications of Bayes theorem. In particular, Bayesians write the binomial
probability distribution in the conditional form
BinSn
n k
k | p =df
p (1 p)nk ,
k
Statisticians often write X pX (x) to indicate that random variable X comes with distribution
pX (x).
6
The values of the cumulative binomial probability distribution P Sn k|p can be obtained
from a binomial table handed
out in class. Also, you can open Excel in your laptop and calculate
the value of P Sn k|p by typing BINOMDIST(k,n,p,1) with appropriate values for k, n and p.
1
X1 + X 2 + Xn .
n
7
2
2
2
1
X 1 X + X2 X + + X n X .
n
S =df
1
n1
X1 X
2
+ X2 X
2
+ Xn X
2
.
1
n1
X1 X
2
+ X2 X
2
+ Xn X .
2
We also have the notion of sample correlation, sample moments, and a host of
other concepts, paralleing the population terminology. Generally, probability
appears only to relate the calculations of sample mean, sample variance, etc.,
to population mean, population variance, and so on.
Often it is easier to work with a stadardized variant of a random variable X
8
X X
.
X
It is easy to check that Z = E Z = 0 and Z = Var Z = 1. We shall use
the Z-score in calculating the so-called P-values.
Sample data provide evidence concerning hypotheses about the population from
which they are drawn. Here is a typical example:
Extrasensory Perception: In attempting to determine whether or not a subject
may be said to have extrasensory perception, something akin to the following
procedure is commonly used. The subject is placed in one room, while, in
another room, a card is randomly selected from a deck of cards. The subject is
then asked to guess a particular feature (e.g., color, number, suit, etc.) of the
card drawn. In this experiment, a person with no extrasensory powers might
be expected to guess correctly an average number of cards. On the other hand,
a person who claims that (s)he has extrasensory powers should presumably be
able to guess correctly an impressively large number of cards.
For specificity, Felix from the TV show called The Odd Couple claims to have
ESP. Oscar tested Felixs claim by drawing a card at random from a set of four
large cards, each with a different number on it, and without showing it, he asked
Felix to identify the card. They repeated this basic experiment many times.
At each such trial, an individual without ESP has one chance in four ( 14 ) of
correctly identifying the card. In 10 trials, Felix made six correct identifications.
Although he did not claim to be perfect, six is rather more than 2.5 = 14 [1 +
2 + 3 + 4], the average number of correctly guessed cards, if Felix does not
have ESP. Question: Does this prove anything about Felix having ESP? Here
is where hypothesis testing comes in.
Let p denote the probability that Felix correctly identifies a card. The so-called
null hypothesis H0 is that Felix has no ESP, and is only guessing or, in terms
of the parameter p, we specify the hypothesis formally by setting
1
H0 : p = .
4
Now we need a test statistic, say random variable Y10 , that represents exactly
10 trials, each consisting of drawing a card by Oscar and then asking Felix to
identify it. Because the trials are independent and the probability p is the same
at each trial, the statistical model is given by the binomial distribution
9
BinY10 k|p
In words, Y10 is a Bernoulli process with a binomial distribution. Under hypothesis H0 (i.e., p = 41 ), the probability distribution function of Y10 is given
by
y
pY10 (y)
Specification of BinY10 k|p
Probability Assignment
2
3
4
5
6
We left out the values pY (9) = pY (10) = 0.000, because they are practically
zero.
Recall that the population mean is = np = 10 41 = 2.5. And Felixs score
(he guessed six times correctly) is rather far from 2.5, out in a tail of the null
hypothesis distribution BinY10 k| 41 . In this sense, 6 correct is rather surprising
when H0 is true.
From the table above we have to calculate P Y10 6|H0 :
P 6 or more correct|H0 = 0.016 + 0.003 + 0.000 + 0.000 + 0.000 = 0.019.
That this probability is small tells us that Y10 = 6 is quite far from what we
expect when p = 14 . Simply, the sample results are rather inconsistent with
the null hypothesis. On the other hand, 6 correct is not very surprising if Felix
really has some degree of ESP.
So lets consider the alternative hypothesis
1
Ha : p > ,
4
, the alternative to H0 , capturing the informal conjecture that Felix has ESP.
Now, the so-called P-value (or observed level of significance) is the probability
in the right tail at and beyond the observed number of sucesses (Y10 = 6), that
is
P Y10 6 = 0.019.
Many statisticians take the following interpretations as benchmarks:
10
(i) Highly statistically significant: P-value < 0.01 is strong evidence against
H0 ;
(ii) Statistically significant: 0.01 < P-value < 0.05 is moderate evidence
against H0 , and
(iii) P-value > 0.10 is little or no evidence against H0 .
In view of the foregoing classification, at 5% level, the test is statisatically
significant, and therefore H0 should be rejected and Ha should be accepted
instead.
11