PR - L2-Review of Statistics and Probability PDF

LECTURE 2: Review of Probability and Statistics
g Probability
n Definition of probability
n Axioms and properties
n Conditional probability
n Bayes Theorem
g Random Variables
n Definition of a Random Variable
n Cumulative Distribution Function
n Probability Density Function
n Statistical characterization of Random Variables
g Random Vectors
n Mean vector
n Covariance matrix
g The Gaussian random variable
Introduction to Pattern Analysis 1

Ricardo Gutierrez-Osuna
Texas A&M University
Basic probability concepts
g Definitions (informal)
n Probabilities are numbers assigned to events that indicate how likely it is that
the event will occur when a random experiment is performed
n A probability law for a random experiment is a rule that assigns probabilities to
the events in the experiment
n The sample space S of a random experiment is the set of all possible outcomes
Sample space
probability
A2
A1 Probability
Law
A3 A1 A2 A3 A4 event
A4
g Axioms of probability
n Axiom I: 0 P[A i ]
n Axiom II: P[S] = 1
n Axiom III: if A i I A j = , then P[A i U A j ] = P[A i ] + P[A j ]

Warming-up exercise
g I come to class with three colored cards
n One BLUE on both sides
n One RED on both sides
n One BLUE on one side, RED on the other
A B C
g I shuffle the three cards, then pick one and show you one side only.
The side visible to you is RED
n Obviously, the card has to be either A or C, right?
g I am willing to bet $1 that the other side of the card has the same color,
and need someone in the audience to bet another $1 that it is the other
color
n Obviously, on the average we will end up even, right?
n Lets try it!

More properties of probability
PROPERTY 1: P[A c ] = 1 P[A]
PROPERTY 2 : P[A] 1
PROPERTY 3 : P[] = 0
given {A 1, A 2 ,...A N }, if {A i I A j = i, j} then P[U A k ] = P[A k ]

N N
PROPERTY 4 :
k =1 k =1
PROPERTY 5 : P[A 1 U A 2 ] = P[A 1 ] + P[A 2 ] P[A 1 I A 2 ]

N N N
PROPERTY 6 : P[U A k ] = P[A k ] P[A j I A k ] + ... + ( 1)N+1P[A 1 I A 2 I ... I A N ]
k =1 k =1 j< k
PROPERTY 7 : if A 1 A 2 , then P[A 1 ] P[A 2 ]

Conditional probability
g If A and B are two events, the probability of event A when we already
know that event B has occurred is defined by the relation
P[A I B]
P[A | B] = for P[B] > 0
P[B]
g This conditional probability P[A|B] is read:
n the conditional probability of A conditioned on B, or simply
n the probability of A given B
S S
A AB B B has A AB B
occurred
g Interpretation
n The new evidence B has occurred has the following effects
g The original sample space S (the whole square) becomes B (the rightmost circle)
g The event A becomes AB
n P[B] simply re-normalizes the probability of events that occur jointly with B

Theorem of total probability
g Let B1, B2, , BN be mutually exclusive events whose union equals the
sample space S. We refer to these sets as a partition of S.
g An event A can be represented as:
A = A I S = A I ( B1 U B 2 U ... U BN ) = (A I B1 ) U (A I B 2 ) U ...(A I BN )
B3
B1 BN-1
B2 BN
B4
g Since B1, B2, , BN are mutually exclusive, then

P[A] = P[A I B1 ] + P[A I B 2 ] + ... + P[A I BN ]
g and, therefore
N
P[A] = P[A | B1 ]P[B1 ] + ...P[A | BN ]P[B N ] = P[A | Bk ]P[B k ]
k =1

Bayes Theorem
g Given B1, B2, , BN, a partition of the sample space S. Suppose
that event A occurs; what is the probability of event Bj?
n Using the definition of conditional probability and the Theorem of total
probability we obtain
P[A I B j ] P[A | B j ] P[B j ]

P[B j | A] = = N
P[A]
P[A | B ] P[B
k =1
k k ]
g This is known as Bayes Theorem or

Bayes Rule, and is (one of) the most
useful relations in probability and
statistics
n Bayes Theorem is definitely the
fundamental relation in Statistical Pattern Rev. Thomas Bayes (1702-1761)
Recognition

Bayes Theorem and Statistical Pattern Recognition
g For the purpose of pattern classification, Bayes Theorem can be
expressed as
P[x | j ] P[ j ] P[x | j ] P[ j ]
P[ j | x] = N
=
P[x]
P[x | k ] P[k ]
k =1
n where j is the ith class and x is the feature vector

g A typical decision rule (class assignment) is to choose the class i
with the highest P[i|x]
n Intuitively, we will choose the class that is more likely given feature vector x
g Each term in the Bayes Theorem has a special name, which you
should be familiar with
n P[ j ] Prior probability (of class i)
n P[ j | x] Posterior Probability (of class i given the observation x)
n P[x | j ] Likelihood (conditional probability of observation x given class i)
n P[x] A normalization constant that does not affect the decision

Stretching exercise
g Consider a clinical problem where we need to decide if a patient has a
particular medical condition on the basis of an imperfect test:
n Someone with the condition may go undetected (false-negative)
n Someone free of the condition may yield a positive result (false-positive)
g Nomenclature
n The true-negative rate P(NEG|COND) of a test is called its SPECIFICITY
n The true-positive rate P(POS|COND) of a test is called its SENSITIVITY
TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL
True-positive False-negative
HAS CONDITION P(POS|COND) P(NEG|COND)
False-positive True-negative
FREE OF CONDITION P(POS|COND) P(NEG|COND)
COLUMN TOTAL
g PROBLEM
n Assume a population of 10,000 where 1 out of every 100 people has the condition
n Assume that we design a test with 98% specificity and 90% sensitivity
n Assume you are required to take the test, which then yields a POSITIVE result
n What is the probability that you have the condition?
g SOLUTION A: Fill in the joint frequency table above
g SOLUTION B: Apply Bayes rule

Stretching exercise
g Consider a clinical problem where we need to decide if a patient has a
particular medical condition on the basis of an imperfect test:
n Someone with the condition may go undetected (false-negative)
n Someone free of the condition may yield a positive result (false-positive)
g Nomenclature
n The true-negative rate P(NEG|COND) of a test is called its SPECIFICITY
n The true-positive rate P(POS|COND) of a test is called its SENSITIVITY
TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL
True-positive False-negative
HAS CONDITION P(POS|COND) P(NEG|COND)
1000.90 100(1-0.90) 100
False-positive True-negative
FREE OF CONDITION P(POS|COND) P(NEG|COND)
9,900(1-0.98) 9,9000.98 9,900
COLUMN TOTAL 288 9,712 10,000
g PROBLEM
n Assume a population of 10,000 where 1 out of every 100 people has the condition
n Assume that we design a test with 98% specificity and 90% sensitivity
n Assume you are required to take the test, which then yields a POSITIVE result
n What is the probability that you have the condition?
g SOLUTION A: Fill in the joint frequency table above
g SOLUTION B: Apply Bayes rule

Stretching exercise
g SOLUTION B: Apply Bayes theorem
P[COND | POS] =
P[POS | COND] P[COND]

= =
P[POS]
P[POS | COND] P[COND]

= =
P[POS | COND] P[COND] + P[POS | COND] P[COND]
0.90 0.01
= =
0.90 0.01 + (1 0.98) 0.99
= 0.3125

Random variables
g When we perform a random experiment we are usually interested in
some measurement or numerical attribute of the outcome
n When we sample a population we may be interested in their weights
n When rating the performance of two computers we may be interested in the
execution time of a benchmark
n When trying to recognize an intruder aircraft, we may want to measure
parameters that characterize its shape
g These examples lead to the concept of random variable
n A random variable X is a function that assigns a real number X() to each
outcome in the sample space of a random experiment
g This function X() is performing a mapping from all the possible elements in the sample
space onto the real line (real numbers)
n The function that assigns values to each outcome is fixed S

and deterministic
g as in the rule count the number of heads in three coin tosses
g the randomness the observed values is due to the underlying
randomness of the argument of the function X, namely the
outcome of the experiment X()=x
n Random variables can be Real line

x
g Discrete: the resulting number after rolling a dice
Sx
g Continuous: the weight of a sampled individual

Cumulative distribution function (cdf)
g The cumulative distribution function
FX(x) of a random variable X is defined 1
P(X<x)
as the probability of the event {Xx}
FX (x) = P[X x] for < x < +

100 200 300 400 500 x(lb)
n Intuitively, FX(b) is the long-term proportion cdf for a persons weight
of times in which X() b
g Properties of the cdf
1
0 FX (x) 1 5/6
4/6
P(X<x)
lim FX (x) = 1 3/6
x
2/6
lim FX (x) = 0 1/6
x
1 2 3 4 5 6 x
FX (a) FX (b) if a b cdf for rolling a dice
FX (b) = lim FX (b + h) = FX (b + )
h0

Probability density function (pdf)
g The probability density function of a continuous 1
random variable X, if it exists, is defined as the
derivative of FX(x)
pdf
dFX (x)
fX (x) =
dx
100 200 x(lb)
300 400 500
g For discrete random variables, the equivalent to pdf for a persons weight
the pdf is the probability mass function:
FX (x)
Properties fX (x) =
x
g
fX (x) > 0 1
b 5/6
P[a < x < b] = fX (x)dx 4/6
a
3/6
pmf
x
FX (x) = fX (x)dx 2/6
1/6
+
1 = fX (x)dx 1 2 3 4 5 6 x
pmf for rolling a (fair) dice
d P[{X < x} I A]
fX (x | A) = FX (x | A) where FX (x | A) = if P[A] > 0
dx P[A]

Probability density function Vs. Probability
What is the probability of somebody weighting 200 lb?
According to the pdf, this is about 0.62
This number seems reasonable, right?
1
Now, what is the probability of somebody weighting 124.876 lb?
According to the pdf, this is about 0.43
pdf
But, intuitively, we know that the probability should be zero (or

very, very small)
100 200 x(lb)

300 400 500
How do we explain this paradox?
pdf for a persons weight
The pdf DOES NOT define a probability, but a probability
DENSITY!
To obtain the actual probability we must integrate the pdf in an
interval
So we should have asked the question: what is the probability of
somebody weighting 124.876 lb plus or minus 2 lb?
1
5/6 The probability mass function is a true probability (reason
4/6 why we call it a mass as opposed to a density)
3/6 The pmf is indicating that the probability of any number when
pmf
2/6 rolling a fair dice is the same for all numbers, and equal to 1/6, a
1/6 very legitimate answer
The pmf DOES NOT need to be integrated to obtain the
1 2 3 4 5 6 x probability (it cannot be integrated in the first place)
pmf for rolling a (fair) dice

Statistical characterization of random variables
g The cdf or the pdf are SUFFICIENT to fully characterize a random
variable, However, a random variable can be PARTIALLY characterized
with other measures +
n Expectation E[X] = = xfX (x)dx

g The expectation represents the center of mass of a density

+
n Variance VAR[X] = E[(X E[X]) ] = (x )2 fX (x)dx

2
g The variance represents the spread about the mean
n Standard deviation STD[X] = VAR[X]1/2
g The square root of the variance. It has the same units as the random variable.
+
n Nth moment E[X ] = x NfX (x)dx
N

Random vectors
g The notion of a random vector is an extension to that of a random variable
n A vector random variable X is a function that assigns a vector of real numbers to each
outcome in the sample space S
n We will always denote a random vector by a column vector
g The notions of cdf and pdf are replaced by joint cdf and joint pdf
n Given random vector, X = [x1 x 2 ... x N ]Twe define
g Joint Cumulative Distribution Function as:
FX ( x ) = PX [{X1 x1 } I {X2 x 2 } I ... I {XN x N }]

g Joint Probability Density Function as:
NFX ( x )
fX ( x ) =
x1x 2 ...x N
g The term marginal pdf is used to represent the pdf of a subset of all the random
vector dimensions
n A marginal pdf is obtained by integrating out the variables that are not of interest
n As an example, for a two-dimensional problem with random vector X=[x1 x2]T, the marginal
pdf for x1, given the joint pdf fX1X2(x1x2), is
x 2 = +
fX1(x1 ) = f
x 2 =
X1X 2 (x1x 2 )dx 2

Statistical characterization of random vectors
g A random vector is also fully characterized by its joint cdf or
joint pdf
g Alternatively, we can (partially) describe a random vector with
measures similar to those defined for scalar random variables
n Mean vector
E[X] = [E[X1 ] E[X 2 ]...E[X N ]] = [1 2 ...N ] =
T
n Covariance matrix
COV[X] = = E[(X )(X )T ]

E[(x1 1 )(x1 1 )] ... E[(x1 1 )(xN N )] 1 ... c1N
2
= ... ... = ... ...

E[(x N N )(x1 1 )] ... E[(x N N )(xN N )] c 1N ... N
2

Covariance matrix (1)
g The covariance matrix indicates the tendency of each pair of features
(dimensions in a random vector) to vary together, i.e., to co-vary*
g The covariance has several important properties
n If xi and xk tend to increase together, then cik>0
n If xi tends to decrease when xk increases, then cik<0
n If xi and xk are uncorrelated, then cik=0
n |cik|ik, where i is the standard deviation of xi
n cii = i2 = VAR(xi)
g The covariance terms can be expressed as
c ii = i and c ik = ik i k
2
n where ik is called the correlation coefficient

Xk Xk Xk Xk Xk
Xi Xi Xi Xi Xi
C ik=-ik C ik=-ik C ik=0 C ik=+ik Cik=ik
ik=-1 ik=- ik=0 ik=+ ik=+1
Introduction to Pattern Analysis *from http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_home.htm 19

Covariance matrix (2)
g The covariance matrix can be reformulated as*
= E[(X )(X )T ] = E[XXT ] T = S T
E[x1x1 ] ... E[x1x N ]
with S = E[XXT ] = ... ... ...
E[x N x1 ] ... E[x N x N ]
n S is called the autocorrelation matrix, and contains the same amount of
information as the covariance matrix
g The covariance matrix can also be expressed as
1 0 ... 0 1 12 ... 1N 1 0 ... 0
0 1 0
= R = 2 12 2
... ... ... ... ... ...

0 N 1N 1 0 N
n A convenient formulation since contains the scales of the features and R
retains the essential information of the relationship between the features.
n R is the correlation matrix
g Correlation Vs. Independence
n Two random variables xi and xk are uncorrelated if E[xixk]=E[xi]E[xk]
g Uncorrelated variables are also called linearly independent
n Two random variables xi and xk are independent if P[xixk]=P[xi]P[xk]
Introduction to Pattern Analysis *from [Fukunaga, 1991] 20
A numerical example
g Given the following samples from a 3- Variables
dimensional distribution (or
n Compute the covariance matrix features)
Examples x1 x2 x3
n Generate scatter plots for every pair of variables 1 2 2 4
g Can you observe any relationships between the 2 3 4 6
covariance and the scatter plots? 3 5 4 2
4 6 6 4
g You may work your solution in the templates

below
x2 x2
(x1-1)( x2-2)
(x1-1)( x3-3)
(x2-2)( x3-3)
Example
(x2-2) 2
(x1-1)2
(x3-3)2
x1 x3
x1-1
x2-2
x3-3
x1
x2
x3
1 x1
2
3
4
Average
x3
The Normal or Gaussian distribution
g The multivariate Normal or Gaussian distribution N(,)
is defined as
1 1
f X (x) = 1/2
exp (X )T 1(X )
(2 )
n/2
2
g For a single dimension, this expression is reduced to
1 1 X - 2
f X (x) = exp
2 2
g Gaussian distributions are very popular since

n The parameters (,) are sufficient to uniquely characterize the
normal distribution
n If the xis are mutually uncorrelated (cik=0), then they are also
independent
g The covariance matrix becomes a diagonal matrix, with the
individual variances in the main diagonal
n Central Limit Theorem
n The marginal and conditional densities are also Gaussian
n Any linear transformation of any N jointly Gaussian rvs results
in N rvs that are also Gaussian
g For X=[X1 X2 ... XN]T jointly Gaussian, and A an NN invertible
matrix, then Y=AX is also jointly Gaussian 1
fX (A y)
fY (y) =
A

Central Limit Theorem
g The central limit theorem states that given a distribution with a mean and
variance 2, the sampling distribution of the mean approaches a normal
distribution with a mean () and a variance 2/N as N, the sample size,
increases.
n No matter what the shape of the original distribution is, the sampling distribution of the mean
approaches a normal distribution
n Keep in mind that N is the sample size for each mean and not the number of samples
g A uniform distribution is used to illustrate the

idea behind the Central Limit Theorem
n Five hundred experiments were performed using
am uniform distribution
g For N=1, one sample was drawn from the
distribution and its mean was recorded (for each of
the 500 experiments)
n Obviously, the histogram shown a uniform density
g For N=4, 4 samples were drawn from the
distribution and the mean of these 4 samples was
recorded (for each of the 500 experiments)
n The histogram starts to show a Gaussian shape
g And so on for N=7 and N=10
g As N grows, the shape of the histograms resembles
a Normal distribution more closely

PR - L2-Review of Statistics and Probability PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PR - L2-Review of Statistics and Probability PDF

Uploaded by

Copyright:

Available Formats

LECTURE 2: Review of Probability and Statistics

Introduction to Pattern Analysis 1

Introduction to Pattern Analysis 2

Introduction to Pattern Analysis 3

PROPERTY 1: P[A c ] = 1 P[A]

given {A 1, A 2 ,...A N }, if {A i I A j = i, j} then P[U A k ] = P[A k ]

PROPERTY 5 : P[A 1 U A 2 ] = P[A 1 ] + P[A 2 ] P[A 1 I A 2 ]

PROPERTY 7 : if A 1 A 2 , then P[A 1 ] P[A 2 ]

Introduction to Pattern Analysis 4

Introduction to Pattern Analysis 5

g Since B1, B2, , BN are mutually exclusive, then

Introduction to Pattern Analysis 6

P[A I B j ] P[A | B j ] P[B j ]

g This is known as Bayes Theorem or

Introduction to Pattern Analysis 7

n where j is the ith class and x is the feature vector

Introduction to Pattern Analysis 8

Introduction to Pattern Analysis 9

Introduction to Pattern Analysis 10

P[POS | COND] P[COND]

P[POS | COND] P[COND]

Introduction to Pattern Analysis 11

n Random variables can be Real line

Introduction to Pattern Analysis 12

FX (x) = P[X x] for < x < +

Introduction to Pattern Analysis 13

Introduction to Pattern Analysis 14

But, intuitively, we know that the probability should be zero (or

100 200 x(lb)

Introduction to Pattern Analysis 15

g The expectation represents the center of mass of a density

n Variance VAR[X] = E[(X E[X]) ] = (x )2 fX (x)dx

g The variance represents the spread about the mean

n Standard deviation STD[X] = VAR[X]1/2

Introduction to Pattern Analysis 16

g Joint Cumulative Distribution Function as:

FX ( x ) = PX [{X1 x1 } I {X2 x 2 } I ... I {XN x N }]

Introduction to Pattern Analysis 17

COV[X] = = E[(X )(X )T ]

= ... ... = ... ...

Introduction to Pattern Analysis 18

n where ik is called the correlation coefficient

Introduction to Pattern Analysis *from http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_home.htm 19

g You may work your solution in the templates

g For a single dimension, this expression is reduced to

g Gaussian distributions are very popular since

Introduction to Pattern Analysis 22

g A uniform distribution is used to illustrate the

You might also like