Chapter2 Probability

Chapter2: Probability
Notes on MLAPP
Wu Ziqing
School of Computer Science and Engineering

Nanyang Technological University
14/07/2018
Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 1 / 45

Outline
1 Basic Concept of Probability Theory

Definition of Probability
Basic Notations
Discrete and Continuous Variable
Mean and Variance
Fundamental Rules
Bayes Rule
Independence
2 Discrete Distributions
Binomial and Bernoulli Distributions
Multinomial and Multinoulli Distributions
Poisson Distribution
Empirical Distribution

Outline
3 Continuous Distributions
Normal Distribution
Degenrate pdf
Laplace Distribution
Gamma Distribution
Beta Distribution
Pareto Distribution
4 Joint Probability Distributions

Covariance and Correlation
Multivariate Gaussian
Multivariate Student t Distribution
Dirichlet Distribution

Outline
5 Transformations of Random Variables

Linear Transformation
General Transformation
Multivariate Change of Variables
Central Limit Theorem
6 Monte Carlo approximation
7 Information Theory
Entropy
KL Divergence
Mutual Information

Basic Concept of Probability Theory
Definition of Probability
There are 2 possible interpretations of a probability:

Frequentist Interpretation: probability represents long run
frequencies of events.
Bayesian Interpretation: probability quantifies our uncertainty of
something happening.
Bayesian view can help us model uncertainty of events that do not
have long term frequencies.

Basic Notations
p(A) : denotes the probability of event A will happen. 0 ≤ p(A) ≤ 1

p(A) : denotes the probability of event not A (Complement).
p(A) = 1 − p(A).

Discrete and Continuous Variable(Discrete)
Discrete Random Variable: variable which can take on any value from a
finite or countable infinite set X . Notation p(X = x) denotes the event
X = x.
p() is called a Probability Mass Function (pmf), 0 ≤ p(x) ≤ 1 and
P
x∈X p(x) = 1.

Discrete and Continuous Variable(Continuous)
If variable is continuous, we define Cumulative Distribution Function

(cdf) as:
F (q) = p(x ≤ q)
Cumulative Distribution is always monotonically increasing.
We define Probability Density Function (pdf) as:
d
f (x) = dx F (x)
Thus the probability of a continuous variable being in a finite interval is:
Rb
P(a < x ≤ b) = a f (x)dx
Probability of a continuous variable taking one value x is:
P(x ≤ X ≤ x + dx) ≈ p(x)dx
Note that here p(x) is allowed to take value > 1, so long as the
density integrates to 1.
Discrete and Continuous Variable(Continuous)(Cont.)
Quantile: Since cdf F is monotonically increasing, F −1 (α) = xα such that

P(x ≤ xα ) = α. α is called the α quantile of F .

Mean and Variance
Mean/Expected Value: denoted by µ.

For Discrete Variables:
P
E[X ] = x∈X xp(x)
For Continuous Variables:
R
E[X ] = X xp(x)dx
Variance: Variance measures ’spread’ of data, denoted by σ 2 .
Var [X ] = E[(X − µ)2 ] = E[X 2 ] − µ2
Thus, E[X 2 ] = µ2 + σ 2
Standard Deviation: Std deviation adopts the same units as the data.
It is denoted by σ:
p
Std[X ] = Var [X ]
Fundamental Rules
Joint Probability of Two events:

p(A, B) = p(A ∧ B) = p(A|B)p(B)
Probability of Union of Two Events (Product Rule):
p(A ∨ B) = p(A) + p(B) − p(A ∧ B)

= p(A) + p(B) if A and B are mutually exclusive
Marginal Distribution (Sum Rule):

P P
p(A) = b p(A, B) = b p(A|B = b)p(B = b)
Chain Rule (Product Rule applied several times):
p(X1:D = p(X1 )p(X2 |X1 )p(X3 |X2 , X1 )p(X4 |X3 , X2 , X1 )...p(XD |XD−1 )
Conditional Probability:
p(A,B)
p(A|B) = p(B)

Bayes Rule
Bayes Rule/Bayes Theorem is the combintation of Conditional

Probability, Product Rule and Sum Rule:
p(X = x|Y = y ) = p(X =x,Y =y )

= P p(Y =y |X =x)p(X
0
=x)
0
p(Y =y ) x 0 p(Y =y |X =x )p(X =x )
It is useful to obtain the conditional probability p(A|B) if we already know

p(B|A) and p(A).

Independence
Unconditional Independence/Marginal Independence: denoted by

X ⊥ Y , satisfies:
X ⊥ Y ⇐⇒ p(X , Y ) = p(X )p(Y )
Conditional Independence: In most cases, two variables are

independence only via other variables:
X ⊥ Y |Z ⇐⇒ p(X , Y |Z ) = p(X |Z )p(Y |Z )
Conditional independence has following property:

X ⊥ Y |Z iff there exist functions g () and h() such that:
p(x, y |z) = g (x, z)h(x, z)

Discrete Distributions
Binomial and Bernoulli Distributions
Binomial Distribution: When we toss a coin n times, with a probability

of head being θ, the number of heads X ∈ {1, 2, ..., n} has a binomial
distribudion:
Bin(k|n, θ) = kn θk (1 − θ)n−k

Binomial Distribution has the following properties:

mean = θ, var = nθ(1 − θ)
Bernoulli Distribution: It is a special case of binomial distribution where
n=1. Thus,
Ber (x|θ) = θII(x=1) (1 − θ)II(x=0)
That is,
(
θ, if x=1
Ber (x|θ) =
1 − θ, if x=0
Multinomial and Multinoulli Distributions
Multinomial Distribution: If we toss a K -side die n times instead, with a

probability of landing on each side represented by a vector
θ = (θ1 , θ2 , ..., θK ), let x = (x1 , x2 , ..., xK ), where xj represents the number
of jt h side occurs, then x follows Multinomial Distribution:
n
QK xj n
n!
Mu(x|n, θ) = x1 ,...,x K j=1 θ j , where x1 ,...,xK = x1 !x2 !...xK !
Multinoulli Distribution: It is a special case of Multinomial distribution

where n=1. It is also called Categorical Distribution/ Discrete
Distribution.
Cat(x|θ) = Mu(x|1, θ) = K II(xj =1)

Q
j=1 θj , here p(x = j|θ) = θj
Do note that here x becomes a binary vector with all elements 0 or 1

and only one dimension can be 1 (since only one side will occur). It is
also known as dummy encoding or one-hot encoding.

Poisson Distribution
Poisson Distribution is usually used to calculate the probability of a

number of a certain event occurring in a specified time interval. It has a
parameter λ, which is the average number of event occurring in the
interval.
x
Poi(x|λ) = e −λ λx!

Empirical Distribution
Empirical Distribution is obtained from empirical tests. For a dataset

D = {x1 , x2 , ..., xN }:
pemp (A) = N1 N
P
i=1 δxi (A), where δx (A) only = 1 if x ∈ A
The probability for each result to occur is associated with a weight wi :
p(x) = N
P PN
i=1 wi δxi (x), where 0 ≤ wi ≤ 1 and i=1 wi = 1.

Continuous Distributions
Normal Distribution
Normal Distribution/Gaussian Distribution has the pdf of:

1
− (x−µ)2
N (x|µ, σ 2 ) = √ 1 e 2σ2
2πσ 2
Its cdf is:

Rx
Φ(x; µ, σ 2 ) = −∞ N (z|µ, σ
2 )dz, where z = (x − µ)/σ
which is usually implemented as:

√ Rx 2
Φ(x; µ, σ 2 ) = 12 [1 + erf (z/ 2)], where erf (x) = √2
π 0 e −t dt
Standard Normal Distribution: the Normal distribution of N ∼ (0, 1).

Precision: precision λ = 1/σ 2 . The higher the precision, the smaller the
variance, the narrower the distribution.

Normal Distribution(Cont.)
Advantage of Normal Distribution:

1 It contains 2 parameter which captures the most basic property of a
distribution.
2 The sum of independent random variables have an approximately
Normal Distribution (Central Limit Theorem).
3 It makes the least amount of assumptions.
4 It has simple mathematical form, which is easy to implement.

Degenerate pdf (Dirac Delta Function)
Dirac Delta Function: When variance of a Normal Distribution

approaches 0, its pdf becomes infinitely thin and tall:
lim N (x|µ, σ 2 ) = δ(x − µ)

σ 2 →0
Here δ(x − µ) is called Dirac Delta Function. It is defined as:

(
∞, if x=0 R∞
δ(x) = , where −∞ δ(x)dx = 1
0, if x 6= 0
Dirac Delta Function has sifting property: it will select out a single term
from a sum of integral:
R∞
−∞ f (x)δ(x − µ)dx = f (u)

Degenerate pdf (Student t Distribution)
Student t Distribution: Compared to Normal Distribution, Student t

Distribution is less affected by outliers. It has the following pdf:
v +1
T (x|µ, σ 2 , v ) ∝ [1 + v1 ( x−µ 2 −(
σ ) ]
2
)
v > 0 is called degree of freedom.

Student t Distribution has the following properties:
mode = µ
mean = µ only if v > 1
v σ2
variance = (v −2) only if v > 2

Laplace Distribution
Laplace Distribution/Double Sided Exponential Distribution is also a

distribution more inert to outliers compared to Normal Distribution. It has
the following pdf:
1 |x−µ|
Lap(x|µ, b) = 2b exp(− b )
If has the following properties:

mode = µ
mean = µ
variance = 2b 2

Gamma Distribution
Gamma Distribution is a flexible distribution for positive real value

random variables. It is defined as following:
a
b
Γ(T |shape = a, rate = b) = Γ(a) T a−1 e −Tb ,
R∞
where Γ(x) = 0 u x−1 e −u du
Gamma Distribution has following properties:

a−1
mode = b
a
mean = b
variance = ba2

Gamma Distribution(Cont.)
Special cases for Gamma Distribution:

Exponential Distribution: is defined as Expon(x|λ) = Γ(x|1, λ),
where λ is the parameter in Poisson Distribution. It describes the
time between two consecutive events in a Poisson process.
Erlang Distribution: is defined as Erlang (x|λ) = Γ(x|2, λ).
Chi-squared Distribution: is defined by χ2 (x|v ) = Γ(x| v2 , 12 ).It is
the distribution of sum P
of squared Gaussian variables. i.e., if
Zi ∼ N (0, 1), and S = vi=1 Zi 2 , thenS ∼ χv 2 .

Gamma Distribution(Cont.)
1
Inverse Gamma Distribution: if X ∼ Ga(a, b), then X ∼ IG (a, b),
which is defined by:
b a −(a+1) −b/x
IG (x|a, b) = Γ(a) x e
IG has following properties:

b
mode = a+1
b
mean = a−1 , only if a > 1
b2
variance = (a−1)2 (a−2)
, only if a > 2

Beta Distribution
Beta Distribution has interval over [0,1] and is defined as:

1 a−1 (1 Γ(a)Γ(b)
Beta(x|a, b) = B(a,b) x − x)b−1 , where B(a, b) = Γ(a+b)
Beta Distribution only exists when a, b > 0.

If a = b = 1, the distribution turns into a Uniform Distribution.
If a, b < 1, the distribution turns into a Bimodal Distribution, which
spikes at 0 and 1.
If a, b > 1, the distribution turns into a Unimodal Distribution, which
has a heap shape.
Beta Distribution has following properties:
a−1
mode = a+b−2
a
mean = a+b
ab
variance = (a+b)2 (a+b+1)
Pareto Distribution
Pareto Distribution is used to model the quantities that exhibits long

tails. It is defined as:
Pareto(x|k, m) = kmk x −(k+1) II(x ≥ m)
Pareto Distribution has following properties:

mode = m
km
mean = k−1 , only if k > 1
m2 k
variance = (k−1)2 (k−2)
, only if k > 2.

Joint Probability Distribution
A Joint Probability Distribution has multiple variables and has the form
of p(x1 , x2 , ..., xD ) for a set of variables x.
If all variables are discrete, we can represent Joint Probability in a
multi-dimensional array, with one variable in each dimension.
The size of high dimensional array can be reduced by making Conditional
Independence assumptions, or restrict the pdf into a certain
functional forms (for continuous distribution).

Covariance
Covariance describes the degree which two variables are linearly related:
cov [X , Y ] = E[(X − E(X ))(Y − E(Y ))] = E(XY ) − E(X )E(Y )
If x is a d-dimensional vector containing d variables, its Covariance

Matrix is defined as:
cov [x] = E[(x − E[x])(x − E[x])T ]

 
var [X1 ] cov [X1 , X2 ] ... cov [X1 , Xd ]
 cov [X1 , X2 ] var [X2 ] ... cov [X2 , Xd ]
=  
 ... ... ... ... 
cov [Xd , X1 ] cov [Xd , X2 ] ... var [Xd ]

Correlation
Correlation Coefficient normalises Covariance and gives it a finite upper

bound:
corr [X , Y ] = √ cov [X ,Y ]
var [X ]var [Y ]
A Correlation Matrix R is thus:

 
corr [X1 , X1 ] corr [X1 , X2 ] ... corr [X1 , Xd ]
 corr [X1 , X2 ] corr [X2 , X2 ] ... corr [X2 , Xd ] 
R= 
 ... ... ... ... 
corr [Xd , X1 ] corr [Xd , X2 ] ... corr [Xd , Xd ]
Correlation Coefficient measures the degree of linearity:
corr [X , Y ] = 1 if X and Y have a linear relationship.
corr [X , Y ] = 0 means X and Y are uncorrelated.
If X,Y are independent (i.e., p(X , Y ) = p(X )p(Y )), corr [X , Y ] = 0,
but not visa versa.
Multivariate Gaussian
Multivariate Gaussian is defined as:
N (x|µ, Σ) = 1
exp[− 12 (x − µ)T Σ−1 (x − µ)],
(2π)D/2 |Σ|1/2
where µ is the mean vector of x, and Σ is the D × D Covariance Matrix.

Multivariate Student t Distribution is defined as:
Γ(v /2 + D/2) |Σ|−1/2 1 v +D

T (x|µ, Σ, v ) = D/2 D/2
× [1 + (x − µ)T Σ−1 (x − µ)]− 2
Γ(v /2) v π v
Γ(v /2 + D/2) v +D
= |piV |−1/2 × [1 + (x − µ)T V −1 (x − µ)]− 2
Γ(v /2)
where V = v Σ
Multivariate Student t Distribution has following properties:

mode = µ
mean = µ
v
variance = v −2 Σ

Dirichlet DistributionP has the support

SK = {x : 0 ≤ xk ≤ 1, K k=1 xk = 1}. It is defined by:
1 QK αk −1 II(x ∈ S ),
Dir (x|α) = B(α) k=1 xk K
QK
Γ(αk ) PK
where B(α) = k=1
Γ(α0 ) and α0 = k=1 αk
Dirichlet Distribution has following properties:

αk −1
mode[xk ] = α0 −K
mean[xk ] = ααk0
0 −αk )
variance[xk ] = ααk2(α
0 (α0 +1)

Transformations of Random Variables
Linear Transformation
If y = f (x) = Ax + b, then:
E[y ] = E[Ax + b] = Aµ + b
cov [y ] = cov [Ax + b] = AΣAT

General Transformation
General Transformation: If we change x into y = f (x):

if x is discrete,
P
py (y ) = x:f (x)=y px (x)
if x in continuous, we work on cdf first:
Py (y ) = P(Y ≤ y ) = P(f (x) ≤ y ) = P(x ∈ x|f (x) ≤ y )
Since cdf is monotonically increasing, it can be inverted:
Py (y ) = P(f (x) ≤ y ) = P(X ≤ f −1 (y )) = Px (f −1 (y ))
To obtain the pdf, we can differentiate cmf:
py (y ) = d d −1 (y )) dx d dx
dy Py (y ) = dy Py (f = dy dx Px (x) = dy px (x)
Since the sign is insignificant, we get:
dx
py (y ) = | dy |px (x) (Change of Variables Formula)

Multivariate Change of Variables
If f () is a function that maps x to y , which both are vectors of

n-dimensional vectors, then dy dx is given by |detJx→y |, where J is its
Jacobian Matrix:
 δy δy δy1

1 1
δx1 δx2 ... δxn
 δy2 δy2 δy 
 δx1 δx2 ... δxn2 
Jx→y = δ(y 1 ,y2 ,...,yn )
δ(x1 ,x2 ,...,xn ) = 
 ... ... ... ... 

δyn δyn δyn
δx1 δx2 ... δxn
If f () is a invertible mapping, then according to Change of Variables

Formula,
py (y ) = px (x)|detJy →x |

Central Limit Theorem
For N random variables with pdf, each with the same µ and σ 2 , i.e., the
variables are
PNindependent and identically distributed (idd).
Let SN = i=1 Xi , i.e., the sum of all variables, as N increases,
2
p(SN = s) = √ 1
2πNσ
exp(− (s−Nµ)
2Nσ 2
)
The distribution converges to a standard normal.

Monte Carlo approximation
Monte Carlo approximation computes the distribution of a function by:

1 generate S samples from the distribution, x1 , x2 , ..., xS
2 approximate the distribution using the empirical distribution of
{f (xs )}Ss=1 , by calculate the arithmetic mean of the function applied
to the sample:
E[f (x)] = f (x)p(x)dx ≈ S1 Ss=1 f (xs )
R P
From the samples drawn, we can compute the following quantities:

E(X ) = x̄ = S1 Ss=1 xs
P
var [X ] = barx = S1 Ss=1 (x − x̄)2

P
median(X) = medianx1 , x2 , ..., xS

(P(X ≤ c) = S1 No.{xs ≤ c}

Monte Carlo approximation (Cont.)
Accuracy of Monte Carlo approximation depends on the number of

samples drawn. By Central Limit Theorem, the error of MC, i.e., the
difference between actual mean and the sample mean, is:
2
µ̂ − µ → N (0, σS ),
where σ 2 , although unknown, can be estimated by:
σ̂ 2 = S1 Ss=1 (f (x) − µ̂)2

P
Since in Normal Distribution,
P{µ − 1.96 √σ̂S ≤ µ̂ ≤ µ + 1.96 √σ̂S } ≈ 0.95,

where √σ̂ is called the Standard Error.
S
We can obtain a estimation accurate to with in ± with probability at

2
least 95% by having sample size S ≥ 4σ̂
2
.

Information Theory
Definition: Information Theory is concerned with:

Data Compression/Source Coding: represent data in a compressed
fashion.
Error Correction/Channel Coding: transmit and store data in a
way that is robust to errors.
Relation to Probability Model:

Compressing data need to represent message with high frequency with
short code words, and reserve long words for rarely used messages.
A good probability model is required for decoding messages sent over
noisy channels.

Information Theory
Entropy
Entropy of a random variable X with distribution p, is a measure of its

uncertainty. It is denoted by H(X ):
H(X ) = − K
P
k=1 p(X = k)log2 p(X = k)
Entropy has following properties:

The unit when using log2 is called bits (binary digits). The unit
when using ln is called nats (natural digits).
Uniform distribution has the maximum Entropy, H(X ) = log2 K for a
K -ray random variable.
Deterministic Distribution (all mass is on one state) has the minimum
Entropy, H(X ) = 0.
If X is a binary variable, and we denote p(X = 1) = θ, we have
Binary Entropy Function:
H(X ) = −[θlog2 θ + (1 − θ)log2 (1 − θ)]
Information Theory
KL Divergence
Kullback-Leibler Divergence (KL Divergence) is a measure of

dissimilarity between two probabilities p and q:
K
X pk
KL(p||q) = pk log
qk
k=1
K
X K
X
= pk logpk − pk logqk
k=1 k=1
= −H(p) + H(p, q)
H(p, q) is called cross entropy. It represents the average number of bits
needed to encode source with distribution p to model q. Thus, KL
Divergence is the average number extra bits to encode data.
Information Inequality Theorem states that:
KL(p||q) ≥ 0 and only = 0 if p = q.
Information Theory
Mutual Information
Mutual Information (MI) show how much knowing one variable x can
tell us another variable y. It is defined by the KL Divergence between the
Joint Probability p(X , Y ) and the factored probability p(X )p(Y ):
II(X ; Y ) = KL(p(X , Y )||p(X )p(Y ))

XX p(x, y )
= p(x, y ) log
x y
p(x)p(y )
= H(X ) − H(X |Y )
= H(Y ) − H(Y |X )
P
H(Y |X ) is called Conditional Entropy, which = x p(x)H(Y |X = x).

Information Theory
Mutual Information (Cont.)
From the previous equation, we can know:

II(X , Y ) = 0 only if p(X , Y ) = p(X )p(Y ), which means X and Y are
independent.
According to the last two lines of the previous equation, we can
interpret MI as the reduction in uncertainty about X after observing
Y, or vice versa.
Pointwise Mutual Information measures the discrepancy between two

events occurring together as compared to what would be observed by
chance. It is defined as:
p(x,y ) p(x|y ) p(y |x)
PMI (x, y ) = log p(x)p(y ) = log p(x) = log p(y )
From the equation, we can know that MI is the expected value of PMI.

Information Theory
Mutual Information (Cont.)
MI for continuous variables: We need to first discretise the variable by

separating the variables into different bins.
The size and boundary of the bins can be selected by trying many
combinations and calculate the largest among them. This normalised
statistic is called Mutual Information Coefficient:
max G ∈G(x,y ) II(X (G );Y (G )
MIC = max [ log min(x,y ) ]
x,y :xy <B
G(x, y ) is the set of 2d grids of x × y ; X (G ), Y (G ) are the

discretisation of the variables on the grid; B is a sample-size
dependent bound usually set to N 0.6 .
MIC ∈ [0, 1], where 0 represents no relationship and 1 represents noise-free
relationship (not only limited to linear relationship).

Chapter2 Probability

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter2 Probability

Uploaded by

Copyright:

Available Formats

Chapter2: Probability

School of Computer Science and Engineering

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 1 / 45

1 Basic Concept of Probability Theory

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 2 / 45

4 Joint Probability Distributions

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 3 / 45

5 Transformations of Random Variables

6 Monte Carlo approximation

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 4 / 45

There are 2 possible interpretations of a probability:

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 5 / 45

p(A) : denotes the probability of event A will happen. 0 ≤ p(A) ≤ 1

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 6 / 45

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 7 / 45

If variable is continuous, we define Cumulative Distribution Function

Quantile: Since cdf F is monotonically increasing, F −1 (α) = xα such that

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 9 / 45

Mean/Expected Value: denoted by µ.

Var [X ] = E[(X − µ)2 ] = E[X 2 ] − µ2

Joint Probability of Two events:

p(A ∨ B) = p(A) + p(B) − p(A ∧ B)

Marginal Distribution (Sum Rule):

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 11 / 45

Bayes Rule/Bayes Theorem is the combintation of Conditional

p(X = x|Y = y ) = p(X =x,Y =y )

It is useful to obtain the conditional probability p(A|B) if we already know

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 12 / 45

Unconditional Independence/Marginal Independence: denoted by

X ⊥ Y ⇐⇒ p(X , Y ) = p(X )p(Y )

Conditional Independence: In most cases, two variables are

X ⊥ Y |Z ⇐⇒ p(X , Y |Z ) = p(X |Z )p(Y |Z )

Conditional independence has following property:

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 13 / 45

Binomial Distribution: When we toss a coin n times, with a probability

Binomial Distribution has the following properties:

Multinomial Distribution: If we toss a K -side die n times instead, with a

Multinoulli Distribution: It is a special case of Multinomial distribution

Cat(x|θ) = Mu(x|1, θ) = K II(xj =1)

Do note that here x becomes a binary vector with all elements 0 or 1

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 15 / 45

Poisson Distribution is usually used to calculate the probability of a

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 16 / 45

Empirical Distribution is obtained from empirical tests. For a dataset

The probability for each result to occur is associated with a weight wi :

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 17 / 45

Normal Distribution/Gaussian Distribution has the pdf of:

Its cdf is:

which is usually implemented as:

Standard Normal Distribution: the Normal distribution of N ∼ (0, 1).

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 18 / 45

Advantage of Normal Distribution:

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 19 / 45

Dirac Delta Function: When variance of a Normal Distribution

lim N (x|µ, σ 2 ) = δ(x − µ)

Here δ(x − µ) is called Dirac Delta Function. It is defined as:

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 20 / 45

Student t Distribution: Compared to Normal Distribution, Student t

v > 0 is called degree of freedom.

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 21 / 45

Laplace Distribution/Double Sided Exponential Distribution is also a

If has the following properties:

Wu Ziqing (NTU) Chapter2: Probability 14/07/2018 22 / 45

Gamma Distribution is a flexible distribution for positive real value

We can obtain a estimation accurate to with in ± with probability at