Summer School Probability Tutorial

Intro to Probability
Andrei Barbu
Andrei Barbu (MIT) Probability August 2018

Some problems

Some problems
A means to capture uncertainty

Some problems

You have data from two sources, are they different?

Some problems

How can you generate data or fill in missing data?

Some problems

How can you generate data or fill in missing data?
What explains the observed data?

Uncertainty

Uncertainty
Every event has a probability of occurring, some event always occurs, and
combining separate events adds their probabilities.

Uncertainty
Why these axioms? Many other choices are possible:

Uncertainty

Possibility theory, probability intervals

Uncertainty


Uncertainty

Belief functions, upper and lower probabilities
Probability is special: Dutch books
I offer that if X happens I pay out R, otherwise I keep your money.


Why should I always value my bet at pR, where p is the probability of X ?


Negative p


Negative p
I’m offering to pay R for −pR dollars.


Negative p
Probabilities don’t sum to one


Negative p
Take out a bet that always pays off.


Negative p
If the sum is below 1, I pay R for less than R dollars.


Negative p
If the sum is above 1, buy the bet and sell it to me for more.


Negative p
When X and Y are incompatible the value isn’t the sum


Negative p
If the value is bigger, I still pay out more.


Negative p
If the value is smaller, sell me my own bets.


Negative p
If the value is smaller, sell me my own bets.
Decisions under probability are “rational”.

Experiments, theory, and funding

Experiments, theory, and funding

Data

Data
X
Mean µX = E[X ] = xp(x)
x

Data
X
x
Variance σX2 = var(X ) = E[(X − µ)2 ]

Data
X
x
Covariance cov(X , Y ) = E[(X − µx )(Y − µY )]

Data
X
x
Covariance cov(X , Y ) = E[(X − µx )(Y − µY )]

cov(X ,Y )
Correlation ρX,Y = σX σY

Mean and variance
Often implicitly assume that our data comes from a normal

distribution.

Mean and variance

distribution.
That our samples are i.i.d. (independent indentically distributed).

Mean and variance

distribution.
Generally these don’t capture enough about the underlying data.

Mean and variance

distribution.
Generally these don’t capture enough about the underlying data.
Uncorrelated does not mean independent!

Correlation vs independence

V = N (0, 1), X = sin(V ), Y = cos(V )

V = N (0, 1), X = sin(V ), Y = cos(V )
Correlation only measures linear relationships.

Data dinosaurs

Data dinosaurs

Data

Data
Mean
Variance
Covariance
Correlation

Data
Are two players the same?
Mean
Variance
Covariance
Correlation

Data

How do you know and how certain are you?
Mean
Variance
Covariance
Correlation

Data

What about two players is different?
Mean
Variance
Covariance
Correlation

Data

How do you quantify which differences matter?
Mean
Variance
Covariance
Correlation

Data

Here’s a player, how good will they be?
Mean
Variance
Covariance
Correlation

Data

What is the best information to ask for?
Mean
Variance
Covariance
Correlation

Data

Mean
What is the best test to run?
Variance
Covariance
Correlation

Data

Mean
Variance
If I change the size of the board, how might
Covariance
the results change?
Correlation

Data

Mean
Variance
If I change the size of the board, how might
Covariance
the results change?
Correlation
...

Probability as an experiment

A machine enters a random state, its current state is an event.


Events, x, have probabilities associated, p(X = x) (p(x) shorthand)


Sets of events, A


Sets of events, A
Random variables, X , are a function of the event


Sets of events, A
The probability of two events p(A ∪ B)


Sets of events, A
The probability of either event p(A ∩ B)


Sets of events, A
p(¬x) = 1 − p(x)


Sets of events, A
p(¬x) = 1 − p(x)
Joint probabilities P(x, y)


Sets of events, A
p(¬x) = 1 − p(x)
Independence P(x, y) = P(x)P(y)


Sets of events, A
p(¬x) = 1 − p(x)
P (x , y )
Conditional probabilities P(x|y) = p (y )


Sets of events, A
p(¬x) = 1 − p(x)
P (x , y )
Conditional probabilities P(x|y) = p(y)
P
Law of total probability A a = 1 when events A are a disjoint cover

Beating the lottery

Beating the lottery

Analyzing a test

Analyzing a test
You want to play the lottery, and have a method to win.

Analyzing a test

0.5% of tickets are winners, and you have a test to verify this.

Analyzing a test

You are 85% accurate (5% false positives, 10% false negatives)

Analyzing a test


Analyzing a test

Is this test useful? How useful? Should you be betting?

Analyzing a test

Is this test useful? How useful? Should you be betting?

Is this test useful?

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

What percent of the time when my test comes up true am I winner?

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

D + ∩T +
T+

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

D + ∩T + .9×0.005
= 0.9×0.0005+0.995×0.05
T+

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

D + ∩T + .9×0.005
= 0.9×0.0005+0.995×0.05 = 8.3%
T+

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

D + ∩T + .9×0.005 P(T + |D+)P(D+)
= 0.9×0.0005+ 0 . 995× 0. 05 = 8.3% =
T+ P(T +)

(D−, T +) (D−, T −) (D+, T +) (D+, T −)

D + ∩T + .9×0.005 P(T + |D+)P(D+)
= 0.9×0.0005+ 0 . 995× 0. 05 = 8.3% =
T+ P(T +)
P(B|A)P(A) likelihood × prior
P(A|B) = posterior =
P(B) probability of data

Human-Oriented Robotics
Common Probability Distributions Prof. Kai Arras
Social Robotics Lab
Bernoulli Distribution 1
• Given a Bernoulli experiment, that is, a yes/no 1−

experiment with outcomes 0 (“failure”)
0.5
or 1 (“success”)
• The Bernoulli distribution is a discrete proba-
bility distribution, which takes value 1 with
0
success probability and value 0 with failure 0 1
probability 1 –
Parameters
• Probability mass function
• : probability of
observing a success
Expectation
•
• Notation Variance
•
37
Social Robotics Lab
Binomial Distribution 0.2

= 0.5, N = 20
= 0.7, N = 20
= 0.5, N = 40
• Given a sequence of Bernoulli experiments

• The binomial distribution is the discrete 0.1
probability distribution of the number of

successes m in a sequence of N indepen-
dent yes/no experiments, each of which
yields success with probability 0
0 10 20 30 40
m
• Probability mass function
Parameters
• N : number of trials
• : success probability
Expectation
• Notation
•
Variance
•
38
Social Robotics Lab
Gaussian Distribution
• Most widely used distribution for µ = 0,

µ = −3,
µ = 2,
=1
= 0.1
=2
continuous variables 1
• Reasons: (i) simplicity (fully represented 0.5
by only two moments, mean and variance)

and (ii) the central limit theorem (CLT) 0
−4 −2 0 2 4
• The CLT states that, under mild conditions,

the mean (or sum) of many independently
Parameters
drawn random variables is distributed
• : mean
approximately normally, irrespective of
• : variance
the form of the original distribution
Expectation
• Probability density function •
Variance
•
46
Social Robotics Lab
Gaussian Distribution
• Notation
• Called standard normal distribution

for µ = 0 and = 1
• About 68% (~two third) of values
drawn from a normal distribution are Parameters
within a range of ±1 standard • : mean
deviations around the mean • : variance
Expectation
• About 95% of the values lie within a
range of ±2 standard deviations •
around the mean Variance
•
• Important e.g. for hypothesis testing
47
Social Robotics Lab
Multivariate Gaussian Distribution
• For d-dimensional random vectors, the

multivariate Gaussian distribution is
governed by a d-dimensional mean vector
and a D x D covariance matrix that must
be symmetric and positive semi-definite
• Probability density function

Parameters
• : mean vector
• : covariance matrix
• Notation Expectation
•
Variance
•
48
Social Robotics Lab
• For d = 2, we have the bivariate Gaussian

distribution
• The covariance matrix (often C) deter-
mines the shape of the distribution (video)
Parameters
• : mean vector
Expectation
•
Variance
•
49
Social Robotics Lab

distribution
Parameters
• : mean vector
Expectation
•
Variance
•
49
Social Robotics Lab

distribution
  

  
Parameters
• : mean vector


  
Expectation
•
Variance
Source [1]

•
     
50
Social Robotics Lab
Poisson Distribution 0.4

=1
=4
= 10
• Consider independent events that happen 0.3
with an average rate of over time

0.2
• The Poisson distribution is a discrete

distribution that describes the probability 0.1
of a given number of events occurring in a

0
fixed interval of time 0 5 10 15 20
• Can also be defined over other intervals

such as distance, area or volume Parameters
• : average rate of events
• Probability mass function over time or space
Expectation
•
• Notation Variance
•
45
Bayesian updates

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
P(θ): I have some prior over how good a player is: informative vs
uninformative.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
P(X |θ): I think dart throwing is a stochastic process, every player has
an unknown mean.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.
X : I observe them throwing darts.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.
P(X ): Across all parameters this is how likely the data is.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.
Normalization is usually hard to compute, but it’s often not needed.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.
Say P(θ) is a normal distribution with mean 0 and high variance.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.

And P(X |θ) is also a normal distribution.

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.

What’s the best estimate for this player’s performance?

Bayesian updates
P(X |θ)P(θ)
P(θ|X ) = P (X )
uninformative.
an unknown mean.

What’s the best estimate for this player’s performance?
∂
∂θ logP(θ|X ) =0

Graphical models

Graphical models
So far we’ve talked about independence, conditioning, and

observation.

Graphical models

observation.
A toolkit to discuss these at a higher level of abstraction.

Graphical models

observation.
A toolkit to discuss these at a higher level of abstraction.

Speech recognition: Naïve bayes
Break down a speech signal into parts.


Recover the original speech

Create a set of features, each sound is composed of combinations of

features.


features.


features.
Y
P(c|X ) ∝ P(Xk |c)P(c)
K

Speech recognition: Gaussian mixture model



Speech recognition: Hidden Markov model

Speech recognition: Hidden Markov model

Summary

Summary
Probabilities defined in terms of events

Summary

Random variables and their distributions

Summary

Reasoning with probabilities and Bayes’ rule

Summary

Updating our knowledge over time

Summary

Graphical models to reason abstractly

Summary

Graphical models to reason abstractly
A quick tour of how we would build a more complex model

Summer School Probability Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Summer School Probability Tutorial

Uploaded by

Copyright:

Available Formats

Intro to Probability

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

A means to capture uncertainty

Andrei Barbu (MIT) Probability August 2018

A means to capture uncertainty

Andrei Barbu (MIT) Probability August 2018

A means to capture uncertainty

Andrei Barbu (MIT) Probability August 2018

A means to capture uncertainty

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Why these axioms? Many other choices are possible:

Andrei Barbu (MIT) Probability August 2018

Why these axioms? Many other choices are possible:

Andrei Barbu (MIT) Probability August 2018

Why these axioms? Many other choices are possible:

Andrei Barbu (MIT) Probability August 2018

Why these axioms? Many other choices are possible:

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

I offer that if X happens I pay out R, otherwise I keep your money.

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

Covariance cov(X , Y ) = E[(X − µx )(Y − µY )]

Andrei Barbu (MIT) Probability August 2018

Covariance cov(X , Y ) = E[(X − µx )(Y − µY )]

Andrei Barbu (MIT) Probability August 2018

Often implicitly assume that our data comes from a normal

Andrei Barbu (MIT) Probability August 2018

Often implicitly assume that our data comes from a normal

Andrei Barbu (MIT) Probability August 2018

Often implicitly assume that our data comes from a normal

Andrei Barbu (MIT) Probability August 2018

Often implicitly assume that our data comes from a normal

Andrei Barbu (MIT) Probability August 2018

Andrei Barbu (MIT) Probability August 2018

V = N (0, 1), X = sin(V ), Y = cos(V )

Andrei Barbu (MIT) Probability August 2018