Bayesian Inference

Bayesian Essentials and Bayesian
Regression
Distribution Theory 101

Marginal and Conditional Distributions:
p Y (y) p X,Y x,y dx p Y X y x p X x dx

Y
p X (x) p X,Y x,y dy
2dy 2y
x
0
2x
p X,Y x,y
2
pY X y x
pX x
2x
1
y 0,x
uniform
2
Simulating from Joint

p X,Y x,y p Y X y x p X x
To draw from the joint:
i. draw from marginal on X
ii. Condition on this draw, and draw from
conditional of Y|X
The Goal of Inference

Make inferences about unknown quantities using
available information.
Inference -- make probability statements
unknowns -parameters, functions of parameters, states or latent variables,
future outcomes, outcomes conditional on an action
Information
data-based
non data-based
theories of behavior; subjective views there is an
underlying structure
parameters are finite or in some range
Data Aspects of Marketing Problems

Ex: Conjoint Survey
500 respondents rank, rate, chose among product
configurations.
Small Amount of Information per respondent
Response Variable Discrete
Ex: Retail Scanning Data
very large number of products
large number of geographical units (markets, zones,
stores)
limited variation in some marketing mix vars
Must make plausible predictions for decision making!
5
The likelihood principle

p(D | ) l ()
Note: any function

proportional to data
density can be
called the
likelihood.
LP: the likelihood contains all information relevant for

inference. That is, as long as I have same likelihood
function, I should make the same inferences about the
unknowns.
Implies analysis is done conditional on the data,.
Bayes theorem
p D
p D,
p D
p D p
p D
p(|D) p(D| ) p()

Posterior Likelihood Prior
Modern Bayesian computing simulation methods
for generating draws from the posterior
distribution p(|D).
Summarizing the posterior
Output from Bayesian Inf: p D

A high dimensional dist
Summarize this object via simulation:

marginal distributions of , h
dont just compute E D , Var D
Prediction
%
p(D|D)
See D, compute:
Predictive Distribution
% p(D|
%)p( | D)d
p(D|D)
% )) !!!)
( p(D|
% p D
% p D
assumes p D,D
Decision theory
Loss: L(a,) where a=action; =state of nature
Bayesian decision theory:
min L(a) E|D L(a, ) L(a, )p( | D)d

a
Estimation problem is a special case:
) ; typically, L , ' A
min
L(
10
Sampling properties of Bayes estimators
r () ED| L(, ) L (D), p(D | )dD

An estimator is admissible if there exist no other
estimator with lower risk for all values of . The Bayes
estimator minimizes expected (average) risk which
implies they are admissible:
E r E ED| L ,
E D E|D L ,
The Bayes estimator does the best for every D.

Therefore, it must work as at least as well as any other
estimator.
11
Bayes Inference: Summary

Bayesian Inference delivers an integrated approach to:
Inference including estimation and testing
Prediction with a full accounting for uncertainty
Decision with likelihood and loss (these are
distinct!)
Bayesian Inference is conditional on available info.
The right answer to the right question.
Bayes estimators are admissible. All admissible
estimators are Bayes (Complete Class Thm). Which
Bayes estimator?
12
Bayes/Classical Estimators
l N
l n
p
Prior washes out locally uniform!!! Bayes is
consistent unless you have dogmatic prior.
1
&
p D ~ N MLE , H
MLE
13
Benefits/Costs of Bayes Inf

Benefitsfinite sample answer to right question
full accounting for uncertainty
integrated approach to inf/decision
Costscomputational (true any more? < classical!!)
prior (cost or benefit?)
esp. with many parms
(hierarchical/non-parameric problems)
14
Bayesian Computations
Before simulation methods, Bayesians used posterior
expectations of various functions as summary of
posterior.
E D h h
p D p
p D
If p(|D) is in a convenient form (e.g. normal), then I

might be able to compute this for some h. Via iid
simulation for all h.
note : p D p D p d
15
Conjugate Families
Models with convenient analytic properties almost
invariably come from conjugate families.
Why do I care now?
- conjugate models are used as building blocks
- build intuition re functions of Bayesian inference
Definition:
A prior is conjugate to a likelihood if the
posterior is in the same class of distributions as prior.
Basically, conjugate priors are like the posterior from
some imaginary dataset with a diffuse prior.
16
Beta-Binomial model
yi ~ Bern( )
n
( ) yi (1 )1 yi
i 1
(1 )
y
n y
where y
y
i 1
p( | y ) ?
Need a prior!
17
Beta distribution
0.08
0.10
Beta(,) 1(1 )1
0.02
0.04
0.06
E[] /( )
0.00
a=2, b=4
a=3, b=3
a=4, b=2
0.0
0.2
0.4
0.6
0.8
1.0
18
Posterior
p( | D) p(D | )p( )
y (1 )ny 1(1 ) 1
y 1(1 )n y1
~ Beta( y, n y)
19
Prediction
Pr(y% 1|y) Pr(y% 1| ,y)d

1
p( |y)d
0
E[ |y]
20
Regression model
i ~ Normal(0, 2 )
yi x i
'
i
p(yi |, )
2
1
' 2
exp
(y
x
i
i)
2
2
2
2
1
y X,, 2 ~ N(X, 2I)

Is this model complete? For non-experimental data,
dont we need a model for the joint distribution of y
and x?
21
simultaneous
systems are not
written this way!
Regression model
p(x,y) p x p y x,, 2
rules out x=f()!!!
If is a priori indep of (,),
p ,, y,X p p x i p , 2
i
2
p
y
x
,
i i
i
two separate analyses

22
Conjugate Prior
What is conjugate prior? Comes from form of
likelihood function. Here we condition on X.
~ N 0, 2In
Jacobian : y 1
l ,
p y X, ,
2
n / 2
exp
1
2 2
X ' g
quadratic form suggests normal prior.

Lets complete the square on or rewrite by
projecting y on X (column space of X).
23
Geometry of regression
y
e y y
x2
y x11 x2 2
2x2
1x1
x1
"Least Squares " min (e'e)
e ' y 0
24
Traditional regression
e ' y y 'e 0
(X )'(y X ) 0
(X ' X)1 X ' y

No one ever computes a matrix inverse directly.
Two numerically stable methods:
QR decomposition of X
Cholesky root of XX and compute inverse using
root
Non-Bayesians have to worry about singularity or
near singularity of XX. We dont! more later
25
Cholesky Roots
In Bayesian computations, the fundamental matrix
operation is the Cholesky root. chol() in R
The Cholesky root is the generalization of the
square root applied to positive definite matrices.
As Bayesians with proper priors, we dont ever
have to worry about singular matrices!
U'U, p.d.s.
ii
U is upper triangular with positive diagonal

elements. U-1 is easy to compute by recursively
solving TU = I for T, backsolve() in R.
26
Cholesky Roots
Cholesky roots can be useful to simulate from
Multivariate Normal Distribution.
U'U; z ~ N 0,I
y U' z ~ N 0,E U' zz 'U U'IU

To simulate a matrix of draws from MVN (each
row is a separate draw) in R,
Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)
Y=t(t(Y)+mu)
27
Regression with R
data.txt:
UNIT Y
X1
X2
A 1 0.23815 0.43730
A 2 0.55508 0.47938
A 3 3.03399 -2.17571
A 4 -1.49488 1.66929
B 10 -1.74019 0.35368
B 9 1.40533 -1.26120
B 8 0.15628 -0.27751
B 7 -0.93869 -0.04410
B 6 -3.06566 0.14486
df=read.table("data.txt",header=TRUE)
myreg=function(y,X){
#
# purpose: compute lsq regression
#
# arguments:
# y -- vector of dep var
# X -- array of indep vars
#
# output:
# list containing lsq coef and std errors
#
XpXinv=chol2inv(chol(crossprod(X)))
bhat=XpXinv%*%crossprod(X,y)
res=as.vector(y-X%*%bhat)
ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))
se=sqrt(diag(ssq*XpXinv))
list(b=bhat,std_errors=se)
}
28
Regression likelihood
(y X)'(y X) (X X)'(X X) (y X )'(y X )
2(y X )'(X X)
s2 ( )' X ' X( )
s 2 SSE (y X )'(y X )
where
nk
p y X,,
2 k / 2
exp
( )' X ' X( )
2
2
s
2 / 2
( )
exp 2
2
29
Regression likelihood
p y X,, 2 normal ?
? is density of form p e
This is called an inverted gamma distribution. It can

also be related to the inverse of a Chi-squared
distribution.
Note the conjugate prior suggested by the form the
likelihood has a prior on which depends on .
30
Bayesian Regression
Prior:
Interpretation
as from
another
dataset.
p(, 2 ) p( | 2 )p(2 )
1
)'
A(
)
2
p( | 2 ) ( 2 )k / 2 exp
p(2 ) ( 2 )
0
1
2
0s02
exp
2
2
Inverted Chi-Square:
s
2 ~ 02 0
0
Draw from prior?

31
Posterior
p(, 2 |D) l ( , 2 )p( | 2 )p( 2 )
1
(y
)'(y
)
2
( 2 )n / 2 exp
2 k / 2
( )
( )
2
exp
(
)'
A(
)
2
0
1
2
0s02
exp
2
2
32
Combining quadratic forms

(y X)'(y X) ( )' A( )
(y X)'(y X) ( )'U'U( )
(v W )'(v W )
y
v
U
X
U
(v W )'(v W ) s 2 ( %)' W ' W( %)

% (W ' W)1W ' v (X ' X A) 1(X ' X A )
2
ns%
(v W %
)'( W %
) (y X%
)'(y X%
) (% )' A (% )
33
Posterior
1
%
%
(
)'(X
'
X
A)(
)
2
(2 )k / 2 exp
( 2 )
n 0 2
2
2
(0 s02 ns%
)
exp
2
2
[ | 2 ] N(%
, 2 (X ' X A)1)
2
s
[2 ] 12 1 with 1 0 n
1
2
2
%
n
s
s12 0 0
0 n
34
IID Simulations
Scheme: [y|X, , 2] [|2] [2]
1) Draw [2 | y, X]
2) Draw [ | 2,y, X]
3) Repeat
35
IID Simulator, cont.

2
s
1) [2 |y,X] 12 1
1
1
2
2
%
2) y, X, N , X ' X A
% (X ' X A)1(X ' X A )
1
2
%
%
note : ~ N 0,I ; U' ~ N ,U'U X ' X A
36
Bayes Estimator
The Bayes Estimator is the posterior mean of .
%
E D E2 D E D,2 E2 D %
Marginal on is a multivariate student t.

Who cares?
37
Shrinkage and Conjugate Priors

The Bayes Estimator is the posterior mean of .
This is a shrinkage estimator.
% (X ' X A)1(X ' X A ) shrinks

as n , % (Why? X'X is of order n).
1
2
2
1
2 1
2
%
Var (X ' X A) A or X ' X
Is this reasonable?
38
Assessing Prior Hyperparameters

,A, 0 ,s02
These determine prior location and spread for
both coefs and error variance.
It has become customary to assess a diffuse
prior:
" small" value of A, A .01Ik Var 2 100Ik

0 " small",e.g. 3
s02 1
This can be problematic. Var(y)

might be a better choice.
39
Improper or non-informative priors

Classic non-informative prior (improper):
p ,
p p
2
1
2
Is this non-informative?
Of course not, it says that is large with high
prior probability
Is this wise computationally?
No, I have to worry about singularity in XX
Is this a good procedure?
No, it is not admissible. Shrinkage is good!
40
runireg
runireg=
function(Data,Prior,Mcmc){
#
# purpose:
# draw from posterior for a univariate regression model with natural conjugate prior
#
# Arguments:
# Data -- list of data
#
y,X
# Prior -- list of prior hyperparameters
# betabar,A
prior mean, prior precision
# nu, ssq
prior on sigmasq
# Mcmc -- list of MCMC parms
# R number of draws
# keep -- thinning parameter
#
# output:
#
list of beta, sigmasq draws
#
beta is k x 1 vector of coefficients
# model:
#
Y=Xbeta+e var(e_i) = sigmasq
#
priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)
#
sigmasq ~ (nu*ssq)/chisq_nu
41
runireg
RA=chol(A)
W=rbind(X,RA)
z=c(y,as.vector(RA%*%betabar))
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp
btilde=crossprod(t(IR))%*%crossprod(W,z)
res=z-W%*%btilde
s=t(res)%*%res
#
# first draw Sigma
#
sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
#
# now draw beta given Sigma
#
beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
list(beta=beta,sigmasq=sigmasq)
}
42
0
500
1000
1500
2000
43
1.0
1.5
out$betadraw
2.0
2.5
0
500
1000
1500
2000
44
0.20
0.25
out$sigmasqdraw
0.30
0.35
Multivariate Regression
y1 X1 1
Y XB E,
M
y c X c c
B 1,K ,c ,K ,m
M
y m X m m
row ~ iid N 0,
Y y1,K ,y c ,K ,ym
E 1,K , c ,K , m
45
Multivariate regression likelihood
1 n
p Y | X,B, | |
exp yr B' x r 1 y r Bx r
2 r 1
| |n / 2 etr Y XB Y XB 1
2
n / 2
| |(nk) / 2 etr S 1
2
k / 2
||
where S Y XB Y XB
etr B B XX B B
2
46
Multivariate regression likelihood

But, tr(A 'B) vec A ' vec B
tr B B XX B B 1 vec B B ' vec X X B B 1
and vec ABC C' A vec B
vec B B ' 1 XX vec B B
therefore,
1
p Y | X,B, | |(nk) / 2 etr S 1

2
| |k / 2 exp 1 X X
2
47
Inverted Wishart distribution

Form of the likelihood suggests that natural
conjugate (convenient prior) for would be of
the Inverted Wishart form:
p 0 ,V0
0 m1 / 2
etr 21 V0 1
denoted ~ IW 0 ,V0
if 0 m 1, E (0 m 1) 1 V0
- tightness
V- location
however, as
increases,
spread also
increases
limitations: i. small -- thick tail ii. only one

tightness parm
48
Wishart distribution (rwishart)
If ~ IW 0 ,V0 , 1 ~ W 0 ,V01
if 0 m 1, E 0 V01
Generalization of 2 :
Let i ~ Nm (0, )
Then W
' ~ W(, )
i 1
i i
The diagonals are 2

49
Multivariate regression prior and

posterior
Prior:
p ,B p p B |
~ IW 0 ,V0
| ~ N , A 1
Posterior:
| Y,X ~ IW 0 n,V0 S%
1
| Y,X, ~ N %
, XX A
% Y XB% Y XB% B% B A B% B
S

% vec B%,
1
B% XX A X XB AB ,
50
Drawing from Posterior: rmultireg

rmultireg=
function(Y,X,Bbar,A,nu,V)
RA=chol(A)
W=rbind(X,RA)
Z=rbind(Y,RA%*%Bbar)
# note: Y,X,A,Bbar must be matrices!
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!
Btilde=crossprod(t(IR))%*%crossprod(W,Z)
#
IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)
S=crossprod(Z-W%*%Btilde)
#
rwout=rwishart(nu+n,chol2inv(chol(V+S)))
#
# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)
#
Cov=(X'X + A)^-1 = IR t(IR)
#
Sigma=CICI'
#
therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'
#
so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)
#
Z_mk is m x k matrix of N(0,1)
#
since vec(ABC) = (C' (x) A)vec(B), we have
#
B = Btilde + IR Z_mk CI'
#
B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)
51
Conjugacy is Fragile!
SUR:
yi X i i i i 1,K ,m
set of
regressions
related via
correlated
errors
given , ~ N would be conjugate

given , ~ IW would be conjugate
BUT, no joint conjugate prior!!
52

Bayesian Inference

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Inference

Uploaded by

Copyright:

Available Formats

Bayesian Essentials and Bayesian

Distribution Theory 101

p Y (y) p X,Y x,y dx p Y X y x p X x dx

p X (x) p X,Y x,y dy

Simulating from Joint

The Goal of Inference

Data Aspects of Marketing Problems

The likelihood principle

Note: any function

LP: the likelihood contains all information relevant for

Implies analysis is done conditional on the data,.

p(|D) p(D| ) p()

Summarizing the posterior

Output from Bayesian Inf: p D

Summarize this object via simulation:

min L(a) E|D L(a, ) L(a, )p( | D)d

Estimation problem is a special case:

Sampling properties of Bayes estimators

r () ED| L(, ) L (D), p(D | )dD

The Bayes estimator does the best for every D.

Bayes Inference: Summary

Benefits/Costs of Bayes Inf

If p(|D) is in a convenient form (e.g. normal), then I

Pr(y% 1|y) Pr(y% 1| ,y)d

y X,, 2 ~ N(X, 2I)

rules out x=f()!!!

If is a priori indep of (,),

two separate analyses

quadratic form suggests normal prior.

"Least Squares " min (e'e)

(X ' X)1 X ' y

U is upper triangular with positive diagonal

y U' z ~ N 0,E U' zz 'U U'IU

This is called an inverted gamma distribution. It can

Draw from prior?

Combining quadratic forms

(v W )'(v W ) s 2 ( %)' W ' W( %)

IID Simulator, cont.

% (X ' X A)1(X ' X A )

Marginal on is a multivariate student t.

Shrinkage and Conjugate Priors

% (X ' X A)1(X ' X A ) shrinks

Assessing Prior Hyperparameters

" small" value of A, A .01Ik Var 2 100Ik

This can be problematic. Var(y)

Improper or non-informative priors

Multivariate regression likelihood

Multivariate regression likelihood

tr B B XX B B 1 vec B B ' vec X X B B 1

and vec ABC C' A vec B

vec B B ' 1 XX vec B B

p Y | X,B, | |(nk) / 2 etr S 1

Inverted Wishart distribution

limitations: i. small -- thick tail ii. only one

Wishart distribution (rwishart)

The diagonals are 2

Multivariate regression prior and

Drawing from Posterior: rmultireg

given , ~ N would be conjugate

You might also like