You are on page 1of 52

Bayesian Essentials and Bayesian

Regression

Distribution Theory 101


Marginal and Conditional Distributions:

p Y (y) p X,Y x,y dx p Y X y x p X x dx


Y

p X (x) p X,Y x,y dy

2dy 2y

x
0

2x

p X,Y x,y
2
pY X y x

pX x
2x
1

y 0,x

uniform
2

Simulating from Joint


p X,Y x,y p Y X y x p X x
To draw from the joint:
i. draw from marginal on X
ii. Condition on this draw, and draw from
conditional of Y|X

The Goal of Inference


Make inferences about unknown quantities using
available information.
Inference -- make probability statements
unknowns -parameters, functions of parameters, states or latent variables,
future outcomes, outcomes conditional on an action
Information
data-based
non data-based
theories of behavior; subjective views there is an
underlying structure
parameters are finite or in some range

Data Aspects of Marketing Problems


Ex: Conjoint Survey
500 respondents rank, rate, chose among product
configurations.
Small Amount of Information per respondent
Response Variable Discrete
Ex: Retail Scanning Data
very large number of products
large number of geographical units (markets, zones,
stores)
limited variation in some marketing mix vars
Must make plausible predictions for decision making!
5

The likelihood principle


p(D | ) l ()

Note: any function


proportional to data
density can be
called the
likelihood.

LP: the likelihood contains all information relevant for


inference. That is, as long as I have same likelihood
function, I should make the same inferences about the
unknowns.

Implies analysis is done conditional on the data,.

Bayes theorem
p D

p D,
p D

p D p
p D

p(|D) p(D| ) p()


Posterior Likelihood Prior
Modern Bayesian computing simulation methods
for generating draws from the posterior
distribution p(|D).

Summarizing the posterior

Output from Bayesian Inf: p D


A high dimensional dist

Summarize this object via simulation:


marginal distributions of , h
dont just compute E D , Var D

Prediction
%
p(D|D)

See D, compute:

Predictive Distribution

% p(D|
%)p( | D)d
p(D|D)

% )) !!!)
( p(D|

% p D
% p D
assumes p D,D

Decision theory
Loss: L(a,) where a=action; =state of nature
Bayesian decision theory:

min L(a) E|D L(a, ) L(a, )p( | D)d


a

Estimation problem is a special case:

) ; typically, L , ' A
min
L(

10

Sampling properties of Bayes estimators

r () ED| L(, ) L (D), p(D | )dD


An estimator is admissible if there exist no other
estimator with lower risk for all values of . The Bayes
estimator minimizes expected (average) risk which
implies they are admissible:

E r E ED| L ,

E D E|D L ,

The Bayes estimator does the best for every D.


Therefore, it must work as at least as well as any other
estimator.
11

Bayes Inference: Summary


Bayesian Inference delivers an integrated approach to:
Inference including estimation and testing
Prediction with a full accounting for uncertainty
Decision with likelihood and loss (these are
distinct!)
Bayesian Inference is conditional on available info.
The right answer to the right question.
Bayes estimators are admissible. All admissible
estimators are Bayes (Complete Class Thm). Which
Bayes estimator?

12

Bayes/Classical Estimators
l N
l n
p
Prior washes out locally uniform!!! Bayes is
consistent unless you have dogmatic prior.
1

&
p D ~ N MLE , H

MLE

13

Benefits/Costs of Bayes Inf


Benefitsfinite sample answer to right question
full accounting for uncertainty
integrated approach to inf/decision
Costscomputational (true any more? < classical!!)
prior (cost or benefit?)
esp. with many parms
(hierarchical/non-parameric problems)

14

Bayesian Computations
Before simulation methods, Bayesians used posterior
expectations of various functions as summary of
posterior.

E D h h

p D p
p D

If p(|D) is in a convenient form (e.g. normal), then I


might be able to compute this for some h. Via iid
simulation for all h.

note : p D p D p d
15

Conjugate Families
Models with convenient analytic properties almost
invariably come from conjugate families.
Why do I care now?
- conjugate models are used as building blocks
- build intuition re functions of Bayesian inference

Definition:
A prior is conjugate to a likelihood if the
posterior is in the same class of distributions as prior.
Basically, conjugate priors are like the posterior from
some imaginary dataset with a diffuse prior.
16

Beta-Binomial model
yi ~ Bern( )
n

( ) yi (1 )1 yi
i 1

(1 )
y

n y

where y

y
i 1

p( | y ) ?
Need a prior!
17

Beta distribution

0.08

0.10

Beta(,) 1(1 )1

0.02

0.04

0.06

E[] /( )

0.00

a=2, b=4
a=3, b=3
a=4, b=2

0.0

0.2

0.4

0.6

0.8

1.0

18

Posterior
p( | D) p(D | )p( )
y (1 )ny 1(1 ) 1
y 1(1 )n y1

~ Beta( y, n y)

19

Prediction

Pr(y% 1|y) Pr(y% 1| ,y)d


1

p( |y)d
0

E[ |y]

20

Regression model
i ~ Normal(0, 2 )

yi x i
'
i

p(yi |, )
2

1
' 2
exp
(y

x
i
i)
2
2
2

2
1

y X,, 2 ~ N(X, 2I)


Is this model complete? For non-experimental data,
dont we need a model for the joint distribution of y
and x?

21

simultaneous
systems are not
written this way!

Regression model

p(x,y) p x p y x,, 2

rules out x=f()!!!

If is a priori indep of (,),

p ,, y,X p p x i p , 2
i

2
p
y
x
,

i i
i

two separate analyses


22

Conjugate Prior
What is conjugate prior? Comes from form of
likelihood function. Here we condition on X.

~ N 0, 2In

Jacobian : y 1

l ,

p y X, ,
2

n / 2

exp

1
2 2

X ' g

quadratic form suggests normal prior.


Lets complete the square on or rewrite by
projecting y on X (column space of X).
23

Geometry of regression
y
e y y

x2

y x11 x2 2

2x2
1x1

x1

"Least Squares " min (e'e)

e ' y 0

24

Traditional regression
e ' y y 'e 0

(X )'(y X ) 0

(X ' X)1 X ' y


No one ever computes a matrix inverse directly.
Two numerically stable methods:
QR decomposition of X
Cholesky root of XX and compute inverse using
root
Non-Bayesians have to worry about singularity or
near singularity of XX. We dont! more later
25

Cholesky Roots
In Bayesian computations, the fundamental matrix
operation is the Cholesky root. chol() in R
The Cholesky root is the generalization of the
square root applied to positive definite matrices.
As Bayesians with proper priors, we dont ever
have to worry about singular matrices!

U'U, p.d.s.

ii

U is upper triangular with positive diagonal


elements. U-1 is easy to compute by recursively
solving TU = I for T, backsolve() in R.

26

Cholesky Roots
Cholesky roots can be useful to simulate from
Multivariate Normal Distribution.

U'U; z ~ N 0,I

y U' z ~ N 0,E U' zz 'U U'IU


To simulate a matrix of draws from MVN (each
row is a separate draw) in R,
Y=matrix(rnorm(n*k),ncol=k)%*%chol(Sigma)
Y=t(t(Y)+mu)
27

Regression with R
data.txt:
UNIT Y
X1
X2
A 1 0.23815 0.43730
A 2 0.55508 0.47938
A 3 3.03399 -2.17571
A 4 -1.49488 1.66929
B 10 -1.74019 0.35368
B 9 1.40533 -1.26120
B 8 0.15628 -0.27751
B 7 -0.93869 -0.04410
B 6 -3.06566 0.14486

df=read.table("data.txt",header=TRUE)

myreg=function(y,X){
#
# purpose: compute lsq regression
#
# arguments:
# y -- vector of dep var
# X -- array of indep vars
#
# output:
# list containing lsq coef and std errors
#
XpXinv=chol2inv(chol(crossprod(X)))
bhat=XpXinv%*%crossprod(X,y)
res=as.vector(y-X%*%bhat)
ssq=as.numeric(res%*%res/(nrow(X)-ncol(X)))
se=sqrt(diag(ssq*XpXinv))
list(b=bhat,std_errors=se)
}
28

Regression likelihood
(y X)'(y X) (X X)'(X X) (y X )'(y X )
2(y X )'(X X)

s2 ( )' X ' X( )
s 2 SSE (y X )'(y X )

where

nk

p y X,,

2 k / 2

exp
( )' X ' X( )
2
2

s
2 / 2
( )
exp 2
2
29

Regression likelihood

p y X,, 2 normal ?
? is density of form p e

This is called an inverted gamma distribution. It can


also be related to the inverse of a Chi-squared
distribution.
Note the conjugate prior suggested by the form the
likelihood has a prior on which depends on .

30

Bayesian Regression
Prior:

Interpretation
as from
another
dataset.

p(, 2 ) p( | 2 )p(2 )
1

)'
A(

)
2

p( | 2 ) ( 2 )k / 2 exp
p(2 ) ( 2 )

0
1
2

0s02
exp
2
2

Inverted Chi-Square:

s
2 ~ 02 0
0

Draw from prior?


31

Posterior
p(, 2 |D) l ( , 2 )p( | 2 )p( 2 )
1

(y

)'(y

)
2

( 2 )n / 2 exp
2 k / 2

( )

( )
2

exp
(

)'
A(

)
2

0
1
2

0s02
exp
2
2

32

Combining quadratic forms


(y X)'(y X) ( )' A( )
(y X)'(y X) ( )'U'U( )
(v W )'(v W )

y
v
U

X
U

(v W )'(v W ) s 2 ( %)' W ' W( %)


% (W ' W)1W ' v (X ' X A) 1(X ' X A )
2
ns%
(v W %
)'( W %
) (y X%
)'(y X%
) (% )' A (% )

33

Posterior
1

%
%
(

)'(X
'
X

A)(

)
2

(2 )k / 2 exp

( 2 )

n 0 2
2

2
(0 s02 ns%
)
exp

2
2

[ | 2 ] N(%
, 2 (X ' X A)1)
2

s
[2 ] 12 1 with 1 0 n
1
2
2
%

n
s
s12 0 0
0 n

34

IID Simulations
Scheme: [y|X, , 2] [|2] [2]
1) Draw [2 | y, X]
2) Draw [ | 2,y, X]
3) Repeat

35

IID Simulator, cont.


2

s
1) [2 |y,X] 12 1
1

1
2
2
%

2) y, X, N , X ' X A

% (X ' X A)1(X ' X A )

1
2
%
%
note : ~ N 0,I ; U' ~ N ,U'U X ' X A

36

Bayes Estimator
The Bayes Estimator is the posterior mean of .

%
E D E2 D E D,2 E2 D %

Marginal on is a multivariate student t.


Who cares?

37

Shrinkage and Conjugate Priors


The Bayes Estimator is the posterior mean of .
This is a shrinkage estimator.

% (X ' X A)1(X ' X A ) shrinks


as n , % (Why? X'X is of order n).

1
2
2
1
2 1
2
%
Var (X ' X A) A or X ' X

Is this reasonable?
38

Assessing Prior Hyperparameters


,A, 0 ,s02
These determine prior location and spread for
both coefs and error variance.
It has become customary to assess a diffuse
prior:

" small" value of A, A .01Ik Var 2 100Ik


0 " small",e.g. 3
s02 1

This can be problematic. Var(y)


might be a better choice.

39

Improper or non-informative priors


Classic non-informative prior (improper):

p ,

p p
2

1
2

Is this non-informative?
Of course not, it says that is large with high
prior probability
Is this wise computationally?
No, I have to worry about singularity in XX
Is this a good procedure?
No, it is not admissible. Shrinkage is good!

40

runireg
runireg=
function(Data,Prior,Mcmc){
#
# purpose:
# draw from posterior for a univariate regression model with natural conjugate prior
#
# Arguments:
# Data -- list of data
#
y,X
# Prior -- list of prior hyperparameters
# betabar,A
prior mean, prior precision
# nu, ssq
prior on sigmasq
# Mcmc -- list of MCMC parms
# R number of draws
# keep -- thinning parameter
#
# output:
#
list of beta, sigmasq draws
#
beta is k x 1 vector of coefficients
# model:
#
Y=Xbeta+e var(e_i) = sigmasq
#
priors: beta| sigmasq ~ N(betabar,sigmasq*A^-1)
#
sigmasq ~ (nu*ssq)/chisq_nu

41

runireg
RA=chol(A)
W=rbind(X,RA)
z=c(y,as.vector(RA%*%betabar))
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W=R'R ; (W'W)^-1 = IR IR' -- this is UL decomp
btilde=crossprod(t(IR))%*%crossprod(W,z)
res=z-W%*%btilde
s=t(res)%*%res
#
# first draw Sigma
#
sigmasq=(nu*ssq + s)/rchisq(1,nu+n)
#
# now draw beta given Sigma
#
beta = btilde + as.vector(sqrt(sigmasq))*IR%*%rnorm(k)
list(beta=beta,sigmasq=sigmasq)
}

42

0
500
1000
1500
2000

43

1.0

1.5

out$betadraw
2.0

2.5

0
500
1000
1500
2000

44

0.20

0.25

out$sigmasqdraw
0.30

0.35

Multivariate Regression
y1 X1 1

Y XB E,

M
y c X c c

B 1,K ,c ,K ,m

M
y m X m m

row ~ iid N 0,

Y y1,K ,y c ,K ,ym

E 1,K , c ,K , m

45

Multivariate regression likelihood

1 n
p Y | X,B, | |
exp yr B' x r 1 y r Bx r
2 r 1

| |n / 2 etr Y XB Y XB 1
2

n / 2

| |(nk) / 2 etr S 1
2

k / 2

||

where S Y XB Y XB

etr B B XX B B
2

46

Multivariate regression likelihood


But, tr(A 'B) vec A ' vec B

tr B B XX B B 1 vec B B ' vec X X B B 1

and vec ABC C' A vec B

vec B B ' 1 XX vec B B

therefore,
1

p Y | X,B, | |(nk) / 2 etr S 1


2

| |k / 2 exp 1 X X
2

47

Inverted Wishart distribution


Form of the likelihood suggests that natural
conjugate (convenient prior) for would be of
the Inverted Wishart form:

p 0 ,V0

0 m1 / 2

etr 21 V0 1

denoted ~ IW 0 ,V0

if 0 m 1, E (0 m 1) 1 V0

- tightness
V- location
however, as
increases,
spread also
increases

limitations: i. small -- thick tail ii. only one


tightness parm
48

Wishart distribution (rwishart)

If ~ IW 0 ,V0 , 1 ~ W 0 ,V01

if 0 m 1, E 0 V01
Generalization of 2 :
Let i ~ Nm (0, )

Then W

' ~ W(, )
i 1

i i

The diagonals are 2


49

Multivariate regression prior and


posterior
Prior:

p ,B p p B |
~ IW 0 ,V0

| ~ N , A 1

Posterior:

| Y,X ~ IW 0 n,V0 S%

1
| Y,X, ~ N %
, XX A

% Y XB% Y XB% B% B A B% B
S

% vec B%,

1
B% XX A X XB AB ,

50

Drawing from Posterior: rmultireg


rmultireg=
function(Y,X,Bbar,A,nu,V)
RA=chol(A)
W=rbind(X,RA)
Z=rbind(Y,RA%*%Bbar)
# note: Y,X,A,Bbar must be matrices!
IR=backsolve(chol(crossprod(W)),diag(k))
#
W'W = R'R & (W'W)^-1 = IRIR' -- this is the UL decomp!
Btilde=crossprod(t(IR))%*%crossprod(W,Z)
#
IRIR'(W'Z) = (X'X+A)^-1(X'Y + ABbar)
S=crossprod(Z-W%*%Btilde)
#
rwout=rwishart(nu+n,chol2inv(chol(V+S)))
#
# now draw B given Sigma note beta ~ N(vec(Btilde),Sigma (x) Cov)
#
Cov=(X'X + A)^-1 = IR t(IR)
#
Sigma=CICI'
#
therefore, cov(beta)= Omega = CICI' (x) IR IR' = (CI (x) IR) (CI (x) IR)'
#
so to draw beta we do beta= vec(Btilde) +(CI (x) IR)vec(Z_mk)
#
Z_mk is m x k matrix of N(0,1)
#
since vec(ABC) = (C' (x) A)vec(B), we have
#
B = Btilde + IR Z_mk CI'
#
B = Btilde + IR%*%matrix(rnorm(m*k),ncol=m)%*%t(rwout$CI)

51

Conjugacy is Fragile!
SUR:

yi X i i i i 1,K ,m

set of
regressions
related via
correlated
errors

given , ~ N would be conjugate


given , ~ IW would be conjugate
BUT, no joint conjugate prior!!
52

You might also like