2 views

Uploaded by Shafayat Abrar

Skew Gaussian Process for Nonlinear Regression

- Arch
- 272 spring 2014.pdf
- Box Cox
- STAT 2507_Final_2010F
- assumptions in multiple regression
- A Stochastic Model, PDF
- statistics_text.pdf
- biostatistics
- 06 Probability Distribution
- the normal distribution
- L2formulaesheet2012 Externals
- 56FBBCFDd01
- Lecture Notes Feb10
- Davis and Morris TRR 2009
- SSRN-id2463999
- Chapter 15 Projects, Procedures, Perspectives
- 1404.1178
- Bsc Phys Hon Cbcs (2019)
- 04711865.pdf
- Midterm Assignment

You are on page 1of 26

ISSN: 0361-0926 print / 1532-415X online

DOI: 10.1080/03610926.2012.737498

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

1

Department of Mathematics, Statistics and Physics, Qatar University, Qatar

2

Department of Statistics, Yarmouk University, Irbid, Jordan

In this article, we extend the Gaussian process for regression model by assuming a skew

Gaussian process prior on the input function and a skew Gaussian white noise on the

error term. Under these assumptions, the predictive density of the output function at

a new fixed input is obtained in a closed form. Also, we study the Gaussian process

predictor when the errors depart from the Gaussianity to the skew Gaussian white noise.

The bias is derived in a closed form and is studied for some special cases. We conduct a

simulation study to compare the empirical distribution function of the Gaussian process

predictor under Gaussian white noise and skew Gaussian white noise.

Prior distribution.

1. Introduction

In statistical literature, the assumption of Gaussianity or normality has been made on statis-

tical models for a long time when analyzing spatial data. The popularity of using Gaussian

assumption is due to its mathematical tractability. For example, the multivariate Gaussian

distribution possesses the properties of closure under marginal, conditional distributions as

well as the closure under convolution. Despite of such nice properties of Gaussian distri-

bution, it is found that the data distribution does not meet the assumption of Gaussianity

for a large number of real data sets due to the presence of the skewness. If the analysis of

such data sets relies on the Gaussian assumption, then unrealistic or nonsensical estimates

will be produced. The simplest way to analyze skewed data via the Gaussian model is

to Gaussianize the data, i.e., by transforming the data to near Gaussian data. Such trans-

formation method is not recommended due to the following different reasons. (i) Finding

a suitable transformation to achieve normality is not an easy issue in practice. (ii) Since

such transformations are usually applied to data component-wise, then the normality of

marginal distributions does not guarantee the joint normality. Hence, the estimates might

Address correspondence to M. T. Alodat, Department of Mathematics, Statistics and Physics,

Qatar University, Qatar; E-mail: alodatmts@yahoo.com

Color versions of one or more of the figures in the article can be found online at

www.tandfonline.com/lsta.

4936

Skew Gaussian Process 4937

be fallible from biases. (iii) Despite of the difficulty in interpreting the transformed data,

data skewness could not be ignored, since it has an interpretation (Buccianti, 2005).

Recently, random processes, that possess a skewness parameter, have been defined

by several researchers. Alodat and Aludaat (2007) employed the skew normal theory, as

presented in Genton (2004), to define a new random process called the skew Gaussian

process. Also they gave an application to real data. Relying on the multivariate closed-skew

normal distribution of González-Farı́as et al. (2004), Allard and Navea (2007) defined what

they called the closed skew normal random field. For more examples about skew random

processes or fields, we refer the reader to Zhang and El-Sharaawi (2009) and Alodat and

Al-Rawwash (2009).

The cornerstone in defining a new skew processes or field is the multivariate skew

normal distribution which appeared in the pioneer works of Azzalini (1985, 1986), Azzalini

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

and Dalla valle (1996), and Azzalini and Capitanio, 1999). The skew-normal or skew

Gaussian distribution is defined as follows. A random vector Z(n×1) is said to have an n−

dimentional multivariate skew normal distribution if it has the probability density function

(pdf)

PZ (z) = 2φn (z; 0, ) α T z , z ∈ Rn , (1)

where φn (.; 0, ) is the pdf of Nn (0, ), (.) is the cdf of N (0, 1), and α (n×1) is a vector

called the skewness parameter. A more general family of (1) is obtained by using the

transformation X = μ + Z, μ ∈ Rn . It is so easy to show that the pdf of X is PX (x) =

PZ (x − μ). We use the notation X ∼ SNn (μ, , α) to denote an n−dimenational skew

normal distribution with parameters μ, , and α.

Also a generalization to (1) is given by González-Farı́as et al. (2004) as follows. Let

μ ∈ Rp , D be an arbitrary q × p matrix, and positive definite matrices of dimensions

p × p and q × q, respectively. A random vector Y is said to have a p−dimensional

closed skew normal distribution (CSN) with parameters q, μ, , D, v, , denoted by Y ∼

CSNp,q (μ, , D, v, ) , if its pdf takes the form

C −1 = q 0; v, + DDT ,

where φp (.; η, ψ), p (.; η, ψ) are the pdf and the cumulative distribution function (cdf)

of a p−dimentional normal distribution with mean vector η and covariance matrix ψ.

Throughout this article, several lemmas and results about the multivariate CSN distribution

will be used extensively. So we present them in Appendix A. For their proofs, we refer the

reader to González-Farı́as et al. (2004) or Genton (2004).

Furthermore, it has been shown that the family of skew normal distributions possesses

properties that are close to or coincide with those of the normal family. Besides to the

closeness properties, it contains the normal family, i.e., when α = 0. Such properties have

attracted the researchers to extend the well-known statistical techniques under the skew

normality assumption. There are still a lot of works in their mission. For example, the

Gaussian process regression (GPR) model is a statistical technique introduced by Neal

(1995) to treat a non-linear regression Y (t) = f (t) + ε (t) from a Bayesian viewpoint.

Simply, the technique assumes a Gaussian process as a prior on the unknown function f (t)

while ε(t) is assumed to have a white noise process. Then the aim is to predict f (t) at a

4938 Alodat and Al-Momani

new value of t. In other words, the Gaussian process provides us with a prior distribution

over the space of all functions.

Since the Gaussian family is a sub-family of the skew Gaussian family, then using

the skew Gaussian process, i.e., a process whose finite dimensional distributions are of the

form (1), as a prior on f (t) will allow us to define a distribution over a more rich family of

functions than the Gaussian one. Also, it will allow us to extend the error term in the above

regression model to have a skewed distribution which closer to real data than its Gaussian

counterpart.

It appears from literatures that the GPR has a significant applications in various fields of

science. For example, it has been applied to model noisy data and to classification problems

arising in machine learning to predict the inverse dynamics of a robot arm (Rasmussen and

Williams, 2006). Brahim-Belhouari and Bermak (2004) applied the GPR model to predict

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

the future value of a non-stationary time series. Schmidt et al. (2008) studied the sensitivity

of GPR to the choice of correlation function. Based on a numerical study, they concluded

that the predictions did not differ much amongst the different correlation functions. Van-

hatalo et al. (2009) proposed a GPR with student-t likelihood by approximating the joint

distribution of the process by a student distribution. The idea beyond that approximation

is to make the GPR model robust against outliers. The model they proposed is analytically

intractable. Kuss (2006) proposed other robust models as alternatives for GPR. Macke et al.

(2010) applied the GPR to estimate the cortical map of the human brain. They modeled

the brain image of their experiment, where the activity at each voxel is measured, by a

Gaussian process. Fyfe et al. (2008) applied the GPR to Canonical correlation analysis with

application to neuron data.

The problem of treating the prediction problem of the nonlinear regression Y (t) =

f (t) + ε (t) from a Bayesian viewpoint when both f (t) and ε (t) follow skew Gaussian

processes has not yet been a dressed in the literature. In this article, we extend the GPR

model by assuming two independent skew Gaussian processes one on f (t) and the other one

on ε (t). In other words, we consider the nonlinear regression model Yi = f (ti )+ε (ti ) , i =

1, 2, . . . , n, i.e., for each i, f (ti ) is measured as Yi but corrupted by the noise ε (ti ). Then

we put a skew Gaussian process as prior on the function f (t). Also, we assume that the

process ε (t) follows a skew Gaussian process. Under these assumptions, the following

two prediction problems are considered: (i) Prediction of f (t) at a fixed input t, and (ii)

Prediction of f (t) at a random input t ∗ .

The rest of this article is organized as follows. In Sec. 2, we introduce the reader to

the GPR model. In Sec. 3, we generalize the GPR model by assuming a skew Gaussian

process on f (t) and another skew Gaussian process on ε (t). Then we derive the predictive

density of the output function at new input. Also, we derive the mean and the variance of

the predictive distribution. In Sec. 4, it is assumed that the GPR predictor is used to analyze

a data with skewed errors. Then we derive the bias and the variance. In Sec. 5, we conduct

a simulation study to compare the new model to the Gaussian one. Finally, we state our

conclusions in Sec. 6.

A family {X (t) , t ∈ C} , C ⊆ Rn of random variables is said to constitute a Gaussian

process if for every n and t1 , . . . , tn ∈ C, the random variables X1 (t) , . . . , Xn (t) have

n-dimentional multivariate normal distribution. The Gaussian process is used in statistical

literatures as a prior process for the Bayesian analysis of several statistical problems. For

example, O’Hagan (1978), was the first to use the Gaussian process as a prior process over

Skew Gaussian Process 4939

the space of functions to treat a nonlinear regression from a Bayesian viewpoint, while

an application of O’Hagan’s work to Bayesian learning in networks has appeared in Neal

(1995).

The GPR, as presented in Neal (1995), can be illustrated as follows. Consider a set

of training data Y = (Y1 , Y2 , . . . , Yn )T , where the input vectors t1 , t2 , t3 , . . . , tn ∈ C ⊆ Rn

and their output values Y1 , Y2 , . . . , Yn are governed by the non-linear regression model

Yi = f (ti ) + ε (ti ), where ε (t1 ) , ε (t2 ) , . . . , ε (tn ) are iid Gaussian noises on C of mean

0 and variance τ 2 , and f (.) is an unknown function. The main question is “what is the

predicted value of f ∗ = f (t ∗ ), the value of f (t) at a new input t ∗ ?”. To answer this

question, a prior distribution is needed on f (t) , i.e., a distribution over a set of functions

is needed. This prior distribution should be defined on the class of all functions defined on

the space of t. The set of all sample paths of a Gaussian process on C provides us with a

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Assume that f (t), t ∈ C is a Gaussian process with covariance function k (., .), i.e., for

every n and t1 , t2 , t3 , . . . , tn ∈ C, we have f = (f (t1 ) , . . . , f (tn ))T ∼ Nn (0, ), where

⎛ ⎞

k (t1 , t1 ) · · · k (t1 , tn )

⎜ .. .. .. ⎟

=⎝ . . . ⎠.

k (tn , t1 ) · · · k (tn , tn )

1 T

k ti , tj = exp − ti − tj −1 ti − tj (3)

2

For simplicity, we may consider

= diag λ21 , . . . , λ2n . A covariance function k (., .)

is said to be isotropic if k ti , tj depends only on the distance ti −tj . For more information

about other types of covariance functions, see Girard et al. (2004).

Since f (t) is assumed to follow a Gaussian process, then, according to Qui ñonero-

Candela et al. (2003), the joint pdf of f (t) and f (t ∗ ) is also an (n + 1) −dimensional

multivariate Gaussian distribution, i.e.,

⎛ ⎞

f (t1 )

⎜ .. ⎟

⎜ . ⎟

⎜ ⎟ ∼ Nn+1 (0, ) ,

⎝ f (tn ) ⎠

f (t ∗ )

with

⎛ ⎞

k (t1 , t1 ) · · · k (t1 , tn ) k (t1 , t ∗ )

⎜ .. .. .. .. ⎟

⎜ . . ⎟

=⎜⎜ .. ⎟,

⎟

⎝ k (t , t ) · · · k (t , t ) k (t , t ∗ ) ⎠

n 1 n n n

∗ ∗ ∗ ∗

k (t ,t1 ) · · · k (t , tn ) k (t , t )

k

= ,

kT k∗

where

4940 Alodat and Al-Momani

Rasmussen (1996) showed that the prediction distribution of f ∗ given Y and t ∗ remains

Gaussian and is given by

where μ (t ∗ ) and σ 2 (t ∗ ) are the mean and the variance of the predictive distribution (4) and

are given by

−1

μ∗ = μ(t ∗ ) = k T + τ 2 In Y,

where

−1

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Y = (Y1 , Y2 , . . . , Yn )T and σ 2 t ∗ = k ∗ − k T + τ 2 In k.

The distribution (4) can be used to draw several inferential statements about f (t ∗ ).

For instance, when p = 1, a 100 (1 − α) % prediction interval for f (t ∗ ) is given by [L, U ],

where L and U are the solutions of the following two equations:

∞

L

∗ ∗ α

∗ α

P (f |Y, t )df = and P (f ∗ |Y, t ∗ )df ∗ = .

0 2 U 2

μ(t ∗ ) ± Z1− α2 σ (t ∗ ),

where Z1− α2 is the 100 (1 − α) quantile of N (0, 1). Moreover, the mean μ (t ∗ ) serves as a

predictor for f (t ∗ ) given the data Y and t ∗ , while the variance σ 2 (t ∗ ) serves as a measure

of uncertainty in μ (t ∗ ).

Now, assume that we are interested in predicting f (t) at t ∗ , where t ∗ is a random

variable such that t ∗ ∼ Np (μ∗ , ∗ ), i.e., we are interested in prediction at a random input.

So the predictive pdf for f ∗ given that μ∗ , ∗ is (Girard et al., 2004):

P (f ∗ |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ . (5)

The integral in Eq. (5) does not have a closed form. Hence, an approximation to this

integral is needed in order to report inferential statements about f ∗ . Moreover the main

computational problem in GPR is the inversion of the matrix + τ 2 In and in obtaining the

mean and variance of the predictive distribution of f ∗ given Y at a random input t ∗ . For

this reason, we propose the following simple Monte Carlo approximation to (5):

1
∗

N

∗

P (f |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ P f , Y |t ∗(r) ,

N r=1

where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Before closing this section, we

refer to Girard et al. (2002) and Williams and Rasmussen (2006) where the reader can find

several analytical approximation techniques to approximate the predictive density (5).

Skew Gaussian Process 4941

In this section, a generalization to the Gaussian process, called the skew Gaussian process

(SGP), is proposed. Under the SGP, we give a generalization to the GPR model called the

skew Gaussian process regression (SGPR) model. Then the predictive density at new inputs

is derived for the SGPR model.

(SGP) if for every n ∈ {1, 2, 3, . . .} and every t1 , . . . , tn ∈ C, the vector (Y (t1 ) , . . . , Y (tn ))T

follows the density (1), i.e., Y (t) is skew Gaussian process if its set of finite dimensional

distributions is a subfamily of the family of distributions defined by (1).

Definition 3.2. A skew Gaussian process Y (t) possesses a fixed skewness in all directions

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

if for every n and t1 , . . . , tn , the parameter α in (1) takes the form α = α1n , α ∈ R.

Throughout

n this article, we will assume that for each n and t1 , . . . , tn ∈ C, the parameter

= k ti , tj i,j =1 , where k (., .) is a given covariance function.

Definition 3.3. A skew Gaussian process is called a skew white noise if for every n and

t1 , . . . , tn ∈ C ⊆ Rn , ε = (ε (t1 ) , . . . , ε (tn ))T ∼SNn 0, τ 2 In , β1Tn , where τ, β ∈ R.

the function f (t) follows a skew Gaussian process and ε (t) follows skew white noise, then

we state the following theorem which gives us the predictive distribution of f (t ∗ ) given

Y, t ∗ as well as its mean and variance at the fixed input t ∗ .

+ τ 2 In k 11 12 11 12

A = = , A =

k k∗ 21 22 21 22

and

D11 D12

DA = = D1 D 2 ,

D21 D22

where

D11 D12

D1 = and D2 = .

D21 2×n

D22 2×1

Then:

i. the conditional distribution of f ∗ given Y and t ∗ is

−1 −1

f ∗ |Y, t ∗ ∼ CSN 1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A , (6)

where

−1

D ∗ = D1 + D2 k T + τ 2 In ;

4942 Alodat and Al-Momani

ii. the predictive mean and the predictive variance are given by

E(f ∗ |Y, t ∗ ) = μ∗ + σ ∗2 D12 (1) ∗ ∗2

2 02×1 ; −D Y, A + σ D2 D2

T

∗ ∗2

+D22 (2)

2 02×1 ; −D Y, A + σ D2 D2

T

,

and

2

var(f ∗ |Y, t ∗ ) = 2σ ∗4 (11)

2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12

12

+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22

21

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

22 ∗ ∗2

2

+2 02×1 ; −D Y, A + σ D2 D2 D22 ,

T

where (i)2 (., .) is the first partial derivative of 2 (., .) with respect to the i argument for

(ij )

i = 1, 2 and 2 (., .) is the second mixed partial derivative of 2 (., .) with respect to the

ith and jth arguments for i, j = 1, 2.

The above theorem shows that the predictive distribution of a new output follows

a closed skew Gaussian distribution. As a special case, this predictive distribution re-

duces to (4) if the skewness is absent, i.e., if α = β = 0. Another predictor of f (t ∗ )

is the median of the conditional distribution of f ∗ given Y. Neither mean nor the me-

dian of the conditional distribution in our case has a simple closed form. Furthermore,

part (i) of Theorem 3.1 can be used to predict the value of f (t) at a random input, say

t ∗ . For instance, assume

∗ ∗2that t ∗ ∼ Np (a, B) and we wish to predict f ∗ = f (t ∗ ). Since

f |Y, t ∼ CSN 1,2 μ , σ , D2 , −D ∗ Y, A , then using the total probability law, we write

∗ ∗

P (f ∗ |Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ .

Rp

Unfortunately, it is difficult even for GPR to find a closed form for the integral in

the last equation, so an approximation for P (f ∗ |Y ) is needed. Here, we propose the

following simple Monte Carlo approximation for the predictive distribution at a random

input:

1
∗

N

∗

P (f |Y ) = ∗ ∗ ∗ ∼

P (f |Y, t )P (t )dt = ∗

P f |Y, t ∗(r) ,

R p N r

where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Since we are putting a skew

Gaussian process prior on the function f (t), then for each n, the finite dimensional dis-

tribution of the skew Gaussian process is used as a prior for the distribution of the vector

f = (f (t1 ) , . . . , f (tn ))T , i.e., f ∼ SNn (0, , α1n ), where

as defined in the previous

sections. Since Y = f + ε, where ε ∼ SNn 0, τ 2 In , β1n , then the posterior distribution

Skew Gaussian Process 4943

P (y, t ∗ |f ) P (f )

P f |y, t ∗ = .

P (y, t ∗ )

Since the prior distribution is proper, then so is the posterior distribution. So the

predictive distribution of f ∗ give Y and t ∗ is

P (f ∗ |y, t ∗ ) = P (f, f ∗ |y, t ∗ )df,

P

1

= (y, t ∗ |f, f ∗ )P (f, f ∗ )df.

P (y, t ∗ )

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

It will be shown in the proof of Theorem 3.1, Appendix B1, that the last equation

simplifies to the pdf of the distribution:

−1 −1

CSN1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A .

give Y is obtained by averaging P (f ∗ |Y, t ∗ ) over all values of t ∗ . Since t ∗ has a proper

distribution, then so is the distribution off ∗ given Y . Furthermore, the strong law of large

∗ ∗(r)

numbers implies that the estimator N1 N r P (f |Y, t ) converges almost surely to its

mean value, i.e.,

1 ∗ ∗

N

∗(r) a.s ∗(1)

P f |Y, t → E P f |Y, t = P f ∗ |y, t ∗ P t ∗ dt ∗ ,

N r

= P f ∗ |y .

−1 −1

CSN1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A ,

then we may utilize the following stochastic representation of the CSN distribution (Genton,

2004; Allard and Naveau, 2007):

i. Let V be a random vector from N2 (−D ∗ Y, Q), where

−1

Q = A + D2T k ∗ − k T + τ 2 In kD2 .

iii. Z = m∗ U + D ∗ Y + ∗ 2 H , where

1

m∗ = − k ∗ − k T + τ 2 In k D2 A + k ∗ − k T + τ 2 In k D2T D2 ,

−1 −1 2

∗ = k ∗ − k T + τ 2 In k − k ∗ − k T + τ 2 In k

−1 −1

D2 A + k ∗ − k T + τ 2 In k D2T D2 D2T

4944 Alodat and Al-Momani

In this section, we consider the model Y = f + ε, where the error process ε (t) follows a

−1

skew white noise. Then we use the Gaussian process predictor, i.e., f̂ = k T + τ 2 In Y,

to predict f (t ∗ ). Under such setup, we study the effect of this assumption on the mean

and the variance of the GPR predictor. In sequel,

the mean and the variance of f̂ , where

Y is replaced by Y = f + ε, with ε ∼ Nn 0, τ 2 In , are denoted G G

by E (f̂ ), var (f̂ ),

respectively. Also if Y is replaced by Y = f + ε with ε ∼ SNn 0, τ 2 In , β1n , then the

mean, the variance and the bias are replaced by E SG (f̂ ) and var SG (f̂ ), respectively. Under

a white noise process, i.e., ε ∼ Nn 0, τ 2 In , we have

−1

E G (f̂ ) = k T + τ 2 In E(f + ε),

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

−1

= k T + τ 2 In f

while under the assumption ε ∼ SNn 0, τ 2 In , β1n , we have that

−1

E SG (f̂ ) = k T + τ 2 In E(f + ε),

−1 −1

= k T + τ 2 In f + k T + τ 2 In Eε ·

2 τ 2 β1n

E (ε) = .

π 1 + β 2τ 2n

Hence,

−1 2 τ 2 β1n

E SG

(f̂ ) = E (f̂ ) + k + τ 2 In

G T

,

π 1 + β 2τ 2n

= E G (f̂ ) + b τ 2 , β 2 , n , say. (7)

2 2 that the GP R predictor is either increased

or decreased by an amount of b τ , β , n . Similarly, under the assumption ε ∼

SNn 0, τ 2 In , β1n , the variance of the var SG (f̂ ) is obtained by applying Proposition

A.6 of Appendix A. So

−1 2 2β 2 τ 4

−1

var SG

(f̂ ) = k + τ In

T 2

τ In − 1n 1n + τ 2 In k,

π 1 + nτ β 2 2

−2 2β 2 τ 4 k02 n2

= τ 2 k + τ 2 In k− ,

π 1 + nτ 2 β 2 L2n

n

where Ln = τ 2 + i=1 k (t1 , ti ).

Skew Gaussian Process 4945

Theorem 4.1. Consider the setup in the above discussion. Then b τ 2 , β 2 , n and

var SG (f̂ )satisfy the following properties.

i. var SG (f̂ ) ≤ var G (f̂ ) for all τ, β, n.

2 2 2 τ T −1

ii. lim b τ , β , n = π √n k + τ 2 In 1n , lim b τ 2 , β 2 , n = 0 and

β→±∞

2 2 β→0

lim b τ , β , n = 0,

τ →0

iii. Assume that t1 , t2 , . . . , tn are chosen so that they are the vertices of a regular

polygon and t ∗ is located at its center. If k (., .) is an isotropic covariance function

and k0 = k (t1 , t ∗ ), then

a. b τ 2 , β 2 , n > 0 for all τ , n and β
= 0, and lim b τ 2 , β 2 , n = 0.

τ →∞

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

n

b. If k (t1 , ti ) = n−0.5 O (n), with n−0.5 O (n) → c
= 0 as n → ∞,

i=1

n

then lim b τ 2 , β 2 , n = π2 τβk 0

|β|c and if k (t1 , ti ) = O (n), then

n→∞

2 2 i=1

lim b τ , β , n = 0.

n→∞

n τ k2

c. If k (t1 , ti ) = O (n), then lim var SG (f̂ ) = 1 − π2 c20 .

i=1 n→∞

nτ 2 k 2 nτ 2 k 2

d. lim var SG (f̂ ) = L2 0 , lim var SG (f̂ ) = L2 0 1 − π2 and

β→0 n β→±∞ n

τ →0,∞

Proof. The proof of (ii) and (iii)(a)–(iii)b are given in Appendix B. The proof of the other

parts is easy, so we leave it to the reader.

It can be noticed that if a Gaussian predictor is used for predicting skew data, then the

variance of the predictor cannot exceed the variance of the Gaussian predictor. On the other

hand, the value of the predictor

will be shifted to the left or to the right of the Gaussian one

by an amount of b τ 2 , β 2 , n . If an isotropic Gaussian covariance function is used, then

n n 0.5θ0 i

i=1 k (t1 , ti ) = i=1 τ exp − λ2 = O(n), where θ0 denotes the angle between ti and

2

t ∗ for all i = 1, . . . , n. So the Gaussian covariance function satisfies part b of Theorem 4.1.

5. Simulation study

In this section, we present an algorithm to simulate a realization from a skew Gaussian

process, i.e., by simulating from its finite dimensional distributions. Then the algorithm is

implemented in a Matlab code to simulate from a GPR and a SGPR predictors.

Simulation of a sample path from the skew Gaussian process can be obtained by sampling

from a multivariate skew normal distribution on a smooth grid. To simulate a random vector

from the pdf (1), we may use the accept-reject method. The accept-reject method as given

in Christian and Casella (2004) assumes that the pdf P (x) can be written as

4946 Alodat and Al-Momani

where c ≥ 1, 0 < g (x) ≤ 1, ∀x and h (x) is a pdf. If this is the case, then a ran-

dom observation from P (x) is generated as follows.

1. Generate U from u (0, 1).

2. Generate Y from h (x).

3. IfU ≤ g (Y ), then deliver Y as a realization of P (x).

4. Go to step1.

For the SNn (0; , λ) distribution, we may use this algorithm with c = 2, g (x) =

(λT x) and h (x) = φn (x; 0, ).

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

achieve

this via the accept-reject

method due to the complexity of calculating g (x) =

q D T (Y − μ) ; v, . Instead, we employ the following algorithm which is derived

from the definition of the CSN distribution (see Genton, 2004; Allard and Naveau, 2007).

(i) Simulate an observation from

U = Nq (v, + D T D)|U ≤ 0.

Also, the simulation from U |U ≤ 0 is not an easy task, so an accept-reject method will

be implemented.

The Maximum Likelihood Estimation (MLE) is an estimation approach to estimate the

hyper parameters in GPR. Here we use the MLE to estimate the parameters

τ, σ 2 , α, β,

and λ, i.e. by maximizing the likelihood function L τ, σ 2 , α, β, λ; Y of Y . Consider the

model

Y = X+ε,

where X ∼ SNn 0, , α1Tn , 0, 1 , ε ∼ SNn 0, τ 2 In , β1Tn , 0, 1 , τ > 0, and X, ε are in-

dependent

random vectors. Then applying Proposition A.5 of Appendix A yields Y ∼

CSN n,2 0, + τ 2 In , D ◦ , 0, ◦ ,where

−1

◦ α1Tn + τ 2 In ◦ A11 A12

D = −1 , = ,

βτ 2 1Tn + τ 2 In A12 A22

and

−1

A11 = 1 + α 2 1Tn 1n − α 2 1Tn + τ 2 In 1n , A22 = 1 + nβ 2 τ 2 − β 2 τ 4 1Tn

−1

× + τ 2 In 1n ,

−1

A12 = −αβτ 2 1Tn + τ 2 In 1n .

Skew Gaussian Process 4947

2 (D ◦ Y ; 0, ◦ )

L τ, σ 2 , α, β, λ; Y = × φn Y ; 0, + τ 2 In .

2 0; 0, ◦ + D ◦ + τ 2 In D ◦T

Although the marginal distribution of the data Y is a multivariate closed skew normal

distribution, the problem of finding confidence intervals for the parameters τ, σ 2 , α, β, and

λ is not an easy task, since these parameters are embedded in the distribution’s parameters,

i.e., in , D ◦ and ◦ . So one may think in Bayesian intervals. For this purpose,

a prior

distribution on the τ, σ 2 , α, β, and λ must be assumed. Let P τ, σ 2 , α, β, λ be the prior

that represents our belief about the distribution of τ, σ 2 , α, β, and λ. Then the posterior

distribution of τ, σ 2 , α, β, and λ given Y satisfies

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

P τ, σ 2 , α, β, λ|Y ∝ L τ, σ 2 , α, β, λ; Y × P τ, σ 2 , α, β, λ .

Again, we face another problem in finding the normalizing constant for the posterior

distribution. Hence, a Markov chain Monte Carlo (MCMC) method should be called for.

For example, one may use the Metropolis-Hasting algorithm (Christian and Casella, 2004).

To find confidenceintervals for the parameters τ, σ 2 , α, β, and λ, a large sample from

P τ, σ 2 , α, β, λ|Y is needed. To do so, we propose an algorithm in Appendix C to find

such confidence intervals. On the other hand, once the parameters τ, σ 2 , α, β, and λ have

been estimated, then their estimates can be plugged in the variance formula var (f ∗ |Y, t ∗ )

to get an estimate for var (f ∗ |Y, t ∗ ). Furthermore, to obtain a 95% Bayesian

confidence

band for (f ∗ |y, t ∗ ), we also need to use simulation. To proceed, let = τ, σ 2 , α, β, λ

and (, t ∗ ) = (f ∗ |y, t ∗ ). A Bayesian simultaneously 95% confidence band for (, t ∗ )is

obtained by finding L and U such that P|Y,t ∗ (L < (, t ∗ ) < U forallt ∗ ) = 0.95, which

is equivalent to solve the following equation for L and U :

∗

∗

P|Y,t L < inf

∗

∗

, t < sup , t < U = 0.95. (8)

t t∗

L < (, t ∗ ) < U is 95% confidence band for (, t ∗ ) for.

In this simulation work, realizations of the sample path of the SGP are generated for the

input function f (t) = sin(t)

t

, t = 0. Then the simulated data are substituted in both Gaussian

and skew Gaussian predictors. To see the effect of the departure from Gaussianity on the

Gaussian predictor, we plot the distribution function for the two predictors. Figures 1–8show

these distribution functions for λ = 1 and different values of α, β, and τ .

1. If a Gaussian process prior is used on the input function, i.e., α = 0, then there is a small

difference between the two distributions and this difference is increasing as a function

of |β|. Moreover, the skew Gaussian predictor distribution is larger than the Gaussian

predictor distribution if β < 0 and the converse is true if β > 0, (see Fig. 2a).

2. The two predictors have about the same distribution functions for small values of the

skewness parameters τ, α, and β (See Figs. 2a, b).

4948 Alodat and Al-Momani

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 1. (a) GPR (G) and skew Gaussian (SG) predictors with parameters; (b) GPR (G) and SG

predictors with parameters α = −0.05, β = −5, 0, 1, 5 and τ = 0.1, α = −0.01, β = −5, −1, 0, 5

and τ = 0.1.

Figure 2. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G) and SG predictors with

Parameters α = 0, β = −5, 0, 1, 5 andτ = 0.1, α = 0.05, β = −5, 0, 2, 5, andτ = 0.1.

Figure 3. (a) GPR (G)and SG predictors with Parameters, (b) GPR (G)and SG predictors with

Parametersα = 2, β = −5, −2, 0, 5 and τ = 0.1, α = 5, β = −5, 0, 2, 5 and τ = 0.1.

Skew Gaussian Process 4949

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 4. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G)and SG predictors with

Parameters α = −0.01, β = −5, 0, 1.5, 5 and τ = 1, α = 0, β = 5, 0, 2, 5, and τ = 1.

Figure 5. (a) GPR (G) and SG predictors with parameters α = 0.5, β = −5, 0, 1.5, 5 and τ = 1.

Figure 6. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with

Parameters α = 1.5, β = −5, −1, 0, 5 and τ = 1.5, α = 4, β = −5, 0, 2, 5 and τ = 1.5.

4950 Alodat and Al-Momani

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 7. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with

parameters α = −0.1, β = −5, 0, 2, 4 and τ = 2, α = 5, β = −5, 0, 2, 5 and τ = 2.

3. If a Gaussian process is used for the errors, i.e., β = 0, then there is no difference

between the two distributions when α ≤ 0, and τ is small (see Figs. 1, 2, 3and 4).

4. For fixed values of α, and moderate values of τ , the difference between the two distri-

butions is very clear and seems to be an increasing function in |β| (see Figs. 4,5).

5. For fixed values of α, and large values of τ , there is a huge difference between the two

distributions (see Fig. 8).

In this article, the nonlinear regression model Y (t) = f (t) + ε (t) has been tackled from a

Bayesian viewpoint by assuming two skew Gaussian processes on f (t) and ε (t). It is shown

that, under this assumption, the predictive density at new input has a closed form. Also,

we studied the GPR predictor under the assumption that the errors violate the assumption

of Gaussianity. If the errors depart from Gaussianity to skew Gaussianity, then the GPR

predictor will be affected and may lead to unrealistic estimates. We know that the skew

Gaussian process for regression, addressed in this paper, has several advantages over the

Skew Gaussian Process 4951

GPR. These advantages will attract us to continue this work in future. We highlight some

of such possible works.

1. Studying the effect of the choice of the covariance function on the skew Gaussian

process predictor.

2. Developing methods for estimating the hyper-parameters of the model.

3. Prediction at several inputs.

4. Defining more robust models by using more general distributions either on the input

function f (t) or on the error term. For such future work, one may utilize the work of

Lachos et al. (2010) and Da Silva-Ferreira et al. (2011) by assuming that either the

input function or the error term follows a random process whose finite dimensional

distributions are scale mixture of skew normal (SMSN) distributions as defined by

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

mixture of skew normal distributions, denoted by Y ∼ SMSNn (μ, , λ, H ) , if Y has the

1

stochastic representation Y = μ + c 2 (U ) Z, where Z ∼ SNn (0, , λ), c (.) is a weight

function and U is a positive mixing random variable with cdf H (.) and independent of

Z. The pdf of Y is

1 1

ϕn (y; μ, c (u) ) c 2 (u) λT − 2 (y − μ) dH (u) .

1

P (y) = 2

0

Lachos et al. (2010b) showed that the family SMSN includes several known families

such as skew-t, skew-slash and the skew-Cauchy families. This open the way for further

research on more robust models.

Although the process whose finite dimensional distributions are of SMSN is very gen-

eral, we have several computational challenges when finding the estimates of the hyper pa-

rameters. These challenges are due to the integration in the pdf of Y ∼ SMSNn (μ, , λ, H ).

So, instead of conducting the numerical calculations, it could be easier to use an intensive

statistical computing algorithm to calculate the integration in the pdf P (y). Since inten-

sive computing requires large samples from the pdf of Y , we may utilize the stochastic

1

representation Y = μ + c 2 (U ) Z for such simulation purposes.

Azzalini and Capitanio (1999) have pointed out that the MLE for the skewness pa-

rameter of the multivariate skew normal distribution may diverge with positive probability.

Also they noticed that the Fisher information matrix is singular when the skewness param-

eter is zero. For the multivariate closed skew normal distribution, these issues have been

considered only in few number of papers. Here we refer to the work of Arellano-Valle

et al. (2005). They used the skew normal distribution to model the both the random effect

and the error terms in the linear mixed effect model. Also they showed that the response

data vector has a multivariate closed skew normal distribution. Furthermore, they derived

and implemented an EM algorithm to find the MLEs for all parameters. According to the

literature, it can be noticed that the above issues concerning the MLEs of the closed skew

normal distribution parameters are still not explored enough. Hence, we believe that a fur-

ther research should conducted. For example, the estimation of the SGP model parameters

using the penalized maximum likelihood method could be called for. We leave these issues

to a separated article.

4952 Alodat and Al-Momani

References

Allard, D., Naveau, P. (2007). A new spatial skew-normal random field model. Commun. Statist.

Theory. Meth. 36:1821–1834.

Alodat, M. T., Aludaat, K. M. (2007). A skew Gaussian process. Pak. J.Statist. 23:89–97.

Alodat, M. T., AL-Rawwash, M. Y. (2009). Skew Gaussian random field. J.Computat. Appl. Math.

232(2):496–504.

Arellano-Valle, R. B., Bolfarine, H., Lachos, V. H. (2005). Skew-normal Linear Mixed models. J.Data

Sci. 3:415–438.

Azzalini, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist.

12:71–178.

Azzalini, A. (1986). Further results on a class of distributions which includes the normal ones.

Statistica 46:199–208.

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Azzalini, A., Capitanio, A. (1999). Statistical application of the multivariate skew normal distributions.

J. Roy. Stat. Soc. Ser. B 61:579–602.

Azzalini, A., Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika

83:715–726.

Brahim-Belhouari, S., Bermak, A. (2004). Gaussian process for non-stationary time series prediction.

Computat. Statist. Data Anal. 47:705–712.

Buccianti, A. (2005). ‘Meaning of the λ parameter of skew–normal and log–skew normal distributions

in fluid geochemistry’ a CODAWORK’05.

Christian, P. R., Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer.

Da Silva-Ferreira, C., Bolfarine, H., Lachos, V. (2011). Skew-scale mixture of skew-normal distribu-

tions. Statist. Methodol. 8:154–181.

Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with diverging number of Parameters.

Ann. Statist. 32:928–961.

Fyfe, C., Leen, G., Lai, P. L. (2008). Gaussian processes for canonical correlation analysis. Neuro

Comput. 71:3077–3088.

Genton, M. (2004). Skew-Elliptical Distributions and Their Applications: A Journey Beyond Nor-

mality. Boca Raton, FL: Chapman & Hall/CRC.

Girard, A., Kocijan, J., Murray-Smith, R., Rasmussen, C. E. (2004). Gaussian process model based

predictive control. Proc. Amer. Control Conf . Boston.

Girard, A., Rasmussen, C. E., Murray-Smith, R. M. (2002). Gaussian Process priors with uncer-

tain Inputs: Multiple-Step-Ahead Prediction. Technical Report TR-2002-119, Department of

computing Science, University of Glasgow.

Gonzáles-Farias, G., Domingusez-Molina, J., Gupta, A. (2004). Additive properties of skew normal

random vectors. J. Statist. Plan. Infer. 126:521–534.

Kuss, M. (2006). Gaussian process models for robust regression, classification, and reinforcement

learning. Ph.D. thesis, Technische Universität Darmstadt.

Lachos, V., Labra, F., Bolfarine, H., Gosh, H. (2010a). Multivariate measurements error models based

on scale mixtures of the skew-normal distribution. Statistics 44:541–556.

Lachos, V. H., Ghosh, P., Arellano-Valle, R. B. (2010b). Likelihood based inference for skew-normal

independent linear mixed models. Statistica Sinica 20:303–322.

Macke, J. H., Gerwinn, S., White, L. E., Kaschube, M., Bethge, M. (2010). Gaussian process methods

for estimating Cortical maps.

Neal, R. M. (1995). Bayesian learning for neural networks. Ph.D., thesis, Dept. of Computer Science,

University of Toronto.

O’Hagan, A. (1978). On curve fitting and optimal design for prediction. J. Roy. Soc. B 40:1–42.

Rasmussen, C. E., Williams, C. (2006). Gaussian Processes for Machine Learning. Cambridge, MA:

MIT press.

Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and other methods for non-linear

regression, Ph.D. thesis, Dept. of Computer Science, University of Toronto.

Skew Gaussian Process 4953

Schmidt, A. M., Concoicäo, M. F., Moreira, G. A. (2008). Investigating the sensitivity of Gaussian

processes to the choice of their correlation functions and prior specifications. J. Statist. Computat.

Simul. 78(8):681–699.

Schott, J. R. (1997). Matrix Analysis for Statistics. New York: Wiley-Interscience.

Vanhatalo, J., Jylänki, P., Vehtari, A. (2009). Gaussian process regression with Student-t likelihood.

In: Bengio Y., Schuurmans D., Lafferty J., Williams C. K. I., Culotta A. Eds. Advances in Neural

Information Processing Systems 22:1910–1918.

Williams, C. K. I., Rasmussen, C. E. (1996) Gaussian processes for regression. Adv. Neur. Inform.

Process. Syst. 8:514–520.

Zhang, H., El-Shaarawi, A. (2009). On spatial skew Gaussian process applications. Environmetrics

10:982.

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Appendix

The results of this appendix are quoted from Genton (2004).

T

(μi , i , Di , vi , i ), Then the joint

distribution of Y1 , . . . , Yn is Y = Y1T , . . . , YnT ∼

CSN p+ ,q + μ+ , + , D + , v + , + , where

n

n T

p+ = pi , q + = qi , μ+ = μT1 , . . . , μTn , + = ⊕ni=1 i ,

i=1 i=1

T

D + = ⊕ni=1 Di , v + = v1T , . . . , vnT , + = ⊕ni=1 i

and

A0

A⊕B = .

0 B

rank n. Then Ay ∼ CSN p,q (μA , A , DA , v, A ) , where μA = Aμ, A = AAT , DA =

DAT A−1 , and A = + DD T − DAT A−1 AD T .

Proposition

If Y ∼ CSN p,q (μ, , D, v, ), then for two sub vectors Y1 and Y2 where

A.3

Y T = Y1T , Y2T , Y1 is k−dimensional, 1 ≤ k ≤ p, and μ, , D are partitioned as follows:

μ1 k

k p − k

μ= , = 11 12 k

μ2 p−k

21 22 p − k

k p−k

D = (D1 D2 ) q.

CSNp−k,q μ2 + 21 −1 ∗

11 (y10 − μ1 ) , 22.1 , D2 , v − D (y10 − μ1 ) , ,

4954 Alodat and Al-Momani

where

D ∗ = D1 + D2 21 −1

11 ,

and

22.1 = 22 − 21 −1

11 12 .

Proposition A.4 If Y ∼ CSN p,q (μ, , D, v, ), then the moment generating function of

Y is:

q Ds; v, + DDT s T μ+ 1 s T s

MY (s) = e 2 , s ∈ Rp .

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

q 0; v, + DDT

Proposition A.5 If Y1 and Y2 are independent vectors such that Yi ∼ CSN p,qi

(μi , i , Di , vi , i ), i = 1, 2, then Y1 + Y2 ∼ CSN p,q1 +q2 (μ1 + μ2 , 1 + 2 , D ◦ , v ◦ , ◦ ),

where

◦ D1 1 ( 1 + 2 )−1 ◦ A11 A12

D = , = ,

D2 2 ( 1 + 2 )−1 A21 A22

and

A11 = 1 + D1 1 D1T − D1 1 ( 1 + 2 )−1 1 D1T ,

A22 = 2 + D2 2 D2T − D2 2 ( 1 + 2 )−1 2 D2T ,

T

A12 = −D1 1 ( 1 + 2 )−1 2 D2T , v ◦ = v1T , v2T .

i. EX = μ + π2 δ, where δ = √1+αα

T α

.

ii. Cov(X) = − π2 δδ T .

B.1. Joint Density of Data and Output. The aim of this section is to derive the joint density

of f ∗ = f (t ∗ ) and Y . For simplicity, we assume that the skew Gaussian processes used

here possess fixed skewness in all directions. Since f (t) is assumed to have a skew Gaussian

process prior, then

f

∼ CSN n+1,1 0, , α1Tn+1 , 0, 1 , ε ∼ CSN n,1 0, τ 2 In , β1Tn , 0, 1 ,

f∗

the column of one’s of size (n + 1), and In is the identity matrix of size

where 1n+1 denotes

f

n × n. Since is independent of ε (t) , then by Proposition A.1 we have that

f∗

⎛ ⎞

f

⎝ f ∗ ⎠ ∼ CSN 2n+1,2 μ+ , + , D + , v + , + ,

ε

Skew Gaussian Process 4955

where

T

μ+ = 0T1×n , 0, 0T1×n , v + = (0, 0)T , + = I2 ,

α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n

D+ = , + = .

0T(n+1)×1 β1tn 2×(2n+1)

0T(n+1)×n τ 2 In

obtained by direct application of Proposition 3 with p = n + 1 and q = 2. The first step is

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

to find the conditional distribution of f ∗ and Y , t ∗ is to find the joint pdf of f ∗ and Y . To

T T

proceed, we write Y T , f ∗ as a linear combination of f T f ∗ T , i.e.,

⎛ ⎞

f

Y f +

I n 0n×1 I n ⎝f∗ ⎠.

= =

f∗ f∗ 0Tn×1 1 0Tn×1

ε

I n 0n×1 I n

To simplify the notation, let A(n+1)×(2n+1) = ( ). It is straight forward to

0Tn×1 1 0Tn×1

check that the matrix A is of rank (n + 1). Now, we are ready to apply Proposition A.2.

Hence,

⎛ ⎞

f

Y

∗ = A ⎝ f ∗⎠

∼ CSN n+1,2 μA , A , DA , v + , A ,

f

where

⎛ ⎞

0T1×n

I n 0n×1 I n ⎝ 0 ⎠ = 0(n+1)×1

μA = Aμ+ =

0Tn×1 1 0Tn×1

0T1×n ⎛ ⎞

I n 0n×1

I n 0n×1 I n (n+1)×(n+1) 0(n+1)×n ⎝ T

A = A + AT = 0n×1 1 ⎠

0Tn×1 1 0Tn×1 0T(n+1)×n τ 2 In

I n 0n×1

+ τ 2 In k

= .

kT k ∗ (n+1)×(n+1)

To proceed, we need to apply the following matrix identity which can be found in Schott

(1997). Let A be a matrix which is partitioned as follows:

A11 A12

A= ,

A21 A22

−1 −1

−1 A11 − A12 A−1 A21 − A−1 A12 A22 − A21 A−1 A12

A = 22

−1 11

−1 11 .

− A−1 −1

22 A21 A11 − A12 A22 A21 A22 − A21 A−1

11 A12

4956 Alodat and Al-Momani

A as follows:

⎛ ⎞

∗−1 T −1

−1 ∗ −1 −1

⎜ + τ In − kk k − + τ2 In k k − kT + τ 2 I n

2

k ⎟

−1

A =⎝ −1 −1 −1 ⎠ .

−k ∗−1 kT + τ 2 I n − kk∗ −1 kT k ∗ − kT + τ 2 In k

(n+1)×(n+1)

DA = D+ + AT −1

A ,

⎛ ⎞

I n 0n×1

α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n ⎝ 0Tn×1 1 ⎠ −1

= T

01×n β1tn 0T(n+1)×n τ 2 In A

I n 0n×1

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

T

α1Tn+1 (, k)T α1Tn+1 k T , k ∗

= −1

A ,

βτ 2 1Tn 0

D11(1×n) D12

= ,

D21(1×n) D22 2×(n+1)

where

−1

D11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT

T −1

−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 In − kk ∗−1 kT ,

−1 −1 −1

D12 = −α1Tn+1 (, k)T + τ 2 I n k k ∗ − kT + τ 2 I n k

T ∗ −1 −1

+α1Tn+1 kT , k ∗ k − kT + τ 2 I n k ,

−1

D21 = βτ 2 1Tn + τ 2 I n − kk ∗−1 kT

and

−1 ∗ −1 −1

D22 = −βτ 2 1Tn + τ 2 I n k k − kT + τ 2 I n k .

A A D+ , where = I 2 ,

T

T

+ + α1n+1 1n+1 0

D D+ = T

,

0 nβ 2 τ 2 2×2

T ∗ T

T T T

α1 (, k) α1 n+1 k , k

D+ + AT = n+1 ,

βτ 2 1Tn 0

2×(n+1)

and

α (, k) 1n+1 βτ 2 1n

A + D+T = .

α kT , k ∗ 1n+1 0 (n+1)×2

Skew Gaussian Process 4957

T

α1Tn+1 1n+1 0 α1Tn+1 (, k)T α1Tn+1 kT , k ∗

A = I 2 + −

0 nβ 2 τ 2 βτ 2 1Tn 0

α (, k) 1n+1 βτ 2 1n 1 + α1Tn+1 1n+1 − W11 −W12

A−1 = ,

α kT , k ∗ 1n+1 0 −W21 1 + nβ 2 τ 2 − W22

where

−1

W11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT

T −1

−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 I n − kk ∗−1 kT α , k 1n+1

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

+ −α1Tn+1 , k + τ 2 I n k k ∗ − kT + τ 2 I n k

T −1 −1 T ∗

+α1Tn+1 kT , k ∗ k ∗ − kT + τ 2 I n k α k , k 1n+1

W12 = α1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − α1Tn+1 (kT , k ∗ )T k ∗−1 kT

−1 2

× + τ 2 I n − kk∗−1 kT βτ 1n

= αβτ 2 (1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − 1Tn+1 (kT , k ∗ )T k ∗−1 kT

× ( + τ 2 I n − kk∗−1 kT )−1 )1n

−1 −1

W21 = βτ 2 1Tn + τ 2 I n − kk ∗−1 kT α , k 1n+1 − βτ 2 1Tn + τ 2 I n

−1 −1

× k k ∗ − k T + τ 2 I n k α k T , k ∗ 1n+1

−1 −1

= αβτ 2 1Tn + τ 2 I n − kk ∗−1 kT , k 1n+1 − τ 2 1Tn + τ 2 I n k

−1 −1 T

× k ∗ − kT ( + τ 2 I n ) k k , k ∗ 1n+1 ,

and

−1

W22 = nβ 2 τ 4 1Tn + τ 2 I n − kk ∗−1 kT 1n .

tion A.3 with p = n + 1 and q = 2. To proceed, consider the following partitions for

μA , A , DA , v + , A :

+ τ2 In k 11 12

A = ∗ = ,

k k 21 22

11 12

A =

21 22

and

D11 D12

DA = = D1 D2 ,

D21 D22

4958 Alodat and Al-Momani

where

D11 D12

D1 = and D2 = .

D21 2×n

D22 2×1

−1 −1

f ∗ |Y , t ∗ ∼ CSN1,2 kT + τ 2 I n Y , k ∗ − kT + τ 2 I n k, D2 , − D∗ Y , A , (3.1)

where

−1

D∗ = D1 + D2 kT + τ 2 I n .

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Proof of (ii). Here, we have to find the mean and the variance of f ∗ |Y, t ∗ by applying

Proposition A.4; to complete this mission, we find the moment generating function of

f ∗ |Y, t ∗ , hence the moment generating function of f ∗ |Y, t ∗ is equal to

2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2

Mf ∗ |Y ,t ∗ (s) = e 2 , s ∈ R,

2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2

−1 −1

where σ ∗2 = k ∗ −k T + τ 2 I n k, and μ∗ = kT + τ 2 I n

(j )

Y . Let 2 (. , .) denote

the first partial derivative of 2 (. , .) with respect to the J component for j = 1, 2. Also,

th

(ij )

let 2 (., .) denote the mixed second partial derivative of 2 (., .). Now we find the mean

and the variance of f ∗ |Y, t ∗ :

∗ ∗ ∂ ∂ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2

E(f |Y , t ) = Mf ∗ |Y (s) |s=0 = e 2 |s=0

∂s ∂s 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2

⎛ ⎞

D12 σ ∗2 s ∗ ∗2

⎜ 2 ; − D Y , A + σ D2 D2 T

⎟

∂ ⎜ D22 σ ∗2 s ⎟

= ⎜ sμ∗ + 12 σ ∗2 s 2 ⎟

⎜

∂s ⎝ ∗

2 02×1 ; − D Y , A + σ D2 D2

∗2 T

e ⎟ |s=0 ,

⎠

1

= ∗

× (1)2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D12 σ ∗2

2 02×1 ; − D Y , A + σ D2 D2

∗2 T

+ (2)

2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D22 σ ∗2

∗ 1 ∗2 2

+ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 μ∗ + σ ∗2 s esμ + 2 σ s |s=0

Finally, we find that

E(f ∗ | y, t ∗ ) = μ∗ + σ ∗2 D12 (1) 2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2

∗ ∗2

+D22 (2)

2 0 2×1 ; − D Y , A + σ D D

2 2

T

,

2 is the first derivative of 2 with respect to the first component, and 2 is the

first derivative of 2 with respect to the second component.

Skew Gaussian Process 4959

Also, we need to find f ∗2 |Y , t ∗ to calculate the variance of f ∗ |Y , t ∗ . So

1

E f ∗2 |Y, t ∗ =

2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2

∂2 D12 σ ∗2 s ∗ ∗2

sμ∗ + 12 σ ∗2 s 2

× 2 2 ∗2 ; − D Y , A + σ D2 D2 e

T

,

∂s D 22 σ s s=0

1

= ∗

2 02×1 ; − D Y , A + σ ∗2 D2 DT2

∂ (1)

× 2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D12 σ ∗2

∂s

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

∗ sμ∗ + 1 σ ∗2 s 2

∗2 ∗

+ 2 D2 σ s; − D Y , A + σ ∗2

D2 DT2 ∗2

μ +σ s e .

2

s=0

After substituting s = 0, f ∗2 |Y, t ∗ reduces to

E f ∗2 Y , t ∗

∗ ∗2 ∗2 2

= 2((11)

2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )

T

∗ ∗2 ∗2 ∗2

+(12)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ

T

∗ ∗2 ∗2 ∗2

+(21)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ

T

∗ ∗2 ∗2 2

+ (22)

2 (02×1 ; − D Y , A + σ D 2 D 2 )(D22 σ ) )

T

∗2 ∗2

+ 4((1)

2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ

T

∗ ∗2 ∗2 ∗ ∗2 ∗2

+ (2)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ .

T

Hence,

2

var f ∗ |Y , t ∗ = f ∗2 |Y − E f ∗ |Y ,

∗ ∗2 ∗2 2

= 2((11)

2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )

T

∗ ∗2 ∗2 ∗2

+ (12)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ

T

∗ ∗2 ∗2 ∗2

+ (21)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ

T

∗ ∗2 ∗2 2

+ (22)

2 (02×1 − D Y , A + σ D 2 D 2 )(D22 σ ) )

T

∗2 ∗2

+ 4((1)

2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ

T

∗ ∗2 ∗2 ∗ ∗2 ∗2

+ (2)

2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ

T

⎛ ⎛ ⎞⎞2

D12 (1)

2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2

− ⎝μ∗ + σ ∗2 ⎝ ⎠⎠ ,

(2) ∗ ∗2

+D22 2 02×1 ; − D Y , A + σ D2 D2 T

2

var f ∗ |Y, t ∗ = 2σ ∗4 ((11)

2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12

4960 Alodat and Al-Momani

+ (12)

2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22

+ (21)

2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22

2

+ (22)

2 02×1 − D∗ Y , A + σ ∗2 D2 DT2 D22 ,

where (11)

2 is the derivative of (1)

2 with respect to the first component, and 2

(12)

is the

derivative of 2 with respect to the second component, and 2 is the derivative of (2)

(1) (21)

2

with respect to the first component, and (22)

2 is the derivative of (2)

2 with respect to the

second component.

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Proof. (ii) and (iii)(a)–(iii)(b). Since k (., .) is isotropic and t 1 , t 2 , . . . , t n are vertices of

∗ ∗

regular polygon with

centernlocated at t , then k (ti , t ) is constant for each i = 1, .., n and

the matrix = k t i , t j i,j =1 is circulant. Let k0 = k (t i , t ∗ ). Then k = k (t ∗ ) = k0 1n .

−1

Moreover, the matrix + τ 2 I n is also circulant. Therefore,

2 2 2 τ 2β −1

b τ ,β ,n = kT + τ 2 I n 1n ,

π 1 + β 2τ 2n

2 τ 2 βk0 −1

= 1Tn + τ 2 I n 1n .

π 1 + β 2τ 2n

−1

Since + τ 2 I n is circulant and 1n is an eigen vector of any circulant matrix, then

−1 1

+ τ2 In 1n = 1n ,

Ln

where

n

Ln = τ 2 + k (t 1 , t i ) .

i=1

Hence,

2 τ 2 βk0 n

b τ ,β ,n =

2 2

.

π 1 + β 2 τ 2 n Ln

It is easy to see that b τ 2 , β 2 , n > 0 for all non zero values of τ , β, and k0 . If

n −0.5

i=1 k (t1 , ti ) = n O (n), where n−0.5 O (n) → c
= 0, then

2 2 2 τ 2 βk0 n

b τ ,β ,n = ,

π 1 + β 2 τ 2 n τ 2 + n−0.5 O (n)

√

2 τ 2 βk0 n

= .

π 1 + β 2 τ 2 n √τ 2 + O(n)

n

n

Skew Gaussian Process 4961

Hence,

2 τ 2 βk0 1

lim b τ 2 , β 2 , n = ,

n→∞ π β 2τ 2 c

2 τβk0

= .

π c |β|

−1

To show that limτ →∞ b τ 2 , β 2 , n = 0, we notice that + τ 2 I n 1n = 1

1 .

Ln n

Conse-

quently, we find that

2 2 2 τ 2 βk0 n

b τ ,β ,n = ,

π 1 + β τ n Ln

2 2

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

2 nβk0 τ 2

= .

π τ 2 + ni=1 k (t1 , ti ) 1 + β 2 τ 2 n

Hence limτ →∞ b τ 2 , β 2 , n = 0.

i. Confidence intervals for hyperparameters

1. Simulate

a large sample say τ (i) , σ 2(i) , α (i) , β (i) and λ(i) , i = 1, . . . , N from

P τ, σ 2 , α, β, λ|Y .

N N N N N

2. Let q0.025 (τ ), q0.025 σ 2 , q0.025 (α), q0.025 (β), q0.025 (λ) be the 2.5% percentiles

of thesamples

N τ (i)

, σ 2(i)

, α (i)

and β (i)

, i = 1, . . . , N, N

respectively, and q0.975 (τ ),

N N N

q0.975 σ 2 , q0.975 (α), q0.975 (β), q0.975 (λ) be the 97.5% percentiles of the same

samples, respectively. N 2 N 2 N

N N N

3. Then q0.025 (τ ) , q0.975

(τ ) , q0.025 σ , q0.975 σ , q0.025 (α) , q0.975 (α) ,

N N N N

q0.025 (β) , q0.975 (β) , q0.025 (λ) , q0.975 (λ) are 95% confidence intervals for

τ, σ 2 , α, and β, respectively.

ii. Algorithm to Solve Eq. (8)

Inputs:

• Increments d1 , d2 , A large real number L0 , precision ω

• Smooth grid for the space of t ∗ say t ∗1 , . . . , t ∗M .

• Grid and sample sizes M and N, respectively.

Start.

1. For each t ∗j simulate a large sample from the posterior say 1 , . . . , N .

∗ ∗

2. Find Ti = minM j =1 {( i , t j )} and Ui = maxj =1 {( i , t j )}, i = 1, . . . , N .

M

1 N

3. Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)

For l = −L0 to L1 STEP d1 (Searching for the solution in (−L0 , L0 ) × (−L0 , L0 )

For u = −U0 to U1 STEP d2

Do while (|p̂ − 0.95|>ω) l = −L0 , u = −L0

For each tj∗ simulate a large sample from the posterior say 1 , . . . , N .

∗ ∗

Find Ti = minM j =1 {(i , tj )} and Ui = maxj =1 {(i , tj )}

M

1 N

Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)

Update l and u: l ← l + d1 and u ← u + d2

End Do

Output: L = l and U = u

- ArchUploaded byaftab20
- 272 spring 2014.pdfUploaded bycombatps1
- Box CoxUploaded byrogersonuruguay
- STAT 2507_Final_2010FUploaded byexamkiller
- assumptions in multiple regressionUploaded byapi-162851533
- A Stochastic Model, PDFUploaded bySoumitra Bhowmick
- statistics_text.pdfUploaded byMuhammad Tariq Rana
- biostatisticsUploaded bythesmushroom
- 06 Probability DistributionUploaded byinnocentsoul
- the normal distributionUploaded byapi-339611548
- L2formulaesheet2012 ExternalsUploaded byMimi Moss
- 56FBBCFDd01Uploaded bySri Susilawati Islam
- Lecture Notes Feb10Uploaded byMinjie Zhu
- Davis and Morris TRR 2009Uploaded byWilliam Sasaki
- SSRN-id2463999Uploaded byrahman
- Chapter 15 Projects, Procedures, PerspectivesUploaded byBishop Panta
- 1404.1178Uploaded byakhi_15a
- Bsc Phys Hon Cbcs (2019)Uploaded byMITHISAR BODOSA
- 04711865.pdfUploaded byPau Agusti Ballester
- Midterm AssignmentUploaded bydfsdfsdfdf4646545
- Special Functions of Signal ProcessingUploaded bySaddat Shamsuddin
- 11Uploaded byAvidiPrasad
- Vol5_No1Uploaded byUday Gupta
- chapter4_section3Uploaded byNhat Ho
- Introecon Central Limit TheoremUploaded byAkash Pushkar Charan
- Mathcad for in Class Examples in a Random Processes Course (1)Uploaded byChristine Petate
- RET560 Research Methods Course Material V01Uploaded byAliceAlormenu
- IIT 1962-2006Uploaded byandreeapopa13
- Distn of Sums - CopyUploaded byasif
- llllUploaded byAmiranagy

- David Matthews, Mohamed Kasim Dalvi - Teach Yourself Urdu (Teach Yourself) (1999).pdfUploaded byShafayat Abrar
- A Calibration System and Perturbation Analysis for the Modulated Wideband ConverterUploaded byShafayat Abrar
- Steepest Descent Multimodulus Algorithm for Blind Signal Retrieval in QAM SystemsUploaded byShafayat Abrar
- Goodness-Of-fit Tests for Symmetric Stable Distributions - Empirical Characteristic Function ApproachUploaded byShafayat Abrar
- behaviorUploaded byShafayat Abrar
- A Novel Robust Student T-Based Kalman FilterUploaded byShafayat Abrar
- Blind equalization of OFDM systems based an Optimal Bounding Ellipsoid under different channels.pdfUploaded byShafayat Abrar
- Concise English-Urdu DictionaryUploaded byShafayat Abrar
- On Optimal Tests for Separate Hypotheses and Conditional Probability Integral TransformationsUploaded byShafayat Abrar
- Recursive Blind Equalization With an Optimal Bounding Ellipsoid AlgorithmUploaded byShafayat Abrar
- Improved Modeling and Bounds for NQR Spectroscopy SignalsUploaded byShafayat Abrar
- A Variable Step-Size Blind Equalization Algorithm Based on Particle Swarm Optimization.pdfUploaded byShafayat Abrar
- Robust Beamforming via Worst-case SINR MaximizationUploaded byShafayat Abrar
- Multivariate Gaussian and Student−t process regression for multi-output predictionUploaded byShafayat Abrar
- On the Equivalence Between Kernel Quadrature Rules and Random Feature ExpansionsUploaded byShafayat Abrar
- Explaining the Saddlepoint ApproximationUploaded byShafayat Abrar
- A Novel Subband Adaptive Filter Algorithm Against Impulsive Noise and Its Performance AnalysisUploaded byShafayat Abrar
- EEE362 Microwave Engineering Course OutlineUploaded byShafayat Abrar
- Relationship between learning in the engineering laboratory and s.pdfUploaded byShafayat Abrar
- Domains of Attraction of Shalvi-Weinstein ReceiversUploaded byShafayat Abrar
- Block and Fast Block Sparse Adaptive Filtering for Outdoor Wireless Channel Estimation and Equalization BEUploaded byShafayat Abrar
- A Variable Data Block Length LMS Algorithm for Second-Order Volterra FilterUploaded byShafayat Abrar
- Schaum Vector AnalysisUploaded byOmar Khalid
- A Probabilistic Call Admission Control Algorithm for WLAN in Heterogeneous Wireless EnvironmentUploaded byShafayat Abrar
- Intro to EE -- Northwestern UniversityUploaded byShafayat Abrar
- New Improved Recursive Least-squares Adaptive-filtering AlgorithmsUploaded byShafayat Abrar
- combiningUploaded byShafayat Abrar
- Engineering Economics Lecture 1 and 2Uploaded byShafayat Abrar
- Scale-Invariant Nonlinear Digital FiltersUploaded byShafayat Abrar

- SVLJ_GARCHUploaded byWei Vickie Guo
- rp_gub_13_06Uploaded byThuy Dung Nguyen
- N5615_1Uploaded byprasannakumar_7
- ME_sm&fe_Uploaded byjomarech
- Thesis M WildemeerschUploaded byfaizazohra
- Dynamic Capital Structure ThesisUploaded byKulbir Singh
- Probability & Random Process QBUploaded bywizardvenkat
- Markov ChainsUploaded byHaydenChadwick
- Measuring Risk in Complex Stochastic Systems - J. Franke, W. Hardle, G. StahlUploaded byAheesh K N
- Data Preparation: Stationarity (Excel)Uploaded bySpider Financial
- Markov Modles in Med Dec MakingUploaded byAmgad Alsisi
- Long Run Mean Reversion PindyikUploaded byPareeksha Scribd
- ProblemsMarkovChainsUploaded byJuan Carlos Fdz
- Markov Chains an Introduction WI1614Uploaded byMyrte Bareilles
- 87359342 Lecture Notes on Mathematical Systems BiologyUploaded byGiulia Menichetti
- EE704 Control Systems 2 (2006 Scheme)Uploaded byAnith Krishnan
- Call Centers Research 33Uploaded byGeorge Yap
- cspl-330Uploaded byعالي المقام
- Class Notes Teletraffic EngineeringUploaded byhollmand
- EE 564-Stochastic Systems-Momin UppalUploaded byWaseem Abbas
- Brownian MotionUploaded byresufahmed
- Arlie O. Petters, Xiaoying Dong-An Introduction to Mathematical Finance With Applications_ Understanding and Building Financial Intuition-SUploaded byMatteoOrsini
- Lecture 1Uploaded byKofi Appiah-Danquah
- Triggiani R., Imanuvilov O., Leugering G.-Control Theory of Partial Differential Equations (2005).pdfUploaded byDenilson Menezes
- Bollerslev Todorov TailsUploaded byPipi Ququ
- AN INTRODUCTION TO STOCHASTIC EPIDEMIC MODELS-PART IUploaded byDavid Arianto
- Ma2262 Probability and Queuing Theory Question Bank DownloadUploaded byTaniyaa Venkat
- simulation of stopped diffusionsUploaded bysupermanvix
- part_iiUploaded byJacob Harris
- r059210401 Probability Theory and Stochastic ProcessUploaded byandhracolleges