You are on page 1of 26

Communications in Statistics—Theory and Methods, 43: 4936–4961, 2014

Copyright © Taylor & Francis Group, LLC


ISSN: 0361-0926 print / 1532-415X online
DOI: 10.1080/03610926.2012.737498

Skew Gaussian Process for Nonlinear Regression

M. T. ALODAT1 AND E. Y. AL-MOMANI2


Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

1
Department of Mathematics, Statistics and Physics, Qatar University, Qatar
2
Department of Statistics, Yarmouk University, Irbid, Jordan

In this article, we extend the Gaussian process for regression model by assuming a skew
Gaussian process prior on the input function and a skew Gaussian white noise on the
error term. Under these assumptions, the predictive density of the output function at
a new fixed input is obtained in a closed form. Also, we study the Gaussian process
predictor when the errors depart from the Gaussianity to the skew Gaussian white noise.
The bias is derived in a closed form and is studied for some special cases. We conduct a
simulation study to compare the empirical distribution function of the Gaussian process
predictor under Gaussian white noise and skew Gaussian white noise.

Keywords Gaussian process; Multivariate closed skew normal distribution; Prediction;


Prior distribution.

Mathematics Subject Classification Primary 60G15; Secondary 62J02.

1. Introduction
In statistical literature, the assumption of Gaussianity or normality has been made on statis-
tical models for a long time when analyzing spatial data. The popularity of using Gaussian
assumption is due to its mathematical tractability. For example, the multivariate Gaussian
distribution possesses the properties of closure under marginal, conditional distributions as
well as the closure under convolution. Despite of such nice properties of Gaussian distri-
bution, it is found that the data distribution does not meet the assumption of Gaussianity
for a large number of real data sets due to the presence of the skewness. If the analysis of
such data sets relies on the Gaussian assumption, then unrealistic or nonsensical estimates
will be produced. The simplest way to analyze skewed data via the Gaussian model is
to Gaussianize the data, i.e., by transforming the data to near Gaussian data. Such trans-
formation method is not recommended due to the following different reasons. (i) Finding
a suitable transformation to achieve normality is not an easy issue in practice. (ii) Since
such transformations are usually applied to data component-wise, then the normality of
marginal distributions does not guarantee the joint normality. Hence, the estimates might

Received May 9, 2012; Accepted October 3, 2012.


Address correspondence to M. T. Alodat, Department of Mathematics, Statistics and Physics,
Qatar University, Qatar; E-mail: alodatmts@yahoo.com
Color versions of one or more of the figures in the article can be found online at
www.tandfonline.com/lsta.

4936
Skew Gaussian Process 4937

be fallible from biases. (iii) Despite of the difficulty in interpreting the transformed data,
data skewness could not be ignored, since it has an interpretation (Buccianti, 2005).
Recently, random processes, that possess a skewness parameter, have been defined
by several researchers. Alodat and Aludaat (2007) employed the skew normal theory, as
presented in Genton (2004), to define a new random process called the skew Gaussian
process. Also they gave an application to real data. Relying on the multivariate closed-skew
normal distribution of González-Farı́as et al. (2004), Allard and Navea (2007) defined what
they called the closed skew normal random field. For more examples about skew random
processes or fields, we refer the reader to Zhang and El-Sharaawi (2009) and Alodat and
Al-Rawwash (2009).
The cornerstone in defining a new skew processes or field is the multivariate skew
normal distribution which appeared in the pioneer works of Azzalini (1985, 1986), Azzalini
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

and Dalla valle (1996), and Azzalini and Capitanio, 1999). The skew-normal or skew
Gaussian distribution is defined as follows. A random vector Z(n×1) is said to have an n−
dimentional multivariate skew normal distribution if it has the probability density function
(pdf)
 
PZ (z) = 2φn (z; 0, )  α T z , z ∈ Rn , (1)

where φn (.; 0, ) is the pdf of Nn (0, ),  (.) is the cdf of N (0, 1), and α (n×1) is a vector
called the skewness parameter. A more general family of (1) is obtained by using the
transformation X = μ + Z, μ ∈ Rn . It is so easy to show that the pdf of X is PX (x) =
PZ (x − μ). We use the notation X ∼ SNn (μ, , α) to denote an n−dimenational skew
normal distribution with parameters μ, , and α.
Also a generalization to (1) is given by González-Farı́as et al. (2004) as follows. Let
μ ∈ Rp , D be an arbitrary q × p matrix,  and  positive definite matrices of dimensions
p × p and q × q, respectively. A random vector Y is said to have a p−dimensional
closed skew normal distribution (CSN) with parameters q, μ, , D, v, , denoted by Y ∼
CSNp,q (μ, , D, v, ) , if its pdf takes the form

Pp,q (y) = Cφp (y; μ, ) q ( D ( y − μ) ; v, ) , y ∈ Rp (2)

where C is defined via


 
C −1 = q 0; v,  + DDT ,

where φp (.; η, ψ), p (.; η, ψ) are the pdf and the cumulative distribution function (cdf)
of a p−dimentional normal distribution with mean vector η and covariance matrix ψ.
Throughout this article, several lemmas and results about the multivariate CSN distribution
will be used extensively. So we present them in Appendix A. For their proofs, we refer the
reader to González-Farı́as et al. (2004) or Genton (2004).
Furthermore, it has been shown that the family of skew normal distributions possesses
properties that are close to or coincide with those of the normal family. Besides to the
closeness properties, it contains the normal family, i.e., when α = 0. Such properties have
attracted the researchers to extend the well-known statistical techniques under the skew
normality assumption. There are still a lot of works in their mission. For example, the
Gaussian process regression (GPR) model is a statistical technique introduced by Neal
(1995) to treat a non-linear regression Y (t) = f (t) + ε (t) from a Bayesian viewpoint.
Simply, the technique assumes a Gaussian process as a prior on the unknown function f (t)
while ε(t) is assumed to have a white noise process. Then the aim is to predict f (t) at a
4938 Alodat and Al-Momani

new value of t. In other words, the Gaussian process provides us with a prior distribution
over the space of all functions.
Since the Gaussian family is a sub-family of the skew Gaussian family, then using
the skew Gaussian process, i.e., a process whose finite dimensional distributions are of the
form (1), as a prior on f (t) will allow us to define a distribution over a more rich family of
functions than the Gaussian one. Also, it will allow us to extend the error term in the above
regression model to have a skewed distribution which closer to real data than its Gaussian
counterpart.
It appears from literatures that the GPR has a significant applications in various fields of
science. For example, it has been applied to model noisy data and to classification problems
arising in machine learning to predict the inverse dynamics of a robot arm (Rasmussen and
Williams, 2006). Brahim-Belhouari and Bermak (2004) applied the GPR model to predict
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

the future value of a non-stationary time series. Schmidt et al. (2008) studied the sensitivity
of GPR to the choice of correlation function. Based on a numerical study, they concluded
that the predictions did not differ much amongst the different correlation functions. Van-
hatalo et al. (2009) proposed a GPR with student-t likelihood by approximating the joint
distribution of the process by a student distribution. The idea beyond that approximation
is to make the GPR model robust against outliers. The model they proposed is analytically
intractable. Kuss (2006) proposed other robust models as alternatives for GPR. Macke et al.
(2010) applied the GPR to estimate the cortical map of the human brain. They modeled
the brain image of their experiment, where the activity at each voxel is measured, by a
Gaussian process. Fyfe et al. (2008) applied the GPR to Canonical correlation analysis with
application to neuron data.
The problem of treating the prediction problem of the nonlinear regression Y (t) =
f (t) + ε (t) from a Bayesian viewpoint when both f (t) and ε (t) follow skew Gaussian
processes has not yet been a dressed in the literature. In this article, we extend the GPR
model by assuming two independent skew Gaussian processes one on f (t) and the other one
on ε (t). In other words, we consider the nonlinear regression model Yi = f (ti )+ε (ti ) , i =
1, 2, . . . , n, i.e., for each i, f (ti ) is measured as Yi but corrupted by the noise ε (ti ). Then
we put a skew Gaussian process as prior on the function f (t). Also, we assume that the
process ε (t) follows a skew Gaussian process. Under these assumptions, the following
two prediction problems are considered: (i) Prediction of f (t) at a fixed input t, and (ii)
Prediction of f (t) at a random input t ∗ .
The rest of this article is organized as follows. In Sec. 2, we introduce the reader to
the GPR model. In Sec. 3, we generalize the GPR model by assuming a skew Gaussian
process on f (t) and another skew Gaussian process on ε (t). Then we derive the predictive
density of the output function at new input. Also, we derive the mean and the variance of
the predictive distribution. In Sec. 4, it is assumed that the GPR predictor is used to analyze
a data with skewed errors. Then we derive the bias and the variance. In Sec. 5, we conduct
a simulation study to compare the new model to the Gaussian one. Finally, we state our
conclusions in Sec. 6.

2. Gaussian Process for Regression


A family {X (t) , t ∈ C} , C ⊆ Rn of random variables is said to constitute a Gaussian
process if for every n and t1 , . . . , tn ∈ C, the random variables X1 (t) , . . . , Xn (t) have
n-dimentional multivariate normal distribution. The Gaussian process is used in statistical
literatures as a prior process for the Bayesian analysis of several statistical problems. For
example, O’Hagan (1978), was the first to use the Gaussian process as a prior process over
Skew Gaussian Process 4939

the space of functions to treat a nonlinear regression from a Bayesian viewpoint, while
an application of O’Hagan’s work to Bayesian learning in networks has appeared in Neal
(1995).
The GPR, as presented in Neal (1995), can be illustrated as follows. Consider a set
of training data Y = (Y1 , Y2 , . . . , Yn )T , where the input vectors t1 , t2 , t3 , . . . , tn ∈ C ⊆ Rn
and their output values Y1 , Y2 , . . . , Yn are governed by the non-linear regression model
Yi = f (ti ) + ε (ti ), where ε (t1 ) , ε (t2 ) , . . . , ε (tn ) are iid Gaussian noises on C of mean
0 and variance τ 2 , and f (.) is an unknown function. The main question is “what is the
predicted value of f ∗ = f (t ∗ ), the value of f (t) at a new input t ∗ ?”. To answer this
question, a prior distribution is needed on f (t) , i.e., a distribution over a set of functions
is needed. This prior distribution should be defined on the class of all functions defined on
the space of t. The set of all sample paths of a Gaussian process on C provides us with a
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

rich class of such functions.


Assume that f (t), t ∈ C is a Gaussian process with covariance function k (., .), i.e., for
every n and t1 , t2 , t3 , . . . , tn ∈ C, we have f = (f (t1 ) , . . . , f (tn ))T ∼ Nn (0, ), where
⎛ ⎞
k (t1 , t1 ) · · · k (t1 , tn )
⎜ .. .. .. ⎟
=⎝ . . . ⎠.
k (tn , t1 ) · · · k (tn , tn )

A suitable choice for k (., .) is the following covariance function




  1 T  
k ti , tj = exp − ti − tj −1 ti − tj (3)
2
 
For simplicity, we may  consider
 = diag λ21 , . . . , λ2n . A covariance function k (., .)
is said to be isotropic if k ti , tj depends only on the distance ti −tj . For more information
about other types of covariance functions, see Girard et al. (2004).
Since f (t) is assumed to follow a Gaussian process, then, according to Qui ñonero-
Candela et al. (2003), the joint pdf of f (t) and f (t ∗ ) is also an (n + 1) −dimensional
multivariate Gaussian distribution, i.e.,
⎛ ⎞
f (t1 )
⎜ .. ⎟
⎜ . ⎟
⎜ ⎟ ∼ Nn+1 (0, ) ,
⎝ f (tn ) ⎠
f (t ∗ )

with
⎛ ⎞
k (t1 , t1 ) · · · k (t1 , tn ) k (t1 , t ∗ )
⎜ .. .. .. .. ⎟
⎜ . . ⎟
=⎜⎜ .. ⎟,

⎝ k (t , t ) · · · k (t , t ) k (t , t ∗ ) ⎠
n 1 n n n
∗ ∗ ∗ ∗

k (t , t1 ) · · · k (t , tn ) k (t , t )
 k
= ,
kT k∗

where

 = {k(ti , tj )}ni,j =1 , k = (k(t1 , t ∗ ), . . . , k(tn , t ∗ ))T , and k ∗ = k(t ∗ , t ∗ ).


4940 Alodat and Al-Momani

Rasmussen (1996) showed that the prediction distribution of f ∗ given Y and t ∗ remains
Gaussian and is given by

P (f ∗ |Y, t ∗ ) ∼ N (μ(t ∗ ), σ 2 (t ∗ )), (4)

where μ (t ∗ ) and σ 2 (t ∗ ) are the mean and the variance of the predictive distribution (4) and
are given by
 −1
μ∗ = μ(t ∗ ) = k T  + τ 2 In Y,

where
   −1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Y = (Y1 , Y2 , . . . , Yn )T and σ 2 t ∗ = k ∗ − k T  + τ 2 In k.

The distribution (4) can be used to draw several inferential statements about f (t ∗ ).
For instance, when p = 1, a 100 (1 − α) % prediction interval for f (t ∗ ) is given by [L, U ],
where L and U are the solutions of the following two equations:

L
∗ ∗ α
∗ α
P (f |Y, t )df = and P (f ∗ |Y, t ∗ )df ∗ = .
0 2 U 2

For GPR, a 100 (1 − α) % prediction interval for f ∗ is

μ(t ∗ ) ± Z1− α2 σ (t ∗ ),

where Z1− α2 is the 100 (1 − α) quantile of N (0, 1). Moreover, the mean μ (t ∗ ) serves as a
predictor for f (t ∗ ) given the data Y and t ∗ , while the variance σ 2 (t ∗ ) serves as a measure
of uncertainty in μ (t ∗ ).
Now, assume that we are interested in predicting f (t) at t ∗ , where t ∗ is a random
variable such that t ∗ ∼ Np (μ∗ , ∗ ), i.e., we are interested in prediction at a random input.
So the predictive pdf for f ∗ given that μ∗ , ∗ is (Girard et al., 2004):

P (f ∗ |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ . (5)

The integral in Eq. (5) does not have a closed form. Hence, an approximation to this
integral is needed in order to report inferential statements about f ∗ . Moreover the main
computational problem in GPR is the inversion of the matrix  + τ 2 In and in obtaining the
mean and variance of the predictive distribution of f ∗ given Y at a random input t ∗ . For
this reason, we propose the following simple Monte Carlo approximation to (5):


1  ∗ 
N

P (f |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗  P f , Y |t ∗(r) ,
N r=1

where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Before closing this section, we
refer to Girard et al. (2002) and Williams and Rasmussen (2006) where the reader can find
several analytical approximation techniques to approximate the predictive density (5).
Skew Gaussian Process 4941

3. Skew Gaussian Process for Non Linear Regresion


In this section, a generalization to the Gaussian process, called the skew Gaussian process
(SGP), is proposed. Under the SGP, we give a generalization to the GPR model called the
skew Gaussian process regression (SGPR) model. Then the predictive density at new inputs
is derived for the SGPR model.

Definition 3.1. A random process Y (t) , t ∈ C ⊆ Rp is said to be a skew Gaussian process


(SGP) if for every n ∈ {1, 2, 3, . . .} and every t1 , . . . , tn ∈ C, the vector (Y (t1 ) , . . . , Y (tn ))T
follows the density (1), i.e., Y (t) is skew Gaussian process if its set of finite dimensional
distributions is a subfamily of the family of distributions defined by (1).

Definition 3.2. A skew Gaussian process Y (t) possesses a fixed skewness in all directions
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

if for every n and t1 , . . . , tn , the parameter α in (1) takes the form α = α1n , α ∈ R.

Throughout
  n this article, we will assume that for each n and t1 , . . . , tn ∈ C, the parameter
 = k ti , tj i,j =1 , where k (., .) is a given covariance function.

Definition 3.3. A skew Gaussian process is called a skew white noise if for every n and
t1 , . . . , tn ∈ C ⊆ Rn , ε = (ε (t1 ) , . . . , ε (tn ))T ∼SNn 0, τ 2 In , β1Tn , where τ, β ∈ R.

Let μA , A , DA , v + and A be as defined in Appendix B2. Under the assumption that


the function f (t) follows a skew Gaussian process and ε (t) follows skew white noise, then
we state the following theorem which gives us the predictive distribution of f (t ∗ ) given
Y, t ∗ as well as its mean and variance at the fixed input t ∗ .

Theorem 3.1. Consider the following partitions for μA , A , DA , v + , A :






 + τ 2 In k 11 12 11 12
A = = , A =
k k∗ 21 22 21 22

and


D11 D12  
DA = = D1 D 2 ,
D21 D22

where



D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1

Then:
i. the conditional distribution of f ∗ given Y and t ∗ is
  −1  −1 
f ∗ |Y, t ∗ ∼ CSN 1,2 k T  + τ 2 In Y, k ∗ − k T  + τ 2 In k, D2 , −D ∗ Y, A , (6)

where
 −1
D ∗ = D1 + D2 k T  + τ 2 In ;
4942 Alodat and Al-Momani

ii. the predictive mean and the predictive variance are given by
  
E(f ∗ |Y, t ∗ ) = μ∗ + σ ∗2 D12 (1) ∗ ∗2
2 02×1 ; −D Y, A + σ D2 D2
T

 ∗ ∗2

+D22 (2)
2 02×1 ; −D Y, A + σ D2 D2
T
,

and
   2
var(f ∗ |Y, t ∗ ) = 2σ ∗4 (11)
2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12
 
12  
+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22
 
21  
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22


 
22  ∗ ∗2
 2
+2 02×1 ; −D Y, A + σ D2 D2 D22 ,
T

where (i)2 (., .) is the first partial derivative of 2 (., .) with respect to the i argument for
(ij )
i = 1, 2 and 2 (., .) is the second mixed partial derivative of 2 (., .) with respect to the
ith and jth arguments for i, j = 1, 2.

Proof. See Appendix A for the proof of Theorem 3.1. 

The above theorem shows that the predictive distribution of a new output follows
a closed skew Gaussian distribution. As a special case, this predictive distribution re-
duces to (4) if the skewness is absent, i.e., if α = β = 0. Another predictor of f (t ∗ )
is the median of the conditional distribution of f ∗ given Y. Neither mean nor the me-
dian of the conditional distribution in our case has a simple closed form. Furthermore,
part (i) of Theorem 3.1 can be used to predict the value of f (t) at a random input, say
t ∗ . For instance, assume
 ∗ ∗2that t ∗ ∼ Np (a, B) and we wish to predict f ∗ = f (t ∗ ). Since
f |Y, t ∼ CSN 1,2 μ , σ , D2 , −D ∗ Y, A , then using the total probability law, we write
∗ ∗

the predictive density of f ∗ given Y as follows:



P (f ∗ |Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ .
Rp

Unfortunately, it is difficult even for GPR to find a closed form for the integral in
the last equation, so an approximation for P (f ∗ |Y ) is needed. Here, we propose the
following simple Monte Carlo approximation for the predictive distribution at a random
input:

1  ∗ 
N

P (f |Y ) = ∗ ∗ ∗ ∼
P (f |Y, t )P (t )dt = ∗
P f |Y, t ∗(r) ,
R p N r

where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Since we are putting a skew
Gaussian process prior on the function f (t), then for each n, the finite dimensional dis-
tribution of the skew Gaussian process is used as a prior for the distribution of the vector
f = (f (t1 ) , . . . , f (tn ))T , i.e., f ∼ SNn (0, , α1n ), where
  as defined in the previous
sections. Since Y = f + ε, where ε ∼ SNn 0, τ 2 In , β1n , then the posterior distribution
Skew Gaussian Process 4943

of f given Y , t ∗ is calculated as follows:


  P (y, t ∗ |f ) P (f )
P f |y, t ∗ = .
P (y, t ∗ )
Since the prior distribution is proper, then so is the posterior distribution. So the
predictive distribution of f ∗ give Y and t ∗ is

P (f ∗ |y, t ∗ ) = P (f, f ∗ |y, t ∗ )df,
P
1
= (y, t ∗ |f, f ∗ )P (f, f ∗ )df.
P (y, t ∗ )
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

It will be shown in the proof of Theorem 3.1, Appendix B1, that the last equation
simplifies to the pdf of the distribution:
  −1  −1 
CSN1,2 k T  + τ 2 In Y, k ∗ − k T  + τ 2 In k, D2 , −D ∗ Y, A .

Now if t ∗ is assumed to have t ∗ ∼ Np (a, B), then the predictive distribution of f ∗


give Y is obtained by averaging P (f ∗ |Y, t ∗ ) over all values of t ∗ . Since t ∗ has a proper
distribution, then so is the distribution off ∗ given Y . Furthermore, the strong law of large
∗ ∗(r)
numbers implies that the estimator N1 N r P (f |Y, t ) converges almost surely to its
mean value, i.e.,

1  ∗    ∗     
N
∗(r) a.s ∗(1)
P f |Y, t → E P f |Y, t = P f ∗ |y, t ∗ P t ∗ dt ∗ ,
N r
 
= P f ∗ |y .

To simulate from (f ∗ |Y, t ∗ ), i.e., from


  −1  −1 
CSN1,2 k T  + τ 2 In Y, k ∗ − k T  + τ 2 In k, D2 , −D ∗ Y, A ,

then we may utilize the following stochastic representation of the CSN distribution (Genton,
2004; Allard and Naveau, 2007):
i. Let V be a random vector from N2 (−D ∗ Y, Q), where
 −1
Q = A + D2T k ∗ − k T  + τ 2 In kD2 .

ii. Let U = V |V ≤ 0.


iii. Z = m∗ U + D ∗ Y +  ∗ 2 H , where
1

  −1     −1  −1


m∗ = − k ∗ − k T  + τ 2 In k D2 A + k ∗ − k T  + τ 2 In k D2T D2 ,
 −1   −1 2
 ∗ = k ∗ − k T  + τ 2 In k − k ∗ − k T  + τ 2 In k
   −1  −1
D2 A + k ∗ − k T  + τ 2 In k D2T D2 D2T

and H ∼ N (0, 1). Then Z is from the distribution in (6).


4944 Alodat and Al-Momani

4. Gaussian Process for Regression with Skew Normal Errors


In this section, we consider the model Y = f + ε, where the error process ε (t) follows a
 −1
skew white noise. Then we use the Gaussian process predictor, i.e., f̂ = k T  + τ 2 In Y,
to predict f (t ∗ ). Under such setup, we study the effect of this assumption on the mean
and the variance of the GPR predictor. In sequel,
 the mean and the variance of f̂ , where
Y is replaced by Y = f + ε, with ε ∼ Nn 0, τ 2 In , are denoted G G
 by E (f̂ ), var (f̂ ),
respectively. Also if Y is replaced by Y = f + ε with ε ∼ SNn 0, τ 2 In , β1n , then the
mean, the variance and the bias are replaced by E SG (f̂ ) and var SG (f̂ ), respectively. Under
a white noise process, i.e., ε ∼ Nn 0, τ 2 In , we have
 −1
E G (f̂ ) = k T  + τ 2 In E(f + ε),
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

 −1
= k T  + τ 2 In f

 
while under the assumption ε ∼ SNn 0, τ 2 In , β1n , we have that
 −1
E SG (f̂ ) = k T  + τ 2 In E(f + ε),
 −1  −1
= k T  + τ 2 In f + k T  + τ 2 In Eε ·

Using Proposition A.6 in Appendix A, we get



2 τ 2 β1n
E (ε) =  .
π 1 + β 2τ 2n

Hence,

 −1 2 τ 2 β1n
E SG
(f̂ ) = E (f̂ ) + k  + τ 2 In
G T
 ,
π 1 + β 2τ 2n
 
= E G (f̂ ) + b τ 2 , β 2 , n , say. (7)

From the last equation, we conclude


  2 2 that the GP R predictor is either increased
or decreased by an amount of b τ , β , n . Similarly, under the assumption ε ∼
 
SNn 0, τ 2 In , β1n , the variance of the var SG (f̂ ) is obtained by applying Proposition
A.6 of Appendix A. So
 
 −1 2 2β 2 τ 4
 −1
var SG
(f̂ ) = k  + τ In
T 2
τ In −   1n 1n  + τ 2 In k,
π 1 + nτ β 2 2

 −2 2β 2 τ 4 k02 n2
= τ 2 k  + τ 2 In k−   ,
π 1 + nτ 2 β 2 L2n
n
where Ln = τ 2 + i=1 k (t1 , ti ).
Skew Gaussian Process 4945
 
Theorem 4.1. Consider the setup in the above discussion. Then b τ 2 , β 2 , n and
var SG (f̂ )satisfy the following properties.
i. var SG (f̂ ) ≤ var G (f̂ ) for all τ, β, n.
  2 2   2 τ T  −1  
 
ii. lim b τ , β , n = π √n k  + τ 2 In 1n , lim b τ 2 , β 2 , n = 0 and
β→±∞
 2 2  β→0
lim b τ , β , n = 0,
τ →0
iii. Assume that t1 , t2 , . . . , tn are chosen so that they are the vertices of a regular
polygon and t ∗ is located at its center. If k (., .) is an isotropic covariance function
and k0 = k (t1 , t ∗ ), then
    
a. b τ 2 , β 2 , n  > 0 for all τ , n and β = 0, and lim b τ 2 , β 2 , n = 0.
τ →∞

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

n
b. If k (t1 , ti ) = n−0.5 O (n), with n−0.5 O (n) → c = 0 as n → ∞,
 
i=1
 n
then lim b τ 2 , β 2 , n = π2 τβk 0
|β|c and if k (t1 , ti ) = O (n), then
n→∞
 2 2  i=1
lim b τ , β , n = 0.
n→∞
n   τ k2
c. If k (t1 , ti ) = O (n), then lim var SG (f̂ ) = 1 − π2 c20 .
i=1 n→∞
nτ 2 k 2 nτ 2 k 2  
d. lim var SG (f̂ ) = L2 0 , lim var SG (f̂ ) = L2 0 1 − π2 and
β→0 n β→±∞ n

lim var SG (f̂ ) = 0.


τ →0,∞

Proof. The proof of (ii) and (iii)(a)–(iii)b are given in Appendix B. The proof of the other
parts is easy, so we leave it to the reader. 
It can be noticed that if a Gaussian predictor is used for predicting skew data, then the
variance of the predictor cannot exceed the variance of the Gaussian predictor. On the other
hand, the value of the  predictor
 will be shifted to the left or to the right of the Gaussian one
by an amount of b τ 2 , β 2 , n  . If an isotropic Gaussian covariance function is used, then
n n  0.5θ0 i 
i=1 k (t1 , ti ) = i=1 τ exp − λ2 = O(n), where θ0 denotes the angle between ti and
2

t ∗ for all i = 1, . . . , n. So the Gaussian covariance function satisfies part b of Theorem 4.1.

5. Simulation study
In this section, we present an algorithm to simulate a realization from a skew Gaussian
process, i.e., by simulating from its finite dimensional distributions. Then the algorithm is
implemented in a Matlab code to simulate from a GPR and a SGPR predictors.

5.1. Simulation from SNn (0, , λ).


Simulation of a sample path from the skew Gaussian process can be obtained by sampling
from a multivariate skew normal distribution on a smooth grid. To simulate a random vector
from the pdf (1), we may use the accept-reject method. The accept-reject method as given
in Christian and Casella (2004) assumes that the pdf P (x) can be written as

P (x) = cg (x) h (x) ,


4946 Alodat and Al-Momani

where c ≥ 1, 0 < g (x) ≤ 1, ∀x and h (x) is a pdf. If this is the case, then a ran-
dom observation from P (x) is generated as follows.
1. Generate U from u (0, 1).
2. Generate Y from h (x).
3. IfU ≤ g (Y ), then deliver Y as a realization of P (x).
4. Go to step1.
For the SNn (0; , λ) distribution, we may use this algorithm with c = 2, g (x) =
(λT x) and h (x) = φn (x; 0, ).

5.2. Simulation from CSN p,q (μ, , D, v, )


Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

To simulate a random observation from the CSNp,q (μ, , D, v, ), it is difficult to


achieve
 this via the accept-reject
 method due to the complexity of calculating g (x) =
q D T (Y − μ) ; v,  . Instead, we employ the following algorithm which is derived
from the definition of the CSN distribution (see Genton, 2004; Allard and Naveau, 2007).
(i) Simulate an observation from

U = Nq (v,  + D T D)|U ≤ 0.

(ii) Given U , simulate Z from

Np (−D( + D T D)−1 (U − v),  − D( + D T D)−1 D T ).

(iii) Deliver Z from CSNp,q (μ, , D, v, ).


Also, the simulation from U |U ≤ 0 is not an easy task, so an accept-reject method will
be implemented.

5.3. Estimation of Hyper Parameters


The Maximum Likelihood Estimation (MLE) is an estimation approach to estimate the
hyper parameters in GPR. Here we use the MLE to estimate the parameters
 τ, σ 2 , α, β,
and λ, i.e. by maximizing the likelihood function L τ, σ 2 , α, β, λ; Y of Y . Consider the
model

Y = X+ε,
   
where X ∼ SNn 0, , α1Tn , 0, 1 , ε ∼ SNn 0, τ 2 In , β1Tn , 0, 1 , τ > 0, and X, ε are in-
dependent
 random vectors. Then  applying Proposition A.5 of Appendix A yields Y ∼
CSN n,2 0,  + τ 2 In , D ◦ , 0, ◦ ,where
  −1 

◦ α1Tn   + τ 2 In ◦ A11 A12
D =  −1 ,  = ,
βτ 2 1Tn  + τ 2 In A12 A22

and
 −1
A11 = 1 + α 2 1Tn 1n − α 2 1Tn   + τ 2 In 1n , A22 = 1 + nβ 2 τ 2 − β 2 τ 4 1Tn
 −1
×  + τ 2 In 1n ,
 −1
A12 = −αβτ 2 1Tn   + τ 2 In 1n .
Skew Gaussian Process 4947

Using the pdf (2), we find the likelihood function of τ, σ 2 , α, β, and λ:


  2 (D ◦ Y ; 0, ◦ )  
L τ, σ 2 , α, β, λ; Y =     × φn Y ; 0,  + τ 2 In .
2 0; 0, ◦ + D ◦  + τ 2 In D ◦T

Although the marginal distribution of the data Y is a multivariate closed skew normal
distribution, the problem of finding confidence intervals for the parameters τ, σ 2 , α, β, and
λ is not an easy task, since these parameters are embedded in the distribution’s parameters,
i.e., in , D ◦ and ◦ . So one may think in Bayesian intervals.  For this purpose,
 a prior
distribution on the τ, σ 2 , α, β, and λ must be assumed. Let P τ, σ 2 , α, β, λ be the prior
that represents our belief about the distribution of τ, σ 2 , α, β, and λ. Then the posterior
distribution of τ, σ 2 , α, β, and λ given Y satisfies
     
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

P τ, σ 2 , α, β, λ|Y ∝ L τ, σ 2 , α, β, λ; Y × P τ, σ 2 , α, β, λ .

Again, we face another problem in finding the normalizing constant for the posterior
distribution. Hence, a Markov chain Monte Carlo (MCMC) method should be called for.
For example, one may use the Metropolis-Hasting algorithm (Christian and Casella, 2004).
To find confidenceintervals for the parameters τ, σ 2 , α, β, and λ, a large sample from
P τ, σ 2 , α, β, λ|Y is needed. To do so, we propose an algorithm in Appendix C to find
such confidence intervals. On the other hand, once the parameters τ, σ 2 , α, β, and λ have
been estimated, then their estimates can be plugged in the variance formula var (f ∗ |Y, t ∗ )
to get an estimate for var (f ∗ |Y, t ∗ ). Furthermore, to obtain a 95% Bayesian
 confidence
band for (f ∗ |y, t ∗ ), we also need to use simulation. To proceed, let  = τ, σ 2 , α, β, λ
and  (, t ∗ ) = (f ∗ |y, t ∗ ). A Bayesian simultaneously 95% confidence band for  (, t ∗ )is
obtained by finding L and U such that P|Y,t ∗ (L <  (, t ∗ ) < U forallt ∗ ) = 0.95, which
is equivalent to solve the following equation for L and U :


 ∗
  ∗

P|Y,t L < inf


 , t < sup  , t < U = 0.95. (8)
t t∗

Also, we propose an algorithm in Appendix C to solve (8) for L and U . Hence,


L <  (, t ∗ ) < U is 95% confidence band for  (, t ∗ ) for.

5.4. Simulation Results


In this simulation work, realizations of the sample path of the SGP are generated for the
input function f (t) = sin(t)
t
, t = 0. Then the simulated data are substituted in both Gaussian
and skew Gaussian predictors. To see the effect of the departure from Gaussianity on the
Gaussian predictor, we plot the distribution function for the two predictors. Figures 1–8show
these distribution functions for λ = 1 and different values of α, β, and τ .

From Figs. 1–8, we report the following concluding remarks:


1. If a Gaussian process prior is used on the input function, i.e., α = 0, then there is a small
difference between the two distributions and this difference is increasing as a function
of |β|. Moreover, the skew Gaussian predictor distribution is larger than the Gaussian
predictor distribution if β < 0 and the converse is true if β > 0, (see Fig. 2a).
2. The two predictors have about the same distribution functions for small values of the
skewness parameters τ, α, and β (See Figs. 2a, b).
4948 Alodat and Al-Momani
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 1. (a) GPR (G) and skew Gaussian (SG) predictors with parameters; (b) GPR (G) and SG
predictors with parameters α = −0.05, β = −5, 0, 1, 5 and τ = 0.1, α = −0.01, β = −5, −1, 0, 5
and τ = 0.1.

Figure 2. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G) and SG predictors with
Parameters α = 0, β = −5, 0, 1, 5 andτ = 0.1, α = 0.05, β = −5, 0, 2, 5, andτ = 0.1.

Figure 3. (a) GPR (G)and SG predictors with Parameters, (b) GPR (G)and SG predictors with
Parametersα = 2, β = −5, −2, 0, 5 and τ = 0.1, α = 5, β = −5, 0, 2, 5 and τ = 0.1.
Skew Gaussian Process 4949
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 4. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G)and SG predictors with
Parameters α = −0.01, β = −5, 0, 1.5, 5 and τ = 1, α = 0, β = 5, 0, 2, 5, and τ = 1.

Figure 5. (a) GPR (G) and SG predictors with parameters α = 0.5, β = −5, 0, 1.5, 5 and τ = 1.

Figure 6. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
Parameters α = 1.5, β = −5, −1, 0, 5 and τ = 1.5, α = 4, β = −5, 0, 2, 5 and τ = 1.5.
4950 Alodat and Al-Momani
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Figure 7. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
parameters α = −0.1, β = −5, 0, 2, 4 and τ = 2, α = 5, β = −5, 0, 2, 5 and τ = 2.

3. If a Gaussian process is used for the errors, i.e., β = 0, then there is no difference
between the two distributions when α ≤ 0, and τ is small (see Figs. 1, 2, 3and 4).
4. For fixed values of α, and moderate values of τ , the difference between the two distri-
butions is very clear and seems to be an increasing function in |β| (see Figs. 4,5).
5. For fixed values of α, and large values of τ , there is a huge difference between the two
distributions (see Fig. 8).

6. Conclusions and Possibility of Future Work


In this article, the nonlinear regression model Y (t) = f (t) + ε (t) has been tackled from a
Bayesian viewpoint by assuming two skew Gaussian processes on f (t) and ε (t). It is shown
that, under this assumption, the predictive density at new input has a closed form. Also,
we studied the GPR predictor under the assumption that the errors violate the assumption
of Gaussianity. If the errors depart from Gaussianity to skew Gaussianity, then the GPR
predictor will be affected and may lead to unrealistic estimates. We know that the skew
Gaussian process for regression, addressed in this paper, has several advantages over the

Figure 8. GPR (G) and SG predictors with parameters α = 1, β = 2, and τ = 2, 10.


Skew Gaussian Process 4951

GPR. These advantages will attract us to continue this work in future. We highlight some
of such possible works.

1. Studying the effect of the choice of the covariance function on the skew Gaussian
process predictor.
2. Developing methods for estimating the hyper-parameters of the model.
3. Prediction at several inputs.
4. Defining more robust models by using more general distributions either on the input
function f (t) or on the error term. For such future work, one may utilize the work of
Lachos et al. (2010) and Da Silva-Ferreira et al. (2011) by assuming that either the
input function or the error term follows a random process whose finite dimensional
distributions are scale mixture of skew normal (SMSN) distributions as defined by
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Lachos et al. (2010a). A random vector Y is said to have an n-dimensional scale


mixture of skew normal distributions, denoted by Y ∼ SMSNn (μ, , λ, H ) , if Y has the
1
stochastic representation Y = μ + c 2 (U ) Z, where Z ∼ SNn (0, , λ), c (.) is a weight
function and U is a positive mixing random variable with cdf H (.) and independent of
Z. The pdf of Y is

1  1 
ϕn (y; μ, c (u) )  c 2 (u) λT  − 2 (y − μ) dH (u) .
1
P (y) = 2
0

Lachos et al. (2010b) showed that the family SMSN includes several known families
such as skew-t, skew-slash and the skew-Cauchy families. This open the way for further
research on more robust models.
Although the process whose finite dimensional distributions are of SMSN is very gen-
eral, we have several computational challenges when finding the estimates of the hyper pa-
rameters. These challenges are due to the integration in the pdf of Y ∼ SMSNn (μ, , λ, H ).
So, instead of conducting the numerical calculations, it could be easier to use an intensive
statistical computing algorithm to calculate the integration in the pdf P (y). Since inten-
sive computing requires large samples from the pdf of Y , we may utilize the stochastic
1
representation Y = μ + c 2 (U ) Z for such simulation purposes.
Azzalini and Capitanio (1999) have pointed out that the MLE for the skewness pa-
rameter of the multivariate skew normal distribution may diverge with positive probability.
Also they noticed that the Fisher information matrix is singular when the skewness param-
eter is zero. For the multivariate closed skew normal distribution, these issues have been
considered only in few number of papers. Here we refer to the work of Arellano-Valle
et al. (2005). They used the skew normal distribution to model the both the random effect
and the error terms in the linear mixed effect model. Also they showed that the response
data vector has a multivariate closed skew normal distribution. Furthermore, they derived
and implemented an EM algorithm to find the MLEs for all parameters. According to the
literature, it can be noticed that the above issues concerning the MLEs of the closed skew
normal distribution parameters are still not explored enough. Hence, we believe that a fur-
ther research should conducted. For example, the estimation of the SGP model parameters
using the penalized maximum likelihood method could be called for. We leave these issues
to a separated article.
4952 Alodat and Al-Momani

References
Allard, D., Naveau, P. (2007). A new spatial skew-normal random field model. Commun. Statist.
Theory. Meth. 36:1821–1834.
Alodat, M. T., Aludaat, K. M. (2007). A skew Gaussian process. Pak. J.Statist. 23:89–97.
Alodat, M. T., AL-Rawwash, M. Y. (2009). Skew Gaussian random field. J.Computat. Appl. Math.
232(2):496–504.
Arellano-Valle, R. B., Bolfarine, H., Lachos, V. H. (2005). Skew-normal Linear Mixed models. J.Data
Sci. 3:415–438.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist.
12:71–178.
Azzalini, A. (1986). Further results on a class of distributions which includes the normal ones.
Statistica 46:199–208.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Azzalini, A., Capitanio, A. (1999). Statistical application of the multivariate skew normal distributions.
J. Roy. Stat. Soc. Ser. B 61:579–602.
Azzalini, A., Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika
83:715–726.
Brahim-Belhouari, S., Bermak, A. (2004). Gaussian process for non-stationary time series prediction.
Computat. Statist. Data Anal. 47:705–712.
Buccianti, A. (2005). ‘Meaning of the λ parameter of skew–normal and log–skew normal distributions
in fluid geochemistry’ a CODAWORK’05.
Christian, P. R., Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer.
Da Silva-Ferreira, C., Bolfarine, H., Lachos, V. (2011). Skew-scale mixture of skew-normal distribu-
tions. Statist. Methodol. 8:154–181.
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with diverging number of Parameters.
Ann. Statist. 32:928–961.
Fyfe, C., Leen, G., Lai, P. L. (2008). Gaussian processes for canonical correlation analysis. Neuro
Comput. 71:3077–3088.
Genton, M. (2004). Skew-Elliptical Distributions and Their Applications: A Journey Beyond Nor-
mality. Boca Raton, FL: Chapman & Hall/CRC.
Girard, A., Kocijan, J., Murray-Smith, R., Rasmussen, C. E. (2004). Gaussian process model based
predictive control. Proc. Amer. Control Conf . Boston.
Girard, A., Rasmussen, C. E., Murray-Smith, R. M. (2002). Gaussian Process priors with uncer-
tain Inputs: Multiple-Step-Ahead Prediction. Technical Report TR-2002-119, Department of
computing Science, University of Glasgow.
Gonzáles-Farias, G., Domingusez-Molina, J., Gupta, A. (2004). Additive properties of skew normal
random vectors. J. Statist. Plan. Infer. 126:521–534.
Kuss, M. (2006). Gaussian process models for robust regression, classification, and reinforcement
learning. Ph.D. thesis, Technische Universität Darmstadt.
Lachos, V., Labra, F., Bolfarine, H., Gosh, H. (2010a). Multivariate measurements error models based
on scale mixtures of the skew-normal distribution. Statistics 44:541–556.
Lachos, V. H., Ghosh, P., Arellano-Valle, R. B. (2010b). Likelihood based inference for skew-normal
independent linear mixed models. Statistica Sinica 20:303–322.
Macke, J. H., Gerwinn, S., White, L. E., Kaschube, M., Bethge, M. (2010). Gaussian process methods
for estimating Cortical maps.
Neal, R. M. (1995). Bayesian learning for neural networks. Ph.D., thesis, Dept. of Computer Science,
University of Toronto.
O’Hagan, A. (1978). On curve fitting and optimal design for prediction. J. Roy. Soc. B 40:1–42.
Rasmussen, C. E., Williams, C. (2006). Gaussian Processes for Machine Learning. Cambridge, MA:
MIT press.
Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and other methods for non-linear
regression, Ph.D. thesis, Dept. of Computer Science, University of Toronto.
Skew Gaussian Process 4953

Schmidt, A. M., Concoicäo, M. F., Moreira, G. A. (2008). Investigating the sensitivity of Gaussian
processes to the choice of their correlation functions and prior specifications. J. Statist. Computat.
Simul. 78(8):681–699.
Schott, J. R. (1997). Matrix Analysis for Statistics. New York: Wiley-Interscience.
Vanhatalo, J., Jylänki, P., Vehtari, A. (2009). Gaussian process regression with Student-t likelihood.
In: Bengio Y., Schuurmans D., Lafferty J., Williams C. K. I., Culotta A. Eds. Advances in Neural
Information Processing Systems 22:1910–1918.
Williams, C. K. I., Rasmussen, C. E. (1996) Gaussian processes for regression. Adv. Neur. Inform.
Process. Syst. 8:514–520.
Zhang, H., El-Shaarawi, A. (2009). On spatial skew Gaussian process applications. Environmetrics
10:982.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Appendix

A. Basic Results on Closed-Skew Normal Distribution


The results of this appendix are quoted from Genton (2004).

Proposition A.1 If Y1 , . . . , Yn are independent random vectors with Yi ∼ CSN pi, qi


 T
(μi , i , Di , vi , i ), Then the joint
 distribution of Y1 , . . . , Yn is Y = Y1T , . . . , YnT ∼
CSN p+ ,q + μ+ ,  + , D + , v + , + , where


n 
n  T
p+ = pi , q + = qi , μ+ = μT1 , . . . , μTn ,  + = ⊕ni=1  i ,
i=1 i=1
 T
D + = ⊕ni=1 Di , v + = v1T , . . . , vnT , + = ⊕ni=1 i

and


A0
A⊕B = .
0 B

Proposition A.2 Let Y ∼ CSN p,q (μ, , D, v, ) and A be an n × p (n ≤ p) matrix of


rank n. Then Ay ∼ CSN p,q (μA , A , DA , v, A ) , where μA = Aμ, A = AAT , DA =
DAT A−1 , and A =  + DD T − DAT A−1 AD T .

Proposition
  If Y ∼ CSN p,q (μ, , D, v, ), then for two sub vectors Y1 and Y2 where
A.3
Y T = Y1T , Y2T , Y1 is k−dimensional, 1 ≤ k ≤ p, and μ, , D are partitioned as follows:



μ1 k
k p − k
μ= ,  =  11  12 k
μ2 p−k
 21  22 p − k

and D1 , D2 come from

k p−k
D = (D1 D2 ) q.

Then the conditional distribution of Y2 given Y1 is


 
CSNp−k,q μ2 +  21  −1 ∗
11 (y10 − μ1 ) ,  22.1 , D2 , v − D (y10 − μ1 ) ,  ,
4954 Alodat and Al-Momani

where

D ∗ = D1 + D2  21  −1
11 ,

and

 22.1 =  22 −  21  −1
11  12 .

Proposition A.4 If Y ∼ CSN p,q (μ, , D, v, ), then the moment generating function of
Y is:
 
q Ds; v,  + DDT s T μ+ 1 s T s
MY (s) =   e 2 , s ∈ Rp .
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

q 0; v,  + DDT

Proposition A.5 If Y1 and Y2 are independent vectors such that Yi ∼ CSN p,qi
(μi , i , Di , vi , i ), i = 1, 2, then Y1 + Y2 ∼ CSN p,q1 +q2 (μ1 + μ2 , 1 + 2 , D ◦ , v ◦ , ◦ ),
where



◦ D1  1 ( 1 +  2 )−1 ◦ A11 A12
D = ,  = ,
D2  2 ( 1 +  2 )−1 A21 A22

and
A11 = 1 + D1  1 D1T − D1  1 ( 1 +  2 )−1  1 D1T ,
A22 = 2 + D2  2 D2T − D2  2 ( 1 +  2 )−1  2 D2T ,
 T
A12 = −D1  1 ( 1 +  2 )−1  2 D2T , v ◦ = v1T , v2T .

Proposition A.6 If X ∼ SNn (μ, , α), then



i. EX = μ + π2 δ, where δ = √1+αα
T α
.
ii. Cov(X) =  − π2 δδ T .

B. Proof of Theorem 3.1.


B.1. Joint Density of Data and Output. The aim of this section is to derive the joint density
of f ∗ = f (t ∗ ) and Y . For simplicity, we assume that the skew Gaussian processes used
here possess fixed skewness in all directions. Since f (t) is assumed to have a skew Gaussian
process prior, then


f    
∼ CSN n+1,1 0, , α1Tn+1 , 0, 1 , ε ∼ CSN n,1 0, τ 2 In , β1Tn , 0, 1 ,
f∗


the column of one’s of size (n + 1), and In is the identity matrix of size
where 1n+1 denotes
f
n × n. Since is independent of ε (t) , then by Proposition A.1 we have that
f∗
⎛ ⎞
f  
⎝ f ∗ ⎠ ∼ CSN 2n+1,2 μ+ ,  + , D + , v + , + ,
ε
Skew Gaussian Process 4955

where
 T
μ+ = 0T1×n , 0, 0T1×n , v + = (0, 0)T , + = I2 ,

where 0n×1 is the zero vector of size n × 1, and





α1tn+1 0n×1  (n+1)×(n+1) 0(n+1)×n
D+ = , + = .
0T(n+1)×1 β1tn 2×(2n+1)
0T(n+1)×n τ 2 In

B.2. The Predictive Density of f ∗ Given Y, t ∗ . The predictive density of f ∗ given Y , t ∗ is


obtained by direct application of Proposition 3 with p = n + 1 and q = 2. The first step is
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

to find the conditional distribution of f ∗ and Y , t ∗ is to find the joint pdf of f ∗ and Y . To
 T  T
proceed, we write Y T , f ∗ as a linear combination of f T f ∗  T , i.e.,
⎛ ⎞



f
Y f +
I n 0n×1 I n ⎝f∗ ⎠.
= =
f∗ f∗ 0Tn×1 1 0Tn×1
ε

I n 0n×1 I n
To simplify the notation, let A(n+1)×(2n+1) = ( ). It is straight forward to
0Tn×1 1 0Tn×1
check that the matrix A is of rank (n + 1). Now, we are ready to apply Proposition A.2.
Hence,
⎛ ⎞

f
Y  
∗ = A ⎝ f ∗⎠
∼ CSN n+1,2 μA ,  A , DA , v + , A ,
f

where
⎛ ⎞


0T1×n
I n 0n×1 I n ⎝ 0 ⎠ = 0(n+1)×1
μA = Aμ+ =
0Tn×1 1 0Tn×1
0T1×n ⎛ ⎞


I n 0n×1
I n 0n×1 I n  (n+1)×(n+1) 0(n+1)×n ⎝ T
 A = A + AT = 0n×1 1 ⎠
0Tn×1 1 0Tn×1 0T(n+1)×n τ 2 In

I n 0n×1
 + τ 2 In k
= .
kT k ∗ (n+1)×(n+1)

To proceed, we need to apply the following matrix identity which can be found in Schott
(1997). Let A be a matrix which is partitioned as follows:


A11 A12
A= ,
A21 A22

where A11 and A22 are invertible square matrices. Then


 −1  −1 
−1 A11 − A12 A−1 A21 − A−1 A12 A22 − A21 A−1 A12
A =  22
−1  11
−1 11 .
− A−1 −1
22 A21 A11 − A12 A22 A21 A22 − A21 A−1
11 A12
4956 Alodat and Al-Momani

For proof, we refer to Schott (1997). Hence, we find  −1


A as follows:
⎛ ⎞
 ∗−1 T −1
  −1  ∗  −1 −1
⎜  + τ In − kk k −  + τ2 In k k − kT  + τ 2 I n
2
k ⎟
 −1
A =⎝  −1   −1 −1 ⎠ .
−k ∗−1 kT  + τ 2 I n − kk∗ −1 kT k ∗ − kT  + τ 2 In k
(n+1)×(n+1)

The parameter DA is given by

DA = D+  + AT  −1
A ,
⎛ ⎞


I n 0n×1
α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n ⎝ 0Tn×1 1 ⎠  −1
= T
01×n β1tn 0T(n+1)×n τ 2 In A
I n 0n×1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015


 T
α1Tn+1 (, k)T α1Tn+1 k T , k ∗
=  −1
A ,
βτ 2 1Tn 0


D11(1×n) D12
= ,
D21(1×n) D22 2×(n+1)

where
 −1
D11 = α1Tn+1 (, k)T  + τ 2 I n − kk ∗−1 kT
 T  −1
−α1Tn+1 kT , k ∗ k ∗−1 kT  + τ 2 In − kk ∗−1 kT ,
 −1   −1 −1
D12 = −α1Tn+1 (, k)T  + τ 2 I n k k ∗ − kT  + τ 2 I n k
 T  ∗  −1 −1
+α1Tn+1 kT , k ∗ k − kT  + τ 2 I n k ,
 −1
D21 = βτ 2 1Tn  + τ 2 I n − kk ∗−1 kT

and
 −1  ∗  −1 −1
D22 = −βτ 2 1Tn  + τ 2 I n k k − kT  + τ 2 I n k .

Also, the parameter A = + + D+  + D+T − D+  + AT  −1 + +


A A D+ , where  = I 2 ,
T

can be simplified as follows:



T
+ + α1n+1 1n+1 0
D  D+ = T
,
0 nβ 2 τ 2 2×2
  T ∗ T 
T T T
α1 (, k) α1 n+1 k , k
D+  + AT = n+1 ,
βτ 2 1Tn 0
2×(n+1)

and


α (, k) 1n+1 βτ 2 1n
A + D+T = .
α kT , k ∗ 1n+1 0 (n+1)×2
Skew Gaussian Process 4957

Finally A takes the following form:




 T
α1Tn+1 1n+1 0 α1Tn+1 (, k)T α1Tn+1 kT , k ∗
A = I 2 + −
0 nβ 2 τ 2 βτ 2 1Tn 0



α (, k) 1n+1 βτ 2 1n 1 + α1Tn+1 1n+1 − W11 −W12
A−1 = ,
α kT , k ∗ 1n+1 0 −W21 1 + nβ 2 τ 2 − W22

where
  −1
W11 = α1Tn+1 (, k)T  + τ 2 I n − kk ∗−1 kT
 T  −1    
−α1Tn+1 kT , k ∗ k ∗−1 kT  + τ 2 I n − kk ∗−1 kT α , k 1n+1
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

  T  −1   −1 −1


+ −α1Tn+1 , k  + τ 2 I n k k ∗ − kT  + τ 2 I n k
 T   −1 −1   T ∗  
+α1Tn+1 kT , k ∗ k ∗ − kT  + τ 2 I n k α k , k 1n+1
 
W12 = α1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − α1Tn+1 (kT , k ∗ )T k ∗−1 kT
 −1  2
×  + τ 2 I n − kk∗−1 kT βτ 1n
= αβτ 2 (1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − 1Tn+1 (kT , k ∗ )T k ∗−1 kT
× ( + τ 2 I n − kk∗−1 kT )−1 )1n
 −1    −1
W21 = βτ 2 1Tn  + τ 2 I n − kk ∗−1 kT α , k 1n+1 − βτ 2 1Tn  + τ 2 I n
  −1 −1  
× k k ∗ − k T  + τ 2 I n k α k T , k ∗ 1n+1
  −1    −1
= αβτ 2 1Tn  + τ 2 I n − kk ∗−1 kT , k 1n+1 − τ 2 1Tn  + τ 2 I n k
 −1 −1  T  
× k ∗ − kT ( + τ 2 I n ) k k , k ∗ 1n+1 ,

and
 −1
W22 = nβ 2 τ 4 1Tn  + τ 2 I n − kk ∗−1 kT 1n .

The predictive density of f ∗ given Y, t ∗ is obtained by direct application of Proposi-


tion A.3 with p = n + 1 and q = 2. To proceed, consider the following partitions for
μA , A , DA , v + , A :



 + τ2 In k  11  12
A = ∗ = ,

k k  21  22
11 12
A =
21 22

and


D11 D12  
DA = = D1 D2 ,
D21 D22
4958 Alodat and Al-Momani

where



D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1

So the conditional distribution of f ∗ given Y , t ∗ is


  −1  −1 
f ∗ |Y , t ∗ ∼ CSN1,2 kT  + τ 2 I n Y , k ∗ − kT  + τ 2 I n k, D2 , − D∗ Y , A , (3.1)

where
 −1
D∗ = D1 + D2 kT  + τ 2 I n .
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

Proof of (ii). Here, we have to find the mean and the variance of f ∗ |Y, t ∗ by applying
Proposition A.4; to complete this mission, we find the moment generating function of
f ∗ |Y, t ∗ , hence the moment generating function of f ∗ |Y, t ∗ is equal to
 
2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
Mf ∗ |Y ,t ∗ (s) =  e 2 , s ∈ R,
2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2
 −1  −1
where σ ∗2 = k ∗ −k T  + τ 2 I n k, and μ∗ = kT  + τ 2 I n
(j )
Y . Let 2 (. , .) denote
the first partial derivative of 2 (. , .) with respect to the J component for j = 1, 2. Also,
th
(ij )
let 2 (., .) denote the mixed second partial derivative of 2 (., .). Now we find the mean
and the variance of f ∗ |Y, t ∗ :
   
∗ ∗ ∂ ∂ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
E(f |Y , t ) = Mf ∗ |Y (s) |s=0 =  e 2 |s=0
∂s ∂s 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2
⎛    ⎞
D12 σ ∗2 s ∗ ∗2
⎜ 2 ; − D Y , A + σ D2 D2 T

∂ ⎜ D22 σ ∗2 s ⎟
= ⎜   sμ∗ + 12 σ ∗2 s 2 ⎟

∂s ⎝ ∗
2 02×1 ; − D Y , A + σ D2 D2
∗2 T
e ⎟ |s=0 ,

1   
=  ∗
 × (1)2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D12 σ ∗2
2 02×1 ; − D Y , A + σ D2 D2
∗2 T

 
+ (2)
2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D22 σ ∗2
   ∗ 1 ∗2 2 
+ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 μ∗ + σ ∗2 s esμ + 2 σ s |s=0


Finally, we find that
  
E(f ∗ | y, t ∗ ) = μ∗ + σ ∗2 D12 (1) 2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
 ∗ ∗2

+D22 (2)
2 0 2×1 ; − D Y , A + σ D D
2 2
T
,

where (1) (2)


2 is the first derivative of 2 with respect to the first component, and 2 is the
first derivative of 2 with respect to the second component.
Skew Gaussian Process 4959
 
Also, we need to find f ∗2 |Y , t ∗ to calculate the variance of f ∗ |Y , t ∗ . So
  1
E f ∗2 |Y, t ∗ =  
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2


∂2 D12 σ ∗2 s ∗ ∗2

sμ∗ + 12 σ ∗2 s 2 
× 2 2 ∗2 ; − D Y , A + σ D2 D2 e
T
 ,
∂s D 22 σ s s=0
1
=  ∗

2 02×1 ; − D Y , A + σ ∗2 D2 DT2
∂  (1)  
× 2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D12 σ ∗2
∂s
 
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

+ 2(2) D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D22 σ ∗2



  ∗  sμ∗ + 1 σ ∗2 s 2  
∗2 ∗
+ 2 D2 σ s; − D Y , A + σ ∗2
D2 DT2 ∗2
μ +σ s e  .

2

s=0
 
After substituting s = 0, f ∗2 |Y, t ∗ reduces to
  
E f ∗2 Y , t ∗
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T

∗ ∗2 ∗2 ∗2
+(12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T

∗ ∗2 ∗2 ∗2
+(21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T

∗ ∗2 ∗2 2
+ (22)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D22 σ ) )
T

∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T

∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ .
T

Hence,
      2
var f ∗ |Y , t ∗ = f ∗2 |Y − E f ∗ |Y ,
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T

∗ ∗2 ∗2 ∗2
+ (12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T

∗ ∗2 ∗2 ∗2
+ (21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T

∗ ∗2 ∗2 2
+ (22)
2 (02×1 − D Y , A + σ D 2 D 2 )(D22 σ ) )
T

∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T

∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ
T

⎛ ⎛   ⎞⎞2
D12 (1)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
− ⎝μ∗ + σ ∗2 ⎝ ⎠⎠ ,
(2)  ∗ ∗2

+D22 2 02×1 ; − D Y , A + σ D2 D2 T

Using the necessary algebra, var (f ∗ |Y, t ∗ ) expression reduces to


    2
var f ∗ |Y, t ∗ = 2σ ∗4 ((11)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12
4960 Alodat and Al-Momani
 
+ (12)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22
 
+ (21)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22
  2
+ (22)
2 02×1 − D∗ Y , A + σ ∗2 D2 DT2 D22 ,

where (11)
2 is the derivative of (1)
2 with respect to the first component, and 2
(12)
is the
derivative of 2 with respect to the second component, and 2 is the derivative of (2)
(1) (21)
2
with respect to the first component, and (22)
2 is the derivative of (2)
2 with respect to the
second component.
Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

B. Proof of Theorem 4.1.


Proof. (ii) and (iii)(a)–(iii)(b). Since k (., .) is isotropic and t 1 , t 2 , . . . , t n are vertices of
∗ ∗
regular polygon with
  centernlocated at t , then k (ti , t ) is constant for each i = 1, .., n and
the matrix  = k t i , t j i,j =1 is circulant. Let k0 = k (t i , t ∗ ). Then k = k (t ∗ ) = k0 1n .
 −1
Moreover, the matrix  + τ 2 I n is also circulant. Therefore,

 2 2  2 τ 2β  −1
b τ ,β ,n =  kT  + τ 2 I n 1n ,
π 1 + β 2τ 2n

2 τ 2 βk0  −1
=  1Tn  + τ 2 I n 1n .
π 1 + β 2τ 2n


 −1
Since  + τ 2 I n is circulant and 1n is an eigen vector of any circulant matrix, then
 −1 1
 + τ2 In 1n = 1n ,
Ln
where

n
Ln = τ 2 + k (t 1 , t i ) .
i=1

Hence,

  2 τ 2 βk0 n
b τ ,β ,n =
2 2
 .
π 1 + β 2 τ 2 n Ln
  
It is easy to see that b τ 2 , β 2 , n  > 0 for all non zero values of τ , β, and k0 . If
n −0.5
i=1 k (t1 , ti ) = n O (n), where n−0.5 O (n) → c = 0, then

 2 2  2 τ 2 βk0 n
b τ ,β ,n =  ,
π 1 + β 2 τ 2 n τ 2 + n−0.5 O (n)
 √
2 τ 2 βk0 n
=  .
π 1 + β 2 τ 2 n √τ 2 + O(n)
n
n
Skew Gaussian Process 4961

Hence, 
  2 τ 2 βk0 1
lim b τ 2 , β 2 , n =  ,
n→∞ π β 2τ 2 c

2 τβk0
= .
π c |β|
   −1
To show that limτ →∞ b τ 2 , β 2 , n = 0, we notice that  + τ 2 I n 1n = 1
1 .
Ln n
Conse-
quently, we find that

 2 2  2 τ 2 βk0 n
b τ ,β ,n =  ,
π 1 + β τ n Ln
2 2

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015

2 nβk0 τ 2
=    .
π τ 2 + ni=1 k (t1 , ti ) 1 + β 2 τ 2 n
 
Hence limτ →∞ b τ 2 , β 2 , n = 0.

C. Algorithms for Section 5


i. Confidence intervals for hyperparameters
1. Simulate
 a large sample say τ (i) , σ 2(i) , α (i) , β (i) and λ(i) , i = 1, . . . , N from
P τ, σ 2 , α, β, λ|Y . 
N N N N N
2. Let q0.025 (τ ), q0.025 σ 2 , q0.025 (α), q0.025 (β), q0.025 (λ) be the 2.5% percentiles
of thesamples
 N τ (i)
, σ 2(i)
, α (i)
and β (i)
, i = 1, . . . , N, N
respectively, and q0.975 (τ ),
N N N
q0.975 σ 2 , q0.975 (α), q0.975 (β), q0.975 (λ) be the 97.5% percentiles of the same
samples,  respectively.   N  2  N  2   N 
N N N
3. Then q0.025 (τ ) , q0.975
 (τ ) , q0.025 σ , q0.975 σ , q0.025 (α) , q0.975 (α) ,
N N N N
q0.025 (β) , q0.975 (β) , q0.025 (λ) , q0.975 (λ) are 95% confidence intervals for
τ, σ 2 , α, and β, respectively.
ii. Algorithm to Solve Eq. (8)
Inputs:
• Increments d1 , d2 , A large real number L0 , precision ω
• Smooth grid for the space of t ∗ say t ∗1 , . . . , t ∗M .
• Grid and sample sizes M and N, respectively.
Start.
1. For each t ∗j simulate a large sample from the posterior say 1 , . . . , N .
∗ ∗
2. Find Ti = minM j =1 {( i , t j )} and Ui = maxj =1 {( i , t j )}, i = 1, . . . , N .
M

1 N
3. Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
For l = −L0 to L1 STEP d1 (Searching for the solution in (−L0 , L0 ) × (−L0 , L0 )
For u = −U0 to U1 STEP d2
Do while (|p̂ − 0.95|>ω) l = −L0 , u = −L0
For each tj∗ simulate a large sample from the posterior say 1 , . . . , N .
∗ ∗
Find Ti = minM j =1 {(i , tj )} and Ui = maxj =1 {(i , tj )}
M

1 N
Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
Update l and u: l ← l + d1 and u ← u + d2
End Do
Output: L = l and U = u