Skew Gaussian Process For Nonlinear Regression

Communications in Statistics—Theory and Methods, 43: 4936–4961, 2014
Copyright © Taylor & Francis Group, LLC

ISSN: 0361-0926 print / 1532-415X online
DOI: 10.1080/03610926.2012.737498
Skew Gaussian Process for Nonlinear Regression
M. T. ALODAT1 AND E. Y. AL-MOMANI2

Downloaded by [Nanyang Technological University] at 13:47 25 April 2015
1
Department of Mathematics, Statistics and Physics, Qatar University, Qatar
2
Department of Statistics, Yarmouk University, Irbid, Jordan
In this article, we extend the Gaussian process for regression model by assuming a skew
Gaussian process prior on the input function and a skew Gaussian white noise on the
error term. Under these assumptions, the predictive density of the output function at
a new fixed input is obtained in a closed form. Also, we study the Gaussian process
predictor when the errors depart from the Gaussianity to the skew Gaussian white noise.
The bias is derived in a closed form and is studied for some special cases. We conduct a
simulation study to compare the empirical distribution function of the Gaussian process
predictor under Gaussian white noise and skew Gaussian white noise.
Keywords Gaussian process; Multivariate closed skew normal distribution; Prediction;

Prior distribution.
Mathematics Subject Classification Primary 60G15; Secondary 62J02.
1. Introduction
In statistical literature, the assumption of Gaussianity or normality has been made on statis-
tical models for a long time when analyzing spatial data. The popularity of using Gaussian
assumption is due to its mathematical tractability. For example, the multivariate Gaussian
distribution possesses the properties of closure under marginal, conditional distributions as
well as the closure under convolution. Despite of such nice properties of Gaussian distri-
bution, it is found that the data distribution does not meet the assumption of Gaussianity
for a large number of real data sets due to the presence of the skewness. If the analysis of
such data sets relies on the Gaussian assumption, then unrealistic or nonsensical estimates
will be produced. The simplest way to analyze skewed data via the Gaussian model is
to Gaussianize the data, i.e., by transforming the data to near Gaussian data. Such trans-
formation method is not recommended due to the following different reasons. (i) Finding
a suitable transformation to achieve normality is not an easy issue in practice. (ii) Since
such transformations are usually applied to data component-wise, then the normality of
marginal distributions does not guarantee the joint normality. Hence, the estimates might
Received May 9, 2012; Accepted October 3, 2012.

Address correspondence to M. T. Alodat, Department of Mathematics, Statistics and Physics,
Qatar University, Qatar; E-mail: alodatmts@yahoo.com
Color versions of one or more of the figures in the article can be found online at
www.tandfonline.com/lsta.
4936
Skew Gaussian Process 4937
be fallible from biases. (iii) Despite of the difficulty in interpreting the transformed data,
data skewness could not be ignored, since it has an interpretation (Buccianti, 2005).
Recently, random processes, that possess a skewness parameter, have been defined
by several researchers. Alodat and Aludaat (2007) employed the skew normal theory, as
presented in Genton (2004), to define a new random process called the skew Gaussian
process. Also they gave an application to real data. Relying on the multivariate closed-skew
normal distribution of González-Farı́as et al. (2004), Allard and Navea (2007) defined what
they called the closed skew normal random field. For more examples about skew random
processes or fields, we refer the reader to Zhang and El-Sharaawi (2009) and Alodat and
Al-Rawwash (2009).
The cornerstone in defining a new skew processes or field is the multivariate skew
normal distribution which appeared in the pioneer works of Azzalini (1985, 1986), Azzalini
and Dalla valle (1996), and Azzalini and Capitanio, 1999). The skew-normal or skew
Gaussian distribution is defined as follows. A random vector Z(n×1) is said to have an n−
dimentional multivariate skew normal distribution if it has the probability density function
(pdf)

PZ (z) = 2φn (z; 0, ) α T z , z ∈ Rn , (1)
where φn (.; 0, ) is the pdf of Nn (0, ), (.) is the cdf of N (0, 1), and α (n×1) is a vector
called the skewness parameter. A more general family of (1) is obtained by using the
transformation X = μ + Z, μ ∈ Rn . It is so easy to show that the pdf of X is PX (x) =
PZ (x − μ). We use the notation X ∼ SNn (μ, , α) to denote an n−dimenational skew
normal distribution with parameters μ, , and α.
Also a generalization to (1) is given by González-Farı́as et al. (2004) as follows. Let
μ ∈ Rp , D be an arbitrary q × p matrix, and positive definite matrices of dimensions
p × p and q × q, respectively. A random vector Y is said to have a p−dimensional
closed skew normal distribution (CSN) with parameters q, μ, , D, v, , denoted by Y ∼
CSNp,q (μ, , D, v, ) , if its pdf takes the form
Pp,q (y) = Cφp (y; μ, ) q ( D ( y − μ) ; v, ) , y ∈ Rp (2)
where C is defined via

C −1 = q 0; v, + DDT ,
where φp (.; η, ψ), p (.; η, ψ) are the pdf and the cumulative distribution function (cdf)
of a p−dimentional normal distribution with mean vector η and covariance matrix ψ.
Throughout this article, several lemmas and results about the multivariate CSN distribution
will be used extensively. So we present them in Appendix A. For their proofs, we refer the
reader to González-Farı́as et al. (2004) or Genton (2004).
Furthermore, it has been shown that the family of skew normal distributions possesses
properties that are close to or coincide with those of the normal family. Besides to the
closeness properties, it contains the normal family, i.e., when α = 0. Such properties have
attracted the researchers to extend the well-known statistical techniques under the skew
normality assumption. There are still a lot of works in their mission. For example, the
Gaussian process regression (GPR) model is a statistical technique introduced by Neal
(1995) to treat a non-linear regression Y (t) = f (t) + ε (t) from a Bayesian viewpoint.
Simply, the technique assumes a Gaussian process as a prior on the unknown function f (t)
while ε(t) is assumed to have a white noise process. Then the aim is to predict f (t) at a
4938 Alodat and Al-Momani
new value of t. In other words, the Gaussian process provides us with a prior distribution
over the space of all functions.
Since the Gaussian family is a sub-family of the skew Gaussian family, then using
the skew Gaussian process, i.e., a process whose finite dimensional distributions are of the
form (1), as a prior on f (t) will allow us to define a distribution over a more rich family of
functions than the Gaussian one. Also, it will allow us to extend the error term in the above
regression model to have a skewed distribution which closer to real data than its Gaussian
counterpart.
It appears from literatures that the GPR has a significant applications in various fields of
science. For example, it has been applied to model noisy data and to classification problems
arising in machine learning to predict the inverse dynamics of a robot arm (Rasmussen and
Williams, 2006). Brahim-Belhouari and Bermak (2004) applied the GPR model to predict
the future value of a non-stationary time series. Schmidt et al. (2008) studied the sensitivity
of GPR to the choice of correlation function. Based on a numerical study, they concluded
that the predictions did not differ much amongst the different correlation functions. Van-
hatalo et al. (2009) proposed a GPR with student-t likelihood by approximating the joint
distribution of the process by a student distribution. The idea beyond that approximation
is to make the GPR model robust against outliers. The model they proposed is analytically
intractable. Kuss (2006) proposed other robust models as alternatives for GPR. Macke et al.
(2010) applied the GPR to estimate the cortical map of the human brain. They modeled
the brain image of their experiment, where the activity at each voxel is measured, by a
Gaussian process. Fyfe et al. (2008) applied the GPR to Canonical correlation analysis with
application to neuron data.
The problem of treating the prediction problem of the nonlinear regression Y (t) =
f (t) + ε (t) from a Bayesian viewpoint when both f (t) and ε (t) follow skew Gaussian
processes has not yet been a dressed in the literature. In this article, we extend the GPR
model by assuming two independent skew Gaussian processes one on f (t) and the other one
on ε (t). In other words, we consider the nonlinear regression model Yi = f (ti )+ε (ti ) , i =
1, 2, . . . , n, i.e., for each i, f (ti ) is measured as Yi but corrupted by the noise ε (ti ). Then
we put a skew Gaussian process as prior on the function f (t). Also, we assume that the
process ε (t) follows a skew Gaussian process. Under these assumptions, the following
two prediction problems are considered: (i) Prediction of f (t) at a fixed input t, and (ii)
Prediction of f (t) at a random input t ∗ .
The rest of this article is organized as follows. In Sec. 2, we introduce the reader to
the GPR model. In Sec. 3, we generalize the GPR model by assuming a skew Gaussian
process on f (t) and another skew Gaussian process on ε (t). Then we derive the predictive
density of the output function at new input. Also, we derive the mean and the variance of
the predictive distribution. In Sec. 4, it is assumed that the GPR predictor is used to analyze
a data with skewed errors. Then we derive the bias and the variance. In Sec. 5, we conduct
a simulation study to compare the new model to the Gaussian one. Finally, we state our
conclusions in Sec. 6.
2. Gaussian Process for Regression

A family {X (t) , t ∈ C} , C ⊆ Rn of random variables is said to constitute a Gaussian
process if for every n and t1 , . . . , tn ∈ C, the random variables X1 (t) , . . . , Xn (t) have
n-dimentional multivariate normal distribution. The Gaussian process is used in statistical
literatures as a prior process for the Bayesian analysis of several statistical problems. For
example, O’Hagan (1978), was the first to use the Gaussian process as a prior process over
the space of functions to treat a nonlinear regression from a Bayesian viewpoint, while
an application of O’Hagan’s work to Bayesian learning in networks has appeared in Neal
(1995).
The GPR, as presented in Neal (1995), can be illustrated as follows. Consider a set
of training data Y = (Y1 , Y2 , . . . , Yn )T , where the input vectors t1 , t2 , t3 , . . . , tn ∈ C ⊆ Rn
and their output values Y1 , Y2 , . . . , Yn are governed by the non-linear regression model
Yi = f (ti ) + ε (ti ), where ε (t1 ) , ε (t2 ) , . . . , ε (tn ) are iid Gaussian noises on C of mean
0 and variance τ 2 , and f (.) is an unknown function. The main question is “what is the
predicted value of f ∗ = f (t ∗ ), the value of f (t) at a new input t ∗ ?”. To answer this
question, a prior distribution is needed on f (t) , i.e., a distribution over a set of functions
is needed. This prior distribution should be defined on the class of all functions defined on
the space of t. The set of all sample paths of a Gaussian process on C provides us with a
rich class of such functions.

Assume that f (t), t ∈ C is a Gaussian process with covariance function k (., .), i.e., for
every n and t1 , t2 , t3 , . . . , tn ∈ C, we have f = (f (t1 ) , . . . , f (tn ))T ∼ Nn (0, ), where
⎛ ⎞
k (t1 , t1 ) · · · k (t1 , tn )
⎜ .. .. .. ⎟
=⎝ . . . ⎠.
k (tn , t1 ) · · · k (tn , tn )
A suitable choice for k (., .) is the following covariance function

1 T
k ti , tj = exp − ti − tj −1 ti − tj (3)
2

For simplicity, we may consider
= diag λ21 , . . . , λ2n . A covariance function k (., .)
is said to be isotropic if k ti , tj depends only on the distance ti −tj . For more information
about other types of covariance functions, see Girard et al. (2004).
Since f (t) is assumed to follow a Gaussian process, then, according to Qui ñonero-
Candela et al. (2003), the joint pdf of f (t) and f (t ∗ ) is also an (n + 1) −dimensional
multivariate Gaussian distribution, i.e.,
⎛ ⎞
f (t1 )
⎜ .. ⎟
⎜ . ⎟
⎜ ⎟ ∼ Nn+1 (0, ) ,
⎝ f (tn ) ⎠
f (t ∗ )
with
⎛ ⎞
k (t1 , t1 ) · · · k (t1 , tn ) k (t1 , t ∗ )
⎜ .. .. .. .. ⎟
⎜ . . ⎟
=⎜⎜ .. ⎟,
⎟
⎝ k (t , t ) · · · k (t , t ) k (t , t ∗ ) ⎠
n 1 n n n
∗ ∗ ∗ ∗

k (t ,t1 ) · · · k (t , tn ) k (t , t )
k
= ,
kT k∗
where
= {k(ti , tj )}ni,j =1 , k = (k(t1 , t ∗ ), . . . , k(tn , t ∗ ))T , and k ∗ = k(t ∗ , t ∗ ).

Rasmussen (1996) showed that the prediction distribution of f ∗ given Y and t ∗ remains
Gaussian and is given by
P (f ∗ |Y, t ∗ ) ∼ N (μ(t ∗ ), σ 2 (t ∗ )), (4)
where μ (t ∗ ) and σ 2 (t ∗ ) are the mean and the variance of the predictive distribution (4) and
are given by
−1
μ∗ = μ(t ∗ ) = k T + τ 2 In Y,
where
−1
Y = (Y1 , Y2 , . . . , Yn )T and σ 2 t ∗ = k ∗ − k T + τ 2 In k.
The distribution (4) can be used to draw several inferential statements about f (t ∗ ).
For instance, when p = 1, a 100 (1 − α) % prediction interval for f (t ∗ ) is given by [L, U ],
where L and U are the solutions of the following two equations:
∞
L
∗ ∗ α
∗ α
P (f |Y, t )df = and P (f ∗ |Y, t ∗ )df ∗ = .
0 2 U 2
For GPR, a 100 (1 − α) % prediction interval for f ∗ is
μ(t ∗ ) ± Z1− α2 σ (t ∗ ),
where Z1− α2 is the 100 (1 − α) quantile of N (0, 1). Moreover, the mean μ (t ∗ ) serves as a
predictor for f (t ∗ ) given the data Y and t ∗ , while the variance σ 2 (t ∗ ) serves as a measure
of uncertainty in μ (t ∗ ).
Now, assume that we are interested in predicting f (t) at t ∗ , where t ∗ is a random
variable such that t ∗ ∼ Np (μ∗ , ∗ ), i.e., we are interested in prediction at a random input.
So the predictive pdf for f ∗ given that μ∗ , ∗ is (Girard et al., 2004):

P (f ∗ |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ . (5)
The integral in Eq. (5) does not have a closed form. Hence, an approximation to this
integral is needed in order to report inferential statements about f ∗ . Moreover the main
computational problem in GPR is the inversion of the matrix + τ 2 In and in obtaining the
mean and variance of the predictive distribution of f ∗ given Y at a random input t ∗ . For
this reason, we propose the following simple Monte Carlo approximation to (5):

1 ∗
N
∗
P (f |μ∗ , ∗ , Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ P f , Y |t ∗(r) ,
N r=1
where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Before closing this section, we
refer to Girard et al. (2002) and Williams and Rasmussen (2006) where the reader can find
several analytical approximation techniques to approximate the predictive density (5).
3. Skew Gaussian Process for Non Linear Regresion

In this section, a generalization to the Gaussian process, called the skew Gaussian process
(SGP), is proposed. Under the SGP, we give a generalization to the GPR model called the
skew Gaussian process regression (SGPR) model. Then the predictive density at new inputs
is derived for the SGPR model.
Definition 3.1. A random process Y (t) , t ∈ C ⊆ Rp is said to be a skew Gaussian process

(SGP) if for every n ∈ {1, 2, 3, . . .} and every t1 , . . . , tn ∈ C, the vector (Y (t1 ) , . . . , Y (tn ))T
follows the density (1), i.e., Y (t) is skew Gaussian process if its set of finite dimensional
distributions is a subfamily of the family of distributions defined by (1).
Definition 3.2. A skew Gaussian process Y (t) possesses a fixed skewness in all directions
if for every n and t1 , . . . , tn , the parameter α in (1) takes the form α = α1n , α ∈ R.
Throughout
n this article, we will assume that for each n and t1 , . . . , tn ∈ C, the parameter
= k ti , tj i,j =1 , where k (., .) is a given covariance function.
Definition 3.3. A skew Gaussian process is called a skew white noise if for every n and
t1 , . . . , tn ∈ C ⊆ Rn , ε = (ε (t1 ) , . . . , ε (tn ))T ∼SNn 0, τ 2 In , β1Tn , where τ, β ∈ R.
Let μA , A , DA , v + and A be as defined in Appendix B2. Under the assumption that

the function f (t) follows a skew Gaussian process and ε (t) follows skew white noise, then
we state the following theorem which gives us the predictive distribution of f (t ∗ ) given
Y, t ∗ as well as its mean and variance at the fixed input t ∗ .
Theorem 3.1. Consider the following partitions for μA , A , DA , v + , A :

+ τ 2 In k 11 12 11 12
A = = , A =
k k∗ 21 22 21 22
and

D11 D12
DA = = D1 D 2 ,
D21 D22
where

D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1
Then:
i. the conditional distribution of f ∗ given Y and t ∗ is
−1 −1
f ∗ |Y, t ∗ ∼ CSN 1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A , (6)
where
−1
D ∗ = D1 + D2 k T + τ 2 In ;
ii. the predictive mean and the predictive variance are given by

E(f ∗ |Y, t ∗ ) = μ∗ + σ ∗2 D12 (1) ∗ ∗2
2 02×1 ; −D Y, A + σ D2 D2
T
∗ ∗2

+D22 (2)
2 02×1 ; −D Y, A + σ D2 D2
T
,
and
2
var(f ∗ |Y, t ∗ ) = 2σ ∗4 (11)
2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12

12
+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22

21
+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22

22 ∗ ∗2
2
+2 02×1 ; −D Y, A + σ D2 D2 D22 ,
T
where (i)2 (., .) is the first partial derivative of 2 (., .) with respect to the i argument for
(ij )
i = 1, 2 and 2 (., .) is the second mixed partial derivative of 2 (., .) with respect to the
ith and jth arguments for i, j = 1, 2.
Proof. See Appendix A for the proof of Theorem 3.1.
The above theorem shows that the predictive distribution of a new output follows
a closed skew Gaussian distribution. As a special case, this predictive distribution re-
duces to (4) if the skewness is absent, i.e., if α = β = 0. Another predictor of f (t ∗ )
is the median of the conditional distribution of f ∗ given Y. Neither mean nor the me-
dian of the conditional distribution in our case has a simple closed form. Furthermore,
part (i) of Theorem 3.1 can be used to predict the value of f (t) at a random input, say
t ∗ . For instance, assume
∗ ∗2that t ∗ ∼ Np (a, B) and we wish to predict f ∗ = f (t ∗ ). Since
f |Y, t ∼ CSN 1,2 μ , σ , D2 , −D ∗ Y, A , then using the total probability law, we write
∗ ∗
the predictive density of f ∗ given Y as follows:

P (f ∗ |Y ) = P (f ∗ |Y, t ∗ )P (t ∗ )dt ∗ .
Rp
Unfortunately, it is difficult even for GPR to find a closed form for the integral in
the last equation, so an approximation for P (f ∗ |Y ) is needed. Here, we propose the
following simple Monte Carlo approximation for the predictive distribution at a random
input:

1 ∗
N
∗
P (f |Y ) = ∗ ∗ ∗ ∼
P (f |Y, t )P (t )dt = ∗
P f |Y, t ∗(r) ,
R p N r
where t ∗(1) , . . . , t ∗(N) are independent samples from P (t ∗ ). Since we are putting a skew
Gaussian process prior on the function f (t), then for each n, the finite dimensional dis-
tribution of the skew Gaussian process is used as a prior for the distribution of the vector
f = (f (t1 ) , . . . , f (tn ))T , i.e., f ∼ SNn (0, , α1n ), where
as defined in the previous
sections. Since Y = f + ε, where ε ∼ SNn 0, τ 2 In , β1n , then the posterior distribution
of f given Y , t ∗ is calculated as follows:

P (y, t ∗ |f ) P (f )
P f |y, t ∗ = .
P (y, t ∗ )
Since the prior distribution is proper, then so is the posterior distribution. So the
predictive distribution of f ∗ give Y and t ∗ is

P (f ∗ |y, t ∗ ) = P (f, f ∗ |y, t ∗ )df,
P
1
= (y, t ∗ |f, f ∗ )P (f, f ∗ )df.
P (y, t ∗ )
It will be shown in the proof of Theorem 3.1, Appendix B1, that the last equation
simplifies to the pdf of the distribution:
−1 −1
CSN1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A .
Now if t ∗ is assumed to have t ∗ ∼ Np (a, B), then the predictive distribution of f ∗

give Y is obtained by averaging P (f ∗ |Y, t ∗ ) over all values of t ∗ . Since t ∗ has a proper
distribution, then so is the distribution off ∗ given Y . Furthermore, the strong law of large
∗ ∗(r)
numbers implies that the estimator N1 N r P (f |Y, t ) converges almost surely to its
mean value, i.e.,

1 ∗ ∗
N
∗(r) a.s ∗(1)
P f |Y, t → E P f |Y, t = P f ∗ |y, t ∗ P t ∗ dt ∗ ,
N r

= P f ∗ |y .
To simulate from (f ∗ |Y, t ∗ ), i.e., from

−1 −1
CSN1,2 k T + τ 2 In Y, k ∗ − k T + τ 2 In k, D2 , −D ∗ Y, A ,
then we may utilize the following stochastic representation of the CSN distribution (Genton,
2004; Allard and Naveau, 2007):
i. Let V be a random vector from N2 (−D ∗ Y, Q), where
−1
Q = A + D2T k ∗ − k T + τ 2 In kD2 .
ii. Let U = V |V ≤ 0.

iii. Z = m∗ U + D ∗ Y + ∗ 2 H , where
1
−1 −1 −1

m∗ = − k ∗ − k T + τ 2 In k D2 A + k ∗ − k T + τ 2 In k D2T D2 ,
−1 −1 2
∗ = k ∗ − k T + τ 2 In k − k ∗ − k T + τ 2 In k
−1 −1
D2 A + k ∗ − k T + τ 2 In k D2T D2 D2T
and H ∼ N (0, 1). Then Z is from the distribution in (6).

4. Gaussian Process for Regression with Skew Normal Errors

In this section, we consider the model Y = f + ε, where the error process ε (t) follows a
−1
skew white noise. Then we use the Gaussian process predictor, i.e., f̂ = k T + τ 2 In Y,
to predict f (t ∗ ). Under such setup, we study the effect of this assumption on the mean
and the variance of the GPR predictor. In sequel,
the mean and the variance of f̂ , where
Y is replaced by Y = f + ε, with ε ∼ Nn 0, τ 2 In , are denoted G G
by E (f̂ ), var (f̂ ),
respectively. Also if Y is replaced by Y = f + ε with ε ∼ SNn 0, τ 2 In , β1n , then the
mean, the variance and the bias are replaced by E SG (f̂ ) and var SG (f̂ ), respectively. Under
a white noise process, i.e., ε ∼ Nn 0, τ 2 In , we have
−1
E G (f̂ ) = k T + τ 2 In E(f + ε),
−1
= k T + τ 2 In f

while under the assumption ε ∼ SNn 0, τ 2 In , β1n , we have that
−1
E SG (f̂ ) = k T + τ 2 In E(f + ε),
−1 −1
= k T + τ 2 In f + k T + τ 2 In Eε ·
Using Proposition A.6 in Appendix A, we get

2 τ 2 β1n
E (ε) = .
π 1 + β 2τ 2n
Hence,

−1 2 τ 2 β1n
E SG
(f̂ ) = E (f̂ ) + k + τ 2 In
G T
,
π 1 + β 2τ 2n

= E G (f̂ ) + b τ 2 , β 2 , n , say. (7)
From the last equation, we conclude

2 2 that the GP R predictor is either increased
or decreased by an amount of b τ , β , n . Similarly, under the assumption ε ∼

SNn 0, τ 2 In , β1n , the variance of the var SG (f̂ ) is obtained by applying Proposition
A.6 of Appendix A. So

−1 2 2β 2 τ 4
−1
var SG
(f̂ ) = k + τ In
T 2
τ In − 1n 1n + τ 2 In k,
π 1 + nτ β 2 2
−2 2β 2 τ 4 k02 n2
= τ 2 k + τ 2 In k− ,
π 1 + nτ 2 β 2 L2n
n
where Ln = τ 2 + i=1 k (t1 , ti ).

Theorem 4.1. Consider the setup in the above discussion. Then b τ 2 , β 2 , n and
var SG (f̂ )satisfy the following properties.
i. var SG (f̂ ) ≤ var G (f̂ ) for all τ, β, n.
2 2 2 τ T −1

ii. lim b τ , β , n = π √n k + τ 2 In 1n , lim b τ 2 , β 2 , n = 0 and
β→±∞
2 2 β→0
lim b τ , β , n = 0,
τ →0
iii. Assume that t1 , t2 , . . . , tn are chosen so that they are the vertices of a regular
polygon and t ∗ is located at its center. If k (., .) is an isotropic covariance function
and k0 = k (t1 , t ∗ ), then

a. b τ 2 , β 2 , n > 0 for all τ , n and β = 0, and lim b τ 2 , β 2 , n = 0.
τ →∞

n
b. If k (t1 , ti ) = n−0.5 O (n), with n−0.5 O (n) → c = 0 as n → ∞,

i=1
n
then lim b τ 2 , β 2 , n = π2 τβk 0
|β|c and if k (t1 , ti ) = O (n), then
n→∞
2 2 i=1
lim b τ , β , n = 0.
n→∞
n τ k2
c. If k (t1 , ti ) = O (n), then lim var SG (f̂ ) = 1 − π2 c20 .
i=1 n→∞
nτ 2 k 2 nτ 2 k 2
d. lim var SG (f̂ ) = L2 0 , lim var SG (f̂ ) = L2 0 1 − π2 and
β→0 n β→±∞ n
lim var SG (f̂ ) = 0.

τ →0,∞
Proof. The proof of (ii) and (iii)(a)–(iii)b are given in Appendix B. The proof of the other
parts is easy, so we leave it to the reader.
It can be noticed that if a Gaussian predictor is used for predicting skew data, then the
variance of the predictor cannot exceed the variance of the Gaussian predictor. On the other
hand, the value of the predictor
will be shifted to the left or to the right of the Gaussian one
by an amount of b τ 2 , β 2 , n . If an isotropic Gaussian covariance function is used, then
n n 0.5θ0 i
i=1 k (t1 , ti ) = i=1 τ exp − λ2 = O(n), where θ0 denotes the angle between ti and
2
t ∗ for all i = 1, . . . , n. So the Gaussian covariance function satisfies part b of Theorem 4.1.
5. Simulation study
In this section, we present an algorithm to simulate a realization from a skew Gaussian
process, i.e., by simulating from its finite dimensional distributions. Then the algorithm is
implemented in a Matlab code to simulate from a GPR and a SGPR predictors.
5.1. Simulation from SNn (0, , λ).

Simulation of a sample path from the skew Gaussian process can be obtained by sampling
from a multivariate skew normal distribution on a smooth grid. To simulate a random vector
from the pdf (1), we may use the accept-reject method. The accept-reject method as given
in Christian and Casella (2004) assumes that the pdf P (x) can be written as
P (x) = cg (x) h (x) ,

where c ≥ 1, 0 < g (x) ≤ 1, ∀x and h (x) is a pdf. If this is the case, then a ran-
dom observation from P (x) is generated as follows.
1. Generate U from u (0, 1).
2. Generate Y from h (x).
3. IfU ≤ g (Y ), then deliver Y as a realization of P (x).
4. Go to step1.
For the SNn (0; , λ) distribution, we may use this algorithm with c = 2, g (x) =
(λT x) and h (x) = φn (x; 0, ).
5.2. Simulation from CSN p,q (μ, , D, v, )

To simulate a random observation from the CSNp,q (μ, , D, v, ), it is difficult to

achieve
this via the accept-reject
method due to the complexity of calculating g (x) =
q D T (Y − μ) ; v, . Instead, we employ the following algorithm which is derived
from the definition of the CSN distribution (see Genton, 2004; Allard and Naveau, 2007).
(i) Simulate an observation from
U = Nq (v, + D T D)|U ≤ 0.
(ii) Given U , simulate Z from
Np (−D( + D T D)−1 (U − v), − D( + D T D)−1 D T ).
(iii) Deliver Z from CSNp,q (μ, , D, v, ).

Also, the simulation from U |U ≤ 0 is not an easy task, so an accept-reject method will
be implemented.
5.3. Estimation of Hyper Parameters

The Maximum Likelihood Estimation (MLE) is an estimation approach to estimate the
hyper parameters in GPR. Here we use the MLE to estimate the parameters
τ, σ 2 , α, β,
and λ, i.e. by maximizing the likelihood function L τ, σ 2 , α, β, λ; Y of Y . Consider the
model
Y = X+ε,

where X ∼ SNn 0, , α1Tn , 0, 1 , ε ∼ SNn 0, τ 2 In , β1Tn , 0, 1 , τ > 0, and X, ε are in-
dependent
random vectors. Then applying Proposition A.5 of Appendix A yields Y ∼
CSN n,2 0, + τ 2 In , D ◦ , 0, ◦ ,where
−1

◦ α1Tn + τ 2 In ◦ A11 A12
D = −1 , = ,
βτ 2 1Tn + τ 2 In A12 A22
and
−1
A11 = 1 + α 2 1Tn 1n − α 2 1Tn + τ 2 In 1n , A22 = 1 + nβ 2 τ 2 − β 2 τ 4 1Tn
−1
× + τ 2 In 1n ,
−1
A12 = −αβτ 2 1Tn + τ 2 In 1n .
Using the pdf (2), we find the likelihood function of τ, σ 2 , α, β, and λ:

2 (D ◦ Y ; 0, ◦ )
L τ, σ 2 , α, β, λ; Y = × φn Y ; 0, + τ 2 In .
2 0; 0, ◦ + D ◦ + τ 2 In D ◦T
Although the marginal distribution of the data Y is a multivariate closed skew normal
distribution, the problem of finding confidence intervals for the parameters τ, σ 2 , α, β, and
λ is not an easy task, since these parameters are embedded in the distribution’s parameters,
i.e., in , D ◦ and ◦ . So one may think in Bayesian intervals. For this purpose,
a prior
distribution on the τ, σ 2 , α, β, and λ must be assumed. Let P τ, σ 2 , α, β, λ be the prior
that represents our belief about the distribution of τ, σ 2 , α, β, and λ. Then the posterior
distribution of τ, σ 2 , α, β, and λ given Y satisfies

P τ, σ 2 , α, β, λ|Y ∝ L τ, σ 2 , α, β, λ; Y × P τ, σ 2 , α, β, λ .
Again, we face another problem in finding the normalizing constant for the posterior
distribution. Hence, a Markov chain Monte Carlo (MCMC) method should be called for.
For example, one may use the Metropolis-Hasting algorithm (Christian and Casella, 2004).
To find confidenceintervals for the parameters τ, σ 2 , α, β, and λ, a large sample from
P τ, σ 2 , α, β, λ|Y is needed. To do so, we propose an algorithm in Appendix C to find
such confidence intervals. On the other hand, once the parameters τ, σ 2 , α, β, and λ have
been estimated, then their estimates can be plugged in the variance formula var (f ∗ |Y, t ∗ )
to get an estimate for var (f ∗ |Y, t ∗ ). Furthermore, to obtain a 95% Bayesian
confidence
band for (f ∗ |y, t ∗ ), we also need to use simulation. To proceed, let = τ, σ 2 , α, β, λ
and (, t ∗ ) = (f ∗ |y, t ∗ ). A Bayesian simultaneously 95% confidence band for (, t ∗ )is
obtained by finding L and U such that P|Y,t ∗ (L < (, t ∗ ) < U forallt ∗ ) = 0.95, which
is equivalent to solve the following equation for L and U :

∗
∗

P|Y,t L < inf
∗
∗
, t < sup , t < U = 0.95. (8)
t t∗
Also, we propose an algorithm in Appendix C to solve (8) for L and U . Hence,

L < (, t ∗ ) < U is 95% confidence band for (, t ∗ ) for.
5.4. Simulation Results

In this simulation work, realizations of the sample path of the SGP are generated for the
input function f (t) = sin(t)
t
, t = 0. Then the simulated data are substituted in both Gaussian
and skew Gaussian predictors. To see the effect of the departure from Gaussianity on the
Gaussian predictor, we plot the distribution function for the two predictors. Figures 1–8show
these distribution functions for λ = 1 and different values of α, β, and τ .
From Figs. 1–8, we report the following concluding remarks:

1. If a Gaussian process prior is used on the input function, i.e., α = 0, then there is a small
difference between the two distributions and this difference is increasing as a function
of |β|. Moreover, the skew Gaussian predictor distribution is larger than the Gaussian
predictor distribution if β < 0 and the converse is true if β > 0, (see Fig. 2a).
2. The two predictors have about the same distribution functions for small values of the
skewness parameters τ, α, and β (See Figs. 2a, b).
Figure 1. (a) GPR (G) and skew Gaussian (SG) predictors with parameters; (b) GPR (G) and SG
predictors with parameters α = −0.05, β = −5, 0, 1, 5 and τ = 0.1, α = −0.01, β = −5, −1, 0, 5
and τ = 0.1.
Figure 2. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G) and SG predictors with
Parameters α = 0, β = −5, 0, 1, 5 andτ = 0.1, α = 0.05, β = −5, 0, 2, 5, andτ = 0.1.
Figure 3. (a) GPR (G)and SG predictors with Parameters, (b) GPR (G)and SG predictors with
Parametersα = 2, β = −5, −2, 0, 5 and τ = 0.1, α = 5, β = −5, 0, 2, 5 and τ = 0.1.
Figure 4. (a) GPR (G) and SG predictors with Parameters; (b) GPR (G)and SG predictors with
Parameters α = −0.01, β = −5, 0, 1.5, 5 and τ = 1, α = 0, β = 5, 0, 2, 5, and τ = 1.
Figure 5. (a) GPR (G) and SG predictors with parameters α = 0.5, β = −5, 0, 1.5, 5 and τ = 1.
Figure 6. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
Parameters α = 1.5, β = −5, −1, 0, 5 and τ = 1.5, α = 4, β = −5, 0, 2, 5 and τ = 1.5.
Figure 7. (a) GPR (G) and SG predictors with parameters; (b) GPR (G) and SG predictors with
parameters α = −0.1, β = −5, 0, 2, 4 and τ = 2, α = 5, β = −5, 0, 2, 5 and τ = 2.
3. If a Gaussian process is used for the errors, i.e., β = 0, then there is no difference
between the two distributions when α ≤ 0, and τ is small (see Figs. 1, 2, 3and 4).
4. For fixed values of α, and moderate values of τ , the difference between the two distri-
butions is very clear and seems to be an increasing function in |β| (see Figs. 4,5).
5. For fixed values of α, and large values of τ , there is a huge difference between the two
distributions (see Fig. 8).
6. Conclusions and Possibility of Future Work

In this article, the nonlinear regression model Y (t) = f (t) + ε (t) has been tackled from a
Bayesian viewpoint by assuming two skew Gaussian processes on f (t) and ε (t). It is shown
that, under this assumption, the predictive density at new input has a closed form. Also,
we studied the GPR predictor under the assumption that the errors violate the assumption
of Gaussianity. If the errors depart from Gaussianity to skew Gaussianity, then the GPR
predictor will be affected and may lead to unrealistic estimates. We know that the skew
Gaussian process for regression, addressed in this paper, has several advantages over the
Figure 8. GPR (G) and SG predictors with parameters α = 1, β = 2, and τ = 2, 10.

GPR. These advantages will attract us to continue this work in future. We highlight some
of such possible works.
1. Studying the effect of the choice of the covariance function on the skew Gaussian
process predictor.
2. Developing methods for estimating the hyper-parameters of the model.
3. Prediction at several inputs.
4. Defining more robust models by using more general distributions either on the input
function f (t) or on the error term. For such future work, one may utilize the work of
Lachos et al. (2010) and Da Silva-Ferreira et al. (2011) by assuming that either the
input function or the error term follows a random process whose finite dimensional
distributions are scale mixture of skew normal (SMSN) distributions as defined by
Lachos et al. (2010a). A random vector Y is said to have an n-dimensional scale

mixture of skew normal distributions, denoted by Y ∼ SMSNn (μ, , λ, H ) , if Y has the
1
stochastic representation Y = μ + c 2 (U ) Z, where Z ∼ SNn (0, , λ), c (.) is a weight
function and U is a positive mixing random variable with cdf H (.) and independent of
Z. The pdf of Y is
1 1
ϕn (y; μ, c (u) ) c 2 (u) λT − 2 (y − μ) dH (u) .
1
P (y) = 2
0
Lachos et al. (2010b) showed that the family SMSN includes several known families
such as skew-t, skew-slash and the skew-Cauchy families. This open the way for further
research on more robust models.
Although the process whose finite dimensional distributions are of SMSN is very gen-
eral, we have several computational challenges when finding the estimates of the hyper pa-
rameters. These challenges are due to the integration in the pdf of Y ∼ SMSNn (μ, , λ, H ).
So, instead of conducting the numerical calculations, it could be easier to use an intensive
statistical computing algorithm to calculate the integration in the pdf P (y). Since inten-
sive computing requires large samples from the pdf of Y , we may utilize the stochastic
1
representation Y = μ + c 2 (U ) Z for such simulation purposes.
Azzalini and Capitanio (1999) have pointed out that the MLE for the skewness pa-
rameter of the multivariate skew normal distribution may diverge with positive probability.
Also they noticed that the Fisher information matrix is singular when the skewness param-
eter is zero. For the multivariate closed skew normal distribution, these issues have been
considered only in few number of papers. Here we refer to the work of Arellano-Valle
et al. (2005). They used the skew normal distribution to model the both the random effect
and the error terms in the linear mixed effect model. Also they showed that the response
data vector has a multivariate closed skew normal distribution. Furthermore, they derived
and implemented an EM algorithm to find the MLEs for all parameters. According to the
literature, it can be noticed that the above issues concerning the MLEs of the closed skew
normal distribution parameters are still not explored enough. Hence, we believe that a fur-
ther research should conducted. For example, the estimation of the SGP model parameters
using the penalized maximum likelihood method could be called for. We leave these issues
to a separated article.
References
Allard, D., Naveau, P. (2007). A new spatial skew-normal random field model. Commun. Statist.
Theory. Meth. 36:1821–1834.
Alodat, M. T., Aludaat, K. M. (2007). A skew Gaussian process. Pak. J.Statist. 23:89–97.
Alodat, M. T., AL-Rawwash, M. Y. (2009). Skew Gaussian random field. J.Computat. Appl. Math.
232(2):496–504.
Arellano-Valle, R. B., Bolfarine, H., Lachos, V. H. (2005). Skew-normal Linear Mixed models. J.Data
Sci. 3:415–438.
Azzalini, A. (1985). A class of distributions which includes the normal ones. Scand. J. Statist.
12:71–178.
Azzalini, A. (1986). Further results on a class of distributions which includes the normal ones.
Statistica 46:199–208.
Azzalini, A., Capitanio, A. (1999). Statistical application of the multivariate skew normal distributions.
J. Roy. Stat. Soc. Ser. B 61:579–602.
Azzalini, A., Dalla Valle, A. (1996). The multivariate skew-normal distribution. Biometrika
83:715–726.
Brahim-Belhouari, S., Bermak, A. (2004). Gaussian process for non-stationary time series prediction.
Computat. Statist. Data Anal. 47:705–712.
Buccianti, A. (2005). ‘Meaning of the λ parameter of skew–normal and log–skew normal distributions
in fluid geochemistry’ a CODAWORK’05.
Christian, P. R., Casella, G. (2004). Monte Carlo Statistical Methods. New York: Springer.
Da Silva-Ferreira, C., Bolfarine, H., Lachos, V. (2011). Skew-scale mixture of skew-normal distribu-
tions. Statist. Methodol. 8:154–181.
Fan, J., Peng, H. (2004). Nonconcave penalized likelihood with diverging number of Parameters.
Ann. Statist. 32:928–961.
Fyfe, C., Leen, G., Lai, P. L. (2008). Gaussian processes for canonical correlation analysis. Neuro
Comput. 71:3077–3088.
Genton, M. (2004). Skew-Elliptical Distributions and Their Applications: A Journey Beyond Nor-
mality. Boca Raton, FL: Chapman & Hall/CRC.
Girard, A., Kocijan, J., Murray-Smith, R., Rasmussen, C. E. (2004). Gaussian process model based
predictive control. Proc. Amer. Control Conf . Boston.
Girard, A., Rasmussen, C. E., Murray-Smith, R. M. (2002). Gaussian Process priors with uncer-
tain Inputs: Multiple-Step-Ahead Prediction. Technical Report TR-2002-119, Department of
computing Science, University of Glasgow.
Gonzáles-Farias, G., Domingusez-Molina, J., Gupta, A. (2004). Additive properties of skew normal
random vectors. J. Statist. Plan. Infer. 126:521–534.
Kuss, M. (2006). Gaussian process models for robust regression, classification, and reinforcement
learning. Ph.D. thesis, Technische Universität Darmstadt.
Lachos, V., Labra, F., Bolfarine, H., Gosh, H. (2010a). Multivariate measurements error models based
on scale mixtures of the skew-normal distribution. Statistics 44:541–556.
Lachos, V. H., Ghosh, P., Arellano-Valle, R. B. (2010b). Likelihood based inference for skew-normal
independent linear mixed models. Statistica Sinica 20:303–322.
Macke, J. H., Gerwinn, S., White, L. E., Kaschube, M., Bethge, M. (2010). Gaussian process methods
for estimating Cortical maps.
Neal, R. M. (1995). Bayesian learning for neural networks. Ph.D., thesis, Dept. of Computer Science,
University of Toronto.
O’Hagan, A. (1978). On curve fitting and optimal design for prediction. J. Roy. Soc. B 40:1–42.
Rasmussen, C. E., Williams, C. (2006). Gaussian Processes for Machine Learning. Cambridge, MA:
MIT press.
Rasmussen, C. E. (1996). Evaluation of Gaussian Processes and other methods for non-linear
regression, Ph.D. thesis, Dept. of Computer Science, University of Toronto.
Schmidt, A. M., Concoicäo, M. F., Moreira, G. A. (2008). Investigating the sensitivity of Gaussian
processes to the choice of their correlation functions and prior specifications. J. Statist. Computat.
Simul. 78(8):681–699.
Schott, J. R. (1997). Matrix Analysis for Statistics. New York: Wiley-Interscience.
Vanhatalo, J., Jylänki, P., Vehtari, A. (2009). Gaussian process regression with Student-t likelihood.
In: Bengio Y., Schuurmans D., Lafferty J., Williams C. K. I., Culotta A. Eds. Advances in Neural
Information Processing Systems 22:1910–1918.
Williams, C. K. I., Rasmussen, C. E. (1996) Gaussian processes for regression. Adv. Neur. Inform.
Process. Syst. 8:514–520.
Zhang, H., El-Shaarawi, A. (2009). On spatial skew Gaussian process applications. Environmetrics
10:982.
Appendix
A. Basic Results on Closed-Skew Normal Distribution

The results of this appendix are quoted from Genton (2004).
Proposition A.1 If Y1 , . . . , Yn are independent random vectors with Yi ∼ CSN pi, qi

T
(μi , i , Di , vi , i ), Then the joint
distribution of Y1 , . . . , Yn is Y = Y1T , . . . , YnT ∼
CSN p+ ,q + μ+ , + , D + , v + , + , where

n
n T
p+ = pi , q + = qi , μ+ = μT1 , . . . , μTn , + = ⊕ni=1 i ,
i=1 i=1
T
D + = ⊕ni=1 Di , v + = v1T , . . . , vnT , + = ⊕ni=1 i
and

A0
A⊕B = .
0 B
Proposition A.2 Let Y ∼ CSN p,q (μ, , D, v, ) and A be an n × p (n ≤ p) matrix of

rank n. Then Ay ∼ CSN p,q (μA , A , DA , v, A ) , where μA = Aμ, A = AAT , DA =
DAT A−1 , and A = + DD T − DAT A−1 AD T .
Proposition
If Y ∼ CSN p,q (μ, , D, v, ), then for two sub vectors Y1 and Y2 where
A.3
Y T = Y1T , Y2T , Y1 is k−dimensional, 1 ≤ k ≤ p, and μ, , D are partitioned as follows:

μ1 k
k p − k
μ= , = 11 12 k
μ2 p−k
21 22 p − k
and D1 , D2 come from
k p−k
D = (D1 D2 ) q.
Then the conditional distribution of Y2 given Y1 is

CSNp−k,q μ2 + 21 −1 ∗
11 (y10 − μ1 ) , 22.1 , D2 , v − D (y10 − μ1 ) , ,
where
D ∗ = D1 + D2 21 −1
11 ,
and
22.1 = 22 − 21 −1
11 12 .
Proposition A.4 If Y ∼ CSN p,q (μ, , D, v, ), then the moment generating function of
Y is:

q Ds; v, + DDT s T μ+ 1 s T s
MY (s) = e 2 , s ∈ Rp .
q 0; v, + DDT
Proposition A.5 If Y1 and Y2 are independent vectors such that Yi ∼ CSN p,qi
(μi , i , Di , vi , i ), i = 1, 2, then Y1 + Y2 ∼ CSN p,q1 +q2 (μ1 + μ2 , 1 + 2 , D ◦ , v ◦ , ◦ ),
where

◦ D1 1 ( 1 + 2 )−1 ◦ A11 A12
D = , = ,
D2 2 ( 1 + 2 )−1 A21 A22
and
A11 = 1 + D1 1 D1T − D1 1 ( 1 + 2 )−1 1 D1T ,
A22 = 2 + D2 2 D2T − D2 2 ( 1 + 2 )−1 2 D2T ,
T
A12 = −D1 1 ( 1 + 2 )−1 2 D2T , v ◦ = v1T , v2T .
Proposition A.6 If X ∼ SNn (μ, , α), then

i. EX = μ + π2 δ, where δ = √1+αα
T α
.
ii. Cov(X) = − π2 δδ T .
B. Proof of Theorem 3.1.

B.1. Joint Density of Data and Output. The aim of this section is to derive the joint density
of f ∗ = f (t ∗ ) and Y . For simplicity, we assume that the skew Gaussian processes used
here possess fixed skewness in all directions. Since f (t) is assumed to have a skew Gaussian
process prior, then

f
∼ CSN n+1,1 0, , α1Tn+1 , 0, 1 , ε ∼ CSN n,1 0, τ 2 In , β1Tn , 0, 1 ,
f∗

the column of one’s of size (n + 1), and In is the identity matrix of size
where 1n+1 denotes
f
n × n. Since is independent of ε (t) , then by Proposition A.1 we have that
f∗
⎛ ⎞
f
⎝ f ∗ ⎠ ∼ CSN 2n+1,2 μ+ , + , D + , v + , + ,
ε
where
T
μ+ = 0T1×n , 0, 0T1×n , v + = (0, 0)T , + = I2 ,
where 0n×1 is the zero vector of size n × 1, and

α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n
D+ = , + = .
0T(n+1)×1 β1tn 2×(2n+1)
0T(n+1)×n τ 2 In
B.2. The Predictive Density of f ∗ Given Y, t ∗ . The predictive density of f ∗ given Y , t ∗ is

obtained by direct application of Proposition 3 with p = n + 1 and q = 2. The first step is
to find the conditional distribution of f ∗ and Y , t ∗ is to find the joint pdf of f ∗ and Y . To
T T
proceed, we write Y T , f ∗ as a linear combination of f T f ∗ T , i.e.,
⎛ ⎞

f
Y f +
I n 0n×1 I n ⎝f∗ ⎠.
= =
f∗ f∗ 0Tn×1 1 0Tn×1
ε
I n 0n×1 I n
To simplify the notation, let A(n+1)×(2n+1) = ( ). It is straight forward to
0Tn×1 1 0Tn×1
check that the matrix A is of rank (n + 1). Now, we are ready to apply Proposition A.2.
Hence,
⎛ ⎞

f
Y
∗ = A ⎝ f ∗⎠
∼ CSN n+1,2 μA , A , DA , v + , A ,
f

where
⎛ ⎞

0T1×n
I n 0n×1 I n ⎝ 0 ⎠ = 0(n+1)×1
μA = Aμ+ =
0Tn×1 1 0Tn×1
0T1×n ⎛ ⎞

I n 0n×1
I n 0n×1 I n (n+1)×(n+1) 0(n+1)×n ⎝ T
A = A + AT = 0n×1 1 ⎠
0Tn×1 1 0Tn×1 0T(n+1)×n τ 2 In

I n 0n×1
+ τ 2 In k
= .
kT k ∗ (n+1)×(n+1)
To proceed, we need to apply the following matrix identity which can be found in Schott
(1997). Let A be a matrix which is partitioned as follows:

A11 A12
A= ,
A21 A22
where A11 and A22 are invertible square matrices. Then

−1 −1
−1 A11 − A12 A−1 A21 − A−1 A12 A22 − A21 A−1 A12
A = 22
−1 11
−1 11 .
− A−1 −1
22 A21 A11 − A12 A22 A21 A22 − A21 A−1
11 A12
For proof, we refer to Schott (1997). Hence, we find −1

A as follows:
⎛ ⎞
∗−1 T −1
−1 ∗ −1 −1
⎜ + τ In − kk k − + τ2 In k k − kT + τ 2 I n
2
k ⎟
−1
A =⎝ −1 −1 −1 ⎠ .
−k ∗−1 kT + τ 2 I n − kk∗ −1 kT k ∗ − kT + τ 2 In k
(n+1)×(n+1)
The parameter DA is given by
DA = D+ + AT −1
A ,
⎛ ⎞

I n 0n×1
α1tn+1 0n×1 (n+1)×(n+1) 0(n+1)×n ⎝ 0Tn×1 1 ⎠ −1
= T
01×n β1tn 0T(n+1)×n τ 2 In A
I n 0n×1

T
α1Tn+1 (, k)T α1Tn+1 k T , k ∗
= −1
A ,
βτ 2 1Tn 0

D11(1×n) D12
= ,
D21(1×n) D22 2×(n+1)
where
−1
D11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT
T −1
−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 In − kk ∗−1 kT ,
−1 −1 −1
D12 = −α1Tn+1 (, k)T + τ 2 I n k k ∗ − kT + τ 2 I n k
T ∗ −1 −1
+α1Tn+1 kT , k ∗ k − kT + τ 2 I n k ,
−1
D21 = βτ 2 1Tn + τ 2 I n − kk ∗−1 kT
and
−1 ∗ −1 −1
D22 = −βτ 2 1Tn + τ 2 I n k k − kT + τ 2 I n k .
Also, the parameter A = + + D+ + D+T − D+ + AT −1 + +

A A D+ , where = I 2 ,
T
can be simplified as follows:

T
+ + α1n+1 1n+1 0
D D+ = T
,
0 nβ 2 τ 2 2×2
T ∗ T
T T T
α1 (, k) α1 n+1 k , k
D+ + AT = n+1 ,
βτ 2 1Tn 0
2×(n+1)
and

α (, k) 1n+1 βτ 2 1n
A + D+T = .
α kT , k ∗ 1n+1 0 (n+1)×2
Finally A takes the following form:

T
α1Tn+1 1n+1 0 α1Tn+1 (, k)T α1Tn+1 kT , k ∗
A = I 2 + −
0 nβ 2 τ 2 βτ 2 1Tn 0

α (, k) 1n+1 βτ 2 1n 1 + α1Tn+1 1n+1 − W11 −W12
A−1 = ,
α kT , k ∗ 1n+1 0 −W21 1 + nβ 2 τ 2 − W22
where
−1
W11 = α1Tn+1 (, k)T + τ 2 I n − kk ∗−1 kT
T −1
−α1Tn+1 kT , k ∗ k ∗−1 kT + τ 2 I n − kk ∗−1 kT α , k 1n+1
T −1 −1 −1

+ −α1Tn+1 , k + τ 2 I n k k ∗ − kT + τ 2 I n k
T −1 −1 T ∗
+α1Tn+1 kT , k ∗ k ∗ − kT + τ 2 I n k α k , k 1n+1

W12 = α1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − α1Tn+1 (kT , k ∗ )T k ∗−1 kT
−1 2
× + τ 2 I n − kk∗−1 kT βτ 1n
= αβτ 2 (1Tn+1 (, kT )T ( + τ 2 I n − kk∗−1 kT )−1 − 1Tn+1 (kT , k ∗ )T k ∗−1 kT
× ( + τ 2 I n − kk∗−1 kT )−1 )1n
−1 −1
W21 = βτ 2 1Tn + τ 2 I n − kk ∗−1 kT α , k 1n+1 − βτ 2 1Tn + τ 2 I n
−1 −1
× k k ∗ − k T + τ 2 I n k α k T , k ∗ 1n+1
−1 −1
= αβτ 2 1Tn + τ 2 I n − kk ∗−1 kT , k 1n+1 − τ 2 1Tn + τ 2 I n k
−1 −1 T
× k ∗ − kT ( + τ 2 I n ) k k , k ∗ 1n+1 ,
and
−1
W22 = nβ 2 τ 4 1Tn + τ 2 I n − kk ∗−1 kT 1n .
The predictive density of f ∗ given Y, t ∗ is obtained by direct application of Proposi-

tion A.3 with p = n + 1 and q = 2. To proceed, consider the following partitions for
μA , A , DA , v + , A :

+ τ2 In k 11 12
A = ∗ = ,

k k 21 22
11 12
A =
21 22
and

D11 D12
DA = = D1 D2 ,
D21 D22
where

D11 D12
D1 = and D2 = .
D21 2×n
D22 2×1
So the conditional distribution of f ∗ given Y , t ∗ is

−1 −1
f ∗ |Y , t ∗ ∼ CSN1,2 kT + τ 2 I n Y , k ∗ − kT + τ 2 I n k, D2 , − D∗ Y , A , (3.1)
where
−1
D∗ = D1 + D2 kT + τ 2 I n .
Proof of (ii). Here, we have to find the mean and the variance of f ∗ |Y, t ∗ by applying
Proposition A.4; to complete this mission, we find the moment generating function of
f ∗ |Y, t ∗ , hence the moment generating function of f ∗ |Y, t ∗ is equal to

2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
Mf ∗ |Y ,t ∗ (s) = e 2 , s ∈ R,
2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2
−1 −1
where σ ∗2 = k ∗ −k T + τ 2 I n k, and μ∗ = kT + τ 2 I n
(j )
Y . Let 2 (. , .) denote
the first partial derivative of 2 (. , .) with respect to the J component for j = 1, 2. Also,
th
(ij )
let 2 (., .) denote the mixed second partial derivative of 2 (., .). Now we find the mean
and the variance of f ∗ |Y, t ∗ :

∗ ∗ ∂ ∂ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 sμ∗ + 1 σ ∗2 s 2
E(f |Y , t ) = Mf ∗ |Y (s) |s=0 = e 2 |s=0
∂s ∂s 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2
⎛ ⎞
D12 σ ∗2 s ∗ ∗2
⎜ 2 ; − D Y , A + σ D2 D2 T
⎟
∂ ⎜ D22 σ ∗2 s ⎟
= ⎜ sμ∗ + 12 σ ∗2 s 2 ⎟
⎜
∂s ⎝ ∗
2 02×1 ; − D Y , A + σ D2 D2
∗2 T
e ⎟ |s=0 ,
⎠
1
= ∗
× (1)2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D12 σ ∗2
2 02×1 ; − D Y , A + σ D2 D2
∗2 T

+ (2)
2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 D22 σ ∗2
∗ 1 ∗2 2
+ 2 D2 σ ∗2 s; −D ∗ Y, A + σ ∗2 D2 DT2 μ∗ + σ ∗2 s esμ + 2 σ s |s=0

Finally, we find that

E(f ∗ | y, t ∗ ) = μ∗ + σ ∗2 D12 (1) 2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
∗ ∗2

+D22 (2)
2 0 2×1 ; − D Y , A + σ D D
2 2
T
,
where (1) (2)

2 is the first derivative of 2 with respect to the first component, and 2 is the
first derivative of 2 with respect to the second component.

Also, we need to find f ∗2 |Y , t ∗ to calculate the variance of f ∗ |Y , t ∗ . So
1
E f ∗2 |Y, t ∗ =
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2

∂2 D12 σ ∗2 s ∗ ∗2

sμ∗ + 12 σ ∗2 s 2
× 2 2 ∗2 ; − D Y , A + σ D2 D2 e
T
,
∂s D 22 σ s s=0
1
= ∗

2 02×1 ; − D Y , A + σ ∗2 D2 DT2
∂ (1)
× 2 D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D12 σ ∗2
∂s

+ 2(2) D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D22 σ ∗2

∗ sμ∗ + 1 σ ∗2 s 2
∗2 ∗
+ 2 D2 σ s; − D Y , A + σ ∗2
D2 DT2 ∗2
μ +σ s e .

2
s=0

After substituting s = 0, f ∗2 |Y, t ∗ reduces to

E f ∗2 Y , t ∗
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T
∗ ∗2 ∗2 ∗2
+(12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 ∗2
+(21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 2
+ (22)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D22 σ ) )
T
∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T
∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ .
T
Hence,
2
var f ∗ |Y , t ∗ = f ∗2 |Y − E f ∗ |Y ,
∗ ∗2 ∗2 2
= 2((11)
2 (02×1 ; − D Y , A + σ D 2 D 2 )(D12 σ )
T
∗ ∗2 ∗2 ∗2
+ (12)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 ∗2
+ (21)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D12 σ D22 σ
T
∗ ∗2 ∗2 2
+ (22)
2 (02×1 − D Y , A + σ D 2 D 2 )(D22 σ ) )
T
∗2 ∗2
+ 4((1)
2 (02×1 ; − D ∗ Y , A + σ D 2 D 2 )D12 σ
T
∗ ∗2 ∗2 ∗ ∗2 ∗2
+ (2)
2 (02×1 ; − D Y , A + σ D 2 D 2 )D22 σ )μ + μ + σ
T
⎛ ⎛ ⎞⎞2
D12 (1)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2
− ⎝μ∗ + σ ∗2 ⎝ ⎠⎠ ,
(2) ∗ ∗2

+D22 2 02×1 ; − D Y , A + σ D2 D2 T
Using the necessary algebra, var (f ∗ |Y, t ∗ ) expression reduces to

2
var f ∗ |Y, t ∗ = 2σ ∗4 ((11)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12

+ (12)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22

+ (21)
2 02×1 ; − D∗ Y , A + σ ∗2 D2 DT2 D12 D22
2
+ (22)
2 02×1 − D∗ Y , A + σ ∗2 D2 DT2 D22 ,
where (11)
2 is the derivative of (1)
2 with respect to the first component, and 2
(12)
is the
derivative of 2 with respect to the second component, and 2 is the derivative of (2)
(1) (21)
2
with respect to the first component, and (22)
2 is the derivative of (2)
2 with respect to the
second component.
B. Proof of Theorem 4.1.

Proof. (ii) and (iii)(a)–(iii)(b). Since k (., .) is isotropic and t 1 , t 2 , . . . , t n are vertices of
∗ ∗
regular polygon with
centernlocated at t , then k (ti , t ) is constant for each i = 1, .., n and
the matrix = k t i , t j i,j =1 is circulant. Let k0 = k (t i , t ∗ ). Then k = k (t ∗ ) = k0 1n .
−1
Moreover, the matrix + τ 2 I n is also circulant. Therefore,

2 2 2 τ 2β −1
b τ ,β ,n = kT + τ 2 I n 1n ,
π 1 + β 2τ 2n

2 τ 2 βk0 −1
= 1Tn + τ 2 I n 1n .
π 1 + β 2τ 2n

−1
Since + τ 2 I n is circulant and 1n is an eigen vector of any circulant matrix, then
−1 1
+ τ2 In 1n = 1n ,
Ln
where

n
Ln = τ 2 + k (t 1 , t i ) .
i=1
Hence,

2 τ 2 βk0 n
b τ ,β ,n =
2 2
.
π 1 + β 2 τ 2 n Ln

It is easy to see that b τ 2 , β 2 , n > 0 for all non zero values of τ , β, and k0 . If
n −0.5
i=1 k (t1 , ti ) = n O (n), where n−0.5 O (n) → c = 0, then

2 2 2 τ 2 βk0 n
b τ ,β ,n = ,
π 1 + β 2 τ 2 n τ 2 + n−0.5 O (n)
√
2 τ 2 βk0 n
= .
π 1 + β 2 τ 2 n √τ 2 + O(n)
n
n
Hence,
2 τ 2 βk0 1
lim b τ 2 , β 2 , n = ,
n→∞ π β 2τ 2 c

2 τβk0
= .
π c |β|
−1
To show that limτ →∞ b τ 2 , β 2 , n = 0, we notice that + τ 2 I n 1n = 1
1 .
Ln n
Conse-
quently, we find that

2 2 2 τ 2 βk0 n
b τ ,β ,n = ,
π 1 + β τ n Ln
2 2

2 nβk0 τ 2
= .
π τ 2 + ni=1 k (t1 , ti ) 1 + β 2 τ 2 n

Hence limτ →∞ b τ 2 , β 2 , n = 0.
C. Algorithms for Section 5

i. Confidence intervals for hyperparameters
1. Simulate
a large sample say τ (i) , σ 2(i) , α (i) , β (i) and λ(i) , i = 1, . . . , N from
P τ, σ 2 , α, β, λ|Y .
N N N N N
2. Let q0.025 (τ ), q0.025 σ 2 , q0.025 (α), q0.025 (β), q0.025 (λ) be the 2.5% percentiles
of thesamples
N τ (i)
, σ 2(i)
, α (i)
and β (i)
, i = 1, . . . , N, N
respectively, and q0.975 (τ ),
N N N
q0.975 σ 2 , q0.975 (α), q0.975 (β), q0.975 (λ) be the 97.5% percentiles of the same
samples, respectively. N 2 N 2 N
N N N
3. Then q0.025 (τ ) , q0.975
(τ ) , q0.025 σ , q0.975 σ , q0.025 (α) , q0.975 (α) ,
N N N N
q0.025 (β) , q0.975 (β) , q0.025 (λ) , q0.975 (λ) are 95% confidence intervals for
τ, σ 2 , α, and β, respectively.
ii. Algorithm to Solve Eq. (8)
Inputs:
• Increments d1 , d2 , A large real number L0 , precision ω
• Smooth grid for the space of t ∗ say t ∗1 , . . . , t ∗M .
• Grid and sample sizes M and N, respectively.
Start.
1. For each t ∗j simulate a large sample from the posterior say 1 , . . . , N .
∗ ∗
2. Find Ti = minM j =1 {( i , t j )} and Ui = maxj =1 {( i , t j )}, i = 1, . . . , N .
M
1 N
3. Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
For l = −L0 to L1 STEP d1 (Searching for the solution in (−L0 , L0 ) × (−L0 , L0 )
For u = −U0 to U1 STEP d2
Do while (|p̂ − 0.95|>ω) l = −L0 , u = −L0
For each tj∗ simulate a large sample from the posterior say 1 , . . . , N .
∗ ∗
Find Ti = minM j =1 {(i , tj )} and Ui = maxj =1 {(i , tj )}
M
1 N
Estimate p, the probability in (8), via p̂ = N i=1 I (lTi andUi u)
Update l and u: l ← l + d1 and u ← u + d2
End Do
Output: L = l and U = u

Skew Gaussian Process For Nonlinear Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Skew Gaussian Process For Nonlinear Regression

Uploaded by

Copyright:

Available Formats

Communications in Statistics—Theory and Methods, 43: 4936–4961, 2014

Copyright © Taylor & Francis Group, LLC

Skew Gaussian Process for Nonlinear Regression

M. T. ALODAT1 AND E. Y. AL-MOMANI2

Keywords Gaussian process; Multivariate closed skew normal distribution; Prediction;

Mathematics Subject Classification Primary 60G15; Secondary 62J02.

Received May 9, 2012; Accepted October 3, 2012.

Pp,q (y) = Cφp (y; μ, ) q ( D ( y − μ) ; v, ) , y ∈ Rp (2)

where C is defined via

2. Gaussian Process for Regression

rich class of such functions.

A suitable choice for k (., .) is the following covariance function

= {k(ti , tj )}ni,j =1 , k = (k(t1 , t ∗ ), . . . , k(tn , t ∗ ))T , and k ∗ = k(t ∗ , t ∗ ).

P (f ∗ |Y, t ∗ ) ∼ N (μ(t ∗ ), σ 2 (t ∗ )), (4)

For GPR, a 100 (1 − α) % prediction interval for f ∗ is

3. Skew Gaussian Process for Non Linear Regresion

Definition 3.1. A random process Y (t) , t ∈ C ⊆ Rp is said to be a skew Gaussian process

Let μA , A , DA , v + and A be as defined in Appendix B2. Under the assumption that

Theorem 3.1. Consider the following partitions for μA , A , DA , v + , A :

+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22

Proof. See Appendix A for the proof of Theorem 3.1.

the predictive density of f ∗ given Y as follows:

of f given Y , t ∗ is calculated as follows:

Now if t ∗ is assumed to have t ∗ ∼ Np (a, B), then the predictive distribution of f ∗

To simulate from (f ∗ |Y, t ∗ ), i.e., from

ii. Let U = V |V ≤ 0.

−1 −1 −1

and H ∼ N (0, 1). Then Z is from the distribution in (6).

4. Gaussian Process for Regression with Skew Normal Errors

Using Proposition A.6 in Appendix A, we get

From the last equation, we conclude

lim var SG (f̂ ) = 0.

5.1. Simulation from SNn (0, , λ).

P (x) = cg (x) h (x) ,

5.2. Simulation from CSN p,q (μ, , D, v, )

To simulate a random observation from the CSNp,q (μ, , D, v, ), it is difficult to

(ii) Given U , simulate Z from

Np (−D( + D T D)−1 (U − v), − D( + D T D)−1 D T ).

(iii) Deliver Z from CSNp,q (μ, , D, v, ).

5.3. Estimation of Hyper Parameters

Using the pdf (2), we find the likelihood function of τ, σ 2 , α, β, and λ:

Also, we propose an algorithm in Appendix C to solve (8) for L and U . Hence,

5.4. Simulation Results

From Figs. 1–8, we report the following concluding remarks:

6. Conclusions and Possibility of Future Work

Figure 8. GPR (G) and SG predictors with parameters α = 1, β = 2, and τ = 2, 10.

Lachos et al. (2010a). A random vector Y is said to have an n-dimensional scale

A. Basic Results on Closed-Skew Normal Distribution

Proposition A.1 If Y1 , . . . , Yn are independent random vectors with Yi ∼ CSN pi, qi

Proposition A.2 Let Y ∼ CSN p,q (μ, , D, v, ) and A be an n × p (n ≤ p) matrix of

and D1 , D2 come from

Then the conditional distribution of Y2 given Y1 is

Proposition A.6 If X ∼ SNn (μ, , α), then

B. Proof of Theorem 3.1.

where 0n×1 is the zero vector of size n × 1, and

B.2. The Predictive Density of f ∗ Given Y, t ∗ . The predictive density of f ∗ given Y , t ∗ is

where A11 and A22 are invertible square matrices. Then

For proof, we refer to Schott (1997). Hence, we find −1

The parameter DA is given by

Also, the parameter A = + + D+ + D+T − D+ + AT −1 + +

can be simplified as follows:

Finally A takes the following form:

T −1 −1 −1

Let μA , A , DA , v + and A be as defined in Appendix B2. Under the assumption that

Theorem 3.1. Consider the following partitions for μA , A , DA , v + , A :

+2 02×1 ; −D ∗ Y, A + σ ∗2 D2 D2T D12 D22

To simulate a random observation from the CSNp,q (μ, , D, v, ), it is difficult to

(iii) Deliver Z from CSNp,q (μ, , D, v, ).

Proposition A.2 Let Y ∼ CSN p,q (μ, , D, v, ) and A be an n × p (n ≤ p) matrix of

Also, the parameter A = + + D+ + D+T − D+ + AT −1 + +

Finally A takes the following form:

+ 2(2) D2 σ ∗2 s; − D∗ Y , A + σ ∗2 D2 DT2 D22 σ ∗2