You are on page 1of 26

J Geogr Syst (2009) 11:381406

DOI 10.1007/s10109-009-0090-z

ORIGINAL ARTICLE

Area-to-point Kriging in spatial hedonic pricing models

E.-H. Yoo P. C. Kyriakidis

Received: 16 October 2008 / Accepted: 18 May 2009 / Published online: 13 June 2009
 Springer-Verlag 2009

Abstract This paper proposes a geostatistical hedonic price model in which the
effects of location on house values are explicitly modeled. The proposed geosta-
tistical approach, namely area-to-point Kriging with External Drift (A2PKED), can
take into account spatial dependence and spatial heteroskedasticity, if they exist.
Furthermore, this approach has significant implications in situations where
exhaustive area-averaged housing price data are available in addition to a subset of
individual housing price data. In the case study, we demonstrate that A2PKED
substantially improves the quality of predictions using apartment sale transaction
records that occurred in Seoul, South Korea, during 2003. The improvement is
illustrated via a comparative analysis, where predicted values obtained from dif-
ferent models, including two traditional regression-based hedonic models and a
point-support geostatistical model, are compared to those obtained from the
A2PKED model.

Keywords Spatial hedonic model  Kriging  External drift  Geostatistics

JEL Classification C51  C52  C53  R20

1 Introduction

Prediction of housing prices using hedonic models has a practical implication in the
context of mass-appraisal or house price index construction, because often only a

E.-H. Yoo (&)


Department of Geography, University at Buffalo,
The State University of New York, Buffalo, NY, USA
e-mail: eunhye@buffalo.edu

P. C. Kyriakidis
Department of Geography, University of California Santa Barbara, Santa Barbara, CA, USA

123
382 E.-H. Yoo, P. C. Kyriakidis

subset of individual house transaction records is publicly available and significant


uncertainty exists regarding the true value of a house. From a traditional economic
view, hedonic pricing is concerned with the generation of implicit prices for
attributes of commodities, where a commodity is viewed as a bundle of amenities
(Rosen 1974). In this context, house is a commodity composed of bundles of
attributes capitalized into the value of the house. These attribute bundles are not
explicitly traded on the market, but rather bought and sold as a package. Typical
amenities comprise a set of variables describing the structural characteristics of a
house, such as age and size, and attributes associated with the geographical location
and the quality of neighborhood in which the house is located (Can and Megbolugbe
1997).
Typically, the primary concern of hedonic modeling is the estimation of the
implicit price of each attribute using statistical methods. Ordinary least squares
(OLS) is one of the most commonly used techniques for estimating the coefficients
of a (typically linear) regression model involving house price as the dependent
variable and amenities as predictors. But OLS optimality is met under the
assumption of independence and homoscedasticity of model errors, which may not
hold in most practical situations. For example, a hedonic model may suffer from
unreliable and biased model parameter estimates due to the omission of relevant
neighborhood characteristics, which are often associated with unmeasurable or
qualitative neighborhood qualities of a house. To alleviate such problems, a full set
of dummy variables that indicate houses located within a certain neighborhood
boundary may be adopted (Black 1999). This method accounts for any unobserved
characteristics shared by houses on either side of the dummy boundary, although the
delineation of such neighborhood boundaries is still problematic. Alternatively,
proxies for neighborhood variables such as a set of descriptive statistics for areal
units generalized from US Census can be used (Can and Megbolugbe 1997).
Not only omitted variables, but also spatial effects in housing prices are likely to
affect the efficiency and significance of hedonic model coefficient estimates and
may lead to errors in the interpretation of regression diagnostics, such as tests for
heteroskedasticity (Kim et al. 2003). These spatial effects pertain to the nature of
housing values, i.e., nearby houses often have similar structural features because
they were built at the same time on similar sized lots and they share locational
amenities (Dubin et al. 1999). In an effort to improve hedonic model quality while
keeping a parsimonious model, accounting for spatial dependence has been a
subject of considerable recent research (Dubin 1988; Can 1990; Basu and
Thibodeau 1998; Pace and Gilley 1998; Gelfand et al. 2004; Paez et al. 2008).
Here, we briefly review some spatial hedonic models that explicitly account for
spatial dependence by incorporating a spatially correlated error structure with
advanced statistical techniques. For example, a generalized least squares (GLS)
algorithm, unlike OLS, is applicable when spatial correlation is assumed for model
errors. There are empirical studies showing that this spatial error structure modeling
approach is flexible and yields improved predictions over approaches relying on
complicated functions of multiple variables to eliminate the problem of spatial error
dependencies (Dubin et al. 1999; Anselin 2002; Lesage and Pace 2004b), although
specification of the error covariance is challenging. Typically, the following three

123
Area-to-point Kriging in spatial hedonic pricing models 383

approaches are employed for specifying the spatial structure underlying observed
data: the inverse of the data-to-data covariance matrix in the case of conditional
autoregressive (CAR) models (Gelfand et al. 2004), the inverse of the square root of
the data-to-data covariance matrix in the case of the simultaneous autoregressive
(SAR) models (Pace and Gilley 1998), and the data-to-data covariance matrix itself
in the case of geostatistical approaches (Dubin 1988; Chica-Olmo 1995).
Hedonic models incorporating spatial effects also include a spatiallag model,
where the price of house is determined by a combination of spatially weighted
average of housing prices in a neighborhood (indirect effects) and the standard
explanatory variables of housing and neighborhood characteristics (direct effects)
(Kim et al. 2003). The spatial lag model differs from the spatial error model in that
the latter does not include indirect effects but is based on the assumption that there
exist omitted variables in the hedonic price equation and the spatial dependence of
the error term is due to those spatially varying omitted variable(s) (Anselin 1998,
2002). The two models are closely related to each other mathematically, but their
economic interpretations are slightly different. The relevance of one model versus
another depends on the particular application at hand: for example, a spatial error
model is preferred over a spatial lag model if the spatial pattern of residuals is
considered as potentially valuable information (Paez et al. 2008).
Another issue of concern in hedonic models is the conceptualization of the spatial
variation in housing prices. In recent works on hedonic modeling, both Jones and
Bullen (1994) and Orford (2000) emphasize the importance of contextual effects,
that is the difference a place makes, and claim that such effects should be taken
into account along with compositional effects (or differences) produced by the
variations in housing attributes within each place (Paez et al. 2001). Here, place is
equivalent to an areal unit (or housing submarket) in which the price per unit of
housing quantity is constant (Goodman and Thibodeau 1998). From this perspec-
tive, house prices and the factors that influence prices can be seen as operating
across several spatial scales, and each scale needs to be defined explicitly in the
hedonic model specification. Unfortunately, most hedonic models draw little
attention to such multi-scale variations of housing values.
Most geostatistical hedonic models are not exceptional in that the effects of
location across various spatial scales on house transaction prices are not properly
captured. Perhaps, multi-level or hierarchical hedonic modeling approaches (Jones
and Bullen 1994; Orford 2000) are noticeable efforts to explicitly take into account
contextual effects, i.e., the spatial variation of house values across spatial scales, in
housing applications. Multi-level modeling, however, often fails to account for the
support differences between observed data and predictions, as well as those among
different source data. Here support pertains to the areal extent of a datum or a sought
after prediction, see, for example, Kyriakidis (2004) for an in-depth discussion of
support-adjusted covariance specification.
Focusing on the prediction of property values at the individual house level while
taking into account the effects of location across various spatial scales, we propose a
geostatistical hedonic model, namely area-to-point Kriging with External Drift
(A2PKED). A2PKED is an extension of area-to-point Kriging (Gotway and Young
2002; Kyriakidis 2004; Goovaerts 2005) which incorporates covariates to estimate a

123
384 E.-H. Yoo, P. C. Kyriakidis

spatially varying drift component. The coherence property of area-to-point Kriging,


therefore, can be readily used to complement the lack of individual household data
with aggregate housing property values. That is, the information provided from
area-aggregated house prices adjusts the individual house price predictions by
forcing the latter to reproduce the area-level values when such prices are
reaggregated over the area, e.g., a census tract or local housing market. Despite
its significance, the areal data reproduction property is not shared by any other
hedonic modeling approach. This areal data reproduction can be achieved by a
consistent covariance modeling that explicitly takes into account support
differences.
In addition, the proposed geostatistical hedonic modeling approach allows
addressing spatial dependence and heteroskedasticity simultaneously, if they exist.
The A2PKED regression coefficients are implicitly re-estimated from one areal unit
to another, implying that the relationship between house prices and predictors is not
constant over the entire study area. For example, the implicit price of apartment age
may not be constant across the whole study area, but may vary spatially (Chica-
Olmo 2007). The spatial dissimilarity of residual house values is also modeled with
multiple nested variogram models corresponding, for example, to factors whose
scale of variation is confined at the school district (high frequency variation) or city
boundary (low frequency variation). Last, the uncertainty associated with predicted
house values is quantified via the A2PKED prediction error variance.
In a case study, the application of the proposed geostatistical hedonic model is
illustrated using apartment sales prices in Seoul, South Korea. The resulting
A2PKED predictions are carefully examined in terms of their coherence, i.e.,
reproduction of the area-averaged housing prices, and their accuracy at a set of
validation locations. The performance comparison of the proposed A2PKED
approach versus other methods, such as OLS, GLS, and point-to-point Kriging with
External Drift (KED), allows evaluating the influence of neighborhood effects on
the house price determination process.

2 Study area and data

The study area is the capital of South Korea, Seoul, which is located at the center of
the Korean peninsula with population around 12 million people by 2000. The area
of Seoul is 605.77 km2, and the Han River bisects the city into two parts, northern
(KangBuk) and southern (KangNam) Seoul, see Fig. 1. The main focus of this
study is on the KangNam area shown in Fig. 1c, which is a dense apartment
complex area for uppermiddle classes. Most apartment complexes in this area are
relatively expensive compared to those in the other parts of Seoul due to the high
quality of education, convenience, safety, accessibility to green areas and Han river
views. The study area comprises four administrative district units (gus) among 25
autonomous gus in Seoul. Each gu consists of several dongs, which are the smallest
administrative district units in South Korea. A total of 93 dongs are included in the
study area, but there are five dongs without apartment complexes due to their land
use, i.e., commercial or green areas.

123
Area-to-point Kriging in spatial hedonic pricing models 385

Fig. 1 a The capital city of South Korea, Seoul. b Seoul is divided by the Han River into KangBuk
and KangNam, which are residential areas, mostly with apartments for the upper-middle classes.
Administratively, Seoul consists of autonomous gus (wards), which are subdivided into 522 dongs
(precincts). c Study area consisting of four gus: KangNam, Seocho, KangDong, and Songpa,
which are characterized by high density apartment complexes. The transaction prices are represented by
the height of bars located at the centers of apartment complexes

The original data set used in this study consists of 648 records of apartment
complex sale transactions that occurred in the KangNam and KangDong areas
during 2003 (see the height of bars in Fig. 1c). Each apartment complex may
contain multiple units (or flats) with different sizes and a wide range of sale prices,
but the spatial average of individual flat sale prices (of those nested within the
complex) is taken as the price datum of a single complex. The sale price of
individual flat is calculated as the average of its minimum and maximum assessed
values. This decision to treat apartment complexes as data of point support is made
under the consideration of the spatial extent of the study area relative to the size of
apartment complexes. To maintain consistency across all spatial prediction methods
considered hereafter, the study area is discretized using an equally spaced

123
386 E.-H. Yoo, P. C. Kyriakidis

(188 9 168) regular grid with cell size (348 9 348) meters, and the centroid of each
apartment complex is assigned to the nearest grid node. It should also be noted that
the implicit prices exhibited by assessed apartment values may be different than
those derived by selling (market) prices because assessment processes do not
necessarily reflect the dynamics of markets and assessors may have different views
of important attributes of apartment complexes.
The data used in this paper are taken from housing price databases and may not
represent a comprehensive housing market, but we assume that they represent
exhaustive individual housing price records at a given year. To duplicate a common
practical situation, however, whereby exhaustive individual house price records are
scarce but often their aggregate summary statistics, such as dong-averaged housing
prices, are easily obtainable, the original 648 apartment transaction records are
divided into two groups: training sample and validation data. A potential drawback
of this scheme is that the prediction performance of the proposed geostatistical
hedonic model depends on the spatial configuration of training and validation data,
as well as on the quality of dong-average information: how many training data are
available, how close the validation locations are to those of the training data, and
how many apartment complexes are included in each dong. To control such a bias
originating from a particular sample selection, model performance was assessed
repeatedly. That is, we randomly partition the 648 transaction records into training
data and validation data, and then assess model performances multiple times. More
details on the sampling and model performance evaluations will be given in
Sect. 4.2. A particular set of randomly selected training and validation data is
presented in Fig. 2b and c, respectively.
Figure 2a shows the histogram of the original 648 apartment complex unit prices
[apartment sales price/pyong(3.3 m2)], which is skewed to the right due to the
presence of a few extreme values. This skewedness of the distribution is also
confirmed by the higher mean value (1,563 thousand wons) compared to the median
value (1,475 thousand wons). Unit prices range from 475 thousand wons per pyong
to 4,960 thousand wons per pyong with a standard deviation 660 thousand wons per
pyong.
Contrary to hedonic price models developed within a conventional econometric
framework (Black 1999), in the current work a linear model with a small number of
covariates is considered. For site attributes associated with each apartment complex,
the average floor area of the residential units, the age of the apartment buildings, the
total number of units, and the existence of a scheduled plan for reconstruction are
considered. For neighborhood characteristic, only the desirability of the location in
terms of education opportunity is considered. The mean values of apartment unit
price and covariates are summarized in Table 1.
The other source of information, which is often ignored or draws little
attention in traditional hedonic models but is easily accessible, consists of area-
averaged summary statistics of apartment prices and covariates. For example, the
Korea Land Corporation publishes monthly summary statistics of apartment and
house transaction records to the public, and such data may provide information to
improve the precision and accuracy of house price predictions. In the current
study, areal data complementing the partially available individual records are

123
Area-to-point Kriging in spatial hedonic pricing models 387

b c

Fig. 2 a Summary statistics and histogram of a total of 648 records of apartment transaction price per
pyong (3.3 m2). b Locations of training data (300 points). c Locations of validation data (348 points)

Table 1 Mean values of hedonic factors used in the model


Variables Units Apt. complex Dong-average

Housing prices Housing price/pyong Thousand won 1563.05 1616.7


Site att. Average floor area Pyong (3.3 m2) 35.6147 34.61
Age of building Year 12 12.45
No. of units No. 417 658.86
Scheduled reconstruction Dummy 0.03 0.04
Neighborhood att. 8th School district Dummy 0.58 0.46
No. of samples 648 88

123
388 E.-H. Yoo, P. C. Kyriakidis

obtained by averaging the apartment sale prices of 648 apartment transaction


records over 88 dong units. Here, it should be noted that dong-averaged
apartment prices are derived conditional on the presence of apartments, i.e. dong-
averaged apartment prices do not include zero price values at discretization
locations where apartments do not exist. Along with the dong-averaged apartment
prices, dong-averaged values of covariates, i.e., the average floor area, the
average age of the building, and the portion of apartment complexes scheduled
for reconstruction conditional on apartment occurrence in each dong, are taken
over 88 dong units. The summary statistics of the dong-averaged variables are
also given in Table 1.
In what follows, we describe a geostatistical housing price prediction model
which takes into account spatial autocorrelation present in various spatial scales,
while specifying the drift or expectation of housing price as a function of relevant
covariates.

3 Methods

3.1 Geostatistical hedonic price modeling

Given the nature of housing prices, the price value z(u) of a property located at u is
viewed as a realization of a random variable (RV) Z(u). Here, u denotes the
coordinate vector of the house location, typically latitude and longitude. This
approach allows modeling the set of house prices over a study area A as a realization
of a non-stationary random function (RF), i.e., an infinite collection {Z(u), u [ A} of
spatially correlated RVs (Chiles and Delfiner 1999). In such a setting, Z(u) can be
decomposed into a mean and error (residual) component as:
Zu lu Ru; u 2 A; 1
where a deterministic function l(u) (or drift) describes the average house price
and a zero-mean RV R(u) captures the spatial variation of the house price
unexplained by the drift. The drift is typically specified by aP
linear function of a set
of covariates describing housing characteristics, i.e., lu M m0 bm fm u with, by
convention, f0(u) = 1. The stochastic residual component R(u) of the house price
model has no implicit price, but instead is seen as a factor driving the spatial
variation in the housing price determination process (Can 1990). In other words,
this model specification suggests that the price z(u) of a house located at u is
comprised of the value l(u) encompassing the effects of structural and locational
attributes and the value r(u) characterizing the idiosyncratic element of the house
(Jones and Bullen 1994).
There is a general concensus on the drift specification in terms of the functional
forms and commonly used attributes to describe the average price of a house.
Modeling the residual prices, however, is rather subjective because such residuals
could be entirely attributed to measurement errors as in traditional hedonic models,
or they could be viewed as a pervasive and fundamental feature of house price as in
spatial hedonic models (Cressie 1993, p 25). There is a large and growing body of

123
Area-to-point Kriging in spatial hedonic pricing models 389

literature showing that a spatially correlated residual specification provides more


parsimonious and better fitting models than those relying on extensive explanatory
variables (LeSage and Pace 2004b). But these results should not be taken for
granted without caution, because the quality of such spatial hedonic model
predictions as well as that of model coefficient estimates highly depends on the
information contained in the observed data.
More specifically, consider a situation where only limited information
pertaining to individual house properties is available due to confidentiality issues,
whereas aggregated statistics of sales and valuations over areal units, such as
submarket or census tracts, are relatively easy to access. Most hedonic models
including spatial models are built on the basis of this partial individual house
information; spatial hedonic models may account for spatial autocorrelation or
heteroscedasticity present at the individual house level (LeSage and Pace 2004a;
Chica-Olmo 2007). None of the existing approaches, however, fully pays attention
to the aggregate statistics obtained from exhaustive surveys. In what follows, we
propose a geostatistical hedonic model applicable in a situation where the lack of
individual housing price data, i.e., data being available only for a subset of all
houses, is complemented by integrated measurements of the same attributes over
areal units.

3.2 Area-to-point Kriging with external drift

Consider the task of predicting the unknown house value at location up using a
subset of an exhaustive household survey fzui ; i 1; . . .; n; n 1; . . .; Ng and
areal averages of the survey data. Individual house prices extracted from the subset
of house survey, denoted by a (n 9 1) vector zu zui ; i 1; . . .; nT ; are the
primary source for the proposed spatial hedonic model, whereas areal averages of
house values over K areas, denoted by zs ZSk ; k 1; . . .; KT , complement
such partially available individual data. Here we consider the kth areal datum z(sk)
the average price of all the houses
P k falling within the boundary of the kth region sk,
which is defined as: zsk n1k nj1 zuj ; uj 2 sk where nk denotes the total number
of individual houses located within region sk.
The idea underlying the proposed hedonic model is that not only the information
on the individual neighboring house values but also the overall appreciation of
house values nested within the neighborhoods, typically defined over a larger spatial
extent, may be relevant information to explain the spatial variation of house prices
unexplained by the drift. The proposed hedonic model is a unique application of
area-to-point Kriging (Journel and Huijbregts 1978; Gotway and Young 2002;
Kyriakidis 2004; Goovaerts 2005; Yoo and Kyriakidis 2008) to housing price
prediction, as well as a methodological improvement by extending area-to-point
Kriging to incorporate covariates for the drift estimation.
The inclusion of overall house values of neighborhoods is particularly useful in a
situation where access to individuals is limited. Given the n point sample data zu
and the K areal data zs, the A2PKED prediction ^ zup of an unknown property value
at location up is obtained as:

123
390 E.-H. Yoo, P. C. Kyriakidis

zup gTp zu kTp zs


^
X
n X
K
gp ui zui kp sk zsk 2
i1 k1

where gp gp ui ; i 1; . . .; nT denotes a (n 9 1) vector of A2PKED weights


associated with the house prices at the n locations where individual records are
available, and kp kp sk ; k 1; . . .; KT denotes a (K 9 1) vector of A2PKED
weights assigned to the K area-averaged house prices.
For any prediction location up, the A2PKED weights, gp and kp , are determined
from the requirements of unbiasedness and minimum prediction error variance.
More precisely, the predictor RV Zu ^ p should be unbiased, i.e., EfZu
^ p 
Zup g 0, which leads to the following conditions:
X
n X
K
gp ui fm ui kp sk fm sk fm up ; for m 0; . . .; M 3
i1 k1

where fm(ui) denotes the (m ? 1)-th predictor value known at the sample location
ui, and fm(sk) is the area-average of predictor values associated with the areal
support sk. The A2PKED prediction of Eq. 2 is obtained by minimizing the pre-
^ p  Zup g subject to the unbiasedness constraints
diction error variance VarfZu
of Eq. 3. This is accomplished by introducing (M ? 1) Lagrange multipliers,
denoted as np nm up ; m 0; . . .; MT , and the resulting A2PKED system is
obtained as:
2 R 32 3 2 u 3
Ruu RRus Fu gp rp
4 RR RR Fs 54 kp 5 4 rsp 5 4
su ss
T T u
Fu F s 0 n p f p

with
2 3 2 3
f0 u1  fM u1 f0 s1  fM s1
6 .. 7 6 .. 7
Fu 6
4  .  5;
7 Fs 6
4  . 
7;
5
f0 un  fM un f0 sK  fM sK
2 3 2 3
n0 up f0 up
6 7 6 7
np 4  5; f up 4  5 5
nM up fM up
where RRuu rR ui ; ui0 ; i; i0 1; . . .; n denotes a (n 9 n) matrix of covariance
values among any two individual residual RVs R(ui) and R(ui0 ), RRus
rR ui ; sk ; i 1; . . .; n; k 1; . . .; KT denotes a (n 9 K) matrix of (cross) covari-
ance values between any pair of individual residual RV R(ui) and areal average
residual RVs R(sk). In words, the spatial variation in apartment transaction prices is
modeled across different spatial scales, i.e., at the individual house level and the
submarket level, and their interactions are captured by the term RRus : The interaction
between areal supports is captured by a (K 9 K) matrix of areal data-to-data

123
Area-to-point Kriging in spatial hedonic pricing models 391

residual covariance values RRss rR sk ; sk0 ; k; k0 1; . . .; K: Fu is a (n 9 (M ? 1))


design matrix with fm ui ; i 1; . . .; n; m 0; . . .; M; and Fs is a (K 9 (M ? 1))
matrix fm Sk ; k 1; . . .; K; m 0; . . .; M of areal average covariates associated
with K areal supports. The ((M ? 1) 9 (M ? 1)) matrix of zeros is denoted by 0.
The (n 9 1) vector of covariance values between n individual residual RVs and the
individual residual RV R(up) at the prediction location is denoted by rup ; and
the (K 9 1) vector of covariance values between the individual residual RV R(up) at
the prediction location and the areal residual RVs is denoted by rsp : The ((M ? 1)
9 1) vector f up fm up ; m 0; . . .; MT denotes covariate values known at the
prediction location up.
The A2PKED system in Eq. 4 has a unique solution only if the ((n ? K) 9
(n ? K)) matrix consisting of RRuu ; RRus ; RRsu and RRss is non-singular, i.e., symmetric
and strictly positive definite, and the design matrix [FTu FTs ]T is of full column rank
((M ? 1) \ (n ? K)). The latter condition implies that the covariates used in
A2PKED must not be co-linear.
In fact, the A2PKED prediction in Eq. 2 can be written in the context of a linear
regression model which has non-stationary error structure. This non-stationary error
term becomes stationary, only after a spatially varying mean component is taken
into account. That is, drift is estimated by GLS by incorporating relevant covariates
and the resulting stationary residuals are used in Kriging prediction. This alternative
form, which is presented in Appendix 1, derives the housing price prediction at the
target location up as the sum of a drift l ^up and a residual r^up prediction obtained
via GLS and area-to-point Simple Kriging (A2PSK), respectively (Chiles and
Delfiner 1999). Note here that, after the drift is taken into account, A2PKED
becomes equivalent to coKriging with areal data residuals being treated as auxiliary
information in addition to point data residuals. This representation implies that the
prediction l ^up of the unknown drift value is a linear combination of covariate
values fup known at the prediction location up weighted by spatially varying GLS
regression coefficient estimates ^ b b ^ ; m 0; . . .; MT : In A2PKED the drift
m
coefficients are implicitly estimated, and are equivalent to GLS regression
coefficients (Chiles and Delfiner 1999; Kyriakidis and Goodchild 2006). Note that
this equivalence between the GLS estimator and the implicitly estimated A2PKED
drift coefficients is guaranteed only if the same spatial information is employed in
both cases; see Appendix 2 for a special case when only one areal datum to which
the prediction location belongs is considered for spatial prediction.
The A2PKED prediction error variance r ^2 up at location up can be written as:
^2 up r
r ^2SK up r
^2D up
r2R 0  gTp rup kTp rsp   nTp f up
" #
X n X
K X
M
2
rR 0  gp ui rui ; up kp sk rsk ; up  nm up fm up
i1 k1 m0

6
where r2R(0)
denotes the residual variance, which per stationarity of the residuals is
the same for all prediction locations.

123
392 E.-H. Yoo, P. C. Kyriakidis

Equation 6 entails that the A2PKED prediction variance is decomposed into the
A2PSK prediction variance r ^2SK up for the unknown residual r(up), and the drift
2
prediction variance r ^D up for the unknown drift l(up) at up. By construction, the
A2PSK prediction variance r ^2SK up is independent of any data values, and is only a
function of the residual spatial correlation and the distance between the prediction and
data locations. The drift prediction variance r ^2D up is a function of covariate values fup
known at that location up penalized by a vector of Lagrange multipliers np
nm up ; m 0; . . .; MT : This term may look similar to the GLS prediction variance
1 u
of the mean component, i.e., Var^ lup f up T FT R1
R F f p (see Eq. 11 in
Appendix 1 for F; RR ), but the A2PKED prediction variance of the mean component
is adjusted so that it becomes zero at sample locations, i.e., r ^2D up Var^
lup
u T T T T 1 1 u 1 u T T
f p  Fu gp  Fs kp  F RR F f p f p  Fu gp  Fs kp  (Ripley 1981, p. 49).

4 Results and discussion

Along with the observed apartment data, A2PKED prediction calls for the
underlying variogram model of the residual point RF. Note that the common
practice in geostatistics consists of inferring and modeling the residual variogram
cR(h) instead of the covariance function rR(h) to populate the entries of RR
(Goovaerts 1997). As shown in Eq. 1, the non-stationary RV Z(u) at location u
consists of a deterministic drift l(u) and a zero-mean stationary residual component
R(u) whose spatial correlation must be inferred from the sample data. In practice,
the drift in the sample data masks the underlying residual variogram and it is most
likely to cause the sample residual variogram cR^h to be biased. The underlying
residual variogram cR(h), however, should be inferred after subtracting the drift,
although the drift is unknown in most practical situations.
In the case study, we assume that the functional form of the underlying variogram
model is known, but the model parameters need to be inferred from the original data
(648 apartment sales prices). Strictly speaking, the underlying variogram model
should be inferred from the 300 training samples to mimic practical applications,
but we pooled all data (training and validation) together and inferred a common
spatial dependence model from them. This approximation is made mainly to avoid a
repetitive procedure of inferring a variogram model for each particular set of
randomly selected training and validation data, but does not unduly affect the
inference results since the estimated variograms derived from the 300 training
sample are fairly similar to those inferred from the 648 pooled data. To infer the
variogram from the complete data, we adopted a common approach widely used in
statistics, i.e., the Gaussian Maximum Likelihood (GML) method (Chiles and
Delfiner 1999). In what follows, we will compare the prediction performance of the
four hedonic models, two standard hedonic models whose coefficients are estimated
via OLS and GLS and two geostatistical models with (A2PKED) and without
(KED) areal data, at two spatial scales: at the individual apartment complex level to
investigate prediction ability by comparing predicted values versus known true
values at the 348 validation locations, and at the areal level to evaluate the
coherence property by examining whether the areal averages of the predicted

123
Area-to-point Kriging in spatial hedonic pricing models 393

apartment transaction prices reproduce the original summary statistics (apartment


values) aggregated over 88 dongs.

4.1 Inference of underlying variogram model

In the use of the GML method, it is assumed that the observed housing data consist
of a location-dependent mean l(u) and residuals which follow a multivariate
Gaussian distribution. The mean l(u) is unknown, but we further assume that its
functional form is fixed with a set of predictors: the average floor area of units (or
flats) f1(u), the age of buildings f2(u), the total number of units f3(u), the scheduled
reconstruction f4(u) of the apartment complex, and belonging or not to the preferred
school district f5(u), i.e., lu b0 b1 f1 u b2 f2 u b3 f3 u b4 f4 u
b5 f5 u: For a variogram model cR(h) underlying the observed apartment price
data, we propose a two-component model with a nugget and exponential function
as:
  
3jhj
cR jhj b0 1  d b1 1  exp  7
q
where b0 denotes a nugget variance, and d = 1 if |h| = 0, and 0 otherwise; here, |h|
denotes the norm of the vector h. The parameter b1 denotes the partial sill of an
exponential function with an effective range q. In summary, assuming that the
predictors and the residual variogram form (i.e., number and type of nested
structures) are known and depend only on two sets of parameters: b
b0 b1 b2 b3 b4 b5 T and b b0 b1 q T ; the optimal parameters are
estimated by maximizing the likelihood function.
Initially, coefficients of the drift model are estimated via OLS and are shown in
Table 2. For level of significance 0.05, the above five predictors show a statistically
significant linear relation with housing prices. From the table of estimated
regression coefficients, we can infer that the mean house price increases per unit
increase in the average flat size or age, but not as much as by dummy variables, i.e.,
the existence of a scheduled reconstruction plan. This interpretation agrees with the
current housing market of the GangNam area, where the apartment reconstruction
market is currently booming and the reconstruction is considered as a means of
increasing the value of old apartments. The scheduled reconstruction of an

Table 2 OLS regression results regarding the drift model


Regression coefficients Estimated values Estimated SD ta

b0 540.05 56.16 9.62


b1 8.80 1.32 6.66
b2 14.73 2.07 7.12
b3 0.40 0.03 13.42
b4 662.62 106.72 6.20
b5 584.58 35.72 16.37
a
Denotes the values of the t statistics for coefficient estimates at the level of significance 0.05

123
394 E.-H. Yoo, P. C. Kyriakidis

apartment complex allows the owners to replace those low-stories small-sized


apartments constructed by the public sector in 1960s with high-rise larger-sized
ones.
The other unique characteristic of the study area is associated with the prestigious
high schools, what is called 8th school district. This area obtains its reputation
from its educational opportunity in the 1970s after the government required that
students enroll to high schools in their residential area. The 8th school district
gained its popularity because the majority of its students entered the nations top
universities. This strong desire for better education is based on the traditional notion
to view education as the best asset toward professional success and economic well-
being.
Once these regression coefficients were estimated via OLS, a set of initial
variogram model parameters b ^ 0:4 0:6 2000 T was visually inferred from
the experimental variogram of the OLS-derived residuals. Then, these parameters
were used as the initial guess for the iterative GML fitting. Figure 3a shows the
initial variogram model of the OLS-derived residuals (solid line) overlayed on
the empirical residual variogram (dotted line) and the variogram model cR(|h|) of
the GML-derived residuals (solid line with squares) estimated after 120 iterations
as:
  
3jhj
cR jhj 0:5djhj 0:5 1  exp  8
2; 300
There is theoretical and empirical evidence (Neuman and Jacobson 1984)
showing that variogram estimators based on iterative approaches are biased, but
such a bias associated with residual-based variogram estimators is small at lags near
the origin and becomes negligible asymptotically as the size of sample data
increases (Cressie 1993, p 167). In this current study, the residual variogram model
inferred from the 648 apartment transaction data is considered reliable (unbiased)
only for short distances, i.e., up to q = 2,300 m, compared to the full extent, i.e.,
180,000 m, of the study area.
The distribution of GML-derived residuals is shown in the histogram of Fig. 3b,
which is symmetric with a mean of 0 and SD 438. Comparing with the distribution
of original housing prices/pyong (Fig. 2a), the distribution of GML-derived
residuals has relatively smaller but still substantial standard deviation. The explicit
consideration of spatial variation in GML-derived residuals makes the spatial
hedonic models different from traditional covariate-driven hedonic models. Last, we
investigate the homogeneity of the estimated apartment price residuals, i.e., lack of
trends, by plotting such residuals against longitude and latitude in the scatter plots of
Fig. 3c and d, respectively. Both scatter plots do not reveal any systematic increase
or decrease of apartment price residuals either from east to west or from south to
north.

4.2 Comparison of hedonic modeling approaches

In this section, we compare four different hedonic modeling approaches, i.e., two
standard regression-derived hedonic models and two geostatistical models, in terms

123
Area-to-point Kriging in spatial hedonic pricing models 395

a : Residual variograms b : Histogram of GML-derived residuals

c : GML-derived residuals vs. longitude d : GML-derived residuals vs. latitude

Fig. 3 a Variograms of residuals with lag distance 200 m; dashed and solid lines denote, respectively,
empirical and theoretical variograms of the OLS-derived residuals. OLS-derived model parameters are
used as the initial guess in the iterations of the GML fitting procedure. The solid line with squares depicts
the variogram model of the GML-derived residuals. b Distribution and summary statistics of GML-
derived residuals. c, d Trend analysis of GML-derived residuals along latitude and longitude,
respectively; no significant trend is evident within the study area

of the following two criteria: accuracy and coherence. The main difference between
regression-derived hedonic models and geostatistical models is in the assumptions
the models adopt to describe the spatial variation of housing prices: standard
hedonic models assume that the spatial variability in house values is solely
determined by a global drift (first-order effects), whereas geostatistical approaches
view the house values as a combination of a drift and a spatial variability stemming
from the residuals (second-order effects); see Eq. 1. One of the standard hedonic
models considered in this paper, i.e., the GLS-based hedonic model, shares the same
spatial error structure model cR(|h|) in Eq. 8 with the two geostatistical models, i.e.,
point-support KED and A2PKED. We chose the GLS estimator to explore whether
such a consideration of spatially correlated error structure affects the standard
hedonic model coefficient estimates and model predictions. In fact, the GLS
regression coefficient estimators coincide with the drift coefficients, implicitly
estimated in point-support KED, as long as the same spatial correlation structure is
considered. The corresponding model predictions, however, are different due to the
residual component added in point-support KED predictions. Point-support KED

123
396 E.-H. Yoo, P. C. Kyriakidis

prediction is conducted in this study only to provide a yardstick against which one
can evaluate the quality of the proposed A2PKED approach relative to the
conventional geostatistical approach.
Before we evaluate the performance of the four models, it should be noted that
the prediction results and the model parameter estimates reported here highly
depend on the number of training samples, their spatial configuration, as well as on
the validation data, and the quality of dong-averaged information. To control for
any bias stemming from the random division of data, we undertake performance
evaluation repeatedly for many, i.e., 100, times, and assess model predictive ability
each time. The selection of 300 training data, a number slightly smaller than half of
the total number of observations but large enough to be considered as an unbiased
subset of the original data, is made to mimic practical situations where individual
apartment unit transaction data are not always available whereas average summary
statistics are more generally accessible. In other words, the case study is designed to
demonstrate how much prediction ability can be improved by utilizing additional
conditioning information, i.e., summary statistics over a large spatial extent, in a
data-poor environment.
For prediction of unknown house values, traditional hedonic models call for
model parameter estimation prior to prediction; such parameters are implicitly
estimated in the prediction stage in the spatial hedonic models. The parameters of
standard hedonic regression models are estimated from a newly selected training
sample via OLS and GLS repeatedly, and box plots depicting the resulting estimates
of OLS and GLS regression coefficients are given in Fig. 4.
The model parameters estimated via OLS and GLS are more or less similar, but it
is obvious that the consideration of spatial autocorrelation in housing price data
yields smaller estimates of regression coefficients except b4 ; i.e., the coefficient for
the scheduled reconstruction plan on an apartment complex in Fig. 4d. For GLS, for
example, the influence of predictors, i.e., the age of buildings and the inclusion to a
prestigious school district, on the determination of an apartment price decreases,
i.e., median values of b^ (14.67, 10.08) and those of b ^ (589.13, 508.28), whereas
2 5
the coefficient of reconstruction plan increases slightly, i.e., median value of b ^
4
(643.27, 720.09). The differences in the parameter estimates between OLS and GLS
can be explained in the context of the spatial configuration and the level of spatial
autocorrelation of sample data. In GLS, an observation located in a dense residential
area where unit prices are similar contributes less to the regression model, whereas
all data are equally treated in OLS. This smaller estimate of regression coefficients
for site attributes in the use of GLS method agrees with the point estimators of
model parameters obtained from the total of 648 data, which implies that the
resulting differences are not necessarily due to the random selection bias of training
data. Along with model coefficients, we also compared the variation of model
fitness that different estimation methods yield per iteration. As expected, Fig. 4f
shows that the mean squared error for GLS increases.
In the current study, A2PKED was implemented using only the single dong
(areal) datum to which a prediction location belongs (see Appendix 2), but more
than one areal data can be included in a general model as shown in Eq. 2. When
more than one areal data are considered, they can be selected using various criteria,

123
Area-to-point Kriging in spatial hedonic pricing models 397

Fig. 4 Estimated coefficients (ae) and MSE (f) for different regression models. b ^ ;b^ ;b^ ^ ^
1 2 3 ; b4 ; b5 denote
the slopes for the predictors, i.e., floor area of apartment units (flats), age of building, the total number of
units, the existence of a scheduled reconstruction plan, and belonging or not to a primary high school
district. MSE denotes the mean squared error. The boxes have lines at the median (thick line), lower
quartile and upper quartile values (dotted lines). The whiskers are denoted by crosses and show the rest of
data outside of the interquartile range

including but not limited to proximity based on distance or adjacency. The current
A2PKED system with a single dong datum is based on the consideration that the
dong size is approximately equal to the range of the residual variogram model cR(h)
underlying the observed apartment price data in Eq. 8.
In addition, it should be noted that some dong data consist of a few apartments or
a single apartment complex with high transaction price. If such apartments were
included in the validation data, the bias would be significantly increased for all three
models considered except for the A2PKED model. A2PKED prediction perfor-
mance is rarely affected by such a configuration due to the control exerted by the
areal datum on the prediction location falling in that dong. For a more robust
comparison, we exclude dongs with a single discretization location or dongs with all
discretization locations included in the training data set for A2PKED prediction.
The prediction ability of the four hedonic models at the individual apartment
complex level is evaluated using a validation procedure, often called cross-
validation (Kutner et al. 1974). This model validation technique is based on data
splitting, i.e., division into training sample (or the model building set) and the

123
398 E.-H. Yoo, P. C. Kyriakidis

validation (or prediction set). A possible drawback of this method is the loss of
precision with which regression coefficients are estimated (Montgomery et al.
2001). The standard errors of model coefficients obtained from the training sample
are usually larger than they would have been if all the data had been used to estimate
such coefficients. In the application of this method to the current study, model
validation results highly depend on the relative spatial configuration of 300 training
and 348 validation locations. As stated before, in order to focus on model prediction
accuracy, while minimizing any bias stemming from a particular sample selection,
we repeated the model performance assessment 100 times.
Given model parameters (regression coefficients) estimated at each iteration
using the newly selected 300 training data, housing values corresponding to given
levels of predictor variables at 348 validation locations fup ; p 1; . . .; 348g are
predicted. We consider the mean squared prediction errors (MSPR) as a validation
criterion, computed as:
n
1X p

MSPR zup  ^zup 2


np p1

where np denotes the number of locations in the validation data set, here 348.
The range of the MSPR measures obtained from the four models is presented in
the box plots of Fig. 5a. A2PKED outperforms other models, which indicates that
A2PKED yields the most exact and accurate apartment price predictions at 348
validation locations, independent of the random selection of training and validation
data.
The other geostatistical model, point-support KED, yields much larger MSPR
than A2PKED, although still smaller than the MSPR of regression-based hedonic
models. It is clear that the A2PKED method improves the quality of model
predictions; the only difference between point-support KED and A2PKED is in the
consideration of additional dong-averaged information whose contribution to
prediction varies from one dong to another. The result, however, should be carefully
interpreted since for the particular apartment price data and the discretization scheme
considered in this study, there are several dongs that consist of a small number of
discretization locations. Even after excluding dong data with a single discretization
location or dongs whose discretization points are all included in the training sample,
some dong-averaged data are still informative due to their small number of
discretization locations. In practice, however, area-averaged statistics are typically
derived from many individual data, which consequently reduces their information
content depending on the spatial correlation of such individual attribute values.
On the other hand, the better performance of the point-support KED model over
the two regression-based hedonic models implies that the consideration of spatial
autocorrelation in the housing prices improves the prediction ability of hedonic
models. As discussed in Sect. 3.2, the apartment price prediction at location u is
solely determined by its drift, i.e., ^ zu l
^u; in traditional hedonic models,
whereas Kriging predictions are the sum of the drift and the residual component at
the prediction location. The residual component compensates for the apartment
price that was not explained by the drift component.

123
Area-to-point Kriging in spatial hedonic pricing models 399

A desirable, but often overlooked, property in multi-level spatial prediction is


coherence, i.e., the predicted individual apartment transaction prices, when
reaggregated over an areal unit within which prediction locations lie, should
reproduce the observed areal datum. The coherence property of predicted apartment
values is evaluated under the same setting with cross-validation, i.e., using the
randomly selected 300 training data, but prediction of house values is sought after at
648 original point locations. The predicted 648 apartment prices are reaggregated
over the dong to which they belong, and the coherence property of the model
predictions is evaluated using MSPR, which is calculated as:
1X ns
MSPR zsk  ^zsk 2
ns k1

where ns denotes the number of dongs which have at least one observed apartment
complex transaction record.
Consistent with the previous cross-validation results in Fig. 5a, A2PKED
generally performs best, yielding the most exact and accurate (lowest MSPR) dong-
averaged apartment price predictions, given any spatial configuration of the data
(Fig. 5b). Areal averages of point-support KED predictions yield slightly higher

Fig. 5 a Boxplots of mean squared prediction errors (MSPR) summarizing prediction performance for
OLS, GLS, point-support KED (P2PKED), and A2PKED at 348 validation locations. b Boxplots of
MSPR summarizing prediction performance for OLS, GLS, P2PKED, and A2PKED at 88 dongs

123
400 E.-H. Yoo, P. C. Kyriakidis

MSPR than those of A2PKED, although they are still lower than those of traditional
hedonic models. OLS-derived predictions are slightly better than those of GLS; both
regression-based models result into a poorer performance than geostatistical models
with respect to the coherence property. In summary, as expected, it is clear that any
method other than A2PKED does not satisfy the coherence property.
For illustrative purposes, we consider a particular case where apartment
transaction records are available only at the 300 training data locations shown in
Fig. 2b. Sharing the common variogram model in Eq. 8, four hedonic model
predictions are computed at all 648 point locations. Their coherence property is
examined by scatter plots of the dong-averaged predicted apartment prices versus
the known dong-averaged apartment prices at 88 dongs (see Fig. 6). A straight line
in a scatter plot indicates a perfect coherence for the apartment price predictions.
The A2PKED shown in Fig. 6d nearly satisfies the coherence property, except
for a few scattered points corresponding to dongs with a single discretization
location. At such discretization locations, point-support KED is performed instead

a b

c d

Fig. 6 Scatter plots of area-averaged predictions for a OLS, b GLS, c point-support KED, and
d A2PKED against original area-averaged prices at 88 dongs

123
Area-to-point Kriging in spatial hedonic pricing models 401

of A2PKED. Consequently, the predicted values do not exactly reproduce the


original data, but they are close to the original values as shown in Fig. 6d. Dong-
averaged point-support KED predictions in Fig. 6c are similar to the original dong
values, but this may be due to the particular configuration of the training and dong-
average data: several dong data (22 dongs out of a total of 88 dongs) in the study
area consist of a single apartment complex, and some of them are included in 300
training data. Note that the scatter plots in Fig. 6c and d would show a worse
reproduction of dong-average data, had the 22 dongs with a single apartment
complex, i.e., a single discretization location selected as training datum, not been
included in those scatter plots. The dong-averaged OLS- and GLS-derived
predictions in Fig. 6a and b, however, deviate more from the straight line than
those of the two geostatistical hedonic models, although they are similar to each
other. From this result, we may conclude that accounting for spatial correlation does
not significantly improve the quality of regression-based model predictions at least
for this particular case study and for this particular random selection of training and
validation data.
The degree of discrepancy between original dong data and dong-averaged
predictions varies from one model to another and depends on the configuration of
the training data. The selective application of A2PKED only at locations where the
unknown value is not completely informed by an areal datum, allows a comparative
model performance analysis. From the scatter plot in Fig. 6d, we can infer that there
are a few dongs with a single discretization point and they were included in the 348
validation data since the coherence property was not met. As shown in Fig. 6a and
b, however, the bias of estimation is more significant in the traditional hedonic
models, yielding a larger deviation from the straight line.

5 Conclusions

There have been several applications of geostatistics to real estate appraisal where a
typical geostatistical approach, i.e., point-to-point Kriging, is used to incorporate
spatial correlation of housing prices to predict such prices at unsampled locations
(Dubin 1992; Chica-Olmo 1995; Basu and Thibodeau 1998). In geostatistics,
the mean component, i.e., drift, of housing prices is separately modeled from the
residuals using traditional estimation techniques, i.e., OLS and GLS, where the
underlying spatial residual correlation model affects the estimation of the unknown
mean component and vice versa. One of the advantages of such a modeling
approach, i.e., a hedonic regression/Kriging procedure, is that reasonably accurate
predictions can be obtained with housing data, such as the multiple listings data
used by realtors (Dubin 1998). But again, the problem is that such complete housing
data are not always readily available.
The major contribution of the current work in hedonic pricing models is to
provide a methodology to complement such insufficient individual house price data
by incorporating aggregated data, e.g., area-averaged house values. By pooling
information from all data sources, i.e., individual house data as well as those of
submarkets, while accounting for their support differences, the implicit price

123
402 E.-H. Yoo, P. C. Kyriakidis

estimation obtained from the proposed A2PKED model can be substantially


improved. Also, the proposed A2PKED model takes into account spatial
heteroskedasticity of house values as well as the effects of location characteristics
operating at various spatial scales. For example, the spatial dependence of house
values, corresponding to factors varying at different scales, can be accounted for by
nested variogram models.
In the case study, we illustrate the application of the proposed A2PKED
method using apartment transaction records obtained from Seoul, South Korea,
during 2003. Starting from a simple model whose predictors consist of apartment
average floor area, age of buildings, size of apartment complex (i.e., total number
of units), and two dummy variables indicating if the apartment complex has a
scheduled reconstruction plan and if it is located within special high school
districts, we augment that model to account for spatial residual correlation. The
A2PKED model provides an improvement in house value predictions over any
other model considered in the current study, i.e., OLS- and GLS-based standard
hedonic models, and a point-support KED model. For a robust comparison of
model prediction performance, we divide the original transaction records into
training and validation data sets and assess model performance repeatedly. More
specifically, we evaluate model prediction performance using two criteria:
coherence and accuracy, and conclude that the proposed geostatistical hedonic
model improves the quality of spatial prediction by combining residual variogram
modeling and Kriging. But caution should be exercised to generalize the results
because the added value of the areal data depends on the quality and quantity of
the observed data. The contribution of the areal data to predictions increases
when the available point sample data are limited or less reliable auxiliary
variables are available. As long as spatial correlation exists in the regression
residuals, however, geostatistical hedonic models are expected to perform better
than hedonic regression models. Apart from the variogram model in Eq. 8,
alternative forms of residual variograms can be considered; it would be
interesting to assess the differences in the prediction performance from different
variogram models.
This work can be expanded in a number of ways. First, these models could be
compared using housing submarkets that are defined over less homogeneous
areas than the one considered in the case study. By extending the current study
area into a larger region, such as all of Seoul, one could evaluate model
performance over such inhomogeneous areas. Also, the proposed method should
be tested over a region where the relation between areal and point data is more
complex. More interestingly, if house pricing data for different dwelling types
across different time periods are available, the current framework could be
extended to account for this information, i.e., a spatio-temporal case, using a
coKriging approach. Last, model performance comparison could involve models
specifically developed for lattice data, which have been well-received in the
hedonic modeling literature.

Acknowledgments The P. C. Kyriakidis acknowledges funding provided by the National Geospatial


Intelligence Agency (NGA) under award: HM1582-07-2020.

123
Area-to-point Kriging in spatial hedonic pricing models 403

Appendix 1: A2PKED as a form of generalized linear regression

Given both n point support data and K area-averaged data, the A2PKED prediction
of the unknown value at a location up (see Eq. 2 for the original form) can be
rewritten in the form of a linear regression model with correlated residuals, i.e., a
point prediction of mean response and the residual, as (Chiles and Delfiner 1999):
zup l
^ ^up r^up
T
h i
^b f up g ~Tp ru ~
kTp rs
" #
XM X n X
K
^ fm up
b gp ui rui
~ ~kp sk rsk 9
m
m0 i1 k1

where g~p and ~ kp denote, respectively, the area-to-point Simple Kriging (A2PSK)
weights for n point data residuals and the K areal data residuals.
The GLS estimator of regression coefficients b can be obtained as a function of
the spatial correlation model of the underlying residuals RR and the design matrix F
as (Cressie 1993, p 167):
^ 1 T 1
b FT R1
R F F RR z 10
with

    
Fu RRuu RRus z
F ; RR ; z u 11
Fs RRsu RRss zs
Note that when RR is diagonal, with constant entries along its main diagonal, the
drift prediction obtained from Eq. 1 coincides with that of OLS. Again, the unique
solution of estimate ^ b requires the ((n ? K) 9 (n ? K)) matrix of variogram values
of residuals RR to be non-singular and the ((n ? K) 9 (M ? 1)) design matrix F to
be of full column rank.
T
Once the drift component is predicted as l ^up ^b f up using the GLS estimator
of regression coefficients ^ b; we can predict unknown residual component r^up at
the prediction location up using A2PSK. The A2PSK prediction at up is a weighted
linear combination of the n point data residuals ru rui zui 
P M ^ T
m0 fm ui bm ; i 1; . . .; n and the areal data residuals rs rsk zsk 
PM ^ ; k 1; . . .; KT as:
m0 fm sk bm

X
n X
K
~Tp ru ~
r^up g kTp rs gp ui rui
~ ~kp sk rsk : 12
i1 k1
T
Here, the A2PSK weights g ~p ~ gp ui ; i 1; . . .; n for the n point data and the
kp ~
weight ~ kp sk ; k 1; . . .; KT for the areal data are determined per solution of
a A2PSK system similar to that of Eq. 4 as:
 R    u 
Ruu RRus ~ gp rp
13
R R
Rsu Rss kp ~ rsp

123
404 E.-H. Yoo, P. C. Kyriakidis

where the covariance model of residuals involved in above A2PSK system is


assumed to be identical to that used in Eq. 4.
In summary, A2PKED is equivalent to optimum drift estimation followed by
area-to-point simple Kriging of the residuals from this drift estimate, as if the mean
were known. The A2PKED prediction error variance accounts for the fact that the
drift is actually unknown but estimated.

Appendix 2: Estimation of a local drift coefficients in A2PKED

Consider the task of predicting the unknown value at location up using n point
data and a subset of the areal data instead of all K such data. This modification
of the original A2PKED system in Eq. 4 may be necessary when dealing with
large data sets of point support as well as areal support or when statistical
modeling of spatial variation, i.e., spatially varying regression model coefficients,
is adopted.
In our case study, we include only one areal datum in which the prediction
location up falls, which may affect the efficiency of the GLS estimator as well as
the residual component. This change of the original system may yield some
problems in the design matrix, particularly when dummy variables associated with
areal data are included. However, this change does not destroy the desirable
properties of Kriging prediction, such as unbiasedness and minimum prediction
error variance. In what follows, we present the A2PKED prediction at location up
as a linear regression form with correlated residuals based on n point data and a
single areal datum:
" #
X
M Xn
zup
^ ^ up fm up
b g ui rui ~kp sk rsk
~ 14
m p
m0 i1

where ~ gp ui and ~kp sk denote, respectively, the A2PSK weight assigned to the ith
point support residual rup zup  l ^up and the kth areal residual rsk
zsk  l
^sk at the region sk in which the prediction location up falls. Note that the
A2PSK weights for point data and the single areal datum in Eq. 14 need to be
updated at each prediction location as the areal datum associated with each
prediction location is subject to change. This amounts to estimate a spatially varying
(local) linear drift component whose GLS regression coefficients are constant within
each neighborhood, but vary from one area to another. For example, the GLS
regression coefficients b ^ up with m = 0,, M at location up are different from
m
^
those bm up0 at location up0 if up0 [ sk0 for k = k0 .
Typically, the application of GLS to hedonic price models assumes that the
relationship between house price and covariates is fixed over the study area. The
proposed approach takes into account heteroskedasticity present in house prices so
that the implicit price of housing attributes varies spatially by submarkets defined by
areal units (Orford 2000).

123
Area-to-point Kriging in spatial hedonic pricing models 405

References

Anselin L (2002) Under the hood: issues in the specification and interpretation of spatial regression
models. Agric Econ 27:247267
Anselin L (1998) Spatial econometrics: methods and models. Kluwer, Dordrecht
Basu S, Thibodeau TG (1998) Analysis of spatial autocorrelation in house prices. J Real Estate Financ
Econ 17:6185
Black SE (1999) Do better schools matter? Parental valuation of elementary education. Quart J Econ
114(2):577599
Can A (1990) The measurement of neighborhood dynamics in urban house prices. Econ Geogr
66(3):254272
Can A, Megbolugbe I (1997) Spatial dependence and house price index construction. J Real Estate Financ
Econ 14:203222
Chica-Olmo J (2007) Prediction of housing location price by a multivariate spatial method: cokriging.
J Real Estate Res 29(2):233254
Chica-Olmo J (1995) Spatial estimation of housing prices and locational rent. Urban Stud 32(8):1331
1344
Chiles JP, Delfiner P (1999) Geostatistics: modeling spatial uncertainty. Wiley, New York
Cressie N (1993) Statistics for spatial data. Wiley, New York
Dubin RA (1988) Estimation of regression coefficients in the presence of spatially autocorrelated error
terms. Rev Econ Stat 70:466474
Dubin RA (1992) Spatial autocorrelation and neighborhood quality. Region Sci Urban Econ 22:433452
Dubin RA (1998) Predicting house prices using multiple listings data. J Real Estate Financ Econ 17:35
59
Dubin RA, Pace RK, Thibodeau TG (1999) Spatial augoregression techniques for real estate data. J Real
Estate Lit 7:7995
Gelfand AE, Ecker MD, Knight JR, Sirmans CF (2004) The dynamics of location in home price. Real
Estate Financ Econ 29(2):149166
Goodman AC, Thibodeau TG (1998) Housing market segmentation. J Hous Econ 7:121143
Goovaerts P (2005) Geostatistical analysis of disease data: estimation of cancer mortality risk from
empirical frequencies using Poisson kriging. Int J Health Geogr 4(31)
Goovaerts P (1997) Geostatistics for natural resources evaluation. Oxford University Press, New York
Gotway CA, Young LJ (2002) Combining incompatible spatial data. J Am Stat Assoc 97(458):632648
Jones K, Bullen N (1994) Contextual models of urban house prices: a comparison of fixed- and random-
coefficient models developed by expansion. Econ Geogr 70(3):252272
Journel AG, Huijbregts ChJ (1978) Mining geostatistics. Academic Press, New York
Kim CW, Phipps TT, Anselin L (2003) Measuring the benefits of air quality improvement: a spatial
hedonic approach. J Environ Econ Manage 45:2439
Kutner MH, Nachtsheim CJ, Neter J, Li W (1974) Applied linear statistical models, 5th edn. McGraw-
Hill, London
Kyriakidis PC (2004) A geostatistical framework for the area-to-point spatial interpolation. Geogr Anal
36(3):4150
Kyriakidis PC, Goodchild MF (2006) On the prediciton error of variance of three common spatial
interpolation schemes. Int J Geogr Inform Sci 20(8):823855
LeSage JP, Pace RK (2004a) Models for spatially dependent missing data. J Real Estate Financ Econ
29(2):233254
LeSage JP, Pace RK (eds) (2004b) Spatial and spatiotemporal econometrics. Elsevier, Oxford
Montgomery CC, Peck EA, Vining GG (2001) Introduction to linear regression analysis. Wiley, New
York
Neuman SP, Jacobson EA (1984) Analysis of nonintrinsic spatial variability by residual kriging with
application to regional ground water levels. Math Geol 16(5):499521
Orford S (2000) Modelling spatial structures in local housing market dynamics: a multilevel perspective.
Urban Stud 37(9):16431671
Pace RK, Gilley OW (1998) Generalizing the OLS and grid estimators. Real Estate Econ 26:331347
Paez A, Uchida T, Miyamoto K (2001) Spatial association and heterogeneity issues in spatial association
and heterogeneity issues in land price models. Urban Stud 38(9):14931508

123
406 E.-H. Yoo, P. C. Kyriakidis

Paez A, Long F, Farber S (2008) Moving window approaches for hedonic price estimation: An empirical
comparison of modeling techniques. Urban Stud 45(8):15651581
Ripley BD (1981) Spatial statistics. Wiley, New York
Rosen S (1974) Hedonic prices and implicit markets: product differentiation in pure competition. J Polit
Econ 82(1):3455
Yoo E-H, Kyriakidis PC (2008) Area-to-point predictions under boundary conditions. Geogr Anal
40(4):355379

123

You might also like