Chris Frazier Master Thesis

Copyright by Christopher Rawls Frazier 2004
Spatial Econometric Models for Land Use/Land Cover Panel Data: Theory and Application using Satellite Images for the Austin, Texas Region
by Christopher Rawls Frazier, B.S., B.A.
Thesis Presented to the Faculty of the Graduate School of The University of Texas at Austin in Partial Fulfillment of the Requirements for the Degree of Master of Science in Engineering The University of Texas at Austin August 2004
APPROVED BY SUPERVISING COMMITTEE: __________________________ Supervisor: Kara Kockelman __________________________ Lauren Ancel Meyers
For Violet and, especially, for Kathryn
ACKNOWLEDGMENTS
This thesis would not have been possible without the help and support of numerous people. First of all, I would like to thank Dr. Kara Kockelman, who has been my advisor over the past two years. She not only supplied the original ideas and inspiration from which this thesis grew, but also provided me with the guidance, feedback, financial support, and freedom which made this work possible. I would also like acknowledge Dr. Kockelmans assistant, Annette Perrone, whose help in administrative and logistic matters has been immeasurable. I want to give special thanks to Dr. Lauren Ancel Meyers, who agreed, at a last minute request (and while 8 months pregnant!), to be a reader for this thesis. I also would like to extend a general thanks to all the faculty and students at The University of Texas at Austin with whom have worked and conversed with during my time as a graduate student. In particular, I would like to single out recent graduate Michael Reyes, who assisted me greatly at various stages of this work. I also want to thank my family, especially my parents; over the last few years, a lot has gone on in my life and they have supported me unconditionally through it all. I should also thank Busch, Djuna, C.J., and B.S. for the comic relief and aromatherapy they have supplied over the past two years. Lastly, and most importantly, I want to acknowledge the loves of my life: my beautiful wife, Kathryn, and daughter, Violet. My craziness and idiocy is not always easy to deal with, and you have given nothing but the greatest love and support throughout it all. For this, I thank you.
by
Christopher Rawls Frazier, M.S.E. The University of Texas at Austin, 2004 SUPERVISOR: Kara Kockelman
ABSTRACT: This thesis develops and uses a variety of novel econometric methodologies incorporating the effects of space and time on demographic and geographic variables, including land-cover data derived from LandSat satellite imagery. Models for panel data incorporating spatial autocorrelation, temporal random effects, and time-lagged independent variables with varying time-lags are developed for linear regression, linear regression with sample selection, and logistic regression frameworks. In addition, a panel data model incorporating time-lagged dependent variables and models approximating differential equations in time and space are presented. To utilize these methodologies, a data set comprised of land-cover data derived from satellite imagery, spatial statistics derived from the land-cover information, U.S. Census data, and geographic information is assembled; to incorporate these various data elements, special vi
techniques, including approximating functions for off-year Census data and spatial integration via a data combination grid, are used. The methodologies are tested and compared through models estimated for a variety of demographic and land-cover variables; from these results important spatial, temporal, and relational information concerning the data is garnered. Finally, predictions incorporating spatial autocorrelation for population and land-cover are run, though the poor results indicate there a probable flaw in the simulation methodology.
vii
TABLE OF CONTENTS
LIST OF TABLES..................................................................................................... xi LIST OF FIGURES .................................................................................................. xv LIST OF FIGURES .................................................................................................. xv CHAPTER 1: INTRODUCTION .................................................................................. 1 1.1 LAND USE AND MODELING ............................................................................ 4 1.2 THESIS OBJECTIVES AND ORGANIZATION ...................................................... 6 CHAPTER 2: HISTORICAL/LITERATURE REVIEW ................................................. 8 2.1 EARLY LAND-USE MODELS ........................................................................... 9 2.2 SPATIAL ECONOMETRICS ............................................................................. 10 2.2.1 Socio-Economic and Demographic Applications of Spatial Econometrics .................................................................................... 14 2.2.2 Land-Use/Land-Cover Models Using Spatial Econometrics............... 15 2.3 OTHER MODELS & THEORIES ...................................................................... 18 2.4 MODEL COMPARISON .................................................................................. 22 CHAPTER 3: DATA ................................................................................................ 25 3.1 LANDSAT SATELLITE IMAGERY ................................................................... 26 3.2 LAND USE VS. LAND COVER ........................................................................ 29 3.3 SUPERVISED IMAGE CLASSIFICATION .......................................................... 30 3.4 DERIVED SPATIAL DATA.............................................................................. 31 3.5 UNITED STATES CENSUS DATA.................................................................... 35 3.6 COMBINATION GRID .................................................................................... 36 3.7 ESTIMATING OFF-YEAR CENSUS DATA ....................................................... 38 3.8 GEOGRAPHIC DATA ..................................................................................... 42 3.9 DATA CAVEATS ........................................................................................... 47 3.10 DATA SUMMARY........................................................................................ 50 CHAPTER 4: METHODOLOGY .............................................................................. 57 4.1 THE SPATIAL WEIGHT MATRIX ................................................................... 59 4.2 PANEL DATA SPATIAL LINEAR REGRESSION MODEL ................................... 61 4.3 PANEL DATA LINEAR REGRESSION MODEL WITH TIME-LAGGED DEPENDENT VARIABLE (LSDV MODEL) ................................................................... 64 viii
4.4 PANEL DATA SPATIAL LINEAR REGRESSION MODEL USING PROBIT SAMPLESELECTION............................................................................................. 69 4.4.1 Panel data Linear Regression Model using Probit Sample-Selection. 70 4.4.2 Panel data Spatial Linear Regression Model using Probit SampleSelection............................................................................................ 72 4.5 PANEL DATA SPATIAL LOGISTIC REGRESSION MODEL................................. 73 4.6 ACCOUNTING FOR DIFFERENCES IN TIME LAGS ........................................... 79 4.7 THE TEMPORAL AND SPATIAL INCIDENTAL PARAMETERS PROBLEM ........... 80 4.8 ESTIMATING DIFFERENTIAL EQUATION FRAMEWORK THROUGH PANEL DATA MODELS................................................................................................. 81 4.9 SAMPLING AND MODEL ESTIMATION ........................................................... 83 4.10 SUMMARY .................................................................................................. 84 CHAPTER 5: MODEL RESULTS ............................................................................. 88 5.1 PANEL DATA SPATIAL LINEAR REGRESSION MODEL ................................... 93 5.1.1 ln(Population) Model........................................................................... 93 5.1.2 ln(Per Capita Income) Model ............................................................ 100 5.1.3 Average Vehicles Available per Household Model ........................... 106 5.2 LSDV MODEL ........................................................................................... 110 5.2.1 ln(Population) Model......................................................................... 113 5.2.2 ln(Per Capita Income) Model ............................................................ 115 5.2.3 Average Vehicles Available per Household Model ........................... 117 5.3 PANEL DATA SPATIAL LINEAR REGRESSION MODEL USING PROBIT SAMPLE SELECTION........................................................................................... 119 5.3.1 Urban Land Cover Greater/Less Than 0.3 ln(Population) Sample Selection Panel data Spatial Regression Model ............................. 121 5.3.1.1 Probit Selection Model ............................................................... 121 5.3.1.2 Urban Land Cover Greater Than 0.3 ln(Population) Model ..... 125 5.3.1.3 Urban Land Cover Less Than 0.3 ln(Population) Model........... 131 5.3.2 Population Greater/Less Than 175 ln(Per Capita Income) Sample Selection Panel Data Spatial Regression Model ............................ 138 5.3.2.1 Probit Selection Model ............................................................... 138 5.3.2.2 Population Greater Than 175 ln(Per Capita Income) Model .... 143 5.3.2.3 Population Less Than 175 ln(Per Capita Income) Model.......... 148 ix
5.4 PANEL DATA SPATIAL LOGISTIC REGRESSION MODEL............................... 152 5.4.1 Proportion Urban Land Cover Model ............................................... 153 5.4.2 Proportion Residential|Urban Land Cover Model............................ 158 5.4.3 Proportion Rural|Not Urban Land Cover Model .............................. 163 5.5 DIFFERENTIAL EQUATION MODELS ........................................................... 165 5.5.1 d(Population) Model.......................................................................... 172 5.5.1.1 Time Dimension .......................................................................... 172 5.5.1.2 Space Dimension......................................................................... 172 5.5.2 d(Average Vehicles Available per Household) Model....................... 174 5.5.2.1 Time Dimension .......................................................................... 174 5.5.2.2 Space Dimension......................................................................... 178 5.5.3 d(Median House Price) Model .......................................................... 181 5.5.3.1 Time Dimension .......................................................................... 181 5.5.3.2 Space Dimension......................................................................... 181 CHAPTER 6: MODEL PREDICTIONS ................................................................... 192 6.1 SIMULATION METHODOLOGIES .................................................................. 193 6.1.1 Spatial Regression ............................................................................. 194 6.1.2 Spatial Logistic Regression ............................................................... 196 6.2 POPULATION PREDICTIONS ........................................................................ 198 6.3 LAND COVER PREDICTIONS ....................................................................... 207 6.4 SUMMARY .................................................................................................. 217 CHAPTER 7: CONCLUSIONS & EXTENSIONS ...................................................... 220 7.1 SUMMARY AND CONCLUSIONS................................................................... 220 7.2 EXTENSIONS .............................................................................................. 226 APPENDIX: MULTINOMIAL PANEL DATA SPATIAL PROBIT MODEL ................. 229 A.1 MONTE CARLO INTEGRATION ................................................................... 229 A.2 THE MULTINOMIAL PANEL DATA SPATIAL PROBIT MODEL ...................... 231 REFERENCES ...................................................................................................... 240 VITA .................................................................................................................... 249
LIST OF TABLES
Table 3.1 Mean values of 1990 and 2000 U.S. Census data and approximation formula parameters derived from them .................................................... 41 Table 3.2 The absolute difference between predicted and actual 1990 and 2000 U.S. Census data (||); and the ratio between || and mean actual value for the 1990 and 2000 U.S. Census data ((||/mean)................................ 43 Table 3.3 Descriptive statistics for year 2000 data ............................................. 51 Table 3.4 Descriptive statistics for year 1997 data ............................................. 52 Table 3.5 Descriptive statistics for year 1991 data ............................................. 54 Table 3.6 Descriptive statistics for year 1983 data ............................................. 55 Table 5.1 Results of ln(Population) spatial regression model without time-lagged vairables .................................................................................................... 94 Table 5.2 Results of ln(Population) spatial regression model with time-lagged vairables .................................................................................................... 97 Table 5.3 Results of ln(Population) spatial regression model with time-lagged vairables and time adjustment................................................................... 98 Table 5.4 Results of ln(Per Capita Income) spatial regression model without time-lagged vairables.............................................................................. 101 Table 5.5 Results of ln(Per Capita Income) spatial regression model with timelagged vairables ...................................................................................... 103 Table 5.6 Results of ln(Per Capita Income) spatial regression model with timelagged vairables and time adjustment..................................................... 104 Table 5.7 Results of Average Number of Vehicles Available per Household spatial regression model without time-lagged vairables......................... 107 Table 5.8 Results of Average Number of Vehicles Available per Household spatial regression model with time-lagged vairables.............................. 108 Table 5.9 Results of Average Number of Vehicles Available per Household spatial regression model with time-lagged vairables and time adjustment ................................................................................................................ 109 Table 5.10 Results of ln(Population) LSDV model ........................................... 114 Table 5.11 Results of ln(Per Capita Income) LSDV model............................... 116 Table 5.12 Results of Average Number of Vehicles Available per Household LSDV model ........................................................................................... 118 xi
Table 5.13 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged vairables........................................................... 122 Table 5.14 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged vairables and time adjustment ......................... 123 Table 5.15 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged vairables and time adjustment using entire data set ............................................................................................................ 126 Table 5.16 Results from ln(Population) spatial regression model without timelagged vairables: Urban Land Cover greater than 0.3. ........................... 127 Table 5.17 Results from ln(Population) spatial regression model with time-lagged vairables: Urban Land Cover greater than 0.3........................................ 132 Table 5.18 Results from ln(Population) spatial regression model with time-lagged vairables and time adjustment: Urban Land Cover greater than 0.3 ...... 133 Table 5.19 Results from ln(Population) spatial regression model without timelagged vairables: Urban Land Cover less than 0.3. ................................ 134 Table 5.20 Results from ln(Population) spatial regression model with time-lagged vairables: Urban Land Cover less than 0.3 ............................................. 136 Table 5.21 Results from ln(Population) spatial regression model with time-lagged vairables and time adjustment: Urban Land Cover less than 0.3............ 137 Table 5.22 Results of Population greater than 175 probit model with time-lagged vairables .................................................................................................. 139 Table 5.23 Results of Population greater than 175 probit model with time-lagged vairables and time adjustment................................................................. 140 Table 5.24 Results of Population greater than 175 probit model with time-lagged vairables and time adjustment using entire data set................................ 142 Table 5.25 Results from ln(Per Capita Income) spatial regression model without time-lagged vairables: Population greater than 175. .............................. 144 Table 5.26 Results from ln(Per Capita Income) spatial regression model with time-lagged vairables: Population greater than 175 ............................... 146 Table 5.27 Results from ln(Per Capita Income) spatial regression model with time-lagged vairables and time adjustment: Population greater than 175 ................................................................................................................ 147 Table 5.28 Results from ln(Per Capita Income) spatial regression model without time-lagged vairables: Population less than 175..................................... 149
xii
Table 5.29 Results from ln(Per Capita Income) spatial regression model with time-lagged vairables: Population less than 175..................................... 150 Table 5.30 Results from ln(Per Capita Income) spatial regression model with time-lagged vairables and time adjustment: Population less than 175 ... 151 Table 5.31 Results from Proportion of Urban Land Cover spatial logistic regression model without time-lagged vairables .................................... 154 Table 5.32 Results from Proportion of Urban Land Cover spatial logistic regression model with time-lagged vairables ......................................... 156 Table 5.33 Results from Proportion of Urban Land Cover spatial logistic regression model with time-lagged vairables and time adjustment........ 157 Table 5.34 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model without time-lagged vairables ............ 159 Table 5.35 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model with time-lagged vairables ................. 161 Table 5.36 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model with time-lagged vairables and time adjustment............................................................................................... 162 Table 5.37 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model without time-lagged vairables. ........... 164 Table 5.38 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged vairables ................. 166 Table 5.39 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged vairables and time adjustment............................................................................................... 167 Table 5.40 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged vairables and time adjustment and with one highly deviant sample removed...................... 168 Table 5.41 Results from time dimension of Population differential equation model ...................................................................................................... 173 Table 5.42 Results from vertical spatial dimension of Population differential equation model........................................................................................ 175 Table 5.43 Results from horizontal spatial dimension of Population differential equation model........................................................................................ 176 Table 5.44 Results from time dimension of Average Number of Vehicles Available per Household differential equation model............................ 177 xiii
Table 5.45 Results from vertical spatial dimension of Average Number of Vehicles Available per Household differential equation model............. 179 Table 5.46 Results from horizontal spatial dimension of Average Number of Vehicles Available per Household differential equation model............. 180 Table 5.47 Results from time dimension of Median Home Value differential equation model........................................................................................ 182 Table 5.48 Results from vertical spatial dimension of Median Home Value differential equation model..................................................................... 184 Table 5.49 Results from horizontal spatial dimension of Median Home Value differential equation model..................................................................... 185
xiv
LIST OF FIGURES
Figure 3.1 Example of how the point nature of a LandSat scanner creates a distorted image.......................................................................................... 28 Figure 3.2 Land-cover maps derived from LandSat imagery for the years 1983 (left) and 1991 (right). .............................................................................. 32 Figure 3.3 Land-cover maps derived from LandSat imagery for the years 1997 (left) and 2000 (right) ............................................................................... 33 Figure 3.4 Reference diagram for pixel neighborhood used in calculation of landcover mix statistic ..................................................................................... 34 Figure 3.5 Illustration of the fact that the combination grid shown here by thick black lines does not line up with the Census block-groups represented by the patchwork of grey shapes............................................................... 38 Figure 5.1 Example of convergence of bootstrap parameter standard deviation estimates from ln(Population) LSDV model .......................................... 112 Figure 5.2 Scatter plots comparing various characteristics of the data.............. 120 Figure 5.3 Example of convergence of bootstrap paramter standard deviation estimates for urban land-cover proportion greater than 0.3 ln(Population) panel-data spatial regression model........................................................ 129 Figure 6.1 Map of the Austin, Texas area showing the Downtown and Cedar Park regions used for predictions.................................................................... 200 Figure 6.2 Population data and predictions for the downtown Austin region ... 201 Figure 6.3 Population predictions for the downtown Austin region: +1 sample standard deviation from average............................................................. 204 Figure 6.4 Population predictions for the downtown Austin region: -1 sample standard deviation from average............................................................. 205 Figure 6.5 Population data and predictions for the Cedar Park region.............. 206 Figure 6.6 Population predictions for the downtown Austin and Cedar Park regions simulated without time adjustment ............................................ 208 Figure 6.7 Urban land cover data and predictions for the downtown Austin region ................................................................................................................ 210 Figure 6.8 Residential land cover data and predictions for the downtown Austin region ...................................................................................................... 211
xv
Figure 6.9 Rural land cover data and predictions for the downtown Austin region ................................................................................................................ 212 Figure 6.10 Urban land cover data and predictions for the Cedar Park region . 214 Figure 6.11 Residential land cover data and predictions for the Cedar Park region ................................................................................................................ 215 Figure 6.12 Rural land cover data and predictions for the Cedar Park region... 216
xvi
CHAPTER 1: INTRODUCTION
As a result of humankinds innate ability to adapt to its environment, civilization has spread, at an increasingly rapid rate, all over the globe. Depending on the environment, resources, and cultural factors, the manner in which humankind inhabits different locales will often vary greatly. Whereas the landscape of a modern city, with its high population density and situation as an economic and cultural center, may be completely covered with buildings and roads, in some parts of the world, a nomadic culture flourishes and little, if any, permanent imprint of humankinds existence is visible. Over the past century and a half, the Industrial Revolution and its beneficial consequences increases in economic and social prosperity, advances in medicine and technology, greater dissemination of information have profoundly effected the manner in which humans have inhabited their environment. Nowhere has this been more apparent than in The United States of America, where, in the span of a few generations, the country has moved from a primarily agrarian, rural society to one in which over 75% of its citizens live in urbanized areas and their economic and social livelihood is fueled by industrial and technological sectors (U.S. Census Bureau 1993 & 2004(a)). Though this demographic shift in the U.S. generally has been beneficial, a number of problems have arisen as a result. Among these are increases in pollution and waste production, reductions in natural habitats, and a greater reliance on consumptive activities and non-renewable natural resources. Problems felt at a personal level also have shifted to issues which are outgrowths of greater urbanization, such as highway congestion and urban sprawl. Over the course of recent history, many attempts have been made at finding solutions to these various problems. In the U.S., environmental legislation, the construction of more and larger freeways, the development of
vehicles with greater fuel economy, and urban growth restrictions are all examples of attempts to remedy problems brought on by urbanization. In order to address the problems effectively, however, a thorough understanding of both the problem itself and the socio-demographic and geographic milieu is necessary. Developing this understanding is aided by and often requires rigorous methods of analysis, which in turn lead to the development, justification, and implementation of possible solutions, as well as a deeper knowledge of the complexities associated with humankinds existence on the Earth. Generally speaking, many roadblocks exist which hinder the development and use of the analytic tools used to understand the impact of urbanization. The largest and most apparent problem is that a great deal of uncertainty is introduced because humans are involved. That is, because humans are not machines, predicting what will happen in given situation involving humans, even with perfect information and infinite computational resources, is virtually impossible. Furthermore, in real-life, information and computational resources are limited and, even with recent gains in both areas, still place constraints on the analytic methods used. The manner in which these hindrances are overcome is actually rather obvious: approximate reality. That is, make very general assumptions in order to simplify reality, and then create models based around those assumptions. The form of the assumptions, and of the models themselves, is entirely up to the researcher, but there should be some justification or general consensus for what is done. The primary assumption which is made for this thesis is statistical in nature and addresses the uncertainty inherent in humans. Simply stated, it is assumed that humans, though varied and to some extent unpredictable individually, when observed at an aggregate level seem to act in more similar and predictable ways. This justification for this is that social, cultural, and genetic commonalities that exist between humans can be observed far more easily at an aggregate level 2
because individual idiosyncrasies, which one would expect to exist in close to equal level above and below the average human, will cancel out. A further justification of the assumption is that because the problems rising out of industrialization and urbanization, such as pollution and traffic, are the result of the summation of many humans actions, analyzing the problems one or two humans at a time is probably not as instructive as analyzing them at an aggregate level. For all of the importance that is associated with developing the methods and techniques used for analyzing the problems of modern humankind, just as much importance, if not more, should be attached to deciding what actually should be modeled. That is, many models are somewhat generic in nature and could be applied to any one or more of the multitude of elements involving humans; however, the true power of a model can be drawn out when those elements which are highly important to humans are modeled. Part of the art of modeling is in selecting things to model which are relevant and influential to past, present and future of humankind. For example, while a modeling framework might be used on a variable like the average length of a populations hair, it would probably be better to model something like the number of miles per day that the population drives on freeways, for this latter variable is more important for understanding important issues effecting the populace, such as traffic, pollution, and oil dependence. In this spirit, the underlying motivation for work is essentially to create models for important elements, such as population and vehicles owned per household, which one would expect to contribute to the problems facing the modern world and, especially, the United States. This is done with a focus towards transportation, urban, and regional planning applications as opposed to others, such as environmental or political. Although such a focus is not unique, one of the distinguishing contributions of this work is the combination of this 3
focus with the inclusion of detailed land-use information and novel modeling techniques which have not been included in many previous models. 1.1 LAND USE AND MODELING One would expect that one of the most important inputs for models addressing the problems of increasing industrialization and urban growth would be the footprint that humankind makes on the Earth. That is, if humans are actually using an area of land, then how that land is being used will most likely make a difference as to what kinds of problems are generated as a result. For example, a large, dirty power plant near a major freeway may create an environmental wasteland, but, because of this, few people will want to live or have businesses near it, so traffic in that area will probably not be an issue. On the other hand, a sprawling suburban shopping center in a small town will probably attract businesses and residents, but this may result in, among other things, limited growth because of restrictions put forth by a local planning commission worried about losing the small-town essence of the area. Even though it may contain important information on the impacts of mankinds growth, good information on how the land is being used has always been elusive. While such information is usually kept by local city and county agencies, it often may be outdated, located in non-centralized locations, and inconsistent. Furthermore, such data is generally not available in electronic form, hindering quantitative analysis, especially since such data sets will tend to be large and dense. This problem is exacerbated if one wants historical data of any form; definite, historical land-use records may be difficult to come by, and, where they exist, probably are not available in digital form. Recent developments in image processing techniques on satellite imagery offer a potential solution to this problem. By capturing advanced images of the landscape and applying quantitative analysis to them, a picture of how mankind 4
appears to be using the land emerges. Furthermore, because the satellite image is essentially a historical documentation of a specific point in time, analysis of past images can yield data for previous periods. Another problem with the use of land-use information in models are the high computational demands associated with it. Not only are such data sets large, but they are also is spatial, requiring complex geographic techniques and modeling methodologies in order to fully incorporate its information. It was not until the last decade, with vast increases in computing power and the development of geographic information system (GIS) software, that the inclusion of detailed land-use data into models became realistic. Even with the advances in computing power and the promise of image processing techniques, land-use data is still not widely available, hindering the development of such models (Irwin and Geoghegan 2001). For this work, detailed land-use/land-cover information of the Austin, Texas region derived from satellite imagery has been obtained. This data is comprised of four data panels, each representing a different point in time, spanning more than a decade. By using this data, more powerful, instructive, and useful models are developed. The development of these models poses another problem: how to rigorously account for the spatial and temporal aspects of the data, as one would expect that there might be information contained within these aspects which could be harnessed for improved models. Herein lies the second major contribution of this work: developing methodologies which capture spatial and temporal information contained within data. These methodologies not only account for the spatial and temporal correlations and attributes of the data, but also utilize them so that they can be used to better understand the various interactions among the data.
1.2 THESIS OBJECTIVES AND ORGANIZATION The goal of this work is to develop and use models for variables important to the transportation, urban, and regional planning applications which can incorporate the spatial and temporal aspects of land use and other information in a rigorous and informative manner. Quantitative analysis of how to incorporate such data, as well as a comparison of how various models perform, will be carried out. Also developed is a special dynamic model form, whose goal will be to exploit the temporal characteristics of the data so as to create predictions for the future in a practical manner. The organization of this thesis is as follows: Chapter 2 discusses the historical background of work on methodologies incorporating spatial information and models for and using land-use and urban growth data. It also includes a defense of the methodological slant chosen for this work. Chapter 3 discusses in detail the various sources of data for this thesis, as well as the methods used to integrate them for use in modeling. Specifically, it covers the land-use/land-cover data used in this work, including a overview of satellite imagery and the methods used to derive the land-use/land-cover data from it. Other data sources, including derived land-cover data, U.S. Census data, and geographical data, are also discussed. One of the major issues with using these various data sets is that they are not aligned in space and time. In this chapter, the methods used to integrate these data sets, including approximation methods and a data combination grid are also discussed. concerning the data are addressed in detail. Chapter 4 presents the methodologies used to model the data introduced in Chapter 3. A series of econometric models, incorporating spatial autocorrelation and capturing temporal effects, are discussed. The models covered include those for both continuous and proportional data. It might be expected that areas with 6 Finally, caveats
different characteristics might have different models, such as having different population models for urban versus rural regions. Sample selection models, which account for such model variation, will be disused in this chapter as well. Finally, a differential equations modeling framework, for both spatial and temporal dimensions, is developed. Chapter 5 covers the quantitative results of the implementations of the models developed in Chapter 4. The results for various models of population, per capita income, average vehicles owned per household, median house value, and land cover are presented and discussed. The explanatory power and usefulness of the models, in the context of the results, is covered as well, including comparisons between the various models. Chapter 6 discusses a practical application of the model results in the form of population and land-cover projections for parts of the Austin, Texas region for the years 2005, 2010, and 2020. Chapter 7 reviews the work and summarizes its various elements. This chapter also discusses the continuations of and extensions to this research which should take place in the future.
CHAPTER 2: HISTORICAL/LITERATURE REVIEW

This thesis builds upon a foundation constructed by a variety of past works. Though these works provide a context for the structure and theory of this thesis, they by no means define a clear point from which this work should extend from. Instead, they comprise a historical and theoretical record of the multitude of ways in which the problem of modeling spatial, and in particular land-use/landcover, data has been approached.1 Though this thesis draws on this background to extend some of the models discussed here and develop new ones, many of the methods covered here are not incorporated; nevertheless, all of the works presented here are important to the development of this thesis. In fact, those techniques which are not used are of particular importance, as they highlight many of the specific reasons and justifications for why the methods which are used were chosen. As this thesis uses land-use/land-cover data derived from satellite, or remotely sensed, imagery, works which incorporate remotely sensed data will also be discussed. Most of the applications of this data involve ecological land change issues in particular deforestation which are somewhat removed from the transportation and regional planning focus of this thesis. However, the methods developed in these texts still provide a good base from which an understanding of the techniques, benefits, and problems surrounding the area of modeling remotely sensed data can be established. This chapter is organized into four sections. The first discusses early landuse models, which create the groundwork for not only the land-use models developed in this thesis, but also, more generally, the spatial analysis techniques used throughout it. The second section discusses spatial econometrics and its
There is a subtle, but important, difference between land use and land cover. For this chapter, it suffices to say that land cover is based on what the land looks like it is being used for, and land use is what the land is actually being used for. A more detailed discussion of this distinction, and how it specifically relates to this work, is presented in Chapter 3.
1
application in various models. The third section will discuss other models of land use and/or involving spatial interactions. Finally, the last section will compare the various methodologies and make a case for the choice of the spatial econometric techniques used in this work. 2.1 EARLY LAND-USE MODELS Many early land-use models were focused on modeling urban regions and were based on the assumption that a city is centralized: as one extends radially from its center, distinct regions can be discerned. The Concentric Zone Theory, proposed by E.W. Burgess in 1926, described cities in terms of a central business district (CBD) and concentric rings surrounding it; the farther a ring was from the CBD, the more affluent the residents living there (Candau 2002). Other theories used the idea of that distance from a citys center matters to create models of land values, or rents, in which populations distribute themselves according some maximization scheme. For example, the von Thunen Model described agricultural land use according to the interaction of distance to the CBD and the value of various goods on the profit maximizing actions of farmers (Candau 2002). Bid rent theory extended this idea to develop models for the distribution of urban land use residential, industrial, commercial, etc. based on utility maximization and land rents which declined as the distance to the CBD increased (Candau 2002, Fujita 1989). Improvements of these land-use models would allow for less centralized city forms. Sector Theory was an extension of the Concentric Zone Theory in which a city is still centralized around a CBD, but in which irregular sectors, rather than rings, around the CBD define the landscape of the city (Candau 2002). The irregularity of the sectors allowed for more regional effects to be modeled. This idea of including regional effects was the basis for the Multiple Nuclei Theory, in which an urban region was allowed to have many nuclei acting, with 9
various strengths, as city centers (Candau 2002). Thus, a city could be seen as still developing according to distance from centers of growth, but in a form which is no longer completely centralized. A model similar to the Multiple Nuclei Theory, but with a more structured form, is the Central Place Theory developed by Walter Christaller in 1933 (Candau 2002). This theory was based around a strict geometric representation of the sphere of sphere of influence of urban areas. Spheres of influence took the shape of polygons which could be packed together, such as triangles or hexagons, and whose size was dictated by the size of the urban area to which they were associated, i.e. the sphere of influence of a city would be larger than that of a village (Candau 2002, Christaller 1954). Overlaying of the various spheres of influence illustrated not only the distribution of the urbanized regions, but also the interactions between them (Candau 2002). Throughout these early models, two themes are important for the development of the spatial analysis used in this work. First is the notion that the distance that a certain location is from major city components, such as the city center or its transportation network, effect the manner in which that location develops. The second theme extends this idea of the importance of distance to a less centralized realm: that sub-regions of a city can develop their own pockets of influence which are local and somewhat independent of the surrounding areas. These two ideas implicitly motivate and guide most later spatial analysis techniques, especially those of spatial econometrics. 2.2 SPATIAL ECONOMETRICS The basic goal of econometrics is to develop models which draw useful information from data in a statistically rigorous manner. Spatial econometrics extends this basic premise to explicitly incorporate spatial information contained within the data. The motivation for including spatial information in models is that 10
for some types of data it might be expected that the location and spatial context of an observation will inform its characteristics. For example, the population of a one kilometer square tract of land is certainly dependent on its spatial properties; among the spatial factors effecting this would be whether or not the tract is in a city, the country it is located in (through which certain social and economic factors specific to that country may be introduced), what the terrain of the tract is, and whether the area surrounding the tract is urbanized or not. An econometric model will provide potentially biased results if these regional and geographic dependencies of the data exist and are not accounted for (Anselin 1988). To introduce spatial effects into a rigorous econometric framework, two general methods, which may be used together, have been developed. One explicitly places spatial information as an explanatory variable of the model; the other allows spatial information to be accounted for via autocorrelation among the error terms of the model. By explicitly defining the spatial data which is incorporated into the model, the first method allows the dependence of the model on that spatial information to be expressed very explicitly; however, this also places a burden on the modeler to not only acquire the spatial data but also to introduce it into the model in the correct form. Anselin (1999) notes that this method corrects for structural instabilities which occur as a result of spatial heterogeneity in the data. In contrast with the previous technique, the method incorporating spatial autocorrelation requires less specificity. One of the things that error terms in econometric models account for is unobserved, or unobservable, information, and the spatial autocorrelation method assumes that within this unobserved information are spatial dependencies that can be drawn out (Greene 2000, Anselin 1999). The implicit lack of specificity as to what this unobserved information is allows for a great deal of flexibility as to what the correlations may account
11
for; however, because of this lack of specificity, there will also be some degree of vagueness as to how these spatial correlations can be interpreted (Anselin 1999). The simplest method of placing spatial information among the explanatory variables is to include some sort of geographic reference. This often is comprised of a type of distance measure, such as distance to the nearest highway or distance to nearest body of water. Anselin (1988) has mentioned that global geographic coordinates, such as longitude and latitude, can be used, though it is difficult to imagine intuitively motivating using such a reference without the proper contextualization. Dummy variables for specific spatial areas can also be used to pick up on regional effects, as used in Wangen and Bim (2001). Another type of spatial information which can be included in the explanatory variables is that which neighborhood around an observation. incorporates information from the Basically, the information in other
observations, usually the dependent variable, which are within some distance of the observation of interest are added to the explanatory variables. Anselin (1988) refers to these added neighborhood variables as spatial lags, as they are the spatial analog of time lags in time series or panel data models. Following the lines of the early land-use models, it is assumed that observations that are close will have more of an impact on each other than those far away; consequently, spatially lagged variables are usually multiplied by spatial weights which account for the effects of distance. Lastly, spatial statistics can also be used as spatial explanatory data for an econometric model. These statistics transform known spatial or locationspecific data to draw additional information from an observation and/or the area surrounding it. Examples of spatial statistics include the mix and entropy statistics used in Kockelman (1997), the built environment variables in Cervero and Kockelman (1997), and the spatial filtering transformation of the suitability maps in Schneider and Pontius (2001). 12
In order to incorporate spatial autocorrelation into econometric models, a specific error structure incorporating the spatial effects must be defined (Anselin 1988). The general method of specifying autocorrelation in an econometric model is to define the error term for an individual observation i as:
i = f ( )
(2.1)
where i is the error term for the observation i and f ( ) is a function of the error terms for all of the observations. What this means is that in modeling a given observation, unobserved information from other observations is important to, and can inform, the model of the observation of interest. In models incorporating spatial autocorrelation, f ( ) relates the error terms of the observations according to their spatial relationships. As with the spatial lags discussed above, usually it is assumed that observations which are close to one another will influence or inform one another more than those which are far away, and the methodological construction of the function in (2.1) are generally based on this assumption (Anselin 1988). To actually account for spatial autocorrelation, the function in (2.1) must be defined. Generally, one of two methods is used, though they may also be combined (Anselin 1988). The first defines a neighborhood around a given observation, and then correlates the error terms for the observations within that neighborhood with that of the observation of interest, usually with equal weight. The second method does not necessarily restrict the spatial autocorrelation to a neighborhood, and instead explicitly accounts for the distance between observations, allowing for greater autocorrelation between observations which are closer together.
13
2.2.1 Socio-Economic and Demographic Applications of Spatial Econometrics There is a growing body of literature concerned with applying spatial econometric models to socio-economic variables, both discrete and continuous. In the case of continuous variables, examples include Vandaveer, Soto, and Nius (2002) use of both spatial autocorrelation and spatial lags to model rural land values; Dubin (1991), who has exploited the flexibility of spatial autocorrelation to utilize it as a proxy for neighborhood quality in a study of residential home values; and Messner and Anselin (2002), who have used spatial autocorrelation and spatial lags in a regression analysis of national homicide rates. In discrete variable analyses, spatial probit formulations are generally preferred so as to exploit convenient characteristics of the normal distribution specifically, that the sum of two normally distributed variables is also normally distributed (Anselin 1988). Examples of this are Marsh, Mittelhammer, and Huffaker (1997), who used both spatial lags and spatial autocorrelation in their examination of the effects of an agricultural blight on potato farmers, and Coughlin, Garrett, and Hernndez-Murillo (2004), who incorporated global and regional spatial autocorrelation effects into a panel data analysis of state lotteries. Spatial effects have also been introduced into logit discrete choice models, though with less success than with the probit models due to the difficulty in maintaining a usable error structure. Bhat and Guo (2004), for example, developed a complicated, specialized, and constraining multinomial logit inspired error structure to account for spatial autocorrelation in residential choice models. All of these examples are indicative of most applications of spatial econometrics towards economic, transportation, regional planning, and urban planning issues in the sense that they do not take into account land-use/land-cover information. A notable exception is Kline and Alig (2001), who develop a landuse change model which incorporates both land-use information (which is not remotely sensed) as well as geographic information. Another is the commercial 14
package UrbanSim (Waddell 2002), which, using an econometric framework, simulates numerous socio-economic and land-use variables with a focus on policy-based applications. However, because binary or multinomial logit formulation of these models precludes their inclusion, neither spatial autocorrelation nor lags are incorporated. The lack of spatial econometric approaches to models involving land use/land cover, or even transportation and regional planning applications, is highlighted by the fact that in Berling-Wolff and Wus (2004) review of urban and transportation modeling and simulation techniques, spatial econometrics is not even mentioned. However, there is one research area where the combination of land-use/land-cover data with spatial econometric techniques is an active topic of research: ecological and, in particular, deforestation models. The following section will discuss some of these works in order to develop a better theoretical and practical foundation for the methods used throughout this thesis.
2.2.2 Land-Use/Land-Cover Models Using Spatial Econometrics In their review of various methods used for modeling land-use/land-cover variables, Irwin and Geoghegan (2001) emphasize the importance of incorporating spatial effects into models, noting many studies whose findings point in this direction. However, they also state that such work, especially in using econometric techniques, is in its infancy, and a greater understanding of how space effects land use/land cover will only come as more advanced and rigorous models are developed. Hindrances to such work are not only methodological (as in the models have not been developed yet), but also computational: spatial data, especially with the advent of GIS and land-use/landcover data derived from satellite imagery, is often very dense and memory intensive (Nelson and Geoghegan 2001). techniques require calculations whose 15 Furthermore, spatial econometric computational demands increase
exponentially with the size of the data set, further placing a burden on the techniques. Another issue which requires further investigation is the development of rigorous analyses of goodness-of-fit for land-use/land-cover models, as well as the effects of scale on the models (see Irwin and Geoghegan 2001, Veldkamp and Lambin 2001, and Kok, et al. 2001). Despite these difficulties, some attempts have been made in modeling land use/land cover using spatial econometric techniques. These models, as noted above, have focused on deforestation issues and, in particular, on land-use/landcover change in tropical rain forest regions. Because change is modeled, it is natural to use discrete choice models in such research; however, continuous variable analysis in spatial econometrics is much simpler. Both techniques may be applied to land-use/land-cover change by analyzing either where a change is predicted to occur (discrete) or how much change is expected (continuous) (Nagendra, Munroe and Southworth 2004). Comparisons of these techniques are found in (2004). Though it is technically more difficult to incorporate spatial effects into discrete choice methodologies, much of the research in econometric techniques for land-use/land-cover change have focused on these types of models. However, the difficulty with such models becomes clear as most of this research ignores potentially important spatial effects or employs questionable methodology. For example, Wear and Bolstead (1998) claim to incorporate spatial autocorrelation into a binomial logit model of land use. However, the method they employ is actually a spatial lag model. Furthermore, they do not take into account the fact that doing so invalidates the necessary assumption that the error term has a Gumbel distribution (Greene 2000), making they methodology highly problematic. Nelson and Hellerstein (1997) and Nelson, et al. (2004) have also applied logit models to land-use/land-cover data. 16 To avoid complications Geoghegan, et al. (2001) and Southworth, Munroe and Nagendra
caused by spatial effects, these works have employed sampling of the data sets so that not cells are neighbors (spatially), as well as incorporating spatial lags. Not only is it unclear that, as is claimed, the sampling technique eliminates spatial autocorrelation, but again the use of spatially lagged dependent variables invalidates the necessary assumptions of the model. Munroe, Southworth and Tucker (2001) have applied probit discrete choice analysis to analyze land-cover changes in Honduran rain forests. Though far more rigorous methodologically than the examples discussed previously, this work does not account for many potentially important spatial effects, especially spatial autocorrelation. However, spatial effects in the form of distance measures, spatial statistics, location measures, and spatially-constant error effects were incorporated. The results indicate that spatial (and, incidentally, temporal) effects are highly significant to the model. Examples of land-use/land-cover models for continuous variables have many of the same problems as in the discrete case. An example is Schneider and Pontius (2001), who, in an analysis of deforestation in Massachusetts, develop a logistic regression methodology to model land-use/land-cover change. However, rather than rigorously accounting for spatial effects, they calculate averages of the dependent variable to use as spatial statistics. This is effectively a spatial lag formulation disguised as a spatial statistic, and without the rigor of Anselins (1988) technique; thus, their methodology is suspect. What is obvious from this discussion is that research concerning the application of spatial econometrics to land-use/land-cover analysis is still in its infancy. In much of the work that has been done in this area, methodological rigor is often ignored, leading to questions concerning the validity of the subsequent model results. In other analyses, important spatial effects are not incorporated, often because of analytic or computational difficulties. What should be noted is that any new research into this area (read: this thesis) has little 17
guidance as to what works or is correct when applying spatial econometrics to land-use/land-cover data, and that any such research is truly testing uncharted waters. 2.3 OTHER MODELS & THEORIES There are a wide range of techniques and methodologies, separate from spatial econometrics, which have been developed to create models incorporating land-use/land-cover data and/or dynamics. This section will discuss some of these in order to provide a fuller picture as to the current state of such modeling, and to create a field against which spatial econometrics can be compared in the following section. One group of models are those which draw primarily from the foundation land-use models presented in Section 2.1. For example, Fujita (1989) uses bid-rent theory to develop a complicated theory of economics, land use, and growth in urban regions. Sonis (2001) combines Christallers central place theory with operations research techniques to model urban form and growth. The problems with these works is that their theoretical underpinnings are so stringent that they overly restrict the usefulness and practicality of the models. In fact, Fujitas work is purely theoretical and does not offer one empirical example; and though Sonis does discuss an application of his work to Munich, the results and implications of the study are primarily theoretical and intellectual in nature. A modeling technique which draws on central place and bid rent theory, but which attempts to place it in context of a larger, more complex scheme, is the systems approach of Allen (1997). As noted in Berling-Wolff and Wu (2004), this work builds on previous work which dealt with a dynamic version of the central place theory. In Allens models, various systems comprised of, among other things, human, spatial, and economic elements are allowed to interact via complex deterministic equations. However, though models of real-world 18
examples involving actual data were created and compared with reality, the model still suffers from the fact that it was theoretically, as opposed to empirically, driven. Weidlich (2000) took a similar, if less focused, approach to models of population, economic, and urban dynamics. Using random utility theory as the basis for models, complex, deterministic equations were devised to establish the probabilities of various events and interactions among urban and non-urban regions and players. Though the results of this work are intellectually satisfying and seem to make intuitive sense, they too are basically non-empirical in nature with few, if any, practical applications. Drawing upon the ideas of complex, interacting systems used in the work Allen and Weidlich, but applying it towards a more empirically minded and flexible framework, are so-called agent-based models. Agents are defined as autonomous decision makers which interact with each other and their environment through their decisions (Parker, et al. 2003). Though decision rules and some interaction effects are defined, the models are not deterministic nor equilibrium-based and it is only by simulating the interactions of the various players that the results and implications of a scenario can be discerned (Parker, et al. 2003). These models have been applied to a wide range of issues relevant to land use/land cover, including deforestation and urban sprawl (Parker, Berger, and Manson 2001) and the effects of spatial interactions on land-use patterns (Parker and Meretsky 2004). The primary drawback of agent-based models is the fact that they tend to use real-world data for validation after they have been constructed, as opposed to calibration up front. Furthermore, agent-based modeling techniques are still relatively new and lack a body of empirical research to both justify the methods and firmly establish paradigms for them (Parker, et al. 2003).
19
Another class of models which has gained widespread popularity for modeling urbanization and land use are those based on cellular automata (CA). CA models take as their basic elements a regular lattice of identical cells, each of which may exist, during a given time interval, in only one of a finite set of discrete states, and which act autonomously according to a set of rules governing transformations and interactions with other cells and the environment (Candau 2002; OSullivan and Torrens 2000). Applied to land-use dynamics, CA models take cells to represent land parcels, and take land-use types as their set of discrete states. Based on a cells interactions with its neighbors and environment which includes such things as a cells physical topography and proximity to transportation networks as well as with randomized inputs, a grid of cells is allowed to grow over time from a seed map, thus forming land-use growth predictions (Candau 2002; Clarke 1997). To calibrate a CA model, first the set of various growth rules are developed. Then parameters governing the strength and characteristics of those rules are determined by growing the model over a time period from the past, comparing the results with historic land-use/land-cover data, adjusting the parameters, and starting the process over again until the model replicates the historic data within desired accuracy (Candau 2002). CA models have been used successfully in modeling urbanization and land-use change (Clarke, Hoppen, and Gaydos 1997; Clarke and Gaydos 1998). Also, because CA and agent-based models are similar in many ways they both involve autonomous entities which act or change according to certain rules these two model forms have been incorporated in analyses of land-use/land-cover change (Parker, Berger, and Manson 2001). However, CA models suffer from many of the same drawbacks as agent-based models. In particular, they focus on finding emergent characteristics of a region while clouding the understanding of the actual interactions of the different elements in the model, which become more difficult to determine as the number of interacting elements increases (Candau 20
2002; OSullivan and Torrens 2000). Another issue with CA models is that extensive data resources are generally required and that calibration can be extremely demanding on computing resources, hindering the applicability and practicality of the models (Candau 2000). Some of the more commercial or practically-minded land-use/land-cover change models combine or incorporate aspects from many of the previously discussed models and theories. Mentioned previously, UrbanSim (Waddell 2002) is, in some respects, closely related to agent-based models. Landis and Zhangs (1998) California Urban Futures 2 (CUF2) model combines CA models with multinomial models of land-use change per hectare (or other unit of observation) to predict future land-use patterns. Another package, What If? (Klosterman 1999) does not draw from any of these theories, instead creating a simplified model of land use based on user defined suitability indices and socioeconomic, geographic, transportation, zoning information. The major drawbacks of these models are that they tend to be data intensive, inflexible with regard to their model form, and most importantly, unable to fully account for spatial dependence in the data. In addition to the aforementioned models and theories, there are many other, less popular, approaches to modeling spatial and land-use/land-cover data. Berling-Wollf and Wu (2004) discuss some of these, such as applications of fractal form analysis, fuzzy-logic theory, and neural network modeling. Chawla, et al. (2001) discuss a data mining approach to spatial modeling which they posit is much faster than spatial econometric techniques. However, all of these techniques, though novel, are fairly specialized and not very accessible to most land-use/land-cover, urban, and transportation modelers and planners.
21
2.4 MODEL COMPARISON When selecting a methodological slant for this work, the goals introduced in Chapter 1 were kept in mind. Specifically, the chosen methods and techniques should be able to not only incorporate land use/land cover as well as spatial and temporal effects, but they also should produce results which provide a deeper understanding of the complex interactions which occur in the modern urbanized landscape. Though many of the models discussed could incorporate both land-use information and accommodate complex spatial and temporal interactions, only the spatial econometric techniques have the strong potential to provide results which are easily interpreted to expose subtle interactions between various elements which are modeled. Some of the methods with the greatest potential to compete with spatial econometrics, in particular agent-based and cellular automata models, while rich in their complexity, have outputs which cloud the explicit interactions between the various elements. That is, their results present the emergent outcomes of the various interactions which they model, which provide little, if any, information as to what is causing these outcomes. Though their results might be of some use to planners (they can, for example, expose the possible implications of policy decisions), they cannot easily provide detailed information on what is going on behind the scenes to drive such implications. Furthermore, there is no guarantee that even though the CA or agent-based models are replicating reality, they are doing it in a realistic or insightful manner (Candau 2002). Another problem with the CA and agent-based models and, especially, the complex systems methods of Allen (1997) and Weidlich (2000), is that they do not use empirical data to directly drive the models. For CA and agent-based models, a series of simulations are validated against real data to determine the best model form. This not only leads to large computational demands for model calibration, but also is highly dependent on the metrics used to quantify the 22
quality of the model fit, which may have more control over the calibrated parameters that the calibration data itself (Candau 2002; Parker, Berger, and Manson 2001; Parker, et al. 2003). With the complex systems models, quantitative model calibration is difficult and instead the models are often justified by how well their results apparently replicate reality. Furthermore, these models, especially the complex systems and agent-based ones, often create extremely complicated model forms which hinder their implementation because they require a high degree of specialization and, often, programming skills. In all of the concerns described above, spatial econometrics has clear advantages over the other techniques. It is data driven; it gives explicit, interpretable results which can clearly expose relationships among various elements; and it is an adaptation of econometric theory which is in widespread use. Elaborating on this last point, though there are complexities associated with models incorporating spatial econometrics (see Chapter 4), the basic model forms and, especially, the assumptions associated with them, are relatively simple, allowing for straightforward interpretation of the model. In fact, because it is our only direct link to reality, it is preferred that the data provides as much information as it can, and avoiding overly complex model formulations allows a model to let the data do the talking in a clear, interpretable manner. On top of all this, there is also a solid empirical background in applying spatial econometric techniques to models incorporating land-use/land-cover data. Though these models fall outside of the focus of this thesis, they nonetheless provide an excellent foundation from which this work can build. confidence in the methodology. In sum, spatial econometrics is the ideal modeling theory for this work. It not only allows the data to drive the models, but it also is able to elegantly draw important information, including spatial and temporal effects, from the data in a 23 Also, the success of the techniques as applied in these works instills a certain sense of
statistically rigorous fashion. Furthermore, it has a solid empirical and theoretical foundation, and does not require a large amount of new, specialized knowledge to understand and implement it. Finally, as will be seen, it is flexible enough to allow for a wide range of methodological variations, letting a wider range of information be drawn from the data as successive models are developed, calibrated, and interpreted.
24
CHAPTER 3: DATA
For this work, there are essentially only three sources of data: LandSat satellite imagery, the United States Census, and a map of the Austin, Texas region. However, from these core sources, a rich set of data is developed using a variety of techniques. This chapter will discuss in detail the three sources, as well as the methods used to derive and organize data from them. As mentioned previously, the advent of detailed satellite imagery and computational methods to analyze it have allowed for easier access to detailed land-cover/land-use data. Such data is used in this thesis and sections 3.1 and 3.3 discuss relevant information concerning satellite imagery and how the land-cover data used here was derived from it. Though it is mentioned in the discussion, it is emphasized here that this derivation of the land-cover data was not performed by the author but rather by a University of Texas at Austin professor and her students as part of a graduate class. Also, the use of land cover as opposed to land use to describe this data is due to a subtle distinction discussed in section 3.2. Sections 3.4, 3.5 and 3.8 discuss the other data used in this work. Section 3.4 covers two spatial statistics land-cover mix and land-cover entropy which were derived from the land-cover data. Section 3.5 discusses United States Census data, and section 3.8 discusses cartographic data. One of the major goals of this thesis is to create models which can draw important spatial and temporal information from data. In order to do this, the various data sources must be spatially and temporally aligned. That is, if two data sets are not available for the same years, and/or if they are spatially referenced to a region differently (e.g. one is referenced to a grid, and the other to irregular polygons), then adjustments and approximations must be made so that the two sources can be used together. Sections 3.6 and 3.7 of this chapter deal with the
25
methods used to align land-cover and Census data which, in their original form, are spatially and temporally incompatible. No matter how perfect the collection method and how detailed it is, data will always have limitations. Understanding these limitations, including its sources, extent, and possible consequences, is an important part of evaluating the performance and power of models which use that data. Section 3.9 discusses a series of caveats concerning the data used in this work. It should be noted that though this section is rather critical of the data, it is not meant as an attack on its validity; rather, it is included so that a complete picture of the data, which should include a discussion of its limitations and ways it might be improved in the future, is presented. Finally, section 3.10 presents a summary of the data, including tables of descriptive statistics. 3.1 LANDSAT SATELLITE IMAGERY Satellite data offers excellent opportunities and considerable challenges. A serious and recurring problem for modeling land use has been the lack of spatially detailed data. Remote sensing, imaging technology, and geographical information systems (GIS) are making accurate land-cover maps far more accessible to the researcher, and to the public. In particular, global satellite imaging, initiated in the early 1970s, provides highly detailed images which can, with image analysis software, be classified into various land-cover categories. analysis. The United States launched LandSat 1 in 1972. Passing over Austin every 18 days, this early satellite provides images with 79 m 79 m pixel resolution. LandSat 4 was launched in 1982, and resulted in 185 km 185 km images with 30 m 30 m resolution with a repeat orbit cycle of 16 days. 1984s LandSat 5 and 26 Furthermore, GIS software combines data maps of various types, dramatically facilitating spatial
1999s LandSat 7 have essentially identical orbit and image characteristics to LandSat 4. These imaging systems work by scanning multiple passes (each representing one pixel) over an area and recording the reflectance of seven distinct spectral bands (Richards and Jia 2000); six of these bands record with 30 m 30 m resolution, while the seventh, a thermal band, records with 120 m 120 m resolution (60 m 60 m for LandSat 7). There are several sources of potential error which occur during the recording of the LandSat image by its scanning system. Some image distortion results from the satellites motion (relative to the Earth) during scanning, the overlap of individual scans, and the fact that the scanner is effectively a point rather than a strip (see Figure 3.1). Pre-processing of the image, done before dissemination of the data, corrects much of these errors. Other sources of error, especially apparent when comparing pixels from the same image or across images, are variations in atmospheric conditions, including cloud cover, and the location of the sun vis--vis the satellite, which can cause distortions from shadow effects and variations in the overall spectral characteristics of the image. Any methods used rectify errors in a satellite image are potentially distortions themselves; and they do not guarantee image accuracy or usefulness when imported into a GIS or other image processing software. Thus, a large amount of post-processing may be necessary by the consumer in order to further correct image distortion. Examples of such post-processing include correcting for atmospheric distortions and registering the image to a specific spatial projection system (Richards and Jia 1999). The latter example is particularly important since accurate registration of an image is necessary for use in a GIS package and, especially, for comparison of different images of the same spatial location. Specifically, in order to display parts of the Earth on flat surfaces, different projection systems, based on ellipsoids created to mimic the Earths shape, have
27
Figure 3.1 Example of how the point nature of a LandSat scanner creates a distorted image. Though the size of the pixels recorded both straight down and at an angle (shown by A and C) are the same, the features contained within the pixel (shown by the A and B) are distorted (compressed) for the pixel recorded at an angle. Preprocessing of the LandSat image corrects for such distortion. been developed (e.g., North American Datum of 1927 (NAD27); see Mugnier (2000) for more information). Unfortunately, while the ellipsoid used in the projection system may mimic the shape of the Earth, localized deviations from this ellipsoid become apparent in an actual image taken of the Earth. In order to use a particular projection system, these distortions of the Earth from this ellipsoid must be corrected during post-processing. The land-cover data used in this work was derived from images taken by the LandSat 4, 5, and 7 satellite systems. Four images of Austin, Texas and the surrounding region were used; these images were taken at 4:30 pm on September 4, 2000; 4:30 pm on April 29, 1997; 4:30 pm on February 8, 1991; and 4:31 pm on January 25, 1983 (Trelogan 2002). The image sections used are all 48.5 km
28
55.8 km and have 30m 30m resolution; each image thus contains just over three million pixels of data. The process used to derive the land-cover data from the images will be discussed in the section 3.3. Since this derivation process required the accurate registration of the images, they were matched to the GRS 1980 projection ellipsoid. This was carried out by slightly distorting different points of the images so that they matched up with Digital Ortho Quarter Quadrangles (DOQQ's), which are rectified aerial photographs maintained by the U.S. Geological Survey.1 3.2 LAND USE VS. LAND COVER Throughout this work, the data derived from the satellite images will be referred to as land-cover data. From the description of the data, the term land use, which is more common, would seem to be appropriate and, perhaps, preferred. However, there is a distinction between the two which is important and which this work will follow. Specifically, land cover refers to a classification system based on what a piece of land looks to be based on its visual/spectral properties. In contrast, land use refers to what a piece of land is actually used for by humans (or, if it is not used by humans, how its features should be classified). In most cases, land use and land cover are identical. However, in certain circumstances, there are differences. For example, imagine a parcel of land upon which a house which is used as a lawyers office sits. This house will almost always, based solely on visual information, be classified as residential, even though its true land use is commercial. Likewise, a park in a city might very well look like land upon which naturally occurring vegetation is growing, even though
For information concerning the USGS Digital Orthophoto Program, one may visit http://mapping.usgs.gov/www/ndop/. DOQQ's may be ordered through this site, though many are available free through individual state's websites; e.g., Texas DOQQ files are available through the Texas Natural Resources Information System (http://www.tnris.state.tx.us/DigitalData/doqs.htm).
1
29
its true use is human-based (either residential or planted/cultivated land). Generally, and this is the case in this work, the goal of land-cover classification is to be as close to the actual land use as possible. Also, even with nearly complete information, discrepancy between the two can be hard to determine, so a case for using them interchangeably can be made. Nevertheless, because the data derived in this work is from spectral (reflected light) information, to avoid misleading a reader, and to emphasize the nature of the data, the term land cover will be used. 3.3 SUPERVISED IMAGE CLASSIFICATION The process used to derive land-cover data used in this work from the satellite images is called supervised image classification. The basic idea behind this process is to select portions of the image known to be representative of certain types of land cover and use the spectral data from these sections, known as training data, to determine the land-cover classification of the rest of the image (Richards and Jia 1999). All of the image processing done to derive the land-cover data for this work was performed by professor Dr. Barbara Parmenter, of The University of Texas at Austin, and her students. Each image pixel was classified into one of nine land-cover types: water, barren, forest, shrubland, herbaceous natural/seminatural, herbaceous planted/cultivated, fallow, residential, or commercial/ industrial/transportation. In the preceding list, the second through fifth classifications are considered uninhabited land, the sixth and seventh rural (agricultural) land, and the final two urban land. To carry out the supervised classification scheme, Dr. Parmenter and her students created a set of training data by using USGS topographic maps and DOQQs to select areas representative of the various land-cover classifications. This training data then was used to generate a set of decision rules by which the 30
entire LandSat image then was classified. As part of the classification procedure, spatial filtering also was performed, in order to remove residual noise from the image processing (Trelogan 2002). The derived land-cover maps for the four years are shown in Figures 3.2 and 3.3. The rapid urban expansion that occurred in the Austin region in the 1990s is clearly evident when comparing the 1983 map with the other three, especially the 1997 and 2000 images. Furthermore, comparisons by Dr. Parmenter and her students of parts of the derived data with DOQQs showed the land-cover data classification was fairly accurate (Trelogan 2002). Though these qualitative validations of the land-cover data are promising, to rigorously test the accuracy of the data, a quantitative error analysis should be performed. Because such an analysis would be both time consuming and complicated (it would probably involve a systematic, detailed comparison of a large part of the data with DOQQs) it will not be carried out for this work. 3.4 DERIVED SPATIAL DATA In addition to the land-cover data used in this work, two spatial statistics based on the land-cover data, land-cover mix and land-cover entropy, were computed. These allow more information to be drawn out of the derived landcover data for incorporation into models. Land-cover mix (from here on called mix) characterizes the dissimilarity of the land cover in a particular area: For a given pixel, mix is an index of adjacent pixels dissimilarity; it measures the level of homogeneity between a central pixels land-cover type ( x0 ) and those of its neighbors ( xi ) (Kockelman 1997, Cervero and Kockelman 1997). For this work, the neighborhood around a pixel was considered to be the eight pixels immediately surrounding it (see Figure 3.4). Mathematically, mix is defined by
31
32 Figure 3.2 Land-cover maps derived from LandSat imagery for the years 1983 (left) and 1991 (right).
33 Figure 3.3 Land-cover maps derived from LandSat imagery for the years 1997 (left) and 2000 (right)
mix ( x0 ) =
i =1
1 x0 , xi 8
(3.1)
where
x ,x =
0 i
1 if xi = x0 0 otherwise
(3.2)
As an average measure of dissimilarity, the mix index ranges from 0 to 1, with a higher numerical value corresponding to less similarity between a given pixel and its neighbors.
x1 x4 x6
x2 x0 x7
x3 x5 x8
Figure 3.4 Reference diagram for pixel neighborhood used in calculation of landcover mix statistic. In subtle contrast, land-cover entropy (from here on called entropy) measures the level of land-cover variety, or balance, of a particular neighborhood (which can be of any size). It essentially measures the heterogeneity of land cover in the neighborhood (Kockelman 1997). Rather than comparing all the pixels in a neighborhood to the central one, as is done in the mix calculations, it compares all of the pixels with each other. The mathematical formulation of entropy, for J land-use types, is given by:
entropy( xi ) = 1 J Pj ln(Pj ) ln( J ) j =1
(3.3)
34
where Pj is the proportion of land cover j in the neighborhood around cell i. Entropy also ranges from 0 to 1, with a higher value corresponding to a large level of heterogeneity in the land cover of a neighborhood. It equals one when all landcover types exist in a zone and when all their proportions are equal (i.e., perfect balance in land cover). Because of this non-centralized nature of the statistic, it was calculated for 300 m 300 m neighborhoods (which correspond to the combination grid cells as described in section 3.6), as opposed to the nine cell neighborhoods used for mix. 3.5 UNITED STATES CENSUS DATA To augment the land-cover data, demographic data from the United States Census was used. The Census is conducted every ten years (at the beginning of each decade) and consists of data collected from a full survey of the population of the country (using a short form survey), as well as more detailed data derived from a sample of the population (using a long form survey). This data covers a variety of demographic information, from population counts to income data to travel behavior. Census data is organized into a series of different spatial units. The smallest spatial unit for which the long form data is available to the public is the block-group. Because some of the data used in this work is from the long form, the block group is used as the spatial unit for the Census data. Though the Census has been conducted for well over a hundred years, only the 1990 and 2000 Census are readily available in digital formats. Because of this, only these years are used here. The 2000 data used in this work was provided by the Caliper Corporation (for use in the TransCAD computer program, version 4.5); the 1990 data was downloaded from the Census website (U.S. Census Bureau 2004(b)). Data from four counties in Texas Bastrop, Hayes,
35
Travis, and Williamson was used, since these counties completely cover the region used for the land-cover data. 3.6 COMBINATION GRID In order to use the land-cover data and the Census data together, three issues had to be addressed. First was the density of the land-cover data, second was the incompatibility of the spatial organization units for the land-cover and Census data, and last was the fact that the two data sources did not match up in the time dimension. The method used to deal with the first two problems is discussed in this section, while the latter problem is addressed in section 3.8. The large number of data points generated from the satellite data posed a problem for modeling because of computational and memory limitations. As will be seen in the next chapter, many of the models require the creation of weight matrices whose size is the number of observations (rows) by the number of observations (columns). Creating a matrix which is roughly 3 million 3 million cells would create around nine quadrillion cells. The memory required just to store such a matrix would be out of reach of most computing resources.3 Furthermore, some of the methods require taking eigenvalues, eigenvectors, and inverses of these matrices. The computational time to compute these, as well as the fact that such calculations on very large matrices have questionable accuracy, also made the data sets size problematic. In order to reduce the size of the data set, a combination grid (the reason for the name will be more clear later) was created. This grid is superimposed over the land-cover data set and the number of each land-cover type counted so as to create proportion (or percentage) of land-cover data. Each pixel in the initial land-cover data set is 30 m 30 m; each cell in the combination grid is 300 m
As a comparative example, a 30,000 30,000 matrix (900 million cells) was created as this work developed and required 800 megabytes of storage.
3
36
300 m. Thus, exactly 100 of the original data pixels fit in each of the combination grids cells and the data set was reduced to just under 30,000 cells. Because the combination grid lined up exactly with the initial raster4 grid, the transfer of the land-cover data to the combination grid is more accurate if done by counting the data, as opposed to using more complex computational methods (such as the overlays discussed below).5 Because the land-cover data is raster (pixel) based, it is essentially a multi-dimensional matrix (or tensor) of values. Such a mathematical construct is very easy to manipulate with matrix-algebra software and the counting necessary to transfer the data to the combination grid fairly straightforward. As such, the transfer of the land-cover data to the combination grid was done using MatLab software (The Mathworks, Inc. 1999). The calculations of entropy and mix, including transferring of mix to the combination grid creating an average mix statistic (which will still be referred to as mix) was carried out in a similar manner. Even after the land-cover data was transferred to the combination grid, the fact that the spatial organization of the land cover and Census did not match up still posed a problem. That is, the combination grid does not line up exactly with Census block groups boundaries, so there was no obvious correspondence between the data sets for modeling (see Figure 3.5 for and illustration of the problem). To remedy this problem, essentially the same technique as used for the land-cover data was undertaken. The combination grid was superimposed over the Census block groups, and the Census data allotted to each grid cell based on
A raster is a spatial data type used in GIS in which information is organized as points on a regular lattice (or grid). The other spatial data type commonly used in GIS is the vector type, in which data is stored as lines and polygons formed out of vectors (Richards and Jia 1999). 5 Actually, the combination grid did not fit exactly over the original data set. The original data sets were 1,621 1,864 cells (3,023,408 total), whereas the combination grid had to fit into round multiples of 100. Furthermore, in order to create accurate aggregations of the mix statistic, a buffer of at least one cell had to be removed from the edge of the data set. Thus, 11 rows and 4 columns of edge data were not included in the combination grid.
4
37
Figure 3.5 Illustration of the fact that the combination grid shown here by thick black lines does not line up with the Census block-groups represented by the patchwork of grey shapes. how much of each block group lay within the cell. For actual count variables, such as population, the fraction of the variable that corresponded to the fraction of the block group in the cell was transferred; for variables representing averages over the block group, such as average household income, the transfer was done by (spatially) weighted summation of the Census values. Because the Census block groups are not rasters, the method used to transfer the land-cover values could not be used. Instead, the superimposition and data transfer was achieved via GIS software. Using TransCAD GIS (Caliper Corporation 2004), a vector (as opposed to raster) grid that exactly matched the combination grid was created and superimposed over the block groups (which contained the Census data). Then, using TransCADs built in Overlay function, the data transfer was made. 3.7 ESTIMATING OFF-YEAR CENSUS DATA In order to employ the land-cover and Census data in an accurate and useful manner in this works models, the years which the data represent must be identical. Because three of the land-cover data panels (1983, 1991, and 1997) cover non-Census years, the models cannot incorporate both land-cover and 38
Census data sets in their original form. In order to remedy this, approximations of the Census data for the years covered by the land-cover data were made.6 Since socio-demographic measures, especially population-related ones, often follows an exponential path as time continues, an exponential form for the approximation was chosen.7 form): The mathematical form of the approximation was (see Smith and Sincich (1992) for a motivation for using this simple exponential
x(t ) = e t
(3.4)
where and are parameters to be estimated. The time index t counts years and is set to zero in the year 1983, so the year 2000 is equivalent to t = 16. The parameters were estimated using the average of the data over all of the cells for the years 1990 and 2000.8 This gives two equations for x(t),
x (6) = e 6 x (16) = e16
(3.5)
which are solved to give the approximations for the parameters:
= exp{( 110 )[16 ln( x (6) ) 6 ln( x (16) )]}

and
(3.6)
Technically, the approximation could have been made on the land-cover data instead of the Census data. However, because the land-cover data was more dense with respect to the number of panels that were available; because the discrete nature of the land-cover data might make it more difficult to approximate; and because of the greater centrality the land-cover data has for this work, it was decided that the approximation should be made on the Census data. 7 The exponential form of some variables, such as population, is well justified (see Smith and Sincich). However, other variables, such as vehicle ownership levels may not have exponential growth. Nonetheless, in the time period considered here (i.e., 17 years, from 1983 to 2000), is short enough that an exponential form might approximate the form of the variables growth over time. Furthermore, assuming the same form for all variables change made for greater simplicity in the preparation of the data. 8 An attempt was made to have a single approximation for each cell. However, because fluctuations in the data often were large and in varying directions, estimates for the off-years often seemed invalid and unintuitive. This was especially true for the 1983 approximation: because the model was leaving the boundaries of its calibration data points and because of its exponential nature, it often would grow or shrink by ridiculous amounts leading to patently implausible results.
39
= ( 110 )[ln ( x (16) ) ln ( x (6) )]
(3.7)
In order to use any Census data whose units were dollars, a correction for inflation had to be made. All such data from 1990, which was in 1989 dollars, was transferred to 1999 dollars, which was the units of the 2000 data. This was done using a correction based on the Consumer Price Index, which showed that $1.00 dollar in 1989 would be worth $1.34 (Federal Reserve Bank of Minneapolis 2004). Because (3.4) is calibrated at an aggregate level, it effectively treats every cell as if it were an average cell from the data set. In reality, each cell may deviate from average behavior and, in order to create accurate approximations of the Census data, a rectifying factor had to be introduced. The factor used was the amount that the cells value deviated from average behavior. Specifically, the approximation actually used was the following transformation of (3.4):
x(t ) = D e t
where
(3.8)
D=
(x(6) + x(16) ) = 1 ( x (6) + x (16) ) 2

1 2
x(6) + x(16) x (6) + x (16)
(3.9)
Table 3.1 gives the averages of the Census data used for calibration of the parameters, as well as the parameters themselves. All of the estimates are what would be expected, except the Number of vehicles per household variable, which is predicted to fall with time. Though this seems non-intuitive, attempts to remedy it by using other forms of the variable, such as vehicle ownership percapita, came across the same problem. As such, and because the Census data is not expected to be at fault, this variable approximation was kept as it was. As a check of the accuracy of (3.8), 1990 and 2000 Census data was predicted using the approximation formula, and then deviations from the true Census values, as well as the ratio of these deviations to the average values,
40
1990 Mean Population % of Population Living in Urban Areas % of Population Living in Rural Areas Number of Workers Number of Workers Driving Alone to Work Number of Workers Carpooling to Work Number of Workers Taking Public Transportation to Work Number of Workers Walking, Bicycling, or Other to Work Number of Workers Working at Home Number of Workers Working Away from Home Average Commute Time to Work Median Household Income (1999 $'s) Per Capita Income (1999 $'s) Number of Houses Number of Households Number of Vehicles per Household Average Household Size Median Rent (1999 $'s) Median House Value (1999 $'s) 22.09 0.427 0.572 11.64 8.876 1.551 0.439 0.445 0.327 11.31 27.77 52772 21458 9.886 8.762 2.059 2.847 664.6 123583
2000 Mean 32.65 0.559 0.441 17.44 13.39 2.277 0.543 0.593 0.632 16.81 29.32 65259 29197 13.16 12.62 1.978 2.812 712.4 158395
17.47 0.364 0.669 9.132 6.935 1.232 0.387 0.375 0.220 8.920 26.87 46458 17837 8.329 7.039 2.109 2.867 637.4 106486
0.039 0.027 -0.026 0.040 0.041 0.038 0.021 0.029 0.066 0.040 0.005 0.021 0.031 0.029 0.036 -0.004 -0.001 0.007 0.025
Table 3.1 Mean values of 1990 and 2000 U.S. Census data and approximation formula parameters derived from them. 41
calculated. Table 3.2 summarizes these results; absolute values of the deviations are used so that cancellations of the deviations which would lead to overestimates of their performance do not occur. As can be seen, most of the predictions are within 25% of the average value, and those that are not, such as Number of Workers Working at Home and Number of Workers Walking, Bicycling, or Other to Work have such small values in the first place that their deviations should not too problematic.9 As a note, the fact that the total deviations for 1990 and 2000 are identical is expected because of the least-squares technique used to estimate the parameters for (3.8). 3.8 GEOGRAPHIC DATA The last type of data used in this work measured the Euclidean distance between each grid cell and certain geographic features. One of these features is Austins central business district (CBD), whose center is generally considered to be located close to the State Capital building. The grid cell which contained the Capital Building was located and the distance to the CBD measurement carried out by measuring the distance (in km) between the center of each grid cell and the cell containing the Capital Building. Another distance measure concerns the shortest distance between the grid cells and a major highway. Using TransCAD software, the shortest Euclidean distance (in km) was calculated between each cell and the closes of the following highways: U.S. Highway 290, U.S. Highway 79, U.S. Highway 183 (also known as (aka) Research Blvd.), State Highway 71 (aka Ben White Blvd.), Interstate 35, Loop 1 (aka MOPAC), and Loop 360 (aka The Capital of Texas Highway). All
9
It should also be noted that many of the variables whose characteristics are presented here are not used in the models presented in later chapters. The reason for presenting all of these results is two-fold: 1) It was not known when the data was estimated whether or not it would be used in the final form of the models; and 2) they give a better understanding of the performance of the method used to estimate the off-Census year data.
42
Population ||: 2000 Mean Standard Deviation Minimum Maximum ||: 1990 Mean Standard Deviation Minimum Maximum ||/mean: 2000 Mean Standard Deviation Minimum Maximum ||/mean: 1990 Mean Standard Deviation Minimum Maximum
% of Population Living in Urban Areas 0.090 0.131 0.000 1.000 0.090 0.132 0.000 1.000 0.160 0.235 0.000 1.789 0.210 0.309 0.000 2.340
% of Population Living in Rural Areas 0.090 0.131 0.000 0.547 0.090 0.132 0.000 1.000 0.203 0.297 0.000 1.241 0.157 0.231 0.000 1.74
Number of Workers 2.731 5.469 0.000 123.4 2.731 5.469 0.000 123.4 0.157 0.314 0.000 7.076 0.235 0.470 0.000 10.60
Number of Workers Driving Alone to Work 2.237 4.454 0.000 92.62 2.237 4.454 0.000 92.62 0.167 0.333 0.000 6.915 0.252 0.502 0.000 10.43
Number of Workers Carpooling to Work 0.460 1.227 0.000 26.58 0.460 1.227 0.000 26.58 0.202 0.539 0.000 11.67 0.296 0.791 0.000 17.13
5.196 10.226 0.000 226.0 5.196 10.23 0.000 226.0 0.159 0.313 0.000 6.923 0.235 0.463 0.000 10.24
43
Table 3.2 The absolute difference between predicted and actual 1990 and 2000 U.S. Census data (||); and the ratio between || and mean actual value for the 1990 and 2000 U.S. Census data ((||/mean).
Number of Workers Taking Public Transportation to Work ||: 2000 Mean Standard Deviation Minimum Maximum ||: 1990 Mean Standard Deviation Minimum Maximum ||/mean: 2000 Mean Standard Deviation Minimum Maximum ||/mean: 1990 Mean Standard Deviation Minimum Maximum 0.146 0.693 0.000 29.00 0.146 0.693 0.000 29.00 0.269 1.277 0.000 53.45 0.333 1.578 0.000 66.06
Number of Workers Walking, Bicycling, or Other to Work 0.162 0.666 0.000 36.02 0.162 0.666 0.000 36.02 0.274 1.123 0.000 60.73 0.365 1.496 0.000 80.88
Number of Workers Working at Home 0.164 0.463 0.000 25.98 0.164 0.463 0.000 25.98 0.259 0.732 0.000 41.09 0.500 1.414 0.000 79.40
Number of Workers Working Away from Home 2.642 5.346 0.000 125.0 2.642 5.346 0.000 125.0 0.157 0.318 0.000 7.439 0.234 0.473 0.000 11.05
Average Commute Time to Work 1.418 1.295 0.000 16.69 1.418 1.295 0.000 16.69 0.048 0.044 0.000 0.569 0.051 0.047 0.000 0.601
44
Table 3.2, continued The absolute difference between predicted and actual 1990 and 2000 U.S. Census data (||); and the ratio between || and mean actual value for the 1990 and 2000 U.S. Census data ((||/mean).
Median Household Income (1999 $'s) ||: 2000 Mean Standard Deviation Minimum Maximum ||: 1990 Mean Standard Deviation Minimum Maximum ||/mean: 2000 Mean Standard Deviation Minimum Maximum ||/mean: 1990 Mean Standard Deviation Minimum Maximum 5780 6307 0.000 64964 5780 6307 0.000 64964 0.089 0.097 0.000 0.995 0.110 0.120 0.000 1.231
Per Capita Income (1999 $'s) 2679 3182 0.000 61212 2679 3182 0.000 61212 0.092 0.109 0.000 2.096 0.125 0.148 0.000 2.853
Number of Houses 2.135 4.495 0.000 95.47 2.135 4.495 0.000 95.47 0.162 0.342 0.000 7.257 0.216 0.455 0.000 9.657
Number of Households 1.956 4.077 0.000 90.23 1.956 4.077 0.000 90.23 0.155 0.323 0.000 7.151 0.223 0.465 0.000 10.30
Number of Vehicles per Household 0.177 0.159 0.000 1.407 0.177 0.159 0.000 1.407 0.089 0.080 0.000 0.711 0.086 0.077 0.000 0.683
Average Household Size 0.121 0.165 0.000 1.810 0.121 0.165 0.000 1.810 0.043 0.059 0.000 0.644 0.043 0.058 0.000 0.636
45
Median House Value (1999 $'s) ||: 2000 Mean Standard Deviation Minimum Maximum ||: 1990 Mean Standard Deviation Minimum Maximum ||/mean: 2000 Mean Standard Deviation Minimum Maximum ||/mean: 1990 Mean Standard Deviation Minimum Maximum 0.125 0.231 0.000 1.976 0.098 0.180 0.000 1.542 15447 28561 0.000 244235 15447 28561 0.000 244235
Median Rent (1999 $'s) 114.0 137.6 0.000 965.7 114.0 137.6 0.000 965.7 0.160 0.193 0.000 1.356 0.172 0.207 0.000 1.453
46
of these highways existed in 1983 and 2000, so no changes to this measurement between the years was made (Machemehl 2004). The reason for using Euclidean measurements of distance, as opposed to measuring the network distances (i.e., actual driving distance) is two-fold. First, the computational complexity for the driving distance measurement is considerably more, especially when considering the number of grid cells in the data set (29,946). Second, though the distance traveled by road will always exceed the Euclidean distance, the ratio between the distances for different cells (which is what really matters) would probably not exhibit great change. So, for simplicity and because the relative differences between the two measurement methods are probably not too great, Euclidean distance measures were used. Ideally, however, network distances would be used. This is an extension of this research that may be undertaken in the future.
3.9 DATA CAVEATS
As is the case with many data sets, there are opportunities for measurement errors in these data sets. One particularly problematic area concerns the transformation of reflected light values to land-use categories. There are many image processing steps required simply to get a satellite image into a usable form, and then several more to analyze and clean (filter) the image. Each can degrade the quality of the source image as well as the final product. Furthermore, the supervised classification of the satellite image assumes that similar land covers share distinctive spectral and visual characteristics. Abnormal or unusual land covers may be classified incorrectly. For example, the industrial/ commercial/transportation land-cover type is predicated on presence of concrete or asphalt and larger building footprints (relative to residential and undeveloped areas). However, a residential area with many housing complexes
47
may easily be coded as industrial/commercial/transportation, due to the presence of larger buildings and parking lots, and subsequent lack of yard-space. Another example of mix-classification can occur because barren and industrial/commercial/ transportation land cover may be hard to distinguish. In the 2000 and 1983 data, a large barren area is apparent in the top center area of the region. However, in the 1997 and 1991 data, this area is classified as Upon examination of the 2000 DOQQ industrial/commercial/transportation.
corresponding to this area it was discovered that it was actually a rock quarry. Because this area switched classifications over the years, and because the rock quarry probably existed at least as far back as 1997, an error in classification definitely occurred. This last example also brings up another problem how to classify certain questionable areas. For example, both the classification of the rock quarry as barren and as industrial/ commercial/transportation land cover could be justified, and to choose one over another would require a more exhaustive definition concerning the land-cover types than used in the classification of these images (i.e., one that would venture a bit farther into the realm of land use). Furthermore, even using a more exhaustive rule system does not guarantee that that piece of land will be classified correctly, especially if it is spectrally very similar to other land-cover types. Also, some areas may not be just one single type of land cover. For example, a certain 30 m 30 m piece of land corresponding to one satellite pixel may be part residential, part commercial, and part water. The supervised classification will only choose one of these, which requires a compromise of some degree. Also, because a mixture of land-cover types may exist in a given pixel, the classification may be tricked into selecting an incorrect land-cover classification.
48
Errors caused by these shortcomings can be analyzed, in part, by a comparison of the resulting land-cover map with actual photographs (e.g., DOQQs as done with the analysis of the rock quarry described above) or other, verified land-cover information. And in this way, classification models developed from spectral information can be more accurately calibrated. While a qualitative, very broad analysis of the Austin map created by Dr. Parmenters students suggested good accuracy in the data, a rigorous, quantitative analysis is preferable. No such analysis is performed in this work, however. Another potential source of error is the fact that the Census data obscures within-block group variation. For example, if one-half of a block group is within a particular grid cell, exactly one-half of that block groups population will be assigned to that cell. In reality the part of the block group in the cell may have little or no population. This problem is magnified in sparsely populated areas, because the block groups become large and often have their population concentrated in only a very small proportion of their overall area.10 This limitation may be addressed, to some degree, by using land-cover data to inform the Census homogeneity assumptions for data assignment to grid cells. For example, one may be able to distinguish urban and rural portions within the same block group and/or interpolate a continuous spatial distribution of data (see Mennis 2003). The result may be a much better spatial representation of the Census data, particularly in the peripheral block groups, where much of the land may be undeveloped and the population is small. Lastly, and most obviously, the approximation of the off-year Census data creates a whole host of potential errors. The model assumed an exponential form of growth, as well as creating an arbitrary standard of counting time, which may
10
In sparsely populated areas, Census tracts tend to be assigned only one or two block groups. So, in these areas, the block groups size is on the order of Census tracts, which themselves already tend to be relatively large in these areas because of the small population.
49
or may not be correct. Also, it was calibrated at an aggregate level; it would probably be more correct to have a different model form for different regions of the data set. Finally, the 1983 approximation extended beyond the limits of the calibration years (1990 and 2000). Because of the exponential form of the model, which may grow or shrink very fast, this could cause a severe misrepresentation of this data.
3.10 DATA SUMMARY
Now that all of the data and data sources have been discussed, Tables 3.3 through 3.6 give a summary of all of the data used in this work. This final data set is the result of a series of careful transformations which integrated data sources which were spatially and temporally incompatible in their original forms. By creating a data set whose elements share the same reference system in space and time, the methodologies presented in the next chapter can be used directly to capture and cull information from the spatial and temporal qualities of the data. Furthermore, by including both demographic and geographic data, a wide range of interesting and important characteristics of a region can be incorporated into models.
50
Mean % of Land Barren % of Land Commercial % of Land Fallow % of Land Forest % of Land Herbaceous Natural/Semi-Natural % of Land Herbaceous Planted/Cultivated % of Land Residential % of Land Shrub % of Land Water Average Land-Cover Mix Average Land-Cover Entropy Distance to CBD Distance to Nearest Highway Population % of Population Living in Urban Areas % of Population Living in Rural Areas Number of Workers Number of Workers Driving Alone to Work Number of Workers Carpooling to Work Number of Workers Taking Public Transportation to Work Number of Workers Walking, Bicycling, or Other to Work Number of Workers Working at Home Number of Workers Working Away from Home 0.00754 0.0955 0.1015 0.1982 0.2049 0.0199 0.2126 0.1448 0.0151 0.3942 0.1184 22.22 5.295 32.65 0.5591 0.4407 17.44 13.39 2.277 0.5426 0.5932 0.6323 16.81
Standard Deviation 0.056 0.1707 0.1984 0.263 0.2219 0.0617 0.2494 0.1549 0.1005 0.1573 0.0571 10.06 3.956 61.26 0.43 0.43 33.45 24.36 5.81 2.711 3.465 1.304 32.54
Minimum 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 0 0 0
Maximum 1 1 1 1 1 0.97 1 0.94 1 0.7463 0.3151 46.78 20.6 1238 1 1 611.0 437.2 119.6 103.4 122.6 21.37 607.3
Table 3.3 Descriptive statistics for year 2000 data.
51
Mean Average Commute Time to Work Median Household Income (1999 $'s) Per Capita Income (1999 $'s) Number of Houses Number of Households Number of Vehicles per Household Average Household Size Median Rent (1999 $'s) Median House Value (1999 $'s) 29.32 65260 29198 13.16 12.62 1.978 2.812 712.4 158395
Standard Deviation 6.374 26131 14095 26.26 25.41 0.3073 0.4654 313.9 99930
Minimum 0 0 0 0 0 0 0 0 0
Maximum 43.79 170758 155441 457.3 459.0 2.620 4.430 2001 733100
Table 3.3, continued Descriptive statistics for year 2000 data.

Standard Deviation 0.04828 0.1335 0.1787 0.2749 0.1179 0.238 0.2001 0.1815 0.132 0.1522 0.05424 10.06 3.956 58.12
Mean % of Land Barren % of Land Commercial % of Land Fallow % of Land Forest % of Land Herbaceous Natural/Semi-Natural % of Land Herbaceous Planted/Cultivated % of Land Residential % of Land Shrub % of Land Water Average Land-Cover Mix Average Land-Cover Entropy Distance to CBD Distance to Nearest Highway Population 0.005148 0.07403 0.06276 0.227 0.068 0.1687 0.1956 0.1679 0.0309 0.3846 0.1212 22.22 5.295 29.04
Minimum 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0
Maximum 1 0.98 1 1 0.95 1 0.94 1 1 0.7275 0.301 46.78 20.6 1121
52
Mean % of Population Living in Urban Areas % of Population Living in Rural Areas Number of Workers Number of Workers Driving Alone to Work Number of Workers Carpooling to Work Number of Workers Taking Public Transportation to Work Number of Workers Walking, Bicycling, or Other to Work Number of Workers Working at Home Number of Workers Working Away from Home Average Commute Time to Work Median Household Income (1999 $'s) Per Capita Income (1999 $'s) Number of Houses Number of Households Number of Vehicles per Household Average Household Size Median Rent (1999 $'s) Median House Value (1999 $'s) 0.5236 0.4764 15.44 11.84 2.029 0.5102 0.5451 0.5194 14.92 28.85 61232 26621 12.07 11.31 2.002 2.823 698 147032
Standard Deviation 0.3726 0.3726 31.21 22.83 4.984 2.438 3.26 1.141 30.35 5.801 22987 12069 26.24 24.46 0.2534 0.376 261.8 76781
Minimum 0.06 0 0 0 0 0 0 0 0 0 0 0 0 0 0.7 0 0 0
Maximum 1 0.94 431.9 304.6 82.91 76.26 129.09 34.1 428.2 43.51 165702 125502 457.4 413.4 2.97 4.01 1693 535699
53
Mean % of Land Barren % of Land Commercial % of Land Fallow % of Land Forest % of Land Herbaceous Natural/Semi-Natural % of Land Herbaceous Planted/Cultivated % of Land Residential % of Land Shrub % of Land Water Average Land-Cover Mix Average Land-Cover Entropy Distance to CBD Distance to Nearest Highway Population % of Population Living in Urban Areas % of Population Living in Rural Areas Number of Workers Number of Workers Driving Alone to Work Number of Workers Carpooling to Work Number of Workers Taking Public Transportation to Work Number of Workers Walking, Bicycling, or Other to Work Number of Workers Working at Home Number of Workers Working Away from Home 0.00595 0.06759 0.08393 0.2625 0.1957 0.05721 0.1243 0.1797 0.02303 0.3271 0.098 22.22 5.295 22.97 0.4387 0.5613 12.12 9.249 1.612 0.4477 0.4582 0.3492 11.77
Standard Deviation 0.05445 0.1482 0.2023 0.3141 0.2297 0.1321 0.1777 0.2056 0.1215 0.1477 0.052 10.06 3.956 45.97 0.3526 0.3526 24.49 17.84 3.959 2.147 2.745 0.7686 23.94
Minimum 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0 0 0.11 0 0 0 0 0 0 0
Maximum 1 1 1 1 1 1 1 1 1 0.7212 0.33 46.78 20.6 886.7 0.89 1 338.9 238 65.86 67.16 108.7 22.96 337.7
54
Mean Average Commute Time to Work Median Household Income (1999 $'s) Per Capita Income (1999 $'s) Number of Houses Number of Households Number of Vehicles per Household Average Household Size Median Rent (1999 $'s) Median House Value (1999 $'s) 27.92 53906 22129 10.17 9.087 2.05 2.843 669.2 126690
Standard Deviation 5.614 20236 10032 22.11 19.65 0.2598 0.379 251.1 66159
Minimum 0 0 0 0 0 0.72 0 0 0
Maximum 42.11 145877 104327 385.3 332.1 3.04 4.04 1624 461587

Standard Deviation 0.04241 0.1446 0.2716 0.2091 0.1561 0.1232 0.1231 0.2865 0.1148 0.1409 0.04811 10.06 3.956 33.62
Mean % of Land Barren % of Land Commercial % of Land Fallow % of Land Forest % of Land Herbaceous Natural/Semi-Natural % of Land Herbaceous Planted/Cultivated % of Land Residential % of Land Shrub % of Land Water Average Land-Cover Mix Average Land-Cover Entropy Distance to CBD Distance to Nearest Highway Population 0.005718 0.07857 0.1906 0.1287 0.089 0.03532 0.04503 0.4074 0.01956 0.3047 0.08673 22.22 5.295 16.8
Minimum 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0
Maximum 1 1 1 1 1 1 0.96 1 1 0.69 0.2917 46.78 20.6 648.6
55
Mean % of Population Living in Urban Areas % of Population Living in Rural Areas Number of Workers Number of Workers Driving Alone to Work Number of Workers Carpooling to Work Number of Workers Taking Public Transportation to Work Number of Workers Walking, Bicycling, or Other to Work Number of Workers Working at Home Number of Workers Working Away from Home Average Commute Time to Work Median Household Income (1999 $'s) Per Capita Income (1999 $'s) Number of Houses Number of Households Number of Vehicles per Household Average Household Size Median Rent (1999 $'s) Median House Value (1999 $'s) 0.3541 0.6459 8.77 6.655 1.186 0.378 0.3645 0.2027 8.573 26.72 45483 17296 8.095 6.787 2.117 2.871 633 103877
Standard Deviation 0.285 0.285 17.72 12.83 2.912 8.813 2.183 0.4534 17.44 5.374 17074 7841 17.59 14.68 0.2682 0.3825 237.5 54245
Minimum 0 0.28 0 0 0 0 0 0 0 0 0 0 0 0 0.74 0 0 0
Maximum 0.72 1 245.3 171.3 48.44 56.69 86.43 13.56 246.0 40.31 123083 81544 306.6 248.1 3.14 4.08 1536 378468
Table 3.6, continued Descriptive statistics for year 1983 data
56
CHAPTER 4: METHODOLOGY
The models developed in this work draw from a wide range of established and novel econometric techniques. As they are presented here, they may seem to be a rather disparate collection. However, there is an important underlying connection among them: they all have been developed such that the spatial and temporal aspects of the data are exploited in a powerful manner. The result is a series of models which are able to investigate data across space and time from a variety of viewpoints, allowing a fuller understanding of the complex interrelationships contained within the data. It is emphasized up-front that this chapter forms the heart of this work. Many of the techniques developed here are new, and even those developed by others have not been incorporated into transportation/regional planning based applications. Though the use of panel data modeling in transportation is not new (e.g. see Kweons (2004) investigation of vehicle crash data), the explicit incorporation of spatial autocorrelation in such a framework is. More importantly, the models presented here are specifically developed such that spatial and temporal characteristics can not only be investigated, but also compared. In total, the methodologies presented in this chapter provide an excellent discussion of a variety of ways in which space and time can be included in econometric models and can serve as an excellent starting point for future research. Though many very different techniques and issues are discussed in this chapter, its organization is rather straightforward. The first part of the chapter, which takes up more than half of it, involves developing models which essentially are extensions, spatially and temporally, of a traditional panel data linear regression framework. Section 4.1 discusses the spatial weight matrix, which is the mathematical construct used in spatial econometrics to incorporate spatial autocorrelation. Section 4.2 develops a panel data linear regression model which
57
uses the weight matrix presented in the previous section to capture spatial autocorrelation, as well as a special method to estimate it. In a panel data model, using a time-lagged dependent variable in the explanatory variables can often greatly improve its predictive power. However, accounting for the statistical complications of such an inclusion in a linear regression model which also incorporates spatial autocorrelation is beyond the scope of this work. Nonetheless, so that a better understanding what the benefits of a time-lagged dependent variable, as compared with spatial effects, are, a model which incorporates a lagged dependent variable but not spatial autocorrelation is presented in section 4.3. It might be expected that, for a given model, some parts of a region might have different coefficient values than others (e.g. different population models for urban and rural regions). Section 4.4 develops a model which allows for this under the spatial autocorrelation structure developed in section 4.2. The land-cover data used this work is not continuous is proportions data contained on the [0,1] interval. Because it is not continuous over the whole real line, special techniques must be developed to model such data. Section 4.5 combines the model presented in section 4.2 with a logistic regression framework to allow the incorporation of spatial autocorrelation in proportions modeling. In the applications of the models presented in Chapter 5, time lagged explanatory variables are often used. However, because the data used in this work is organized evenly spaced time intervals, a correction in the models for the size of the time lags might be necessary. Such a correction is presented in section 4.6. Also, due to the inclusion of time lags, an infinite data set is technically required for the models to be rigorously accurate. The same issue comes up concerning the spatial dimension, due to the inclusion of spatial autocorrelation into the models. Section 4.7 discusses this incidental parameters problem in the context of this work. 58
Another way to incorporate spatial and temporal information into a model is to investigate deviations in space and time. Section 4.8 looks at just that in the context of a framework which is developed as an approximation to a differential equations model. Because of the large size of the data sets used in this work and the computationally demanding estimation methods needed for the models developed here, sampling is required to estimate the models. Section 4.9 presents this sampling methodology, including a discussion of the implications of using it. Finally, section 4.10 summarizes and contextualizes the models and issues discussed in this chapter.
4.1 THE SPATIAL WEIGHT MATRIX
In this work, spatial autocorrelation is accounted for through the use of the spatial weight matrix. The premise behind a spatial weight matrix is that correlations will be proportional to some function of the (Cartesian) distance between each cell in a data set (Anselin 1988). Thus, by recording the values of the distance function, the weight matrix accounts for correlations directly. Specifically, if the distance between two cells, i and j, is denoted as s ij , h() is
some real-valued function, and the total number of cells in the data set (for one time period) is N, then the heart of the spatial weight matrix is constructed as a ~ symmetric matrix W :
h( s12 ) 0 h( s ) 0 ~ W = 21 M M h( s N 1 ) h( s N 2 ) L h( s1N ) L h( s 2 N ) O M L 0
(4.1)
which is a symmetric matrix. For panel data sets, the spatial weight matrix consists of T copies of (4.1):
59
~ W 0 L 0 t =T ~ 0 W L 0 t =(T 1) W= M M O M M ~ 0 0 L W t =1
(4.2)
While the ordering of the time indices is irrelevant for (4.2) since each block is identical, it is used below for a parameter transformation, so, for consistency, it is retained here. Generally, it is expected that spatial autocorrelation between cells falls with distance, so f () is usually taken to be an inverse distance measure or a negative exponential function (Anselin 1988, Elhorst 2003). This work will focus on an inverse distance measure. Specifically, 1 ( sij ) k
f ( sij ) =
(4.3)
where k 1. Since every model using the spatial weight matrix is estimated using a maximum likelihood routine, k could be estimated; however, this would add a large level of complexity to every model since inverses of (4.1) or (4.2) or eigenvalues of (4.1) are needed. For this work, k will be fixed at two, creating an inverse distance squared measure. The motivation for this is that not only should autocorrelation decrease with distance, but also that as the distance between cells increases, the correlations should quickly become insignificant. That is, the inverse distance squared function ensures, more so than even inverse distance, that the spatial autocorrelation picks up only local effects. Furthermore, the inverse distance squared measure follows conventions from others (e.g. Case 1992; Marsh et. al 2000).
60
4.2 PANEL DATA SPATIAL LINEAR REGRESSION MODEL The model structure used for modeling continuous random variables ( yit ) in panel data sets (without sample selection see below) is the panel data spatial linear regression model. In the context of this work, a relatively general form of the model for an individual cell i (with N total cells and T total time periods) is: y = y + ~ + + v + x z (4.4)
it i ,t 1 it it i it
where yit is the continuous dependent variable at time t (such as population or per capita income), ~ is a vector of strictly exogenous variables, z is a vector of x
it
it
possibly endogenous and/or time-lagged variables, vi is an individual-specific effect ~ Normal(0, v2 ).1
it is an error term which, to capture spatial
autocorrelation, is specified, in block matrix form, as follows (Anselin 1988):
= W +
(4.5)
where is a (TN 1) vector of which every element ~ Normal(0, 2 ) and W is a (TN TN) block diagonal matrix with T copies of the (N N) spatial weight ~ matrix W described previously.2 Unfortunately, if is non-zero (i.e. if time-lagged dependent variables are included in the explanatory variables), then the inclusion of spatial autocorrelation creates a model whose complexity renders it beyond the scope of this work. In the model discussed in this section, spatial autocorrelation will be included but time-lagged dependent variables will not ( = 0); a method where is allowed to
As a convention, the tilde (~) will be used throughout this text to mean is distributed. It is assumed that, unless otherwise noted, when shown in block matrix form, the observations are grouped by time period, not by individual cell. If the latter were used (which is required for many econometric software packages), then W would be altered significantly to capture the spatial autocorrelation and would no longer be block diagonal. Since a matrix inversion involving it is required, and since W will generally be large (TN TN), this altered form leads to computational inefficiencies when compared to the version specified. Furthermore, the method described below ~ utilizes the eigenvalues of W and thus requires that the data be grouped by time.
2
61
~ be non-zero, but with W = 0 (no spatial autocorrelation) is discussed in the
following section. Before continuing, some assumptions concerning zit need to be addressed. For this work, zit includes the mix and entropy measures described in Chapter 3. Depending on the model, these measures may seem to be endogenous (e.g., when modeling land cover). Because mix and entropy do not depend directly on the past levels of land cover, one method of getting around this issue is to use timelagged variables. However, to avoid any problems, throughout this work mix and entropy are assumed to be exogenous. The justification for this assumption is that the information the two variables measure may not be solely tied to land-cover levels; for example, by looking at a satellite picture, the mix and entropy measures could be determined without having to know the specific land cover. Furthermore, the mathematical formulas used to derive mix and entropy from land cover make it very unlikely that, as the models discussed here are linear in nature, endogeneity problems will surface. As such, any correlation problems (with the dependent variable), serial or otherwise, can/will be disregarded. Then, to simplify notation without any loss of generality, from here on, the ~it and zit x x vectors are concatenated into one vector xit = [~it , z it ], and the coefficient vectors likewise are represented by = [ , ] . For the version of (4.4) with = 0, a combination of feasible generalized least squares regression (FGLS) and maximum-likelihood estimation (MLE) can be used. In the following derivation of the model, which closely follows Elhorst (2003 and 2004), the random effects model can be viewed as a variable
parameters model, with the constant variable, X 1 = [1,1,...,1] , having a variable coefficient 1 + vi. Furthermore, is partitioned such that = [ 1 , 1 ] , the
62
~ ~ eigenvalues of W are i, the matrix of the eigenvectors of W is , and a
parameter is defined such that
2 =
v2 , 2
2
(4.6)
In addition, a matrix R is defined as an (N N) diagonal matrix whose ith diagonal element is given by T 2 + (1 i ) . With these assumptions, for a model with K explanatory variables, the concentrated log-likelihood function for the model is given by
ln L = NT 2
[ln(NT 2 ) 1 ln(
N i =1
T t =1
[T ln(1 ) + (
i
d td t +
)]
2 2 1 1 2 ) ln ( + T ( i ) )]
(4.7)
where
~ 1 d t = 1 W Yt Y N
N i =1
1 y i X t X N
N i =1
x i 1 +
(4.8)
1 R Y N
N i =1
1 y i X N
N i =1
x i 1
Yt is the (N 1) vector of observations at time t, Y is the (N 1) vector of time averages across Yt , X t is the (N (K 1)) matrix of exogenous variables minus the constant term, X is the (N (K 1)) matrix of time averages across X t , and is an (N 1) vector of ones. (4.7) is called the concentrated log-likelihood function because 1 and 2 have been factored out of the equation; they can be recovered by 1 N 1 N
1 =
and
i =1
yi
i =1
1 xi
(4.9)
2 =
where
1 T etet T t =1
(4.10)
63
~ et = 1 W [Yt Y (X t X ) 1 ] + R[Y 1 X 1 ]
(4.11)
The et term in (4.10) and (4.11) is the vector of estimated errors (residuals) which correspond to the term from (4.5). To estimate the parameters , 2, and -1, a two-step iterative procedure can be used (Elhorst 2004). First, values for and 2 are chosen, then -1 is estimated using an ordinary least squares routine of Y * on X * , where both are stacked with elements given by :
~ 1 Yt = 1 W Yt Y N
N i =1
1 y i + R Y N
N i =1
y i
(4.12)
and
~ 1 * X t = 1 W X t X N
N i =1
1 x i + R X N
N i =1
x i
(4.13)
Where is an (N 1) vector of ones. Next, given -1, and 2 are estimated using an MLE routine. The entire routine is iterated until suitable convergence is achieved.
4.3 PANEL DATA LINEAR REGRESSION MODEL WITH TIME-LAGGED DEPENDENT VARIABLE (LSDV MODEL)
Intuitively, for some (continuous) variables it is expected that the present value would be dependent on the past levels. For example, the present population of a certain region will probably depend, to some degree, on the population of the region one, five, or even ten years previously. Despite the flexibility of the model discussed in the previous section, it does not allow for the inclusion of a timelagged dependent variable (it will cause bias in the estimates see Greene (2000)). In fact, as noted previously, a model incorporating spatial autocorrelation and a time-lagged dependent variable is beyond the scope of this thesis. However, it is of interest to know how much the inclusion of a timelagged dependent variable benefits a model. To this end, in this section a model
64
with the form of (4.4) which incorporates a time-lagged dependent variable ( ~ 0) but not spatial autocorrelation (i.e., W = 0) is discussed. The method presented here to estimate the model uses first differences across time, which eliminates the use of any distance measures (e.g., distance to the CBD), which are time-constant. This means that, in the context of this work, this model does not incorporate any strictly spatial information. Thus, parameter estimates of the model presented here can be compared to that of estimates from the panel data spatial linear regression model to better understand the benefits of time-lagged dependent variables as opposed to spatial effects. The method used to estimate (4.4) without spatial autocorrelation but with a lagged dependent variable is that developed by Kiviet (1995). If a panel data linear regression model includes a random effect, then the so-called least squares dummy variable (LSDV) estimator (defined in equation 4.17), which is often the estimator of choice for linear panel data models, is asymptotically biased (Greene 2000).3 Typically, to estimate a random effects panel data model, a generalized method of moments (GMM) technique is used. Unfortunately, this requires
T for efficiency, and for small panels, like that used here, the methods are
often highly inefficient (Kiviet 1995, Judson and Owen 1997). The idea behind Kiviets method is to approximate the bias generated by the LSDV method and then subtract it from the incorrectly estimated parameters. Given a set of conditions on the independence of the various components of (4.4), which our model will be assumed to follow (see Kiviet (1995) for more details4), we let M = [Yt 1 , X ] , = [ , , ] , and
The name least squares dummy variable comes from the estimation method (least squares) and the fact that the constant term is effectively represented by a dummy (indicator) variable. 4 Among these conditions is that E(itjt) = 0, which is why spatial autocorrelation techniques will not work with this method.
65
v1 v = [I N T ] M , v N
(4.14)
where T is a vector of ones and is the Kronecker product.5 With the data ordered by time period first, then observation, (4.4) can be rewritten, in block form, as
Y = M + v +
(4.15)
Also, by defining
AT = I T
1 T T T
(4.16)
and A = I N AT , the LSDV estimator is (Kiviet 1995, Greene 2000)
= (M AM )1 M AY
(4.17)
The matrix A subtracts off the time-constant part of the variables M; this means that variables which are time constant (such as a constant term or distance to the CBD) cannot be incorporated into the model. However, once the parameters have been estimate (and, as shown below, corrected for bias), the constant term, which will be called 1, can be determined by determining the average residual across time and space in (4.15). With a lagged dependent variable, Kiviet (1995) has calculated the bias of (4.17) to be:
The Kronecker, or outer, product of two matrices A, of dimension (N M), and B is defined as:
A11 B L A1M B A B = M O M AN 1 B L ANM B
66
N E = 2 D 1 (T CT )(2q M AM ) 2 + tr M (I N AT CAT )MD 1 q
+ M (I N AT CAT )MD 1 q + 2 Nq D 1 q N 2 (T CT ) tr{C AT C} + 2 tr{C AT CAT C} q + O( N 1T 3 / 2 )
(4.18)
where D = M AM + 2 Ntr{CA T C}qq , AM = E( AM ) , O(x) is a higher order term whose magnitude is on the order of x asymptotically, q = [1,0,,0] and K elements long, and 0 1 0 1 C= 2 M M T 1
K
O O O M L
0 M M 1 0
(4.19)
To calculate (4.18) (neglecting the higher order term), an consistent estimate must be used. Following Kiviet (1995) and Judson and Owen (1997), the Anderson & Hsiao estimator is used. Letting y it = yit y i ,t 1 , then AH = ( Z X AH ) 1 Z YAH AH AH where
Z AH and Z AH ,1 = M Z AH , N X AH X AH ,1 = M X AH , N Y AH Y AH ,1 = M Y AH , N (4.21)
(4.20)
67
y i1 Z AH ,i = M y i ,T 2
xi 3 M xiT
X AH ,i
y i 2 = M y i ,T 1
xi 3 y i 3 M Y AH ,i = M y iT xiT
(4.22)
The value of (4.18) can be subtracted from (4.17) to get an unbiased estimate for . Also, because of (4.16)s form, the constant term cannot be included in the estimation. This term is easily estimated as the average residual in the estimate; i.e. a=
N T 1 N T y it xit NT i =1 t =1 i =1 t =1
(4.23)
Because of the complicated nature of the corrected , it is not tractable to analytically compute its standard errors (Judson and Owen 1997). Instead, these are estimated by a bootstrapping procedure (see Greene 2000).6 Given the corrected estimate, as well as the determination of the constant term estimate 1 , the variance of the random-effect in (4.15) can be estimated by ML, with the following log-likelihood function being:
ln L = NT ln 2 2 +
where
1 log exp 2 i =1 2 N
d d f (vi ) dv i t =1 it it
(4.24)
d it = y it (M )it 1 + vi
(4.25)
and
The idea behind bootstrapping is to take the estimates of the model from the original data set and compare them with estimates from a data set created by randomly sampling observations with replacement from the original data set. Unless it is rigorously determined, there is no assurance that the bootstrapped estimates will be correct, or that they will even converge; however, because they might allow the determination of standard error estimates for complicated models of which no analytic means are feasible, they are of great potential use. For more details on bootstrapping as applied to econometric models, see Greene (2000) and Fair (2003).
68
f (v i ) =
v2 exp i 2 2 2 v2 v 1
(4.26)
Thus, a panel data model with random effects and a time-lagged variable can be estimated using a combination of an analytic estimation method (LSDV), bias correction, and MLE.
4.4 PANEL DATA SPATIAL LINEAR REGRESSION MODEL USING PROBIT SAMPLESELECTION
Thus far, the models discussed have assumed that the same coefficients for the model in (4.4) are applicable across the entire data set. It is possible, For however, to imagine reasons why such an assumption would not hold.
example, if modeling population in a region, different parameter estimates may exist depending on whether a particular section of that region is urban, suburban, or rural. The justification for such an assumption might be that the motivation behind why people move to, and live in, these three areas is probably very different, and such a difference could manifest itself in having distinct population models for each region. The purpose of this chapter is to develop a methodology for allowing multiple coefficient sets for panel data models incorporating spatial autocorrelation. Though models with three or more coefficient sets could be developed, in this work, only models with two separate coefficient sets are considered. The method used to estimate linear regression models with two sets of coefficients is a variant of the incidental truncation model and has the following form (Greene 2000, Vella and Verbeek 1999):7
Another method for incorporating model variations is the variable parameters model, in which parameters, rather than having fixed values, are assumed to be distributed over a range of values. (As noted previously, the incorporation of random effects is actually a form of such a model, where only the constant term is allowed to vary.) Though such a framework is flexible, it can be
69
1 1 y it = xit + vi1 + it 2 y it = xit + vi2 + it2 mit = wit + i + u it mit = 1 mit > 0 1 1 y it = y it mit 2 2 y it = y it (1 mit )
(4.27)
1 where mit is the variable defining the selection, y it is the variable of interest when 2 the selection variable mit = 1, y it is the variable of interest when mit = 0, xit and wit
are sets of exogenous variables which may or may not contain the same elements (and which may include time-lagged variables), vik ~ Normal(0, v2k ) and i ~
2 Normal(0, ) and are variables capturing unobserved heterogeneity in the data,
k and uit and it are error terms which may be correlated with each other.
Essentially, this model estimates a binary discrete choice model for the selection variable, and then estimates the model for the variable of interest, conditioned on the level of the selection variable.
4.4.1 Panel data Linear Regression Model using Probit Sample-Selection
For cross-sectional data sets (t = 1), the above model in (4.27) is fairly simple to estimate using a two-step Heckit estimator (see Greene 2000). For panel data, the model is complicated by the fact that first differencing, which is generally used in panel data model to remove individual-specific effects, does not remove the individual nor the time-varying effects resulting from the selection equation (Jensen, Rosholm, and Verner 2002). Many consistent estimators for (4.27) have been developed, with varying degrees of efficiency and computational
complicated and does not directly lead to a clear understanding of how the parameters values are spread over a region (see Greene (2000) and McFadden and Train (2000) for more details).
70
complexity. Jensen, Rosholm, and Verner (2002) survey many of these methods and report that many perform well and are fairly robust to unobserved correlations and misspecifications of the error term. The model form chosen for sample-selection in this work is a twoestimator developed from Wooldridge (1995) and Vella and Verbeek (1999). It uses a probit formulation to model the selection variable, and, after including the correct additional information variables from the selection model, the variable of interest is modeled using least squares regression. This subsection discusses this model; in the following subsection, an extension of this model which incorporates spatial autocorrelation in the regression equation is discussed. To estimate the probit sample selection equation, an MLE routine is used. The log-likelihood for the selection model is given by
ln L = i =1
N
T t =1
[(2mit 1)(wit + i )]f ( i )d i
(4.28)
where () is the standard normal cumulative distribution function (CDF). The integration here can be performed numerically (using, for example, quadrature techniques (see Greene (2000)), as opposed to using simulation, as it is only over one-dimension. Once maximum likelihood estimates of (4.28) are obtained, then the expected value of the error term in the selection model is calculated by (Jensen, Rosholm, and Verner 2002): eit = E( i + u it | wi1 ,..., wiT , mi1 ,..., miT ) = [ i + E(u it | i , wi1 ,..., wiT , mi1 ,..., miT )]
i
(4.29)
f ( i | wi1 ,..., wiT , mi1 ,..., miT )d i
where, assuming that wit are independent across time,
E(u it | i , wi1 ,..., wiT , mi1 ,..., miT ) (wit + i ) (wit + i ) mit (1 mit ) = 1 (wit + i ) (wit + i )
(4.30)
71
and
f ( i | wi1 ,..., wiT , mi1 ,..., miT ) = f (mi1 ,..., miT , i | wi1 ,..., wiT ) f (mi1 ,..., miT | wi1 ,..., wiT )
(4.31)
The denominator in (4.31) corresponds to the likelihood function for a single individual across time, and the numerator is the same as the denominator except without the i integrated out. The value of eit in (4.29) can be calculated numerically and, given this, the model for the variable of interest is estimated using OLS:
1 y it = xit + eit + ei + it
(4.32)
where ei is the time-average of the individual-specific residual level and and are parameters to be estimated. Note that vi1 is absent from the model; either it can be assumed that the second term on the right-hand side of (4.32) replaces vi1 , or the parameter can be estimated. If the latter choice is selected, then (4.32) must be estimated using a concentrated log-likelihood function where the random parameter is integrated out (Wooldridge 1995). Also note that the
2 model for y it can be estimated easily by noting that the above formulation is
2 correct for the y it case if mit and (1 mit) are switched throughout.
4.4.2 Panel data Spatial Linear Regression Model using Probit SampleSelection
It is possible to account for spatial autocorrelation in the sample selection model using the estimation procedure outlined above. Intuitively, it seems reasonable to argue that spatial autocorrelation may exist in the selection model. Unfortunately, using spatial autocorrelation techniques in the selection equation is not possible to do in an easy manner, primarily because determining the probit error variables which are incorporated in (4.32) is not straightforward. Instead, 72
however, spatial autocorrelation can be introduced into the regression equation in a fairly straightforward manner (e.g., see Vella and Verbeek (1999)). Though this might leave out important spatial information from the selection half of the model, it still creates a powerful structure in which two sets of model coefficients and spatial autocorrelation are accounted in a rigorous manner. To incorporate spatial autocorrelation into the regression part of the sample selection model, the same process is used as described above, except that
1 2 the regression equations for y it* and y it * in (22) are replaced with the structural
form of (4.4) with = 0. So, (4.32) is transformed to

1 y it = xit + eit + ei + (1 W ) 1
it
(4.33)
which, given eit and ei , can be estimated using the two-step procedure discussed above. Because of the inclusion of the expected value of the probit model error variables, estimates the standard errors of the parameter estimates calculated using standard methods will be biased (Greene 2000). For models without spatial autocorrelation, Greene (2000) discusses a method through which consistent estimators of the standard errors can be obtained. However, the applicability of this method, which is rather complex, to models with spatial autocorrelation is not clear. As such, in this work, a bootstrapping procedure will be used to estimate the standard errors.8
4.5 PANEL DATA SPATIAL LOGISTIC REGRESSION MODEL
Because it must be contained on the [0,1] interval, as opposed to (, ) , modeling proportional land-cover data cannot be done by using spatial linear
Again, whether or not the bootstrapping procedure actually converges to the correct standard error estimates is not rigorously proven here; it will be seen, in the applications presented in Chapter 5, that the estimates seem to be incorrect. A defense (and good discussion) of using bootstrapping in sample selection models is found in Hill, Adkins, and Bender (2003).
8
73
regression as described previously. Instead, a modification of that model, which is an extension of logistic regression method that Greene (2000) proposes for modeling proportions data, is used to model the land-cover data in this work. It is noted that though this work uses Greenes method as inspiration, the methodology is completely new. Also, the method developed here models binary data, so for this work it will be applied to modeling one land-cover type versus another (for example urban vs. non-urban). Towards the end of this section an extension to allow further binary splits among the data will be discussed.9 The method begins by using a transformation that will take measurements on (, ) to the [0,1] interval. Any CDF defined on the whole real number line satisfies this property and, for this work, the logistic distribution CDF will be used:
e xit F ( xit ) = 1 + e xit
(4.34)
Following Greene (2000), the percentage land cover is treated as a series of draws from a Bernoulli distribution. This is justified by the sense that the land-cover percentage for a given cell is constructed from counts of smaller cells, each having one of two specific land-cover types; thus, this model derivation assumes that the smaller cells represent Bernoulli draws. Then the percentage of a given land-cover type, following the conventions of the discussion of the panel data spatial linear regression model, is written as Pit = F ( xit ) + vi + it where (4.35)
= W +
9
(4.36)
Another method which can be used to model proportions data incorporating spatial autocorrelation is the panel data spatial probit discrete choice model. This model was developed for this work, but attempts at applying it with data were thwarted by convergence problems in the MLE routine. Because of these problems, it is not presented in this chapter; however, as it may be of future interest, the methodology of the model is presented in the Appendix.
74
and E( it ) = 0, Var( it ) = F ( X )(1 F ( X )) (4.37)
(4.35) can be approximated by a Taylor expansion as (Greene 2000):

F 1 (Pit ) xit +
it vi + f it ( xit ) f it (xit )
(4.38)
where fi() is the probability density function (PDF) of the logistic function. (4.38) is a heteroscedastic version of the spatial regression model. In order to correct for the heteroscedasticity, a variance normalizing transformation is applied to (4.38): QF 1 (P ) QX + v + (1 W )
1
(4.39)
where Q is a (TN TN) diagonal matrix with Qit ,it = F (xit )(1 F (xit )) ,
vit =
(4.40) (4.41)
vi Qit ,it
and
= Qit ,it it
(4.42)
The error term in (4.42) ~ Normal(0,1), so, except for the random effect (4.41), which varies across time, the model in (4.39) is exactly the same as the spatial linear regression model. In order to make the random-effects term work correctly, the error components of (4.35) are transformed: vi + it it = vi Qit ,it + it in which case (4.39) transforms to QF 1 (P ) QX + v + (1 W )
1
(4.43)
(4.44)
which is of the required form. The transformation in (3.08) is curious and controversial, to say the least. The essential problem is which model form, (4.35) or (4.44), to select to have the 75
true random effect, either of which may be correct; to select (4.44) for convenience is possibly no more restrictive than to expect the assumptions in (4.37). It should be noted that (4.41) may be written out, using power expansions, as
(x )2 (xit )4 (xit )6 vi = vi 2 + it + + + ... 4 192 11520 Qit ,it (4.45)
which, if all terms involving (xit ) are dropped, would form a model which is of the same form as (4.44). Such a model would have a bias in its estimate of the random effect and, though making them inefficient, would not effect the consistency of the estimate of . So, at the very least, the form of (4.44) can be used to consistently model the percentage of land-cover variables. To estimate (4.44), the percentage land-cover variables would be transformed by the logit function (Greene 2000):
P F 1 (Pit ) = ln it 1 P it
(4.46)
in which, to avoid infinities, a small constant is added to or subtracted from Pit if it equals 0 or 1. Then, with a small adjustment, the iterative process described earlier to model the spatial linear regression model is used. The adjustment is that, because Q requires estimated parameters, (4.44) cannot be estimated directly; however estimates from (4.35) can be used to obtain a consistent estimate Q, from which (4.44) can then be modeled (Greene 2000). The above technique works only for a binary case. Thus, distinctions such as urban versus non-urban can be modeled, but residential versus commercial versus non-urban cannot. In order to incorporate more land-cover distinctions into the model, the method described above is performed iteratively. Specifically, as an example, if the percentage that a certain cell, i, in time t, is residential is
76
PitRes , and the percentage that that same cell, in the same time period, is urban is PitUrban , then the quantity PitRes|Urban = PitRes PitUrban (4.47)
can be modeled by the methods described above. However because
PitRes|Urban PitUrban
estimates of PitUrban
(4.48)
should be used to instrument the model. However, to
incorporate this correctly with respect to the logistic transformation is difficult. Instead, inserting PitUrban
among the explanatory variables can be done. The
motivation, though it does not rigorously prove that the variable correctly and fully instruments the model, is that adding this variable to the explanatory variables allows the model to incorporate the information it contains.10 Doing this, and using the actual estimates from the urban model, creates the following model form:
Urban 1 + exp xitUrban Urban + it F 1 PitRes xit 1 + exp x Urban Urban + Urban 2 + it it it vit + f it ( xit ) f it ( xit )
(4.49)
where the full coefficient vector is defined by = [ 1, B2 ] . The total error term
of (4.49), including errors from both models, is complicated and seemingly intractable. However, if it assumed that the error terms from the two models are
10
At the very least, it is known that the even introducing (PitUrban )1 into the model in this form at
increases, PitRes| Urban is expected to decrease. Furthermore, including the instrument variable is better than leaving it out, as it surely contains information on PitRes| Urban , especially as statistical corrections, given below, are included to maintain least preserves the property that as the validity of the model (Green 2000).
(P
Urban 1 it
77
independent, and if the PitUrban
is expanded using a Taylor expansion about
Urban it = 0, then an analysis of the error contribution of the added term can be
performed:
(P
Urban 1 it
Urban 1 + exp xitUrban Urban 1 it

2
)(
(4.50)
Urban where terms of ( it ) and higher have been dropped. As can be seen, if the
error from the Urban model is ignored in the estimation of (4.49), then the variance of the error terms for the Res model will be biased high and the parameter estimates will be inefficient because of the added heteroscedasticity. To account for this, the weight matrix, Q, used for this model should be transformed such that
Qit ,it ) Wi exp 2 xitUrban Urban + = Urban Urban 1 F xitUrban Urban F xit F xitUrban Urban 1 F xitUrban Urban
)(
)(
it
))
))
(4.51)
1 / 2
Var v with
Urban i
)exp(2 x
Urban
Urban
1 + F ( xit )(1 F ( xit ))
N ) 1 Wi = (1 W ) j =1
ij
(4.52)
which would, again, invalidate the time-independence of the random effect of the Res model. However, an argument similar to the one used to justify the form of vit in (4.43) could be applied to (4.49) in combination with (4.51) to justify the time-independence of the random effect in the Res model; then (4.49) can be modeled using the weight in (4.51). It should be noted that though the Taylor expansion in (4.50) was used to justify the form of (4.51), the actual estimation of
(P
Urban 1 it
, as opposed to its approximation, is used in the model estimation.
78
Also, though this discussion focused on urban and residential models, any series of binary splits of the land-cover variables could be used. Intuitively, one would expect that the splits should be logical; a split Residential and Forest versus Not Residential and Forest seems to neither make sense nor be of any use on a practical level. Furthermore, going beyond two binary levels, though possible, would probably be pushing the model assumptions too far (especially concerning the random-effects term). In this work, no more than two binary levels will be modeled.
4.6 ACCOUNTING FOR DIFFERENCES IN TIME LAGS

If time-lagged variables are contained in the xit ( z it ) vector in (4.4), (4.27), and (4.34) and the time lag (t t) is not the same for each time period (as is the case with the data set in this work), then the use of the same coefficient for each lagged variable is problematic. The issue is that, under the assumptions discussed previously, it would be expected that the effect of the time-lagged variables would change for different sized time-lags. Thus, a variable lagged by three years would be expected to influence the dependent variable differently than one lagged by eight years. In order to account for this, the vector of coefficients for the models must be altered so that is time-dependent. For this work, this time-dependence is assumed to be proportional to a constant raised to the power of the time difference. Specifically, if the time differences between the various panels is
i = t panel i t panel i 1
then a time-adjustment vector can be constructed:
(4.53)
= ~ , a , ~ , a z x x
T 1
,..., ~ , a 1 z x
(4.54)
79
where ~ is a vector of ones with as many elements as the number of non-lagged x variables, z is a vector of ones with as many elements as the number of lagged variables, and a, called the time adjustment factor, is a parameter to be estimated. It is also assumed that the most recent time period is located at the top of the stacked variables. Then, the kth term in the vector of coefficients is transformed such that
k k k
(4.55)
For the models discussed thus far, the transformation in (4.55) is straightforward and does not alter the estimation procedure in a significant way, except that the time adjustment parameter, a, is added to those estimated in the MLE procedure.
4.7 THE TEMPORAL AND SPATIAL INCIDENTAL PARAMETERS PROBLEM

With time-lagged variables, the so-called incidental parameter problem (IPP) becomes an issue. In effect, the IPP raises the point that with a lagged variable, call it xt 1 , in a finite (i.e. realistic) model, there must be, at some point, an initial condition, x0 , which cannot be assumed to exist a priori in nonequilibrium models without creating bias in the results. As noted Lancasters (2000) excellent survey of the subject, a wide variety of solutions to the problem have been posed, but none has gained broad acceptance in practice. Part of the problem is the that many of the proposed solutions are difficult to implement and make estimation of a model much more cumbersome. Even though spatial lag variables are not being used explicitly in any of the models discussed above, the inclusion of spatial autocorrelation effects leads to the spatial-dimension analogue of the time-dimension IPP. Specifically, there is an edge effect: the cells which lie beyond the data should be included in the spatial autocorrelation term. Anselin (1988) discusses this problem in detail. 80
Some of the proposed solutions include wrapping the data set around to correlate opposite sides with each other, creating an estimate for the data that lies beyond the data set, and truncating the data. None of these has been found to be a particularly suitable solution to the problem (Anselin 1988). Unfortunately, no other tractable ways to deal with this problem have been developed as of yet. Because of the problems with dealing with the spatial and temporal IPP, it will be ignored in this work. An extension of this work would be to more fully investigate the implications of this on the model results. As a note, a method to account for the temporal IPP is included in the development of the multinomial panel data spatial probit model in the Appendix.
4.8 ESTIMATING DIFFERENTIAL EQUATION FRAMEWORK

DATA MODELS
THROUGH
PANEL
Applying differential equations to human or society based variables is difficult to do accurately. Nonetheless, using a differential equations framework may provide interesting results which complement those from the models described above, especially since they deal with deviations in the data across various dimensions. To see this consider the following model:
y it (r , t ) = xi (r ) + xi (t ) + r + e t + t (t ) + vi (r )
(4.56)
where r is a positive spatial measurement with respect to a single reference point (e.g. distance from the CBD), t is a time measurement, and vi (r) and t (t) are nonlinear effects respectively constant across the t and r dimensions. Also, it is assumed that xit (r , t ) = xi (r ) + xt (t ) , that is, that the level of x in the individual (cell) dimension only depends on its spatial location and its level in the time dimension on depends on the time.11 (4.56) is motivated by the fact that it seems
Note that the form xit (r , t ) = axi (r ) + bxt (t ) is supported by the above model, with a and b being absorbed into the parameters and .
11
81
possible that yit(r, t) would depend not only on xit(r, t), but also explicitly on r and t, including some possibly non-linear effects (which will later be assumed to be stochastic). The motivation for using
r follows from previous results in Frazier
and Kockelman (2003) in which it was found that the effects of distance are attenuated as the distance become large. The motivation behind using et is that many demographic variables may be described by exponential functions in time (Smith and Sincich 1992). Taking derivatives we get:
x (r , t ) v (r ) yit (r , t ) + + i = it r r r 2 r and
y it (r , t ) x (r , t ) (t ) = it + e t + t t t t
(4.57)
(4.58)
As discrete intervals must be used, (4.57) and (4.58) can be reworked using differentials: i y it (r , t ) = i xit (r , t ) + and t y it (r , t ) = t xit (r , t ) + e t t t + t t (t ) (4.60)
i r
2 r
+ i vi ( r )
(4.59)
(4.59) and (4.60) are reminiscent of the first-differencing often done in panel data models to remove group-wise effects (see Greene 2000). In order to estimate the parameters in (4.56) using these equations, first differencing will be carried out and a few assumptions will be made: First, vi (r) and t (t) will be allowed to be stochastic with distributions such that vi (r) ~ Normal(0, v 2 ) and t (t) ~ Normal(0,
2
) (note that this does not cost any loss of generality with
respect to the variances of the distributions). Also, it will be assumed that the first differenced versions of (4.59) and (4.60) have uncorrelated error terms. So, using notation, the models to be estimated become:
82
i y it = i xit + and
i r
2 r
+ i vi + it ,r
(4.61)
t y it = t xit + e t t t + t t + it ,
(4.62)
2 where i vi ~ Normal(0, v2 ), t t ~ Normal(0, ), it,r ~ Normal(0, r2 ),
it, ~ Normal(0, 2 ), and E(it,rit,) = 0. Given the above assumptions, the parameters for the model can be estimated using standard methods.12 (4.61) becomes just a linear panel data model with random effects, and can be estimated using an FGLS algorithm (see Greene 2000). On the other hand, (4.62) is a nonlinear model and its parameters will be estimated using MLE methods. Its log-likelihood function is:
ln L = NT ln 2 2 2
1 T t =1 ln 2 2 2
exp
((
t2
2 2
)+ ( )
1 2 2
N i =1
2 it
) f ( )d ( )
t t t t
(4.63)
with d it = t y it t xit e t t t + t t
(4.64)
The integral can be numerically evaluated as it is only over a one-dimensional normal distribution. Thus, it is seen that through a differential approximation, the model in (4.56), which would not be simple to estimate directly (since the split on xit(r, t) into spatial and temporal dimensions is not observable) can be estimated using panel data techniques.
4.9 SAMPLING AND MODEL ESTIMATION

In Chapter 3, the original land-cover data had to be aggregated in order to reduce the computational burden posed by its large size. For this work, even the
12
For ease of estimation, the incidental parameters problem (in both dimensions) is being ignored.
83
reduced data set, which is 29,946 cells 4 time periods, creates computational issues that severely hamper the model estimation process. Specifically, many of the models require, among other things, eigenvalue calculations and bootstrap estimations, all of which involve computations whose size grow geometrically with respect to the number of observations. In order to make estimation of the models computationally more tractable, cell sampling is used. The sampling method is straightforward. Depending on the model, a sample, of size n, is selected from the N cells in the initial data set. Then, a set of S random samples of size n are then selected (without replacement )from the data set. The model is estimated S times using the different data sets, and the vector of parameters of each estimated model, call them s, are then averaged. That is, because 1 S s S s =1
(4.65)
is a consistent estimator for the true (for the whole data set), it can be used as an estimate of the parameters (Greene 2000). Unfortunately, a simple formula analogous to (4.65) for creating consistent estimates of the standard deviation (or variance) of does not exist. Therefore, significance tests are carried out at individual model estimation stage, and averages of these analyzed to get a feel for the significance of the parameters. Chapter 5. A discussion of the details and complications of this sampling method specific to each model will be presented in
4.10 SUMMARY
This chapter has presented a myriad of techniques for incorporating spatial and temporal effects into econometric-based models. A large part of the chapter was devoted to incorporating spatial autocorrelation into models under a rigorous framework. The key to this incorporation was the spatial weight matrix, which 84
allows for the diminishing effects of distance on spatial autocorrelation to be accounted for. The basic model using the spatial weight matrix was the panel data spatial linear regression model for continuous dependent variables. In order to estimate this model in a timely fashion, a technique combining least squares regression and MLE is used. Because of spatial diversity, regional effects, or other heterogeneity, some linear regression models might have different model parameters for different parts of a data set. The panel data spatial linear regression model with probit sample selection allows for two different coefficient sets in a linear regression model incorporating spatial autocorrelation. This model uses a binary probit discrete choice framework to select which group an observation lies in, and then estimates a different set of model parameters for each group. Because of complications which arose because of the necessity of incorporating the effects of the probit selection model error into the regression model, spatial autocorrelation was only accounted for in the regression half of the model. Despite the flexibility of the spatial linear regression models, they did not, in the form presented in this work, allow for the inclusion of a time-lagged dependent variable in the explanatory variables. To investigate the effects of such a variable, the LSDV model, which allows for a lagged-dependent variable but not spatial autocorrelation, was introduced. Because the estimation of this model cancels out time-constant variables, it effectively eliminates the inclusion of any strictly spatial effects in the model. This allows for a comparison of the tradeoffs which occur in deciding between including time-lagged dependent variables or spatial effects in a panel data linear regression models. Because the land-cover data used in this thesis is not continuous over the whole real line, but is instead limited to the [0,1] interval, the panel data spatial linear regression model could not be applied directly. Instead, by using the logistic PDF transformation, the model could be applied. However, this led to 85
two complications. First, this only allowed for binary models of the data set, which means that if more than two types of land cover are to be modeled, they must be organized into a series of binary splits. Secondly, the transformation of the model introduced heteroscedasticity into the model, which must be corrected for before the estimation methods from the panel data spatial linear regression model could be applied. The last issue was made more complicated by the fact that if secondary (or greater) binary splits are used, then an instrument variable, which is constructed from the previous binary models dependent variable, should be included in the explanatory variables. These various spatial econometric techniques can incorporate lagged dependent variables. However, because the different panels of data used in this thesis have different time gaps between them, an accounting of the size of the time lags might be needed. A method by which this could be done was presented, in which model parameters are multiplied by another (estimated) parameter which is taken to the power of the size of the time lag. Another way to analyze spatial and temporal aspects of a data set is to look at deviations over space and time. One way of doing this is to approximate a differential equation framework by using first-differences across space and time. The application of such a technique presented in this chapter assumes a certain model form from which a differential equation model, in space and time, is constructed, then the model is approximated. Because of the construction of this model, both explicit and implicit spatial and temporal effects are incorporated. Lastly, because of the complications associated with the estimation of these models for a large data set, data sampling must be used. By taking a series of random samples from the data set, estimating the model parameters, and then averaging the estimates, consistent estimates of the model parameters for the entire data set can be formed. Unfortunately, such consistent estimates for the standard deviations of the model parameters cannot be achieved as easily, and 86
thus evaluations of the significance of the parameters must be done in a more qualitative fashion. The techniques presented in this chapter provide an excellent base of rigorous techniques by which the complexities of space and time in the data presented in Chapter 3 can be investigated in Chapter 5. These investigations lead to interesting, and sometime surprising, conclusions, and show the power and usefulness of incorporating and analyzing spatial and temporal effects in transportation/regional planning based models.
87
CHAPTER 5: MODEL RESULTS

This chapter summarizes and discusses the results of the models developed in Chapter 4 as applied to the data discussed in Chapter 3. This chapter is organizes the discussion according to model type, though comparisons of different model types results also occur in several sections. Some of the models are further separated into those without time-lagged variables, those with time-lagged variables, and those with both time-lagged variables and time adjustment. As this chapter discusses many different aspects of the various models in detail, it is important to emphasize guiding objective of this chapter: to elucidate the manner in which land cover, space, and time are reflected in model results. What is garnered from this is a better understanding of both how well the models perform and, most importantly, how various factors interact to effect demographic and geographic change. Seven variables are modeled here: population, per capita income, the average number of vehicles available per household, median home value, proportion of land which is urban, proportion of urbanized land which is residential, and proportion of non-urbanized land which is rural.1 The reason these variables have been selected for modeling is that they are important indicators for a variety of economic, social, environmental, and political qualities, and especially for transportation and regional science-based applications. For example, the number of vehicles per household might be used as a measure of household wealth as well as hydrocarbon emissions for a given region. Likewise, the proportions of different types of land cover could be used to measure urban sprawl and its consequences.
Because the cells modeled in this work all have the same land area, the population variable can also be interpreted as a population density measure.
88
In addition to the logistic transformation (equation 4.46) needed to apply proportions data to the logistic spatial regression model, two of the variables were also transformed: the natural log of population and per capita income were modeled in most specifications. The primary motivation for this is that the natural log transformation (ln) eliminates the possibility that the model predicts negative values for the variables, which, to be realistic, always will be positive. 2 For the differential equations models, no transformation of the variables was carried out, primarily because the differences themselves are not strictly positive. Also, in order to increase the power or fit of some models, the natural log transformation was applied to a few explanatory variables for some models. An issue which came up with selecting which models to run was estimation time constraints. That is, estimating most of these models was extremely time consuming, ranging from around one hour to more than three days. This placed restrictions on how many models could be run, and how much model adjustment (mainly adding or deleting variables) could be carried out. Even given this, the multitude of results presented here still gives an excellent opportunity to make qualified judgments concerning the strengths and weaknesses of the models. As noted in Chapter 4, the computational demands of the estimation of these models necessitated the use of sampling. Except where noted, all models were estimated using 25 random samples of 1,000 observations (grid cells) each. In order to ease time and computational demands on the sampling end, the same 25 samples were usually used for all models.
Though it might make sense to use the natural log transformation on the average number of vehicles available per household variable, attempts to do this led to convergence problems in the model estimation, possibly implying an incorrect form for the data. As such, the transformation was not applied to this variable. The transformation is also not applied to the median house value because this variable is only used in the differential equations model (see paragraph following footnote).
89
All of the models presented in this chapter were estimated using procedures for the GAUSS computer program (Aptech 1998) written by the author. These codes all use the built-in maximum-likelihood function of GAUSS to estimate some or all of the model parts. Because of the large number of results produced, the discussion in this chapter only includes a summary of the results. This summary includes average parameter values, average standard deviations for the parameters, average tstatistics for the parameters, and average model statistics (including R 2 measures and log-likelihood levels). As noted in Chapter 4, only the parameter averages offer consistent estimates of the actual values of the results if the entire model were applied to the data set.3 The other average statistics are presented to give a feel for how the various model components are performing. Also included in the result summaries are maximum and minimum parameter values as well as sample standard deviations for the parameter averages. To gauge the effects of individual variables on each model, elasticities are reported. These measure the percentage change in the dependent variable For linear expected for a one percent increase in the independent variable. be estimated (using the data averages) by (Greene 2000): k = k Xk X ' (5.1)
models, the formula for the elasticity of a particular variable k on the model can
Because the sampling scheme only incorporates a small part of the data set, for the models incorporating spatial autocorrelation the consistency of the parameter values, especially that measuring the spatial autocorrelation, may no longer hold. This does not negate the results, especially in the interpretation of them; rather, it means that if the results are used in another capacity, such as the simulations presented in the next Chapter 6, the manner in which they are used must be consistent with the context in which they were estimated. The implications of this are made clearer in the discussion of the simulation methodology in Chapter 6.
90
where is the estimated coefficient vector and X represents the average of the independent variables across space. This formula is used for all f the regressionbased models. For those models with a constant term in the independent variables, the denominator can be replaced by the average of the dependent variable, Y (Greene 2000).4 For the probit model used in the sample selection model, the elasticities are given by
k = k X k
X X
( ) ( )
(5.2)
where () is the standard normal PDF and () is the standard normal CDF. To better gauge the effects of time, elasticities for each year were determined. Also included in this chapter is a small discussion of the convergence properties and consequences of sampling for the models. For some of the models, estimates for the whole data set could be determined in a timely manner along with the estimates from the samples. In order to get a feel for the quality of the sampling and its results, these estimates are compared. Lastly, for the models using bootstrapping, the convergence of the bootstrapping standard error estimates is shown. Since these bootstrapped standard error estimates seem to be unrealistically large, the model results both with and without bootstrapping are summarized. In all of the model incorporating distance measures in the explanatory variables, the square roots of the distances, as opposed to the distances themselves, are used. This was done because this transformation of the distance variables improved the model performance.
4
As discussed in Frazier and
For the logistic model, the true elasticity formula, as it relates to the proportion variable (as opposed to the variable as transformed by equation 4.46), should actually be transformed by a factor of (1 Y ) 1 , but, as this is constant across all elasticity calculations, it does not effect the relative elasticity levels, which is what is truly of interest here. A similar issue is present wherever the dependent variable is transformed by a natural log transformation.
91
Kockelman (2003), the reason that this transformation has more explanatory power is that as distance increases, the effect of that distance on population (or other variables), while still monotonic, is dampened. That is, for cells which are close to the CBD or a highway, a move of one kilometer closer or farther away will make a much greater difference than for a cell which is far from the CBD or highway. The square root transformation accounts for this relative effect of distance, while still preserving the expected monotonicity of the distance effect.5 Finally, before starting the discussion of the model results, a larger discussion concerning the goal of this chapter is in order. In this chapter a large number of results are presented. Many of these results will seem somewhat redundant, as some of the models have three flavors without time-lagged variables, with time-lagged variables, and with time-lagged variables and time adjustment all of which often have similar results. The point behind presenting all of these is two-fold, all of which have to do with motivating and justifying the use of time-lagged variables and, especially, the time adjustment factor. First of all, for predictive purposes, using time-lagged variables is preferred, as this allows one to use currently known data, as opposed to projections, to predict the future. By showing that time-lagged variables offer comparable, and sometimes even preferable, model results helps justify their inclusion. Secondly, by showing how the time-adjustment factor effects the model results, its inclusion can be motivated. Specifically, in most of the results presented here, the time adjustment factor, though statistically significant, does not effect the results of the model very much. This is important because it shows that its inclusion does not lead to unrealistic results, especially as the model with
It is noted that in order to arrive at these final variable forms, a variety were tried. These included linear distances and squared distances. Square roots were shown to perform the best, though there was not enough time to test whether using more than one (i.e. squared distances and square root of the distances) would have been beneficial. This would be an interesting extension of this work.
5
92
the time adjustment factor is the most general and flexible version of the models presented here. Furthermore, by including the results of all three models, a better picture of how the model specifications effect the results can be developed, and a better understanding of the uses, strengths, and limitations of these models can be developed. This last point will be emphasized in the conclusion of this chapter, and will be extended by the results shown in Chapter 6.
5.1 PANEL DATA SPATIAL LINEAR REGRESSION MODEL
Three variables were modeled using the panel data spatial linear regression model: natural log of population, natural log of per capita income, and vehicles available per household; for simplicity, in this section these variables will be respectively referred to as population, income, and vehicles. When reading the tables, remember that estimates the amount of spatial autocorrelation present, and relates the time constant random effects variance to the variance of the model error which varies with time. Also, in the models with time adjustment, the t-statistic reported is testing the null hypothesis that the time adjustment parameter is equal to one (which would make the model equivalent to a model with time lags but no time adjustment).
5.1.1 ln(Population) Model
Table 5.1 presents the results for the population model with no timelagged variables. It is seen that as the distance to the CBD increases, the expected population of a cell decreases. This makes sense, intuitively, because it is expected that the populated areas of a city are more concentrated closer to the city center; and that the farther from a city a cell is, the more sparsely populated it is. Interestingly, the results also show that as the distance from the nearest highway increases, the population increases. At first glance this is a non-intuitive result, as it is expected that the major transportation network is located near, not far from, 93
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Proportion of Commercial Land Cover Proportion of Residential Land Cover ln(Proportion of Rural Land Cover) ln(Land Cover Mix) Land Cover Entropy Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 5.044 -0.715 0.125 0.273 0.330 -4.78E-04 0.0108 0.00116 8.217 5.634 0.0329 1.444 0.549 25
S.E. 0.174 0.0345 0.0460 0.0437 0.0324 0.00119 0.00403 0.0356 0.227 0.0293
T-statistic 29.119 20.774 2.711 6.261 10.122 0.857 2.664 0.570 36.136 231.232
Estimation Sample Properties Standard Max Min Error 5.351 4.590 0.188 -0.630 0.196 0.394 0.573 0.00243 0.0183 0.181 9.317 7.046 -0.774 0.0526 0.157 0.216 -0.00211 0.00601 -0.0505 4.451 4.714 0.0314 0.0367 0.0679 0.0792 0.00112 0.00361 0.0427 0.917 0.736
Elasticities 2000 1997 1991
-1.413 0.115 0.00873 0.0279 0.00071 -0.00576 6.07E-05
-1.533 0.125 0.00865 0.0192 0.00106 -0.00751 5.33E-05
-1.723 0.140 0.0113 0.00782 0.00114 -0.00827 5.30E-05
94
Table 5.1 Results of ln(Population) spatial regression model without time-lagged variables.
the populated parts of a city. However, in looking at the elasticities of the two distance measures, the interpretation of the results becomes clearer. Comparing the elasticity of the distance measures, it is seen that the effect of a change in distance to the CBD impacts the population more than ten times as much as a similar change in the distance to the nearest highway. What is happening is that the distance to CBD measure dominates as far as predicting the population levels, and the distance to the nearest highway acts as a correction factor. Cells which are very far from the CBD will also often be very far from a highway; the negative effects of the large distance to the CBD will cause an underestimation of the population, and the parameter on distance to the nearest highway corrects for this.6 The results also show that population is expected to increase with the proportion of residential and commercial land cover, and fall with the proportion of rural land cover. Furthermore, in both the magnitude of parameter and its elasticity, the proportion of residential land cover has a greater impact on population than does commercial, and both have a greater impact than rural land cover does. This is expected, as a citys population will tend to be focused in more urbanized areas, and more so in residential areas than commercial ones. It is also seen that as land cover entropy or mix increases, the population is expected to increase.
The fact that the distance to the nearest highway measure has a non-intuitive effect on the model could be due to a mis-specification as in the linear or squared distance measure should have been used. This is questionable, though, because, during the estimation stage, the square root of the distance performed the best out of the three. Also, the result might be an effect of multicollinearity between the distance to the nearest highway and distance to the CBD measures. However, as noted in Greene (2000), such an effect would be evidenced by low significance levels for one or more of the collinear variables, as well as large parameter changes for small changes in the data set used to estimate the model. From the results, it is seen that both the parameters are highly significant on average and, more importantly, there is very little variation in the estimated parameters for the different samples. Thus, multicollinearity is probably not a problem here.
95
Given the high average t-statistics for the and parameters, it is concluded that spatial autocorrelation and temporal random effects are statistically, and possibly practically, significant parts of the model. The positive value implies that cells which are close together tend to have similar population levels. Actually, this result is expected, as areas of similar population densities, such as housing developments (medium to high density) or agricultural land (low density), are already known to group together.7 The significance attached to the random effects in the model also seems intuitive, as this effect accounts for the regional deviations in the constant term, which itself accounts for the propensity of a cell to have a certain population (regardless of the year). Intuitively, a cell with a certain population in 1983, at the very least, keeps that population throughout the rest of the years modeled. The random effect, as an adjustment to the constant term, helps to pick up this propensity. Tables 5.2 and 5.3 present the results of the population model with timelagged variables and with both time-lagged variables and time adjustment, respectively. Comparing these results with each other, it is seen that they are nearly identical; furthermore, comparing them with those from the population model without lagged variables, it is seen that all three of the model results are quite similar. In fact, the only results which are significantly different between the models involve the proportion rural land cover, land cover mix, and land cover entropy variables. In the models with time-lagged variables, as the proportion rural land cover increases, the population will also increase. From examining the magnitude
It is noted that such grouping indicated by the spatial autocorrelation is highly dependent on the scale of the maps being examined. That is, if grid cells of a few kilometers or more size were used, then such a cell might encompass a whole neighborhood and the grouping effects may not be as readily picked up. The 300 meter size of the grid cells, however, intuitively seems small enough to capture the effects of such groupings. Scale is an important issue in spatial models, and many papers discuss the issue in detail, including Kok, et al. (2001) and Lam and Quattrochi (1992).
7
96
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Proportion of Commercial Land Cover* Proportion of Residential Land Cover* ln(Proportion of Rural Land Cover)* ln(Land Cover Mix)* Land Cover Entropy* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 4.968 -0.715 0.125 0.162 0.456 9.39E-04 -0.00652 0.181 8.143 4.982 0.033 1.435 0.556 25
S.E. 0.174 0.0347 0.0462 0.0483 0.0330 0.00121 0.00393 0.0338 0.225 0.027
T-statistic 28.594 20.671 2.696 3.356 13.803 1.020 1.688 5.245 36.181 223.163
Estimation Sample Properties Standard Max Min Error 5.340 4.484 0.173 -0.629 0.197 0.235 0.674 0.00269 4.12E-04 0.724 9.808 7.157 -0.783 0.0513 0.0670 0.275 -0.00218 -0.0114 0.0836 4.454 3.880 0.0319 0.0371 0.0433 0.0870 0.00120 0.00311 0.118 1.017 0.836
-1.413 0.115 0.00518 0.0385 -0.00139 0.00348 0.00947
-1.533 0.125 0.00513 0.0266 -0.00208 0.00454 0.00831
-1.722 0.140 0.00670 0.0108 -0.00225 0.00500 0.00827
97
Table 5.2 Results of ln(Population) spatial regression model with time-lagged variables (* denotes time lagged variable).
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Proportion of Commercial Land Cover* Proportion of Residential Land Cover* ln(Proportion of Rural Land Cover)* ln(Land Cover Mix)* Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 4.985 -0.717 0.125 0.224 0.575 0.00096 -0.00751 0.232 8.360 5.363 0.943 0.032 1.439 0.5533 25
S.E. 0.174 0.0347 0.0463 0.0652 0.0444 0.00163 0.00533 0.0453 0.232 0.032 0.0108
T-statistic 28.660 20.689 2.696 3.439 12.997 1.030 1.429 5.079 35.993 212.524 5.235
Estimation Sample Properties Standard Max Min Error 5.347 4.485 0.187 -0.629 0.194 0.345 0.786 0.00404 -0.00165 0.738 9.538 6.794 0.985 -0.782 0.0523 0.100 0.404 -0.00388 -0.0141 0.122 4.455 3.940 0.876 0.0325 0.0360 0.0585 0.103 0.00181 0.00367 0.119 0.956 0.747 0.027
-1.416 0.115 0.00600 0.0407 -0.00120 0.00336 0.010
-1.536 0.125 0.00499 0.0236 -0.00150 0.00368 0.00748
-1.726 0.140 0.00580 0.00852 -0.00144 0.00361 0.00662
98
Table 5.3 Results of ln(Population) spatial regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
of the other parameters and their elasticities, the effect of this increase is less than that from residential or commercial land cover, which is expected. However, in the model with no time-lagged variables, the effect of increases in rural land cover was a decrease in population. In fact, in all three models, the maximum and minimum parameter estimations show that this parameter is positive for some samples and negative for others. The reason is probably due to the fact that rural land cover is being compared against two general alternatives: urban land cover and non-urban non-rural land cover. In a highly urbanized area, an increase of rural land cover probably leads to a decrease in the population level; on the other hand, in a non-urbanized region, an increase in rural land cover probably leads to a small population increase. This confusion is being related in the various estimations samples; also, the variables low average significance level is probably also a result of it trying to account for these two essentially opposite effects. In the lagged variable models, the land cover mix parameter has changed sign to negative from the model without lagged variables. Also, its t-statistic shows that it is of decreased significance to the model. On the other hand, land cover entropy, which was of questionable significance in the model without lagged variables, is now highly significant and, as evidenced by its elasticity, has a greater relative effect on the population of a cell. For the population model with time adjustment, the time adjustment factor is shown to be statistically significant and less than one. This possibly implies that the farther one goes back in time, the less of an impact a cells information has on its current population. This also suggests that the expectation that population increases with time is not being exposed by this factor, though it could be indicating that population densities remain fairly constant over time. A better understanding of the implications of the time adjustment factor will become more apparent in Chapter 6, when population prediction simulations are discussed. 99
As a final note, the R 2 value for this model is actually slightly lower than that for the lagged variable model with time adjustment. However, the R 2 values for the two models are so close, as is that for the model without time-lagged variables, that they can be considered essentially equivalent as far as fit according to R 2 is concerned. As such, selecting the model with the most flexibility and generality, that is the model with time-lagged variables and time adjustment, would be recommended. It is noted that the R 2 values, which hover around 0.55, suggest that there is a reasonable amount of uncertainty in the population variable which the model is not accounting for.
5.1.2 ln(Per Capita Income) Model
The results for the income model without time-lagged variables is presented in Table 5.4. As with the population models, the most influential variable, judging by elasticities, is the distance to CBD. It is seen that as the distance to the CBD increases, income is expected to increase. At first glance, this may not seem very intuitive, as it might be expected that income would drop as one moved farther from the city. However, what this parameter is most likely picking up on is the fact that regions with high per capita income, while not located extremely far from the city center, may not be located very close to the CBD. On the other hand, it might be expected many regions with low income to be remote in the sense that they are located both far from the CBD and far from highways. This idea is born out by the fact that the model results show that as the distance to the nearest highway increases, the income is expected to decrease, and that this effect, as evidenced by the elasticity of the variable, probably acts as a correction to the distance to the CBD in a similar way to that of the population models discussed previously.
100
Variable Constant ln(Square root of Distance to CBD) Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover) Proportion of Residential Land Cover ln(Proportion of Rural Land Cover)
Beta 9.850 0.156 -0.00919 0.00147 0.115 -5.58E-04 0.00892 -0.00304 5.862 7.094 0.0146 0.592 0.9921 25
S.E. 0.0851 0.0547 0.0189 6.48E-04 0.0186 6.89E-04 0.00455 0.00381 0.164 0.0399
T-statistic 123.149 2.735 0.810 2.608 6.275 1.026 2.330 1.466 35.640 315.182
Estimation Sample Properties Standard Max Min Error 10.130 9.444 0.175 0.408 0.0172 0.00211 0.267 8.24E-04 0.0183 0.0345 10.603 10.032 0.00528 -0.052 -0.00105 0.0827 -0.00188 -0.0272 -0.0098 1.584 3.075 0.104 0.0170 6.14E-04 0.0369 6.25E-04 0.00861 0.00825 2.234 1.362
0.0227 -0.00193 -8.93E-04 0.0137 1.88E-04 -0.00108 3.12E-04
0.0229 -0.00285 -0.00109 0.0251 2.61E-04 -0.00131 3.82E-04
0.0233 -0.00383 -8.95E-04 0.0371 2.56E-04 -0.00131 3.81E-04
101
ln(Land Cover Mix) ln(Land Cover Entropy) Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.4 Results of ln(Per Capita Income) spatial regression model without time-lagged variables.
The model results also show that income is expected to increase with the proportion of residential and commercial land cover, and decrease with the proportion of rural land cover. This is reasonable as it is expected that regions of higher income tend to be urbanized as opposed to rural. Furthermore, it is expected that the proportion of residential land cover in a cell is probably a better indicator of higher income than commercial land cover. This is shown by both the fact that the parameter level and elasticity of the residential land cover variable are significantly higher than that for commercial land cover. In fact, the elasticity of the residential land cover variable shows this variable to have nearly as big of an impact on the income of a cell as the distance to the CBD. The model results also show that income is expected to increase as land cover mix increases and land cover entropy decreases. Again, the high average t-statistics for the and parameters suggest that temporal random effects and spatial autocorrelation play a significant role in the income model. The parameter is positive and, in fact, is much higher in magnitude that that for the population models. This suggests that the amount of similitude in per capita income among cells close together is greater than that for population. The random effects variance, which is small compared with the constant term of the model, suggests that the random effect offers a regional adjustment for the average baseline income, which the constant term represents. Tables 5.5 and 5.6 present the results for the income model with timelagged variables and with both time-lagged variables and time adjustment, respectively. As with the population model, these two models are similar to that of the model without time-lagged variables. The main difference is that the signs changed for the parameters for the proportion of commercial land cover, the proportion of rural land cover, land cover mix, and land cover entropy. This implies that by time lagging the variables, the manner in which they predict the income level of a cell changes significantly. 102
Variable Constant ln(Square root of Distance to CBD) Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover)* Proportion of Residential Land Cover* ln(Proportion of Rural Land Cover)*
Beta 9.825 0.155 -0.00872 -0.00200 0.196 4.43E-04 -0.00222 0.00341 5.756 6.586 0.0147 0.589 0.9921 25
S.E. 0.0849 0.0547 0.0189 6.47E-04 0.0194 6.99E-04 0.00432 0.00355 0.161 0.037
T-statistic 123.111 2.707 0.804 3.034 10.400 1.310 1.022 0.946 35.674 408.742
Estimation Sample Properties Standard Max Min Error 10.147 9.393 0.187 0.422 0.0177 -8.13E-04 0.344 0.00291 0.00668 0.105 9.565 9.940 0.00574 -0.0524 -0.00686 0.136 -0.00618 -0.0902 -0.00603 1.572 2.868 0.107 0.0174 0.00125 0.0449 0.00161 0.0186 0.0214 1.931 1.452
0.0226 -0.00183 0.00122 0.0233 -1.50E-04 2.70E-04 -3.51E-04
0.0228 -0.00271 0.00149 0.0429 -2.08E-04 3.27E-04 -4.29E-04
0.0232 -0.00364 0.00122 0.0634 -2.03E-04 3.26E-04 -4.28E-04
103
ln(Land Cover Mix)* ln(Land Cover Entropy)* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.5 Results of ln(Per Capita Income) spatial regression model with time-lagged variables (* denotes time lagged variable).
Variable Constant ln(Square root of Distance to CBD) Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover)* Proportion of Residential Land Cover* ln(Proportion of Rural Land Cover)*
Beta 9.823 0.155 -0.00895 -0.00171 0.158 3.06E-04 -0.00200 0.00279 5.757 6.233 1.046 0.148 0.590 0.9921 25
S.E. 0.0848 0.0547 0.019 5.16E-04 0.0154 5.57E-04 0.00346 0.00285 0.161 0.039 0.0140
T-statistic 123.326 2.715 0.791 3.338 10.734 1.293 0.912 0.900 35.670 308.015 3.680
Estimation Sample Properties Standard Max Min Error 10.144 9.399 0.188 0.417 0.0157 -6.96E-04 0.319 0.00187 0.00529 0.0839 10.231 9.863 1.128 0.00550 -0.0520 -0.00631 0.0944 -0.00557 -0.0718 -0.00447 1.572 2.851 0.989 0.107 0.0171 0.00112 0.0482 0.00138 0.0147 0.0170 2.136 1.279 0.038
0.0226 -0.00188 0.00120 0.0215 -1.18E-04 2.79E-04 -3.28E-04
0.0228 -0.00278 0.00167 0.0453 -1.88E-04 3.87E-04 -4.60E-04
0.0232 -0.00374 0.00150 0.0733 -2.01E-04 4.23E-04 -5.02E-04
104
ln(Land Cover Mix)* ln(Land Cover Entropy)* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.6 Results of ln(Per Capita Income) spatial regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
Specifically, for the proportion of commercial land cover variable, which was the only one out of the four which changed signs with a high average tstatistic, as the proportion of commercial land cover in the past increases, then the average income is expected to decrease. This is in direct contrast with the model without time-lagged variables, whose results showed an opposite effect. This might be signaling that regions with higher commercial development are not, over time, attractive to those with higher incomes. Also, if per capita income is used as a signal of the economic health of the individuals of a region, then this result implies that increased commercial development might actually be slightly detrimental, down the road, to the people living in that region. The time adjustment factor for the model including it was slightly bigger than one. This implies that the impact of the past state of the cell on the average income in that cell increases with time. The most influential lagged variable in the model, judging by their parameter values and elasticities, was the proportion of residential land cover. Because increased residential land cover in a cell is expected to lead to greater per capita income in that cell, it can be said that residential development seems to lead to significant, and multiplicatively increasing, economic benefit for the residents of that cell. Finally, the R 2 level for all three models was high and, to four significant digits, was also the same: 0.9921. This suggest that the parameters which were essentially constant across the models that for the constant and distance to the CBD variables, and the random-effects term probably dominate the model and that the others only offer small corrections. Nonetheless, for generality and flexibility, the model with time-lagged variables and time adjustment is suggested as the preferred model form.
105
5.1.3 Average Vehicles Available per Household Model
Tables 5.7, 5.8, and 5.9 present the results for the vehicle model with no time-lagged variables, with time-lagged variables, and with both time-lagged variables and time adjustment, respectively. As with the previously discussed models, the results of all three of these models are all very similar. Interestingly, judging by the elasticities, the parameter with the greatest impact on vehicles is the proportion of commercial land cover, which, as it increases, leads to an expected drop in vehicles per household. The same effect, though lesser in magnitude, is found with both the proportion of residential and rural land cover. This latter effect is probably the result of the fact that rural areas will tend to have lower economic prosperity and thus the households in these areas probably cannot afford as many vehicles per household. It is also seen that the farther from the CBD and the nearest highway a cell is, the fewer vehicles expected in that cell. Also, for the model without lagged variables, an increase in land cover entropy and a decrease in mix leads to an increase in expected vehicles. However, for the time lagged variable models, the signs on the entropy and mix variables are switched. This suggests that the effects of mix and entropy tend to be very different when they are time lagged and when they are not (this is something shared by the population and income models as well). Again the parameters effecting random effects and spatial autocorrelation are highly significant. The positive spatial autocorrelation suggests that regions with similar average number of vehicles per household tend to group together. Also, the time adjustment parameter, in the model incorporating it, is less than zero. This is actually probably a reflection of the fact that, according the Census approximations described by Table 3.2, the average number of vehicles available per household are expected to decrease with time. Though this reflects the data, it also means that the farther out any prediction is made, the average number of 106
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover) Proportion of Residential Land Cover Proportion of Rural Land Cover
Beta 1.820 0.0404 0.00775 -0.140 -0.00191 -0.00868 -0.121 0.0922 2.191 6.370 0.0177 0.291 0.960 25
S.E. 0.0357 0.00695 0.00914 0.0270 9.00E-04 0.0152 0.0425 0.0404 0.0649 0.0383
T-statistic 51.101 5.827 1.068 5.184 2.130 0.920 2.843 2.447 33.750 272.752
Estimation Sample Properties Standard Max Min Error 1.872 1.774 0.0268 0.0525 0.0227 -0.0659 5.81E-05 0.0174 -0.0333 0.167 2.428 10.556 0.0292 -0.00803 -0.208 -0.00349 -0.035 -0.209 0.0365 1.949 4.679 0.00578 0.00843 0.0341 9.22E-04 0.0135 0.0436 0.0348 0.137 1.534
0.0936 0.00837 0.437 -0.00117 -0.00102 -0.0236 0.0521
0.0925 0.00828 0.522 -0.00211 -6.12E-04 -0.0198 0.0976
0.0903 0.00808 0.412 -0.00299 -9.57E-04 -0.0180 0.140
107
Land Cover Mix Land Cover Entropy Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.7 Results of Average Number of Vehicles Available per Household spatial regression model without timelagged variables.
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover)* Proportion of Residential Land Cover* Proportion of Rural Land Cover*
Beta 1.808 0.0387 0.00804 -0.0793 -0.00434 -0.00391 0.0484 -0.055 2.211 6.210 0.0177 0.293 0.959 25
S.E. 0.0362 0.00703 0.00920 0.0296 7.83E-04 0.0133 0.0446 0.0403 0.0654 0.0404
T-statistic 50.111 5.508 1.115 2.686 5.533 0.624 1.132 1.372 33.799 253.776
Estimation Sample Properties Standard Max Min Error 1.840 1.752 0.0246 0.0489 0.0223 -0.0205 -0.00274 0.0131 0.110 0.00415 2.523 10.301 0.0295 -0.00966 -0.133 -0.0056 -0.026 -0.025 -0.170 1.968 4.671 0.00578 0.00869 0.0311 8.47E-04 0.00993 0.0336 0.0363 0.151 1.542
0.0894 0.00868 0.249 -0.00266 -4.58E-04 0.00942 -0.0309
0.0884 0.00858 0.297 -0.00480 -2.76E-04 0.00791 -0.0579
0.0863 0.00838 0.234 -0.00680 -4.31E-04 0.00720 -0.0832
108
Land Cover Mix* Land Cover Entropy* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.8 Results of Average Number of Vehicles Available per Household spatial regression model with time-lagged variables (* denotes time lagged variable).
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway ln(Proportion of Commercial Land Cover)* Proportion of Residential Land Cover* Proportion of Rural Land Cover*
Beta 1.810 0.0383 0.00818 -0.170 -0.0101 -0.00911 0.145 -0.166 2.150 5.283 0.871 0.0186 0.292 0.9593 25
S.E. 0.0358 0.00700 0.00917 0.0631 0.00163 0.0281 0.0949 0.0829 0.0640 0.0411 0.0208
T-statistic 50.640 5.484 1.137 2.738 6.162 0.720 1.525 1.946 33.583 182.255 6.308
Estimation Sample Properties Standard Max Min Error 1.846 1.740 0.0266 0.0527 0.0222 -0.0116 -0.00456 0.0548 0.295 2.98E-04 2.588 8.082 0.980 0.0290 -0.0095 -0.359 -0.0160 -0.0599 -0.0382 -0.432 1.954 4.111 0.787 0.00637 0.00848 0.0835 0.00313 0.0288 0.0894 0.107 0.152 1.127 0.048
0.0887 0.00883 0.352 -0.00409 -7.04E-04 0.0186 -0.0622
0.0876 0.00873 0.277 -0.00487 -2.80E-04 0.0103 -0.0768
0.0855 0.00852 0.166 -0.00523 -3.32E-04 0.00710 -0.0837
109
Land Cover Mix* Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.9 Results of Average Number of Vehicles Available per Household spatial regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
vehicles will continue to decrease, which may not reflect reality. Finally, in comparing the R 2 value of the different models, it is seen that, again, they all are very close in value, and essentially equivalent
5.2 LSDV MODEL
For the population, average per capita income, and average vehicles available per household models discussed in the previous section, it is reasonable to expect that the past levels of these variables in a cell are important indicators of future levels of the variables. As discussed in Chapter 4, a statistically rigorous way of incorporating lagged dependent variables into a spatial regression model is beyond the scope of this work. However, Chapter 4 does presents a method to estimate a panel data regression model with a lagged dependent variable but without incorporating spatial autocorrelation. This method uses the least squares dummy variable (LSDV) method to estimate the model and then corrects for the bias introduced by the lagged dependent variable. This section presents LSDV model results for the three variables modeled in the previous section: ln(population), ln(per capita income), and average vehicles available per household, with the shortened names of the variables from that section population, income and vehicles used throughout. These results will be used to compare the model results of the LSDV model and the panel data spatial regression model in two key areas: the benefit of having a lagged dependent variable, and the incorporation of spatial information. This latter comparison comes about because not only does the LSDV model not account for spatial autocorrelation, but also because its estimation method, which uses first differences across the time dimension, eliminates the possibility of using the time constant distance to the CBD and distance to the nearest highway variables. It should be noted that, to avoid over-complication of the LSDV model, time adjustment was not incorporated into the models. 110
To maintain comparability with the panel data spatial regression models, 25 samples of 1,000 observations (cells) again were used to estimate the LSDV models. However, because of complications with maintaining the same samples for the two model forms, a different set of samples was used to estimate the LSDV parameters. Examining the standard errors of the average sample parameter estimates for both the LSDV and the spatial regression models suggests that they both had similar levels of accuracy which should not drastically effect their parameter levels, and thus should not effect the comparison of the results of the models. As mentioned in Chapter 4, because of the subtracting of the bias from the estimated parameters in the LSDV model, a bootstrapping technique should be used to estimate the standard errors for the parameters. However, in all three models discussed here, these standard errors seemed rather high as they suggested nearly all of the variables were insignificant to the model, despite R 2 levels which were not terribly low. Though the R 2 measure is cannot be used directly to judge whether more variables should be significant to the model, comparisons with the panel data spatial regression models suggest that, at the very least, many of the variables seen to be insignificant with bootstrapping probably should be more significant than their t-statistics suggest. As such, both the bootstrapped and the non-bootstrapped estimates for the standard errors and their respective t-statistics are included in the results; and the question of significance based on either of these will not be discussed in detail. Because bootstrapped standard error estimates may not always converge, Figure 5.1 presents an example from the population LSDV model which is indicative of the convergence of the standard error estimates as the number of bootstrap sample estimations increased (100 samples were used in all of the models).
111
Figure 5.1 Example of convergence of bootstrap parameter standard deviation estimates from ln(Population) LSDV model.
112
5.2.1 ln(Population) Model
Table 5.10 presents the results for the LSDV model for population. In comparing it with the results of the panel data spatial regression population models with similar explanatory variables, the most significant difference is that the R 2 level for this model is significantly higher than the others, 0.8127 versus around 0.55. This suggests that the inclusion of the lagged population variable is highly important accurately estimating population. In the panel data spatial regression population models, the distance to the CBD was the most important variable for predicting the population of a cell. Because this variable cannot be included in the LSDV model, and because of the inclusion of the lagged dependent variable, the parameters for the other explanatory variables have changed significantly. Specifically, it is now seen that increases in residential and commercial land cover lead to decreases in the expected population of a cell. Also, population is expected to increase with increases in rural land cover. It is difficult to intuitively interpret these results, though it seems that the population lag somehow overestimates the population in urbanized and rural regions and the parameters for the land cover in these regions are correcting for this. Finally, the random effects variance of this model, which is estimated to be 0.4162, is significantly less than that reported for the panel data spatial regression models. What is most likely happening is that the random effect in the LSDV model is accounting for, at least in part, the distance to the CBD and the nearest highway variables. On the other hand, the random effect in the spatial regression models is probably in a large part accounting for the lack of a lagged dependent variable. What these results show is that it is beneficial to develop models which are able to account for spatial effects, such as distance measures and spatial autocorrelation, an d lagged dependent variables, as both of these elements seem 113
Variable Lagged ln(Population) No Bootstrap Constant No Bootstrap Proportion of Commercial Land Cover* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap Land Cover Entropy* No Bootstrap Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 1.458 -0.531 -0.319
S.E. 0.382 0.0203 0.391 0.136 1.324 0.0649
T-statistic 3.820 72.329 1.382 3.933 0.240 4.952 0.267 9.517 0.0883 0.802 0.257 3.332 0.182 0.505 2948.863
Bias -0.707
Estimation Sample Properties Standard Max Min Error 1.474 1.439 0.0107 -0.471 -0.585 -0.420 0.0326 0.0529
Elasticities 2000 1.447 1997 1.432 1991 1.422
0.339
-0.200
-0.00667
-0.00618
-0.00400
-0.436
1.636 0.0464
0.467
-0.390
-0.492
0.0278
0.226
0.148
0.208
0.00108 -0.0205 2.23E-05 0.645 0.8127 25
0.0156 0.00170 0.0807 0.00624 0.127 0.0485 2.19E-04
-6.25E-04 0.0199 -0.00225
0.00346 -0.0146 0.0664 0.672
-0.00162 -0.0283 -0.0941 0.620
0.00126 0.00347 0.0336 0.0148
-1.19E-04 0.00195 -2.11E-06
-1.32E-04 0.00213 -2.31E-06
-1.62E-04 0.00263 -2.85E-06
114
Table 5.10 Results of ln(Population) LSDV model (* denotes time lagged variable).
to play a significant role in predicting the population of a cell.
5.2.2 ln(Per Capita Income) Model
Table 5.11 presents the results for the LSDV income model. Unlike the population models, the R 2 level for this model, at 0.7896, is significantly less than that for the panel data spatial regression models, which all had R 2 values of 0.9921. This immediately suggests that the removal of the distance to CBD and highway measures, as well as spatial autocorrelation, has a significant, detrimental impact on the model. This idea is supported by the previous discussion of the income spatial regression model in which it was suggested that the distance to CBD measure was a dominant part of predicting the income levels in a cell. As with the population model, the inclusion of the lagged dependent variable and the lack of time-constant data significantly changes the parameter estimates. For example, the effect of an increase in the proportion of residential land cover in a cell leads to an expected decrease the average income of that cell. Intuitively, this is contrary to what might be expected and, similar to that in the population model, it is probably a correction to the lagged dependent variable. It is thus seen that for the income model, the importance of the distance to the CBD variable is so important that even a lagged dependent variable does not seem to be as useful to predicting the income of a cell. As such, though a model including both spatial effects and a lagged dependent variable probably would be more powerful than those discussed here, it seems that the panel data spatial linear regression model still performs very well.
115
Variable Lagged ln(Per Capita Income) No Bootstrap Constant No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 1.355 -3.335 -0.00445
S.E. 0.461 0.0186 4.133 0.207 0.105 0.00115
T-statistic 2.942 72.682 0.808 16.127 0.0426 3.859 0.0415 9.229 0.0177 1.365 0.0346 3.134 0.0347 1.678 4128.154
Bias -0.738
Estimation Sample Properties Standard Max Min Error 1.381 -3.154 1.337 -3.596 -0.00530 0.0122 0.122 5.41E-04
Elasticities 2000 1.345 1997 1.331 1991 1.322
0.00445
-0.00331
-9.30E-05
-8.63E-05
-5.59E-05
-0.300
7.239 0.0325
0.321
-0.265
-0.325
0.0146
0.156
0.102
0.143
0.00163 -0.0238 0.0104 0.484 0.7896 25
0.089 0.00120 0.685 0.00759 0.294 0.00608 1.17E-04
0.00120 0.0274 -0.0117
0.00263 -0.0158 0.0210 0.489
-0.0001 -0.0350 0.00379 0.483
6.48E-04 0.00485 0.00438 0.00164
-1.80E-04 0.00226 -9.84E-04
-2.00E-04 0.00246 -0.00107
-2.44E-04 0.00305 -0.00133
116
Table 5.11 Results of ln(Per Capita Income) LSDV model (* denotes time lagged variable).
5.2.3 Average Vehicles Available per Household Model
Table 5.12 presents the results for the LSDV vehicles model. Even more so than with the income model, the R 2 level for this model, 0.5696, is much less than those for the panel data spatial regression model, which were near 0.96. This suggests again that the loss of the distance measures in the models significantly effects their power in predicting the vehicles variable. The parameters in the LSDV results actually seem to make more intuitive sense than those with the panel data spatial regression model. Specifically, an increase in the proportion of residential land cover leads to an expected increase in vehicles, whereas an increase in commercial leads to a decrease. Furthermore, judging by the elasticity of the parameters, residential land cover is more important for predicting the number of vehicles than commercial land cover, which is opposite the results of the spatial regression models. Judging by the previous LSDV results, there is a good chance that these variables are still acting as correction factors to the lagged variable term. The random effects variance of this LSDV model is significantly higher than that for the spatial regression models, 0.2333 versus about 0.085. This suggests that there is a significant time constant effect which the LSDV model is not accounting for which the other models are that is, the two distance measures. Even the inclusion of a lagged dependent variable seems to leave out a lot of important information from the model. Thus, as with the income model, it seems that though a model with a lagged dependent variable and incorporating spatial effects probably might be beneficial, the panel data spatial regression model as presented here seems to work very well in predicting the average number of vehicles per household in a given cell.
117
Variable Lagged Vehicles Available No Bootstrap Constant No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap Proportion of Rural Land Cover* No Bootstrap Land Cover Mix* No Bootstrap Land Cover Entropy* No Bootstrap Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 1.362 -0.815 -1.16E-04
S.E. 0.593 0.0639 1.028 0.158 0.0292 0.00107
T-statistic 2.314 21.334 0.813 5.176 0.0203 0.482 0.035 2.279 0.0439 0.416 0.037 2.539 0.032 1.871 4119.377
Bias -0.736
Estimation Sample Properties Standard Max Min Error 1.462 1.147 0.073 -0.353 -1.034 -0.00132 0.154 6.31E-04
Elasticities 2000 1.351 1997 1.337 1991 1.328
-0.00283
0.00109
-2.42E-06
-2.24E-06
-1.45E-06
0.0659
1.853 0.0288
-0.0355
0.116
0.0149
0.0231
-0.0342
-0.0224
-0.0314
0.00803 0.154 -0.0987 0.483 0.5696 25
0.264 0.0199 4.194 0.0604 3.066 0.0528 1.17E-04
-0.0124 -0.308 0.236
0.0315 0.233 -0.0236 0.483
-0.00269 0.067 -0.179 0.483
0.00872 0.046 0.0366 3.54E-06
-8.86E-04 -0.0146 0.00937
-9.83E-04 -0.0159 0.01022
-0.00120 -0.0197 0.0126
118
Table 5.12 Results of Average Number of Vehicles Available per Household LSDV model (* denotes time lagged
variable).
5.3 PANEL DATA SPATIAL LINEAR REGRESSION MODEL USING PROBIT SAMPLE SELECTION
Despite the incorporation of spatial variables and autocorrelation into the regression models discussed earlier, there may still be regional or cell-specific effects which the models are not accounting for. Specifically, for a given dependent variable, there may be different models for different areas of the region modeled. For example, for areas of high urbanization, there might be a different model for population than for areas of low urbanization, because of differences in the way in which these regions attract (or deter) people from living in them. The sample selection models presented in Chapter 4 offer a way to account for such differences in models, especially when the variable determining the model split is latent. In this section, two sample selection models are discussed; one is a model for ln(population), the other is a model for ln(per capita income), from here on respectively referred to as the population and income models. For the population model, the selection criteria for splitting the model is whether or not a cell is greater than 30% urbanized. The selection criteria for the income model split is whether or not a cell has a population greater than 175. The reason that these splits are chosen is illustrated graphically in Figure 5.2, which plots all of the data used in the models (119,784 data points) in a scatter plot format, comparing their variable characteristics. In the figure, it is clear that there is apparently a different scatter pattern for population when the proportion of urbanized land cover is less than 0.3, than when it is greater. Likewise, though it is less clear of a delineation, there seems to be a change in the way that per capita income is spread once the population of a cell drops below 175. Other scatter plots comparing population, per capita income, and average vehicles per household to various latent variables were examined, but no other obvious
119
120
Figure 5.2 Scatter plots comparing various characteristics of the data: population vs. per capita income (left) and proportion of urban land cover vs. population (right).
variable splits were seen. Because the probit models used to determine the selection criterion do not involve spatial autocorrelation, a model incorporating all of the observations as once could be run. However, in order to maintain consistency with the second stage regression models, 25 samples of 1,000 observations (cells) each are again used to estimate the probit model estimates. The fact that the entire data set could be modeled does, however, offer an opportunity to validate, at least to some extent, the sampling scheme used in this work. From this validation, it is seen that the model results from the samples are fairly accurate, though there are some problems which might be present. It is also seen from these validation results that the average standard errors for the parameters are generally underestimating the standard errors for the population, and thus the significance of the parameters is most likely higher than what is seen from the model averages. Finally, recall that when reading the tables, estimates the amount of spatial autocorrelation present, relates the time constant random effects variance to the variance of the model error which varies with time, and, in the models with time adjustment, the t-statistic reported is testing the null hypothesis that the time adjustment parameter is equal to one.
5.3.1 Urban Land Cover Greater/Less Than 0.3 ln(Population) Sample Selection Panel data Spatial Regression Model 5.3.1.1 Probit Selection Model
Tables 5.13 and 5.14 respectively present the results for the proportion of urban land cover greater than 0.3 probit model with time-lagged variables and with time-lagged variables and time adjustment. These model results are for
121
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Random Effects Standard Deviation Log Likelihood Level Log Likelihood Level: Constants Only Number of Valid Samples
Beta 0.879 -0.704 0.0440 4.039 0.164 1.349 -1.3007 -1.489 25
S.E. 0.332 0.0604 0.0699 0.672 0.608 0.0628
T-statistic 2.650 11.687 1.065 6.144 1.024 21.476
Estimation Sample Properties Standard Max Min Error 1.664 0.215 0.370 -0.582 0.194 5.288 7.275 1.464 -0.842 -0.116 1.503 -1.248 1.227 0.069 0.082 0.762 1.608 0.063
-4.100 0.120 1.977 0.0252
-4.690 0.137 1.923 0.0233
-4.927 0.144 1.881 0.0217
122
Table 5.13 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged variables (* denotes time lagged variable).
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Random Effects Standard Deviation Time Adjustment Log Likelihood Level Log Likelihood Level: Constants Only Number of Valid Samples
Beta 1.118 -0.735 0.0479 5.059 0.147 1.393 0.937 -1.2861 -1.489 25
S.E. 0.348 0.0628 0.0724 0.996 0.895 0.0649 0.0137
T-statistic 3.219 11.719 1.123 5.115 0.935 21.502 4.705
Estimation Sample Properties Standard Max Min Error 2.127 0.344 0.408 -0.594 0.216 6.517 3.542 1.524 0.961 -0.871 -0.120 2.748 -1.476 1.272 0.901 0.074 0.090 0.868 1.149 0.067 0.015
-3.902 0.119 1.857 0.0169
-5.162 0.157 1.717 0.0149
-5.726 0.174 1.557 0.0128
123
Table 5.14 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged variables and time adjustment (* denotes time lagged variable).
averages from the model results run on the 25 samples of 1,000 observations.8 As with the previously discussed models, the results of both models are very similar. With respect to parameter estimates, the biggest difference between them is that the model with time adjustment has a larger constant term than the model without adjustment. More significant, however, is the fact that the loglikelihood level of the model with time adjustment is better than that without time adjustment. A measure of goodness of fit can be calculated by the likelihood ratio index (LRI) which compares the likelihood level of the full model, L, with that of the model including only constants, L0: LRI = 1 ln L ln L0 (5.3)
Generally (though there are caveats to this), as the LRI value approaches unity, the better the fit of the model (Greene 2000). For the model with time adjustment, the LRI level is 0.1363, while that for the model without time adjustment is 0.1265, indicating a slightly improved fit for the model with time adjustment. In examining the parameter estimates from the model results, it is seen that as the distance to the CBD increases, the probability that a particular cell will be more than 30% urbanized decreases. This makes sense, as it is expected that urbanized areas would tend to be located close to the CBD, as opposed to far from it. It is also seen that as the distance to the highway increases, the probability that a cell is greater than 30% urbanized actually increases; however, as with the models discussed previously, comparison of the elasticities leads to the
In the regression models that are discussed below, the expected value of the probit error variables (see Section 4.41) from the mode without time adjustment are used in the regression models without lagged variables and with lagged variables but without time adjustment; the expected value of the probit error variables from the model with time adjustment is used with the regression model with time adjustment. The motivation for this is to avoid problems which may arise if the time adjustment factor renders the instrument variable incorrect for regression models without time adjustment.
8
124
interpretation of this parameter is probably a correction factor to the distance to the CBD measure. It is also seen that as land cover mix and entropy increase, the probability that a cell will be more than 30% urbanized increases as well. Furthermore, examining the elasticities shows that the mix variable has a far greater impact on the model than entropy, indicating that land cover mix is, at least to some degree, a good indicator for higher levels of urbanization. As mentioned previously, it is possible to estimate the results of a greater than 0.3 urban land cover probit model on the entire data set, as opposed to samples. Table 5.15 presents the results of such a model with lagged variables and time adjustment. Comparing these results to the averages reported in Table 5.14 shows they are incredibly close. Even the time adjustment factor, randomeffects variance, log-likelihood levels, and elasticities of the two models are similar. The only major difference is in the standard errors of the parameters and their t-statistics. As is expected with a larger sample, the standard errors of the parameters decreased; however, because the parameter estimates stayed nearly the same, the t-statistics in the full model are generally much larger than the averages from the models run on the samples. Except for the entropy variable, all of the variables were shown to be highly significant. representative sample of the population as a whole. This lends credence to the sampling scheme, at least as far as indicating that the 25 samples are a
5.3.1.2 Urban Land Cover Greater Than 0.3 ln(Population) Model
Table 5.16 presents the results of the population model for cells which are greater than 30% urbanized. These results include the standard errors and tstatistics for the parameters determined by both traditional means, which are known to be incorrect, and bootstrapping. Though the traditionally determined standard errors are known to be biased, the bootstrapped estimates are very high, rendering all of the parameters insignificant at a 95% confidence level. 125
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy*
Beta 1.188 -0.747 0.0461 5.166 0.0740 1.366 0.934 -1.287 -1.4991
S.E. 0.063 0.0113 0.0130 0.177 0.142 0.0113 0.00243
T-statistic 18.880 -66.103 3.558 29.183 0.522 120.352 27.128
2000
Elasticities 1997
1991
-3.909 0.113 1.854 0.00837
-5.221 0.151 1.717 0.00737
-5.807 0.168 1.553 0.00633
126
Random Effects Standard Deviation Time Adjustment Log Likelihood Level Log Likelihood Level: Constants Only
Table 5.15 Results of Proportion of Urban Land Cover greater than 0.3 probit model with time-lagged variables and time adjustment using entire data set (* denotes time lagged variable).
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover No Bootstrap Proportion of Residential Land Cover No Bootstrap Proportion of Rural Land Cover No Bootstrap ln(Land Cover Mix) No Bootstrap ln(Land Cover Entropy) No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 5.471 -0.806 0.172 0.289 0.488 -7.94E-04 -0.125 0.0716 -0.199 0.659 4.958 4.880 13.244 16.431 0.7165 25
S.E. 63.051 0.282 14.918 0.0595 15.418 0.0773 5.843 0.0857 4.403 0.0774 0.111 0.00241 7.877 0.132 1.482 0.0307 3.223 0.0442 21.523 0.0871 0.138 0.085
T-statistic 0.0875 19.609 0.0547 13.624 0.0128 2.302 0.0511 3.495 0.113 6.416 0.0230 1.048 0.0273 1.615 0.0468 2.316 0.0627 4.618 0.0309 7.635 35.755 75.253
Estimation Sample Properties Standard Max Min Error 6.483 4.418 0.398 -0.688 0.417 0.556 0.813 0.00467 0.706 0.149 -0.0957 0.840 43.775 5.613 -0.891 -0.0714 0.158 0.327 -0.00561 -0.468 -0.00291 -0.297 0.502 3.969 4.259 0.0543 0.122 0.0946 0.0967 0.00280 0.240 0.0473 0.0537 0.0865 0.782 0.345
-1.006 0.109 0.0175 0.0645 -1.10E-05 0.0304 -0.0218
-1.004 0.110 0.0165 0.0633 -3.09E-05 0.0237 -0.0174
-0.899 0.103 0.0208 0.0533 -1.31E-05 0.0260 -0.0204
127
Table 5.16 Results from ln(Population) spatial regression model without time-lagged variables: Urban Land Cover greater than 0.3.
These bootstrapped results seem incorrect, as the R 2 level of the model, at 0.7165, indicates that at least some of the variables should be significant to the model. Figure 5.3 shows how the level of the bootstrapped standard error estimates changes as the number of bootstrap samples increases, and it is seen that the standard error estimates are converging. However, as with the LSDV models discussed previously, it seems that the values that these estimates are converging to may be incorrect. All the other sample selection regression models came across the same issue, further suggesting that the bootstrap procedure was not producing accurate standard error estimates. For completeness, both bootstrapped and traditional standard error and t-statistic estimates will be reported for all of the sample selection regression models; however, because of the questions concerning their accuracy, a detailed discussion of the parameter significances will not be carried out. Two other issues concerning the model results are notable. First, elasticities for the expected value of the probit error variables are not reported. This is a result of the fact that statistics for these variables (i.e. their averages) were not kept and thus the elasticities of the parameters could not be computed using equation 5.1. Examining this more thoroughly would be a good topic of future research extending this work. Second, two of the variables, land cover entropy and proportion of rural land, have been changed from the population models discussed previously. Namely, the natural log of the former variable and the latter variable without the transformation are now used. Though for comparability between models the variable forms should not have been changed, the fit of the models improved in the selection models when these transformations were applied. This occurred for both the greater than and less than population models; why such a change occurred is unclear. Nonetheless, the major results of the models, specifically variables with the greatest impact (of which rural land cover and entropy are not 128
129
Figure 5.3 Example of convergence of bootstrap parameter standard deviation estimates for urban land-cover proportion greater than 0.3 ln(Population) panel-data spatial regression model.
among), goodness of fit, and spatial and temporal properties, can still be compared. The results in Table 5.16 are similar to those of the population model without time lags reported in Table 5.1. Though there are some differences, the parameter estimates for the distance measures and proportion residential and commercial land cover can be interpreted in a similar manner to that in section 5.1.1. It is noted, though, that comparing the elasticities of the models, residential land cover is a much more important indicator of population when looking at cells which are greater than 30% urbanized, as opposed to all of the cells. The results for shows that in a neighborhood, cells tend to have similar population levels. What is more interesting is the fact that the random error and random effects variances are significantly higher in this model than in the model not using sample selection. Though it is not definite, the most likely cause of this is the fact that only part 1,000 observations for each sample are used in the regression equation. Because the spatial autocorrelation causes all error terms to influence one another, by eliminating some of those that is the ones where there is less than 30% of the land cover is urbanized the canceling effects which may occur from such autocorrelations may be reduced. Similar error term levels for the less than model bolster such an explanation, though it is by no means implied that this explanation is definitive. Perhaps the most telling result of the population model for greater than 30% urban land cover is its R 2 level. At 0.7165, it is significantly higher than the 0.5491 level of the model without sample selection, and is approaching the 0.8127 level reported for the LSDV model which includes lagged variables. Thus, it is seen that, at least for this half of the model, there is a significant improvement in model fit by estimating a separate model for cells which have moderate to high urbanization levels. However, as Greene (2000) notes, it may not be entirely possible to directly compare these fit levels since the models are 130
not addressing identical data sets. Nonetheless, as the models and their results are fairly similar, the fit measures are most likely pointing towards an improved fit for the model using sample selection. Tables 5.17 and 5.18 present the results of the models with time-lagged variables and with both time-lagged variables and time adjustment, respectively. Again, the results are relatively consistent with the results for the models without sample selection (Tables 5.2 and 5.3); also, there is a similar improvement to the R 2 level of the models as in the model without time-lagged variables. Also, there is again high random error and random effects variances for the models. One significant change is that the time adjustment factor for the sample selection model. At 0.9932, it is much closer to unity than the model without sample selection. In fact, with an average t-statistic of 1.592, whether or not this parameter is significantly different, statistically speaking, from one is questionable (at a 95% confidence level). Between the two sample selection models with time-lagged variables, the results are nearly the same. Though the R 2 level for the model without time adjustment is slightly higher, the levels of the two models are close enough to render them essentially identical as far as goodness of fit is concerned. As such, again the model with time adjustment, which is the most flexible, would be recommended for use.
5.3.1.3 Urban Land Cover Less Than 0.3 ln(Population) Model
Table 5.19 presents the results of the population model for cells which are greater than 30% urbanized with no time-lagged variables. As is expected from previous results, the population is expected to decrease as the distance to the CBD increases, and increase slightly with the distance to the nearest highway. However, the effects of these distances, as indicated by the parameters, are less than for the model including all of the cells (Table 5.1) and, especially, the model
131
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover* No Bootstrap Proportion of Residential Land Cover* No Bootstrap Proportion of Rural Land Cover* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 5.255 -0.807 0.174 0.272 0.660 9.47E-04 0.0501 -0.0208 -0.0887 0.570 6.020 4.357 10.293 16.823 0.742 25
S.E. 72.846 0.266 17.203 0.0620 18.945 0.0777 3.030 0.0530 1.438 0.0357 0.100 0.00212 7.655 0.103 3.984 0.0276 7.435 0.0729 24.669 0.0909 0.167 0.139
T-statistic 0.0732 19.914 0.0477 13.073 0.0105 2.314 0.0946 5.408 0.467 19.107 0.0217 1.016 0.0144 1.041 0.00912 1.027 0.0128 1.326 0.0233 6.323 35.950 60.085
Estimation Sample Properties Standard Max Min Error 6.311 4.475 0.377 -0.659 0.418 0.373 0.735 0.00624 0.270 0.0466 0.0879 0.779 61.792 5.746 -0.922 -0.069 0.172 0.591 -0.00833 -0.250 -0.108 -0.289 0.443 4.144 0.728 0.0647 0.122 0.0514 0.0382 0.00288 0.114 0.0344 0.0755 0.0973 1.009 1.094
-1.007 0.110 0.0153 0.0845 0.0500 -0.00939 3.63E-05
-1.004 0.111 0.0204 0.0751 0.00619 -0.0109 1.62E-05
-0.899 0.104 0.0273 0.0498 0.00560 -0.0105 2.51E-05
132
Table 5.17 Results from ln(Population) spatial regression model with time-lagged variables: Urban Land Cover greater than 0.3 (* denotes time lagged variable).
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover* No Bootstrap Proportion of Residential Land Cover* No Bootstrap Proportion of Rural Land Cover* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 5.249 -0.808 0.173 0.286 0.690 9.00E-04 0.0642 -0.0215 -0.0838 0.567 6.028 4.628 0.993 10.293 16.823 0.742 25
S.E. 159.176 0.266 36.366 0.0620 47.657 0.0777 4.326 0.0552 4.494 0.0376 0.167 0.00220 11.161 0.109 7.701 0.0287 12.801 0.0726 54.056 0.0909 0.167 0.0950 0.00454
T-statistic 0.0344 19.900 0.0232 13.098 0.00438 2.311 0.0711 5.415 0.248 19.019 0.0144 0.975 0.0122 1.101 0.00516 1.030 0.00789 1.286 0.0109 6.287 35.962 79.212 1.592
Estimation Sample Properties Standard Max Min Error 6.304 4.482 0.377 -0.659 0.423 0.404 0.934 0.00717 0.310 0.0449 0.091 0.778 63.873 5.128 1.030 -0.922 -0.0705 0.146 0.547 -0.00723 -0.261 -0.117 -0.305 0.444 4.203 3.587 0.953 0.0638 0.122 0.0719 0.0964 0.00284 0.120 0.0362 0.0773 0.0978 1.019 0.435 0.0206
-1.008 0.110 0.0158 0.0865 0.0505 -0.0118 3.38E-05
-1.005 0.111 0.0206 0.0753 0.00613 -0.0134 1.48E-05
-0.900 0.104 0.0272 0.0492 0.00547 -0.0127 2.26E-05
133
Table 5.18 Results from ln(Population) spatial regression model with time-lagged variables and time adjustment: Urban Land Cover greater than 0.3 (* denotes time lagged variable).
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover No Bootstrap Proportion of Residential Land Cover No Bootstrap Proportion of Rural Land Cover No Bootstrap ln(Land Cover Mix) No Bootstrap ln(Land Cover Entropy) No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 3.973 -0.540 0.0939 0.837 0.599 -0.00174 0.00473 0.0148 -0.281 0.399 5.553 5.449 22.564 26.338 0.5211 25
S.E. 41.437 0.225 7.755 0.0420 8.674 0.0464 4.261 0.147 2.556 0.0931 0.0481 0.00163 1.546 0.0505 0.128 0.00453 1.958 0.0418 9.033 0.0489 0.155 0.0377
T-statistic 0.096 17.708 0.0697 12.879 0.0108 2.021 0.199 5.706 0.235 6.438 0.0383 1.134 0.0286 0.862 0.119 3.273 0.146 6.728 0.0443 8.156 35.890 168.200
Estimation Sample Properties Standard Max Min Error 4.503 3.554 0.250 -0.438 0.148 1.121 0.776 5.88E-04 0.521 0.0223 -0.206 0.554 36.365 6.659 -0.647 0.0313 0.597 0.276 -0.00424 -0.0840 0.0075 -0.424 0.302 26.099 4.930 0.0444 0.0317 0.137 0.125 0.00129 0.113 0.00380 0.052 0.0607 0.243 0.603
-1.566 0.121 0.0138 0.0243 -1.77E-04 -0.00302 -0.0106
-2.726 0.131 0.00868 0.0301 -3.14E-04 -0.00386 -0.0140
-1.781 0.140 0.00871 0.00623 -1.99E-04 -0.00466 -0.0167
134
Table 5.19 Results from ln(Population) spatial regression model without time-lagged variables: Urban Land Cover less than 0.3.
for cells with greater than 30% urbanization (Table 5.16). This indicates that the importance of distance is diminished here, probably because of the fact that less urbanized regions tend to be located farther from the CBD and highways. This interpretation is bolstered by the fact that the elasticities on the distance variables are higher in this model than the others. As with previous results, as the proportion of commercial and residential land cover increase and the proportion of residential land cover decreases, the population of the cell is expected to increase. Also, the positive value of indicates that even in less urbanized areas, a cells with a certain population, or population density, will tend to have cells with similar levels located near it. Similar to the results of the greater than population model, it is seen that there is a significant increase in the variance of the random error and random-effects terms in the model over the model without sample selection. Again this might be due to the fact that error cancellations across the more- and less-urbanized groups are not allowed to occur. Finally, unlike the greater than model, the R 2 level of this model is slightly smaller than that for the model without sample selection. This indicates that the model with sample selection does not have an improved fit level over the one without it. In fact, the poor fit of the model for cells with less than 30% urban land cover might be causing the smaller fit for the model without sample selection, as compared to the greater than model presented in the previous subsection. Tables 5.20 and 5.21 present the results for the models with time-lagged variables and with both time-lagged variables and time adjustment, respectively. The results are again similar to that of the model without time-lagged variables and to the models without sample selection. One major difference, however, is that as the proportion of commercial land increases, population is expected to 135
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover* No Bootstrap Proportion of Residential Land Cover* No Bootstrap Proportion of Rural Land Cover* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 4.392 -0.612 0.0950 -0.470 0.707 0.00313 0.272 -0.00849 0.0505 0.210 5.754 5.555 21.246 26.474 0.5239 25
S.E. 42.847 0.228 7.994 0.043 8.878 0.0468 4.003 0.129 2.110 0.0756 0.0501 0.00157 1.966 0.0553 0.124 0.00431 3.106 0.0594 9.395 0.0557 0.160 0.0487
T-statistic 0.103 19.257 0.0766 14.257 0.0107 2.026 0.118 3.609 0.339 9.338 0.0639 2.018 0.136 4.697 0.0723 1.972 0.0214 1.086 0.0224 3.790 35.890 123.474
Estimation Sample Properties Standard Max Min Error 4.941 3.911 0.264 -0.519 0.156 -0.217 0.955 0.00608 1.463 -0.00337 0.185 0.349 40.260 7.052 -0.719 0.0476 -0.744 0.466 -0.00077 0.158 -0.0156 -0.0651 0.0692 27.859 4.746 0.0479 0.0324 0.142 0.126 0.00159 0.252 0.00350 0.0622 0.0689 0.280 0.765
-1.775 0.123 -0.00451 0.0328 0.00742 -0.205 5.22E-04
-3.090 0.132 -0.00460 0.00689 0.00897 -0.252 3.35E-04
-2.019 0.141 -0.0102 0.00590 0.00899 -0.248 5.30E-04
136
Table 5.20 Results from ln(Population) spatial regression model with time-lagged variables: Urban Land Cover less than 0.3 (* denotes time lagged variable).
Variable Constant No Bootstrap Square root of Distance to CBD No Bootstrap Square root of Distance to Nearest Highway No Bootstrap Proportion of Commercial Land Cover* No Bootstrap Proportion of Residential Land Cover* No Bootstrap Proportion of Rural Land Cover* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 4.396 -0.613 0.0947 -0.592 0.903 0.00425 0.332 -0.0103 0.0566 0.206 5.669 5.183 0.967 21.761 26.420 0.5258 25
S.E. 53.791 0.229 10.084 0.0430 11.423 0.0467 6.717 0.163 8.715 0.0931 0.102 0.00198 5.659 0.0658 0.203 0.00543 5.904 0.0599 10.863 0.056 0.158 0.0361 0.00881
T-statistic 0.0820 19.258 0.0609 14.271 0.00828 2.022 0.0913 3.630 0.146 9.649 0.0445 2.088 0.0719 4.864 0.0571 1.952 0.0131 1.165 0.0190 3.695 35.955 153.916 3.628
Estimation Sample Properties Standard Max Min Error 4.967 3.922 0.263 -0.520 0.158 -0.231 1.273 0.0109 1.249 -0.00205 0.207 0.346 39.090 6.909 1.064 -0.723 0.0483 -1.036 0.516 -2.33E-04 0.151 -0.0183 -0.0634 0.0672 27.855 4.640 0.905 0.047 0.0318 0.207 0.222 0.00253 0.213 0.0045 0.0632 0.0687 0.245 0.505 0.0347
-1.778 0.122 -0.00514 0.0379 0.00817 -0.227 6.41E-04
-3.095 0.132 -0.00474 0.00720 0.00894 -0.251 3.73E-04
-2.022 0.141 -0.0099 0.00577 0.00838 -0.231 5.52E-04
137
Table 5.21 Results from ln(Population) spatial regression model with time-lagged variables and time adjustment: Urban Land Cover less than 0.3 (* denotes time lagged variable).
decrease. Because this variable is time lagged, this indicates that for regions of low urbanization, past commercial development seems to lead to areas which are less attractive for residents. The time adjustment factor is again closer to one than that for the model without sample selection; however, unlike the greater than model, the factor is different, at a 95% confidence level, from unity. Like the model without timelagged variables, the R 2 levels of the models are slightly smaller than the models without sample selection; and the model with time adjustment has the best fit. From this discussion, it is thus seen that if one was looking at only regions of low urbanization, the complication of going through the sample selection model probably would be unnecessary and discouraged. However, if a region with a mix of urbanization levels is being modeled, then it is suggested that a sample selection model with time deprecation be used.
5.3.2 Population Greater/Less Than 175 ln(Per Capita Income) Sample Selection Panel Data Spatial Regression Model 5.3.2.1 Probit Selection Model
Tables 5.22 and 5.23 present the results of the population greater than 175 probit model respectively with time-lagged variables and with both time-lagged variables and time adjustment. These models are essentially modeling whether or not cells have high population densities (with high being roughly greater than 1944 persons per square kilometer). As expected (because of previous results), in both models it is seen that as the distance to the CBD increases, the expectation that a cell will have high population density decreases. It is also seen that the distance to the nearest highway variable again seems to act as a correction factor to the distance to the CBD measure, with the probability that a cell has high population density slightly increasing as the distance to the highway increases. Comparing the two models, there is a big discrepancy between the 138
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Random Effects Standard Deviation Log Likelihood Level Log Likelihood Level: Constants Only Number of Valid Samples
Beta 1.392 -1.582 0.0823 7.028 -7.336 2.250 -0.2028 -0.2498 25
S.E. 0.701 0.201 0.206 2.170 2.086 0.280
T-statistic 2.098 8.036 1.970 3.257 3.471 8.249
Estimation Sample Properties Standard Max Min Error 3.625 -0.678 1.035 -0.975 0.707 15.959 -0.990 2.996 -1.876 -1.077 1.445 -25.033 1.705 0.232 0.477 3.824 5.055 0.360
-42.258 1.027 15.780 -5.189
-42.258 1.027 13.418 -4.197
-42.258 1.027 12.501 -3.714
139
Table 5.22 Results of Population greater than 175 probit model with time-lagged variables (* denotes time lagged variable).
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Random Effects Standard Deviation Time Adjustment Log Likelihood Level Log Likelihood Level: Constants Only Number of Valid Samples
Beta 1.151 -1.710 0.0598 20.697 -15.850 2.346 0.773 -0.1999 -0.2677 25
S.E. 0.652 0.231 0.206 12.824 10.407 0.295 0.115
T-statistic 1.902 7.633 1.803 1.819 1.567 8.148 2.079
Estimation Sample Properties Standard Max Min Error 4.242 -1.443 1.151 -0.958 0.871 40.599 -0.469 3.082 1.120 -2.745 -1.081 5.425 -68.313 1.756 0.615 0.342 0.474 10.414 13.477 0.420 0.126
-31.223 0.510 14.678 -3.541
-43.842 0.716 8.096 -1.858
-47.529 0.776 4.886 -1.065
140
Table 5.23 Results of Population greater than 175 probit model with time-lagged variables and time adjustment (* denotes time lagged variable).
parameter estimates for land cover mix and entropy. Though the estimates have the same sign with the probability that a cell will have a high population density increasing with mix and decreasing with entropy the magnitudes of the parameters are very different. Though at first this might be attributed to the time adjustment factor which, at 0.773, is much smaller than one, an examination of the elasticities, which incorporate the time adjustment, shows there to be great differences between these as well. The true source of the problem is that there are not many cells with very high density less than 2% in the entire population and many of the samples may not have a good enough selection of them to create accurate parameter estimates. In order to compensate for this, oversampling or weighting should probably be done to more correctly account for this. Future work could examine this more extensively. Further exposing the problems with the sampling are the results of the model run on the entire data set, shown in Table 5.24. It is seen that the true parameter estimates for land cover mix and entropy are much smaller than either of the previous two models averages estimated. Furthermore, though the rest of the parameters estimates are fairly similar among the three models, the time adjustment factor is also much different. In fact, where the averaged model had the time adjustment factors level at much less than one, the full data set model estimated it at 1.988. This is not only a significant difference in magnitudes, but the fact that they lie on different sides of one indicates that the two model results have completely different interpretations of how the value of past information, with respect to this model, changes over time. Obviously, the sampling procedure has some serious problems when a segment which is modeled is underrepresented in the data set. Lastly, it is noted that in comparing the LRI of the two models with averages, the model with time adjustment, at 0.2533, has a slightly better fit than that without, at 0.1881. 141
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Random Effects Standard Deviation Time Adjustment Log Likelihood Level Log Likelihood Level: Constants Only
Beta 1.762 -1.642 0.182 0.0102 -0.021 2.231 1.988 -0.2182 -0.3023
S.E. 0.134 0.039 0.046 0.0105 0.018 0.046 0.191
T-statistic 13.171 -42.556 3.949 0.976 -1.154 48.978 5.121
2000
Elasticities 1997
1991
-41.511 2.144 0.171 -0.110
-41.011 2.119 1.125 -0.691
-39.289 2.030 3.968 -2.314
142
Table 5.24 Results of Population greater than 175 probit model with time-lagged variables and time adjustment using entire data set (* denotes time lagged variable).
5.3.2.2 Population Greater Than 175 ln(Per Capita Income) Model
The results for the income model for cells which have population greater than 175 without time-lagged variables are presented in Table 5.25. As with the results from the population sample selection model, the bootstrapped standard error estimates seemed to be underestimating the values. Also, in estimating the models for some of these samples, the maximum likelihood procedure used was unable to calculate the Hessian, which means that the standard errors for the parameters estimated by this procedure could not be calculated.9 As it took many hours to estimate the model for each sample, re-estimating the failed models was found to not be feasible. Thus, when the Hessian from the model of a sample failed to be computed, that sample was not included in the final model averages. Comparing the results with those of the income model without sample selection (Table 5.4) shows that the results have some similarities. For example, income is expected to increase as the distance to the CBD increases. However, there are some differences. Though it still seems to act as a correction to the distance to the CBD variable, the distance to the nearest highway variable now has a positive effect on the expected average per capita income in a cell. Also, increased commercial land cover is expected to lead to decreases in the average income levels. These changes, along with the fact that increased residential and rural land cover leads to increased average incomes, indicate that, among areas of higher population density, greater affluence is found in more remote and less commercially developed areas. Another, more general, change is that many of the estimated parameter averages have much higher magnitudes in the sample selection model.
9
One reason this might be happening is because of the poor sample properties associated with the population greater than 175 section of the data set; supporting this idea is the fact that the problem with Hessian calculations did not occur with the population less than 175 income model.
143
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover) No Bootstrap Proportion of Residential Land Cover No Bootstrap ln(Proportion of Rural Land Cover) No Bootstrap ln(Land Cover Mix) No Bootstrap ln(Land Cover Entropy) No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.277 0.806 0.110 -0.00156 0.322 0.00519 0.0128 -0.0281 -0.0860 -0.0568 0.0690 4.590 0.213 0.084 0.9975 23
S.E. 70.203 0.383 64.346 0.564 10.771 0.130 1.739 0.00367 27.756 0.0699 2.240 0.00773 37.042 0.120 29.874 0.0775 39.570 0.105 43.740 0.138 0.0271 0.094
T-statistic 0.139 34.504 0.0125 1.917 0.0128 0.966 0.00510 1.419 0.00093 0.167 -0.00051 -0.195 0.0124 2.772 -0.00128 -0.376 -0.00228 -0.954 -0.00132 -0.532 1.999 87.445
Estimation Sample Properties Standard Max Min Error 10.599 8.009 0.674 3.772 0.747 0.0185 0.699 0.0138 0.184 0.204 0.246 0.177 0.321 6.202 -0.999 -0.348 -0.0160 0.0187 -0.00214 -0.186 -0.211 -0.344 -0.575 0.0000 1.122 1.006 0.248 0.00783 0.156 0.00402 0.101 0.110 0.125 0.164 0.0673 0.519
0.0846 0.0215 4.93E-04 0.0198 -0.00460 -0.00186 0.00353
0.0811 0.0210 3.44E-04 0.0179 -0.00193 -0.00126 0.00256
0.0815 0.0209 2.63E-04 0.0160 -0.00459 -0.00166 0.00316
144
Table 5.25 Results from ln(Per Capita Income) spatial regression model without time-lagged variables: Population greater than 175.
A less significant change concerns the level of spatial autocorrelation. The value of is smaller in this model than in the model without sample selection. The positive value of indicates that t in high population density areas, neighborhoods ten to have similar per capita income levels. However, because it is smaller than in the model without sample selection, it is implied that such spatial autocorrelation, with respect to income, occurs across, and perhaps regardless of, population density levels. Tables 5.26 and 5.27 respectively present the results for the income model with time-lagged variables and with both time-lagged variables and time adjustment. Comparing the results to the models without sample selection (Tables 5.5 and 5.6), it is again seen that the magnitudes of many of the parameter estimates are increased with the sample selection models. It is also seen that the distance to the nearest highway variable has a positive parameter; however, now so does the parameter for the proportion of land cover variable. This indicates, in an effect which is opposite of that for the non-sample selection models, that for densely populated areas commercial development is an indicator of future increased affluence. Also, is again found to have a slightly smaller value than in the models without sample selection. The differences between the models with and without time adjustment are very similar in the models with and without sample selection. Again the time adjustment factor is greater than one, indicating that the effect of the lagged variables in predicting the affluence of a cell, as measured by its average income level, increases multiplicatively with time. It is also seen that there is a miniscule decreased R 2 level in the model with time adjustment. Nonetheless, the R 2 levels for all the sample selection models for densely populated cells, which hovered around 0.9975, were improved over the models without sample selection, all of
145
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.318 0.503 0.110 0.0138 0.282 0.00522 0.0624 -0.111 0.0543 -0.134 0.0867 4.543 0.193 0.105 0.9976 24
S.E. 76.571 0.388 60.205 0.610 10.134 0.129 1.383 0.00330 41.630 0.125 4.138 0.0145 33.479 0.0992 50.307 0.140 52.515 0.161 51.899 0.154 0.0332 0.088
T-statistic 0.133 34.191 0.015 1.746 0.0189 1.552 0.00574 1.712 0.00447 1.149 0.00619 1.318 0.0102 3.014 0.00449 1.233 0.00266 1.004 0.00324 1.298 2.420 94.642
Estimation Sample Properties Standard Max Min Error 11.121 8.032 0.691 3.588 0.679 0.105 0.509 0.0116 0.492 0.199 0.547 0.262 0.708 6.125 -1.625 -0.225 -0.0356 0.0630 -0.00440 -0.217 -0.484 -0.363 -0.595 0.000 1.533 1.037 0.225 0.0291 0.132 0.00382 0.181 0.191 0.201 0.187 0.143 0.236
0.0528 0.0215 -0.00302 0.0156 -0.00192 -0.00608 0.0100
0.0506 0.0211 -0.00228 0.0137 -0.00451 -0.00791 0.0122
0.0508 0.0209 -0.00207 0.0095 -0.00386 -0.00639 0.0929
146
Table 5.26 Results from ln(Per Capita Income) spatial regression model with time-lagged variables: Population greater than 175 (* denotes time lagged variable).
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.237 0.608 0.098 0.00737 0.174 0.00322 0.0235 -0.0571 0.0312 -0.115 0.107 3.858 1.134 0.186 0.128 0.9973 21
S.E. 67.701 0.407 78.316 0.643 10.575 0.137 0.834 0.00183 18.311 0.0662 1.767 0.00770 15.153 0.0522 22.667 0.073 40.997 0.156 45.134 0.158 0.0304 0.157 0.00458
T-statistic 0.146 32.391 0.0127 1.965 0.0199 1.548 0.00465 1.979 0.00525 1.232 0.00593 1.192 0.0110 3.224 0.00557 1.341 0.00360 1.029 0.00367 1.215 3.309 75.688 29.023
Estimation Sample Properties Standard Max Min Error 11.136 8.061 0.687 3.712 0.659 0.0717 0.514 0.0116 0.322 0.167 0.493 0.242 0.716 7.648 1.323 -1.475 -0.283 -0.0122 0.0200 -0.00232 -0.203 -0.316 -0.423 -0.605 0.000 1.398 0.962 1.084 0.262 0.0175 0.147 0.00305 0.118 0.118 0.193 0.192 0.170 1.532 0.103
0.0638 0.0192 -0.00235 0.0140 -0.00173 -0.00333 0.00750
0.0612 0.0189 -0.00258 0.0179 -0.00591 -0.00632 0.0133
0.0615 0.0187 -0.00302 0.0160 -0.00650 -0.00657 0.130
147
Table 5.27 Results from ln(Per Capita Income) spatial regression model with time-lagged variables and time adjustment: Population greater than 175 (* denotes time lagged variable).
which had R 2 values of 0.9921. Thus, in this case, the sample selection model provides a slight improvement over the model without it.
5.3.2.3 Population Less Than 175 ln(Per Capita Income) Model
Table 5.28 presents the results for the income model without time-lagged variables for cells which have population less than 175. As noted earlier, a majority of the data set, and thus by extension the samples, consists of cells with population less than 175. Thus, as expected, the results of this model are nearly identical to that from the income model without sample selection (Table 5.4). This includes the R 2 level, which indicates the models have identical fit. One slight difference is that, in the sample selection model, is a little smaller, again implicitly indicating neighborhoods of cells tend to have similar per capita income levels, regardless of the population density of the region. Another difference is the variance of the error terms. Though has nearly the same value as in the model without sample selection, the random error variance, and thus the random effects variance, are much higher. The reason for this is unclear, though one explanation might follow the lines of that explaining the high variances in the population sample selection model: that the observations that were dropped (because they had populations less than 175) may have offered error cancellations via spatial effects which help to decrease the error variance. However, the small number of dropped observations does not lead to much confidence in this explanation, and it may be that there is an error in the method used to compute the random error variance in this model.10 Tables 5.29 and 5.30 respectively present the results of the income model for cells with populations less than 175 with time-lagged variables and with both
10
The GAUSS code used to estimate this model was checked for errors and none were found. A more rigorous investigation of the issue of increased variances is something that may be addressed in future work.
148
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover) No Bootstrap Proportion of Residential Land Cover No Bootstrap ln(Proportion of Rural Land Cover) No Bootstrap ln(Land Cover Mix) No Bootstrap ln(Land Cover Entropy) No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.926 0.108 -0.00902 0.00177 0.143 -6.48E-04 0.0104 -0.00354 0.171 -0.201 5.397 5.946 15.188 18.046 0.9922 25
S.E. 230.428 0.101 140.561 0.0638 45.147 0.0192 0.248 0.00076 1.444 0.00493 0.228 0.00070 8.772 0.0202 1.056 0.00413 174.018 0.114 161.883 0.095 0.151 0.0297
T-statistic 0.0478 104.390 0.00077 1.697 3.85E-04 0.870 0.00320 1.062 0.0104 2.855 0.00927 2.876 0.0176 7.265 0.00711 1.726 0.00131 1.581 0.00150 2.028 35.591 265.585
Estimation Sample Properties Standard Max Min Error 10.343 9.463 0.202 0.399 0.0190 0.00314 0.252 0.00102 0.0217 0.0505 0.880 0.0156 102.936 9.865 -0.0641 -0.0507 -0.00148 0.0978 -0.00280 -0.0404 -0.0132 -0.131 -0.731 2.475 4.289 0.118 0.0184 0.00085 0.0339 7.21E-04 0.0115 0.0117 0.254 0.200 2.176 1.170
0.0166 -0.00197 -1.16E-03 0.00286 3.40E-04 -0.00118 3.45E-04
0.0167 -0.00200 -0.00116 0.00269 2.29E-04 -0.00133 3.84E-04
0.0170 -0.00204 -0.00142 0.00177 3.18E-04 -0.00163 4.76E-04
149
Table 5.28 Results from ln(Per Capita Income) spatial regression model without time-lagged variables: Population less than 175.
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.917 0.0938 -0.00916 -0.00232 0.246 4.59E-04 -0.00139 0.00218 0.226 -0.241 5.409 5.579 15.064 17.984 0.9922 25
S.E. 232.753 0.101 142.812 0.064 46.177 0.0192 0.266 0.00076 1.369 0.00462 0.204 0.00069 6.091 0.021 1.004 0.00379 181.262 0.116 166.485 0.0964 0.151 0.0243
T-statistic 0.0475 104.506 6.61E-04 1.502 3.94E-04 0.884 0.00468 1.338 0.00330 0.887 0.0122 3.404 0.0453 12.209 0.00358 0.889 0.00150 1.816 0.00176 2.404 35.633 259.157
Estimation Sample Properties Standard Max Min Error 10.290 9.470 0.197 0.385 0.0179 -0.00113 0.306 0.00238 0.00678 0.0825 0.957 -0.0184 103.237 6.964 -0.0639 -0.0518 -0.00626 0.158 -0.00749 -0.0697 -0.00547 -0.0531 -0.782 2.418 4.284 0.114 0.0189 0.00102 0.0380 0.002 0.0145 0.017 0.251 0.198 2.116 0.790
0.0144 -0.00200 0.00151 0.00455 -1.61E-04 1.77E-04 -2.34E-04
0.0145 -0.00203 0.00182 0.00297 -2.20E-04 2.13E-04 -2.86E-04
0.0148 -0.00207 0.00149 0.00112 -2.19E-04 2.14E-04 -2.87E-04
150
Table 5.29 Results from ln(Per Capita Income) spatial regression model with time-lagged variables: Population less than 175 (* denotes time lagged variable).
Variable Constant No Bootstrap ln(Square root of Distance to CBD) No Bootstrap Square root of Distance to Nearest Highway No Bootstrap ln(Proportion of Commercial Land Cover)* No Bootstrap Proportion of Residential Land Cover* No Bootstrap ln(Proportion of Rural Land Cover)* No Bootstrap ln(Land Cover Mix)* No Bootstrap ln(Land Cover Entropy)* No Bootstrap Expected Value of Probit Error No Bootstrap Time Average of E.V. of Probit Error No Bootstrap Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 9.915 0.0928 -0.00876 -0.00154 0.147 1.69E-04 -2.89E-04 6.32E-04 0.228 -0.243 5.448 5.533 1.100 15.178 17.146 0.9922 25
S.E. 0.101 225.109 0.0639 142.186 0.0192 45.5640 4.11E-04 0.210 0.0125 34.597 4.54E-04 0.433 0.00274 1.585 0.00223 1.348 0.115 180.548 0.0963 166.784 0.152 0.0411 0.0120
T-statistic 104.557 0.0492 1.519 6.70E-04 0.886 3.93E-04 3.863 0.00727 12.550 0.00512 1.283 0.00150 0.824 0.00156 0.866 0.00174 1.812 0.00141 2.418 0.00172 35.609 230.332 8.288
Estimation Sample Properties Standard Max Min Error 10.291 9.484 0.195 0.386 0.0175 -7.95E-04 0.358 0.00160 0.00603 0.0354 0.980 -0.00636 10.167 6.926 1.174 -0.0651 -0.0519 -0.00647 0.0800 -0.00615 -0.0299 -0.00366 -0.0346 -0.840 1.550 2.854 1.015 0.114 0.0192 0.00110 0.0573 0.00139 0.00644 0.00736 0.263 0.209 2.271 0.882 0.0396
0.0375 -0.00180 -3.74E-05 0.00809 -1.75E-04 2.33E-05 1.04E-05
0.0372 -0.00181 -6.60E-05 0.00951 -2.86E-05 3.58E-05 6.17E-06
0.0353 -0.00180 -1.14E-04 0.00809 -3.33E-05 4.41E-05 1.23E-05
151
Table 5.30 Results from ln(Per Capita Income) spatial regression model with time-lagged variables and time adjustment: Population less than 175 (* denotes time lagged variable).
time-lagged variables and time adjustment. Again, the results are very similar to those of the models without sample selection (Tables 5.5 and 5.6). As with the model without time-lagged variables, is smaller and there is an increased variance in the random error and random-effects terms. Also, the time adjustment factor is very close to that of the model without sample selection. It is clear that this part of the sample selection model is, except for the estimates concerning the random terms, essentially the same as the model without sample selection. However, because there is a slightly better fit found with the population greater than 175 income model, this entire sample selection model seems to perform better than he model without sample selection.
5.4 PANEL DATA SPATIAL LOGISTIC REGRESSION MODEL
As discussed in Chapter 4, because it lies on the [0,1] interval, land cover proportions data cannot be modeled using the standard spatial regression techniques. However, using the spatial logistic regression technique discussed in Section 4.5, such data can be modeled. In this section, three proportions data models are run: the proportion of urban land cover, the proportion of urban land cover which is residential (residential|urban), and the proportion of non-urban land cover which is rural (rural|not urban).11 These models will, from here on in this section, be referred to as the urban, residential, and rural models. These three models are actually the constituent parts of a larger, two-binary split model. The urban model represents the first binary split, with the rural and residential models, which incorporate results from the urban model, being the secondary splits. Again, an average of model estimates from 25 samples of 1,000 observations (cells) each were used to estimate the parameters. However, for some of the samples in the residential and rural models, the maximum likelihood
11
Recall that in this work, rural refers to agricultural-based land cover.
152
procedure in the GAUSS program used to estimate the models could not calculate the Hessian for the parameters estimated by this method. As it took many hours to estimate the model results for each sample, these samples were discarded from that models results.12 When reading the tables, remember that estimates the amount of spatial autocorrelation present, relates the time constant random effects variance to the variance of the model error which varies with time, and, in the models with time adjustment, the t-statistic reported is tested the null hypothesis that the time adjustment parameter is equal to one. As noted in Chapter 4, there is an issue concerning the correct form of the instrument variable included in the explanatory variables of the secondary binary split models. are gained. However, it is seen that this variable not only appears to be significant, but can also be interpreted such that important insights for the model
5.4.1 Proportion Urban Land Cover Model
Table 5.31 presents the results of the urban model without time-lagged variables. As is expected, as the distance to the CBD increases, the expected proportion of urban land cover decreases. Also, as the distance to the nearest highway increases, the expected proportion of land cover increases. Once again, a comparison of the elasticities of the parameters leads to the conclusion that this non-intuitive result is possibly due to the fact that the distance to the nearest highway acts as a correction factor to the distance to CBD measure. It is also seen that the expected proportion of urbanized land increases as land cover mix decreases and entropy increases.
12
Attempts were made to re-estimate the results for the samples which caused problems by changing the initial values of the parameters, but this was not successful (the long estimation times limited such attempts).
153
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix Land Cover Entropy Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 0.311 -0.347 0.0126 -1.038 2.580 0.127 4.335 1.474 0.432 0.08117 25
S.E. 0.228 0.0417 0.0548 0.654 0.596 0.0318 0.0817
T-statistic 1.472 8.312 0.578 1.617 4.385 11.199 59.073
Estimation Sample Properties Standard Max Min Error 0.636 -0.238 0.205 -0.284 0.0785 0.254 9.311 0.175 5.056 -0.422 -0.110 -1.795 1.145 0.0931 3.892 0.0346 0.0418 0.534 1.487 0.0213 0.324
0.751 -0.0127 0.194 -0.145
0.721 -0.0122 0.182 -0.142
0.383 -0.00648 0.0820 -0.0611
154
Table 5.31 Results from Proportion of Urban Land Cover spatial logistic regression model without time-lagged variables.
From the positive value of , and its high average significance, it is seen that for a cell with a given level of urbanization, cells with similar levels of urbanization will be expected to be located in its neighborhood. This is expected, as the urban areas in the land cover maps are obviously not homogonously distributed throughout the map but rather tend to be grouped into areas of high and low urbanization (e.g., cities and rural areas tend to be noticeably distinct). Also, the high average significance of indicates that there is a significant variation in the time-constant urbanization level expected in the region. That is, cells have distinct predispositions towards or against urbanization which are time constant; this result may be a result of their geographic location (outside of that already accounted for in the model), their physical geography, or something else which is perhaps unmeasurable. Tables 5.32 and 5.33 present the results for the urban models with timelagged variables and with both time-lagged variables and time adjustment, respectively. The results are similar to that of the model without time-lagged variables, with the major difference being that the signs on land cover mix and entropy have changed. Examining the average t-statistics and elasticities of the mix variable, it is seen that when it is time lagged, it gains a large amount of power as far as predicting the level of urbanization of a cell. This indicates that the higher a cells land cover mix, the greater the chance that more urban development will occur in that cell in the future. It is also seen that the time adjustment factor is slightly greater than one, possibly indicating that that in order to account for future urbanization, mix and entropy levels must be inflated. It also indicates that the model will probably predict increases in urbanized regions as time progresses. This is as expected; however, the true implications of this will be apparent in the simulations presented in Chapter 6.
155
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 0.290 -0.329 0.00730 3.951 -1.891 0.0969 4.312 1.574 0.390 0.08917 25
S.E. 0.226 0.0402 0.0530 0.679 0.602 0.0335 0.0810
T-statistic 1.323 8.219 0.636 5.824 3.032 9.328 59.992
Estimation Sample Properties Standard Max Min Error 0.557 -0.0803 0.174 -0.270 0.0958 5.425 -0.608 0.154 4.807 -0.397 -0.103 2.824 -9.637 0.0557 3.899 0.0320 0.0446 0.641 1.694 0.0207 0.286
0.714 -0.00739 -0.720 0.109
0.686 -0.00710 -0.588 0.084
0.364 -0.00377 -0.291 0.0396
156
Table 5.32 Results from Proportion of Urban Land Cover spatial logistic regression model with time-lagged variables (* denotes time lagged variable).
Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 0.293 -0.332 0.00736 3.629 -1.732 0.0966 4.311 1.018 1.555 0.387 0.08999 25
S.E. 0.226 0.0403 0.0532 0.622 0.555 0.034 0.0812 0.00938
T-statistic 1.331 8.254 0.632 5.834 3.007 9.262 59.814 1.789
Estimation Sample Properties Standard Max Min Error 0.549 -0.0779 0.174 -0.271 0.096 4.932 -0.592 0.154 4.808 1.083 -0.404 -0.103 1.868 -9.207 0.0558 3.892 0.966 0.0326 0.0447 0.781 1.638 0.0206 0.286 0.0264
0.720 -0.00745 -0.697 0.105
0.691 -0.00716 -0.600 0.086
0.367 -0.00380 -0.308 0.0418
157
Table 5.33 Results from Proportion of Urban Land Cover spatial logistic regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
Finally, it is noted that the R 2 level for three models, at less than 0.09, are not very high. However, there is a definite improvement in the fit of the urban model with time-lagged variables and time adjustment, indicating that the addition of these model elements increase the power of the model.
5.4.2 Proportion Residential|Urban Land Cover Model
Table 5.34 presents the results of the residential model without timelagged variables. As is expected, it is seen that as the distance to both the CBD and the nearest highway increase, the proportion of urbanized land which is residential decreases. However, it is noted that the average t-statistic for these distance variables is very low, indicating that they probably do not distinguish between residential and non-residential urban land cover very well. In fact, the t-statistics on all of the explanatory variables are low. Only the instrument variable and land cover entropy have average t-statistics which are even close to be significant with 95% confidence. Furthermore, upon examining the elasticities, the instrument variable appears to dominate the model. The sign on this variable is negative, indicating that as the proportion of urban land cover increases, the proportion of that land which is expected to be residential increases. Two variables which do have high average t-statistics are the and parameters. The positive value of indicates that even among urbanized land, cells with similar levels of residential land cover tend to be found lose to one another. The significance of indicates that there is a statistically significant variation in the time-constant propensity for certain urban areas to be residential. Finally, it is noted that the R 2 level of this model, at 0.1124, is an improvement over the level of the urban model, and indicates that this further refinement allows for a better data fit.
158
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix Land Cover Entropy Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta -0.351 -0.0702 -0.0611 -0.0295 0.067 1.510 0.0381 5.007 1.425 0.224 0.1124 22
S.E. 0.210 0.442 0.140 0.0638 0.886 1.185 0.0555 0.0505
T-statistic 1.657 0.699 0.734 0.595 0.498 1.354 4.480 112.464
Estimation Sample Properties Standard Max Min Error -0.071 -0.671 0.166 0.692 0.107 0.0341 1.393 3.362 0.141 9.797 -0.655 -0.237 -0.097 -1.510 -0.240 0.000 3.966 0.367 0.0954 0.0358 0.580 0.903 0.0395 1.296
Elasticities 2000 -0.212 1997 -0.182 1991 -0.156
-0.0613 -0.0138 0.00578 0.0392
-0.0543 -0.0123 0.00500 0.0355
-0.0441 -0.0100 0.00345 0.0233
159
Table 5.34 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model without time-lagged variables.
Tables 5.35 and 5.36 present the results for the residential models with time lagged variable and with both time-lagged variables and time adjustment, respectively. With these models, and especially that with time adjustment, there seems to be a greater significance for the explanatory variables, as indicated by the average t-statistics, than the model without time lags. Furthermore, the land cover mix variable, judging by its elasticity and parameter levels, is much more powerful in predicting residential land cover when it is time lagged. This is a result which is similar to that for the urban model, and indicates that mix is perhaps a good indicator for future urban and residential development. More problematic is the instrument variable. For the model with time adjustment, its result was similar to that of the model without time-lagged variables; however, for the model without time adjustment, the instrument variables sign changed and its elasticities dropped significantly. This could be an indication of a bad selection for the form of the instrument variable. Another explanation is that it is an indication that not accounting for the difference in time lags, as is done with the time adjustment factor, can have a detrimental effect on the model results. Most likely, however, is that it is a result of the fact that for the model without time adjustment, five samples had to be thrown out because of problems computing the Hessian. Though the other models had three sample removed, those additional lost samples could very well have caused problems for the model results. Finally, it is noted that the time adjustment factor is very close to one, and the low average t-statistic indicates that it may not be significantly different, at a 95% confidence level, from unity. Though a comparison of the model results for the two models with time-lagged variables seems to indicate that the time adjustment factor significantly impacts the model parameters, this could also be a result of the samples which were thrown out. Nonetheless, as the time adjustment
160
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 0.0465 -0.485 -0.251 -0.0624 4.933 -2.037 0.0133 4.589 1.507 0.137 0.1115 20
S.E. 0.193 0.432 0.120 0.0617 1.658 0.973 0.0855 0.0385
T-statistic 0.710 1.210 2.084 1.164 3.015 2.116 1.711 145.607
Estimation Sample Properties Standard Max Min Error 0.328 -0.208 0.154 0.157 -0.120 0.085 6.804 -0.633 0.0974 6.190 -1.157 -0.415 -0.194 3.575 -8.651 0.000 3.967 0.337 0.0952 0.0557 1.067 1.671 0.0233 0.555
Elasticities 2000 0.0314 1997 0.0302 1991 0.0249
-0.252 -0.0292 0.416 -0.0541
-0.223 -0.0259 0.313 -0.0388
-0.181 -0.0210 0.237 -0.0279
161
Table 5.35 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model with time-lagged variables (* denotes time lagged variable).
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta -0.378 0.249 -0.0157 -0.0522 1.651 -0.928 0.0164 4.512 0.990 1.488 0.153 0.1095 22
S.E. 0.168 0.387 0.106 0.0620 1.589 0.991 0.0705 0.0388 0.0196
T-statistic 3.357 2.019 2.345 1.155 2.258 1.483 2.108 135.447 0.686
Estimation Sample Properties Standard Max Min Error 1.011 -1.895 0.618 2.439 0.906 0.103 12.071 2.951 0.0828 5.708 1.117 -1.718 -0.845 -0.229 -6.460 -4.692 0.000 3.966 0.883 0.953 0.363 0.0782 4.247 1.585 0.0205 0.489 0.0675
Elasticities 2000 -0.263 1997 -0.252 1991 -0.208
-0.0158 -0.0244 0.135 -0.0239
-0.0140 -0.0217 0.0984 -0.0166
-0.0113 -0.0176 0.0729 -0.0117
162
Table 5.36 Results from Proportion of Urban Land Cover which is Residential spatial logistic regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
parameter led to more samples being included when using time-lagged variables, it would be recommended as the model form for estimating residential land cover.
5.4.3 Proportion Rural|Not Urban Land Cover Model
Table 5.37 presents the results of the rural model without time-lagged variables. In this model, it is seen that the proportion of non-urban land which is rural falls as the distance to the CBD and nearest highway increase. Furthermore, from the parameter elasticities it is also seen that the distance to the CBD has a large impact on the model, indicating that even among non-urban land, agricultural land tends to be located far away from the city center. Nearly as important as the distance to the CBD, as indicated by its elasticity, is the instrument variable. From its positive value, it is seen that as the proportion of non-urban land cover increases, the proportion of rural land cover is actually expected to decrease. This may be because areas which are less urbanized are so because they have qualities that do not make them conducive to urban or rural land use for example they are mountainous or water areas. With the exception of the distance to the nearest highway parameter, all of the parameters are significant. Highly significant is the parameter, whose positive value indicates that a cell with a certain proportions of rural land cover will tend to be located close to other cells with similar rural land-cover levels. This is expected, as agricultural land are expected to be found as part of large farms or tracts. Also significant is the parameter, which indicates that for nonurban land, there is a statistically significant variation in average baseline level of rural land cover expected in a cell. Again this is expected, as certain non-urban areas are expected to be highly conducive to rural use, that is flat and fertile, and others not, such as craggy or barren areas.
163
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix Land Cover Entropy Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 1.754 -7.346 0.625 0.0231 -4.339 3.762 0.0358 4.778 1.190 0.203 0.2328 24
S.E. 0.699 1.568 0.151 0.0728 1.079 1.419 0.0511 0.0325
T-statistic 2.533 4.754 4.182 0.830 4.136 2.975 4.376 165.526
Estimation Sample Properties Standard Max Min Error 3.176 0.867 0.617 -5.276 0.972 0.218 -2.282 6.026 0.101 6.180 -11.033 0.418 -0.109 -6.699 1.200 0.000 3.975 1.445 0.153 0.0771 1.192 1.291 0.0304 0.686
Elasticities 2000 0.669 1997 0.605 1991 0.483
-0.665 -0.0115 0.398 -0.104
-1.578 -0.0272 0.922 -0.252
-0.749 -0.0129 0.372 -0.097
164
Table 5.37 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model without time-lagged variables.
Tables 5.38 and 5.39 present the results for the rural model with timelagged variables and with time-lagged variables and time adjustment, respectively. With the exception of the entropy and distance to the nearest highway variables in the model with time adjustment, the results are very similar to that of the model without time-lagged variables. In the model with time adjustment, the results show a significant change in the distance to nearest highway, entropy, and mix variables, with a sign change in the first two. Upon closer examination, it is found that this is primarily a result of the fact that the model results from one sample had very large magnitudes for these variables (indicated in the Max and Min columns of the results), and that these results skewed the averages.13 Table 5.40 shows the averages with this sample dropped, and shows the estimates are more in line with what is seen in the other models. However, this exposes how sensitive the sampling scheme used here is to aberrations in the estimates. Finally, it is noted that the R 2 levels for all of the rural models indicate that they have a better fit than the urban and residential models. Nonetheless, with levels no higher than 0.2328, the fit is still fairly low.
5.5 DIFFERENTIAL EQUATION MODELS
The models discussed up to this point have been essentially focused on how the levels of different variables could be used to elucidate the complex interactions between various demographic and geographic elements of a region. Another way of analyzing these interactions is to look at the way that differences in the variable levels across space and time can be modeled. Though such analysis can be used to estimate results for models involving the full variable levels (an example of this is the method used to estimate the LSDV models),
13
The sample whose results skewed the averages was not the sample which was dropped from the other two rural models.
165
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 0.768 -4.438 0.370 0.0194 -6.334 4.821 0.0386 4.704 1.352 0.224 0.2073 24
S.E. 0.691 1.482 0.132 0.0705 1.844 1.156 0.0494 0.0372
T-statistic 1.192 3.037 2.840 0.664 3.592 5.075 4.632 138.762
Estimation Sample Properties Standard Max Min Error 2.446 -0.454 0.662 -1.923 0.608 0.152 0.641 7.147 0.107 6.202 -7.337 0.153 -0.108 -10.166 1.066 0.000 3.975 1.461 0.129 0.0589 2.274 1.282 0.031 0.620
Elasticities 2000 0.272 1997 0.228 1991 0.187
-0.394 -0.0096 0.567 -0.136
-0.935 -0.0228 1.144 -0.261
-0.444 -0.0108 0.506 -0.110
166
Table 5.38 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged variables (* denotes time lagged variable).
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix* Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Beta 1.793 -6.626 0.569 -0.00123 -1.321 -22.941 0.0385 4.600 0.963 1.253 0.218 0.204 25
S.E. 0.654 1.418 0.128 0.0745 6.982 19.370 0.0474 0.0399 0.0130
T-statistic 3.265 4.997 4.754 0.878 5.094 5.995 4.642 125.784 2.768
Estimation Sample Properties Standard Max Min Error 4.466 -3.340 1.592 4.923 1.122 0.208 201.795 12.538 0.0881 5.898 1.076 -12.638 -0.439 -0.449 -20.174 -729.427 0.000 3.971 0.393 3.491 0.303 0.123 42.563 147.206 0.0266 0.551 0.124
Elasticities 2000 0.622 1997 0.525 1991 0.429
-0.605 6.10E-04 0.105 0.577
-1.436 0.00145 0.190 0.989
-0.681 6.87E-04 0.0778 0.385
167
Table 5.39 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged variables and time adjustment (* denotes time lagged variable).
Variable Instrument Variable Constant Square root of Distance to CBD Square root of Distance to Nearest Highway Land Cover Mix*
Beta 1.802 -6.644 0.561 0.01743 -9.784 6.496 0.0398 4.617 0.987 1.2931042 0.2254406 0 24
S.E. 0.642 1.394 0.126 0.0709 1.878 1.041 0.0459 0.0388 0.0121
T-statistic 3.329 5.076 4.777 0.797 5.241 6.179 4.793 128.375 2.405
2000 0.625
Elasticities 1997 0.527
1991 0.431
-0.598 -8.66E-03 0.781 -0.163
-1.418 -0.02056 1.407 -0.280
-0.673 -9.75E-03 0.5764 -0.109
168
Land Cover Entropy* Time Adjustment Error Variance Random Effect Standard Deviation R-Squared Number of Valid Samples
Table 5.40 Results from Proportion of Non-Urban Land Cover which is Rural spatial logistic regression model with time-lagged variables and time adjustment and with one highly deviant sample removed (* denotes time lagged variable).
another way that such results can be used is to estimate models based on differential equation approximations. An application of this, extending the differentials in both spatial and temporal dimensions, is described in Chapter 4. In this section, the results of these models as applied to population, average vehicles per household, and median house price from here on referred to as the population, vehicles, and home value models are presented. For the home value model, because the dependent variable has a much larger scale than the explanatory variables, it has been changed such that it now is measured in steps of $10,000. For the time model, how to take the first differences is obvious as the data is separated into years. However, for the spatial dimension, any number of differencing schemes could be used. Because the data is split into grids, the natural choice is to take first differences between contiguous neighbors. That is what was done for this work, and because there are two directions that such a difference could be taken vertical and horizontal models using both first difference directions were estimated. Due to the random-effects terms contained in these models, special techniques have to be used to account for the fact that these terms are constant over a certain dimension time-constant for the spatial dimension model and space-constant for the time dimension model. Unfortunately, due to an error on the part of the author, in the estimation stage the wrong assumptions concerning which dimension these terms are constant over were made: for the time dimension model it was assumed that that the random-effects term was time-constant, and for the spatial dimension model the random-effects term was assumed to be space-constant. Unfortunately, a space-constant random effect severely hampers the estimation of the model as it necessitates a very restrictive sampling scheme which, as will be seen, makes it difficult to estimate precise model results. As
169
such, the models in the spatial dimension are severely effected by this incorrect specification. Despite this error, the (incorrect) model results are still presented below. The reason for this is two-fold. For one, as noted above, the specification does not totally invalidate the model results, though it probably does introduce some bias (which will be more pronounced in the space dimension, because of the small sample size used). Secondly, the issues which are raised concerning the difficulties caused by the sampling scheme used for the space dimension model are still valid, though they actually should be applied to the time dimension. In the following discussion, the issue of this random-effects term mix-up will not be dwelt upon; however, it will be mentioned in key areas to emphasize the fact that an error in the model estimation did occur and that further work is needed to correctly specify the models. Though sampling is not needed to estimate the time dimension models, for the space dimension, only models with very small numbers of observations can be estimated. This is because the model actually is constructed such that each time period acts as an observation and each cell acts as a panel in the traditional sense. As a result of this, the log-likelihood function multiplies together the likelihoods of each cell in each time period; as the number of cells modeled gets large, then a large number of likelihoods, all of which are less than one, are multiplied together. At some point, the likelihood, within machine precision, becomes zero; it is this point which is the limiting number on the number of cells which can be modeled at once. Experimenting with various sample sizes, it was found that sample sizes of 30 we required to avoid problems. This low sample size is obviously problematic. However, a sample size this low increases the speed that the model could be estimated; thus, to counteract the lowered accuracy of the estimates a larger number of samples are estimated. So, for the spatial models, 100 samples of 30 cells are estimated and averaged. 170
parameter larger than commercial. Interestingly, the variable with the largest elasticity is rural land cover, indicating that if population densities change over time, a larger relative change is expected in the rural land cover proportion of a cell. It is also seen that as land cover mix and entropy increase over time, a drop in population is expected; or, more intuitively, as the population in a cell increases over time, the mix and entropy of a cell are expected to decrease. From the positive, and statistically significant, values of and it is seen that population exhibits a time dependent exponential growth. This last effect is expected, as an exponential model form (equation 3.4) identical to that used in the differential equation form (equation 4.56) was used to approximate the population variable. As with the logistic regression models discussed earlier, for some of the samples the maximum likelihood procedure of the GAUSS program could not compute the Hessian, and so those samples were discarded. It will be seen that the results of these spatial dimension models are somewhat erratic, indicating that the small sample sizes are severely hampering the accuracy of the model estimation. Finally, when reading the tables remember that and measure the exponential effect of time on the model, and measures the effect of distance (from the CBD). Also, because the variables are first differenced, constants could not be included in the model; thus a log-likelihood level for a constants only model could not be calculated. Of use for a goodness-of-fit comparison would be a model with no explanatory variables (but with random effects and, perhaps, explicit time and space information); unfortunately this was not calculated when these models were estimated. As such, a measure of the models fit is not quantified.
171
5.5.1 d(Population) Model 5.5.1.1 Time Dimension
Table 5.41 presents the results for the time dimension of the population model. It is seen that as the proportion of residential land cover increases, This result makes sense, especially if population is expected to increase.
population is viewed as population density. It is expected that areas of higher population density need to have higher amounts of residential land in order to accommodate the added residents; thus, it is expected that as the population density of a cell increases, so does the residential land cover. Likewise, an opposite effect is expected for the rural and, to a lesser degree, commercial land cover: as these increase a population density decrease is expected. Such an effect is shown by the results of the model, as the parameters for both of these variables are less than zero, and the magnitude of the rural
5.5.1.2 Space Dimension
Tables 5.42 and 5.43 respectively present the results of the population for the vertical and horizontal dimensions. As is expected, population is expected to increase as commercial and residential land cover increase, with residential having a greater effect than commercial. It is also seen that as one moves to a cell with greater rural land cover, population is expected to fall. The positive value of indicates that population expected to increase as the distance to the CBD decreases. This is because the term which relates is r (2 r ) (where r is the
1
distance to the CBD), which is generally dominated by r 2 , which decreases as

1
distance to the CBD increases.14 What is most interesting is the fact that there seems to be some disparity between the results of the models for the horizontal and vertical dimensions.
14
The reason r dominates is that r is generally on the order of the size of the grid cells, or 300 meters, whereas r , with the mean of 4.575, is usually much larger.
1 2
172
Variable Proportion of Commercial Land Cover Proportion of Residential Land Cover Proportion of Rural Land Cover Land Cover Mix
Beta -0.0934 0.237 -0.380 -0.322 -0.299 0.111 0.615 0.941 1.430 -7.896
S.E. 0.0518 0.0349 0.0150 0.0575 0.0470 0.00107 0.0158 0.00408 0.0104
T-statistic -1.803 6.782 -25.322 -5.602 -6.370 103.456 38.964 230.858 137.176
2000 -5.53E-04 0.00111 0.0116 -8.57E-04 0.00138
Elasticities 1997 -9.85E-05 0.00278 -0.00565 -0.00306 -0.00242
1991 1.67E-04 0.00305 0.00522 -0.00117 -0.00204
173
Land Cover Entropy Error Standard Deviation Random Effect Standard Deviation Log Likelihood Level
Table 5.41 Results from time dimension of Population differential equation model.
However, a comparison of the maximum and minimum sample estimates, as well as the standard deviations of the estimates, shows them to similar in this respect. Furthermore, the difference in the parameter estimates are well within their respective sample standard deviations. This indicates that results are probably consistent, and that the high variability of the estimates which is a result of the small sample sizes is leading to highly variable averages. At the very least, this indicates that the sampling scheme used is not returning very accurate model parameters; to remedy this, a method incorporating more or larger sample sizes should be carried out.15
5.5.2 d(Average Vehicles Available per Household) Model 5.5.2.1 Time Dimension
Table 5.44 presents the results of the time dimension of the vehicles model. It is seen that as the proportion of residential or rural land cover increases, that the number of vehicles per household is expected to increase. This is actually expected, as such land cover changes are with respect to other land cover types, and households in areas with higher commercial or unsettled land cover might be expected to have less vehicles either because of lowered needs or, more probably, less affluence. Interestingly, it is seen, from both parameter magnitudes and elasticities, that changes in rural land cover have a greater effect on the number of vehicles in households than any other included variable. A possible explanation for this is that households in rural regions have a greater need for more vehicles because many of them work in agriculture, which may require different vehicles for work and personal purposes.
15
Recall that the error in estimation means that this sampling scheme, and hence this discussion, should have been applied to the time dimension.
174
Variable Proportion of Commercial Land Cover Proportion of Residential Land Cover Proportion of Rural Land Cover Land Cover Mix Land Cover Entropy Error Standard Deviation Random Effect Standard Deviation Log Likelihood Level Number of Valid Samples
Beta 11.654 18.189 -4.346 2.299 -7.941 36.440 213.268 634.217 -141.43 87
S.E. 20.922 18.137 13.887 36.220 27.699 4.485 0.957 9.283
T-statistic 2.273 2.553 1.011 1.032 1.246 18.453 16.848 3.077
Estimation Sample Properties Standard Max Min Error 155.710 113.737 36.958 116.895 93.503 87.814 495.732 588.997 -98.068 -75.176 -48.842 -101.794 -74.720 -84.771 59.781 577.948 57.153 52.917 17.235 45.393 41.737 71.716 137.916 246.714
Elasticities 2000 -0.777 -2.425 0.483 -0.306 0.353 1997 -0.647 -4.042 -1.449 -0.639 1.324 1991 -0.832 -1.299 -0.310 -0.164 0.567 1983 0.424 4.961 -2.371 0.209 -0.722
175
Table 5.42 Results from vertical spatial dimension of Population differential equation model.
Beta 7.625 19.050 -1.402 -5.383 -3.947 25.299 291.105 583.310 -127.69 89
S.E. 24.025 20.212 15.279 40.322 30.773 3.562 1.089 8.538
T-statistic 1.818 1.814 0.885 1.022 1.168 11.595 15.491 2.829
Estimation Sample Properties Standard Max Min Error 162.187 140.980 36.878 185.332 103.745 69.450 566.331 588.485 -104.386 -77.641 -44.624 -131.480 -138.442 -77.545 130.696 577.407 53.145 41.749 16.751 54.529 46.707 35.315 91.581 1.778
Elasticities 2000 -0.0696 0.957 0.109 -0.147 -0.0360 1997 0.0913 0.228 0.260 -0.226 -0.0473 1991 -0.2311 0.577 0.276 0.041 0.0120 1983 0.0794 -0.198 0.394 -0.112 -0.0411
176
Table 5.43 Results from horizontal spatial dimension of Population differential equation model.
Variable Proportion of Commercial Land Cover Proportion of Residential Land Cover Proportion of Rural Land Cover Land Cover Mix Land Cover Entropy Error Standard Deviation Random Effect Standard Deviation Log Likelihood Level
Beta -0.00114 0.0212 0.113 0.0772 0.0100 0.0789 -0.508 0.866 1.303 -3.19
S.E. 0.00219 0.00159 8.77E-04 0.00302 0.00254 1.19E-04 0.00142 1.55E-04 0.00197
T-statistic -0.521 13.347 128.995 25.587 3.942 661.066 -358.998 423.875 244.725
2000 -6.75E-06 9.99E-05 -0.00345 2.05E-04 -4.63E-05
Elasticities 1997 -1.20E-06 2.49E-04 0.00168 7.32E-04 8.10E-05
1991 2.03E-06 2.73E-04 -0.00155 2.80E-04 6.84E-05
177
Table 5.44 Results from time dimension of Average Number of Vehicles Available per Household differential equation model.
Finally, it is seen from the negative value of that as time goes on, the number of vehicles per household is expected to decrease. Though this seems non-intuitive, it is a direct result of the fact that the Census approximation for this variable, discussed in Chapter 3, was on data showing the average number of vehicles per household decreasing over time. At the very least, it is promising that the model is picking up on this expected exponential effect of time on the variable.
Tables 5.45 and 5.46 present the results of the vehicle model for the vertical spatial and horizontal spatial dimensions, respectively. The results show that as one moves to a cell with a higher proportion of residential and rural land cover, there is an expected drop in the number of vehicles per household. It is also seen that an increase in land cover entropy and a decrease in mix lead to an expected drop in the number of vehicles per household. Furthermore, the positive value of indicates that the average number of vehicles per household will tend to become higher as one moves closer to the CBD. Even more so than with the population differential equation model, there seem to be some serious problems with the parameter estimates for this model. There is again a greater discrepancy between the average estimates for the vertical and horizontal dimensions, with the parameter for commercial land cover actually switching signs. Furthermore, the results described above concerning the residential and rural land cover do not make intuitive sense; it is not expected that cells with more uninhabited land have higher average vehicles per household levels. The sampling scheme again seems to blame for these issues. Examining the maximum and minimum sample estimates, as well as their standard deviations, reveals the parameter estimates to be varying widely and indicating a serious accuracy problem. A further problem is evidenced by the elasticity 178
Beta 0.216 -1.728 -0.539 -1.084 1.968 69.017 125.390 583.278 -114.313 94
S.E. 15.543 12.891 10.014 26.151 19.976 3.291 0.705 8.537
T-statistic 1.662 1.317 0.939 1.192 1.352 21.497 15.492 2.829
Estimation Sample Properties Standard Max Min Error 77.488 51.656 28.015 72.211 91.076 94.441 374.227 588.000 -69.088 -49.091 -30.624 -92.115 -70.640 23.910 31.198 577.002 31.501 21.027 12.147 38.778 33.053 22.819 69.760 148.162
Elasticities 2000 0.647 -10.370 -2.694 -6.505 3.936 1997 -0.0719 2.304 -1.078 1.807 -1.968 1991 -0.0719 0.576 -0.180 0.361 -0.656 1983 0.0288 -1.728 -1.078 -0.361 0.656
179
Table 5.45 Results from vertical spatial dimension of Average Number of Vehicles Available per Household differential equation model.
Beta -5.397 -1.124 -1.019 -8.004 5.588 44.786 205.330 583.350 -122.406 94
S.E. 19.429 16.455 13.224 35.256 27.556 3.028 0.913 8.539
T-statistic 1.451 1.326 0.860 1.182 1.309 14.594 15.492 2.829
Elasticities 2000 3.598 -4.121 5.777 -16.008 3.725 1997 26.986 5.620 -79.006 140.070 -27.938 1991 -53.971 11.240 -66.263 -20.010 5.588 1983 5.397 -1.124 -27.525 16.008 -5.588
180
Table 5.46 Results from horizontal spatial dimension of Average Number of Vehicles Available per Household differential equation model.
estimates, which are also varying greatly both between and among the two models.16
5.5.3 d(Median House Price) Model 5.5.3.1 Time Dimension
Table 5.47 presents the results for the time dimension of the home value model. It is seen that as the proportion of commercial, residential, and rural land increase, the median value of a home is expected to increase. Interestingly, the magnitude of the parameters for both residential and rural land cover are nearly the same; this indicates that there is a propensity for areas which are more urban or more rural to have larger home values. It is also seen that increased commercial land cover also leads to expected increases in home values. What these results are picking up on is that homes are appreciating over time (even with respect to inflation), and more so in regions where the population tends to be settled than those which are more uninhabited. Again reflecting the Census approximation discussed in Chapter 3, it is seen from the and that home values are expected to increase over time. In light of this, the estimates for the land cover variables are probably adding a correction to the expected amount of average appreciation over time which the model applies to every home in the region.
Tables 5.48 and 5.49 present the results for the home value differential equation model in vertical and horizontal spatial dimensions, respectively. The positive value of in both models indicates that home values are expected to increase as one moves closer to the CBD. Though this result is intuitive, the rest
16
Again recall that this issue with the sampling scheme should actually apply to the time dimension.
181
Variable Proportion of Commercial Land Cover Proportion of Residential Land Cover Proportion of Rural Land Cover Land Cover Mix Land Cover Entropy Error Standard Deviation Random Effect Standard Deviation Log Likelihood Level
Beta 0.0994 0.356 0.313 -0.117 0.305 0.0111 23.788 0.866 1.303 -4.866
S.E. 0.0300 0.0224 0.0116 0.0402 0.0333 4.20E-04 0.962 0.00271 0.00448
T-statistic 3.312 15.919 27.067 -2.910 9.140 26.377 24.719 295.126 167.383
2000 5.89E-04 0.00168 -0.0095 -3.11E-04 -0.00141
Elasticities 1997 1.05E-04 0.00418 0.00465 -0.00111 0.00246
1991 -1.77E-04 0.00458 -0.00430 -4.24E-04 0.00208
182
Table 5.47 Results from time dimension of Median Home Value differential equation model.
of the results for the models are much less clear. Again, the sampling scheme seems to have severely hampered the models ability to arrive at accurate parameter estimates. Not only are the maximum and minimum sample estimates widely spread, but between the two models, the average parameter values on all of the explanatory variables, save the proportion of land cover which is residential, have changed sign. Now, because this sampling (and estimation) scheme should have been used with the time rather than space dimension, it is obvious that much further work needs to be done concerning these models. However, even though the sampling scheme and estimation methods used were used on the incorrect models, the conclusions drawn from them are still essentially valid. This is especially true for the problems with the sampling scheme used in the spatial dimension (which should have been used in the time dimension), as it is seen that an overly small sample size severely hampers the results of the model estimation.
5.6 SUMMARY
Upon reviewing the results of the model estimations presented in this chapter, many important conclusions can be made. However, before discussing these, a review of the results is in order. First, panel data spatial regression models were estimated with population, per capita income, and average vehicles owned per household as dependent variables. For all three dependent variables, models were estimated without time-lagged variables, with time-lagged variables, and with both time-lagged variables and time adjustment. For each model, estimations from 25 random samples of 1,000 observations (cells) each were averaged together. The results of the estimations were generally intuitive. One consistent result, which occurred in nearly all of the models discussed in this chapter, was the fact that the parameter for the distance to the nearest highway generally had somewhat non-intuitive values. Thought this could be the result of 183
Beta -3.910 -2.845 -1.124 -3.860 4.292 69.544 117.170 583.094 -113.275 97
S.E. 14.938 12.987 9.702 25.197 19.067 3.165 0.681 8.534
T-statistic 1.710 1.530 0.801 1.141 1.212 22.507 15.492 2.830
Elasticities 2000 -0.173 -0.252 -0.0829 -0.342 0.127 1997 -0.0954 -0.278 0.165 -0.471 0.314 1991 -0.111 -0.0806 0.0318 -0.109 0.122 1983 0.054 0.295 0.233 0.133 -0.148
184
Table 5.48 Results from vertical spatial dimension of Median Home Value differential equation model.
Beta 1.395 -1.091 0.994 0.875 -2.678 42.687 216.616 583.288 -123.255 89
S.E. 21.213 17.306 13.491 35.390 27.185 3.017 0.939 8.538
T-statistic 1.371 1.207 0.960 1.039 1.116 14.028 15.491 2.829
Elasticities 2000 -0.00890 -0.038 -0.054 0.0167 -0.0171 1997 0.0110 -0.00863 -0.122 0.0242 -0.0212 1991 -0.0256 -0.0200 -0.119 -0.00401 0.00491 1983 0.00781 0.00611 -0.150 0.010 -0.0150
185
Table 5.49 Results from horizontal spatial dimension of Median Home Value differential equation model.
a misspecification or multicollinearity issue, the consistency of the results show that the distance to the highway measure is actually acting as a correction factor to the distance to the CBD measure, which generally dominates the models. Next, LSDV models incorporating time-lagged dependent variables were estimated for the population, per capita income, and average vehicles owned per household variables. Models again were estimated and averaged for 25 random, 1,000 observation samples. For all of the models, bootstrap estimates of the parameter standard errors seemed to be lower than expected. For the population model, it was seen that a lagged dependent variable provided important information for the model; however, for the other two variables, the lagged dependent variable was not as important as the spatial information the LSDV model could not incorporate due to the method used to estimate it. For the population and per capita income variables, sample selection models were also run, with the criteria for splitting the models being, respectively, that the proportion of a cells urban land cover is greater than 0.3 and that the population of a cell was greater than 175. Again, models without time-lagged variables, with time-lagged variables, and with both time-lagged variables and time adjustment were estimated; and results were produced from averages of results from the same samples used for the panel data spatial regression models. As with the LSDV model, bootstrap estimates of the parameter standard errors seemed to be perhaps underestimating the true values. For the population sample selection model, results from both the probit sample selection and regression models were basically as expected. Also, a probit selection model was run on the entire data set; comparisons with the results from the averages from the 25 samples showed the results to be very similar. The only exception was that the standard errors for the averages were much lower than those for the entire data set, indicating that the significance of the variables to the model is probably higher than the sample averages point to. In the regression 186
model for cells with greater than 30% urbanization, there was a significant improvement in the model fit over that from the model run on the entire data set; the less than regression model showed no improvement. For the per capita income sample selection model, most of the results were also fairly intuitive. However, there were some major discrepancies between the probit sample selection model estimates for the land cover mix and entropy variables for both the two sample averaged models and a model estimated using the entire data set. For the regression models, the model for cells with population greater than 175 showed a slightly improved fit over the model without sample selection, whereas the less than regression model showed no improvement in goodness-of-fit. Panel data spatial logistic regression models were run for proportion of urban land cover, proportion of urban land cover which is residential, and proportion of non-urban land cover which is rural variables. Once again, 25 samples of 1,000 observations were used to estimate the models and three model variations were estimated without time-lagged variables, with time-lagged variables, and with time-lagged variables and time adjustment. For the residential and rural models, computational issues caused some of the samples to be discarded during the estimation process. The average model results were fairly intuitive; however, the instrument variable, which was of great significance to all of the models, in the residential model had parameter estimates which were widely varying and possibly inconsistent. All of the models had very low fit levels, though the residential and rural models had improved fit over the firststage urban model. Lastly, differential equation models were estimated for population, per capita income, and median home value dependent variables. For the time dimension, models were estimated using the entire data set and the results were intuitive. For the spatial dimension, peculiarities of the model require very small 187
samples of 30 observations each to be used for estimation. This resulted in models results with questionable accuracy and large standard errors. As such, many of the results from the spatial dimension models were non-intuitive and difficult to interpret. Unfortunately, a mistake on the part of the author resulted in the use of incorrect estimation schemes for the models: the method used for the spatial dimension should have been used for the time dimension model, and vice versa. Though this theoretically should only have a major effect on the random effect results, it did also hamper the rest of the spatial dimension results because of the sampling scheme (incorrectly) used for it. From these results, a number of important conclusions can be drawn. First and foremost, it is evident that throughout all of the models, spatial and temporal effects have great statistical significance to the model results. In the spatial realm, not only is spatial autocorrelation nearly always present in a statistically significant way, but the distance measures, in particular distance to the CBD, also have great importance to the models. As for temporal effects, time constant random effects are almost always present and statistically significant; also statistically significant is the time adjustment factor, which corrects for differences in time lags. The significance of the time adjustment factor brings up the issue of its real importance, as well as comparing the models with and without time-lagged variables. As mentioned in the introduction to this chapter, a lot of the results presented were somewhat redundant, as there was often little change between models with and without time lags and, especially, between time lagged models with and without the time adjustment factor. These comparisons are, however, a major part of this chapter as they help to gauge the effects of time on the models results. For the models with time-lagged variables, only variables involving land cover were lagged. The fact that many of the results of the models with and 188
without time lags are so similar indicates that there seems to be a fair bit of inertia for land cover change; for some of the models it could be said that past land cover can be used as a type of proxy for current levels, and differences in the size of time lags does not seem effect the model results greatly.17 Nevertheless, the time deprecation factor, however little it may seem to effect a model, is not only almost always highly statistically significant, but also adds a great deal of flexibility to the models. In comparing models with and without time-lagged variables, it is important to remember that parameter estimates for time-lagged variables are basically measuring the effect of land cover on the future of the region. This is especially important in the land cover mix and entropy measures, because their effects cannot be reasoned out as clearly as that for, say, rural or residential land cover proportions. In fact, one of the most consistent changes between the models with and without time lags was that the parameter for land cover mix was very different between them. Thus, the manner in which land cover should be used to assess the present state of a region is often very different from how it should be used to predict the future. One of the main advantages of models with time lags is that they can be used for predictive purposes without having to make future projections of the explanatory variables. This is especially important when using land cover variables because, as evidenced by the logistic land cover models estimated here, land cover projections may not be that accurate. In this work, the time adjustment factor is usually the differentiating line for the models with lagged variables, and thus deciding whether or not this factor is necessary is of primary importance in choosing a predictive model.
17
Obviously, as mentioned above, the time
It should be emphasized that this conclusion is only valid for the time scales being used in this work extension to time lags greater than 10 years or so may result in problematic results. Such consequences will be more apparent when simulations are discussed in Chapter 6.
189
adjustment factor is statistically significant in most of the models which incorporate it. However, it has also been noted that the factor, which is usually very close to unity, does not effect the model results very much. What is important to remember is that the time adjustment factor not only accounts for the differences in time lags, but also allows one the flexibility to choose how far into the future to make a prediction. Though there may be questions as to the use of the factor for large prediction, what the factor actually corrects for, and whether one factor is enough for a model, the flexibility offered by it makes it a preferred inclusion into models with lagged variables.18 It is also important to compare the various general model forms to understand their strengths and weaknesses. Obviously, if very specific temporal or spatial dynamics are to be investigated, the differential approximation method is a strong model. Though there are some major issues with the solution quality for the spatial dimension, it was a result of a restrictive sampling scheme required by the estimation method. Furthermore, this estimation method was incorrectly applied to the wrong model, and thus the issues with solution quality, actually should be addressed to the time dimension model. Further research should clear up this issue. For the regression models, as noted above, it is seen that capturing spatial effects is important to the models. This is not only seen by the significance of the terms capturing spatial effects, but also by the fact that in two of the LSDV models, even the inclusion of a lagged dependent variable cannot make up for the loss of spatial information which could not be included. Even though the lagged dependent variable can sometimes be highly beneficial, as seen by the population LSDV model, the inclusion of spatial effects is evidently extremely important in these models.
18
A more detailed discussion concerning issues surrounding the time adjustment factor will take place in Chapters 6 and 7.
190
Furthermore, when models are specified to account for further heterogeneities, as with the sample selection models, greater model fits are seen. Though it would be beneficial to develop models which can account for spatial correlation and incorporate dependent variable lags, accounting for spatial and regional effects seems to be more important to the models. Furthermore, though a lagged dependent variable can be informative when interpreting the results of a model, incorporating spatial effects arguably provides more important knowledge for understanding the interactions and complexities which effect a region. Lastly, it is noted that though in general the sampling schemes used for model estimations seemed to perform well, there were a few problems. This is especially true with the estimation used for the spatial dimension of the differential equation model (which should have been applied to the time dimension model), whose results were so poor as to render them essentially inconclusive. Though this model, in theory, might be of interest, it does not seem practical to estimate it and get accurate results in a timely fashion. The only potential solutions to its problems, at least as the model has been developed in this work, are to either estimate a lot more samples, which would increase the estimation time greatly, or to increase the number of digits of computer precision, which is probably not feasible. Another issue with the sampling is that it limits the amount of spatial autocorrelation that can be measured. This is because in a sample of 1,000 observations, only up to 999 error terms can be involved in the autocorrelation, whereas the true number in reality would be on the order of the number of cells in the data set (or higher, if edge effects are accounted for). This has a very profound effect how predictions can be created (see Chapter 6); it also means that the true quantitative level of this effect cannot be measured and that the models are only providing an illustration of this effect.
191
CHAPTER 6: MODEL PREDICTIONS

An important test for any model is how it performs in a practical setting. Such validation is important not only to show how the model can be extended to real-world applications, but also to better understand its strengths and limitations. One validation method which is of particular use for Predictions not only illuminate model transportation/regional planning-based applications, is the generation of futureyear predictions using the model. performance, but can also be used for practical purposes by researchers or planners who are interested in how, among other things, demographics, land use, and transportation characteristics of a region will change over time. Predictions are particularly useful validation technique for some of the models created in this work, since their incorporation of time-lagged variables and, in particular, the time adjustment factor allows them to create predictions without having to create forecasts of the explanatory variables. Particularly powerful is the time adjustment factor, as it allows the selection of exactly how far into the future one wishes to make predictions. In this chapter, the practical performance of two of the model forms developed in Chapter 4, the panel data spatial regression model and the panel data spatial logistic regression model, are tested by creating population and land cover predictions for parts of the Austin, Texas region for the years 2005, 2010, and 2020. In addition, some analysis of the variance of the predictions, as well as the effects of the time adjustment factor, is carried out. In order to correctly capture the spatial autocorrelation and random effects which were incorporated into these model, special simulation methodologies are developed. Because of computational issues with these methodologies, only limited areas of the Austin region are used in the predictions. As will be seen, the performance of the predictions is not very good, leading to the conclusion that
192
either these models were not very useful in a practical sense or, more likely, that the simulation methodology used to generate the predictions was flawed in some way.
6.1 SIMULATION METHODOLOGIES
Though it was noted in Chapter 5, let it be emphasized here: the sampling method used in estimating the spatial econometric models has a profound effect on the manner in which they perform under simulations. The specific issue is with the incorporation of spatial autocorrelation; because samples of 1,000 observations were used in estimating the models, only 1,000 error terms at one time are used in measuring the level of spatial autocorrelation. In reality, and in the strict interpretation of the models, as presented in Chapter 4, it would be expected that for each cell, every other cell, not just a random sample, has a spatial autocorrelative effect. Not including every observation in the estimations does not invalidate the method of capturing spatial autocorrelation, however; rather, the sampling method provides a view of the spatial effect. It gives an accurate, though not complete, understanding of the effects of spatial autocorrelation in the model. The consequence of using samples to estimate the models are that a similar procedure must be used to create predictions. As has just been discussed, the sampling scheme gives a partial view of the autocorrelation in the model; if the estimated effect of this was then applied to the whole data set, as opposed to a sample, the effect would be a misrepresentation of the true effect. In a sample of 1,000 random observations, probably only a handful are very close to a certain cell, and it is those cells which dominate in the estimation of the autocorrelation effect. However, if the estimated effect was applied in a simulation to the whole data set, then the dozens of cells which are very close to the cell of interest
193
would contribute to this autocorrelation (via the factor estimated from the sample of cells), potentially creating a much larger effect than what should be.1 By accounting for the spatial autocorrelation in a manner consistent with the method used to estimate the models, the performance of the models, not only at a predictive level but also at a more general level of fit, can be assessed. Because spatial autocorrelation involves random error terms, simulations must be used to capture these effects. The following subsections discuss the simulation methodologies used in this work to create population and land cover predictions.
6.1.1 Spatial Regression
The method used to simulate predictions from the panel data spatial regression model simply uses the estimated parameter values in the mathematical form of the model, using the same notation as in Chapter 4: Y = X + v + (1 W )
1
(6.1)
where represents a parameter vector which has been time adjusted according to equation 4.55. To estimate the effect of the random terms, v and , a Monte Carlo simulation technique is used (see Greene (2000) for a discussion). However, as noted above, to correctly account for the effects of spatial autocorrelation, a random sample of 1,000 cells must be chosen for each Monte Carlo scheme. Furthermore, because there may be an unintended bias resulting from bad sampling, multiple sets of 1,000 samples should be selected for every cell, and the results averaged. The pseudo-code for the panel data spatial regression simulation method is as follows (as will be noted later, for this work, N = 2500, J = 5, and K = 200):
Early in the research carried out for this chapter, simulations were tried using small neighborhoods of cells to capture the spatial autocorrelation. The result was impossibly large results for every cell, which lends credence to the idea that not following the method used to estimate the model when simulating predictions will create biased results.
1
194
Definitions: Cells which simulation will be run on are indexed c1 , c 2 ,..., c N , , 2 , and 2 are previously estimated model parameters
RE
is the time adjustment vector defined in equation 4.54: = [ ~ , (a T ) Z ] X ~ ( X is a vector of ones the number of non-lagged variables long and Z
is a vector of ones the number of time-lagged variables long) J = number of outer-loop cycles K = number of inner-loop cycles Procedure: Calculate time adjusted parameter: k = k k Calculate deterministic part of model: U = X Cell Loop For i = 1 through N Outer Loop For j = 1 through J Select random sample of 1,000 cells, S ij , including ci ~ Calculate spatial weight matrix W ij for S ij (via eqs. 4.1 4.3) 2 Sample random effect term ij drawn from Normal (0, RE ) Inner Loop For k = 1 through K Create (1000 1) vector, ijk , of i.i.d. draws from Normal (0, 2 ) Calculate total error: ijk = ij + (1 W ij ) ijk End Inner Loop K 1 Calculate inner-loop prediction: p ij = U i + K k =1 ijk
~
1
a is the estimated time adjustment factor N is the number of cells in the region for which predictions are being simulated T is the number of years into the future the prediction will be made
End Outer Loop Calculate prediction: Y i = End Cell Loop
1 J
J j =1
p ij
195
It is seen from the procedure that a 1,000 1,000 cell matrix inverse must be done J times for every cell in the prediction set. As will be discussed below, the time it takes to make this calculation is a limiting factor as to how many outer-loop cycles can be made.
6.1.2 Spatial Logistic Regression
The method used to create predictions using the panel data spatial logistic regression model is very similar to that for the panel data spatial regression model, except for the additional requirements of correcting for heteroscedasticity and performing the logistic PDF transformation on each prediction: F (x ) = ex 1+ ex (6.2)
In the pseudo-code for the technique which is given below, the diagonal matrix Q is used to return the model, as estimated using Monte Carlo simulation, to its original heteroscedastic form. The form of this matrix is given in Chapter 4 and is dependent on what variable is being predicted: If the variable is urban land cover, then equation 4.40 is used; if residential or rural land cover is the variable being predicted, then equation 4.51 is used. Since these forms are complicated, they will not be written out again in this chapter; rather, the symbol Q (or its diagonal element Qii ) will be taken to represent them. Also, in the formation of Q for the residential and rural models, estimates from the model must be used as instrument variables (see Section 4.5). For the cell of interest, the actual urban prediction is used; for the others predictions are calculated using (6.1) disregarding the error terms. The pseudo-code for the panel data spatial logistic regression model prediction simulation follows(again, for this work, N = 2500, J = 5, and K = 200):
196
Definitions: Cells which simulation will be run on are indexed c1 , c 2 ,..., c N , , 2 , and 2 are previously estimated model parameters
RE
is the time adjustment vector defined in equation 4.54: = [ ~ , (a T ) Z ] X ~ ( X is a vector of ones the number of non-lagged variables long and Z
is a vector of ones the number of time-lagged variables long) Q is the variance normalizing transformation defined in Chapter 4 F() is the logistic PDF function defined in equation 6.2 J = number of outer-loop cycles K = number of inner-loop cycles Procedure: Calculate time adjusted parameter: k = k k Calculate deterministic part of model: U = X Cell Loop For i = 1 through N Outer Loop For j = 1 through J Select random sample of 1,000 cells, S ij , including ci ~ Calculate spatial weight matrix W ij for S ij (via eqs. 4.1 4.3) 2 Sample random effect term ij drawn from Normal (0, RE ) Calculate Q for S ij Inner Loop For k = 1 through K Create (1000 1) vector, ijk , of i.i.d. draws from Normal (0, 2 )
a is the estimated time adjustment factor N is the number of cells in the region for which predictions are being simulated T is the number of years into the future the prediction will be made
Calculate total error: ijk = Qii1 ij + Q 1 (1 W ij ) ijk End Inner Loop K 1 Calculate inner-loop prediction: p ij = K k =1 F (U i + ijk )
~
1
End Outer Loop Calculate prediction: Y i = End Cell Loop
1 J
J j =1
p ij
197
Again, J 1,000 1,000 cell matrix inversions are required for each cell prediction. This limiting factor is increased by the fact that a new inverse must be calculated for every different variable (urban, residential, and rural), as they each have different values of .
6.2 POPULATION PREDICTIONS
The population predictions were simulated using the methodology described in section 6.1.1. Incorporating the results presented in Table 5.3, predictions were created for the years 2005, 2010, and 2020. Also, to better understand the effects of the time adjustment factor, a simulation was run with no time adjustment factor (which means that a specific year of prediction cannot be specified) using the results presented in Table 5.2. Because of computational issues discussed below, only 5 outer-loop cycles, each with 200 inner-loop cycles, were used to simulate the prediction for each cell. As discussed previously, a large number of 1,000 1,000 cell matrix inversions are required by the simulation procedure. This placed a potentially huge computational burden on the simulations. Though calculating each inversion required only about ten seconds for the computers used in the simulations, using a modest 5 outer-loop cycles per cell works out to over 17 days of computing time needed just to calculate the matrix inversions needed for the entire data set. Two things were done to compensate for this computational constraint. The first was to use relatively small numbers of inner- and outer-loop cycles so as to minimize the computational time for each cell. Second, to reduce the total number of cells used in the prediction, two 15 km 15 km (50 cell 50 cell) regions were selected for the simulations. The two regions selected were the area around downtown Austin and a part of Cedar Park, which is a close suburb of Austin. A map showing these regions in the context of both Austin and the 198
original data map is given in Figure 6.1. The reason these two areas were chosen was because they offered potentially interesting results for the predictions: the downtown area is well established urban area with a river running through it, whereas Cedar Park is a rapidly growing area with a mixture of rural and developed land. Because of the logarithmic transformation of the population variable in the models estimated in Chapter 5, an exponential transformation was required on the results of the simulations in order to obtain predictions. However, this meant that small simulation fluctuations would be multiplied exponentially in the population predictions. This actually ended up happening in roughly 1% of the predictions. Because such occurrences are not only unrealistic, but also make it difficult to view the results, whenever a cell had a very high population prediction (greater than 3,000) its results were replaced with a simple average from the predictions for the surrounding cells.2 The simulations were run using GAUSS programs. Sample variances from the outer-loop predictions were also calculated. Figure 6.2 presents the results of the population predictions for the downtown Austin region as well as, for reference, the population data from the 2000 Census. In this and all of the figures presented in this section, dark columns represent areas of high population, while lighter ones represent areas with lower populations. It is immediately clear that none of the predictions bears any visual resemblance to the 2000 data; that is, the manner in which the population levels are distributed in the 2000 data is not reflected in the population predictions. Furthermore, the somewhat gentle variance of population levels in the Census
The reason for choosing a population of 3,000 as the cutoff was twofold: First, as the population of a cell can also be interpreted as population density, and because the results of the 2000 Census showed regions of high density rarely had cell populations greater than 2,000, it was unlikely that populations higher than 3,000 would reasonably be expected. Secondly, it was visually difficult to interpret the results when population predictions greater than 3,000 were included.
2
199
Figure 6.1 Map of the Austin, Texas area showing the Downtown and Cedar Park regions used for predictions. The green line outlines the original data set area.
200
2000 Census Data
2005 Prediction
201
2010 Prediction 2020 Prediction
Figure 6.2 Population data and predictions for the downtown Austin region.
data has, in the predictions, been replaced by a relatively low base population across the entire region with a few dozen large spikes of high population which seem to be randomly spread throughout the area. Also, the Austin CBD is located at the center of the downtown region evidenced by the area of high population in the center of the Census data but there seems to be no recognition of this important feature in the predictions. It is obvious that the distribution of the population of the region is not being correctly accounted for by the model. This despite the fact that spatial autocorrelation, which would be expected to account for some of the local spatial heterogeneity, was included in the simulation. One possible reason for this is that the effects of the autocorrelation are probably dominated by the non-random parts of the regression model (i.e. the explanatory variables and their respective parameters). However, the cells in this region are not identical, and if the nonrandom elements of the model dominated then the cells would be expected to have a greater variance of population levels than is seen (though the relatively low R 2 level of this model indicates that such diversity may not be captured). The relative flatness of most of the cell predictions is actually possibly a result of the autocorrelation effects: by incorporating information from surrounding cells, with more weight given to those which are closer, the effect might be to average out the differences in the data. Another issue with the predictions is the fact that they actually predict a decrease in the population over time. Though the highest population cells in the 2005 prediction are greater than those of the 2000 data, there is a decline in these highest levels as the year of the prediction increases. This is probably a consequence of the fact that the time adjustment factor used in this model was, at 0.943, slightly less than one. Thus, even if the model accounted for the spatial
202
heterogeneities accurately, it apparently would not correctly predict the expected increase in population over time.3 Figures 6.3 and 6.4 present predictions which are, respectively, one (outerloop) sample standard deviation above and below the average. It is seen that the distribution of population levels in these above- and below- average predictions is roughly the same as in the predictions from Figure 6.2; the difference between them is not of very much interest. That is, there is little additional spatial heterogeneity displayed in these predictions, and they certainly do not come any closer to matching the form of the 2000 Census data. These results concerning the variability of the predictions are representative of the other predictions discussed below; as they are not of great interest, discussions of them will not be included in the results presented below. Figure 6.5 presents the population predictions for the Cedar Park region as well as the data from the 2000 Census. The 2000 Census data shows population concentrated around a highway which runs diagonally, from the southeast to the northwest, through the region. The results of the predictions are very similar to those of the downtown Austin region: a low base of population with sporadic cells with high population levels; a total lack of any resemblance of the spatial distribution of population in the Census data; a decrease over time in the predicted population levels. The fact that the model simulations are unable to account for the spatial distribution of population is even more evident in this case because the aligning of the population along the highway is so distinct that it would be very obvious if the predictions were even close to resembling them. Thus, it is obvious that the predictive power of the model, as far as population goes, seems to be fairly low.
3
Because the cell population levels are also densities, it might be argued that the densities may be expected to, in some regions, remain the same or even decrease over time. However, such an effect should not be captured by the model (remember that estimates of the population for the nonCensus years were created using an exponential model which increased over time).
203
2005
2010
204
2020
Figure 6.3 Population predictions for the downtown Austin region: +1 sample standard deviation from average.
2005
2010
205
2020
Figure 6.4 Population predictions for the downtown Austin region: -1 sample standard deviation from average.
2000 Census Data
2005 Prediction
206
Figure 6.5 Population data and predictions for the Cedar Park region.
As noted above, it was seen that the population predictions were expected to decrease over time, presumably as a result of the time adjustment factor. To gain a better understanding of this factors effect on the predictions, Figure 6.6 presents the results of the population predictions simulated without the time adjustment factor. It is seen that these predictions are close to, if not higher than, the predictions found in the 2005 predictions with the time adjustment factor. This lends credence to the idea that the time adjustment factor is what is causing the decrease in the predicted population levels over time. Though it creates nonrealistic results in these predictions, it is important to note that the time adjustment factor does have a visual effect on the predictions.
6.3 LAND COVER PREDICTIONS
Urban, residential, and rural land cover predictions for 2005, 2010, and 2020 were simulated using the methodology presented in section 6.1.2 and incorporating results presented in Tables 5.33, 5.36, and 5.39. The urban land cover predictions for each year were used to instrument the residential and rural predictions. As with the population model, computational issues resulting from the large number of matrix inversions placed limits on the simulations. Thus, as with the population variable, predictions were only run for the downtown Austin and Cedar Park regions; and 5 outer-loop and 200 inner-loop cycles were used in the simulations. In a similar way that the logarithmic transformation caused some problems with the population predictions, the logarithmic transformation introduced some problematic results for the land cover predictions. The issue here was the fact that when the exponential factor in the logistic transformation (equation 6.2) become very large, the proportion is expected to approach one. However, the GAUSS program used sometimes had difficulties calculating the value of the proportion and would often return an indeterminate (non-numeric) 207
Downtown Austin
Cedar Park
Figure 6.6 Population predictions for the downtown Austin and Cedar Park regions simulated without time adjustment.
208
result. When this happened, as with the population predictions, averages from the surrounding cells were substituted for the missing values. Figures 6.7 through 6.9 present the results of the urban, residential, and rural land cover predictions for the downtown Austin region as well as, for reference, the actual 2000 land cover data used in this work. For better visualization, the coloring of this figure, and all of those in this section, are opposite of those for the population predictions: lighter columns represent cells with higher proportions of the land cover of interest, whereas dark columns represent lower proportions. From the results of the urban and residential land cover predictions, is it seen that these simulations are plagued by much of the same problems that happened with the population predictions. Specifically, the distribution of the urban land cover which is present in the 2000 data is essentially non-existent in the predictions. For example, though it is difficult to see in the figure, in the 2000 data there is a band of cells with low proportions of urban and residential land cover which follow the river running from the northwest to the southeast corner of the region. This geographic feature is not accounted for at all in any of the urban or residential predictions. Furthermore, the predictions, compared to the 2000 data, have very little variation and seem to predict similar urban and residential land cover proportions for all cells. The 2020 predictions seem have greater variation, but their scale is much smaller than with the other predictions and thus is somewhat misleading. Another problem shared with the population predictions is that proportions of urban and residential land cover are expected to decrease significantly over time. It is seen that 2005 is expected to have a large drop in urban and residential land cover proportions, and then the proportions are predicted to decrease from those levels. Though for the residential model, which has a time adjustment factor which is less than one (0.99), such a result might be 209
2000 Data
2005 Prediction
210
Figure 6.7 Urban land cover data and predictions for the downtown Austin region.
2000 Data
2005 Prediction
211
Figure 6.8 Residential land cover data and predictions for the downtown Austin region.
2000 Data
2005 Prediction
212
Figure 6.9 Rural land cover data and predictions for the downtown Austin region.
expected, the time adjustment factor for the urban model is 1.018 which (in combination with the parameter and elasticity values for the time-lagged variables in Table 5.33) would lead to the expectation that urban proportions might increase over time. It is clear that the effects of the time adjustment factor may always be garnered solely from whether it is below or above one.4 The rural land cover perditions for the downtown Austin region also have issues, though of a slightly different flavor. Though again the spatial distribution of the land cover proportions form the 2000 data is not reflected, there is actually more variance in the predictions than in the data. Furthermore, though the highest predicted rural proportions are lower than the highest levels from the 2000 data, the general level of the 2010 and, especially, 2005 predictions are higher than that of the actual data. This means that rural land cover is expected to rise in the five years after 2000 and then decrease over time from there. Figures 6.10 through 6.12 present the results of the urban, residential, and rural land cover predictions for the Cedar Park region, as well as the actual 2000 land cover data. The results are very similar to those from the downtown Austin regions, except that the decrease in the land cover levels over times is less severe. In particular, the area of urban and residential land cover which surrounds the highway running through the region, which is very obvious in the 2000 data, is clearly not accounted for in the predictions. Furthermore, the general level of the rural and, especially, urban predictions is far more homogenous than the 2000 data. The residential predictions have more diversity, but, as mentioned above, it has no relation to the patterns found in the 2000 data. It is also seen that the predictions for urban and residential land cover are far below the 2000 levels,
The problems with understanding the effects of the time adjustment factor are partially a consequence of the ambiguity as to how it should understood theoretically: it could either be interpreted as a representation for the diminishing or increasing importance of the past, or it could be understood as a correction factor adjusting past variable values to future levels. Though there is some crossover in these interpretations, there is also great potential disparity, especially concerning predictions.
4
213
2000 Data
2005 Prediction
214
Figure 6.10 Urban land cover data and predictions for the Cedar Park region.
2000 Data
2005 Prediction
215
Figure 6.11 Residential land cover data and predictions for the Cedar Park region.
2000 Data
2005 Prediction
216
Figure 6.12 Rural land cover data and predictions for the Cedar Park region.
whereas the proportion of rural land cover is much higher in the predictions than in the 2000 data. From all of these results, it is obvious that, just like with the population predictions, this model and simulation technique is unable to create accurate and intuitive predictions. Though this might be expected from the low R 2 level from the model estimations, the severity of this shortcoming is greater than might be anticipated. It underscores the fact that the models and simulation techniques should be improved before they are used for predictive purposes.
6.4 SUMMARY
One of the stated goals in the development of the methodologies in Chapter 4 was to create models which could capture important spatial and temporal correlations and effects in the data. Chapter 5s results illustrate interesting spatial and temporal characteristics of the data were being exposed by the models; however, they do not ensure that the results are of use in a practical setting. The predictions run in this chapter were a way of testing the models in this respect by applying them towards an application which is of interest both in a theoretical and practical sense. In fact, the predictions are actually an excellent way to test of the goals of the development of these models: they require the models create predictions of the future, i.e. account for temporal effects, at an individual cell level, which, when taken together, will test the ability of the models to capture spatial heterogeneities. Unfortunately, as evidenced by the results of the predictions, the models did not perform very well. In terms of temporal effects, the predicted future levels seemed not only incongruous with the 2000 data, but also generally predicted unrealistic changes over time. Spatially, the predictions did not even come close to reflecting actual spatial distribution of the population and land cover levels across the regions. As noted previously, it is obvious that even 217
though spatial effects, in the form of distance measures and spatial autocorrelation, were accounted for, the spatial characteristics of the region were not accurately captured by the model. The issue with the poor spatial performance of the predictions could be a result of the simulation technique used. The results many of the predictions seem to indicate there is a general leveling or averaging out process going on. Though, as mentioned above, this might be a result of the dominance of the nonrandom parts of the model, it is more probable that the method used here to capture spatial autocorrelation has a smoothing effect when applied in a manner as done here. No matter the cause, it is clear that in all of models the accounting for spatial effects is not accurate enough to create usable predictions. The results of the predictions also show that the time adjustment factor does not account for the effects of time in an intuitive sense. That is, many of the predicted variables, such as population and urban land cover, would be expected to increase over time; however, the time adjustment factors effect is to cause a decrease in the predicted levels over time. It is also found that the effect of the time adjustment factor on the predictions cannot necessarily be determined directly from its value in relation to unity. From these results concerning the time adjustment factor, more insights into the parameters effect can be garnered. Specifically, it is clear that though it seems natural to use the factor in a predictive setting, it obviously does not perform well or realistically in such an application. On the other hand, it should be noted that the parameter was introduced in order to account for differences in time lags, not specifically for predictive purposes (though this was a beneficial consequence of its form). Thus it is very possible that extending it to years outside of the data used to estimate it may be applying it in an incorrect context. Furthermore, it is not clear that only the time-lagged variables should be used to capture the evolution of the world and interesting land-use dynamics; it seems 218
problematic that the effect of an important variable, such as distance to the CBD, is not, in the context of these models, expected to change at all over time. Despite all of the problems the models had in capturing the spatial and temporal aspects of the data, they are by no means rendered completely useless. It is evident from the discussions in Chapter 5 that interesting (and intuitive) results were generated from the models. Thus, though they are not useful for predictive purposes, they definitely have, with respect to the data, exploratory benefits. These benefits, while not as overtly practical as predictions, allow a better understanding of the complex interactions and inter-relations which form an important part of the development of the region.
219
CHAPTER 7: CONCLUSIONS & EXTENSIONS

In Chapter 1, the central goal of this thesis was established: to develop models which can incorporate a variety of spatial and temporal aspects of data, and employ them in applications which are relevant to transportation and regional planning applications. To this end, this thesis, as it has developed over the past six chapters, has achieved this. However, this works true value lays not in its final result, but rather in the many interesting techniques and results which were developed and presented as it progressed. With this in mind, this chapter summarizes the various parts of this thesis, including discussions of the data, model development, model applications, and prediction simulations. It also develops some conclusions based on these various elements. Finally, there is a broad discussion of extensions which could be made to improve and further develop the techniques and models presented in this thesis.
7.1 SUMMARY AND CONCLUSIONS
One of the major accomplishments of this thesis is its use of established and new modeling methodologies incorporating a range of spatial and temporal effects, towards advanced transportation/regional planning based applications. The motivation driving this was the fact that, as evidenced by the literature review, little work had been done towards applying spatial econometrics, let alone spatial econometrics in combination with panel data, to transportation and regional planning relevant research. This despite the fact that, intuitively, it would be expected that the effects of space and time on models in this field would be highly significant. Furthermore, in the area of land-use/land-cover modeling, not only has little quality work in applying spatial econometrics been done, but there is also a lack of consensus, in any of the modeling techniques (spatial econometrics, cellular automata, agent-based modeling, etc), as to what
220
constitutes a good land-use/land-cover model.
This thesis addressed these
deficiencies by applying spatial and panel data econometrics methodologies chosen primarily for their statistical strengths and transparency in model interpretation in a variety of interesting and relevant contexts. In order to estimate models which are not only of interest in a transportation/regional planning context, but also could expose interesting effects of space and time on a region, a relevant data set had to be developed. With this in mind, a combination of land-cover data derived from LandSat satellite imagery, statistics derived from the land-cover information, demographic data from the U.S. Census, and cartographic data for the Austin, Texas region was collected. Issues with the density of the land-cover data and the fact that the spatial reference systems for the land-cover and Census data were not aligned required the use of a combination grid to integrate the data sets. Furthermore, because the years for the land-cover and Census data also did not line up, Census data approximations for non-decennial years had to be carried out. The result was a data set which could be used to both create relevant models and test their ability to draw out the spatial and temporal characteristics of the data. The centerpiece of this thesis is its development of a series of econometric methodologies to model data in space and time. The models can be separated into three main categories: regression models for continuous dependent variables, regression models for proportions data, and econometric approximations to differential equation models. For the first two types, lagged dependent variables could be incorporated; however, because the data panels are not equally spaced in time, a correction was developed. variables. For the continuous dependent variables, two panel data econometric models incorporating spatial autocorrelation and temporal random effects were 221 This correction used an estimated time adjustment factor to transform the parameters of the time-lagged explanatory
developed; one model assumed a single set of parameters for the entire data set (the panel data spatial linear regression model) while the other assumed two different coefficient sets determined, via a probit discrete choice model, by latent characteristics of the data (the panel data spatial linear regression model using probit sample selection). Though incorporating both a time-lagged dependent variable and spatial-autocorrelation was beyond the scope of this work, a model incorporating a time-lagged dependent variable but no spatial effects was developed (the LSDV model) so as to compare the benefits of such a model with those incorporating spatial effects. To model proportions data, an extension of the spatial regression model to logistic regression was developed (the panel data spatial logistic regression model). This allowed for models of binary proportions data with spatial autocorrelation and temporal random effects. In order to model proportions data with more than two types, a method to similarly model the data by applying further binary splits was also developed. Because of issues concerning correcting for heteroscedasticity and instrumenting models of secondary binary splits, a series of approximations and, possibly controversial, assumptions (especially concerning the random-effects term) had to be made. Finally, a methodology was developed by which a differential equation model in time and space could be estimated by approximating differentials by first differences in the data. Because of the way in which this framework was developed, spatial and temporal effects were explicitly incorporated into the models, allowing for a more transparent view of the effects of space and time on the models. To investigate these methodologies, a series of models were estimated, including models of population, average vehicles available per household, per capita income, median house value, and land-cover (urban, residential urban, and rural non-urban). Though there is not near enough space to summarize all of the 222
results here, a few key points are worth noting. First is the fact that the large sized of the data set and the computationally demanding model aspects required sampling to be used to estimate the models. For most of the models, a series of random samples from the entire data set were drawn and the parameter estimates for these samples averaged to get unbiased estimates for the model parameters. However, this meant that accurate estimates for the significance of the model parameters could not be made; instead, they were only qualitatively analyzed based on averages from the sample estimates. Also, there are some issues as to the accuracy of the sampling, especially concerning its ability to handle large biases in individual sample parameter estimates and its use with small sample sizes. The model results, in general, show that many spatial and temporal characteristics of the data are very important in the models. It is first noted that the effects of distance were found to saturate as the distances became large and, as such, the square roots of the distance measures were used. Consistently one of the most important explanatory variables was the distance to the CBD measure. The elasticities for this measure showed it to be the most influential explanatory variable in nearly all of the models. However, because it was based on the distance from a single point, it often over-estimated the effect of distance for regions which were very far from the CBD. As a result, the distance to the nearest highway measure often seemed to act as a correction factor to the distance to the CBD variable, allowing for more subtle spatial effects to be exposed by the model. whole. It was also seen from the model estimations that spatial autocorrelation and temporal random effects have a significant effect on the models. From the positive value of the spatial autocorrelation in all of the models incorporating it, it 223 This result could also have been an effect of mis-specification or multicollinearity, though this seems unlikely from the results when taken as a
is seen that regions with similar characteristics be it population, per capita income, or land-cover tend to be located near one another. Also, from the sample selection models, it is seen that isolating further latent spatial heterogeneities by estimating two different model from a data set can greatly improve model performance. A large number of results were generated investigating the effects of timelagged explanatory variables and the time adjustment factor. It was seen that using time-lagged variables in the model, as opposed to not using them, did not significantly improve the results, but it did lead to some interesting model results. Though it was seen that the effect of the time adjustment factor was statistically significant, its actual effect on the model results was rather small. More importantly, though, the results established the legitimacy of the time adjustment factor, which is important for creating flexible predictive models without having to create separate forecast models. The LSDV model investigated the effects of a time-lagged dependent variable on the continuous dependent variable models. Though for the population model the addition of the time-lag significantly improved the model, in general it was seen that the loss of any explicit spatial information severely harmed the model performance. It is thus concluded that though the inclusion of a laggeddependent variable will probably be beneficial to a model, it should not be done at the expense of incorporating spatial information, which had significant explanatory value for all of the variables modeled. Finally, in the differential equation models, the explicit effects of space and time on the models was found to be highly significant. For the time dimension, this was somewhat as expected because the approximations of the Census data for non-Census years was of the same form as that in the differential equation model. Unfortunately, because of confusion concerning the randomeffects terms in the models (i.e., which one was constant over time and which one 224
was constant over space), the estimation of the differential equation models was not carried out correctly. Essentially, the method used to estimate the time dimension should have been used for the space dimension, and vice-versa. However, the results were still instructive, though it was seen that the small sample sizes used for the space dimension (which should have been used for the time dimension) created results of questionable accuracy. To test the applicability of the models and their results in a more practical setting, predictions for the years 2005, 2010, and 2020 were run for population and land-cover variables. To evaluate the effect of spatial autocorrelation, Monte Carlos simulations were run. The results were not very promising, as the specific spatial heterogeneity of the modeled regions was not transferred to the predictions and the dynamics which the predictions show are not intuitive. The reason for this is not clear, though there is a good chance that the methodology for accounting for spatial autocorrelation was to blame and caused the predictions to average-out rather than expose spatial diversity of the region. Broadly speaking, this thesis shows that in transportation and regional planning based models, temporal and, especially, spatial effects are very important and can offer significant improvements to models. Furthermore, their inclusion can lead to a better understanding of the underlying demographic and geographic dynamics and complexities which occur in a region. However, there are issues with the computational demands of estimating such models and how they may be used to accurately predict the future. Nonetheless, the results are very promising and lead the way for interesting new research which can extend and expand this work.
225
7.2 EXTENSIONS
There are a number of valid extensions which may be applied to this research. Though by no means comprehensive, this section will point out some of the most interesting and important of these and how they relate to this work. One interesting topic of research would be to better understand the sources and magnitudes of error in the data used for this work. Such error would not only occur as a result of things specifically done in this thesis such as the Census approximations but also due to causes outside of the researchers control, like poor land cover classification of the satellite images or problems with the original satellite images themselves. Such an investigation should also investigate how such errors propagate through the work and the models, determine how well the models account for such error, and perform sensitivity analyses. Investigations of methods which could be used to improve the data set such as using the landcover data to incorporate Census data at a less aggregate level (e.g., see Mennis (2003)) would also be of interest. With regards to the methodology, one of the immediate conclusions from this thesis is that combining spatial autocorrelation econometric methods with time-lagged dependent variables would lead to more powerful models. Investigations of further spatial econometric techniques, such as using spatial lags, would also be useful. More fundamentally, one thing which was not done in this work was analyze the data ahead of time for spatial autocorrelation and heterogeneities. Such analyses (see Anselin 1988 and 1999 for details) would be beneficial and would lead the way to justifying model selection in a more rigorous fashion. Another area of interest would be to investigate more fully the properties of the panel data spatial logistic regression model developed in Chapter 4, especially with regards to the applicability of using the instrument variable for the second-stage binary split models. Also, research into applying the panel data 226
spatial probit model presented in the Appendix and comparing it to the logistic regression model could lead to improved land-cover models. Of particular use would be research concerning more efficient algorithms for estimating spatial econometric models. If estimation methods could be developed which sped up model estimations, then this would lead more widespread use of these methods. Along similar lines, investigations into the effects of using the sampling scheme to estimate the models would help to better understand the biases that these schemes might be causing. Of special interest is the effect they have on the spatial autocorrelation parameter and on the standard errors and significance tests of the parameters. In a related issue, using the correct procedure to sample and estimate the differential equations models should also be investigated. One of the biggest problems in this thesis dealt with the time adjustment factor, which corrected for differences in time lags between data panels. There was a great deal of uncertainty and ambiguity as to how this parameter should be interpreted and used. Further research into this parameter, as well as alternative methods for accounting for time-lag differences, would be of great interest not only for work related to this thesis, but for all models incorporating multi-year data. Possibly related to the time adjustment factor are the issues with the simulations run for this work. Determining whether the model forms, model results, or simulation methodology (or all three) is causing the poor predictions is of great interest. It seems that the problems might be being caused by the Monte Carlo method used to simulate the effects of spatial autocorrelation; if so, a better way to capture this spatial effect in predictions is needed. Finally, at a broad level, there are a great many areas where this work might be extended. Applying spatial econometric models to more practical applications, such as travel demand models or economic impact assessments, 227
would be of great interest. Also, extending the differential equations framework to creating dynamic models of population or land-use/land-cover change would be interesting research topics as well. Finally, it would be of great use to compare and contrast more traditional modeling techniques (for example ordinary least squares) with the methodologies presented in this thesis to gain a better understanding of the benefits and drawbacks, at a practical level, of incorporating spatial and temporal effects into econometric models.
228
APPENDIX: MULTINOMIAL PANEL DATA SPATIAL PROBIT MODEL

In Chapter 4, a methodology for modeling proportions data and incorporating spatial autocorrelation was developed in the panel data spatial logistic regression model. As was noted in that chapter, another way to model proportions data is to use discrete choice models. Such a methodology is presented in the appendix. The reason a probit model is used is that a normally distributed error term allows for easy specification of spatial autocorrelation, because the sum of two normally distribution terms is also normally distributed. In contrast, to incorporate spatial autocorrelation in another popular discrete choice model, the multinomial logit, a much more complicated, and less elegant, model must be used. In the estimation of the model, Monte Carlo integration is required, and the first section in this appendix covers the theory behind this topic. The second section covers the multinomial panel data spatial probit model first by discussing the binary form of the model, and then extending it to J choice types. As mentioned in Chapter 4, attempts were made to estimate this model using the land-cover data used in this thesis, but there were issues with the models not converging (it seems that the model is extremely sensitive to initial conditions).
A.1 MONTE CARLO INTEGRATION
The model developed in section A.2 requires integrations over Ndimensional normal distributions, where N is the number of observations. If N exceeds 4, this is impossible to evaluate even numerically, let alone analytically. Instead, simulation techniques must be use. For this work a standard Monte-Carlo (MC) simulation method will be used.1
Quasi-Monte Carlo (QMC) methods, which use deterministic as opposed to random sampling to estimate integrals, have recently been used with much success (see Bhat (2001) and Train (1999)). However, these methods are not accurate for high-dimensional integrals; depending on the
1
229
For the models requiring simulation in this work, the form of the likelihood function for a single cell is
L = t =1 g (u it , v1 ,..., v N ) f v1 ,..., v N | v2 dv1 dv N
T v1
= E F v1 ,..., v N |
( (
vN
2 v
))
(A.1)
Where uit is a vector of data for observation i in time period t, g (u it , v1 ,..., v N ) is a generic, real-valued function, vi ~ i.i.d. Normal(0, v2 ),2 h() is the probability density function (PDF) for the collection of vi s, and consequently F() is a cumulative distribution function (CDF). The law of large numbers implies that if {(v11 ,..., v1N ),..., (v R1 ,..., v RN )} is an i.i.d. sample from h v1 ,..., v N | v2 , then (Durrett 1996) 1 R r =1 F vr1 ,..., vrN | v2 prob E F v1 ,..., v N | v2 R Then, the simulated log-likelihood function 1 R ln SL = ln r =1 F v r1 ,..., v rN | v2 R is maximized to obtain parameter estimates. Given an i.i.d. standard multivariate uniform sample contained on [0,1]N , it is straightforward to create an i.i.d. normally distributed sample with variance v2 . That is, if (U r1 ,...,U rN ) is an i.i.d. sample on [0,1]N , an i.i.d. sample from the distribution of (v1 ,..., v N ) can be created by
( (
)
))
(A.2)
(A.3)
(vr1 ,..., vrN ) = ( v 1 (U r1 ),..., v 1 (U rN ) )
(A.4)
method, the maximum dimension of integration that can be achieved accurately falls between around 9 and 200 dimensions. Furthermore, though the upper limit of 200 may actually be higher, to employ QMC in such a case requires a specialized sequence of numbers whose calculation is not straightforward. Because the dimension of integration being estimated in this work is at least 1,000, traditional MC was chosen over QMC methods. 2 Recall from Chapter 4 that the tilde (~) is used to mean is distributed.
230
The MC method utilizes (A.4) to create R i.i.d. samples from the distribution of
(v1 ,..., v N ) , and then estimates (A.1) using (A.3).

A.2 THE MULTINOMIAL PANEL DATA SPATIAL PROBIT MODEL
To model proportions data, an alternative to the panel data spatial logistic regression model is the multinomial panel data spatial probit model (spatial probit model, for short). Though the spatial probit model is generally used for discrete choice modeling, it can be specified to model the proportion of a population which selects a particular choice. By substituting the interpretation that the spatial probit models the proportion of a population that selects a given choice with the interpretation that it models the fraction of the land cover in a particular cell, proportions of land cover in a particular cell can be modeled using the spatial probit model. This attribute of the spatial probit model (that it can be used to model proportions) is actually shared by other discrete choice models. The reason for selecting the spatial probit model to model land cover over other potential models, in particular the multinomial logit, is that the way the spatial probit incorporates spatial correlation is much more natural. For example, to create a correlation structure in the multinomial logit (MNL) one method requires using indicator variables to capture the correlations between neighboring cells. The correlations cannot be specified in the error term, due to the properties of the Gumbel distribution, and a complex mixed logit form3 probably would be needed to
See Train (2003) or McFadden and Train (2000) for a discussion of the mixed logit specification. The reason that the correlations could not easily be specified in the error term of an MNL model is due to the fact that the sum of two Gumbel distributions is no longer a Gumbel distribution, so the MNL model would no longer be valid. Thus, a structure analogous to the nested logit structure, which allows for correlations across choice alternatives (see Greene (2000)), does not exist for correlations across individuals. Furthermore, even if one did, the number of nests and the restrictions on the correlations as required by the present situation would be enormous and impractical.
231
account correctly for the correlations in an unbiased manner (so that the fraction of a particular type of land cover in a given cell is correlated with the fractions of land cover in surrounding cells). Furthermore, this approach would be problematic, since the number of required indicator variables would be huge and correctly restricting them (see discussion of spatial weights below) would be difficult.4 Alternatively, one could create a simplified correlation structure as in Bhat and Guo (2004) which provides computational feasibility. However, that structure puts significant restrictions on the spatial correlations and is specifying them on the choice, rather than on the cell (individual), level. In fact, Bhat and Guo (2004) state clearly that the formulation of the spatial probit model captures spatial correlations more efficiently than the MNL. As discussed above, the discrete nature of the spatial probit model goes unused in this paper: the model is being employed for one of its beneficial properties (i.e., that it estimates probabilities which, like proportions, lie on [0,1] and sum to unity). Typically, the spatial probit model is motivated via utilitymaximization. That is, the choice that an individual makes is the one that provides the greatest utility to him/her as opposed to all other choices (Greene 2000). It is important to note that in the spatial probit models developed here, this interpretation no longer holds. At the most fundamental level, this is because the discrete-choice nature of the model has been discarded and thus the utilitymaximization interpretation is no longer valid here. Furthermore, there is always a problem when attempting to assign utility-maximizing principles to non-human agents (in this case cells of land) as they technically do not have a consciousness and do not make decisions (there is the possibility of a single land developer for every cell, but to assume this generally is probably incorrect). It is possible to
4
If there were N cells total and m cells in each neighborhood around a cell, the number of required indicator variables would be on the order of Nm , which, even for modest N and m would
2
be large.
232
motivate the utility-maximization framework through the idea that land cover is, to some degree, a human decision (via society) and thus the land cover in a cell is an indirect representation of a collective human decision that reflects utilitymaximization. On the surface this seems plausible (especially if Darwinian notions are extended to the collective human actions) but is problematic at a fundamental level, since it does not account correctly for land-cover choices which the humans rarely have a direct hand in (e.g. natural forest or water areas). In general, it is best to drop the utility-maximizing viewpoint and instead motivate the model purely on the basis that dependent variables (i.e. fractions of land-cover types) are determined by various exogenous and endogenous variables. Since it uses less cumbersome notation, the binary panel data spatial probit model (binary spatial probit model), will first be used to describe the model structure. The extension to the multinomial case is straightforward and is given at the end of this section. The general binary spatial probit model for N cells (individuals) and T time periods is given by:
y it = 1 y it 0
(A.5)
where 1{} is the indicator function and y it , in stacked matrix form, is modeled
by (Heckman 1981(a), Greene 2000): ~ Y = X + Z +
(A.6)
where is a (TN TN) non-singular matrix with unit diagonal elements that allows the specification of spatial and temporal lags in the dependent variable, , ~ Y ; X is a (TN K) matrix of exogenous variables; Z is a (TN L) matrix of potentially endogenous and/or time-lagged variables; is an error term with zero mean; and and are, respectively, (K 1) and (L 1) coefficient vectors. To specify spatial autocorrelation in the error terms, the same method used for the panel data spatial linear regression model (Chapter 4); and the vector of error terms is specified as (Anselin 1988, Greene 2001): 233
= +v
where
(A.7) (A.8)
= W +
is a (TN 1) vector wherein every Nth element (one cell over all time periods)
~ Normal(0, ), v is a (TN 1) vector of i.i.d. Normal(0, 2 ) elements of which every Tth one is equal (so there are only N unique elements), and W is a (TN TN) block diagonal matrix with T copies of the (N N) spatial weight ~ matrix W . For identification, must have at least one diagonal vector normalized to unity and (T 1) off-diagonal elements set to zero (Greene 2000 and 2002). For simplicity and ease of estimation, from here on will be assumed to an identity matrix, and thus it ~ i.i.d. Normal(0,1) . v represents a random effect which captures unobserved heterogeneity among the individuals and is constant across time for each individual. (A.8) can thus be rewritten as ~ Y = 1 X + 1 Z + 1v + 1 (I W ) 1 (A.9)s structure induces heteroscedasticity.
(A.9)
To correct for this, Heckmans
(1978) method (employed in Case (1992), Marsh, Mittelhammer, and Huffaker (2000), and Coughlin, Garrett, and Hernndez-Murillo (2004)) is used here. By renaming the last term in (A.8) as u = 1 (I W ) 1 whose covariance can be calculated as E(uu ) = [ (I W )(I W ) ]
1
(A.10)
(A.11)
a variance normalizing transformation for (A.9) can be constructed as a diagonal matrix Q = (diag[E(uu ' )]) . (A.9) can then be pre-multiplied by this (TN TN) matrix to create a homoscedastic model: ~ QY = Q 1 X + Q 1 Z + Q 1v + Q 1 (I W ) 1
(A.12)
234
Since Qy it 0 is equivalent to y it 0 , the model is fundamentally the same. For
notational simplicity, the model in (A.12) can be rewritten as Y = X + v + (A.13) ~ 1 where = Q 1 , Y = QY , X = [X , Z ] , = [ , ] , and = (I W ) . The heteroscedastic correction discussed above effectively ignores the error component v; in fact, the correction via Q induces heteroscedasticity in the v component. However, this component represents an unobserved effect which cannot be measured directly. Following Greene (2001), the general form of the model in (A.13) can be written as f ( y it | xit , vi , W ) = g y it , xit , vi , 2 , W f (vi ) = h vi | v2
(A.14)
where f () refers to a generic PDF. Likewise, the PDF of vi can be denoted as
(A.15)
Note the important assumption that vi captures all of the correlation across time for an individual cell. This is clearly a false assumption if I and if Z 0 (or if I 2 ), but for the moment, this problem is ignored. Later, it will be shown how the random-effects framework can be used to account for the incidental parameters problem associated with lagged dependent variables, so the following discussion will be useful. Given the assumption just discussed, along with equations A.14 and A.15, the marginal PDF for a single cell across time is represented by:
f y i1 , y i 2 ,..., y iT , vi | xi1 ,..., xiT , , v2 , W
= f y i1 , y i 2 ,..., y iT | xi1 ,..., xiT , vi , , v2 , W f (vi ) = t =1 g y it , xit , vi , v2 , W h vi | v2

T
) )
)(
(A.16)
235
For estimation purposes, a log-likelihood function for (A.16) must be created. In order to do this, vi must be integrated out of the marginal PDF. After this, the likelihood function for all N cells can be written as follows5:
L = i =1 ... t =1 g y it , xit , vi , v2
N T vN v1
[(
)]
pit
(1 pit )
[(1 g (y
it
, xit , vi , v2
))]
h vi | v2 dvi
(A.17)
Where pit is the proportion of cell i at time t that has the land cover of interest (this is a binary representation); it is in this manner that the discrete choice model becomes a proportions model (see Greene 2000). The log-likelihood function would involve N 1-dimensional integrations over an univariate normal distribution. However, as will be seen, when spatial correlation is included, the log-likelihood function requires N integrations over an N-dimensional normal distribution. This is approximated using Monte Carlo simulation.. As mentioned previously, the assumption used to construct the randomeffects model, namely that the random-effect captures all of the correlation over time for an individual cell, is inconsistent with (A.9). For the models discussed here, the lagged dependent variables will only occur in the T dimension; that is, only time-lags and not spatial lags will be used. Spatial lags are often used in spatial-probit models (e.g., see Case (1992) and Coughlin, Garrett, and Hernndez-Murillo (2004)). However, this method does not formally measure spatial dependence and, without error term specifications, would lead to biased coefficient estimates. Furthermore, at an intuitive level the actual proportions of land cover in a given cell does not depend on the land-cover proportions in the neighborhood, but rather are correlated with these. On the other hand, it could be
Greene (2001) covers a slightly more general version of (A.17), allowing for different panel (time) periods for different individuals (cells).
236
argued that proportions of land cover in a given cell do depend on past proportions, hence the inclusion of time-lagged variables.6 As mentioned in Chapter 4, with the inclusion of time-lagged variables, the incidental parameter problem (IPP) becomes an issue. For panel data discrete choice models, Honor and Kyriazidou (2000) developed a logit method which, while consistent, is highly restrictive and unclear as to how the initial conditions impact the results. Heckman (1981(b)) offered a possible solution to the IPP for the dynamic panel-probit by approximating a conditional distribution for the initial conditions. The method used here is one developed by Wooldridge (2002), which is similar in spirit to Heckmans but which turns out to be far simpler to estimate since it collapses to a random-effects structure under certain assumptions. The method Wooldridge developed to address the IPP solves for the outcomes distribution conditioned on initial conditions and model parameters, and then employs an estimate for the distribution of the initial conditions using the random-effects parameter (v). First, all assumptions concerning v are released, so the vector has a variance-covariance matrix associated with it (v). Then, if one assumes that the distribution of the random-effects parameter can be modeled correctly via a function of the initial conditions, the entire set of exogenous variables, and v, then the IPP can be addressed correctly. adapting (A.15), let So,
f (v i ) = h (v i | y i 0 , ~ i1 ,..., ~ iT , v ) x x
correlation):
(A.18)
Then, the general form of (A.16) becomes (assuming for the moment no spatial
In fact, dependent variables lagged in space and time could be used, but this would add another level of complexity to what is already an extremely dense model. A more generalized discussion of this topic, considering real and spurious state dependence in discrete choice models, is given in Heckman (1981a) and Hsiao (1986).
237
f ( y i1 ,..., y iT | xi1 ,..., xiT , vi , , v ) f (vi )

T = t =1 g ( y it , xit , vi , v )h(vi | y i 0 , ~i1 ,..., ~iT , v ) x x
(A.19)
In order to transfer this to (A.17), we need a workable form of (A.18). Wooldridge suggested assuming vi | y i 0 , ~i1 ,..., ~iT , v ~ Normal 0 + 1 y i 0 + 2 xi , v2 x x
(A.20)
where are parameters to be estimated and xi represents the time average of the exogenous variables (those which are time-constant are left out). The assumption in (A.20) is more believable than it may seem at first, since a case may be made that the unobserved effect (which includes the essence of the initial conditions) is not only random, but also depends on the initial conditions as well as the timevarying exogenous variables.7 If (A.20) is accepted, the same assumptions made previously concerning vi are re-invoked, and the parameters yi0 and xi are added to X, then the new model is of the form of (A.17), only with (A.18) replacing (A.15). Thus, with nearly no added complications to the model, the IPP can be resolved. For the binary probit model, the function g () is the standard normal CDF. So, using the notation described above, the log-likelihood function for the completely specified spatial binary probit model is:
log L = i =1 log t =1 [ ( xit + [v ]it )]
N T pit v1 vN
[(1 ( xit + [v]it ))]
1 pit
h v1 ,..., v N | dv1 dv N
2 v
(A.21)
The formulation thus far has effectively ignored Z from (A.8), the explanatory variables which are either time-lagged or endogenous. In this work, Z represents the mix and entropy statistics. Per the discussion in Chapter 4, these
Wooldridge (2002) discusses in detail the potential problems with this assumption, all of which, he effectively argues, are no worse than other common econometric models, such ordinary least squares regression and the multinomial logit model.
7
238
variables are considered to be exogenous.
If Z did contain endogenous
parameters, accounting for the issues associated with them could be done using a random parameters framework (see Greene 2002), though the application is not straightforward and would require additional model adjustments (see Wooldridge 1995 and Heckman 1981(a) and 1981(b)). In order to generalize (A.21) to the spatial probit model with J choices, a J-1 dimensional normal distribution is used to generate the likelihoods. Specifically, for a discrete choice model, the likelihood function for an individual cell (Green 2000):
L = t =1 j =1 Pitj ( x itj + v itj )
T J vN v1
] h(v ,..., v
pitj 1
| v2 dv1 dv N ,
Pitj = Prob 1 q < (x it1 x itj ) + v ( it1 itj ),...,
J j < (xitJ xitj ) + v ( itJ itj ),...,]
(A.22)
In order to evaluate the probability expression, an approximation for the multidimensional normal must be made. For this work, the Geweke-HajivassiliouKeane (GHK) simulator, which has been shown to be quite effective, is used. Details on the simulator can be found in Hajivassiliou, McFadden, and Ruud (1996) and Greene (2000). Specifics on the computational methods used to evaluate (A.22) can be found in Ruud (1996) or, especially for applications using the GHK simulator, Navarro (2004).
239
REFERENCES
Allen, Peter M. 1997. Cities and Regions as Self-Organizing Systems: Models of Complexity. Amsterdam: Gordon and Breach. Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic Press. Anselin, Luc. 1999. Spatial Econometrics. Working paper. Accessed July 14, 2004: http://www.csiss.org/learning_resources/content/papers/baltchap.pdf Aptech Systems, Inc. 1998 GAUSS, version 5.0, with Maximum Likelihood Estimation Module. Aptech Systems, Inc. Maple Valley, Washington. Berling-Wolff, Sheryl, and Jianguo Wu. 2004. Modeling Urban Landscapes Dynamics: A Review. Ecological Research 19: 119-129. Bhat, Chandra. 2001. Quasi-Random Maximum Simulated Likelihood Estimation of the Mixed Multinomial Logit Model. Transportation Research Part B 35: 677-693. Bhat, Chandra, and Jessica Guo. 2004. A Mixed Spatially Correlated Logit Model: Formulation and Application to Residential Choice Modeling. Transportation Research Part B 38: 147-168. Caliper Corporation. 2004. TransCAD GIS Software, version 4.7. Caliper Corporation, Newton, Massachusetts. Case, Anne. 1992. Neighborhood Influence and Technological Change. Regional Science and Urban Economics 22: 491-508. Candau, Jeanette Therese. 2002. Temporal Calibration Sensitivity of the SLEUTH Urban Growth Model. Masters Thesis. The University of California, Santa Barbara. Cervero, Robert, and Kara Kockelman. 1997. Travel Demand and the Three Ds: Density, Diversity, and Design. Transportation Research D 2 (3): 199219.
240
Christaller, Walter. 1954. Die Zentralen Orte in Suddeutschland. Originally published by Gustave Fischer, 1933. Trans. C. Baskin, Bureau of Population and Urban Research, University of Virginia. Chawla, Sanjay, Shashi Shekhar, Weili Wu, and Uygar Ozesmi. 2001. Modeling Spatial Dependencies for Mining Geospatial Data. Accessed July 10, 2004: http://www.siam.org/meetings/SDM01/pdf/sdm01_28.pdf Clarke, Keith C. 1997. Land Transition Modeling With Deltatrons. Web Paper. Accessed July 10, 2004: http://www.ncgia.ucsb.edu/conf/landuse97/ Clarke, Keith C., Stacy Hoppen, and Leonard Gaydos. 1997. A Self-Modifying Cellular Automata Model of Historical Urbanization in the San Francisco Bay Area. Planning and Design B 24: 247-261. Clarke, Keith C., and Leonard Gaydos. 1998. Loose-Coupling a Cellular Automaton Model and GIS: Long-Term Urban Growth Prediction for San Francisco and Washington/Baltimore. International Journal of Geographical and Information Science 12(7): 699-714. Coughlin, Cletus C., Thomas A. Garrett, and Rubn Hernndez-Murillo. 2003. Spatial Probit and the Geographic Patterns of State Lotteries. The Federal Reserve Bank of St. Louis Working Paper Series, working paper 2003042A. Accessed March 1, 2004: http://research.stlouisfed.org/wp/2003/2003-042.pdf Dubin, Robin A. 1992. Spatial Auto Correlation and Neighborhood Quality. Regional Science and Urban Economics 22: 433-452. Durrett, Richard. 1996. Probability: Theory and Examples, 2nd Ed. Belmont: Duxbury Press. Elhorst, J. Paul. 2001. Panel Data Models Extended to Spatial Error Autocorrelation or a Spatially Lagged Dependent Variable. University of Groningen, Research Institute SOM Research Paper 01C05. Accessed March 1, 2004: http://www.ub.rug.nl/eldoc/som/c/01C05/01C05.pdf Elhorst, J. Paul. 2003. Specification and Estimation of Spatial Panel Data Models. International Regional Science Review 26: 244-268.
241
Fair, Ron C. 2003. Bootstrapping Macroeconometric Models. Studies in Nonlinear Dynamics and Econometrics 7(4): 1-24. Federal Reserve Bank of Minneapolis. What is a Dollar Worth Accessed February 23, 2004: http://woodrow.mpls.frb.fed.us/research/data/us/calc/ Frazier, Chris, and Kara Kockelman. 2003. Cities and Satellite Imagery: Models for Regional Change. Accessed March 3, 2004: http://www.ce.utexas.edu/prof/kockelman/public_html/TRB04SatData.pdf Fujita, Masahisa. 1989. Urban Economic Theory: Land use and city size. Cambridge: Cambridge University Press. Geoghegan, Jacqueline, et al. 2001. Modeling Tropical Deforestation in the Southern Yucatn Peninsular Region: Comparing Survey and Satellite Data. Agriculture, Ecosystems, and Environment 85: 25-46. Greene, William. 2000. Econometric Analysis. Upper Saddle River: PrenticeHall. Greene, William. 2001. Fixed and Random Effects in Nonlinear Models. Preliminary draft. Accessed March 1, 2004: http://pages.stern.nyu.edu/~wgreene/panel.pdf Greene, William. 2002. Convenient Estimators for the Panel Probit Model: Further Results. Accessed March 1, 2004: http://pages.stern.nyu.edu/~wgreene/panelprobitmodel.pdf Hajivassiliou, Vassilis, Daniel McFadden, and Paul Ruud. 1996. Simulation of Multivariate Normal Rectangle Probabilities and their Derivatives: Theoretical and Computational Results. Journal of Econometrics 72: 85134. Heckman, James J. 1978. Dummy Endogenous Variables in a Simultaneous Equation System. Econometrica 46: 931-959. Heckman , James J. 1981(a). Statistical Models for Discrete Panel Data, in Structural Analysis of Discrete Data and Econometric Applications, Charles F. Manski and Dan L. McFadden, eds. Cambridge: The MIT Press.
242
Heckman, James J. 1981(b). The Incidental Parameters Problem and the Problem of Initial Conditions in Estimating a Discrete Time-Discrete Data Stochastic Process, in Structural Analysis of Discrete Data and Econometric Applications, Charles F. Manski and Dan L. McFadden, eds. Cambridge: The MIT Press. Hill, R. Carter, Lee C. Adkins, and Keith A. Bender. 2003. Test Statistics and Critical Values in Selectivity Models, in Advances in Econometrics Vol. 17: Maximum Likelihood Estimation of Misspecified Models: 20 Years Later, R. Carter Hill and Thomas B. Fomby, eds. Oxford: Elsevier Science. Honor, Bo E., and Ekaterini Kyriazidou. 2000. Panel Data Discrete Choice Models with Lagged Dependent Variables. Econometrica 68: 839-874. Hsiao, Cheng. 1986. Analysis of Panel Data. Cambridge: Cambridge University Press. Irwin, Elena G., and Jacquline Geoghegan. 2001. Theory, Data, Methods: Developing Spatially Explicit Economic Models of Land Use Change. Agriculture, Ecosystems, and Environment 85: 7-23. Jensen, Peter, Michael Rosholm, and Mette Verner. 2001. A Comparison of Different Estimators for Panel Data Sample Selection Models. Aarhus School of Business, Department of Economics Working Paper Series, paper 02-1. Accessed March 10, 2004: http://www.hha.dk/nat/WPER/02-1_pje.pdf Judson, Ruth A., and Ann L. Owen. 1997. Estimating Dynamic Panel Data Models: A Practical Guide for Macroeconomists. Federal Reserve Board Finance and Economic Discussion Series, FEDS paper 1997-3. Accessed March 10, 2004: http://www.federalreserve.gov/pubs/feds/1997/199703/199703pap.pdf Kiviet, Jan F. 1995. On Bias, Inconsistency, and Efficiency of Various Estimators in Dynamic Panel Data Models. Journal of Econometrics 68: 53-78. Kline, Jeffrey D., and Ralph J. Alig. 2001. A Spatial Model of Land Use Change for Western Oregon and Western Washington. USDA Research Paper PNW-RP-528. Accessed July 10, 2004: http://www.fs.fed.us/pnw/pubs/rp528.pdf 243
Klosterman, R. E. 1999. What if?: Collaborative Planning Support System. Environment and Planning B 26: 393-408. Kockelman, Kara M. 1997. Travel Behavior as a Function of Accessibility, Land Use Mixing, and Land Balance: Evidence from the San Francisco Bay Area. Transportation Research Record 1607: 117 125. Kok, Kasper, Andrew Farrow, A. Veldkamp, and Peter H. Verburg. 2001. A Method and Application of Multi-Scale Validation in Spatial Land Use Models. Agriculture, Ecosystems, and Environment 85: 223-238. Kweon, Young-Jun. 2004. Speed Choices and Crash Concequences: Effects of Speed Limit Policies on High-Speed Roadways. Doctoral Dissertation. The University of Texas at Austin. Lam, N., and D. A. Quattrochi. 1992. On the Issues of Scale, Resolution, and Fractal Analysis in the Mapping Sciences. Professional Geographer 44: 88-98. Lancaster, Tony. 2000. The Incidental Parameter Problem Since 1948. Journal of Econometrics 95: 391-413. Machemehl, Randy. Personal conversation, February 20, 2004. Marsh, Thomas L., Ron C. Mittelhammer, and Ray G. Huffaker. 1997. Spatial Correlation in Applied Econometric Models: A Generalized Model with an Application to Potato Production. Accessed March 1, 2004: http://www.agecon.ksu.edu/tlmarsh/Research/aaea_97.pdf Mathworks, Inc., The. 1999. MatLab Student Version 5.3. The Mathworks, Inc. Natwick, Massachusetts. McFadden, Daniel, and Kenneth Train. 2000. Mixed MNL Models for Discrete Response. Journal of Applied Economics 15: 447-470. Mennis, Jeremy. 2003. Generating Surface Models of Population Using Dasymetric Mapping. The Professional Geographer 55: 31-42. Messner, Steve, and Luc Anselin. 2002. Spatial Analyses of Homicide with Areal Data. Working paper. Accessed July 12, 2004: 244
http://agec221.agecon.uiuc.edu/users/anselin/papers/smla.pdf Mugnier, Clifford J. 2000. The Basics of Classical Datums. Photogrammetric Engineering & Remote Sensing 66(4): 367-368. Munroe, Darla, Jane Southworth, and Catherine M. Tucker. 2001. The Dynamics of Land-Cover Change in Western Honduras: Spatial Autocorrelation and Temporal Variation. Conference Proceedings. American Agricultural Economics Association. AAEA-CAES 2001 Annual Meeting. Accessed July 10, 2004: http://agecon.lib.umn.edu/cgi-bin/pdf_view.pl?paperid=2611 Nagendra, Harini, Darla K. Munroe, and Jane Southworth. 2004. From Patterns to Process: Landscape Fragmentation and the Analysis of Land Use/Land Cover Change. Agriculture, Ecosystems, and Environment 101: 111-115. Navarro, Salvador. 2004. Economics 350 (University of Chicago) Class Handout #8, 2004. Accessed April 1, 2004: http://lily.src.uchicago.edu/econ350/Salvador/h8.pdf Nelson, Gerald C., and Daniel Hellerstein. 1997. Do Roads Cause Deforestation: Using Satellite Images in Econometric Analysis of Land Use. American Journal of Agricultural Economics 79: 80-88. Nelson, Gerald C., and Jacqueline Geoghegan. 2002. Deforestation and Land Use Change: Sparse Data Environments. Agricultural Economics 27: 201-216. Nelson, Gerald C., Alessandro DePinto, Virginia Harris, and Steven Stone. 2004 Land Use and Road Improvements: A Spatial Perspective. International Regional Science Review 27(3): 297-325. OSullivan, David, and Paul M. Torrens. 2000. Cellular Models of Urban Systems. Centre for Advanced Spatial Analysis Working Paper Series #22. Accessed July 12, 2004: http://www.casa.ucl.ac.uk/cellularmodels.pdf Parker, Dawn C., Thomas Berger, and Steven M. Manson, eds. 2001. Agent Based Models of Land-Use and Land-Cover Change: Proceedings of an International Workshop, October 4-7, 2001. CIPEC Collaborative Report CCR-3. Accessed July 10, 2004: http://www.csiss.org/events/other/agent-based/additional/proceedings.pdf 245
Parker, Dawn C., Steven M. Manson, Marco A. Janssen, Matthew J. Hoffmann, and Peter Deadman. 2003. Multi-Agent Systems for the Simulation of Land-Use and Land-Cover Change: A Review. Annals of the Association of American Geographers, 93(2): 314-340. Parker, Dawn C., and Vicky Meretsky. 2004. Measuring Pattern Outcomes in an Agent-Based Model of Edge-Effect Externalities using Spatial Metrics. Agriculture, Ecosystems, and Environment 101: 233-250. Richards, John A., and Xiuping Jia. 1999. Remote Sensing Digital Image Analysis. Berlin: Springer-Verlag. Ruud, Paul A. 1996. Approximation and Simulation of a Multinomial Probit Model: An Analysis of Covariance Matrix Estimation. Working paper, 1996. Accessed April 1, 2004: http://emlab.berkeley.edu/users/ruud/montreal.pdf Schneider, Laura C., and R. Gil Pontius Jr. 2001. Modeling Land-Use Change in the Ipswich Watershed, Massachusetts, USA. Agriculture, Ecosystems, and Environment 85: 83-94. Smith, Stanley K., and Terry Sincich. 1992. Forecasting State and Household Populations: Evaluating the Forecast Accuracy and Bias of Alternative Population Projections for States. International Journal of Forecasting 8: 495-508. Sonis, Michael. 2001. Complication and Complexity in Dynamics of Linear Systems in Economic Geography and Regional Science. Working paper presented at the 2001 North American Meeting of the Regional Sciences Association International in Charleston, South Carolina. Southworth, Jane, Darla Munroe, and Harini Nagendra. 2004. Land Cover Change and Landscape Fragmentation Comparing the Utility of Continuous and Discrete Analyses for a Western Honduras Region. Agriculture, Ecosystems, and Environment 101: 185-205. Train, Kenneth. 1999. Halton Sequences for Mixed Logit. Accessed March 1, 2004: http://iber.berkeley.edu/wps/econ/E00-278.pdf
246
Train, Kenneth. 2003. Discrete Choice Methods with Simulation. Cambridge: Cambridge University Press. Trelogan, Jessica. 2002. Html file titled tm00_blob Metadata. (Data description for student interpreted land-use/land-cover data for Travis County) . United States Census Bureau. 1993. United States Summary: Population and Housing Unit Counts 1790 to 1990. Accessed June 11, 2004: http://www.census.gov/population/censusdata/table-4.pdf United States Census Bureau. 2004(a). Statistical Abstract of the United States 2003 - No. HS-2. Population Characteristics: 1900 to 2000. Accessed June 11, 2004: http://www.census.gov/statab/hist/HS-02.pdf United States Census Bureau. 2004(b). U.S. Census web site. Accessed June 4, 2004: http://www.census.gov Vandeveer, Lonnie R., Patricia Soto, and Huizhen Niu. 2002. A Statewide Spatial Analysis of the Effects of Location and Economic Development on Rural Land Values. Southwestern Economic Review 29(1): 1-20. Veldkamp, A., and E. F. Lambin. 2001. Editorial: Predicting Land Use Change. Agriculture, Ecosystems, and Environment 85: 1-6. Waddell, Paul. 2002 UrbanSim: Modeling Urban Development for Land use, Transportation, and Environmental Planning. The Journal of the American Planning Association 68(3): 297-314. Wangen, Knut R., and Erik Bim. 2001. Prevalence and Substitution Effects in Tobacco Consumption: A Discrete Choice Analysis of Panel Data. Discussion paper #312, Statistics Norway, Research Department. Accessed March 1, 2004: http://www.ssb.no/publikasjoner/DP/pdf/dp312.pdf Wear, David N., and Paul Bolstad. 1998. Land-Use Changes in Southern Appalachian Landscapes: Spatial Analysis and Forecast Evaluation. Ecosystems 1: 575-594. Weidlich, Wolfgang. 2000. Sociodynamics: A systematic approach to mathematical modelling in the social sciences. Amsterdam: Harwood Academic Press. 247
Wooldridge, Jefferey M. 1995. Selection Corrections for Panel Data Models Under Conditional Mean Independence Assumptions. Journal of Econometrics 68: 115-132. Wooldridge, Jeffrey M. 2002. Simple Solutions to the Initial Conditions Problem in Dynamic, Nonlinear Panel Data Models with Unobserved Heterogeneity. University College London Centre for Microdata Methods and Practice, Institute for Fiscal Studies CeMMAP working paper CWP18/02. Accessed March 4, 2004: http://cemmap.ifs.org.uk/docs/cwp1802.pdf
248
VITA
Christopher Rawls Frazier was born in Wilmington, Delaware on April 21, 1978, the son of Rawls Harrell Frazier and Mary Ruth Frazier. He is married to Kathryn Morgan Frazier and with her has one daughter, Violet Allegra Frazier. After graduating from Petaluma High School, Petaluma, California, in 1996, he enrolled in The University of California, Santa Barbara. He graduated from The University of California, Santa Barbara in 2000 with a Bachelor of Science in Physics and a Bachelor of Arts in Film Studies. In 2002, after idling for two years in New Orleans, Louisiana, he entered the Graduate School at The University of Texas at Austin.
Permanent Address: 2204 Enfield Rd. #209 Austin, Texas 78703
This thesis was typed by the author.
249

Chris Frazier Master Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chris Frazier Master Thesis

Uploaded by

Copyright:

Available Formats

Copyright by Christopher Rawls Frazier 2004

For Violet and, especially, for Kathryn

CHAPTER 2: HISTORICAL/LITERATURE REVIEW

which are solved to give the approximations for the parameters:

= exp{( 110 )[16 ln( x (6) ) 6 ln( x (16) )]}

= ( 110 )[ln ( x (16) ) ln ( x (6) )]

(x(6) + x(16) ) = 1 ( x (6) + x (16) ) 2

x(6) + x(16) x (6) + x (16)

Table 3.3 Descriptive statistics for year 2000 data.

Table 3.3, continued Descriptive statistics for year 2000 data.

Maximum 1 0.98 1 1 0.95 1 0.94 1 1 0.7275 0.301 46.78 20.6 1121

Table 3.4 Descriptive statistics for year 1997 data.

Minimum 0.06 0 0 0 0 0 0 0 0 0 0 0 0 0 0.7 0 0 0

Table 3.4, continued Descriptive statistics for year 1997 data.

Minimum 0 0 0 0 0 0 0 0 0 0 0 0 0.02 0 0 0.11 0 0 0 0 0 0 0

Table 3.5 Descriptive statistics for year 1991 data.

Table 3.5, continued Descriptive statistics for year 1991 data.

Maximum 1 1 1 1 1 1 0.96 1 1 0.69 0.2917 46.78 20.6 648.6

Table 3.6 Descriptive statistics for year 1983 data.

Minimum 0 0.28 0 0 0 0 0 0 0 0 0 0 0 0 0.74 0 0 0

Table 3.6, continued Descriptive statistics for year 1983 data

possibly endogenous and/or time-lagged variables, vi is an individual-specific effect ~ Normal(0, v2 ).1

it is an error term which, to capture spatial

autocorrelation, is specified, in block matrix form, as follows (Anselin 1988):

~ be non-zero, but with W = 0 (no spatial autocorrelation) is discussed in the

~ ~ eigenvalues of W are i, the matrix of the eigenvectors of W is , and a

parameter is defined such that

and A = I N AT , the LSDV estimator is (Kiviet 1995, Greene 2000)

A11 B L A1M B A B = M O M AN 1 B L ANM B

N E = 2 D 1 (T CT )(2q M AM ) 2 + tr M (I N AT CAT )MD 1 q

+ M (I N AT CAT )MD 1 q + 2 Nq D 1 q N 2 (T CT ) tr{C AT C} + 2 tr{C AT CAT C} q + O( N 1T 3 / 2 )

4.4.1 Panel data Linear Regression Model using Probit Sample-Selection

[(2mit 1)(wit + i )]f ( i )d i

f ( i | wi1 ,..., wiT , mi1 ,..., miT )d i

where, assuming that wit are independent across time,

form of (4.4) with = 0. So, (4.32) is transformed to

and E( it ) = 0, Var( it ) = F ( X )(1 F ( X )) (4.37)

(4.35) can be approximated by a Taylor expansion as (Greene 2000):

can be modeled by the methods described above. However because

should be used to instrument the model. However, to

among the explanatory variables can be done. The

independent, and if the PitUrban

is expanded using a Taylor expansion about

Urban 1 + exp xitUrban Urban 1 it

1 + F ( xit )(1 F ( xit ))

, as opposed to its approximation, is used in the model estimation.

4.6 ACCOUNTING FOR DIFFERENCES IN TIME LAGS

4.7 THE TEMPORAL AND SPATIAL INCIDENTAL PARAMETERS PROBLEM

4.8 ESTIMATING DIFFERENTIAL EQUATION FRAMEWORK

2 where i vi ~ Normal(0, v2 ), t t ~ Normal(0, ), it,r ~ Normal(0, r2 ),

4.9 SAMPLING AND MODEL ESTIMATION

CHAPTER 5: MODEL RESULTS

As discussed in Frazier and

5.1.1 ln(Population) Model

Elasticities 2000 1997 1991

-1.413 0.115 0.00873 0.0279 0.00071 -0.00576 6.07E-05

-1.533 0.125 0.00865 0.0192 0.00106 -0.00751 5.33E-05

-1.723 0.140 0.0113 0.00782 0.00114 -0.00827 5.30E-05

Elasticities 2000 1997 1991

-1.413 0.115 0.00518 0.0385 -0.00139 0.00348 0.00947

-1.533 0.125 0.00513 0.0266 -0.00208 0.00454 0.00831

-1.722 0.140 0.00670 0.0108 -0.00225 0.00500 0.00827

Elasticities 2000 1997 1991