Professional Documents
Culture Documents
Eugene Brusilovskiy
Outline
Review of Correlation OLS Regression Regression with a non-normal dependent variable Spatial Regression
Correlation
Defined as a measure of how much two variables X and Y change together Dimensionless measure:
A correlation between two variables is a single number that can range from -1 to 1, with positive values close to one indicating a strong direct relationship and negative values close to -1 indicating a strong inverse relationship
E.g., a positive correlation between income and years of schooling indicates that more years of schooling would correspond to greater income (i.e., an increase in the years of schooling is associated with an increase in income)
Generally denoted by the Greek letter Pearson Correlation: When the variables are normally distributed Spearman Correlation: When the variables arent normally distributed
Some Remarks
In practice, we rarely see perfect positive or negative correlations (i.e., correlations of exactly 1 or -1) Correlations are those higher than 0.6 (or lower than -0.6) are considered to be strong There might be confounding factors that explain a strong positive or negative correlation between variables
E.g., volume of ice cream consumption might be correlated with crime rates. Why?
Both tend to be high when the temperatures are warmer!
The correlation between two seemingly unrelated variables does not always equal exactly zero (although it will often be close to it)
4
Source: http://imgs.xkcd.com/comics/correlation.png
5
Regression
A statistical method used to examine the relationship between a variable of interest (dependent variable) and one or more explanatory variables (predictors)
Strength of the relationship Direction of the relationship (positive, negative, zero) Goodness of model fit
Allows you to calculate the amount by which your dependent variable changes when a predictor variable changes by one unit (holding all other predictors constant) Often referred to as Ordinary Least Squares (OLS) regression
Regression with one predictor is called simple regression Regression with two or more predictors is called multiple regression
Available in all statistical packages Just like correlation, if an explanatory variable is a significant predictor of the dependent variable, it doesnt imply that the explanatory variable is a cause of the dependent variable
Example
Assume we have data on median income and median house value in 381 Philadelphia census tracts (i.e., our unit of measurement is a tract) Each of the 381 tracts has information on income (call it Y) and on house value (call it X). So, we can create a scatter-plot of Y against X.
Through this scatter plot, we can calculate the equation of the line that best fits the pattern (recall: Y=mx+b, where m is the slope and b is the y-intercept) This is done by finding a line such that the sum of the squared (vertical) distances between the points and the line is minimized
Hence the term ordinary least squares
Income 0 1 X 1 2 X 2 ... n X n ,
where
An Example with 2 Predictors: Income as a function of House Value and Crime Income 0 1 House Value 2 Thefts
10
2. 3.
The predictors should not be strongly correlated with each other (i.e., no multicollinearity) Very importantly, the observations should be independent of each other. (The same holds for regression residuals). If this assumption is violated, our coefficient estimates could be wrong!
N=140
15
Data Transformations
Sometimes, it is possible to transform a variables distribution by subjecting it to some simple algebraic operation. The logarithmic transformation is the most widely used to achieve normality when the variable is positively skewed (as in the image on the left below) Analysis is then performed on the transformed variable.
16
Multinomial logistic regression When your dependent variable is categorical and has more than two categories
E.g., Race: Black, Asian, White, Other
Ordinal logistic regression When your dependent variable is ordinal and has more than two categories
E.g., Education: (1=Less than High School, 2=High School, 3=More than High School)
Spatial Autocorrelation
Recall:
There is spatial autocorrelation in a variable if observations that are closer to each other in space have related values (Toblers Law) One of the regression assumptions is independence of observations. If this doesnt hold, we obtain inaccurate estimates of the coefficients, and the error term contains spatial dependencies (i.e., meaningful information), whereas we want the error to not be distinguishable from random noise.
18
This example is obviously a dramatization, but nonetheless, in many spatial problems points which are close together have similar values
19
Just as the non-spatial correlation coefficient, ranges from -1 to 1 Can be calculated in ArcGIS
21
As in OLS regression, we can include independent variables in the model. Whereas we will see spatial autocorrelation in OLS residuals, the SL model should account for spatial dependencies and the SL residuals would not be autocorrelated,
Hence the SL residuals should not be distinguishable from random noise (i.e., have no consistent patterns or dependencies in them)
22
23
When we have n observations, we form an n x n table (called a weight matrix or a link matrix) which summarizes all the pairwise spatial relationships in the dataset These weight matrices are used in the estimation of spatial regression (and the calculation of Morans I). Unless we have compelling reasons not to do so, its generally a good idea to see whether our results hold with different types of weight matrices
24
Point # 1 2 3 4 5 6 7
1 0 1 0 0 1 0 0
2 1 0 0 1 0 1 0
3 0 0 0 1 1 1 1
4 0 1 1 0 1 0 1
5 1 0 1 1 0 1 1
6 0 1 1 0 1 0 1
7 0 0 1 1 1 1 0
8 0 0 0 0 1 0 0
9 0 0 0 1 0 0 0
10 1 0 0 0 0 0 1
8
9 10
0
0 1
0
0 0
0
0 0
0
1 0
1
0 0
0
0 0
0
0 1
0
1 1
1
0 1
1
1 0
25
26
GeoDa
A software package developed by Luc Anselin Can be downloaded free of charge (for members of educational and research institutions) at https://www.geoda.uiuc.edu/ Has a user-friendly interface Accepts ESRI shapefiles as inputs Is able to perform a number of basic GIS operations in addition to running the sophisticated spatial statistics models
27
These methods also aim to account for spatial dependencies in the data