Professional Documents
Culture Documents
(i) Normality
(ii) Homogeneity of Variance
(iii) Fixed X (X represents explanatory variables)
(iv) Independence
(v) Correct model specification (Zuur et al., 2007).
Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.
Train/Test
The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.
import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics
lm = LinearRegression(n_jobs=-1)
plt.show()
plt.legend(loc=4)
plt.title("Homogeneity of Variance")
plt.scatter(y_test,y_test-predictions)
plt.xlabel("Actual NPP2001")
plt.ylabel("Residual")
plt.show()
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):
print ("TRAIN", train_index, "TEST", test_index)
X_train,X_test=X[train_index], X[test_index]
y_train,y_test=y[train_index],y[test_index]
# Make Cross Validated predictions
predictions2=cross_val_predict(model,X,y,cv=10)
#Check the R2- the proportion of variance in the dependent variable explained by the
predictors
accuracy=metrics.r2_score(y,predictions2)
print ("This is R2",accuracy)
plt.scatter(y,predictions2,color='c', marker='.')
plt.legend(loc=4)
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.title("Actual and Predicted NPP2001 Values using 10 Fold Cross
Validation",size=10)
plt.show()
accuracy=metrics.r2_score(y,predictions2)
The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.
1.274E-04*(b1_PG2001)+2.314E-03*(SPTFPR2001)-1.147E-01*(b1_Tmn)+8.877E-
1*(X2001WSI)+1.326E-01*(b1_Vap)-3.43E-05*(Elevation)-1.0E-01*(Forest)-
1.27*(Closed_Shrublands)-9.79*(Open_Shrublands)-1.019E-01*(Woody_Savannas)-
9.549E-02*(Savannas)-1.0422*(Grasslands)-1.22E-02*Croplands
Reference
https://www.medium.com/towards-data-science/train-test-split-
and-cross-validation-in-python-80b61beca4b6 retrieved on June
28, 2017.