You are on page 1of 7

Multiple Linear Regression using Python Machine Learning

Objective:- The objective of this exercise is to predict the Net


Primary Productivity-(NPP, major ecosystem health indicator)
from climate and land use data for Upper Blue Nile Basin,
Ethiopia, East Africa. Its derived from Gross Primary
Productivity (GPP) which is an ecosystem level parameter that
refers to the rate at which green plants produce organic matter
by assimilating carbon dioxide using solar energy through
photosynthesis(Liang et al., 2012). Net Primary Productivity is
the difference between GPP and plant autotrophic respiration.
Approximately 50% the organic matter generated by gross primary
production is released into the atmosphere through plant
respiration. The other half, which constitutes NPP is the
biomass produced in a given time (Liang et al., 2012). The
following variables were used:
The NPP dataset(dependent variable) from the year 2001 to 2010
was downloaded from NASAs Reverb/ECHO website. Data from 2001
was taken for regression analysis.
Precipitation: GPCC-Global Precipitation Climatology Centre,
raster image.
Land use land cover classification image for 2001 and 2010
were acquired from MODIS Land Cover(MCDQ12) from Reverb/ECHO.
Fraction of Photosynthetically Active Radiation (fAPAR) SPOT
satellite, AfSIS raster image(ftp://africagrids.org)
Digital Elevation Model(DEM)- ftp://srtm.csi.cgiar.org.
Minimum Temperature, Vapor Pressure, WSI(Water Stress Index
)derived from Potential Evapotranspiration and Actual
Evapotranspiration of CRU 3.22 Time-Series data (Climate
Research Unit, University of East Anglia)

In this exercise, a total of 2,377 random sample points were


collected from the raster data using ArcGIS 10.3. I used Pandas
module for loading comma delimited(csv) file, Numpy module to
convert the data into array, Scikit_Learn for computing multiple
linear regression and Matplotlib module for plotting the result.

Certain assumptions about the dataset must be met before


conducting multiple linear regression. In ecological studies,

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

statistical and spatial contexts must be considered in modeling.


To simplify, statistical assumptions were met. Multiple linear
regression assumes

(i) Normality
(ii) Homogeneity of Variance
(iii) Fixed X (X represents explanatory variables)
(iv) Independence
(v) Correct model specification (Zuur et al., 2007).

Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.

To segregate the numerical and categorical data, I used a


separate pandas DataFrame for Precipitation, fAPAR, Minimum
Temperature, Vapor Pressure, WSI features (numerical independent
variables)as data1 and categorical LULC as dummies and
eventually join the two datasets as a numpy array X. The
dependent variable NPP2001 was also converted to array y using
numpy.

Train/Test
The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

'''Regression for predicting NPP using features(independent variables) in Machine


Learning '''

import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics

from sklearn.linear_model import LinearRegression


from sklearn.model_selection import
train_test_split,KFold,cross_val_score,cross_val_predict

import matplotlib.pyplot as plt


from matplotlib import style
import datetime
style.use('ggplot')
raw_data='mydata_2001_2377_BlueNile.csv'
df = pd.read_csv(raw_data)
# Create a DataFrame for numerical features
data1 = pd.DataFrame(df,
columns=['b1_PG2001','SPTFPR2001','b1_Tmn','X2001WSI','b1_Vap','Elevation'])
print(data1.shape)

# Create a DataFrame for categorical features


cols_to_transform =
pd.DataFrame(df,columns=['Forest','Closed_Shrublands','Open_Shrublands','Woody_Savanna
s','Savannas','Grasslands','Croplands'])
dummies = pd.get_dummies(cols_to_transform)
# Join data1 and dummies using Numpy and yield as array
X = np.array(data1.join(dummies))

# Specify the dependent variable as array


y = np.array(df['NPP2001'])

lm = LinearRegression(n_jobs=-1)

'''To check the accuracy/confidence level of the prediction,


we have 25% test datasets, while 75% is used for training.'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
# First we fit a model
model=lm.fit(X_train,y_train)
#print the coefficents
print("The linear cofficients",model.coef_)
# Try to predict the y ( NPP_Predict) for the test data-features(independent
variables(X_test)
predictions=lm.predict(X_test)
# Accuracy of the prediction
confidence = lm.score(X_test, y_test)
print("This is predicted NPP2001 Values",predictions)
print("This is the prediction accuracy",confidence)
plt.legend(loc=4)
plt.title("Actual NPP2001 vs. NPP2001_Predict", size=10)
plt.scatter(y_test,predictions,color='c', marker='.')
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

plt.show()

plt.legend(loc=4)
plt.title("Homogeneity of Variance")
plt.scatter(y_test,y_test-predictions)
plt.xlabel("Actual NPP2001")
plt.ylabel("Residual")
plt.show()
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):
print ("TRAIN", train_index, "TEST", test_index)
X_train,X_test=X[train_index], X[test_index]
y_train,y_test=y[train_index],y[test_index]
# Make Cross Validated predictions
predictions2=cross_val_predict(model,X,y,cv=10)
#Check the R2- the proportion of variance in the dependent variable explained by the
predictors
accuracy=metrics.r2_score(y,predictions2)
print ("This is R2",accuracy)
plt.scatter(y,predictions2,color='c', marker='.')
plt.legend(loc=4)
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.title("Actual and Predicted NPP2001 Values using 10 Fold Cross
Validation",size=10)
plt.show()

The steps used so far are:-

Load the data.


Convert categorical variables to dummies and join to
numerical variables.
Split the sample (2,377 pts) into training and test sets.
Use training data to fit a regression model.
Made predictions based on the X_test data.
Computed accuracy of the prediction (score).

Train/Test split is not enough to guarantee the randomness of


the samples. If samples fail to be random, this might result in
overfitting. Overfitting means the model is too well trained,
although it cannot be applied to other data. Overfitting happens
when the model uses too many predictors; while it works too well
on the training set, it fails on new untrained data. This means
we cannot make inferences from our model.

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

Cross-Validation method called K-Folds Cross Validation is


used to subset the sample into k different subsets (or folds).
We use k-1 subsets to train our data and leave the last subset
as test data. We then average the model against each of the
folds and then finalize our model. After that we test it against
the test set. Cross Validated predictions are made by supplying
cross_val_predict function with the model, X(original/not test
independent variables) and the y(dependent variable),and the
cv(cross validation fold). The plot will have 10x points due to
cross validation.
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)

Cross Validated Scores [ 0.34638801 0.56139146 0.61525375 0.7076254 0.70162425


0.49563864 0.61883974 0.52543957 0.33933734 0.10156286]

# Make Cross Validated predictions


predictions2=cross_val_predict(model,X,y,cv=10)
Finally, the R2-the proportion of variance explained by the
predictors is given by:

accuracy=metrics.r2_score(y,predictions2)

The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.

The linear equation:

1.274E-04*(b1_PG2001)+2.314E-03*(SPTFPR2001)-1.147E-01*(b1_Tmn)+8.877E-
1*(X2001WSI)+1.326E-01*(b1_Vap)-3.43E-05*(Elevation)-1.0E-01*(Forest)-
1.27*(Closed_Shrublands)-9.79*(Open_Shrublands)-1.019E-01*(Woody_Savannas)-
9.549E-02*(Savannas)-1.0422*(Grasslands)-1.22E-02*Croplands

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

Kaleab Woldemariam, June 2017


Multiple Linear Regression using Python Machine Learning

Reference

https://www.medium.com/towards-data-science/train-test-split-
and-cross-validation-in-python-80b61beca4b6 retrieved on June
28, 2017.

Liang, S.,Li,X., Wang, J., 2012. Advanced Remote Sensing:


Terrestrial Information Extraction and Applications, Academic
Press, pp. 800.

Zuur, A. K., Ieno, E.N., Smith, G. M., 2007. Statistics for


Biology and Health: Analyzing Ecological Data, Springer Science
+ Business Media, LLC.

Kaleab Woldemariam, June 2017

You might also like