Multiple Linear Regression Using Python Machine Learning For Predicitng Net Primary Productivity PDF

Multiple Linear Regression using Python Machine Learning
Objective:- The objective of this exercise is to predict the Net

Primary Productivity-(NPP, major ecosystem health indicator)
from climate and land use data for Upper Blue Nile Basin,
Ethiopia, East Africa. Its derived from Gross Primary
Productivity (GPP) which is an ecosystem level parameter that
refers to the rate at which green plants produce organic matter
by assimilating carbon dioxide using solar energy through
photosynthesis(Liang et al., 2012). Net Primary Productivity is
the difference between GPP and plant autotrophic respiration.
Approximately 50% the organic matter generated by gross primary
production is released into the atmosphere through plant
respiration. The other half, which constitutes NPP is the
biomass produced in a given time (Liang et al., 2012). The
following variables were used:
The NPP dataset(dependent variable) from the year 2001 to 2010
was downloaded from NASAs Reverb/ECHO website. Data from 2001
was taken for regression analysis.
Precipitation: GPCC-Global Precipitation Climatology Centre,
raster image.
Land use land cover classification image for 2001 and 2010
were acquired from MODIS Land Cover(MCDQ12) from Reverb/ECHO.
Fraction of Photosynthetically Active Radiation (fAPAR) SPOT
satellite, AfSIS raster image(ftp://africagrids.org)
Digital Elevation Model(DEM)- ftp://srtm.csi.cgiar.org.
Minimum Temperature, Vapor Pressure, WSI(Water Stress Index
)derived from Potential Evapotranspiration and Actual
Evapotranspiration of CRU 3.22 Time-Series data (Climate
Research Unit, University of East Anglia)
In this exercise, a total of 2,377 random sample points were

collected from the raster data using ArcGIS 10.3. I used Pandas
module for loading comma delimited(csv) file, Numpy module to
convert the data into array, Scikit_Learn for computing multiple
linear regression and Matplotlib module for plotting the result.
Certain assumptions about the dataset must be met before

conducting multiple linear regression. In ecological studies,
Kaleab Woldemariam, June 2017

statistical and spatial contexts must be considered in modeling.

To simplify, statistical assumptions were met. Multiple linear
regression assumes
(i) Normality
(ii) Homogeneity of Variance
(iii) Fixed X (X represents explanatory variables)
(iv) Independence
(v) Correct model specification (Zuur et al., 2007).
Note that land use-land cover (LULC) data were categorical and
needed to be converted to dummies (0/1 values).I used a Pandas
function, pd.get_dummies, to manipulate the nominal LULC data to
include it in predicting NPP.
To segregate the numerical and categorical data, I used a

separate pandas DataFrame for Precipitation, fAPAR, Minimum
Temperature, Vapor Pressure, WSI features (numerical independent
variables)as data1 and categorical LULC as dummies and
eventually join the two datasets as a numpy array X. The
dependent variable NPP2001 was also converted to array y using
numpy.
Train/Test
The model is trained to predict the known outputs and later
tested using test data and applied to generalize other non-
trained data. Test data is used to test the prediction ability
(accuracy) of the model. Training data (X_train,y_train) is used
to fit the regression model(make a linear model).This model is
used to predict NPP2001 from independent variables.

'''Regression for predicting NPP using features(independent variables) in Machine

Learning '''
import math
import numpy as np
import pandas as pd
from sklearn import preprocessing,svm
from sklearn.preprocessing import StandardScaler
from sklearn import model_selection,metrics
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import
train_test_split,KFold,cross_val_score,cross_val_predict
import matplotlib.pyplot as plt

from matplotlib import style
import datetime
style.use('ggplot')
raw_data='mydata_2001_2377_BlueNile.csv'
df = pd.read_csv(raw_data)
# Create a DataFrame for numerical features
data1 = pd.DataFrame(df,
columns=['b1_PG2001','SPTFPR2001','b1_Tmn','X2001WSI','b1_Vap','Elevation'])
print(data1.shape)
# Create a DataFrame for categorical features

cols_to_transform =
pd.DataFrame(df,columns=['Forest','Closed_Shrublands','Open_Shrublands','Woody_Savanna
s','Savannas','Grasslands','Croplands'])
dummies = pd.get_dummies(cols_to_transform)
# Join data1 and dummies using Numpy and yield as array
X = np.array(data1.join(dummies))
# Specify the dependent variable as array

y = np.array(df['NPP2001'])
lm = LinearRegression(n_jobs=-1)
'''To check the accuracy/confidence level of the prediction,

we have 25% test datasets, while 75% is used for training.'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)
# First we fit a model
model=lm.fit(X_train,y_train)
#print the coefficents
print("The linear cofficients",model.coef_)
# Try to predict the y ( NPP_Predict) for the test data-features(independent
variables(X_test)
predictions=lm.predict(X_test)
# Accuracy of the prediction
confidence = lm.score(X_test, y_test)
print("This is predicted NPP2001 Values",predictions)
print("This is the prediction accuracy",confidence)
plt.legend(loc=4)
plt.title("Actual NPP2001 vs. NPP2001_Predict", size=10)
plt.scatter(y_test,predictions,color='c', marker='.')
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)

plt.show()
plt.legend(loc=4)
plt.title("Homogeneity of Variance")
plt.scatter(y_test,y_test-predictions)
plt.xlabel("Actual NPP2001")
plt.ylabel("Residual")
plt.show()
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
kf=KFold(n_splits=10, random_state=None,shuffle=True)
for train_index, test_index in kf.split(X):
print ("TRAIN", train_index, "TEST", test_index)
X_train,X_test=X[train_index], X[test_index]
y_train,y_test=y[train_index],y[test_index]
# Make Cross Validated predictions
predictions2=cross_val_predict(model,X,y,cv=10)
#Check the R2- the proportion of variance in the dependent variable explained by the
predictors
accuracy=metrics.r2_score(y,predictions2)
print ("This is R2",accuracy)
plt.scatter(y,predictions2,color='c', marker='.')
plt.legend(loc=4)
plt.xlabel("Actual NPP2001", size=10)
plt.ylabel("NPP2001_Predict", size=10)
plt.title("Actual and Predicted NPP2001 Values using 10 Fold Cross
Validation",size=10)
plt.show()
The steps used so far are:-
Load the data.

Convert categorical variables to dummies and join to
numerical variables.
Split the sample (2,377 pts) into training and test sets.
Use training data to fit a regression model.
Made predictions based on the X_test data.
Computed accuracy of the prediction (score).
Train/Test split is not enough to guarantee the randomness of

the samples. If samples fail to be random, this might result in
overfitting. Overfitting means the model is too well trained,
although it cannot be applied to other data. Overfitting happens
when the model uses too many predictors; while it works too well
on the training set, it fails on new untrained data. This means
we cannot make inferences from our model.

Cross-Validation method called K-Folds Cross Validation is

used to subset the sample into k different subsets (or folds).
We use k-1 subsets to train our data and leave the last subset
as test data. We then average the model against each of the
folds and then finalize our model. After that we test it against
the test set. Cross Validated predictions are made by supplying
cross_val_predict function with the model, X(original/not test
independent variables) and the y(dependent variable),and the
cv(cross validation fold). The plot will have 10x points due to
cross validation.
#Perform 10 fold Cross Validation (KFold)
scores=cross_val_score(model,X,y,cv=10)
print ("Cross Validated Scores",scores)
Cross Validated Scores [ 0.34638801 0.56139146 0.61525375 0.7076254 0.70162425

0.49563864 0.61883974 0.52543957 0.33933734 0.10156286]
# Make Cross Validated predictions

predictions2=cross_val_predict(model,X,y,cv=10)
Finally, the R2-the proportion of variance explained by the
predictors is given by:
accuracy=metrics.r2_score(y,predictions2)
The result indicates that the predictors account for 70.2% of the
variance in the Net Primary Productivity for year 2001.
The linear equation:
1.274E-04*(b1_PG2001)+2.314E-03*(SPTFPR2001)-1.147E-01*(b1_Tmn)+8.877E-
1*(X2001WSI)+1.326E-01*(b1_Vap)-3.43E-05*(Elevation)-1.0E-01*(Forest)-
1.27*(Closed_Shrublands)-9.79*(Open_Shrublands)-1.019E-01*(Woody_Savannas)-
9.549E-02*(Savannas)-1.0422*(Grasslands)-1.22E-02*Croplands


Reference
https://www.medium.com/towards-data-science/train-test-split-
and-cross-validation-in-python-80b61beca4b6 retrieved on June
28, 2017.
Liang, S.,Li,X., Wang, J., 2012. Advanced Remote Sensing:

Terrestrial Information Extraction and Applications, Academic
Press, pp. 800.
Zuur, A. K., Ieno, E.N., Smith, G. M., 2007. Statistics for

Biology and Health: Analyzing Ecological Data, Springer Science
+ Business Media, LLC.

Multiple Linear Regression Using Python Machine Learning For Predicitng Net Primary Productivity PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Linear Regression Using Python Machine Learning For Predicitng Net Primary Productivity PDF

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression using Python Machine Learning

Objective:- The objective of this exercise is to predict the Net

In this exercise, a total of 2,377 random sample points were

Certain assumptions about the dataset must be met before

Kaleab Woldemariam, June 2017

statistical and spatial contexts must be considered in modeling.

To segregate the numerical and categorical data, I used a

Kaleab Woldemariam, June 2017

'''Regression for predicting NPP using features(independent variables) in Machine

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

# Create a DataFrame for categorical features

# Specify the dependent variable as array

'''To check the accuracy/confidence level of the prediction,

Kaleab Woldemariam, June 2017

The steps used so far are:-

Load the data.

Train/Test split is not enough to guarantee the randomness of

Kaleab Woldemariam, June 2017

Cross-Validation method called K-Folds Cross Validation is

Cross Validated Scores [ 0.34638801 0.56139146 0.61525375 0.7076254 0.70162425

# Make Cross Validated predictions

The linear equation:

Kaleab Woldemariam, June 2017

Kaleab Woldemariam, June 2017

Liang, S.,Li,X., Wang, J., 2012. Advanced Remote Sensing:

Zuur, A. K., Ieno, E.N., Smith, G. M., 2007. Statistics for

Kaleab Woldemariam, June 2017

You might also like