You are on page 1of 25

6/2/2019 Linear Regression in Python

LINEAR REGRESSION IN PYTHON


Ekta Aggarwal 6 Comments Linear Regression, Python

Linear Regression is a supervised statistical


technique where we try to estimate the
dependent variable with a given set of
independent variables. We assume the
relationship to be linear and our dependent
variable must be continuous in nature.

Python : Linear Regression

In the following diagram we can see that as


horsepower increases mileage decreases thus
we can think to fit linear regression. The red line
is the fitted line of regression and the points
denote the actual observations.

The vertical distance between the points and the


fitted line (line of best fit) are called errors. The
main idea is to fit this line of regression by
minimizing the sum of squares of these errors.

https://www.listendata.com/2018/01/linear-regression-in-python.html 1/25
6/2/2019 Linear Regression in Python

This is also known as principle of least


squares.

Examples:
Estimating the price (Y) of a house on
the basis of its Area (X1), Number of
bedrooms (X2), proximity to market
(X3) etc.
Estimating the mileage of a car (Y) on
the basis of its displacement (X1),
horsepower(X2), number of
cylinders(X3), whether it is automatic
or manual (X4) etc.
To find the treatment cost or to predict
the treatment cost on the basis of
factors like age, weight, past medical
history, or even if there are blood
reports, we can use the information
from the blood report.

Simple Linear Regression


Model: In this we try to predict the
value of dependent variable (Y)
with only one regressor or
independent variable(X).

Multiple Linear Regression


Model:Here we try to predict the
value of dependent variable (Y)
with more than one regressor or
independent variables.

The linear regression model:


Here 'y' is the dependent variable to be
estimated, and X are the independent variables

https://www.listendata.com/2018/01/linear-regression-in-python.html 2/25
6/2/2019 Linear Regression in Python

and ε is the error term.

Multiple Regression Equation

Assumptions of linear regression:


There must be a linear
relationshipbetween the dependent
and independent variables.
Sample observations are
independent.
Error terms are normally distributed
with mean 0.
No multicollinearity - When the
independent variables in my model
are highly linearly related then such a
situation is called multicollinearity.
Error terms are identically and
independently distributed.
(Independence means absence
ofautocorrelation).
Error terms have constant variance
i.e. there is no heteroscedasticity.
No outliers are present in the data.

Important Model Performance


Metrics

Coefficient of Determination (R square)


It suggests the proportion of variation in Y which
can be explained with the independent
variables. Mathematically, it is the ratio of
predicted values and observed values, i.e.

https://www.listendata.com/2018/01/linear-regression-in-python.html 3/25
6/2/2019 Linear Regression in Python

RSquare

If our fit is perfect then

If then R2 = 0 indicates a poor fit. Thus it lies


between 0 and 1.

If the value of R2 is 0.912 then this


suggests that 91.2% of the
variation in Y can be explained
with the help of given explanatory
variables in that model. In other
words, it explains the proportion
of variation in the dependent
variable that is explained by the
independent variables.

R square solely not such a good measure:


On addition of a new variable the error is sure to
decrease, thus R square always increases
whenever a new variable is added to our model.
This may not describe the importance of a
variable.

For eg. In a model determining the


price of the house, suppose we
had the variables GDP, Inflation
rate, Area. If we add a new
variable: no. of plane crashes

https://www.listendata.com/2018/01/linear-regression-in-python.html 4/25
6/2/2019 Linear Regression in Python

(which is irrelevant) then still R


square will increase.

Adjusted R square:

Adjusted R square is given by:

or

Adjusted R-Square

where k is the no. of regressors or predictors.

Hence adjusted R square will always be less


than or equal to R square.

On addition of a variable then R square in


numerator and 'k' in the denominator will
increase.
If the variable is actually useful then R square will
increase by a large amount and 'k' in the
denominator will be increased by 1. Thus the
magnitude of increase in R square will
compensate for increase in 'k'. On the other
hand, if a variable is irrelevant then on its
addition R square will not increase much and
hence eventually adjusted R square will
increase.

Thus as a general thumb rule if


adjusted R square increases when

https://www.listendata.com/2018/01/linear-regression-in-python.html 5/25
6/2/2019 Linear Regression in Python

a new variable is added to the


model, the variable should remain
in the model. If the adjusted R
square decreases when the new
variable is added then the variable
should not remain in the model.

Why error terms should be


normally distributed?
For parameter estimate (i.e. estimating the βi’s)
we don't need that assumption. But, if it is not a
normal distribution, some of those hypotheses
tests which we will be doing as part of
diagnostics may not be valid.

For example: To check whether


the Beta (the regression
coefficient) is significant or not, I'll
do a T-test. So, if my error is not a
normal distribution, then the
statistic I derive may not be a T-
distribution. So, my diagnostic test
or hypotheses test is not valid.
Similarly, F-test for linear
regression which checks whether
any of the independent variables in
a multiple linear regression model
are significant will be not be viable.

Why is expectation of error


always zero?

https://www.listendata.com/2018/01/linear-regression-in-python.html 6/25
6/2/2019 Linear Regression in Python

The error term is the deviation between observed


points and the fitted line. The observed points will
be above and below the fitted line, so if I took the
average of all the deviations, it should be 0 or
near 0. Zero conditional mean is there which
says that there are both negative and positive
errors which cancel out on an average. This
helps us to estimate dependent variable
precisely.

Why multicollinearity is a
problem?

If my Xi’s are highly correlated then |X’X| will be


close to 0 and hence inverse of (X’X) will not
exist or will be indefinitely large. Mathematically,
which will be indefinitely large in presence of
multicollinearity. Long story in
short, multicollinearity increases the estimate
of standard error of regression coefficients
which makes some variables statistically
insignificant when they should be significant.

How can you detect multicollinearity? 1. Bunch


Map Analysis: By plotting scatter plots between
various Xi’ s we can have a visual description of
how the variables are related.

2. Correlation Method: By calculating the


correlation coefficients between the variables we
can get to know about the extent of
multicollinearity in the data.

3. VIF (Variance Inflation Factor) Method: Firstly we fit


a model with all the variables and then calculate
the variance inflation factor (VIF) for each

https://www.listendata.com/2018/01/linear-regression-in-python.html 7/25
6/2/2019 Linear Regression in Python

variable. VIF measures how much the variance


of an estimated regression coefficient increases if
your predictors are correlated. The higher the
value of VIF for ith regressor, the more it is highly
correlated to other variables.

So what is Variance Inflation


Factor?

Variance inflation factor (VIF) for an


explanatory variable is given 1/(1-R^2 ) .
Here, we take that particular X as response
variable and all other explanatory variables
as independent variables. So, we run a
regression between one of those
explanatory variables with remaining
explanatory variables.

Detecting heteroscedasticity!
1. Graphical Method: Firstly do the
regression analysis and then plot the error
terms against the predicted values( Yi^). If
there is a definite pattern (like linear or
quadratic or funnel shaped) obtained from
the scatter plot then heteroscedasticity is
present.
2. Goldfeld Quandt (GQ)Test: It assumes
that heteroscedastic variance σi2 is
positively related to one of the explanatory
variables And errors are assumed to be
normal. Thus if heteroscedasticity is
present then the variance would be high
for large values of X.

https://www.listendata.com/2018/01/linear-regression-in-python.html 8/25
6/2/2019 Linear Regression in Python

Steps for GQ test:

1. Order/ rank (ascending) the


observations according to
the value of Xi beginning
with the lowest X value.
2. Omit ‘c’ central observations
and divide the remaining (n-
c) observations into 2
groups of (n-c)/2
observations each.
3. Fit separate OLS regression
to both the groups and
obtain residual sum of
squares (RSS1 and RSS2)
for both the groups.
4. Obtain F = RSS2/ RSS1

It follows F with ((n-c)/2-k) d.f. both


both numerator and denominator.
Where k is the no. of parameters to
be estimated including the
intercept.
If errors are homoscedastic then
the two variances RSS2 and RSS1
turn out to be equal i. e. F will tend
to 1.

Dataset used:
We have 1030 observations on 9 variables. We
try to estimate the Complete compressive
strength(CRS) using:

https://www.listendata.com/2018/01/linear-regression-in-python.html 9/25
6/2/2019 Linear Regression in Python

1. Cement - kg in a m3 mixture
2. Blast Furnace Slag - kg in a m3 mixture
3. Fly Ash - kg in a m3 mixture
4. Water - kg in a m3 mixture
5. Superplasticizer - kg in a m3 mixture
6. Coarse Aggregate - kg in a m3 mixture
7. Fine Aggregate - kg in a m3 mixture
8. Age - Day (1-365)

Dataset - Download Data

Importing the libraries:


Numpy, pandas and matplotlib.pyplot are
imported with aliases np, pd and plt respectively.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Loading the data


We load our data using pd.read_csv( )

data =
pd.read_csv("Concrete_Data.csv")

Now the data is divided into independent (x) and


dependent variables (y)

x = data.iloc[:,0:8]
y = data.iloc[:,8:]

https://www.listendata.com/2018/01/linear-regression-in-python.html 10/25
6/2/2019 Linear Regression in Python

Splitting the data into training and


test sets
Using sklearn we split 80% of our data into
training set and rest in test set. Setting
random_state will give the same training and test
set everytime on running the code.

from sklearn.cross_validation
import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size =
0.2,random_state = 100)

Running linear regression using


sklearn
Using sklearn linear regression can be carried
out using LinearRegression( ) class. sklearn
automatically adds an intercept term to our
model.

from sklearn.linear_model import


LinearRegression
lm = LinearRegression()
lm = lm.fit(x_train,y_train)
#lm.fit(input,output)

The coefficients are given by:

lm.coef_

https://www.listendata.com/2018/01/linear-regression-in-python.html 11/25
6/2/2019 Linear Regression in Python

array([[ 0.12415357, 0.10366839, 0.093371 ,


-0.13429401, 0.28804259,
0.02065756, 0.02563037, 0.11461733]])

To store coefficients in a data frame along with


their respective independent variables -

coefficients =
pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))],
axis = 1)

0 Cement 0.124154
1 Blast 0.103668
2 Fly Ash 0.093371
3 Water -0.134294
4 Superplasticizer 0.288043
5 CA 0.020658
6 FA 0.025630
7 Age 0.114617

The intercept is:

lm.intercept_

array([-34.273527])

To predict the values of y on the test set we use


lm.predict( )

y_pred = lm.predict(x_test)

Errors are the difference between observed and


predicted values.

https://www.listendata.com/2018/01/linear-regression-in-python.html 12/25
6/2/2019 Linear Regression in Python

y_error = y_test - y_pred

R square can be obbtained using sklearn.metrics


( ):

from sklearn.metrics import


r2_score
r2_score(y_test,y_pred)

0.62252008774048395

Running linear regression using


statsmodels:
It is to be noted that statsmodels does not add
intercept term automatically thus we need to
create an intercept to our model.

import statsmodels.api as sma


X_train =
sma.add_constant(x_train) ## let's
add an intercept (beta_0) to our
model
X_test =
sma.add_constant(x_test)

Linear regression can be run by using sm.OLS:

import statsmodels.formula.api as
sm
lm2 = sm.OLS(y_train,X_train).fit()

https://www.listendata.com/2018/01/linear-regression-in-python.html 13/25
6/2/2019 Linear Regression in Python

The summary of our model can be obtained via:

lm2.summary()

"""
OLS Regression
Results
=================================================

Dep. Variable: CMS R-


squared: 0.613
Model: OLS Adj. R-
squared: 0.609
Method: Least Squares F-
statistic: 161.0
Date: Wed, 03 Jan 2018 Prob (F-
statistic): 4.37e-162
Time: 21:29:10 Log-
Likelihood: -3090.4
No. Observations: 824 AIC:
6199.
Df Residuals: 815 BIC:
6241.
Df Model: 8
Covariance Type: nonrobust
=================================================

coef std err t


P>|t| [0.025 0.975]
-------------------------------------------------
-----------------------------------
const -34.2735 29.931 -1.145
0.253 -93.025 24.478
Cement 0.1242 0.010 13.054
0.000 0.105 0.143
Blast 0.1037 0.011 9.229
0.000 0.082 0.126
Fly Ash 0.0934 0.014 6.687
0.000 0.066 0.121
Water -0.1343 0.046 -2.947
0.003 -0.224 -0.045
Superplasticizer 0.2880 0.102 2.810
0.005 0.087 0.489
CA 0.0207 0.011 1.966

https://www.listendata.com/2018/01/linear-regression-in-python.html 14/25
6/2/2019 Linear Regression in Python

0.050 2.79e-05 0.041


FA 0.0256 0.012 2.131
0.033 0.002 0.049
Age 0.1146 0.006 19.064
0.000 0.103 0.126
=================================================

Omnibus: 3.757 Durbin-


Watson: 2.033
Prob(Omnibus): 0.153 Jarque-
Bera (JB): 3.762
Skew: -0.165 Prob(JB):
0.152
Kurtosis: 2.974 Cond. No.
1.07e+05
=================================================

Warnings:
[1] Standard Errors assume that the covariance
matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This
might indicate that there are
strong multicollinearity or other numerical
problems.
"""

The predicted values for test set are given by:

y_pred2 = lm2.predict(X_test)

Note that both y_pred and y_pred2 are same.


It's just these are calculated via different
packages.

Calculate R-Squared and Adjusted


R-Squared Manually on Test data

We can also calculate r-squared and adjusted r-


squared via formula without using any package.

https://www.listendata.com/2018/01/linear-regression-in-python.html 15/25
6/2/2019 Linear Regression in Python

import numpy as np
y_test =
pd.to_numeric(y_test.CMS,
errors='coerce')
RSS = np.sum((y_pred2 -
y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test -
y_mean)**2)
R2 = 1 - RSS/TSS
R2

n=X_test.shape[0]
p=X_test.shape[1] - 1

adj_rsquared = 1 - (1 - R2) * ((n -


1)/(n-p-1))
adj_rsquared

R-Squared : 0.6225
Adjusted RSquared : 0.60719

Detecting Outliers:
Firstly we try to get the studentized residuals
using get_influence( ). The studentized residuals
are saved in resid_student.

influence = lm2.get_influence()
resid_student =
influence.resid_studentized_external

https://www.listendata.com/2018/01/linear-regression-in-python.html 16/25
6/2/2019 Linear Regression in Python

Combining the training set and the residuals we


have:

Cement Blast Fly Ash Water


Superplasticizer CA FA Age \
0 540.0 0.0 0.0 162.0
2.5 1040.0 676.0 28.0
1 540.0 0.0 0.0 162.0
2.5 1055.0 676.0 28.0
2 332.5 142.5 0.0 228.0
0.0 932.0 594.0 270.0
3 332.5 142.5 0.0 228.0
0.0 932.0 594.0 365.0
4 198.6 132.4 0.0 192.0
0.0 978.4 825.5 360.0

Studentized Residuals
0 1.559672
1 -0.917354
2 1.057443
3 0.637504
4 -1.170290

resid =
pd.concat([x_train,pd.Series(resid_student,name
= "Studentized Residuals")],axis =
1)
resid.head()

If the absolute value of studentized residuals is


more than 3 then that observation is considered
as an outlier and hence should be removed. We
try to create a logical vector for the absolute
studentized residuals more than 3

Cement Blast Fly Ash Water


Superplasticizer CA FA Age \
649 166.8 250.2 0.0 203.5
0.0 975.6 692.6 3.0

https://www.listendata.com/2018/01/linear-regression-in-python.html 17/25
6/2/2019 Linear Regression in Python

Studentized Residuals
649 3.161183

resid.loc[np.absolute(resid["Studentized
Residuals"]) > 3,:]

The index of the outliers are given by ind:

ind =
resid.loc[np.absolute(resid["Studentized
Residuals"]) > 3,:].index
ind

Int64Index([649], dtype='int64')

Dropping Outlier
Using the drop( ) function we remove the outlier
from our training sets!

y_train.drop(ind,axis = 0,inplace =
True)
x_train.drop(ind,axis = 0,inplace =
True) #Interept column is not there
X_train.drop(ind,axis = 0,inplace =
True) #Intercept column is there

Detecting and Removing


Multicollinearity
We use the statsmodels library to calculate VIF

https://www.listendata.com/2018/01/linear-regression-in-python.html 18/25
6/2/2019 Linear Regression in Python

from
statsmodels.stats.outliers_influence
import variance_inflation_factor
[variance_inflation_factor(x_train.values,
j) for j in range(x_train.shape[1])]

[15.477582601956859,
3.2696650121931814,
4.1293255012993439,
82.210084751631086,
5.21853674386234,
85.866945489015535,
71.816336942930675,
1.6861600968467656]

We create a function to remove the collinear


variables. We choose a threshold of 5 which
means if VIF is more than 5 for a particular
variable then that variable will be removed.

def calculate_vif(x):
thresh = 5.0
output = pd.DataFrame()
k = x.shape[1]
vif =
[variance_inflation_factor(x.values,
j) for j in range(x.shape[1])]
for i in range(1,k):
print("Iteration no.")
print(i)
print(vif)
a = np.argmax(vif)
print("Max VIF is for variable
no.:")
print(a)

https://www.listendata.com/2018/01/linear-regression-in-python.html 19/25
6/2/2019 Linear Regression in Python

if vif[a] <= thresh :


break
if i == 1 :
output =
x.drop(x.columns[a], axis = 1)
vif =
[variance_inflation_factor(output.values,
j) for j in range(output.shape[1])]
elif i > 1 :
output =
output.drop(output.columns[a],axis
= 1)
vif =
[variance_inflation_factor(output.values,
j) for j in range(output.shape[1])]
return(output)
train_out = calculate_vif(x_train)

Now we view the training set

train_out.head()

Cement Blast Fly Ash


Superplasticizer Age

337 275.1 0.0 121.4 9.9


56
384 516.0 0.0 0.0 8.2
28
805 393.0 0.0 0.0 0.0
90
682 183.9 122.6 0.0 0.0
28
329 246.8 0.0 125.1 12.0
3

https://www.listendata.com/2018/01/linear-regression-in-python.html 20/25
6/2/2019 Linear Regression in Python

Removing the variables from the


test set.

x_test.head()
x_test.drop(["Water","CA","FA"],axis
= 1,inplace = True)
x_test.head()

Cement Blast Fly Ash Superplasticizer


Age
173 318.8 212.5 0.0 14.3
91
134 362.6 189.0 0.0 11.6
28
822 322.0 0.0 0.0 0.0
28
264 212.0 0.0 124.8 7.8
3
479 446.0 24.0 79.0 11.6
7

Running linear regression again on our new


training set (without multicollinearity)

import statsmodels.api as sma


import statsmodels.formula.api as
sm
train_out =
sma.add_constant(train_out) ##
let's add an intercept (beta_0) to
our model
x_test.drop(["Water","CA","FA"],axis
= 1,inplace = True)
X_test = sma.add_constant(x_test)

https://www.listendata.com/2018/01/linear-regression-in-python.html 21/25
6/2/2019 Linear Regression in Python

lm2 =
sm.OLS(y_train,train_out).fit()
lm2.summary()

"""
OLS Regression
Results
=================================================

Dep. Variable: CMS R-


squared: 0.570
Model: OLS Adj. R-
squared: 0.567
Method: Least Squares F-
statistic: 216.3
Date: Wed, 10 Jan 2018 Prob (F-
statistic): 6.88e-147
Time: 15:14:59 Log-
Likelihood: -3128.8
No. Observations: 823 AIC:
6270.
Df Residuals: 817 BIC:
6298.
Df Model: 5
Covariance Type: nonrobust
=================================================

coef std err t


P>|t| [0.025 0.975]
-------------------------------------------------
-----------------------------------
const -11.1119 1.915 -5.803
0.000 -14.871 -7.353
Cement 0.1031 0.005 20.941
0.000 0.093 0.113
Blast 0.0721 0.006 12.622
0.000 0.061 0.083
Fly Ash 0.0614 0.009 6.749
0.000 0.044 0.079
Superplasticizer 0.7519 0.077 9.739
0.000 0.600 0.903
Age 0.1021 0.006 16.582
0.000 0.090 0.114
=================================================

https://www.listendata.com/2018/01/linear-regression-in-python.html 22/25
6/2/2019 Linear Regression in Python

Omnibus: 0.870 Durbin-


Watson: 2.090
Prob(Omnibus): 0.647 Jarque-
Bera (JB): 0.945
Skew: 0.039 Prob(JB):
0.623
Kurtosis: 2.853 Cond. No.
1.59e+03
=================================================

Checking normality of residuals We


use Shapiro Wilk test from scipy library to
check the normality of residuals.

1. Null Hypothesis: The residuals are


normally distributed.
2. Alternative Hypothesis: The residuals
are not normally distributed.

from scipy import stats


stats.shapiro(lm2.resid)

(0.9983407258987427, 0.6269884705543518)

Since the pvalue is 0.6269 thus at 5% level of


significance we can say that the residuals are
normally distributed.

Checking for autocorrelation To ensure the


absence of autocorrelation we use Ljungbox test.

1. Null Hypothesis: Autocorrelation is


absent.

https://www.listendata.com/2018/01/linear-regression-in-python.html 23/25
6/2/2019 Linear Regression in Python

2. Alternative Hypothesis: Autocorrelation


is present.

from statsmodels.stats import


diagnostic as diag
diag.acorr_ljungbox(lm2.resid ,
lags = 1)

(array([ 1.97177212]), array([ 0.16025989]))

Since p-value is 0.1602 thus we can accept the


null hypothesis and can say that autocorrelation
is absent.

Checking heteroscedasticity Using Goldfeld


Quandt we test for heteroscedasticity.

1. Null Hypothesis: Error terms are


homoscedastic
2. Alternative Hypothesis: Error terms are
heteroscedastic.

import statsmodels.stats.api as
sms
from statsmodels.compat import
lzip
name = ['F statistic', 'p-value']
test =
sms.het_goldfeldquandt(lm2.resid,
lm2.model.exog)
lzip(name, test)

https://www.listendata.com/2018/01/linear-regression-in-python.html 24/25
6/2/2019 Linear Regression in Python

[('F statistic', 0.9903), ('p-value', 0.539)]

The p-value is 0.539 hence we can say that the


residuals have constant variance. Hence we can
say that all the assumptions of our linear
regression model are satisfied.

https://www.listendata.com/2018/01/linear-regression-in-python.html 25/25

You might also like