Professional Documents
Culture Documents
https://www.listendata.com/2018/01/linear-regression-in-python.html 1/25
6/2/2019 Linear Regression in Python
Examples:
Estimating the price (Y) of a house on
the basis of its Area (X1), Number of
bedrooms (X2), proximity to market
(X3) etc.
Estimating the mileage of a car (Y) on
the basis of its displacement (X1),
horsepower(X2), number of
cylinders(X3), whether it is automatic
or manual (X4) etc.
To find the treatment cost or to predict
the treatment cost on the basis of
factors like age, weight, past medical
history, or even if there are blood
reports, we can use the information
from the blood report.
https://www.listendata.com/2018/01/linear-regression-in-python.html 2/25
6/2/2019 Linear Regression in Python
https://www.listendata.com/2018/01/linear-regression-in-python.html 3/25
6/2/2019 Linear Regression in Python
RSquare
https://www.listendata.com/2018/01/linear-regression-in-python.html 4/25
6/2/2019 Linear Regression in Python
Adjusted R square:
or
Adjusted R-Square
https://www.listendata.com/2018/01/linear-regression-in-python.html 5/25
6/2/2019 Linear Regression in Python
https://www.listendata.com/2018/01/linear-regression-in-python.html 6/25
6/2/2019 Linear Regression in Python
Why multicollinearity is a
problem?
https://www.listendata.com/2018/01/linear-regression-in-python.html 7/25
6/2/2019 Linear Regression in Python
Detecting heteroscedasticity!
1. Graphical Method: Firstly do the
regression analysis and then plot the error
terms against the predicted values( Yi^). If
there is a definite pattern (like linear or
quadratic or funnel shaped) obtained from
the scatter plot then heteroscedasticity is
present.
2. Goldfeld Quandt (GQ)Test: It assumes
that heteroscedastic variance σi2 is
positively related to one of the explanatory
variables And errors are assumed to be
normal. Thus if heteroscedasticity is
present then the variance would be high
for large values of X.
https://www.listendata.com/2018/01/linear-regression-in-python.html 8/25
6/2/2019 Linear Regression in Python
Dataset used:
We have 1030 observations on 9 variables. We
try to estimate the Complete compressive
strength(CRS) using:
https://www.listendata.com/2018/01/linear-regression-in-python.html 9/25
6/2/2019 Linear Regression in Python
1. Cement - kg in a m3 mixture
2. Blast Furnace Slag - kg in a m3 mixture
3. Fly Ash - kg in a m3 mixture
4. Water - kg in a m3 mixture
5. Superplasticizer - kg in a m3 mixture
6. Coarse Aggregate - kg in a m3 mixture
7. Fine Aggregate - kg in a m3 mixture
8. Age - Day (1-365)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data =
pd.read_csv("Concrete_Data.csv")
x = data.iloc[:,0:8]
y = data.iloc[:,8:]
https://www.listendata.com/2018/01/linear-regression-in-python.html 10/25
6/2/2019 Linear Regression in Python
from sklearn.cross_validation
import train_test_split
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size =
0.2,random_state = 100)
lm.coef_
https://www.listendata.com/2018/01/linear-regression-in-python.html 11/25
6/2/2019 Linear Regression in Python
coefficients =
pd.concat([pd.DataFrame(x_train.columns),pd.DataFrame(np.transpose(lm.coef_))],
axis = 1)
0 Cement 0.124154
1 Blast 0.103668
2 Fly Ash 0.093371
3 Water -0.134294
4 Superplasticizer 0.288043
5 CA 0.020658
6 FA 0.025630
7 Age 0.114617
lm.intercept_
array([-34.273527])
y_pred = lm.predict(x_test)
https://www.listendata.com/2018/01/linear-regression-in-python.html 12/25
6/2/2019 Linear Regression in Python
0.62252008774048395
import statsmodels.formula.api as
sm
lm2 = sm.OLS(y_train,X_train).fit()
https://www.listendata.com/2018/01/linear-regression-in-python.html 13/25
6/2/2019 Linear Regression in Python
lm2.summary()
"""
OLS Regression
Results
=================================================
https://www.listendata.com/2018/01/linear-regression-in-python.html 14/25
6/2/2019 Linear Regression in Python
Warnings:
[1] Standard Errors assume that the covariance
matrix of the errors is correctly specified.
[2] The condition number is large, 1.07e+05. This
might indicate that there are
strong multicollinearity or other numerical
problems.
"""
y_pred2 = lm2.predict(X_test)
https://www.listendata.com/2018/01/linear-regression-in-python.html 15/25
6/2/2019 Linear Regression in Python
import numpy as np
y_test =
pd.to_numeric(y_test.CMS,
errors='coerce')
RSS = np.sum((y_pred2 -
y_test)**2)
y_mean = np.mean(y_test)
TSS = np.sum((y_test -
y_mean)**2)
R2 = 1 - RSS/TSS
R2
n=X_test.shape[0]
p=X_test.shape[1] - 1
R-Squared : 0.6225
Adjusted RSquared : 0.60719
Detecting Outliers:
Firstly we try to get the studentized residuals
using get_influence( ). The studentized residuals
are saved in resid_student.
influence = lm2.get_influence()
resid_student =
influence.resid_studentized_external
https://www.listendata.com/2018/01/linear-regression-in-python.html 16/25
6/2/2019 Linear Regression in Python
Studentized Residuals
0 1.559672
1 -0.917354
2 1.057443
3 0.637504
4 -1.170290
resid =
pd.concat([x_train,pd.Series(resid_student,name
= "Studentized Residuals")],axis =
1)
resid.head()
https://www.listendata.com/2018/01/linear-regression-in-python.html 17/25
6/2/2019 Linear Regression in Python
Studentized Residuals
649 3.161183
resid.loc[np.absolute(resid["Studentized
Residuals"]) > 3,:]
ind =
resid.loc[np.absolute(resid["Studentized
Residuals"]) > 3,:].index
ind
Int64Index([649], dtype='int64')
Dropping Outlier
Using the drop( ) function we remove the outlier
from our training sets!
y_train.drop(ind,axis = 0,inplace =
True)
x_train.drop(ind,axis = 0,inplace =
True) #Interept column is not there
X_train.drop(ind,axis = 0,inplace =
True) #Intercept column is there
https://www.listendata.com/2018/01/linear-regression-in-python.html 18/25
6/2/2019 Linear Regression in Python
from
statsmodels.stats.outliers_influence
import variance_inflation_factor
[variance_inflation_factor(x_train.values,
j) for j in range(x_train.shape[1])]
[15.477582601956859,
3.2696650121931814,
4.1293255012993439,
82.210084751631086,
5.21853674386234,
85.866945489015535,
71.816336942930675,
1.6861600968467656]
def calculate_vif(x):
thresh = 5.0
output = pd.DataFrame()
k = x.shape[1]
vif =
[variance_inflation_factor(x.values,
j) for j in range(x.shape[1])]
for i in range(1,k):
print("Iteration no.")
print(i)
print(vif)
a = np.argmax(vif)
print("Max VIF is for variable
no.:")
print(a)
https://www.listendata.com/2018/01/linear-regression-in-python.html 19/25
6/2/2019 Linear Regression in Python
train_out.head()
https://www.listendata.com/2018/01/linear-regression-in-python.html 20/25
6/2/2019 Linear Regression in Python
x_test.head()
x_test.drop(["Water","CA","FA"],axis
= 1,inplace = True)
x_test.head()
https://www.listendata.com/2018/01/linear-regression-in-python.html 21/25
6/2/2019 Linear Regression in Python
lm2 =
sm.OLS(y_train,train_out).fit()
lm2.summary()
"""
OLS Regression
Results
=================================================
https://www.listendata.com/2018/01/linear-regression-in-python.html 22/25
6/2/2019 Linear Regression in Python
(0.9983407258987427, 0.6269884705543518)
https://www.listendata.com/2018/01/linear-regression-in-python.html 23/25
6/2/2019 Linear Regression in Python
import statsmodels.stats.api as
sms
from statsmodels.compat import
lzip
name = ['F statistic', 'p-value']
test =
sms.het_goldfeldquandt(lm2.resid,
lm2.model.exog)
lzip(name, test)
https://www.listendata.com/2018/01/linear-regression-in-python.html 24/25
6/2/2019 Linear Regression in Python
https://www.listendata.com/2018/01/linear-regression-in-python.html 25/25