You are on page 1of 26

Analysis of Hydrocarbon Data

Anirban Ray & Soumya Sahu


October 25, 2017

Description of the Dataset


When petrol is pumped into tanks, hydrocarbons escape. To evaluate the effectiveness of pollution controls,
experiments were performed. The following dataset was obtained from the experiments.
DATASET <- read.csv("Dataset.csv")
str(DATASET)

## 'data.frame': 32 obs. of 5 variables:


## $ Tank.temperature : int 33 31 33 37 36 35 59 60 59 60 ...
## $ Petrol.temperature : int 53 36 51 51 54 35 56 60 60 60 ...
## $ Initial.tank.pressure: num 3.32 3.1 3.18 3.39 3.2 3.03 4.78 4.72 4.6 4.53 ...
## $ Petrol.pressure : num 3.42 3.26 3.18 3.08 3.41 3.03 4.57 4.72 4.41 4.53 ...
## $ Hydrocarbons.escaping: int 29 24 26 22 27 21 33 34 32 34 ...
Here, we have the data of 32 observations on response variable Hydrocarbons escaping(grams), and
4 explanatory variables Tank temperature (degrees Fahrenheit), Petrol temperature (degrees
Fahrenheit), Initial tank pressure (pounds/square inch) and Petrol pressure (pounds/square
inch). Let us respectively denote these by y, x1 , x2 , x3 and x4 .
n <- nrow(DATASET)
p <- ncol(DATASET)
x1 <- DATASET $ Tank.temperature
x2 <- DATASET $ Petrol.temperature
x3 <- DATASET $ Initial.tank.pressure
x4 <- DATASET $ Petrol.pressure
y <- DATASET $ Hydrocarbons.escaping

Primary Analysis:
Let us first plot the response variable against each of the explanatory variables, which will give us some
insight about the nature of the data. Here, all the graphs show patterns in more or less extent, so we can
conclude that each of the x variables explains some part of the variation in y.
for(i in 1 : (p - 1))
{
plot(DATASET[[i]], DATASET[[5]], xlab = colnames(DATASET)[i], ylab = colnames(DATASET)[5], main = past
}

1
Plot of y vs. x1
50
Hydrocarbons.escaping

40
30
20

30 40 50 60 70 80 90

Tank.temperature
Figure 1

2
Plot of y vs. x2
50
Hydrocarbons.escaping

40
30
20

40 50 60 70 80 90

Petrol.temperature
Figure 2

3
Plot of y vs. x3
50
Hydrocarbons.escaping

40
30
20

3 4 5 6 7

Initial.tank.pressure
Figure 3

4
Plot of y vs. x4
50
Hydrocarbons.escaping

40
30
20

3 4 5 6 7

Petrol.pressure
Figure 4
Let us now fit a Ordinary Least Square Model (OLS), without checking for the validity of the assumptions.
We observe that the fit seems to be very good in terms of the Adjusted R2 and F − statistic.
model.1 <- lm(y ~ 1 + x1 + x2 + x3 + x4) # INITIAL OLS MODEL
X.1 <- model.matrix(model.1)[, -1] # INITIAL REDUCED DESIGN MATRIX
summary(model.1)

##
## Call:
## lm(formula = y ~ 1 + x1 + x2 + x3 + x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.586 -1.221 -0.118 1.320 5.106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01502 1.86131 0.545 0.59001
## x1 -0.02861 0.09060 -0.316 0.75461
## x2 0.21582 0.06772 3.187 0.00362 **
## x3 -4.32005 2.85097 -1.515 0.14132
## x4 8.97489 2.77263 3.237 0.00319 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.73 on 27 degrees of freedom
## Multiple R-squared: 0.9261, Adjusted R-squared: 0.9151

5
## F-statistic: 84.54 on 4 and 27 DF, p-value: 7.249e-15
Now we prepare the residual plot and also the plot of the fits vs. the residuals. The first plot does not seem
to be uniformly scattered around zero, and the second plot is not at all random. Hence we can suspect that
all the assumptions do not hold good.
Y.1 <- fitted(model.1) # PREDICTED y BASED ON INITIAL MODEL
e.1 <- residuals(model.1) # RESIDUALS BASED ON INITIAL MODEL
plot(e.1, ylab = "Residuals", main = "Residual Plot", sub = "Figure 5") # RESIDUAL PLOT # not random

Residual Plot
4
2
Residuals

0
−2
−4
−6

0 5 10 15 20 25 30

Index
Figure 5
plot(Y.1, e.1, xlab = "Fits", ylab ="Residuals", main = "Plot of Residuals vs. Predicted Values", sub =

6
Plot of Residuals vs. Predicted Values
4
2
Residuals

0
−2
−4
−6

20 25 30 35 40 45 50

Fits
Figure 6

Checking Model Assumptions


The assumptions of the OLS model Y = Xβ +  are the following:
∼ ∼ ∼

1. Errors are unbiased, i.e. E(i ) = 0, ∀i,


2. Errors have constant variance, i.e. V (i ) = σ 2 , ∀i,
3. Errors are uncorrelated, i.e. cov(i , j ) = 0, ∀i 6= j,
4. Errors are normally distributed, i.e.  ∼ N (0 , σ 2 In )
∼ ∼
5. Explanatory variables are independent, i.e. X is of full column rank.
Now, we test these assumptions one by one.

Normality of Errors

In this case, we first draw the Quantile Quantile Plot of the residuals and then we perform the Shapiro-Wilk
Normality test. The diagram closely resembles the y = x line and the null hypothsis of the test, i.e. normality
of errors is accepted with considerably high p-value. So we can conclude that errors can be assumed to come
from a normal distribution.
qqnorm(e.1, sub = "Fiqure 7") # QQPLOT OF INITIAL RESIDUALS # close to identity line

7
Normal Q−Q Plot
4
Sample Quantiles

2
0
−2
−4
−6

−2 −1 0 1 2

Theoretical Quantiles
Fiqure 7
shapiro.test(e.1) # SHAPIRO-WILK TEST OF INITIAL RESIDUALS # normally distributed

##
## Shapiro-Wilk normality test
##
## data: e.1
## W = 0.97847, p-value = 0.7539

Multicollinearity

Next, we plot each of the explanatory variables against one another. It seems that there is correlation between
almost all of them.
# PLOT OF EXPLANATORY AND RESPONSE VARIABLES # seems to be multicollinear data
for(i in 1 : (p - 2))
{
for(j in (i + 1) : (p - 1))
{
plot(DATASET[[i]], DATASET[[j]], main = paste("Figure", 7 + i + (j - 1) - (i == 1)), xlab = colnames
}
}

8
Figure 8
90
80
Petrol.temperature

70
60
50
40

30 40 50 60 70 80 90

Tank.temperature

9
Figure 9
7
Initial.tank.pressure

6
5
4
3

30 40 50 60 70 80 90

Tank.temperature

10
Figure 10
7
Petrol.pressure

6
5
4
3

30 40 50 60 70 80 90

Tank.temperature

11
Figure 11
7
Initial.tank.pressure

6
5
4
3

40 50 60 70 80 90

Petrol.temperature

12
Figure 12
7
Petrol.pressure

6
5
4
3

40 50 60 70 80 90

Petrol.temperature

13
Figure 13
7
Petrol.pressure

6
5
4
3

3 4 5 6 7

Initial.tank.pressure

We then compute the correlation matrix.


cor(X.1) # CORRELATION MATRIX OF EXPLANATORY VARIABLES # highly linear relation

## x1 x2 x3 x4
## x1 1.0000000 0.7742909 0.9554116 0.9337690
## x2 0.7742909 1.0000000 0.7815286 0.8374639
## x3 0.9554116 0.7815286 1.0000000 0.9850748
## x4 0.9337690 0.8374639 0.9850748 1.0000000
Now, we strongly suspect multicollinearity and hence calculate the VIFs and obtain the following.
car :: vif(model.1) # VARIANCE INFLATION FACTORS # indicates high collinearity

## x1 x2 x3 x4
## 12.997379 4.720998 71.301491 61.932647
The high values suggest collinearity too. For the final verification, we compute the Condition Number
of X ∗ 0 X ∗ , where X ∗ is the scaled design matrix. Its large value leads us to conclude that the extent of
multicollinearity is significant in the dataset.
(K.SQ.1 <- kappa(t(scale(X.1)) %*% scale(X.1), exact = TRUE)) # SQUARE OF CONDITION NUMBER # confirms

## [1] 482.6577

14
Checking for Unusual Observations

Outliers in x-direction

We know that if there are high leverage points or influential points present in the dataset, those may lead
to pseudo-multicollinearity, for example by masking, swamping, etc. So to avoid that situation, we first
try to detect these points and check whether removal of these points leads to decrease in the extent of
multicollinearity.

Detection

First, we detect the influential points by the hat diagonals and covariance ratios, and obtain the following
detected points:
h <- hatvalues(model.1) # HAT.DIAGONAL
hat.leverage <- which(h > 2 * p / n) # HIGH LEVERAGE POINTS DETECTED BY HAT DIAGONALS
COVARIANCE.RATIO <- covratio(model.1)
cov.leverage <- which(abs(COVARIANCE.RATIO - 1) > 3 * p / n) # INFLUENTIAL POINTS DETECTED BY COVARIANC
(influential.points <- sort(unique(c(hat.leverage, cov.leverage)))) # ALL DETECTED INFLUENTIAL POINTS

## [1] 2 3 4 15 17 18 20 23

Outliers in y-direction

Detection

Before proceeding to fitting models, we first detect the outliers by DFBETA, DFFIT and Cook’s Distance
criteria. The detected points are the following:
DFBETA <- dfbetas(model.1) # DFBETA
dfbeta.outlier <- list() # OUTLIERS DETECTED BY DFBETAS
for(i in 1 : p){dfbeta.outlier[[i]] <- which(DFBETA[, i] > 2 / sqrt(n))}
DFFIT <- dffits(model.1) # DFFIT
dffit.outlier <- which(DFFIT > 2* sqrt(p / (n - p))) # OUTLIERS DETECTED BY DFFITS
COOK <- cooks.distance(model.1) # COOK'S DISTANCES
cook.outlier <- which(COOK > qf(0.05, p, n-p, lower.tail = F)) # OUTLIERS DETECTED BY COOK'S DISTANCES
(potential.outlier.1 <- sort(unique(c(unlist(dfbeta.outlier), dffit.outlier, cook.outlier)))) # ALL DET

## [1] 4 15 18 21 23 24 25 26

Outlier Shift Model

Then to verify whether they are really outliers, we compare these points against the rest of the points which
are assumed to be clean data points. Here we test whether the observations under testing are coming from
some different distribution other than that of the normal observations. If the null hypothesis gets rejected,
we can conclude that at least some of the points are outliers. If the hypothesis gets rejected, we test for the
significance of the γ coeffiients and include those points in the clean dataset for which these coefficients are
not significantly different from zero. Then we perform the test again and continue in the same way until we
get a set of points for which all the coefficients are significant. Then we shall treat those points as outliers.
k.1 <- length(potential.outlier.1) # NUMBER OF INITIALLY DETECTED OUTLIERS
y.mod.1 <- c(y[-potential.outlier.1], y[potential.outlier.1]) # INITIALLY MODIFIED RESPONSE
X.mod.1 <- cbind(rbind(X.1[-potential.outlier.1, ], X.1[potential.outlier.1, ]), rbind(matrix(0, n - k.1

15
outlier.model.1 <- lm(y.mod.1 ~ 1 + X.mod.1) # INITIAL OUTLIER SHIFT MODEL
F.1 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.1)) ^ 2)) / k.1) / (sum((residuals
F.1 > qf(0.05, k.1, n - p - k.1, lower.tail = FALSE)

## [1] TRUE
summary(outlier.model.1)

##
## Call:
## lm(formula = y.mod.1 ~ 1 + X.mod.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4613 -0.6052 0.0000 0.4497 3.9471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.75298 1.41066 -0.534 0.599688
## X.mod.1x1 -0.11158 0.08277 -1.348 0.193478
## X.mod.1x2 0.31378 0.07556 4.153 0.000541 ***
## X.mod.1x3 0.10255 3.84141 0.027 0.978981
## X.mod.1x4 4.71023 3.97008 1.186 0.250076
## X.mod.1 -3.97634 2.67335 -1.487 0.153318
## X.mod.1 2.85640 3.03530 0.941 0.358485
## X.mod.1 3.30415 2.94678 1.121 0.276142
## X.mod.1 -2.51355 2.23577 -1.124 0.274913
## X.mod.1 -6.80742 2.46254 -2.764 0.012344 *
## X.mod.1 -3.79404 2.16636 -1.751 0.096014 .
## X.mod.1 4.63336 2.05466 2.255 0.036121 *
## X.mod.1 5.58316 2.09783 2.661 0.015419 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.912 on 19 degrees of freedom
## Multiple R-squared: 0.9745, Adjusted R-squared: 0.9584
## F-statistic: 60.46 on 12 and 19 DF, p-value: 1.606e-12
potential.outlier.2 <- potential.outlier.1[c(5, 7, 8)] # OUTLIERS DETECTED AFTER CROSSCHECK
k.2 <- length(potential.outlier.2) # NUMBER OF DETECTED OUTLIERS AFTER CROSSCHECK
y.mod.2 <- c(y[-potential.outlier.2], y[potential.outlier.2]) # MODIFIED RESPONSE AFTER CROSSCHECK
X.mod.2 <- cbind(rbind(X.1[-potential.outlier.2, ], X.1[potential.outlier.2, ]), rbind(matrix(0, n - k.2
outlier.model.2 <- lm(y.mod.2 ~ 1 + X.mod.2) # OUTLIER SHIFT MODEL AFTER CROSSCHECK
F.2 <- ((sum((residuals(model.1)) ^ 2) - sum((residuals(outlier.model.2)) ^ 2)) / k.2) / (sum((residuals
F.2 > qf(0.05, k.2, n - p - k.2, lower.tail = FALSE)

## [1] TRUE
summary(outlier.model.2)

##
## Call:
## lm(formula = y.mod.2 ~ 1 + X.mod.2)
##
## Residuals:
## Min 1Q Median 3Q Max

16
## -3.5204 -0.8975 0.0000 1.0743 4.3300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16239 1.45093 0.112 0.911818
## X.mod.2x1 -0.10068 0.07347 -1.370 0.183236
## X.mod.2x2 0.21759 0.05235 4.157 0.000354 ***
## X.mod.2x3 -0.31137 2.36207 -0.132 0.896226
## X.mod.2x4 5.98182 2.26424 2.642 0.014282 *
## X.mod.2 -7.12255 2.38953 -2.981 0.006496 **
## X.mod.2 5.30706 2.21009 2.401 0.024441 *
## X.mod.2 6.33178 2.23658 2.831 0.009238 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 24 degrees of freedom
## Multiple R-squared: 0.9611, Adjusted R-squared: 0.9498
## F-statistic: 84.76 on 7 and 24 DF, p-value: 2.289e-15

Checking Influence and Modifying Data

Now, we consider the design matrix without the rows corresponding to the influential points and calculate
the condition number based on that. But in this context, we must mention that if multicollinearity is present
in the dataset, this is not the correct approach.
X.2 <- X.1[-influential.points,]
(K.SQ.2 <- kappa(t(scale(X.2)) %*% scale(X.2), exact = TRUE))

## [1] 1281.337
Anyway, even now the condition number is too large. So we can conclude that the problem of multicollinearity
is serious and hence we will opt for suitable regression methods. But before that, we should decide upon our
treatment with the outliers. Since our dataset is small and we have already verified the presence of severe
multicollinearity in the data, we cannot afford to completely remove these observations. Instead, we predict
these points by a OLS regression fitted to the rest of the points and henceforth continue our analysis with
these predicted observations.
y.mod.3 <- y # MODIFIED RESPONSE AFTER ESTIMATING DETECTED OUTLIERS
y.mod.3[potential.outlier.2] <- as.vector(cbind(1, X.1[potential.outlier.2,]) %*% coefficients(lm(y[-pot

Fitting Models
To handle multicollinearity, we can proceed either by removing some of the explanatory variables, or by
performing biased regression, where we minimise the Mean Square Error subject to some Penalty Term. In
this assignment, first we will try to select a model by the stepwise regression, and then we shall apply Lasso
regression.

Stepwise Regression

Here, we start with the null model, i.e. only with an intercept term. Then we shall add variables one by one
and calculate AIC. At each step, we see what gives us minimum AIC value:
1. Adding any further variable,

17
2. Removing the variable which is added,
3. Keep the model same.
Among these, if the last gives us minimum AIC, the algorithm stops there and that will be our final model.
Otherwise, we repeat the same process until we reach such a stage. Here, we obtain the following results:
r model.2 <- step(lm(y.mod.3 ~ 1 + x1 + x2 + x3 + x4), direction = "both") # STEPWISE
REGRESSION
## Start: AIC=48.27 ## y.mod.3 ~ 1 + x1 + x2 + x3 + x4 ## ## Df Sum of
Sq RSS AIC ## - x3 1 0.089 105.90 46.295 ## <none> 105.81
48.268 ## - x1 1 9.204 115.01 48.937 ## - x4 1 34.690 140.50 55.342 ##
- x2 1 76.949 182.76 63.757 ## ## Step: AIC=46.3 ## y.mod.3 ~ x1 + x2 + x4
## ## Df Sum of Sq RSS AIC ## <none> 105.90 46.295 ##
+ x3 1 0.089 105.81 48.268 ## - x1 1 17.254 123.15 49.125 ## - x2 1
112.325 218.22 67.433 ## - x4 1 186.416 292.31 76.787

LASSO Regression
0
In this method, we minimise n1 Σn1 (Yi − x β)2 + λΣpi |βj |. This is justified as multicollinearity inflates the
variances of the estimated regression coefficients and by including the constraint Σpi |βj | ≤ c, we enforce
the regression coefficients to take small values and can also make some of these coefficients close to zero.
This can be supported by the fact in presence of multicollinearity, all the explanatory variables are actually
not required. Thus, even though this method will no longer yield unbiased estimators, but we will still
avoid loss as the estimates will have comparatively smaller MSE’s than the unbiased estimates. Here,
we first obtain the an optimal λ, which comes out to be 0.0479 and The fitted model is obatined as
Y = 0.194 − 0.084x1 + 0.221x2 + 5.390x4 .
lasso.constant <- glmnet :: cv.glmnet(X.1, y.mod.3, alpha = 1, type.measure = "mse")$lambda.min # LASSO
model.3 <- glmnet :: glmnet(X.1, y.mod.3, alpha = 1, lambda = lasso.constant, intercept = T) # LASSO MO
lasso.coefficients <- coefficients(model.3) # LASSO COEFFICIENTS

Checking Goodness of the Fitted LASSO Model

Now, we plot of the original y observations and the predictions obtained from this method. Since it seems
that the biased model yields good results, we now proceed to the residual analysis of this model. Here, we
check for the normality, autocorrelation and heteroscedasticity of the residuals of the model.
Y.2 <- predict(model.3, newx = X.1) # LASSO PREDICTIONS
e.2 <- y - Y.2 # RESIDUALS BASED ON MODIFIED MODEL
plot(y, type = "l", col = "blue", ylab = NULL, sub = "Figure 14", main = "Plot of Original Observations
lines(Y.2, lty = 2, col = "green")
legend(legend = c("Original Observations", "Lasso Predictions"), x = 3, y = 50, lty = c(1, 3), col = c("

18
Plot of Original Observations and LASSO Predictions
50

Original Observations
Lasso Predictions
40
y

30
20

0 5 10 15 20 25 30

Index
Figure 14
Initially, we check for normality of the errors, and both the QQ Plot and the Shapiro Test concludes in the
affirmative.
qqnorm(e.2, sub = "Figure 15")

19
Normal Q−Q Plot
6
4
Sample Quantiles

2
0
−2
−4
−6
−8

−2 −1 0 1 2

Theoretical Quantiles
Figure 15
shapiro.test(e.2)

##
## Shapiro-Wilk normality test
##
## data: e.2
## W = 0.97103, p-value = 0.5284
Now, we check for condition number and note that it is much lower than 100, and hence it can be thought of
to be free from the effect of multicollinearity.
X.3 <- cbind(x1, x2, x4)
(K.SQ.2 <- kappa(t(scale(X.3)) %*% scale(X.3), exact = TRUE))

## [1] 46.34992
Next, we prepare the ACF and PACF plots of the residuals, and note that none of the picks are significant.
acf(e.2, sub = "Figure 16", main = "Autocorrelation Plot of LASSO residuals") # AUTOCORRELATION PLOT OF

20
Autocorrelation Plot of LASSO residuals
0.8
0.4
ACF

0.0
−0.4

0 5 10 15

Lag
Figure 16
pacf(e.2, sub = "Figure 17", main = "Partial Autocorrelation Plot of LASSO residuals") # PARTIAL AUTOCO

21
Partial Autocorrelation Plot of LASSO residuals
0.3
0.1
Partial ACF

−0.1
−0.3

2 4 6 8 10 12 14

Lag
Figure 17
Now, we try to check for equal variances. We recursively select subsets (of some fixed size) of the residuals and
compute the variances of each group. If the plot of these variances against the indices reveal some pattern,
we can suspect that heteroscedasticity is present in the dataset. We repeat this procedure for different sizes
of the subsets. From the plots, we can see that there is an increasing pattern in each of the plots. This
indicates that the residuals are not homoscedastic.
d <- seq(10, 20, 5)
rolling.variance <- sapply(d, function(j){zoo :: rollapply(e.2, j, var)}) # ROLLING VARIANCES
for(i in seq_along(d))
{
plot(rolling.variance[[i]], ylab = "Variance", sub = paste("Figure", 18 + i), main = paste("Moving var
}

22
Moving variances with Order 10
15
Variance

10
5

5 10 15 20

Index
Figure 19

23
Moving variances with Order 15
12
10
Variance

8
6
4

5 10 15

Index
Figure 20

24
Moving variances with Order 20
10
8
Variance

6
4

2 4 6 8 10 12

Index
Figure 21
To confirm our suspicion, we proceed towards Breusch - Pagan Test. Here, we test whether the variances
can be modelled by the explanatory variables. We use the square of the residuals as the estimates of the
variances. From this model, we observe that the test is accepted at 5% level of significance, but with a low
p-value. So based on the data, we cannot conclude that our data is homoscedastic, but we would certainly
prefer to test it for more number of observations.
model.4 <- lm((e.2 ^ 2) ~ 1 + x1 + x2 + x4)
summary(model.4)

##
## Call:
## lm(formula = (e.2^2) ~ 1 + x1 + x2 + x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.997 -6.898 -1.477 2.465 39.968
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.0094 7.6959 -0.261 0.7959
## x1 0.7542 0.2978 2.533 0.0172 *
## x2 0.1865 0.2419 0.771 0.4471
## x4 -10.3912 4.8345 -2.149 0.0404 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.57 on 28 degrees of freedom

25
## Multiple R-squared: 0.2127, Adjusted R-squared: 0.1283
## F-statistic: 2.521 on 3 and 28 DF, p-value: 0.07818

Conclusion
After all these calculations, we can see that the initial pressure of the tank is not included in our final model,
as the other explanatory variables explain this due to multicollinearity. If temperature of the tank increases,
the amount of escaped hydrocarbon decreases, and for increment in the temperature or pressure of the petrol,
waste is increased. So one should consider these points while taking steps against pollution with the fact that
these conclusions are based on a biased model affected by the problem of heteroscedasticity.

26

You might also like