Professional Documents
Culture Documents
Sarajit Poddar
26 July 2015
Contents
1 Executive Summary
1.1
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
2 Exploratory Analysis
2.1
2.2
2.3
11
3.1
11
3.2
12
3.3
15
3.4
15
3.5
16
3.6
18
3.7
19
4 Final conclusion
1
1.1
20
Executive Summary
Objective
Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using
linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.
Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The
machine learning algorithms will be explored in subsequent articles.
1.2
1.2.1
1.2.2
Details
2
2.1
Exploratory Analysis
Loading relevant libraries
2.2
carat
Min.
:0.2000
1st Qu.:0.4000
Fair
Good
cut
: 1610
: 4906
color
D: 6775
E: 9797
2
clarity
SI1
:13065
VS2
:12258
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Median :0.7000
Mean
:0.7979
3rd Qu.:1.0400
Max.
:5.0100
depth
Min.
:43.00
1st Qu.:61.00
Median :61.80
Mean
:61.75
3rd Qu.:62.50
Max.
:79.00
y
Min.
: 0.000
1st Qu.: 4.720
Median : 5.710
Mean
: 5.735
3rd Qu.: 6.540
Max.
:58.900
Very Good:12082
Premium :13791
Ideal
:21551
table
Min.
:43.00
1st Qu.:56.00
Median :57.00
Mean
:57.46
3rd Qu.:59.00
Max.
:95.00
z
Min.
: 0.000
1st Qu.: 2.910
Median : 3.530
Mean
: 3.539
3rd Qu.: 4.040
Max.
:31.800
F: 9542
SI2
: 9194
G:11292
VS1
: 8171
H: 8304
VVS2
: 5066
I: 5422
VVS1
: 3655
J: 2808
(Other): 2531
price
x
Min.
: 326
Min.
: 0.000
1st Qu.: 950
1st Qu.: 4.710
Median : 2401
Median : 5.700
Mean
: 3933
Mean
: 5.731
3rd Qu.: 5324
3rd Qu.: 6.540
Max.
:18823
Max.
:10.740
price2
Min.
: 1.000
1st Qu.: 1.000
Median : 3.000
Mean
: 4.398
3rd Qu.: 6.000
Max.
:19.000
carat2
Min.
: 2.000
1st Qu.: 4.000
Median : 7.000
Mean
: 8.468
3rd Qu.:11.000
Max.
:51.000
2.3
2.3.1
600
400
0
200
Frequency
800
Mean
Density Curve
Normal Curve
1000
2000
3000
4000
Diamond Price
2.3.2
g
g
g
g
<<<<-
ggplot(data.sample, aes(x=price))
g + geom_histogram(aes(y = ..density..), fill="dark grey")
g + geom_density(alpha=.3, fill="#FF6666")
g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(data.sample$price),
sd=sd(data.sample$price)))
g <- g + xlab("Diamond price")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Diamond Price")
g
5000
Frequency
4e04
3e04
2e04
1e04
0e+00
1000
2000
3000
4000
5000
Diamond price
2.3.3
g
#
#
g
g
g
g
g
g
<- ggplot(data.sample)
Using the cut as to show the differences in the price due to the
quality of the cut
<- g + geom_bar(aes(x=price, fill= cut))
<- g + xlab("Price of Diamonds")
<- g + ylab("Number of Diamonds")
<- g + ggtitle("Prices of Sampled Diamonds")
<- g + theme(legend.position="bottom")
Number of Diamonds
300
200
100
0
1000
2000
3000
4000
5000
Price of Diamonds
cut
2.3.4
g
g
g
g
g
<<<<-
Fair
Good
Very Good
Premium
Ideal
Regression line showing the impact of Carat on the price (Using lm)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + geom_smooth(method=lm, col="red", lwd=1)
g + theme(legend.position="bottom")
price
6000
4000
2000
0.4
0.8
1.2
carat
clarity
2.3.5
g
g
g
g
g
<<<<-
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
Regression line showing the impact of Carat on the price (Using Loess)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity))
g + geom_smooth(method=loess, col="blue", lwd=1)
g + theme(legend.position="bottom")
IF
5000
price
4000
3000
2000
1000
0.4
0.8
1.2
carat
clarity
2.3.6
g
g
g
g
g
g
<<<<<-
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
IF
Good
Very Good
Premium
Ideal
D
E
F
G
price
Fair
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
H
I
J
carat
clarity
2.3.7
I1
SI2
SI1
VS2
VS1
VVS2
library(corrplot)
# Convert Diamonds dataset all fields to numeric
diamonds.num <- data.sample
diamonds.num[, 1:12] <- sapply(diamonds.num[, 1:12], as.numeric)
# Remove price and carat and retain price2 and carat2
diamonds.num <- select(diamonds.num, cut:table, x:carat2)
M <- cor(diamonds.num)
corrplot.mixed(M)
10
VVS1
IF
cut
0.8
0.06 color
0.6
0.23 0.02clarity
0.4
0.2
x
0.2
0.15 0.84
y
0.4
z
0.6
0.25 0.31 0.59 0.1 0.18 0.97 0.82 0.93 0.86 carat2
1
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,
X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highly
correlated to Carat. This can also mean that takening Carat 2 as the
2.3.8
Exploratory plot
3
3.1
Call:
lm(formula = price ~ carat + cut + color + clarity + depth +
table + x + z, data = model.data)
Residuals:
Min
1Q
-2307.76 -186.11
Median
-18.29
3Q
179.54
Max
1564.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1757.964
436.220 -4.030 5.66e-05 ***
carat
5946.169
114.468 51.946 < 2e-16 ***
cut.L
289.409
18.773 15.416 < 2e-16 ***
cut.Q
-118.602
14.754 -8.039 1.13e-15 ***
cut.C
122.496
13.433
9.119 < 2e-16 ***
cut^4
27.127
11.409
2.378
0.0175 *
color.L
-889.472
16.658 -53.397 < 2e-16 ***
color.Q
-205.282
14.891 -13.785 < 2e-16 ***
color.C
-54.810
14.049 -3.901 9.69e-05 ***
color^4
31.203
13.128
2.377
0.0175 *
color^5
54.780
12.283
4.460 8.38e-06 ***
color^6
43.795
11.144
3.930 8.62e-05 ***
clarity.L
1938.848
28.227 68.689 < 2e-16 ***
clarity.Q
-726.394
24.116 -30.121 < 2e-16 ***
clarity.C
451.869
20.706 21.823 < 2e-16 ***
clarity^4
-248.135
17.122 -14.492 < 2e-16 ***
clarity^5
77.947
14.643
5.323 1.06e-07 ***
clarity^6
-17.521
13.226 -1.325
0.1853
clarity^7
18.544
11.973
1.549
0.1215
depth
-10.051
4.490 -2.239
0.0252 *
table
-4.957
2.624 -1.889
0.0589 .
x
91.671
49.098
1.867
0.0619 .
z
97.390
44.027
2.212
0.0270 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.1 on 4977 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9267
F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16
We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, z
and carat.
3.2
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
Carat
cut
Fair
Good
Very Good
13
Premium
Ideal
15
16
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
15
16
Carat
clarity
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
IF
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
Carat
color
14
15
16
3.3
3.4
Call:
lm(formula = price ~ carat, data = model.data)
Residuals:
Min
1Q
-3138.17 -307.81
Median
-14.44
3Q
299.14
Max
2393.19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728.13
24.62 -29.57
<2e-16 ***
carat
4694.20
32.72 143.48
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 529.1 on 4998 degrees of freedom
Multiple R-squared: 0.8047, Adjusted R-squared: 0.8046
F-statistic: 2.059e+04 on 1 and 4998 DF, p-value: < 2.2e-16
summary(fitted.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = price ~ carat + cut + clarity + color + table +
y + z, data = model.data)
Residuals:
Min
1Q
-2340.21 -184.12
Median
-17.18
3Q
178.61
Max
1571.15
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2253.841
180.766 -12.468 < 2e-16 ***
carat
6154.635
66.980 91.888 < 2e-16 ***
cut.L
316.192
17.328 18.247 < 2e-16 ***
cut.Q
-126.597
14.558 -8.696 < 2e-16 ***
cut.C
123.818
13.368
9.262 < 2e-16 ***
cut^4
25.591
11.421
2.241 0.02509 *
15
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
clarity.L
1936.765
28.124 68.866 < 2e-16 ***
clarity.Q
-733.049
24.012 -30.529 < 2e-16 ***
clarity.C
454.915
20.721 21.954 < 2e-16 ***
clarity^4
-247.327
17.145 -14.426 < 2e-16 ***
clarity^5
79.403
14.658
5.417 6.34e-08 ***
clarity^6
-18.140
13.241 -1.370 0.17075
clarity^7
19.361
11.993
1.614 0.10653
color.L
-890.439
16.679 -53.387 < 2e-16 ***
color.Q
-205.497
14.892 -13.800 < 2e-16 ***
color.C
-54.116
14.056 -3.850 0.00012 ***
color^4
32.902
13.141
2.504 0.01232 *
color^5
54.910
12.298
4.465 8.18e-06 ***
color^6
44.704
11.158
4.007 6.25e-05 ***
table
-1.865
2.462 -0.757 0.44892
y
30.742
12.057
2.550 0.01081 *
z
64.395
38.046
1.693 0.09060 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.6 on 4978 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265
F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16
# Conduct Analysis of Variance between the simple model and the best fitted model
anova(simple.model, fitted.model)
##
##
##
##
##
##
##
##
##
3.5
3.5.1
The graph shows that variance between the actual and prediction are higher when the price of the dimond
increaes. There is a possibility that a factor that increases the price at a higher price range, is not captured in
the model. Hence the variance of the price cant be adequately captured by the model based on the available
predictors.
x <- model.data$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
geom_point(size=2, colour="salmon", alpha = 0.2) +
xlab("Fitted value") +
16
ylab("Residual") +
geom_smooth(method="loess", colour="red", lwd=1)
1000
Residual
1000
2000
1000
2000
3000
4000
Fitted value
3.5.2
5000
1000
0
500
Frequency
2000
2000
1000
1000
2000
Residuals
3.6
18
model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
## [1] 323.8607
3.7
Here we see that the prediction is more accurate between the price range of USD 1000 to USD 5000. Outside
this price range, the prediction is not accurate. Perhaps a different prediction model should be created for
dataset which are outside the range.
For the price range below 1000, the predicted price is lower than the actual price. Similarly for the price
range above USD 4500, the predicted price is higher than the actual price.
g
g
g
#
g
g
g
g
g
Actual Price
6000
4000
2000
0
0
2000
4000
Predicted Price
19
6000
0 1000
3000
5000
0 4
Normal QQ
8015
2367
5217
Standardized residuals
2000
Residuals vs Fitted
2000
Residuals
par(mfrow=c(2, 2))
plot(fitted.model)
236749190
5217
0 1000
3000
5000
Fitted values
Residuals vs Leverage
2315
8 2
1.5
2367 5217
Standardized residuals
ScaleLocation
49190
Theoretical Quantiles
0.0
Standardized residuals
Fitted values
1
0.5
0.5
1
4792
Cook's distance
49190
0.0
0.2
0.4
0.6
0.8
Leverage
The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.
Final conclusion
We have seen that using Linear model, a good predictive model can be developed, provided that the variables
(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditions
are accurately identified, then different models can be built for predicting the data outside the fitted model.
20