Predicting Diamond Price Using Linear Model

Predicting Diamond Price using Linear Model
Sarajit Poddar
26 July 2015
Contents
1 Executive Summary
1.1
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Exploratory Analysis
2.1
Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Predicting the diamond price
11
3.1
Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . .
11
3.2
Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3
Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.4
Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . .
15
3.5
Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.6
Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
3.7
Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4 Final conclusion
1
1.1
20
Executive Summary
Objective
Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using
linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.
Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The
machine learning algorithms will be explored in subsequent articles.
1.2
1.2.1
About the data

Description
Prices of 50,000 round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables
are as follows:
1
1.2.2
Details
price. price in US dollars ($326-$18,823)

carat. weight of the diamond (0.2-5.01)
cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour. diamond colour, from J (worst) to D (best)
clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x. length in mm (0-10.74)
y. width in mm (0-58.9)
z. depth in mm (0-31.8)
depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
table. width of top of diamond relative to widest point (43-95)
2
2.1
Exploratory Analysis
Loading relevant libraries
# Load required libraries

library(dplyr);
library(tidyr);
library(ggplot2)
2.2
Subsetting the dataset
The dataset is subset to a smaller size as the dataset it huge

# Load the diamonds dataset
data(diamonds)
# Convert continuous variables to factors
# Cut by interval of 1000
diamonds$price2 <- as.numeric(cut(diamonds$price,
seq(from = 0, to = 20000, by = 1000)))
# Cut by interval 0.5
diamonds$carat2 <- as.numeric(cut(diamonds$carat,
seq(from = 0, to = 6, by = 0.1)))
# Summary of diamonds dataset
summary(diamonds)
##
##
##
carat
Min.
:0.2000
1st Qu.:0.4000
Fair
Good
cut
: 1610
: 4906
color
D: 6775
E: 9797
2
clarity
SI1
:13065
VS2
:12258
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Median :0.7000
Mean
:0.7979
3rd Qu.:1.0400
Max.
:5.0100
depth
Min.
:43.00
1st Qu.:61.00
Median :61.80
Mean
:61.75
3rd Qu.:62.50
Max.
:79.00
y
Min.
: 0.000
1st Qu.: 4.720
Median : 5.710
Mean
: 5.735
3rd Qu.: 6.540
Max.
:58.900
Very Good:12082
Premium :13791
Ideal
:21551
table
Min.
:43.00
1st Qu.:56.00
Median :57.00
Mean
:57.46
3rd Qu.:59.00
Max.
:95.00
z
Min.
: 0.000
1st Qu.: 2.910
Median : 3.530
Mean
: 3.539
3rd Qu.: 4.040
Max.
:31.800
F: 9542
SI2
: 9194
G:11292
VS1
: 8171
H: 8304
VVS2
: 5066
I: 5422
VVS1
: 3655
J: 2808
(Other): 2531
price
x
Min.
: 326
Min.
: 0.000
1st Qu.: 950
1st Qu.: 4.710
Median : 2401
Median : 5.700
Mean
: 3933
Mean
: 5.731
3rd Qu.: 5324
3rd Qu.: 6.540
Max.
:18823
Max.
:10.740
price2
Min.
: 1.000
1st Qu.: 1.000
Median : 3.000
Mean
: 4.398
3rd Qu.: 6.000
Max.
:19.000
carat2
Min.
: 2.000
1st Qu.: 4.000
Median : 7.000
Mean
: 8.468
3rd Qu.:11.000
Max.
:51.000
# Structure of the diamond dataset

str(diamonds)
## 'data.frame':
53940 obs. of 12 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ price2 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ carat2 : num 3 3 3 3 4 3 3 3 3 3 ...
# Lets say that input price range is 1000 to 5000 and
# the number of obs is 500
input.pricerange.low <- 1000
input.pricerange.high <- 5000
input.obs
<- 5000
# Subsetting sampling the data based on the price range
data.sample <- subset(diamonds,
price >= input.pricerange.low &
price <= input.pricerange.high)
# Sampling the data from the subset
data.sample <- data.sample[sample(1:nrow(data.sample), input.obs,
replace=FALSE),]
2.3
2.3.1
Plotting the characteristics of dataset

Plotting using base graphics
#-------------------------------# Plotting with Base graphics

#-------------------------------x <- data.sample$price
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Diamond Price",
main="Frequency Distribution of Diamond Price")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(myhist$mids[1:2]) * length(x)
lines(xfit, yfit, col="red", lwd=2)
# Add legend
legend('topright', c("Mean", "Density Curve", "Normal Curve"),
lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))
Frequency Distribution of Diamond Price
600
400
0
200
Frequency
800
Mean
Density Curve
Normal Curve
1000
2000
3000
4000
Diamond Price
2.3.2
g
g
g
g
Plotting using ggplot
<<<<-
ggplot(data.sample, aes(x=price))
g + geom_histogram(aes(y = ..density..), fill="dark grey")
g + geom_density(alpha=.3, fill="#FF6666")
g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(data.sample$price),
sd=sd(data.sample$price)))
g <- g + xlab("Diamond price")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Diamond Price")
g
5000
Frequency Distribution of Diamond Price

5e04
Frequency
4e04
3e04
2e04
1e04
0e+00
1000
2000
3000
4000
5000
Diamond price
2.3.3
g
#
#
g
g
g
g
g
g
Diamond price distribution with regards to Cut
<- ggplot(data.sample)
Using the cut as to show the differences in the price due to the
quality of the cut
<- g + geom_bar(aes(x=price, fill= cut))
<- g + xlab("Price of Diamonds")
<- g + ylab("Number of Diamonds")
<- g + ggtitle("Prices of Sampled Diamonds")
<- g + theme(legend.position="bottom")
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
Prices of Sampled Diamonds
Number of Diamonds
300
200
100
0
1000
2000
3000
4000
5000
Price of Diamonds
cut
2.3.4
g
g
g
g
g
<<<<-
Fair
Good
Very Good
Premium
Ideal
Regression line showing the impact of Carat on the price (Using lm)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + geom_smooth(method=lm, col="red", lwd=1)
g + theme(legend.position="bottom")
price
6000
4000
2000
0.4
0.8
1.2
carat
clarity
2.3.5
g
g
g
g
g
<<<<-
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
Regression line showing the impact of Carat on the price (Using Loess)
g + geom_point(aes(color=clarity))
g + geom_smooth(method=loess, col="blue", lwd=1)
IF
5000
price
4000
3000
2000
1000
0.4
0.8
1.2
carat
clarity
2.3.6
g
g
g
g
g
g
<<<<<-
I1
SI2
SI1
VS2
VS1
Regression line faceted by Colour and Cut

g + geom_point(aes(color=clarity), position="jitter")
g + facet_grid(color~cut)
g + geom_smooth(method=lm, col="salmon", lwd=1)
VVS2
VVS1
IF
Good
Very Good
Premium
Ideal
D
E
F
G
price
Fair
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
H
I
J
0.4 0.8 1.2
0.4 0.8 1.2
0.4 0.8 1.2
0.4 0.8 1.2
0.4 0.8 1.2
carat
clarity
2.3.7
I1
SI2
SI1
VS2
VS1
VVS2
Correlation plot between all variables
library(corrplot)
# Convert Diamonds dataset all fields to numeric
diamonds.num <- data.sample
diamonds.num[, 1:12] <- sapply(diamonds.num[, 1:12], as.numeric)
# Remove price and carat and retain price2 and carat2
diamonds.num <- select(diamonds.num, cut:table, x:carat2)
M <- cor(diamonds.num)
corrplot.mixed(M)
10
VVS1
IF
cut
0.8
0.06 color
0.6
0.23 0.02clarity
0.4
0.25 0.06 0.07depth

0.2
0.46 0.03 0.170.25 table

0
0.23 0.3 0.6
0.2
x
0.2
0.18 0.25 0.5
0.15 0.84
y
0.4
0.27 0.3 0.58 0.22 0.12 0.94 0.83
z
0.6
0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2

0.8
0.25 0.31 0.59 0.1 0.18 0.97 0.82 0.93 0.86 carat2
1
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,
X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highly
correlated to Carat. This can also mean that takening Carat 2 as the
2.3.8
Exploratory plot
# Loading required libraries

library(ggplot2)
library(GGally)
library(scales)
# Sampling the data for the plot generation
diasamp <- diamonds[sample(1:length(diamonds$price), 500),]
# Generating the plot
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))
3
3.1
Predicting the diamond price

Determining the Significant Predictors of Diamond price
model.data <- subset(data.sample, select = -c(price2, carat2))

full.model <- lm(price ~ ., data = model.data)
11
reduced.model <- step(full.model, direction="backward", k=2, trace=0)

summary(reduced.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = price ~ carat + cut + color + clarity + depth +
table + x + z, data = model.data)
Residuals:
Min
1Q
-2307.76 -186.11
Median
-18.29
3Q
179.54
Max
1564.51
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1757.964
436.220 -4.030 5.66e-05 ***
carat
5946.169
114.468 51.946 < 2e-16 ***
cut.L
289.409
18.773 15.416 < 2e-16 ***
cut.Q
-118.602
14.754 -8.039 1.13e-15 ***
cut.C
122.496
13.433
9.119 < 2e-16 ***
cut^4
27.127
11.409
2.378
0.0175 *
color.L
-889.472
16.658 -53.397 < 2e-16 ***
color.Q
-205.282
14.891 -13.785 < 2e-16 ***
color.C
-54.810
14.049 -3.901 9.69e-05 ***
color^4
31.203
13.128
2.377
0.0175 *
color^5
54.780
12.283
4.460 8.38e-06 ***
color^6
43.795
11.144
3.930 8.62e-05 ***
clarity.L
1938.848
28.227 68.689 < 2e-16 ***
clarity.Q
-726.394
24.116 -30.121 < 2e-16 ***
clarity.C
451.869
20.706 21.823 < 2e-16 ***
clarity^4
-248.135
17.122 -14.492 < 2e-16 ***
clarity^5
77.947
14.643
5.323 1.06e-07 ***
clarity^6
-17.521
13.226 -1.325
0.1853
clarity^7
18.544
11.973
1.549
0.1215
depth
-10.051
4.490 -2.239
0.0252 *
table
-4.957
2.624 -1.889
0.0589 .
x
91.671
49.098
1.867
0.0619 .
z
97.390
44.027
2.212
0.0270 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.1 on 4977 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9267
F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16
We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, z
and carat.
3.2
Exploring the predictors using box plot
#------------------------------## Exploring the predictors using box plot

12
#------------------------------# Exploring association of Cut with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = cut)) + xlab("Carat") +
theme(legend.position="bottom")
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
Carat
cut
Fair
Good
Very Good
# Exploring association of Clarity with Carat and Price

geom_boxplot(aes(fill = clarity)) + xlab("Carat") +
13
Premium
Ideal
15
16
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
15
16
Carat
clarity
I1
SI2
SI1
VS2
VS1
VVS2
VVS1
IF
# Exploring association of Color with Carat and Price

geom_boxplot(aes(fill = color)) + xlab("Carat") +
5000
price
4000
3000
2000
1000
3
10
11
12
13
14
Carat
color
14
15
16
3.3
Generating the Model
# The Starting and Suggested Model

simple.model <- lm(price ~ carat, data = model.data)
fitted.model <- lm(price ~ carat + cut + clarity + color + table + y + z,
data = model.data)
3.4
Analysing the variance between multiple models
# Summary of the simple model and fitted model

summary(simple.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = price ~ carat, data = model.data)
Residuals:
Min
1Q
-3138.17 -307.81
Median
-14.44
3Q
299.14
Max
2393.19
Coefficients:
(Intercept) -728.13
24.62 -29.57
<2e-16 ***
carat
4694.20
32.72 143.48
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 2.059e+04 on 1 and 4998 DF, p-value: < 2.2e-16
summary(fitted.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
Call:
lm(formula = price ~ carat + cut + clarity + color + table +
y + z, data = model.data)
Residuals:
Min
1Q
-2340.21 -184.12
Median
-17.18
3Q
178.61
Max
1571.15
Coefficients:
(Intercept) -2253.841
180.766 -12.468 < 2e-16 ***
carat
6154.635
66.980 91.888 < 2e-16 ***
cut.L
316.192
17.328 18.247 < 2e-16 ***
cut.Q
-126.597
14.558 -8.696 < 2e-16 ***
cut.C
123.818
13.368
9.262 < 2e-16 ***
cut^4
25.591
11.421
2.241 0.02509 *
15
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
clarity.L
1936.765
28.124 68.866 < 2e-16 ***
clarity.Q
-733.049
24.012 -30.529 < 2e-16 ***
clarity.C
454.915
20.721 21.954 < 2e-16 ***
clarity^4
-247.327
17.145 -14.426 < 2e-16 ***
clarity^5
79.403
14.658
5.417 6.34e-08 ***
clarity^6
-18.140
13.241 -1.370 0.17075
clarity^7
19.361
11.993
1.614 0.10653
color.L
-890.439
16.679 -53.387 < 2e-16 ***
color.Q
-205.497
14.892 -13.800 < 2e-16 ***
color.C
-54.116
14.056 -3.850 0.00012 ***
color^4
32.902
13.141
2.504 0.01232 *
color^5
54.910
12.298
4.465 8.18e-06 ***
color^6
44.704
11.158
4.007 6.25e-05 ***
table
-1.865
2.462 -0.757 0.44892
y
30.742
12.057
2.550 0.01081 *
z
64.395
38.046
1.693 0.09060 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16
# Conduct Analysis of Variance between the simple model and the best fitted model
anova(simple.model, fitted.model)
##
##
##
##
##
##
##
##
##
Analysis of Variance Table

Model 1: price ~ carat
Model 2: price ~ carat + cut + clarity + color + table + y + z
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
4998 1399334980
2
4978 524428896 20 874906084 415.24 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.5
3.5.1
Analysing the Residuals

Checking unidentified patterns in the Residuals
The graph shows that variance between the actual and prediction are higher when the price of the dimond
increaes. There is a possibility that a factor that increases the price at a higher price range, is not captured in
the model. Hence the variance of the price cant be adequately captured by the model based on the available
predictors.
x <- model.data$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
geom_point(size=2, colour="salmon", alpha = 0.2) +
xlab("Fitted value") +
16
ylab("Residual") +
geom_smooth(method="loess", colour="red", lwd=1)
1000
Residual
1000
2000
1000
2000
3000
4000
Fitted value
3.5.2
Density plot of residuals to check Normal Distribution
The graph shows that the residula falls in a normal pattern.

x <- residuals(fitted.model)
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Residuals",
main="Frequency Distribution of residuals")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
17
5000
yfit <- yfit * diff(myhist$mids[1:2]) * length(x)

lines(xfit, yfit, col="red", lwd=2)
1000
0
500
Frequency
2000
Frequency Distribution of residuals
2000
1000
1000
2000
Residuals
3.6
Predicting using the fitted model
The formula for prediction is

Diamond price = -4253.844 + 4920.324 * carat + xx * cut + 77* clarity + zz * color + 1.462 * table +
376.099 * y + 275.481 * z
Note: The value of xx, yy and xx depends on the class of the variable
Coefficients: (Intercept) carat cut.L cut.Q cut.C cut4 clarity.L clarity.Q
-4253.844 4920.324 261.542 -83.890 66.899 37.187 2011.021 -749.420
clarity.C clarity4 clarity5 clarity6 clarity7 color.L color.Q color.C
477.879 -296.209 77.985 -30.944 27.387 -944.980 -226.105 -80.026
color4 color5 color6 table y z
18.037 12.063 30.269 1.462 376.099 275.481
# Join the predicted and model data for comparition
pred.data <- model.data
pred.data <- select(pred.data, cut:z, carat)
pred <- predict(fitted.model, pred.data)
pred <- data.frame(model.data, pred)
# Round the predicted data
pred$pred <- round(pred$pred, 0)
# Determining RMSE to assess fit (Root Mean Squared Error)
18
model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
## [1] 323.8607
3.7
Plotting the predicted data with actual data
Here we see that the prediction is more accurate between the price range of USD 1000 to USD 5000. Outside
this price range, the prediction is not accurate. Perhaps a different prediction model should be created for
dataset which are outside the range.
For the price range below 1000, the predicted price is lower than the actual price. Similarly for the price
range above USD 4500, the predicted price is higher than the actual price.
g
g
g
#
g
g
g
g
g
<- ggplot(pred, aes(y = price, x = pred))

<- g + geom_point(size=3, colour="black", alpha = 0.1)
<- g + geom_point(size=2, colour="salmon", alpha = 0.2)
g <- g + geom_point()
<- g + ylab("Actual Price")
<- g + xlab("Predicted Price")
<- g + geom_smooth(method=loess, col="blue", lwd=1)
<- g + geom_smooth(method=lm, col="red", lwd=1)
Actual Price
6000
4000
2000
0
0
2000
4000
Predicted Price
19
6000
0 1000
3000
5000
0 4
Normal QQ
8015
2367
5217
Standardized residuals
2000
Residuals vs Fitted
2000
Residuals
par(mfrow=c(2, 2))
plot(fitted.model)
236749190
5217
0 1000
3000
5000
Fitted values
Residuals vs Leverage
2315
8 2
1.5
2367 5217
ScaleLocation
49190
Theoretical Quantiles
0.0
Fitted values
1
0.5
0.5
1
4792
Cook's distance
49190
0.0
0.2
0.4
0.6
0.8
Leverage
The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.
Final conclusion
We have seen that using Linear model, a good predictive model can be developed, provided that the variables
(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditions
are accurately identified, then different models can be built for predicting the data outside the fitted model.
20

Predicting Diamond Price Using Linear Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Diamond Price Using Linear Model

Uploaded by

Copyright:

Available Formats

Predicting Diamond Price using Linear Model

About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Predicting the diamond price

Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . .

Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . .

Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

About the data

Prices of 50,000 round cut diamonds

price. price in US dollars ($326-$18,823)

# Load required libraries

Subsetting the dataset

The dataset is subset to a smaller size as the dataset it huge

# Structure of the diamond dataset

Plotting the characteristics of dataset

#-------------------------------# Plotting with Base graphics

Frequency Distribution of Diamond Price

Plotting using ggplot

Frequency Distribution of Diamond Price

Diamond price distribution with regards to Cut

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Prices of Sampled Diamonds

Regression line faceted by Colour and Cut

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

Correlation plot between all variables

0.25 0.06 0.07depth

0.46 0.03 0.170.25 table

0.23 0.3 0.6

0.18 0.25 0.5

0.27 0.3 0.58 0.22 0.12 0.94 0.83

0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2

# Loading required libraries

Predicting the diamond price

model.data <- subset(data.sample, select = -c(price2, carat2))

reduced.model <- step(full.model, direction="backward", k=2, trace=0)

Exploring the predictors using box plot

#------------------------------## Exploring the predictors using box plot

#------------------------------# Exploring association of Cut with Carat and Price

# Exploring association of Clarity with Carat and Price

# Exploring association of Color with Carat and Price

Generating the Model

# The Starting and Suggested Model

Analysing the variance between multiple models

# Summary of the simple model and fitted model

Analysis of Variance Table

Analysing the Residuals

Density plot of residuals to check Normal Distribution

The graph shows that the residula falls in a normal pattern.

yfit <- yfit * diff(myhist$mids[1:2]) * length(x)

Frequency Distribution of residuals

Predicting using the fitted model

The formula for prediction is

Plotting the predicted data with actual data

<- ggplot(pred, aes(y = price, x = pred))