You are on page 1of 19

Mission

Hospital
Assignment 2

20/001 - Aarushi Sharma


20/279 - Nischal Jain
20/360 - Swati Jain
20/345 - Sagar Dang
20/115 - Siddharth Solanki
20/267 - Harsh Garg
1. Develop a suitable simple linear regression model to check if there is
any relationship between Total Cost to Hospital and AGE. For the
fitted model, interpret the regression coefficient corresponding to
AGE.

Plot between Total Cost to Hospital and AGE

Below are the results from linear regression between Total Cost to Hospital ansd AGE.
The graphs above show that the assumptions of normality and homoscedasticity is not being
followed, as in the residual vs fitted graph we can see a pattern, the values are clustered with
lower fitted values and far apart with higher fitted values. This shows that the variances are
not same, they depend on the covariance of fitted values. Similarly the Normal QQ Plot
shows that the plot of the values deviate from the normal line. Hence the underlying
assumptions for a linear relationship are not satisfied.
So we try the log linear model.
Following are the result by linear regression between Log (Total Cost to Hospital) and AG
E.
We see in the residual vs fitted graph that it shows random variances, and the pattern that was
first visible in the previous graph is not there. Also the normal QQ Plot shows a better the fit
of normality than the previous plot. The beta 1 shows that one unit change in age will change
the total cost to hospital by a factor of Rs. 1.0086 ( 0.008565 ).

2. At the time of admission, suppose a patients age is 50 years. Based on


the fitted model in (1), what will be the minimum cost of treatment for
this patient at 95% confidence level?
Using the model from question 1, age = 50 and predict function ( Default confidence level is
95%) , we can calculate the lower value of range of confidence interval of log(total cost of
treatment). Output is as follows:
fit lwr upr
1 12.24298 11.34373 13.14223

Exponential power of lwr is 84434.41.So, Minimum cost of treatment for the patient with age
of 50 at 95% confidence level is Rs. 84434.41( 11.34373)

3. Suppose Mission Hospital is planning to introduce a package price for


the treatment and has decided to charge INR 250,000 for patients of
age 50 years. What is the probability that the treatment cost will
exceed the package price? Do you think that the Mission Hospital
should revise the package price?

Calculate the probability that the treatment cost will exceed the package price. We know that
distribution of log of treatment cost with age will follow the normal distribution. Since mean
and standard deviation is known from model in question 1, we can calculate the probability b
y 1-pnorm((log(250000)-mean)/ Residual standard error)

Residual standard error: 0.455 on 246 degrees of freedom.


Log of mean value of total cost of treatment is 12.24298

34% is the probability that the treatment cost exceeds package price. The hospital should not
revise the package price as it is greater than the mean

4. Build a simple linear regression model between Total Cost to


Hospital and GENDER. Interpret the results.
Following is the plot of Total cost of capital and Gender. Gender here is qualitative variable
with value F or M.

Gender being a qualitative variable. A dummy variable is created which is quantitative and as
already discussed in question 1 log of total lost is used for regression model. The dummy
variable formed is GEN. The contrast command shows that it is coded as 1 for male and 0 for
female.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.93436 0.05503 216.865 < 2e-16 ***
GEN 0.19082 0.06726 2.837 0.00493 **

The model shows that for males the total cost to hospital will be increased by a factor of
1.210242 ( 0.19082 ) and p value is also significant.
5. Build a simple linear regression model between Total Cost to
Hospital and MARITAL STATUS. Interpret the results.

Following is the plot of Total cost of capital and MARITAL STATUS. MARITAL STATUS here
is qualitative variable with value MARRIED and UNMARRIED.
Marital status being a qualitative variable. A dummy variable is created which is quantitative
and as already discussed in question 1 log of total lost is used for regression model. The
contrast command shows that it is coded as 1 for UNMARRIED and 0 for MARRIED.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.88486 0.03923 302.987 <2e-16 ***
MARITAL_STAT 0.40697 0.05944 6.847 6e-11 ***
---
F-statistic: 46.88 on 1 and 246 DF, p-value: 5.998e-11

The model shows that for males the total cost to hospital will be increased by a factor of
0.6656642 ( 0.40697 ) and p value is also significant.
6. Build a multiple linear regression model with Total Cost to Hospital
as dependent variable, and AGE, GENDER and MARITAL STATUS
as predictors. Compare the results with that of (4) and (5).

Linear regression with AGE, GENDER and MARITAL STATUS as


independent variables .

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.757557 0.056499 208.102 < 2e-16 ***
AGE 0.007637 0.002555 2.989 0.00308 **
GEN 0.104211 0.062490 1.668 0.09667 .
MARITAL_STAT 0.032630 0.132570 0.246 0.80578

Only Age is significant. Gender and marital status are insignificant as seen by p
value, however in question 4 and 5 these variables were coming as significant. This shows
if considered independently, the gender and marital status show a lot of
significant impact on the total cost to hospital, however, in the combined model,
the effect is not significant for these two variables.
7. Build a multiple linear regression model with appropriate set of predictors.
Identify the statistically significant predictors that the Mission Hospital can
use in predicting Total Cost to Hospital. Comment on the performance of
the fitted model. How does the fitted model help Mission Hospital to take
managerial decisions?
Linear regression by taking all variables as independent variables
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.4195765 0.4989676 20.882 < 2e-16 ***
AGE 0.0085850 0.0030825 2.785 0.006015 **
MALE -0.0410937 0.0716926 -0.573 0.567339
UNMARRIED 0.0964430 0.1444585 0.668 0.505364
ACHD 0.0606913 0.1454933 0.417 0.677148
CAD.DVD 0.4675391 0.1300201 3.596 0.000433 ***
CAD.SVD 0.3492459 0.3141862 1.112 0.268025
CAD.TVD 0.3441462 0.1408546 2.443 0.015670 *
CAD.VSD 0.3220618 0.4186867 0.769 0.442926
OS.ASD 0.2303903 0.1517427 1.518 0.130964
other..heart 0.2947377 0.1152326 2.558 0.011488 *
other..respiratory 0.0736222 0.2061631 0.357 0.721494
other.general -1.6289222 0.4634972 -3.514 0.000577 ***
other.nervous 0.6509382 0.4193210 1.552 0.122602
other.tertalogy 0.3684828 0.1693884 2.175 0.031108 *
PM.VSD 0.2809374 0.2406915 1.167 0.244907
RHD 0.5645466 0.1333216 4.234 3.9e-05 ***
BODY.WEIGHT 0.0022855 0.0037020 0.617 0.537890
BODY.HEIGHT 0.0005591 0.0016910 0.331 0.741381
HR.PULSE 0.0050994 0.0019315 2.640 0.009129 **
BP..HIGH -0.0021987 0.0023049 -0.954 0.341603
BP.LOW -0.0005388 0.0032198 -0.167 0.867311
RR 0.0173013 0.0090719 1.907 0.058343 .
Diabetes1 -0.0931856 0.1643344 -0.567 0.571496
Diabetes2 0.2090071 0.1756235 1.190 0.235820
hypertension1 -0.0623585 0.1217057 -0.512 0.609116
hypertension2 -0.2203463 0.1496889 -1.472 0.143028
hypertension3 0.1137384 0.1999772 0.569 0.570339
other -0.0703775 0.1239298 -0.568 0.570932
HB 0.0027892 0.0118002 0.236 0.813456
UREA 0.0008210 0.0026521 0.310 0.757307
CREATININE 0.2667857 0.1271125 2.099 0.037444 *
AMBULANCE 0.1048268 0.3199244 0.328 0.743607
TRANSFERRED -0.2662347 0.2261663 -1.177 0.240923
ELECTIVE 0.0878894 0.3115261 0.282 0.778221
The significant predictors are highlighted in yellow in the table above.

Now consider only significant variables for building model.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.974447 0.176893 62.040 < 2e-16 ***
AGE 0.006630 0.001672 3.965 0.000101 ***
CAD.DVD 0.401122 0.105391 3.806 0.000186 ***
CAD.TVD 0.388842 0.109755 3.543 0.000490 ***
other..heart 0.221803 0.074259 2.987 0.003162 **
other.general -1.544496 0.419724 -3.680 0.000298 ***
other.tertalogy 0.288918 0.114124 2.532 0.012103 *
RHD 0.490360 0.100450 4.882 2.11e-06 ***
HR.PULSE 0.005739 0.001594 3.600 0.000399 ***
Residual standard error: 0.4061 on 205 degrees of freedom
(33 observations deleted due to missingness)
Multiple R-squared: 0.4232, Adjusted R-squared: 0.3979
F-statistic: 16.71 on 9 and 205 DF, p-value: < 2.2e-16

The fitted model with all the significant predictor also has a multiple r square of 42.32% and
the adjusted r square of 0.3979 which shows that model explains 42.32% of the model.
Appendix R syntax

Question 1:
> ggplot(mission, aes(x=AGE, y=TOTAL.COST.TO.HOSPITAL)) + geom_point(col="
green") + geom_smooth(method="lm", col="red") + labs(x="Age", y="Total Cos
t")
> mod1 <- lm(TOTAL.COST.TO.HOSPITAL ~ AGE, data = mission)
> summary(mod1)

Call:
lm(formula = TOTAL.COST.TO.HOSPITAL ~ AGE, data = mission)

Residuals:
Min 1Q Median 3Q Max
-232683 -61888 -19440 28238 600773

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141216.6 10610.7 13.309 < 2e-16 ***
AGE 1991.2 273.8 7.273 4.67e-12 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 111400 on 246 degrees of freedom


Multiple R-squared: 0.177, Adjusted R-squared: 0.1736
F-statistic: 52.9 on 1 and 246 DF, p-value: 4.672e-12

> mod2 <- lm(log(TOTAL.COST.TO.HOSPITAL) ~ AGE, data = mission)


> summary(mod2)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE, data = mission)

Residuals:
Min 1Q Median 3Q Max
-1.51748 -0.24402 -0.00536 0.25388 1.39912

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.814724 0.043326 272.693 < 2e-16 ***
AGE 0.008565 0.001118 7.662 4.21e-13 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.455 on 246 degrees of freedom


Multiple R-squared: 0.1927, Adjusted R-squared: 0.1894
F-statistic: 58.7 on 1 and 246 DF, p-value: 4.212e-13

Question 2:
> newage = data.frame(AGE=50)
> p<- predict(mod2, newage, interval = "prediction")
> p
fit lwr upr
1 12.24298 11.34373 13.14223
> exp(p[2])
[1] 84434.41

Question 3:
> 1-pnorm((log(250000)-p[1])/.455)
[1] 0.3411574

Question 4:
# To Develop a simple linear model to predict Total cost with Gender as predictor
# plotting Gender Vs TOtal cost
plot(GENDER, TOTAL.COST.TO.HOSPITAL, col="red")
# plotting points & linear model
ggplot(mission, aes(x=GENDER, y=TOTAL.COST.TO.HOSPITAL)) + geom_poin
t(col="green") + geom_smooth(method="lm", col="red") + labs(x="Gende
r", y="Total Cost")
# creating dummy variable for Gender
GEN <- NULL
for(i in 1:248)
{
if(GENDER[i]=="M") GEN[i]<-1
if(GENDER[i]=="F") GEN[i]<-0
}
> contrasts(GENDER)
M
F 0
M 1
# developing a linear model with Total cost and Gender
> mod3 <- lm(log(TOTAL.COST.TO.HOSPITAL) ~ GEN, data = mission)
> summary(mod3)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ GEN, data = mission)

Residuals:
Min 1Q Median 3Q Max
-1.31142 -0.28273 -0.08258 0.26109 1.57082

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.93436 0.05503 216.865 < 2e-16 ***
GEN 0.19082 0.06726 2.837 0.00493 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.4983 on 246 degrees of freedom


Multiple R-squared: 0.03168, Adjusted R-squared: 0.02774
F-statistic: 8.048 on 1 and 246 DF, p-value: 0.004934

Question 5:

> plot(MARITAL.STATUS, TOTAL.COST.TO.HOSPITAL, col="red")


> ggplot(mission, aes(x=MARITAL.STATUS, y=TOTAL.COST.TO.HOSPITAL)) + geom_
point(col="green") + geom_smooth(method="lm", col="red") + labs(x="Marital
Status", y="Total Cost")
> MARITAL_STAT <- NULL
> for(i in 1:248)
+ {
+ if(MARITAL.STATUS[i]=="UNMARRIED") MARITAL_STAT[i]<-0
+ if(MARITAL.STATUS[i]=="MARRIED") MARITAL_STAT[i]<-1
+ }

> contrasts(MARITAL.STATUS)
UNMARRIED
MARRIED 0
UNMARRIED 1

> mod4 <- lm(log(TOTAL.COST.TO.HOSPITAL) ~ MARITAL_STAT, data = mission)


> summary(mod4)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ MARITAL_STAT, data = mission)

Residuals:
Min 1Q Median 3Q Max
-1.3608 -0.2360 -0.0334 0.2396 1.4042

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.88486 0.03923 302.987 <2e-16 ***
MARITAL_STAT 0.40697 0.05944 6.847 6e-11 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.4641 on 246 degrees of freedom


Multiple R-squared: 0.1601, Adjusted R-squared: 0.1566
F-statistic: 46.88 on 1 and 246 DF, p-value: 5.998e-11

> plot(mod4, which = c(1,2))

Question 6:

> mod5 <- lm(log(TOTAL.COST.TO.HOSPITAL) ~ AGE + GEN + MARITAL_STAT, data


= mission)
> summary(mod5)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + GEN + MARITAL_STAT,
data = mission)

Residuals:
Min 1Q Median 3Q Max
-1.5285 -0.2603 -0.0104 0.2470 1.3529

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.757557 0.056499 208.102 < 2e-16 ***
AGE 0.007637 0.002555 2.989 0.00308 **
GEN 0.104211 0.062490 1.668 0.09667 .
MARITAL_STAT 0.032630 0.132570 0.246 0.80578
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.4543 on 244 degrees of freedom
Multiple R-squared: 0.2019, Adjusted R-squared: 0.1921
F-statistic: 20.58 on 3 and 244 DF, p-value: 6.394e-12

Question 7:
> mod6<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+MALE+UNMARRIED+ACHD+CAD.DVD+CAD
.SVD+CAD.TVD+CAD.VSD+OS.ASD+other..heart+other..respiratory+other.general+
other.nervous+other.tertalogy+PM.VSD+RHD+BODY.WEIGHT+BODY.HEIGHT+HR.PULSE+
BP..HIGH+BP.LOW+RR+Diabetes1+Diabetes2+hypertension1+hypertension2+hyperte
nsion3+other+HB+UREA+CREATININE+AMBULANCE+TRANSFERRED+ELECTIVE,data=missio
n)
> summary(mod6)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + MALE + UNMARRIED +
ACHD + CAD.DVD + CAD.SVD + CAD.TVD + CAD.VSD + OS.ASD + other..heart +
other..respiratory + other.general + other.nervous + other.tertalogy +
PM.VSD + RHD + BODY.WEIGHT + BODY.HEIGHT + HR.PULSE + BP..HIGH +
BP.LOW + RR + Diabetes1 + Diabetes2 + hypertension1 + hypertension2 +
hypertension3 + other + HB + UREA + CREATININE + AMBULANCE +
TRANSFERRED + ELECTIVE, data = mission)

Residuals:
Min 1Q Median 3Q Max
-0.96533 -0.18093 -0.01659 0.19462 1.19165

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.4195765 0.4989676 20.882 < 2e-16 ***
AGE 0.0085850 0.0030825 2.785 0.006015 **
MALE -0.0410937 0.0716926 -0.573 0.567339
UNMARRIED 0.0964430 0.1444585 0.668 0.505364
ACHD 0.0606913 0.1454933 0.417 0.677148
CAD.DVD 0.4675391 0.1300201 3.596 0.000433 ***
CAD.SVD 0.3492459 0.3141862 1.112 0.268025
CAD.TVD 0.3441462 0.1408546 2.443 0.015670 *
CAD.VSD 0.3220618 0.4186867 0.769 0.442926
OS.ASD 0.2303903 0.1517427 1.518 0.130964
other..heart 0.2947377 0.1152326 2.558 0.011488 *
other..respiratory 0.0736222 0.2061631 0.357 0.721494
other.general -1.6289222 0.4634972 -3.514 0.000577 ***
other.nervous 0.6509382 0.4193210 1.552 0.122602
other.tertalogy 0.3684828 0.1693884 2.175 0.031108 *
PM.VSD 0.2809374 0.2406915 1.167 0.244907
RHD 0.5645466 0.1333216 4.234 3.9e-05 ***
BODY.WEIGHT 0.0022855 0.0037020 0.617 0.537890
BODY.HEIGHT 0.0005591 0.0016910 0.331 0.741381
HR.PULSE 0.0050994 0.0019315 2.640 0.009129 **
BP..HIGH -0.0021987 0.0023049 -0.954 0.341603
BP.LOW -0.0005388 0.0032198 -0.167 0.867311
RR 0.0173013 0.0090719 1.907 0.058343 .
Diabetes1 -0.0931856 0.1643344 -0.567 0.571496
Diabetes2 0.2090071 0.1756235 1.190 0.235820
hypertension1 -0.0623585 0.1217057 -0.512 0.609116
hypertension2 -0.2203463 0.1496889 -1.472 0.143028
hypertension3 0.1137384 0.1999772 0.569 0.570339
other -0.0703775 0.1239298 -0.568 0.570932
HB 0.0027892 0.0118002 0.236 0.813456
UREA 0.0008210 0.0026521 0.310 0.757307
CREATININE 0.2667857 0.1271125 2.099 0.037444 *
AMBULANCE 0.1048268 0.3199244 0.328 0.743607
TRANSFERRED -0.2662347 0.2261663 -1.177 0.240923
ELECTIVE 0.0878894 0.3115261 0.282 0.778221
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3965 on 156 degrees of freedom


(57 observations deleted due to missingness)
Multiple R-squared: 0.5307, Adjusted R-squared: 0.4285
F-statistic: 5.19 on 34 and 156 DF, p-value: 5.174e-13

> mod7<-lm(log(TOTAL.COST.TO.HOSPITAL)~AGE+CAD.DVD+CAD.TVD+other..heart+ot
her.general+other.tertalogy+RHD+HR.PULSE+CREATININE,data=f)
>
> summary(mod7)

Call:
lm(formula = log(TOTAL.COST.TO.HOSPITAL) ~ AGE + CAD.DVD + CAD.TVD +
other..heart + other.general + other.tertalogy + RHD + HR.PULSE +
CREATININE, data = f)

Residuals:
Min 1Q Median 3Q Max
-1.06605 -0.20151 -0.02119 0.19485 1.26342

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.974447 0.176893 62.040 < 2e-16 ***
AGE 0.006630 0.001672 3.965 0.000101 ***
CAD.DVD 0.401122 0.105391 3.806 0.000186 ***
CAD.TVD 0.388842 0.109755 3.543 0.000490 ***
other..heart 0.221803 0.074259 2.987 0.003162 **
other.general -1.544496 0.419724 -3.680 0.000298 ***
other.tertalogy 0.288918 0.114124 2.532 0.012103 *
RHD 0.490360 0.100450 4.882 2.11e-06 ***
HR.PULSE 0.005739 0.001594 3.600 0.000399 ***
CREATININE 0.223745 0.064466 3.471 0.000633 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.4061 on 205 degrees of freedom


(33 observations deleted due to missingness)
Multiple R-squared: 0.4232, Adjusted R-squared: 0.3979
F-statistic: 16.71 on 9 and 205 DF, p-value: < 2.2e-16

You might also like