You are on page 1of 10

Mathematical Planning and Advanced Statistics 2018-2019

Problem 1. An article in the Journal of Pharmaceuticals Sciences (vol. 80, 1991, pp. 971-977)
shows data on the observed mole fraction solubility of a solute at a constant temperature to the
dispersion, dipolar, and hydrogen bonding Hansen partial solubility parameters. The data is
available in the file ‘Experiment1.xlsx’, where y is the negative logarithm of the mole fraction
solubility, x1 is the dispersion Hansen partial solubility, x2 is the dipolar partial solubility, and x3
is the hydrogen bonding partial solubility.

(a) Fit a full quadratic model.

The fitted model is thus:

Even though, this is the complete quadratic model of the experiment, further analysis is
still needed in order to evaluate the significance of the model and of each term.

(b) Using an Analysis of Variance, test for significance of the regression model using an
𝛼 = 0.05. Interpret the results.
With a p-value of <0,0001, the null hypothesis is clearly rejected at the 5% level of
significance, we can now conclude that the association between the negative logarithm

1
of the mole fraction solubility and the factors: dispersion Hansen partial solubility, dipolar
partial solubility, and hydrogen bonding partial solubility, is significant. However, this is
evaluated as an overall analysis of the model, without taking in to account the significance
of each term in the fitted model.

(c) Using t-tests, interpret the significant model coefficients.

At a 5% of confidence, every interaction factor in the fitted model are not significant.
There has also been found that the intercept and the dipolar partial solubility (x2) are also
non-significative, however they shouldn’t be removed from the model, since they are
main terms.

(d) Using t-tests, simplify the model using an 𝛼 = 0.05. Remove the nonsignificant
coefficients one by one starting from the one with the largest p-value. That is, remove the
most nonsignificant coefficient, fit a reduced model and use it to identify the next term
to be removed. Continue until there are no nonsignificant coefficients. Hint: In the Menu
Analyze - Fit Model platform, active the ‘Keep dialog open’ option so you can easily
remove the coefficients one by one.

After removing the non-significant terms, we ended up with a new expression, noting that
the quadratic factor of hydrogen bounding partial solubility (x3.x3) is still relevant for the
fitted model.

2
(e) Compare the new model in (d) to the initial model in (c).

The estimates in both changes has slightly changed, the intercept continuous to be non-
significative in the new expression and the factor x1 and x3 continue to be significative.
When comparing the expanded model and the simplified one, we can notice that the
significance in the x2 factor has increased (p-value: 0.0276) in the new model, the same
occurs with the quadratic factor x3.x3, which changed from a p value of 0,5754 (non-
significative) to 0,0347 (significative).
Initial Model New Model

(f) Comment on the goodness of fit (𝑅 2 and Adjusted 𝑅2 ) of the new model.

From the comparison chart, the initial model shows a higher R2 value (0,916949) than the
one from the new model (0,892772), this is due to the number of terms included in the
initial model that increase the correlation. However, when analyzing the R2 adjust this
situation changes, the compensation for the addition of variables would only increase if
the new term enhances the model above that is predicted by chance. Therefore, in the
initial model, since most of the interaction factors were non-significative, the R2 adjust
was lower. To conclude, a better multiple correlation is explained by the second model.

3
Initial Model New Model

(g) Using the prediction profiler, determine the optimal settings that maximize the response.

The maximum value for the response variable (negative logarithm of the mole fraction
solubility) is obtain when the dispersion Hansen partial solubility (x1) is equal to 10,3; the
dipolar partial solubility (x2) to 7,8; and the hydrogen bonding partial solubility (x3) a 0.

(h) Predict the response for the following variable settings: x1 = 7,1, x2 = 0, and x3 = 20,7.
Given the nature of the response, do you find any issues with this prediction?

The predicted value for the negative logarithm of the mole fraction (Xa) for the mentioned
values is -0,23849 (y=-log (Xa)), this means a mole fraction of 1,73. However, the values
for the mole fraction are ranged between 0 and 1,0, and the negative logarithm of it is
always positive. Therefore, the response variable (y) could only be a positive number
higher or equal than 0.

4
Problem 2. A chemist performed the experiments shown in the file ‘Experiment2.xlsx’,
randomizing the order of the runs within each week. The objective of this 16-run experiment is
to increase the yield.

(a) Fit a model including the effect of Week, and the main effects and two-factor interactions
involving the other factors.
The fitted model is thus:

5
In this case, the yield is correlated with quantitative and qualitative factors, for instance
the type of catalyst (1 or 2), as a qualitative nominal variable, will affect in two different
levels the response variable. The same happens with the week variable. But temperature
and pH only one estimate.

(b) Using a procedure similar to Problem 1 (d), simplify the model. Does the model have a
good fit in terms of the 𝑅2 and Adjusted 𝑅2 ?

After removing the non-significative terms from the fitted model, the new model presents
a R2 lower than the initial one. Both R2 and R2 adjust are higher in the initial model.
Therefore, we cannot conclude that the new one is a good fitted model.
Initial Model New Model

The predicted expression of the new fitted model is then shown below, were the only
nominal variable is the catalyst.

6
In the next chart, the t-test shows that in the case of the nominal variables (catalyst) the
p-value has increased.

(c) Using the prediction profiler, interpret the significant effects. More specifically, interpret
the significant main effects individually. If there are significant two-factor interactions,
discuss how the impact of one factor on the response depends on the level of the other
factor.

The correlation between yield and temperature, for instance, is positive. So, for each 10C
of increase in temperature, when the other variables are constant, the yield rise 7,75%.
The same happens with pH variation, by varying in 0,1 units the pH, the yield is positively
modified in 7,575%. The catalyst factor is modifying the rate at which temperature affect
the yield. Therefore, the latter values, obtained with catalyst 1, may be different to those
for catalyst 2. At which a change of 10C of temperature results in a positive variation of
3,275% on the yield. However, for the pH factor with catalyst 2, at constant temperature,
has the same effect as for catalyst 1 (7,575%).

7
(d) What are the optimal settings to increase the Yield?

To maximize the values of the yield (82,2375%) it is necessary to apply the first catalyst,
a temperature of 150C and a pH of 6,9.

(e) Assess multicollinearity in the effect estimates of the reduced model using the Variance
Inflation Factors. Interpret the results.

8
The VIF is define as 1/(1-R2), so if the correlation between variables is 0 then the VIF will
be 1, and if the R2 is close to 1 the VIF tends to infinite. In this case, none of the variables
present a problem of collinearity.

Moreover, as shown in the chart below the main effects are not correlated with each
other (0,0000), however temperature (-0,1017) and in a greater manner pH (-0,9948) are
correlated to the intercept, it means that the interpretation of the intercept is subdue to
these variables

(f) Assess the assumptions of the residuals of the simplified model using graphical displays.
More specifically, assess whether the residuals have a constant variance and are normally
distributed with zero mean. For the former, use the ‘Row Diagnostics - Plot Residual by
Predicted’ in the output from the Menu Analyze - Fit Model platform. For the later, save
the residuals of the model and use the histogram given by the Menu Analyze -
Distribution platform. Investigate on the internet or in books on applied linear statistical
models about how to interpret these two plots.

The plot “Residual by predicted” shows the prediction made by the model (x-axis), and
the accuracy of the prediction or residual (y-axis), since Residual = Observed - Predicted.
The distance from the line at 0 is how bad the prediction was for that value. So, the
positive values for the residual (y-axis) mean the prediction was too low, and negative
values mean the prediction was too high; 0 means the guess was exactly correct. This plot
is related to the variance, if it is constant or not.

9
In the “yield” case, the residual plot, has shown a slight tendency to the left, where the
difference from the central residual line is bigger when the predicted values are smaller.
This could mean that the variance is not constant in the model.

Regarding the distribution of the residuals, the histogram allows visual assessment of the
assumption that the measurement errors in the response variable are normally
distributed. In this case, the residuals show a bell shape distribution and the median is
near 0 (-0,075), so even if it slightly moved to the left, the residuals show a normal
distribution.

10

You might also like