You are on page 1of 4

Abstract

This study attempts to create a model for predicting writing scores for high
school students. Such a model could be used in a variety of ways in real life
application, such as determining which students could benefit most from additional
writing instruction. To generate said model I have taken a host of relevant variables
including gender, race, socioeconomic status, ability in other academic areas, type
of program and school, and analyzed them using multivariate linear regression and
logit modeling. My regression analysis attempts to establish what variables are
important in predicting a student’s writing score. I follow this analysis with a logit
model that attempts to establish what variables are important in predicting whether
a student will score high or not on a writing test. After assessing the two models I
came up with, one regression and one logit, I have concluded that gender and
performance in other academic areas are the variables that can help make such
predictions.

Part 3: Multivariate Linear Regression

A priori expectations

Based on my analysis of univariate and bivariate statistics, I first ran a


regression using the following model:

represents the intercept coefficient, which is hard to interpret and in this case will
be relatively meaningless so I have no a priori expectation for it.

The female variable’s coefficient will be and I expect it to be positive based


on the T-test involving gender and writing scores from part 2. I believe in theory
this assumption makes sense as well because considerable research has shown that
females outperform males in areas involving verbal skills.
Variables through are the coefficients for reading scores, math scores,
science scores, and social studies scores respectively. I expect all of these
variables’ coefficients to be positive based on the correlation analysis from part 2
and theoretically it seems that all scores would rise together.

The coefficient on the dummy variable indicating white or non-white is


represented by . I expect that this coefficient will also be positive because the T-
test in part 2 indicated that whites on average outperform nonwhites in writing.

Coefficients and are on the dummy variables for low socioeconomic


status and middle socioeconomic status respectively. I have left off high
socioeconomic status, which makes it our base case to which the other two
variables will thus be compared. Therefore, I expect both of these coefficients to be
negative as predicted by our T-test in part 2 and the general theory that high
socioeconomic status usually results in higher test scores.

Coefficient captures the impact of attending private school or not. Since


the dummy variable is 1 when the student attended private school, I expected to
see a positive coefficient because of previous analysis and my belief that on
average those in private school outperform students in public school on tests.

The last two variables and attempt to capture the effects of different
program types as general program and vocational program respectively. Here I
have chosen to omit the academic program, making it our base case for
comparison. For this reason I anticipate both coefficients to be negative because in
part 2 a standard T-test showed that on average students in academic programs
outperformed all others in writing scores.

Reevaluation dropping and changing variables

I ran the regression using the above model and got some expected results
but also many unexpected results, all of which are summarized in table 1 under
model 1. As I expected, the coefficient on female was highly significant and
positive. Additionally all the coefficients on the other test scores were positive and
only the reading test score wasn’t significant and even it was barely above the .05
threshold. No other variable was found to be significant, however, and many had
signs opposite of what I had expected. Since I expected other variables to be
important, I investigated co-linearity via a VIF test to see if that could be causing
some of the unexpected results. A VIF test revealed that all of the variables for
other test scores seemed to be co-linear. Since they all appear to be equally
important, instead of using one as a proxy for all the others I decided to make a
new variable that was an average of reading score, math score, science score, and
social studies score. This way I could replace the four test score variables with the
one test average variable I created. I ran the regression again using the updated
model expecting to see some changes in the significance of some of my variables.

However, no new variables became significant as a result of the change in the


model other than the newly created test average variable. The details are
summarized in table 1 under model 2. White vs. nonwhite, socioeconomic status,
private vs. public, and type of program all remained statistically insignificant.
Therefore, I decided to drop all the variables except for female and test average.
The reason these other variables can be omitted and the reason they are
statistically insignificant is that information about other test scores is a far better
indicator of how a student will do on a writing test. Multivariate linear regression
shows the effect of each variable holding all other variables constant at their mean.
Although we would expect high socioeconomic students to perform better on
average on a writing test, it is in fact the case that we would expect them to do
better on all tests, which I prove to be the case with a T-test performed in
retrospect in part 2. We would not expect a student classified as having high
socioeconomic status to do even better on writing tests than on tests in other
subjects. That would have to be the case in order to observe a negative significant
coefficient for the low socioeconomic and middle socioeconomic variables.

The same can be said for the other dropped variables. since they all likely
impact overall test scores and not just writing scores. In other words, if two
students have identical test scores in all other subjects, it is unlikely that just
because a student is white or attends private school he will outperform his
counterpart in writing alone. Gender remains significant because females according
to our T-tests appear to differ significantly in writing but do not differ significantly in
average test scores. In this instance if two students, one male and one female, had
identical other test scores, the female would be expected to outperform her
counterpart on a writing test.

Final Model

The final model ends up being rather simple with only two dependent
variables—average test score and gender.

Both variables had their expected signs and were highly significant using the
current model. The final regression had an R-squared of 0.60, which means about
60% of the variability in a student’s writing score can be explained by the model.
The intercept coefficient was 6.58 but this has no practical interpretation.

The coefficient for the variable representing gender was approximately 5.50
meaning, all else held constant, a female would be expected to score 5.50 points
higher on the writing test. A 95% confidence interval suggested it could be as much
as 7.20 additional points and as low as 3.81 additional points.

The coefficient for the variable of average test score was approximately 0.83
meaning, all else held constant, a 1.00 point increase in average test score would
be expected to result in a 0.83 point increase in writing score. A 95% confidence
interval suggested it could be as much as 0.93 additional points and as low as 0.73
additional points.

Based on this information, an equation for predicting writing scores is as


follows:

You might also like