You are on page 1of 18

FRESHMEN GPA IN GPA & RETENTION MODELING

Jesse Harden
Department of Mathematics and Statistics
Radford University

Authors Note

This work was created as Jesse Hardens Research Paper as part of an


Independent
Study of the R Programming Language with Dr. Yong Xu at Radford
University.
Correspondence concerning this article should be directed to
Jesse Harden
Mathematics Major, Statistics Concentration
Radford University
801 E. Main Street
Radford, VA 24142
Email: jharden@radford.edu
Phone: (757)-560-7939
or
Yong Xu
Department of Mathematics and Statistics
Radford University
801 E. Main Street
Radford, VA 24142
Email: yxu10@radford.edu
Phone: (540)-831-6442

Abstract

Predicting how likely a student is to be retained and graduate, as well as


their Cumulative GPA upon graduation, is important to many colleges in
regards to selecting students both for freshman enrollment and retention
purposes. This paper creates a predictive model for Freshman GPA based on
traditional measures such as High School GPA and SAT scores of students
who enter as freshmen, as well as using all of these measures to predict both
whether or not a student will graduate from their institution and what their
Cumulative GPA upon graduation will be. In this study, data from Radford
University undergraduate students who entered Radford University as
freshmen since Fall 2000, including High School GPA, SAT Score, Freshman
GPA, and Cumulative GPA upon graduation were taken in order to represent
the population of all undergraduate students in America who enter as
freshmen into public, mid-size universities. Any extrapolations to all
undergraduates should be done with some caution. For the GPA Modeling,
students who did not graduate were removed from the sample, and the time
in which they entered was ignored. This subset was then used to construct
several simple, and eventually multiple, linear regression models. For
Retention Modeling, High School GPA & Freshman GPA were used to create a
logistic model in relation to whether or not a student graduated at the
institution they enrolled as a freshman at. As a result of my studies, I have
concluded that Freshman GPA seems to be a stronger predictor of
Cumulative GPA than the traditional measures, and that the traditional GPA
estimates, as well as Freshman GPA, are not very useful for Retention
Modeling.

Keywords: regression, freshman, retention, GPA

Introduction
When predicting a prospective students Cumulative GPA, university
officials often consider two variables: the students High School GPA and SAT
Score. Several models of prediction already exist, such as Astin & Osegueras
simple linear model (0.077(High School GPA) + 0.000322(SAT Score)
0.2092) (Campbell, 2008), and Freshman GPA is often used to test the
validity of measures involved in admissions decisions (Wilson, 1983, p. 1).
With this in mind, I decided to investigate whether or not Freshman GPA is a
better predictor of Cumulative GPA than the usual measures, namely High
School GPA and SAT Scores. While such research may not be of interest to
admissions, it is certainly valuable to anyone looking to locate students who
need assistance or who are likely to be successful.
Of course, admissions offices are also often concerned with admitting
applicants who will eventually graduate from their institution. Thus, I also
wanted to test if there was a significant relationship between traditional GPA
predictors, as well as Freshman GPA, and whether or not a student
graduates, as well as how long it takes them to graduate.
Therefore, this papers focus is on determining both a workable model
for predicting Cumulative GPA with traditional predictors and Freshman GPA,
as well as investigating any potential relationship between GPA Predictors
and Retention Measures, specifically whether or not a student graduated
from their first college, and how many traditional semesters (Fall/Spring) it
took them to graduate.
This study has been broken down into two parts; the first part focuses
on building models to predict Cumulative GPA, while the second part deals
with retention measures and the traditional GPA predictors.

PART 1
Notes on Data Used for Study Part 1
Data for this study was taken from all Radford University
undergraduate students who entered as freshman since Fall 2000, and thus
is not truly random. Furthermore, the subset of only students who also
graduated from Radford University, entered in a Fall Semester, and had a
High School GPA & SAT Score on file were used in this part of the study.
Caution should be taken when extrapolating any conclusions, especially to
other universities.
The variables studied from the data include High School GPA (HS.GPA),
SAT Scores (SAT), and Freshman GPA (FMGPA), which are used to predict
Cumulative GPA (CUM.GPA).

Process Used in Study Part 1


After uploading the data, I prepared my models and analysis using the
following steps in the order specified:
1. Normality Testing: I first tested to make sure that the data values came
from approximately normal distributions.
2. Simple Linear Regression Models: I ran simple linear regression models
for predicting Cumulative GPA using SAT Scores, High School GPA, and
Freshman GPA alone, testing and accounting for any instances of nonconstant variance.
3. Multicollinearity Check: I checked for multicollinearity between SAT
Scores, High School GPA, and Freshman GPA using a regression model.
4. Multiple Regression Model: I constructed several multiple regression
models for predicting Cumulative GPA using High School GPA, SAT
4

Scores, Freshman GPA and then compared the correlation coefficients


and variable significance of each model.
All processes were completed using RStudio.

Normality Testing for Part 1


Using the qqnorm() function in R, I
created QQ-Plots for High School GPA,
SAT

Scores, Freshman GPA, and Cumulative

GPA, as seen below:

It is clear from the pictures that the normality assumption is


adequately satisfied for all four variables.

Simple Linear Regression Models for Part 1


Using the lm() function in R, I created the simple linear regression
models for all subjects below. Each model was determined to be statistically
significant via testing with the summary() function, with all three variables
5

producing p-values of much less than 0.001, and was tested for non-constant
variance using the bptest() function of the lmtest package with an alpha of
0.05. Both tests were done using RStudio, with both Freshman GPA exhibiting
non-constant variance. To account for this, I used Whites HeteroscedasticityCorrected Covariance Matrices to test significance, and compared the
parameter estimates to the ones found with summary(). Both were
significant, and there was no meaningful difference between any of the
parameter estimates. All R^2 values are drawn from the summary() function
output.
CUM.GPA = 1.25125+ 0.56203(HS.GPA), R^2 = 0.2669
CUM.GPA = 2.020 + 0.001008 (SAT), R^2 = 0.06728
CUM.GPA = 1.473662+ 0.529696(FMGPA), R^2 = 0.4854
These equations suggest that Freshman GPA may help determine Cumulative
GPA to a greater extent than SAT Scores and High School GPA.

Multicollinearity Check for Part 1


Using the lm() function in R, I created simple linear regression models
with High School GPA, SAT Scores, and Freshman GPA being dependent on
each other in order to test whether or not there was a relationship between
them at the 0.05 level of significance. I utilized the aov() (ANOVA) function in
order to perform the test. The results, as seen below, indicate that all the
variables have some level of interaction.
Test of Interaction between Freshman GPA & High School GPA
> summary(aov.hsFm)
Df Sum Sq Mean Sq F value Pr(>F)
HS.GPA
1 961 961.3 3352 <2e-16 ***
Residuals 11490 3295
0.3
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Test of Interaction between Freshman GPA & SAT Scores


> summary(aov.satFm)

Df Sum Sq Mean Sq F value Pr(>F)


SAT
1 356 356.1 1049 <2e-16 ***
Residuals 11490 3900
0.3
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Test of Interaction between High School GPA & SAT Scores


> summary(aov.satHs)
Df Sum Sq Mean Sq F value Pr(>F)
SAT
1 131.1 131.11 773.5 <2e-16 ***
Residuals 11490 1947.6 0.17
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

To compensate for the presence of interaction, interaction terms were


included in the next step, building the multiple regression model.

Multiple Regression Model for Part 1


With the simple linear regressions resulting in significance for High
School GPA, Freshman GPA and SAT Scores, and the multicollinearity check
showing the presence of interaction, I went on to create a multiple regression
model using High School GPA, Freshman GPA, SAT Scores and all interactions
to include a three-way interaction between all variables to predict
Cumulative GPA. The multiple regression model and its diagnostics can be
seen below:
CUM.GPA = 3.096 0.2771(HS.GPA) 7.018e-4(SAT) 0.4634(FMGPA) +
5.841e-5(HS.GPA:SAT) + 0.2212(HS.GPA:FMGPA) +
4.266e-4(SAT:FMGPA)

6.911e-5(HS.GPA:SAT:FMGPA)

Call:
lm(formula = CUM.GPA ~ (HS.GPA + SAT + FMGPA)^3)
Residuals:
Min
1Q Median
3Q
Max
-1.28542 -0.21344 0.01477 0.21785 2.04091
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
3.096e+00 8.898e-01 3.480 0.000504 ***
HS.GPA
-2.771e-01 2.882e-01 -0.962 0.336269
SAT
-7.018e-04 8.783e-04 -0.799 0.424288

FMGPA
-4.634e-01 2.811e-01 -1.649 0.099236 .
HS.GPA:SAT
5.841e-05 2.834e-04 0.206 0.836727
HS.GPA:FMGPA
2.212e-01 8.852e-02 2.499 0.012474 *
SAT:FMGPA
4.266e-04 2.731e-04 1.562 0.118273
HS.GPA:SAT:FMGPA -6.911e-05 8.556e-05 -0.808 0.419301
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3141 on 11484 degrees of freedom
Multiple R-squared: 0.5395,
Adjusted R-squared: 0.5392
F-statistic: 1922 on 7 and 11484 DF, p-value: < 2.2e-16

Clearly, many of the variables used in this regression model are not
showing statistical significance. In order to correct this, I used the step()
function in R to reduce the model in the direction of backwards, yielding the
following:
CUM.GPA = 2.393 0.05053(HS.GPA) 9.158e-6(SAT) 0.2411(FMGPA)
1.645e-4(HS.GPA:SAT) + 0.1504(HS.GPA:FMGPA) + 2.092e4(SAT:FMGPA)
Call:
lm(formula = CUM.GPA ~ HS.GPA + SAT + FMGPA + HS.GPA:SAT + HS.GPA:FMGPA +
SAT:FMGPA)
Residuals:
Min
1Q Median
3Q
Max
-1.28387 -0.21349 0.01499 0.21737 2.04705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.393e+00 1.852e-01 12.924 < 2e-16 ***
HS.GPA
-5.053e-02 6.591e-02 -0.767 0.4433
SAT
-9.158e-06 1.897e-04 -0.048 0.9615
FMGPA
-2.411e-01 5.662e-02 -4.258 2.08e-05 ***
HS.GPA:SAT -1.645e-04 6.456e-05 -2.548 0.0109 *
HS.GPA:FMGPA 1.504e-01 1.215e-02 12.379 < 2e-16 ***
SAT:FMGPA
2.092e-04 4.601e-05 4.547 5.49e-06 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3141 on 11485 degrees of freedom
Multiple R-squared: 0.5395,
Adjusted R-squared: 0.5392
F-statistic: 2242 on 6 and 11485 DF, p-value: < 2.2e-16

Simply through the removal of the three-way interaction term, the


significance of the remaining variables jumps tremendously. However, both
SAT & FMGPA appear to be very insignificant outside of interaction with the
other variables, with SAT having a whopping 0.9615 p-value for significance.
Because of this, I decided to try removing SAT from the equation entirely,
which resulted in the much simpler equation seen below:
CUM.GPA = 2.43969 0.23628(HS.GPA) 0.05280(FMGPA) +
0.15843(HS.GPA:FMGPA)
Call:
lm(formula = CUM.GPA ~ (HS.GPA + FMGPA)^2)
Residuals:
Min
1Q Median
3Q
Max
-1.28805 -0.21258 0.01537 0.21750 2.00515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.43969 0.10943 22.294 < 2e-16 ***
HS.GPA
-0.23628 0.03550 -6.656 2.94e-11 ***
FMGPA
-0.05280 0.03498 -1.510 0.131
HS.GPA:FMGPA 0.15843 0.01104 14.355 < 2e-16 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3145 on 11488 degrees of freedom
Multiple R-squared: 0.5382,
Adjusted R-squared: 0.538
F-statistic: 4462 on 3 and 11488 DF, p-value: < 2.2e-16

This final model doesnt just have nearly the same adjusted R-squared
as the other models; all of its variables exhibit relatively decent significance.
While FMGPAs p-value is greater than 0.1 in this model, it is still far from the
0.4433 or 0.9615 of HS.GPA & SAT from the previous reduced model.

Part 1 Conclusions
All three multiple regression models show a moderate correlation, fairly
stronger than that of Astin & Osegueras model with an R^2 of 0.319
(Campbell, 2008). However, Freshman GPA alone resulted in a similar, if
slightly lesser, correlation to these mixed models. This suggests that
9

Freshman GPA can be used to better predict overall college performance of


students, with High School GPA and SAT Scores adding little, if any, value to
what Freshmen GPA already accomplishes. Thus, I recommend that anyone
seeking to locate students based on probable Cumulative GPA start as early
as the end of students Freshman Year, whether their purpose be to help
underperformers or attract high achievers.

Part 1 Limitations & Concerns


Several concerns should be noted regarding this study. As mentioned
earlier, since all subjects are from one university, extrapolation to other
universities should not be done without further testing. Furthermore, the fact
that the time in which a student entered Radford University was not
considered may have skewed the results. Thus, further testing should
consider blocking or grouping data by semester entered. Finally, Freshman
GPA may be a stronger predictor in part because of how Freshman GPA is
essentially a part of Cumulative GPA. Still, the fact that Freshman GPA
measures academic performance in an ostensibly new environment, college,
should be enough to consider not discounting it.

Part 2
Notes on Data Used for Study Part 2
Data for this study was taken from all Radford University
undergraduate students who entered as freshman since Fall 2000, and thus
10

is not truly random. The data was further divided into three subsets: one is
the same subset used in Part 1, which consists of only students who also
graduated from Radford University, entered in a Fall Semester, and had a
High School GPA & SAT Score on file, while the other two consist of all entries
with a High School GPA who are not still attending and of all entries with a
Freshman GPA who are not still attending, respectively. Again, caution should
be taken when extrapolating any conclusions, especially to other universities.
The variables studied from the sampled observations include High
School GPA (HS.GPA), and Freshman GPA (FMGPA), which are used to predict
whether or not a student graduated from Radford University (DID.GRAD).

Process Used in Study Part 2


After gathering the data, I approached the problem in 1 step:
1. Logistic Regression Modeling: I ran simple logistic regression models
for predicting whether or not a student graduated from Radford
University using High School GPA and Freshman GPA alone.

Logistic Regression Modeling for Part 2


With normality and variance already tested in Part 1, I moved straight
to creating logistic regression models with the caTools package, and
measured the fit of the models with the McFadden R2 index as seen on an RBloggers article guide to logistic regression in R (Alice, 2015), which resulted
in the following values:

DID.GRAD~FMGPA McFadden R2 Index: 6.829991e-02


DID.GRAD~HS.GPA McFadden R2 Index: 2.132800e-02

Given the incredibly small values for fit, further testing was deemed
unnecessary.

Part 2 Conclusion

11

I had hoped that there might be a significant relationship between


traditional GPA measures and retention, but the small fit values suggest that
High School & Freshman GPA are not nearly sufficient to predict retention
whatsoever. While they may play a part, they do not appear to be a
significant variable.

Code
library(lmtest)

12

library(car)

gpaData <- read.table("gpaModelData.txt", header = TRUE)


gpaDataMod <- subset(gpaData, HS.GPA > 0 & SAT > 0 & COMP.FM > 0 &
GRAD > 0)

attach(gpaDataMod)
qqnorm(HS.GPA, main="QQ HS.GPA")
qqnorm(SAT, main="QQ SAT")
qqnorm(HS.GPA, main="QQ FMGPA")
qqnorm(SAT, main="QQ CUM.GPA")

hsRegCum <- lm(CUM.GPA~HS.GPA)


summary(hsRegCum)
bptest(hsRegCum)

fmRegCum <- lm(CUM.GPA~FMGPA)


summary(fmRegCum)
bptest(fmRegCum)
coeftest(fmRegCum,vcov=hccm(fmRegCum))
13

satRegCum <- lm(CUM.GPA~SAT)


summary(satRegCum)
bptest(satRegCum)
coeftest(satRegCum,vcov=hccm(satRegCum))

aov.hsFm <- aov(FMGPA~HS.GPA)


summary(aov.hsFm)

aov.satFm <- aov(FMGPA~SAT)


summary(aov.satFm)

aov.satHs <- aov(HS.GPA~SAT)


summary(aov.satHs)

fullMR <- lm(CUM.GPA~(HS.GPA+SAT+FMGPA)^3)


summary(fullMR)
reduce_fullMR <- step(fullMR, direction="backward")
summary(reduce_fullMR)

14

hsFmMR <- lm(CUM.GPA~(HS.GPA+FMGPA)^2)


summary(hsFmMR)

gpaDataMod2 <- subset(gpaData, HS.GPA > 0 & DID.GRAD >=0)


require(caTools)
set.seed(101)
sample <- sample.split(gpaDataMod2, SplitRatio = .75)
train <- subset(gpaDataMod2, sample == TRUE)
test <- subset(gpaDataMod2, sample == FALSE)
model <- glm(DID.GRAD~HS.GPA,family=binomial(link='logit'),data=train)
summary(model)
anova(model, test="Chisq")
library(pscl)
pR2(model)

gpaDataMod3 <- subset(gpaData, COMP.FM > 0 & DID.GRAD >=0)


require(caTools)
set.seed(101)
sample <- sample.split(gpaDataMod3, SplitRatio = .75)
train2 <- subset(gpaDataMod3, sample == TRUE)

15

test2 <- subset(gpaDataMod3, sample == FALSE)


model2 <- glm(DID.GRAD~FMGPA,family=binomial(link='logit'),data=train2)
summary(model2)
anova(model2, test="Chisq")
library(pscl)
pR2(model2)

References
Alice, M. (2015, September 13). How to perform a logistic regression in R. Retrieved
from R-Bloggers: http://www.r-bloggers.com/how-to-perform-a-logisticregression-in-r/
Campbell, J. (2008). Analysis of institutional data in predicting student retention
utilizing knowledge discovery and statistical techniques. Northern Arizona
University. Retrieved from https://books.google.com/books?id=VajhPWlLwvsC
16

Wilson, K. M. (1983). A review of research on the prediction of academic


performance after the freshman year. Retrieved from Research - The College
Board Homepage:
https://research.collegeboard.org/sites/default/files/publications/2012/7/resea
rchreport-1983-2-prediction-performance-after-freshman-year.pdf

17

You might also like