You are on page 1of 7

Regression Analysis: How Do I Interpret R-squared and

Assess the Goodness-of-Fit?


Jim Frost 30 May, 2013

After you have fit a linear model using regression analysis, ANOVA, or design of experiments (DOE), you need to
determine how well the model fits the data. To help you out, Minitab statistical software presents a variety of
goodness-of-fit statistics. In this post, well explore the R-squared (R2 ) statistic, some of its limitations, and uncover
some surprises along the way. For instance, low R-squared values are not always bad and high R-squared values are
not always good!

What Is Goodness-of-Fit for a Linear Model?

Definition: Residual = Observed value - Fitted value

Linear regression calculates an equation that minimizes the distance between the fitted line and all of the data points.
Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.

In general, a model fits the data well if the differences between the observed values and the model's predicted values
are small and unbiased.

Before you look at the statistical measures for goodness-of-fit, you should check the residual plots. Residual plots can
reveal unwanted residual patterns that indicate biased results more effectively than numbers. When your residual
plots pass muster, you can trust your numerical results and check the goodness-of-fit statistics.

What Is R-squared?

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the
coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward; it is the percentage of the response variable variation that is
explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

In general, the higher the R-squared, the better the model fits your data. However, there are important conditions for
this guideline that Ill talk about both in this post and my next post.
Graphical Representation of R-squared

Plotting fitted values by observed values graphically illustrates different R-squared values for regression models.

The regression model on the left accounts for 38.0% of the variance while the one on the right accounts for 87.4%.
The more variance that is accounted for by the regression model the closer the data points will fall to the fitted
regression line. Theoretically, if a model could explain 100% of the variance, the fitted values would always equal the
observed values and, therefore, all the data points would fall on the fitted regression line.

Key Limitations of R-squared

R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why you must
assess the residual plots.

R-squared does not indicate whether a regression model is adequate. You can have a low R-squared value for a good
model, or a high R-squared value for a model that does not fit the data!

The R-squared in your output is a biased estimate of the population R-squared.

Are Low R-squared Values Inherently Bad?

No! There are two major reasons why it can be just fine to have low R-squared values.

In some fields, it is entirely expected that your R-squared values will be low. For example, any field that attempts to
predict human behavior, such as psychology, typically has R-squared values lower than 50%. Humans are simply
harder to predict than, say, physical processes.

Furthermore, if your R-squared value is low but you have statistically significant predictors, you can still draw
important conclusions about how changes in the predictor values are associated with changes in the response value.
Regardless of the R-squared, the significant coefficients still represent the mean change in the response for one unit
of change in the predictor while holding other predictors in the model constant. Obviously, this type of information
can be extremely valuable.

See a graphical illustration of why a low R-squared doesn't affect the interpretation of significant variables.

A low R-squared is most problematic when you want to produce predictions that are reasonably precise (have a small
enough prediction interval). How high should the R-squared be for prediction? Well, that depends on your
requirements for the width of a prediction interval and how much variability is present in your data. While a high R-
squared is required for precise predictions, its not sufficient by itself, as we shall see.
Are High R-squared Values Inherently Good?

No! A high R-squared does not necessarily indicate that the model has a good fit. That might be a surprise, but look
at the fitted line plot and residual plot below. The fitted line plot displays the relationship between semiconductor
electron mobility and the natural log of the density for real experimental data.

The fitted line plot shows that these data follow a nice tight function and the R-squared is 98.5%, which sounds great.
However, look closer to see how the regression line systematically over and under-predicts the data (bias) at different
points along the curve. You can also see patterns in the Residuals versus Fits plot, rather than the randomness that
you want to see. This indicates a bad fit, and serves as a reminder as to why you should always check the residual
plots.

This example comes from my post about choosing between linear and nonlinear regression. In this case, the answer is
to use nonlinear regression because linear models are unable to fit the specific curve that these data follow.

However, similar biases can occur when your linear model is missing important predictors, polynomial terms, and
interaction terms. Statisticians call this specification bias, and it is caused by an underspecified model. For this type of
bias, you can fix the residuals by adding the proper terms to the model.

For more information about how a high R-squared is not always good a thing, read my post Five Reasons Why Your
R-squared Can Be Too High.

Closing Thoughts on R-squared

R-squared is a handy, seemingly intuitive measure of how well your linear model fits a set of observations. However,
as we saw, R-squared doesnt tell us the entire story. You should evaluate R-squared values in conjunction with
residual plots, other model statistics, and subject area knowledge in order to round out the picture (pardon the pun).

While R-squared provides an estimate of the strength of the relationship between your model and the response
variable, it does not provide a formal hypothesis test for this relationship. The F-test of overall significance determines
whether this relationship is statistically significant.

In my next blog, well continue with the theme that R-squared by itself is incomplete and look at two other types of R-
squared: adjusted R-squared and predicted R-squared. These two measures overcome specific problems in order to
provide additional information by which you can evaluate your regression models explanatory power.

Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted
as the proportion of the variance in the dependent variable that is predictable from the independent
variable.

The coefficient of determination is the square of the correlation (r) between predicted y scores
and actual y scores; thus, it ranges from 0 to 1.
With linear regression, the coefficient of determination is also equal to the square of the
correlation between x and y scores.
An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
An R2 of 1 means the dependent variable can be predicted without error from the independent
variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An
R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means
that 20 percent is predictable; and so on.
The formula for computing the coefficient of determination for a linear regression model with one
independent variable is given below.

Coefficient of determination. The coefficient of determination (R2) for a linear regression model with
one independent variable is:

R2 = { ( 1 / N ) * [ (xi - x) * (yi - y) ] / (x * y ) }2

where N is the number of observations used to fit the model, is the summation symbol, x i is the x value
for observation i, x is the mean x value, yi is the y value for observation i, y is the mean y value, x is the
standard deviation of x, and y is the standard deviation of y.
Correlation
Correlation is a statistical technique that can show whether and how strongly pairs of variables are
related. For example, height and weight are related; taller people tend to be heavier than shorter people.
The relationship isn't perfect. People of the same height vary in weight, and you can easily think of two
people you know where the shorter one is heavier than the taller one. Nonetheless, the average weight of
people 5'5'' is less than the average weight of people 5'6'', and their average weight is less than that of
people 5'7'', etc. Correlation can tell you just how much of the variation in peoples' weights is related to
their heights.

Although this correlation is fairly obvious your data may contain unsuspected correlations. You may also
suspect there are correlations, but don't know which are the strongest. An intelligent correlation analysis
can lead to a greater understanding of your data.

Techniques in Determining Correlation

There are several different correlation techniques. The Survey System's optional Statistics
Moduleincludes the most common type, called the Pearson or product-moment correlation. The module
also includes a variation on this type called partial correlation. The latter is useful when you want to look
at the relationship between two variables while removing the effect of one or two other variables.

Like all statistical techniques, correlation is only appropriate for certain kinds of data. Correlation works
for quantifiable data in which numbers are meaningful, usually quantities of some sort. It cannot be used
for purely categorical data, such as gender, brands purchased, or favorite color.
Rating Scales

Rating scales are a controversial middle case. The numbers in rating scales have meaning, but that
meaning isn't very precise. They are not like quantities. With a quantity (such as dollars), the difference
between 1 and 2 is exactly the same as between 2 and 3. With a rating scale, that isn't really the case.
You can be sure that your respondents think a rating of 2 is between a rating of 1 and a rating of 3, but
you cannot be sure they think it is exactly halfway between. This is especially true if you labeled the mid-
points of your scale (you cannot assume "good" is exactly half way between "excellent" and "fair").

Most statisticians say you cannot use correlations with rating scales, because the mathematics of the
technique assume the differences between numbers are exactly equal. Nevertheless, many survey
researchers do use correlations with rating scales, because the results usually reflect the real world. Our
own position is that you can use correlations with rating scales, but you should do so with care. When
working with quantities, correlations provide precise measurements. When working with rating scales,
correlations provide general indications.

Correlation Coefficient

The main result of a correlation is called the correlation coefficient (or "r"). It ranges from -1.0 to +1.0.
The closer r is to +1 or -1, the more closely the two variables are related.

If r is close to 0, it means there is no relationship between the variables. If r is positive, it means that as
one variable gets larger the other gets larger. If r is negative it means that as one gets larger, the other
gets smaller (often called an "inverse" correlation).

While correlation coefficients are normally reported as r = (a value between -1 and +1), squaring them
makes then easier to understand. The square of the coefficient (or r square) is equal to the percent of the
variation in one variable that is related to the variation in the other. After squaring r, ignore the decimal
point. An r of .5 means 25% of the variation is related (.5 squared =.25). An r value of .7 means 49% of
the variance is related (.7 squared = .49).

A correlation report can also show a second result of each test - statistical significance. In this case, the
significance level will tell you how likely it is that the correlations reported may be due to chance in the
form of random sampling error. If you are working with small sample sizes, choose a report format that
includes the significance level. This format also reports the sample size.
A key thing to remember when working with correlations is never to assume a correlation means that a
change in one variable causes a change in another. Sales of personal computers and athletic shoes have
both risen strongly in the last several years and there is a high correlation between them, but you cannot
assume that buying computers causes people to buy athletic shoes (or vice versa).

The second caveat is that the Pearson correlation technique works best with linear relationships: as one
variable gets larger, the other gets larger (or smaller) in direct proportion. It does not work well with
curvilinear relationships (in which the relationship does not follow a straight line). An example of
a curvilinear relationship is age and health care. They are related, but the relationship doesn't follow a
straight line. Young children and older people both tend to use much more health care than teenagers or
young adults. Multiple regression (also included in the Statistics Module) can be used to examine
curvilinear relationships, but it is beyond the scope of this article.

You might also like