You are on page 1of 6

Quick and Dirty Regression Tutorial

The statistical procedure known as linear regression basically involves drawing and analyzing
trend-lines through data points. Economists use regression analysis to test hypotheses, derived
from economic theory, against real-world data.
In your first microeconomics class you saw theoretical demand schedules (Figure 1) showing
that if price increases, the quantity demanded ought to decrease. But when we collect market
data to actually test this theory, the data may exhibit a trend, but they are "noisy" (Figure 2).

Drawing a trendline through datapoints
To analyze the empirical relationship between price and quantity, download and open the Excel
spreadsheet with the data. Right-click on the spreadsheet chart to open a chart window, and
print off a full-page copy of the chart (same as the one shown in Figure 2). Using a pencil and
straightedge, eyeball and then draw a straight line through the cloud of points that best fits the
overall trend. Extend this line to both axes. Now calculate the values of intercept A and slope B
of the linear equation that represents the trend-line Price =A +B*Quantity
Although it is standard practice to graph supply and demand with Price on the Y-axis and
Quantity on the X-axis, economists more often consider demand Quantity to be the "dependent"
variable influenced by the "independent" variable Price. To obtain a more conventional demand
equation, invert your equation, solving for intercept and slope coefficients a and b, where
Quantity =a +b*Price. Technically, since this "empirical" (i.e., data-derived) demand model
doesn't fit through the data points exactly, it ought to be written as Quantity =a +b*Price +e
where e is the residual "unexplained" variation in the Quantity variable (the deviations of the
actual Quantity data points from the estimated regession line that you drew through them).
That's basically what linear regression is about: fitting trend lines through data to analyze
relationships between variables. Since doing it by hand is imprecise and tedious, most
economists and statisticians prefer to...
Fitting a trendline in an XY-scatterplot
MS-Excel provides two methods for fitting the best-fitting trend-line through data points, and
calculating that line's slope and intercept coefficients. The standard criterion for "best fit" is the
trend line that minimizes the sum of the squared vertical deviations of the data points from the
fitted line. This is called the ordinary least-squares (OLS) regression line. (If you got a bunch of
people to fit regression lines by hand and averaged their results, you would get something very
close to the OLS line.)
The easiest way to plot a trend line and calculate a single-variable regression equation is to right-
click on the data points in an Excel XY plot and select "Add Trendline." Under the "Options" tab
check "Display equation on chart" and click "OK." How well do this trend line and calculated
slope and intercept coefficients match the line you drew and the slope and intercept that you
calculated?

Using Excel's Regression utility (Data Analysis tools)
Excel also includes a formal regression utility in its Analysis ToolPak that provides statistics
indicating goodness-of-fit and confidence intervals for slope and intercept coefficients. This
utility lets you regress one dependent "left-hand-side" (of the equal sign) variable against one or
several independent "right-hand side" variables, and it provides useful indicators about the
statistical reliability of your model.
Excel's Regression procedure is one of the Data Analysis tools. If you don't see it, you need to
activate the Analysis ToolPak. Click the Windows symbol or the File menu, choose Options--
Add-Ins, select Analysis ToolPak (not Analysis ToolPak VBA) and click "Go..." Check the
Analysis TookPak checkbox and "OK." You will find "Data Analysis" on the right end of the
"Data" menu.

The only things you are required to specify are...
(a) one column of numbers as the Y Range, aka the dependent variable, "left-hand-side" variable
or endogenous variable whose variation is to be "explained" by the regression model;
(b) one or several adjacent columns of numbers as the X Range, aka the independent variables,
right-hand side (of the equals sign) variables, exogenous variables or "explanatory" variables;
(c) the upper-left corner of a blank range of cells in your spreadsheet where the results will be
printed.
The X and Y ranges must contain the same number of rows, all numeric data, no missing values.

Here is output from Excel's regression utility replicating the regression of Price (Y range) against
Quantity (X range). At the bottom of the output you can see the same Intercept and Quantity
slope coefficients that are shown for the trend line in the XY plot above. This empirical inverse
demand model, written out in equation form, is P = 13.675 - 0.1664*Q + e. Other parts of the
output are explained below.)
Try specifing Quantity as the dependent variable and Price as the independent variable, and
estimating the conventional demand regression model Quantity = a + b*Price . Note that you
obtain an approximate rather than exact mathematical inverse of the price equation! This is
because OLS minimizes the sum of the squared vertical deviations from the regression line, not
the sum of squared perpendicular deviations:

Multivariate models
Now try regressing Quantity (Y range) against both Price and Income (the X range is both the
Price and Income columns). This will yield coefficient estimates for the multivariate demand
model Quantity =a +b*Price +c*I ncome +e. You should get something like this:


Written out in equation form, this empirical demand model is Q = 49.18 - 3.118*P + 0.510*I +
e. Multivariate models such as this don't lend themselves to easy graphing, but they are much
more interesting. In this example an increase in Income shifts the conventional Q vs. P demand
schedule to the right, while an increase in Price shifts the Q vs. Income curve (aka Engel curve)
to the left.
Model diagnostics
When analyzing your regression output, first check the signs of the model coefficients: are they
consistent with your hypotheses? Is the Price coefficient negative as theory predicts? Does the
Income coefficient indicate this is a normal good, or an inferior good? Try calculating the price
and income elasticities using these slope coefficients and the average values of Price and
Quantity.
The next thing you should check is the statistical significance of your model coefficients.
Because the data are noisy and the regression line doesnt fit the data points exactly, each
reported coefficient is really a point estimate, a mean value from a distribution of possible
coefficient estimates. So the residuals e (the remaining noise in the data) are used to analyze the
statistical reliability of the regression coefficients. The columns to the right of the coefficients
column at the bottom of the Excel output report the standard errors, t-statistics, P-values, and
lower and upper 95% confidence bounds for each coefficient.
The standard error is the square root of the variance of the regression coefficient. The t-statistic
is the coefficient estimate divided by the standard error. If your regression is based on what
statisticians call a "large" sample (30 or more observations), a t-statistic greater than 2 (or less
than -2) indicates the coefficient is significant with >95% confidence. A t-statistic greater than
1.68 (or less than -1.68) indicates the coefficient is significant with >90% confidence. The
confidence thresholds for t-statistics are higher for small sample sizes. This example uses only
21 observations to estimate 1 intercept and 2 slope coefficients, which leaves 21 - 3 = 18
"degrees of freedom" (df) for calculating significance levels. In this example, the t-statistic on
the Income coefficient is 2.037, which would exceed the 95% confidence threshold for a "large"
(N > 30 observations) dataset, but does not quite meet the 95% confidence threshold when N =
21 observations.
If that last paragraph is just statistical gibberish for you, don't worry--most people just check the
P-values. These are the probabilities that the coefficients are not statistically significant. The P-
value of 0.056 for the Income coefficient implies 1 - 0.056 = 94.4% confidence that the "true"
coefficient is between 0 and about 1.02. The last two columns report the exact lower and upper
95% confidence thresholds for the Income coefficient: -0.0159 and +1.038 respectively. The
very low P-values for the Intercept and Price coefficients indicate they are very strongly
significant, so their 95% confidence intervals are relatively narrower.
The R-Square statistic near the top of the output represents the percent of the total variation in
the dependent variable that is explained by the independent variables, i.e., the model's overall
goodness of fit." But whether a model is really a "good" fit or not depends on context. R-squares
for cross-sectional models are typically much lower than R-squares for time-series models. You
can always increase R-square by throwing another independent variable (any variable!) into your
model. Remember that your real objective is to test your hypotheses, not to maximize R-square
by including irrelevant variables in your model and then making up some "hypothesis" after the
fact to "explain" the results you got.
Those are all the diagnostics you really need to worry about.
Final comments
The classical OLS model assumes that the residuals e are independent of each other and
randomly distributed with a mean of zero. It is sometimes helpful to examine plots of residuals to
check for non-random pattens that indicate problems with your model. If you take an
econometrics class, you will learn how to identify violations of these assumptions and how to
adapt the OLS model to deal with these situations.
Keep in mind that a regression actually analyzes the statistical correlation between one variable
and a set of other variables. It doesn't actually prove causality. It is only the context of your
analysis that lets you infer that the "independent" variabes "cause" the variation in the
"dependent" variable. Somebody else out there is probably using the same data to prove that your
dependent variable is "causing" one of your independent variables!
You should never force the regression line through the origin (the "Constant is zero" check-box
in the Excel utility) without a clear theoretical justification for doing so. It makes your model
diagnostics unreliable.

You might also like