You are on page 1of 21

Correlation and Linear Regression

orrelation and regression are powerful and flexible techniques used to analyze interval-level relationships. Pearson's correlation coefficient (Pearson's r) measures the strength and direction of the relationship between two interval-level variables. Pearson's r is not a proportional reduction in error (PRE) measure, but it does gauge strength by an easily understood scale-from -1, a perfectly negative association between the variables, to + I , a perfectly positive relationship. A correlation of 0 indicates no relationship. Researchers often use correlation techniques in the beginning stages of analysis to get an overall picture of the relationships between interesting variables. Regression analysis produces a statistic, the regression coefficient, that estimates the effect of an independent variable on a dependent variable. Regression also produces a PRE measure of association, R-squared, which indicates how completely an independent variable or set of independent variables explains the dependent variable. In regression analysis the dependent variable is measured at the interval level, but the independent variable can come in any variety-nominal, ordinal, or interval. Regression is more specialized than correlation. Researchers use regression analysis to model causal relationships between one or more independent variables and a dependent variable. In this chapter you will learn to perform correlation analysis using the correlate command. You will run bivariate regression and multiple regression using the regress command. Bivariate regression uses one independent variable to predict a dependent variable, whereas multiple regression uses two or more independent variables to predict a dependent variable. Also, by becoming familiar with a Stata graphics routine, graph twoway, you will learn how to construct visual depictions of interval-level relationships. By exploiting Stata's ability to combine two (or more) graphics, you will learn to overlay a scatterplot with a regression prediction line. These sophisticated graphic techniques will help you interpret your findings and will greatly enhance the presentation of your results.

THE CORRELATE COMMAND AND THE REGRESS COMMAND


Suppose that a student of state politics is interested in the gender composition of state legislatures. Running the summarize command on a states.dta variable (womleg), this student finds that state legislatures range from 6 percent female to 40 percent female. Why is there such variation in this variable? The student researcher begins to formulate an explanation. Perhaps states with lower percentages of college graduates have lower percentages of women legislators than do states with more college-educated residents. And maybe a cultural variable, the percentage of Christian adherents, plays a role. Perhaps states with higher percentages

7 3 8 A Stata Companion to Political Analysis

of Christian residents have lower percentages of female lawmakers. Correlation analysis would give this researcher an overview of the relationships among these variables. Let's investigate. Open states.dta. First let's use correlate to get a general idea about the direction and strength of the relationships among three variables: the percentage of Christian adherents (christad), the percentage of college graduates (college), and the percentage of female state legislators (womleg). Stata commands don't get much simpler than correlate. You type "correlate" (or "corn for short) and then type or click the variables into the Command window:

cor christad college wornleg

.0 corm 5 9c)h r i s t a d ( b~
I
c h r istad

c o l l e g e watnleg christad college woml eg

Stata produces a table, called a correlation matrix, that shows the correlation of each variable with each of the other variables-it even shows the correlation between each variable and itself! The correlation between christad and womleg is -.5696, or about -.57, which tells us that increasing values of one of the variables is associated with decreasing values of the other variable. So as the percentage of Christian adherents goes up, the percentage of female legislators goes down. How strong is the relationship? We know that Pearson's r is bracketed by -1 and + I , so we could say that this relationship is a fairly strong negative association. The correlation between college and womleg, at about .41, indicates a positive relationship: As states' percentages of college graduates increase, so do their percentages of women legislators. This association is weaker, however, than the christad-womleg relationship. Finally, note that christad and college, with a Pearson's r of -.14, are weakly related. This value for Pearson's correlation coefficient suggests that there is no clear, systematic pattern to the relationship. Correlation analysis is a good place to start when analyzing interval-level relationships. Even so, a correlation coefficient is agnostic on the question of which variable is the cause and which the effect. Does an increase in the percentage of Christian adherents somehow cause lower percentages of women in state legislatures? Or do increasing percentages of women in state legislatures somehow cause states to have lower percentages of Christian adherents? Either way, correlation analysis reports the same measure of association, a Pearson's r of -.57. Regression is more powerful than correlation, in part because it helps the researcher investigate causal relationships-relationships in which an independent variable is thought to affect a dependent variable. Regression analysis will (a) reveal the precise nature of the relationship between an independent variable and a dependent variable, (b) test the null hypothesis that the observed relationship occurred by chance, and (c) provide a PRE measure of association between the independent variable and the dependent variable. The general syntax of the regress command is as follows: regress dep- var indep-var(s) Using the permissible command abbreviation "reg," we would analyze the effect of christad on womleg by typing "reg womleg christad" and pressing Enter:

reg wmleg christad source Model Residual ~ oa1t

ss
990. M3l77 206l.83152 3051.8447

df
48

MS

Number o f obs =
,

1 990.M3177
42.9548234 62.2825449 49

48) = Prob > F = R-squared = ~ d R-squared = j Root MSE =


FC

1,

50 23.05 0.000 0.3244 0.3103 6.554

m l eg
c h r i s t ad ,cons

coef.

std. ~ r r .
.0727169 3.983823

t
-4.80 9.85

P>ltl
0.000 0.00p

[95% conf. I n t e r v a l ]
-.4953068 EL. 24272 -.2028928 47.26275

-.3490998 39.25274

Correlation and Linear Regression

139

In the bottom-most table, Stata reports values ("Coef.") for the Y-intercept or constant, -cons, and for the regression coefficient on the independent variable, christad. According to these coefficients, the regression equation for estimating the effect of christad on womleg is as follows: Percent of state legislators who are women = 39.25 - 0.35*christad. The constant, 39.25, is the estimated value of Y when X equals 0. If you were using this equation to estimate the percentage of women legislators for a state, you would start with 39.25 percent and then subtract .35 for each 1 percent of the state's population who are Christian adherents. So your estimate for a state with, say, 50 percent Christian adherents would be 39.25 - .35*(50) = 39.25 - 17.5 22 percent female legislators. The main statistic of interest, then, is the regression coefficient, -.35, which estimates the average change in the dependent variable for each unit change in the independent variable. A regression coefficient of -35 tells us that, for each 1-unit increase in the percentage of Christian adherents, there is a 35-unit decrease in the percentage of female legislators. So a 1-percentage-point increase in christad is associated with a .35-percentage-point decrease in wom1eg.l What would the null hypothesis have to say about all this? Of course, we are not analyzing a random sample here, given that we have information on the entire population of 50 states. But let's assume, for illustrative purposes, that we have just analyzed a random sample and that we have obtained a sample estimate of the effect of christad on womleg. The null hypothesis would say what it always says: In the population from which the sample was drawn, there is no relationship between the independent variable (in this case, the percentage of Christian adherents) and the dependent variable (the percentage of female legislators). In the population the true regression coefficient is equal to 0. Furthermore, the regression coefficient that we obtained, -35, occurred by chance. In Stata regression results, you test the null hypothesis by examining two columns in the bottom tablethe column labeled "t", which reports t-ratios, and the column labeled "P>I t I", which reports P-values. Informally, to safely reject the null hypothesis, the researcher generally looks for t-ratios with magnitudes (absolute values) of 2 or greater. According to the results of our analysis, the regression coefficient for christad has a t-ratio of -4.80, well above the informal 2-or-greater rule. A P-value, which tells you the probability of obtaining the results if the null hypothesis is correct, helps you to make more precise inferences about the relationship between the independent variable and the dependent variable. If "P>I t I" is greater than .05, then the observed results would occur too frequently by chance, and you must not reject the null hypothesis. By contrast, if "P>I t I" is equal to or less than .05, then the null hypothesis represents an unlikely occurrence and may be rejected. The t-ratio for christad has a corresponding P-value of 0.000. If the null is correct, it is highly unlikely that we would obtain these result^.^ Reject the null hypothesis. It depends on the research problem at hand, of course, but for most applications you can ignore the t-ratio and P-value for the ~ o n s t a n t . ~ How strong is the relationship between christad and womleg? The answer is provided by the R-squared statistics, which appear on the right-hand side of the results, above the table of coefficients, t-ratios, and P-values. Stata reports two values, one labeled "R-squared" and one labeled "Adj R-squared" (for Adjusted R-squared). Which one should you use? Most research articles report the adjusted value, so let's rely on Adj R-squared to provide the best overall measure of the strength of the relationship. (See "A Closer Look: R-squared and Adjusted R-squared: What's the Difference?" on the following page.) Adj R-squared is equal to .3 1. What does this mean? R-squared communicates the proportion of the variation in the dependent variable that is explained by the independent variable. Like any proportion, R-squared can assume any value between 0 and 1. Thus, of all the variation in womleg between states, .31, or 31 percent, is explained by christad. The rest of the variation in womleg, 69 percent, remains unexplained by the independent variable.

CREATING A SCATTERPLOT
One of Stata's graphic commands, graph twoway, adds a visual dimension to correlation and regression and thus can help you to paint a richer portrait of a relationship. Consider Figure 8-1, created using Graphics -P Overlaid twoway graphs. The scatterplot displays the cases in a two-dimensional space according to their values on the two variables. The horizontal axis (X-axis) is defined by the independent variable, christad, and the vertical axis (Y-axis) is defined by the dependent variable, womleg. We know from our correlation

140 A Stata Companion to Political Analysis

R-squared and Adjusted R-squared: What's the Difference?


Most data analysis programs, Stata included, provide two values of R-squared-a plain version, which Stata labels "R-squared," and an adjusted version, "Ad] R-squared." Adjusted R-squared is often about the same as (but is always less than) p l a ~ n R-squared. What is the difference? Just like a sample mean, w h ~ c h provides an estimate of the unseen population mean, a sample R-squared provides an est~mate the true value of R-squared In the population. And just l ~ k e sample mean, the sample of a R-squared is equal to the populalion R-squared, grve or take random sampling error. However, unlike the random error associated with a sample mean, R-squared's errors can assume only positive values-squaring any negative error, after all, produces a positive number-~ntroducing upward bias into the est~mated value of R-squared. This problem, which is more troublesome for small samples and for models with many independent variables, can be corrected by adjusting plain R-squared "downward." For a sample of size N and a regression model with k predictors, adjusted R-square is equal to 1 - (1 - R-squared)[(N - 1)/(N - k - l)]. See Barbara G. Tabachnick and Linda S. Fidell, Using Multivariate Statistics, 3d ed. (New York: HarperCollins, 1996),164-165.

Figure 8-1

Scatterplot o f Percentage Women Legislators by Percentage Christian Adherents

Correlat~on Linear Regression 7 4 7 and Figure 8-2


The Graphics + Overlaid twoway graphs Menu

Diatributiml g r w s Sm&ing dttmsiYe;f; Regression dianoskic pbts

b
b b

Time s w i e ~ graphs 00s~-wctirsnd tin~e-?;wiss e pili Survival analyds graphs ROC analysis

b b
b

Change s c k m j s b

analysis that Pearson's r for this relationship is -.57. We can now see what the correlation "looks like." Based on the figure, states that have lower percentages of Christian adherents tend to have higher percentages of women legislators, with values on the Y-axis that range between roughly 20 percent and 3 5 4 0 percent. The percentages of women legislators for states at the higher end of the X-axis, furthermore, fall between less than 10 percent and around 20 percent. So as you move from left to right along the X-axis, values on the Y-axis generally decline, just as the negative correlation coefficient suggested. Obviously this scatterplot adds depth and interest to our description of the relationship. How was it created? Click Graphics + Overlaid twoway graphs, as shown in Figure 8-2. The Twoway graphs window appears, open to the Plot 1 tab (Figure 8-3). This interface is designed mainly to produce graphic overlays, for which the user makes requests in two or more of the four available Plot tabs. We will demonstrate this usage below. Our current, simpler object-a two-way scatterplot-can be accomplished with a handful of choices in the Plot 1 tab, plus additional entries in the familiar-looking Y-Axis, X-Axis, Title, Caption, and Overall tabs. You must take two actions in the Plot 1 tab, and you may want to experiment with a third. (See Figure 8-4). First, select Scatter in the Plot type list. (By doing so, you wake up the rest of the choices in the Plot 1 tab.) Second, type or click the independent variable, christad, into the X variable box, and type or

742 A Stata Companion to Political Analysis

Figure 8-3

Twoway graphs, Plot 1 Tab [default]

Figure 8-4

Twoway graphs, Plot 1 Tab [modified for scatterplot)

click the dependent variable, womleg, into the Y variable box. Third, and optionally, you can choose a color for the scatterplot's markers. Click the Y-Axis tab (Figure 8-5). This is where we supply a desired title for the graph's vertical axis. Click in the Title box and type "Percentage Women: State Legislators". Click the X-Axis tab (Figure 8-6) and in the Title box type "Percentage Christian Adherents". Just as in earlier graphs you have constructed in this book, click Title tab (Figure 8-7) and provide a concise but descriptive title ("Percentage Women Legislators") and subtitle ("by Percentage Christian Adherents"). And don't forget to source the data in the Note box of the Caption tab (Figure 8-8).

1 I

Correlation and Linear Regresslor1 143

Figure 8-5

Adding a Y-Axis Title to a Scatterplot

6 Figure 8-6

Adding an X-Axis Title to a Scatterplot

Now click the Overall tab (Figure 8-9). There are many ways to alter a graph's general appearance, including its overall size and shape. Left to its defaults, Stata will produce a scatterplot in which the X-axis is approximately twice the length of the Y-axis. Tastes vary, of course, but a width-to-height ratio of about 1.5 to 1 probably has greater aesthetic appeal.4 How can we change Stata's default? Given the way Stata defines the different areas of a graph, if we specify a Graph Region with Height equal to 4 inches and Width equal .~ to 4.75 inches, Stata will draw a scatterplot with an X-axis that is about 1.5 times as long as its Y - a ~ i sIn the Graph Region panel of the Overall tab (as shown in Figure 8-9), click in the Height box and type "4': and

144 A Stata Companion to Political Analysis

Figure 8-7

Adding a Title to a Scatterplot

Pl.,t 1

fld 2

Plot 3

r' 4tt:

r .A413

Tltie

C 3ptlorl

Legerd

0: eisil

E,l

Figure 8-8

Adding a Source Note to a Scatterplot

Plot 1

P l d

Pb>t 3

Plot 4

YAxs

}:-P.m:

Titie

c~tiilr.

L w Overell By e d

then click in the Width box and type "4.75". Click Submit. A scatterplot, constructed to our specifications, appears above the Results window (Figure 8- 10). Depending on the analysis at hand, you may want Stata to include a linear prediction line in the scatterplot you're working on. This enhancement is easily achieved with information you can provide in the Plot 2 tab, plus a minor adjustment you can make in the Legend tab. Return to the Twoway graphs window, and click the Plot 2 tab (Figure 8-11). Scroll down the Plot type list. Click on Linear prediction. (Be careful not to select the similar-looking "Linear prediction w/cin plot type.) With this selection, Stata will plot predicted

Correlation and Linear Regression 145

Figure 8-9

Modify~ng Height and Width of a Scatterplot the

Figure 8-10

Scatterplot w ~ t h Command Syntax

values of the variable named in the Y variable box based on the observed values of the variable named in the X variable box. We want to see a line showing predicted values of womleg based on observed values of christad. Just as you did in the Plot 1 tab, put christad in the X variable box and put womleg in the Y variable box, as shown in Figure 8-1 1. One more thing. Click the Legend tab (Figure 8-12). By default, Stata will include a legend that distinguishes the different plotted elements in its graphic overlays. For the kind of

746 A Stata Companion to Political Analysis

Figure 8-1 1 Adding a Linear Prediction Overlay to a Scatterplot

X v+

Y varlabk

If. (expression]

Figure 8-12

Suppressing the Display of the Legend

straightforward plotting we are doing here-a scatter plus a line-a legend adds unnecessary clutter. Click the drop-down next to the Use legend box and select No. Clicking OK or Submit gives us the enhanced product (Figure 8-13). The negatively sloping regression line we estimated in our earlier analysis-for each 1-point increase in christad there is a .35-point decrease in womleg-is superimposed over the data points. A scatterplot like this one would provide professional-quality graphic support to a research report or article. Nicely done.

Correlation and Linear Regression 747

Figure 8-13

Scatterplot with Linear Prediction Overlay and Command Syntax

Creating a scatterplot can be a tedious, exacting chore, especially if you also want to add a linear prediction line. Here is a step-by-step review: To create a scatterplot, follow Steps 1 through 6: Step 1 . Step 2 Step 3 Step 4 Step 5 Step 6 Click Graphics + Overlaid twoway graphs. In the Plot 1 tab, select Scatter in the Plot type list. Type the independent variable in the X variable box. Type the dependent variable in the Y variable box. Click the Y-Axis tab. Click in the Title box and type a Y-axis title. Click the X-Axis tab. Click in the Title box and type an X-axis title. Click the Title tab. Click in the Title box and type a title. Click in the Subtitle box and type a subtitle. Click the Caption tab. Click in the Note box and type the name of the data source. Click the Overall tab. In the Graph Region panel, click in the Height box and type "4". Click in the Width box and type "4.75".

To add a linear prediction line to the scatterplot, also perform Steps 7 and 8: Step 7 Step 8 Click the Plot 2 tab. Select Linear prediction from the Plot type list. Type the independent variable in the X variable box. Type the dependent variable in the Y variable box. Click the Legend tab. Click the drop-down list in the Use legend box and select No.

EXPLORING MULTIVARIATE RELATIONSHIPS WITH REGRESSION


Suppose a policy researcher is investigating factors causally related to motor vehicle deaths in the states. One such factor might be a simple characteristic of states: how densely populated they are. Residents of sparsely populated states, the policy researcher reasons, would typically drive longer distances at higher speeds than would residents of more densely populated states. Plus, a car accident in a thinly populated state would be

748 A Stata Companion to Political Analysis

more likely to be fatal, because "both Good Samaritans and hospitals are more scattered in thinly populated states compared to the denser state^."^ So as density goes up, we should find that fatalities go down. Another variable might be demographic: the proportion of young people in the population. As every insurance agent knows-and as many premium-paying parents will attest-younger drivers are more likely to be involved in automobile accidents than are older drivers. Thus, as the proportion of younger people goes up, fatalities should be found to go up, too. Dataset states.dta contains three variables: carfatal, the number of motor vehicle deaths per 100,000 residents; density, state population per square mile; and under20, the percentage of the population 19 years of age or younger. The command "cor carfatal density under20" will produce a correlation matrix of these three variables:

. car

c a r f a t a l d e n s i t y under20 (obs=50)

I
c a r f 3x31 dens i r y under 20

carfatal

density
1,8000 -0.4965

under20

X. 0000 -0.6l.92 8.4400

1,0000

Note the correlation between each independent variable and the dependent variable. The correlation between density and carfatal is negative, indicating that as density increases, motor vehicle fatalities decrease (r = -.62). The relationship between under20 and carfatal, as the policy researcher suspected, is positive: As the percentage of young people increases, fatalities also increase (r = .44). But notice, too, that the two independent variables are themselves moderately related (r = -.50). This correlation is negative, suggesting that deilsely populated states have lower percentages of young people than do sparsely populated states. (This relationship may become important later on.) First let's run a simple regression, using carfatal as the dependent variable and density as the independent variable. Type the command "reg carfatal density" and press Enter:

reg c a r f a t a l density
Number o f obs = FC 1, 48)s
554.937491

1 554,937491

Frob >

=
. :

R-squared
ROOt MSE

ndj R-squared =
=

50 29-84 0.0000 0.3834 0.3705 4.3124

carfatal

coef,
-.a40757 20.27816

s r d . Err.
.OQZ5767 -7576958

P > I ~ I

~95% conf.

xntlervalf
-.0088949 21.80161

-5.46 26.76

O.0QO O. QQQ

-.OT92566 18.75471

Consider the "Coef." column, which reports the Y-intercept (-cons) and the regression coefficient on density. According to these values, the Y-intercept is equal to 20.278, and the regression coefficient is -.014. The regression equation for the effect of density on carfatal, therefore, is as follows: Motor vehicle fatalities per 100,000 pop. = 20.278 - .014*pop. per square mile. What do these coefficients mean? In terms of its magnitude, for example, the regression coefficient seems to be an incredibly small number, and its meaning is not intuitively obvious. Remember to keep the substantive relationship in mind-and focus on the units of measurement. Very thinly populated states will have an estimated fatality rate close to the intercept, or about 20 fatalities per 100,000 population. Alaska, for example, has a population density of just more than 1 person per square mile. So its estimated fatality rate would be

1
Correlat~on and L~near Regression 749

close to the intercept of 20. The regression coefficient tells us that, for each additional person per square mile, the motor vehicle fatality rate drops by .014. New Jersey, for instance, is a very densely populated state, with a density of about 1,000 people per square mile. So New Jersey's estimated fatality rate would be 20 .014*1000,which is equal to 20 - 14, or about 6 fatalities per 100,000 population. Thus, as density increases, by one person at a time, fatalities decrease by .014 of a fatality per 100,000 population. Is this effect, a drop of .014 of a fatality for each unit increase in density, statistically different from O? Yes. With a t-ratio of -5.46 and a vanishingly small P-value (0.000), we can safely reject the null hypothesis. How strong is the carfatal-density relationship? Stata reports an Adj R-squared of .37. Thus, of all the variation among states in automobile fatality rates, about 37 percent is explained by population density. As bivariate regression models go, that's not too bad. Now let's run another bivariate regression. We will keep carfatal as the dependent variable, but this time we'll use the percentage of the population 19 years of age or younger (undedo) as the independent variable. Type the command "reg carfatal under20" and press Enter:
r e g carfata7 under20

umber o f oks = 50 FC 1, 48) = 11.52 ~rab F > 1 0.04ll4 R-squar ed = 013 .96 ~ d R-squared j 0.1F68 ROOT WSE 4.9315

7
r

According to the estimated coefficients, the regression line for the effect of under20 on carfatal is as follows: Motor vehicle fatalities per 100,000 pop. = -13.174

+ 1.07*percentage of pop. age 19 or younger.

As is sometimes the case with regression, the constant, -13.174, represents an "unreal" situation. The estimated value of carfatal for states with a value of 0 on the independent variable is a minus 13.174 motor vehicle deaths per 100,000 residents. Of course, the smallest value of under20 in the actual data is substantially higher than 0.' However, for the regression line to produce the best estimates for "real" data, regression has anchored the line at a Y-intercept. The regression coefficient, 1.070, says that for each percentage point increase in under20, there is a bit more than a one-unit increase in the motor vehicle fatality rate-one additional fatality per 100,000 population. As the percentage of younger people in the population increases, so too does the fatality rate. In the population, could the true value of the regression coefficient be O? Probably not, according to the t-ratio (3.39) and the P-value (0.001). And, according to Adj R-squared, the independent variable does a fair amount of work in explaining the dependent variable. About 18 percent of the variation in carfatal is explained by under2O. Let's review what we have discovered so far. In the first bivariate regression, we found that population density has a statistically significant negative effect on motor vehicle fatalities. Low-density states have higher fatality rates than do high-density states. In the second bivariate regression, we found that the percentage of younger people has a significant positive effect on motor vehicle fatalities. States with lower percentages of young people have lower fatality rates than do states with higher percentages of young people. But recall the initial correlation matrix. There we found that the two independent variables are related: As density goes up, the percentage of younger people goes down. So when we compare states that have lower percentages of young people with states that have higher percentages of young people, we are also comparing high-density states with low-density states. Perhaps states with higher percentages of young people have higher fatality rates not because they have more young people but because they have lower population densities. Thus the relationship between under20 and carfatal might be spurious. Then again, it might not be. Unless we reexamine the under20-carfatal relationship, controlling for density, there is no way to tell.

7 5 0 A Stata Companion to Political Analysis

Multiple regression is designed to estimate the partial effect of an independent variable on a dependent variable, controlling for the effects of other independent variables. Stata's regression command easily allows the researcher to run multiple regression analysis. Let's do such an analysis, again using carfatal as the dependent variable and entering both density and under20 as independent variables. Click the last regression command back into the Command window. Modify it to read "reg carfatal under20 density":

r e g c a r f a t a l under2O density ~ m b e r obs GI= F ( 2, 47) rob > F R-squared ~ d R-squared j Ratit MSE c a r f a t a1
under20

=
=
E

S O

= = =

0.38l4 4.2748

coef,
.4277766 -.01,20895 7.534612

std. ~ r r .

p>lt

[95% conf. ~ n t e r v a l ]

density ,cons

.a 4 6 3 2 8 .002.9425
9.403022

1.36 -4.15, 0.80

0.180 Q.OOQ 0.427

-.2051832
-.01,8

-3.l.3Bl.83

1.060736 -.08h226.45105

The Coef. column provides the information we need to isolate the partial effect of each independent variable on the dependent variable. The multiple regression equation is as follows: Motor vehicle fatalities per 100,000 pop. = 7.535 + .428*percentage of pop. age 19 or younger - .012*pop. per square mile. Let's focus on the regression coefficients for each of the independent variables. The coefficient for under20, .428, tells us the effect of under20 on carfatal, controlling for density. Recall that in the bivariate analysis, a l-percentage-point increase in under20 was associated with about a one-unit increase in the fatality rate. When we control for density, however, we find a substantial reduction in this effect-to less than a half-unit increase in the fatality rate. What is more, the regression coefficient for under20, with a P-value of .180, is not sta~istically significant. Density, on the other hand, retains much of its predictive power. The regression coefficient, -.O12, is very close to the effect we found earlier when we investigated the bivariate relationship between carfatal and density. With a P-value of 0.000, we can say that, controlling for the percentage of young residents, population density is significantly related to motor vehicle fatalities. It would appear, then, that the carfatal-under20 relationship is a spurious artifact of differences between states in population density. In multiple regression, R-squared communicates how well all of the independent variables explain the dependent variable. So by knowing two things about states-the percentage of younger people and the population density-we can account for about 38 percent of the variation in motor vehicle death rates. But notice that this value of Adj R-squared is practically the same as the R-squared we found before, using density by itself to explain carfatal. Clearly, density does the lion's share of explanatory work in accounting for the dependent variable.

EXERCISES
1. (Dataset: states.dta. Variables: demnat, demstate, union.) Consider a plausible scenario for the relationships

between three variables: The percentages of a state's U.S. House and U.S. Senate delegations who are Democrats, the percentage of state legislators who are Democrats, and the percentage of workers in the state who are unionized. One could hypothesize that, compared with states with few Democrats in their state legislatures, states having larger percentages of Democratic legislators would also have greater proportions of Democrats in their U.S. congressional delegations. Furthermore, because unions tend to support Democratic candidates, one would also expect more heavily unionized states to have higher percentages of Democratic legislators at the state and national levels. Dataset states.dta contains three variables: demnat, the percentage of House and Senate members who are Democrats; demstate, the percentage of state legislators who are Democrats; and union, the percentage of workers who are union members.

?U

*, .

*.-.v" W

.-

--

Correlation and Llnear Regression 7 5 7

A. Run correlate to find the Pearson's correlation coefficients among demnat, demstate, and union. Next to the question marks, write in the two missing correlation coefficients in the following matrix:

F
1
6

Percent U.S. House and Senate Democratic Percent U.S. House and Senate Democratic Percent state legislators Democratic Percent workers who are union members Pearson's correlation Pearson's correlation Pearson's correlation
1

Percent state legislators Democratic

Percent workers who are union members

Eh
{

.5590
?

/?

B. According to the correlation coefficient, as the percentage of ~ulionized workers increases, the percentage of Democratic U.S. representatives and U.S. senators (circle one)
increases. decreases.

i
C

C. According to the correlation coefficient, as the percentage of unionized workers decreases, the percentage of Democratic U.S. representatives and U.S. senators (circle one)
increases. decreases.

\ t

D. Which of the followiilg statements best describes the relationship between the percentage of unionized workers and the percentage of state legislators who are Democrats (check one)?

Cl The relationship is negative and strong.


CI The relationship is negative and weak. CI The relationship is positive and weak.

CI The relationship is positive and strong.


2. (Dataset: states.dta. Variables: conpct-m, cons-hr.) Two congressional scholars are discussing the extent to which members of the U.S. House of Representatives stay in touch with the voters in their states.

Scholar 1: "When members of Congress vote on important public policies, they are closely attuned to the ideological make-ups of their states. Members from states having lots of liberals will tend to cast votes in the liberal direction. Representatives from states with mostly conservative constituencies, by contrast, will take conservative positions on important policies." Scholar 2: "You certainly have a naive view of congressional behavior. Once they get elected, members of Congress adopt a 'Washington, D.C., state of mind,' perhaps voting in the liberal direction on one policy and in the conservative direction on another. One thing is certain: The way members vote has little to do with the ideological composition of their states." Think about an independent variable that measures the percentage of self-described "conservatives" among the mass public in a state, with low values denoting low percentages of conservatives and high values denoting high percentages of conservatives. And consider a dependent variable that gauges the degree to which the state's House delegation votes in a conservative direction on public policies. Low scores on this dependent variable tell you that the delegation tends to vote in a liberal direction and high scores say that the delegation votes in a conservative direction.

7 5 2 A Stata Corr~panion Political Analysis to

A. Below is an empty graphic shell showing the relationship between the independent variable and the dependent variable. Draw a regression line inside the shell that depicts what the relationship should look like if Scholar 1 is correct.

B. Below is another graphic shell showing the relationship between the independent variable and the dependent variable. Draw a regression line inside the shell that depicts what the relationship should look like if Scholar 2 is correct.

Correlation and I-inear Regression 153 C. Dataset states.dta contains the variable conpct-m, the percentage of the mass public calling themselves conservative. This is the independent variable. The dataset also contains cons-hr, a measure of conservative votes by states' House members. Scores on this variable can range from 0 (low conservatism) to 100 (high conservatism). This is the dependent variable. Run regress. According to the coefficient estimates, the regression equation for estimating House conservatism scores is as follows:

The t-ratio for conpct-m is

,and the P-value for the regression coefficient on conpct-m is

D. According to the regression equation, a 1-percentage-point increase in conservatives in the mass public is associated with (check one): D about a 16-point decrease in House conservatism scores. D about a 2-point increase in House conservatism scores. D about a 14-point increase in House conservatism scores.

E. If you were to use this regression to estimate the mean House conservatism score for states having 30 percent conservatives, your estimate would be (circle one):
a score of about 30. a score of about 50. a score of about 80.

F. The value of Adj R-squared for this relationship is equal to

. This tells you that about

percent of the variation in cons-hr is explained by conpct-m.

G. Use Graphics + Overlaid twoway graphs to create a scatterplot (in the Plot 1 tab) with a linear prediction line (in the Plot 2 tab). Use the following titles and notes: Y-axis title, "Conservatism Score: US House Delegation"; X-axis title, "Percent Mass Public Conservative"; title, "House Conservatism Score"; subtitle, "by Percent Mass Public Conservative"; caption note, "Source: State Dataset". In the Use legend box of the Legend tab, select No. In the Chart Region panel of the Overall tab, specify Height of 4 and Width of 4.75. Print the graphic you have created.
H. Based on your inspection of the regression results, the scatterplot and regression line, and Adj R-squared, which congressional scholar is more correct (check one)? Scholar 1 I. Briefly explain your reasoning. Scholar 2

3. (Dataset: statesdta. Variables: to-0004, kerry04). An article of faith among Democratic Party strategists (and a source of apprehension among Republican strategists) is that high voter turnouts help Democratic candidates. Why should this be the case? According to the conventional wisdom, Democratic electorates are less likely to vote than are Republican voters. Thus low turnouts naturally favor Republican candidates. As turnouts push higher, the reasoning goes, a larger number of potential Democratic voters will go to the polls, creating a better opportunity for Democratic candidates. Therefore, as turnouts go up, so should the Democratic percentage of the vote.'

754 A Stata Companion to Political Analysis

A. Use the regress command to test this conventional wisdom. Dataset states.dta contains to-0004, the percentage-point change in presidential election turnout between 2000 and 2004. States in which turnout declined between 2000 and 2004 have negative values on to-0004, and states in which turnout increased have positive values on to-0004. (For example, Florida's turnout increased from 53.4 percent to 61.8 percent, giving Florida a value of 8.4 on to-0004. Arizona's turnout dropped from 44.6 percent to 42.3 percent, giving it a value of -2.3 on to-0004.) To-0004 is the independent variable. Another variable, kerry04, the percentage of the vote cast for Democratic candidate John Kerry, is the dependent variable. Based on your results, the regression equation for estimating the percentage voting for Kerry is (put the value of the constant, -cons, in the first blank):

B. The P-value for the regression coefficient on to-0004 is

, and the value of Adj R-squared is

C. Consider your findings in parts A and B. You can conclude that


D the conventional wisdom is correct because

D the conventional wisdom is incorrect because

4. (Dataset: states.dta. Variables: abortlaw, permit.) As you are no doubt aware, in its momentous decision in Roe v. Wade (1973), the U.S. Supreme Court declared that states may not outlaw abortion. Even so, many state legislatures have enacted restrictions and regulations that, while not banning abortion, make an abortion more difficult to obtain. Other states, however, have few or no restrictions. What factors might explain these differences in abortion laws among the states?We know that the mass public remains divided on this issue. Public opinion in some states is more favorable toward permitting abortion, whereas public opinion in other states is less favorable. Does public opinion guide state policy on this issue?

Dataset states.dta contains abortlaw, which measures the number of abortion restrictions a state has enacted into law. Values on abortlaw range from 0 (least restrictive) to 10 (most restrictive). This is the dependent variable. The dataset also has the variable permit, the percentage of the mass public saying that abortion should always be permitted. This is the independent variable. A. If you were to use regression analysis to test the idea that public opinion on abortion affects state abortion policy, you would expect to find (check one)
D a negative sign on permit's regression coefficient. D a positive sign on permit's regression coefficient.

B. Use the regress command to analyze the abortlaw-permit relationship. According to the results, the regression equation for estimating the number of abortion restrictions is as follows (fill in the blanks): 'permit. (regression coefficient)

(constant) C. The P-value for the regression coefficient is

. The value of Adj R-squared is

Correlation and Linear Regression 755

D. According to states.dta, about 44 percent of Colorado residents believe that abortion should always be permitted. In Louisiana, by contrast, only about 14 percent of the public holds this view. Based on the regression equation (fill in the blanks), you would estimate that Colorado would have about you would estimate that Louisiana would have about abortion restrictions. abortion restrictions.

E. Use Graphics + Overlaid twoway graphs to create a scatterplot (in the Plot 1 tab) with a linear prediction line (in the Plot 2 tab). Use the following titles and notes: Y-axis title, "Number of Abortion Restrictions"; X-axis title, "Percent Mass Public Saying Always Permit"; title, "Number of Abortion Restrictions"; subtitle, "by Percent Mass Saying Always Permit Abortion"; note, "Source: State Dataset". In the Use legend box of the Legend tab, select No. In the Chart Region panel of the Overall tab, specify Height of 4 and Width of 4.75.Print the graphic you have created.
5 (Dataset: states.dta. Variables: demstate, dempct-m, libpct-m.) In Exercise 2 you analyzed the connection . between mass political attitudes and congressional voting, and in Exercise 4 you examined the link between public opinion and public policy. In this exercise you will use correlation and multiple regression to examine a set of relationships between mass attitudes and the partisan make-up of state legislatures. State legislatures are remarkably varied in this regard-ranging in partisan composition from about 22 percent Democratic to over 85 percent Democratic. What accounts for this variation? Consider two plausible independent variables: the percentage of a state's citizens who are self-identified Democrats, and the percentage of citizens who are selfdescribed "liberals." Each of these variables should have a positive relationship with the percentage of Democrats in the state legislature. A. Dataset states.dta contains these three variables: demstate, the percentage of state legislators who are Democrats; dempct-m, the percentage of Democrats in the mass electorate; and libpct-m, the percentage of selfdescribed liberals in the mass public. Find the Pearson's correlation coefficients among demstate, dempct-rn, and libpct-m. Write the correlation coefficients next to the question marks in the following matrix. Percent state legislators Democratic Percent state legislators Democratic Percent mass public Democratic Percent mass public Liberal Pearson's correlation Pearson's correlation Pearson's correlation
1

Percent mass public Democratic

Percent mass public Liberal

? ?

1 ?
1

B. According to the correlation coefficient, as the percentage of liberals in the mass public increases, the percentage of Democratic state legislators (circle one) increases. decreases.

C. Suppose someone were to make this claim: "Being a Democrat and being a liberal are practically synonymous. The relationship between the percentage of Democratic identifiers and the percentage of liberals, therefore, will be positive and strong." According to the correlation coefficient, this claim is

L l correct because

7 5 6 A Stata Companion to Political Analysis

Cl incorrect because

D. Run regression analysis to obtain multiple regression estimates for the partial effects of dempct-m and libpct-m on demstate. Demstate is the dependent variable, and dempct-m and libpct-m are the independent variables. Based on your results, the multiple regression for estimating the percentage of Democratic state legislators is as follows:

E. The P-value for the regression coefficient on dempct-m is


regression coefficient on libpct-m is

, and the P-value for the

F. As you may know, Nebraska's state legislature is unique in two ways-it is unicameral (all other state legislatures are bicameral), and it is nonpartisan. Candidates do not run for the state legislature using party labels, and the legislature is not organized on the basis of party. Thus Nebraska has a missing value on the variable demstate, and it was not included in the regression analysis you just performed. However, if you were to peruse states.dta, you would find that 29.03 percent of Nebraskans are Democrats and 16.44 percent are self-described liberals. For the sake of speculation, assume that Nebraska decided that all members of the state legislature should declare a partisan allegiance. Based on your regression model, about what percentage would be Democrats (circle one)? About 24 percent About 36 percent About 60 percent

G. Based on your interpretation of the multiple regression output, you can conclude that (check all that apply):
Cl controlling for the percentage of the mass public who are liberal, a 1 -percentage-point increase in the percentage of Democrats in the mass public is associated with about a 5.8-percentage-point increase in the percentage of Democratic state legislators. Cl controlling for the percentage of the mass public who are Democratic, a I-percentage-point increase in the percentage of liberals in the mass public is associated with about a 1.5-percentage-point increase in the percentage of Democratic state legislators. Cl both independent variables are significantly related to the dependent variable.
Ll the relationship between dempd-m and demstate is spurious.
Ll taken together, both independent variables explain about one-half of the variation in the dependent

variable. That concludes the exercises for this chapter. After exiting Stata, be sure to take the media containing your datasets with you.

1. Regression analysis on variables measured by percentages can be confusing. Always stay focused on the exact units of

measurement. One percentage point would be 1.00. So if christad increases by 1.00, then womleg decreases, on average, by .35, or .35 of a percentage point. 2. Stata reports two-tailed P-values, not one-tailed P-values. The expression "P>I t I" means, "the area of the curve above the absolute value of t." This includes the region below a negatively signed t and above a positively signed t. Strictly speaking, then, you may correctly apply the .05 standard by rejecting the null hypothesis for any reported P-value of equal to or less than .lo. However, in this book we follow the more conservative practice of rejecting the null hypothesis for P-values of equal to or less than .05.

Correlation and Linear Regression 757


3. The t-ratio for the Y-intercept permits you to test the null hypothesis that, in the population, the Y-intercept is 0. In

4.

5.

6. 7. 8.

this case we have no interest in testing the hypothesis that states having 0 Christian adherents have 0 percent women in their state legislatures. See Edward R. Tufte, The Visual Display of Quantitative Information, 2d ed. (Cheshire, Conn.: Graphics Press, 2001), 186-190. Tufte notes that although people have a "mild preference for proportions near the Golden Rectangle" (p. 190), preferences vary considerably, from 1.2 to 1 to 2.2 to 1. The Golden Rectangle, based on a "venerable (fifth-century B.c.)but dubious rule of aesthetic proportion" (p. 189), has a width to height ratio of 1.618 to 1. TufteS recommendations (p. 190):"If the nature of the data suggests the shape of the graphic, follow that suggestion. Otherwise, move toward horizontal graphics about 50 percent wider than tall." The Graph Region includes virtually all elements of the graphic space, including axis titles and the chart title and subtitle. Assuming that the user requests default sizes for these elements, a 4-inch Height will produce a Y-axis of 2.75 inches, and a 4.75-inch Width will produce an X-axis of 4.125 inches. See Stata Graphics Reference Manual, Release 9 (College Station, Texas: Stata Press, 2005), 445. Edward R. Tufte, Data Analysis for Politics and Policy (Englewood Cliffs, N.J.: Prentice Hall, 1974), 21. Tufte uses regression analysis to evaluate the effectiveness of motor vehicle inspections, controlling for population density. If you do a quick codebook run, you will find that the lowest value of under20 is 25 percent. See Michael D. Martinez and Jeff Gill, "The Effects of Turnout on Partisan Outcomes in U.S. Presidential Elections 1960-2000," Journal of Politics 67 (November 2005): 1248-1274. Martinez and Gill find that the Democratic advantage from higher turnouts has declined over time.