Professional Documents
Culture Documents
a) Data View
First is the Data View. This is like an Excel spreadsheet and should look familiar
to you, except that the variable names are listed on the top row and the Case
Numbers are listed row by row. You can enter or delete data directly as in Excel.
If you delete data, these missing values in this dataset are represented by a dot.
b) Variable View
The second is called Variable View, this is where you can view various
components of your variables; but the important components are the Name,
Label, Values and Measure. The Name specifies the name of your variable.
The Measure column is often overlooked but is important for certain analysis in
SPSS and will help orient you to the type of analyses that are possible. You can
choose between Scale, Ordinal or Nominal variables:
In regression, you typically work with Scale outcomes and Scale predictors,
although we will go into special cases of when you can use Nominal variables as
predictors in Lesson 3.
From the Variable View we can see that we have 21 variables and the labels
describing each of the variables. We will not go into all of the details about these
variables. We have variables about academic performance in 2000 api00, and
various characteristics of the schools, e.g., average class size in kindergarten to
third grade acs_k3, parent’s education avg_ed, percent of teachers with full
credentials full, and number of students enroll.
c) Syntax Editor
The Syntax Editor is where you enter SPSS Command Syntax. You can highlight
portions of your code and implement it by pressing the Run Selection button.
Note that you can explore all the syntax options in SPSS via the Command
Syntax Reference by going to the Help menu. This will call a PDF file that is a
reference for all the syntax available in SPSS.
As we will see in this seminar, there are some analyses you simply can’t do from
the dialog box, which is why learning SPSS Command Syntax may be useful.
Throughout this seminar, we will show you how to use both the dialog box and
syntax when available. To begin, let’s go over basic syntax terminology:
Note that a ** next to the specification means that it’s the default specification if
no specification is provided (i.e., /MISSING = LISTWISE). When you paste the
syntax from drop down menu, SPSS usually explicitly outputs the default
specifications.
After pasting the Syntax and clicking on the Run Selection button or by clicking
OK from properly specifying your analysis through the menu system, you will see
a new window pop up called the SPSS Viewer, otherwise known as the Output
window. This is where all the results from your regression analysis will be stored.
Now that we are familiar with all the essential components of the SPSS
environment, we can proceed to our first regression analysis.
Go to top of page
yi=b0+b1xi+eiyi=b0+b1xi+ei
The index ii can be a particular student, participant or observation. In this
seminar, this index will be used for school. The term yiyi is the dependent or
outcome variable (e.g., api00) and xixi is the independent variable (e.g., acs_k3).
The term b0b0 is the intercept, b1b1 is the regression coefficient, and eiei is the
residual for each school. Now let’s run regression analysis using api00 as the
dependent variable of academic performance. Let’s first include acs_k3 which is
the average class size in kindergarten through 3rd grade (acs_k3). We expect
that better academic performance would be associated with lower class size.
Let’s try it first using the dialog box by going to Analyze – Regression – Linear
In the Linear Regression menu, you will see Dependent and Independent fields.
Dependent variables are also known as outcome variables, which are variables
that are predicted by the independent or predictorvariables. Let’s not worry
about the other fields for now. You can either click OK now, or click on Paste and
you will see the code outputted in the Synatx Editor. Click the Run button to run
the analysis.
This is the output that SPSS gives you if you paste the syntax.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT api00
/METHOD=ENTER acs_k3.
Coefficientsa
Unstandardized Standardized
Model t Sig.
Coefficients Coefficients
B Std. Error Beta
1
avg class size
-2.712 1.420 -.096 -1.910 .057
k-3
We omit certain portions of the output which we will discuss in detail later.
Looking at the coefficients, the average class size (acs_k3, b=-2.712) is
marginally significant (p = 0.057), and the coefficient is negative which would
indicate that larger class sizes is related to lower academic performance — which
is what we would expect. Should we take these results and write them up for
publication? From these results, we would conclude that lower class sizes are
related to higher performance. Before we write this up for publication, we should
do a number of checks to make sure we can firmly stand behind these results.
We start by getting more familiar with the data file, doing preliminary data
checking, and looking for errors in the data.
Go to top of page
DESCRIPTIVES /VAR=ALL.
Descriptive Statistics
Std.
N Minimum Maximum Mean
Deviation
Recall that we have 400 elementary schools in our subsample of the API 2000
data set. Some variables have missing values, like acs_k3 (average class size)
which has a valid sample (N) of 398. When we did our original regression
analysis the DF (degrees of freedom) Total was 397 (not shown above, see the
ANOVA table in your output), which matches our expectation since the total
degree of freedom in our Total Sums of Squares is the total sample size minus
one. Taking a look at the minimum and maximum for acs_k3, the average class
size ranges from -21 to 25. An average class size of -21 sounds implausible
which means we need to investigate it further. Additionally, as we see from
the Regression With SPSS web book, the variable full (pct full credential)
appears to be entered in as proportions, hence we see 0.42 as the minimum. The
last row in the Descriptives table, Valid N (listwise) is the sample size you would
obtain if you put all the predictors of your table in your regression analysis, this is
otherwise known as Listwise Deletion, which is the default implementation for the
REGRESSION command. The descriptives have uncovered peculiarities worthy
of further examination.
The actual values of the “fences” in the boxplots can be difficult to read. We can
request percentiles to show where exactly the lines lie in the boxplot. Recall that
the boxplot is marked by the 25th percentile on the bottom end and 75th
percentile on the upper end. To request percentiles go to Analyze – Descriptive
Statistics – Explore – Statistics.
The code you obtain from pasting the syntax
EXAMINE VARIABLES=acs_k3
/PLOT BOXPLOT HISTOGRAM
/COMPARE GROUPS
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
Now let’s take a look at the output.
Cases
avg class
398 99.5% 2 0.5% 400 100.0%
size k-3
We get Case Processing Summary which tells us again the Valid N and the
number of missing, as expected.
Descriptives
Std.
Statistic
Error
Lower
18.05
Bound
95% Confidence
Median 19.00
Variance 25.049
Std. Deviation 5.005
Minimum -21
Maximum 25
Range 46
Interquartile Range 2
The Descriptives output gives us detailed information about average class size.
You can get special output that you can’t get from Analyze – Descriptive
Statistics – Descriptives such as the 5% trimmed mean. Here are key points:
(Optional)
For more an annotated description of a similar analysis please see our web
page: Annotated SPSS Output Descriptive statistics.
Boxplots are better for depicting Ordinal variables, since boxplots use percentiles
as the indicator of central tendency and variability. The key percentiles to note
are the 25, 50 and 75 since these indicate the lower, middle and upper “fences”
on the boxplot. Note that Tukey’s hinges cannot take on fractional values
whereas Weighted Average can.
Percentiles
Percentiles
5 10 25 50 75 90 95
avg
Weighted
class 16.00 17.00 18.00 19.00 20.00 21.00 21.00
Average(Definition 1)
size k-3
avg
size k-3
The boxplot is shown below. Pay particular attention to the circles which are mild
outliers and stars, which indicate extreme outliers. Note that the extreme outliers
are at the lower end.
You can look at the outliers by double clicking on the boxplot and right click on
the starred cases (extreme outliers). Then click on Go to Case to see the case in
Data View.
Going back at Data View
We see that the histogram and boxplot are effective in showing the schools with
class sizes that are negative. Looking at the boxplot and histogram we see
observations where the class sizes are around -21 and -20, so it seems as
though some of the class sizes somehow became negative, as though a negative
sign was incorrectly typed in front of them. To see if there’s a pattern, let’s look at
the school and district number for these observations to see if they come from the
same district. Indeed, they all come from district 140.
We can use Variable View to place variable acs_k3 from Position 10 to Position
3 by holding the left mouse button down on the left most column (in Windows)
and dragging the variable up.
All of the observations from District 140 seem to have this problem. When you
find such a problem, you want to go back to the original source of the data to
verify the values. We have to reveal that we fabricated this error for illustration
purposes, and that the actual data had no such problem. Let’s pretend that we
checked with District 140 and there was a problem with the data there, a hyphen
was accidentally put in front of the class sizes making them negative. We will
make a note to fix this! Let’s continue checking our data.
Further steps
We recommend repeating these steps for all the variables you will be analyzing in
your linear regression model. In particular, it seems there are additional typos in
the full variable. In the Regression With SPSSweb book we describe this error in
more detail. In conclusion, we have identified problems with our original data
which leads to incorrect conclusions about the effect of class size on academic
performance. The corrected version of the data is called elemapi2v2. Let’s use
that data file and repeat our analysis and see if the results are the same as our
original analysis.
Go to top of page
Now that we have the correct data, let’s revisit the relationship between average
class size acs_k3 and academic performance api00.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT api00
/METHOD=ENTER acs_k3.
Model Summary
Std. Error
R Adjusted
Model R of the
Square R Square
Estimate
1 .171a .029 .027 140.165
ANOVAa
Sum of Mean
Model df F Sig.
Squares Square
Coefficientsa
Standardized
Unstandardized Coefficients
Coefficients
Model t Sig.
1
avg class
17.751 5.140 .171 3.454 .001
size k-3
Looking at the Model Summary we see that the R square is .029, which means
that approximately 2.9% of the variance of api00 is accounted for by the model.
The R is the correlation of the model with the outcome, and since we only have
one predictor, this is in fact the correlation of acs_k3 with api00. From the
ANOVA table we see that the F-test and hence our model is statistically
significant. Looking at the Coefficients table the constant or intercept term is
308.34, and this is the predicted value of academic performance
when acs_k3 equals zero. We are not that interested in this coefficient because a
class size of zero is not plausible. The t-test for acs_k3 equals 3.454, and is
statistically significant, meaning that the regression coefficient for acs_k3 is
significantly different from zero. Note that (3.454)2 = 11.93, which is the same as
the F-statistic (with some rounding error). The coefficient for acs_k3 is 17.75,
meaning that for a one student increase in average class size, we would expect a
17.75 increase in api00. Additionally from the Standardized Coefficients Beta, a
one standard deviation increase in average class size leads to a 0.171 standard
deviation increase in academic performance.
Move api00 and acs_k3 from the left field to the right field by highlighting the two
variables (holding down Ctrl on a PC) and then clicking on the right arrow.
Correlations
Pearson
1 .171**
Correlation
api 2000
Sig. (2-tailed) .001
N 400 398
Pearson
.171** 1
Correlation
N 398 398
**. Correlation is significant at the 0.01 level (2-tailed).
Note that the correlation is equal to the Standardized Coefficients Beta column
from our simple linear regression, whose term we will denote beta^beta^ with a
hat to indicate that it’s being estimated from our sample.
The formula for an unstandardized coefficient in simple linear regression is:
b^1=corr(y,x)∗SD(y)SD(x).b^1=corr(y,x)∗SD(y)SD(x).
For a standardized variable:
β1^=corr(Zy,Zx)∗SD(Zy)SD(Zx).β1^=corr(Zy,Zx)∗SD(Zy)SD(Zx).
Since the standard deviation is for a standardized variable is 1, the terms on the
right hand divide to 1 and we simply get the correlation coefficient. It can be
shown that the correlation of the z-scores are the same as the correlation of the
original variables:
β1^=corr(Zy,Zx)=corr(y,x).β1^=corr(Zy,Zx)=corr(y,x).
Thus, for simple linear regression, the standardized beta coefficients are simply
the correlation of the two unstandardized variables!
Go to top of page
yi=b0+b1x1i+b2x2i+b3x3i+eiyi=b0+b1x1i+b2x2i+b3x3i+ei
For this multiple regression example, we will regress the dependent
variable, api00, on predictors acs_k3,meals and full. We can modify the code
directly from Section 1.4. Remember to use the corrected data file: elemapi2v2.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT api00
/METHOD=ENTER acs_k3 meals full.
The output you obtain is as follows:
Variables Entered/Removeda
Variables Variables
Model Method
Entered Removed
pct full
credential,
avg class
1 . Enter
size k-3,
pct free
mealsb
Model Summary
a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
ANOVAa
Sum of
Model df Mean Square F Sig.
Squares
b. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model t Sig.
1
-
pct free meals -3.686 .112 -.828 .000
32.978
pct full
1.327 .239 .139 5.556 .000
credential
Let’s examine the output from this regression analysis. As with the simple
regression, we look to the p-value of the F-test to see if the overall model is
significant. With a p-value of zero to three decimal places, the model is
statistically significant. The R-squared is 0.824, meaning that approximately 82%
of the variability of api00 is accounted for by the variables in the model. In this
case, the adjusted R-squared indicates that about 82% of the variability
of api00 is accounted for by the model, even after taking into account the number
of predictor variables in the model. The coefficients for each of the variables
indicates the amount of change one could expect in api00 given a one-unit
change in the value of that variable, given that all other variables in the model are
held constant. For example, consider the variable meals. We would expect a
decrease of 3.686 in the api00 score for every one unit increase in percent free
meals, assuming that all other variables in the model are held constant. The
interpretation of much of the output from the multiple regression is the same as it
was for the simple regression.
(Optional) You may be wondering what a -3.686 change in meals really means,
and how you might compare the strength of that coefficient to the coefficient for
another variable, say full. To address this problem, we can refer to the column
Standardized Coefficients Beta, also known as standardized regression
coefficients. The Beta coefficients are used by some researchers to compare the
relative strength of the various predictors within the model. Because the Beta
coefficients are all measured in standard deviations, instead of the units of the
variables, they can be compared to one another. In other words, the beta
coefficients are the coefficients that you would obtain if the outcome and predictor
variables were all transformed to standard scores, also called z-scores, before
running the regression. In this example, meals has the largest Beta coefficient, -
0.828, and acs_k3 has the smallest Beta, -0.007. Thus, a one standard deviation
increase in meals leads to a 0.828 standard deviation decrease in
predicted api00, with the other variables held constant. And, a one standard
deviation increase in acs_k3, in turn, leads to a -0.007 standard deviation
decrease api00 with the other variables in the model held constant. This means
that the positive relationship between average class size and academic
performance can be explained away by adding a proxy of socioeconomic status
and teacher quality into our model.
To see the additional benefit of adding student enrollment as a predictor let’s click
Next and move on to Block 2. Remember that the previous predictors in Block 1
are also included in Block 2.
Note that we need to output something called the R squared change, so under
Linear Regression click on Statistics and check the R squared change box and
click Continue. The syntax looks like this (notice the new keyword CHANGE
under the /STATISTICS subcommand).
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA CHANGE
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT api00
/METHOD=ENTER acs_k3 full meals
/METHOD=ENTER enroll.
The output we obtain from this analysis is:
Variables Entered/Removeda
Variables Variables
Model Method
Entered Removed
pct full
credential,
avg class
1 . Enter
size k-3,
pct free
mealsb
number of
2 . Enter
studentsb
Model Summary
Change Statistics
Std.
Adjusted
R Error of
Model R R R
Square the F Sig. F
Square Square df1 df2
Estimate Change Change
Change
a. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
b. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals,
number of students
ANOVAa
Sum of
Model df Mean Square F Sig.
Squares
Regression 6604966.181 3 2201655.394 615.546 .000b
b. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals
c. Predictors: (Constant), pct full credential, avg class size k-3, pct free meals,
number of students
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model t Sig.
1
pct free -
-3.686 .112 -.828 .000
meals 32.978
pct full
1.327 .239 .139 5.556 .000
credential
2
avg class size
.841 2.243 .008 .375 .708
k-3
pct free -
-3.645 .111 -.819 .000
meals 32.965
pct full
1.080 .244 .113 4.420 .000
credential
number of
-.052 .014 -.084 -3.711 .000
students
Excluded Variablesa
Collinearity
Partial Statistics
Model Beta In t Sig.
Correlation
Tolerance
number of
1 -.084b -3.711 .000 -.184 .849
students
a. Dependent Variable: api 2000
b. Predictors in the Model: (Constant), pct full credential, avg class size k-3, pct
free meals
Go to top of page
1.6 Summary
In this seminar we have discussed the basics of how to perform simple and
multiple regressions, the basics of interpreting output, as well as some related
commands. We examined some tools and techniques for screening for bad data
and the consequences such data can have on your results. We began with a
simple hypothesis that decreasing class size increases academic performance.
However, what we realize is that a correct conclusion must first be based on valid
data as well as a sufficiently specified model. Our initial findings were changed
when we removed implausible (negative) values of average class size. After
correcting the data, we arrived at the finding that just adding class size as the
sole predictor results in a positive effect of increasing class size on academic
performance. However the R-square was low. When we put in more explanatory
predictors into our model such as proxies of socioeconomic status, teacher
quality and school enrollment, the effect of class size disappeared. Our
hypothesis that larger class size decreases performance was not confirmed when
we specified the full model. Let’s move onto the next lesson where we make sure
the assumptions of linear regression are satisfied in making our inferences.
Go to top of page
(Optional) Proof for the Standardized Regression Coefficient for Simple
Linear Regression
Suppose aa and bb are the unstandardized intercept and regression coefficient
respectively in a simple linear regression model. Additionally, we are given that
the formula for the intercept is a=y¯−b1x¯a=y¯−b1x¯. The simple linear equation
is given as:
yi=a+b1xi+ϵiyi=a+b1xi+ϵi
Substituting the formula for the intercept we obtain:
yi=(y¯−b1x¯)+b1xi+ϵiyi=(y¯−b1x¯)+b1xi+ϵi
Rearranging terms:
yi=y¯+b1(xi−x¯)+ϵiyi=y¯+b1(xi−x¯)+ϵi
Subtract both sides by y¯y¯, note the first term in the right hand side goes to
zero:
(yi−y¯)=(y¯−y¯)+b1(xi−x¯)+ϵi(yi−y¯)=(y¯−y¯)+b1(xi−x¯)+ϵi
Multiply the resulting first term in the right hand side by SD(x)SD(x)=1SD(x)SD(x)=1:
(yi−y¯)=b1(xi−x¯)SD(x)∗SD(x)+ϵi(yi−y¯)=b1(xi−x¯)SD(x)∗SD(x)+ϵi
Substitute Zx(i)=(xi−x¯)/SD(x)Zx(i)=(xi−x¯)/SD(x), which is the standardized
variable of