You are on page 1of 30

A Manual for Conducting Analyses with Data from TIMSS and PISA

Report prepared for the UNESCO Institute for Statistics


J Douglas Willms Canadian Research Institute for Social Policy University of New Brunswick And Thomas Smith Peabody College Vanderbilt University

Table of Contents Chapter 1 Getting Started Downloading and preparing the source data base Creating and merging tall skinny files for your research database Chapter 2 Constructing Pupil and School/Classroom Variables Approaches to scaling variables Handling missing data Chapter 3 Using Plausible Values and Survey Design Weights The Function of Survey Design Weights The Function of Replicate Design Weights The Use of Plausible Values Chapter 4 Estimating Basic Hierarchical Linear Models The null HLM and variance components A two-level HLM with variables at level 1 Random or fixed effects? A two-level HLM with variables at levels 1 and 2 Intercepts and slopes as outcomes 15 11 5 2

Chapter 1 Getting Started


Downloading and Preparing the Source Data Bases
PISA 2003. The sources data bases for the PISA 2003 data base can be downloaded from the PISA web site at the following address: http://pisaweb.acer.edu.au/oecd_2003/oecd_pisa_data.html For the purposes of this workshop you will need to download the SPSS control files for the student and school questionnaires, and the student and school data sets in TXT format. You will probably also find it useful to download the corresponding codebook files. Our first aim is to construct SPSS system data files for the PISA 2003 data, one for the student data and one for the school data. To do this you will need to slightly modify the SPSS control files to read the *.txt data sets. We have included an example of the modified files on the web-site. This necessary first step can be described as follows:

Int_Stui.TXT Read_StuI.SPS PISA_2003 Pupil .SAV


Our file to read the data, Read_StuI.SPS, was modified in two ways. First, it included the file handle for where the Int_Stui.txt file and altered the format of the country, school and student id variables from alpha-numeric to numeric: DATA LIST FILE = 'f:\PISA2003\DATA\pupil\INT_stui.txt' /COUNTRY 1 - 3 CNT 4 - 6 (a) SUBNATIO 7 - 10 (a) SCHOOLID 11 - 15 STIDSTD 16 - 20 Etc.

Second, it included the code to create a PISA ID variable that can be used for matching all files at the student level. The code also calls for the data to be sorted by this ID variable, and a file handle is given for the PISA2003_pupil.sav data set: *** inserted pisaid here. compute countryx=country*10000000000. compute schlidx=schoolid*100000. compute pisaid=countryx+schlidx+stidstd. compute countryz=100000*country. compute schlid=countryz+schoolid. format pisaid (f11.0) schlid (f8.0). execute. sort cases by pisaid. save outfile="f:\pisa2003\data\pupil\pisa2003_pupil.sav". execute.

TIMSS 2003. The sources data bases for TIMSS 2003 should be available for download from the TIMSS web site at the following address in March 2005: http://timss.bc.edu/timss2003.html The data files referred to in this manual were those made available to TIMSS National Coordinators around the time of the press release in December 2004. The SPSS syntax file Join_M3.sps (available on the UIS website) can be used to merge the SPSS portable files into an international dataset. Just modify the list of countries, the datafile type (student=BSG, math teacher=BMT, science teacher=BST, school=BCG, etc), and the sort variables (depending on the corresponding data file type). As the add file command in SPSS is limited to 50 files, only 50 countries/jurisdictions can be processed at one time. The syntax file included on the website does not include the US state of Indiana, one of the benchmarking jurisdictions participating in TIMSS 2003.

Additional technical information about the structure of the data files, as well as how they were created and cleaned was included in the file: Cleaning_Docu_Version_3.0.pdf, which was distributed to TIMSS National Coordinators.

Creating and Merging Tall Skinny Files


What are tall skinny files? Tall skinny files are SPSS data files (i.e., *.SAV files) that include information on a single variable, or sometimes a small set of related variables, and the PISA ID variables. They are created with a single syntax file (*.SPS in SPSS) that: (1) reads the source data base (e.g., PISA_2003_Pupil.SAV), (2) creates a new derived variable or set of variables through various manipulations such as recoding existing variables or developing new variables that are a combination of existing variables, and (3) saves the tall skinny file as an SPSS system data set. The new variable is assigned a new name, usually one that is more meaningful for analysis than the ones used in the source data set. For a simple example, the variable in the source PISA data set for childs sex is st03q01. In the syntax file for the new variable called female, the original variable, st03q01 is coded as a dummy variable that can be used in analysis: recode st03q01 (2=0) (1=1) (else=sysmis) into female. The syntax file can also include the code for creating a new variable that denotes cases with missing data on the target variable, and a new derived variable that has imputed values for missing cases. The overall strategy is for the research network to build up a set of syntax files that produce tall skinny files. Analysts can then simply choose the tall skinny files with the variables that they wish to use for their particular analysis, and merge them together using the PISAID variable. They then have a data set ready for analysis. Although this approach may seem somewhat tedious, and uses more space on the hard drive, we have found it works very well with large research networks for several reasons. First, analysts can see how particular variables were derived without having to wade through very long syntax files. Second, when there is an error in the syntax, one can simply fix the relevant syntax file, and run it to produce a new tall skinny file. This avoids having to run very large syntax files that try to do everything in one step. We have found also that syntax errors are easier to identify when using this approach. Third, it enables network members to contribute to the efforts of other analysts. For example, one analyst may wish to test whether a different approach to the scaling of socioeconomic status (SES) produces different results. He or she can simply create the syntax for a tall skinny file with the new SES variable. This new variable can then be simply merged into the working data set for analysis. Fourth, some variables may need to be created using programs other than SPSS. For example, one may wish to create a new variable based on Rasch modeling and using specialized software. The new variable can easily be set into the data structure as a tall skinny file. Fifth, analysts usually use different sets of variables for their analyses. With the tall skinny data sets, they can simply choose which files they wish for their particular analyses. Finally, and perhaps most important, the approach saves people many hours of tedious work, as they do not need to create every single variable. The data structure for the use of tall skinny files for the PISA data is presented in Figure 1. A similar structure would apply to the PISA school level data, and for the pupil, teacher, and school TIMSS data sets. For the UIS project, the syntax files and the tall skinny files are made available on the web-site.

Int_Stui.TXT
Read_Stui.SPS

PISA_2003_Pupil .SAV
Weights.SPS Pared.SPS SES.SPS

W E I G H T S .SAV

P A R E D .SAV

S E S .SAV

JoinPupil.SPS

Pupil Master.SAV
Figure 1. Data structure for the use of tall skinny files.

Chapter 2 Constructing Pupil and School/Classroom Variables


Approaches to Scaling Variables
In educational research we often want to measure abstract concepts or constructs, such as mathematical ability, self esteem, or socioeconomic status. Such constructs cannot be measured directly, and therefore they are commonly called latent variables. However, we develop tests and questionnaires that yield observable responses, or manifest variables, which we believe depend on peoples scores on an underlying latent variables. For example, we believe that someone with high self esteem is likely to respond differently to a statement like I am liked by most people than someone with low self esteem. While we can never actually attain the true scores on the unobservable latent variable, we can make progress in understanding our world by aggregating the scores on the manifest variables that we believe represent the latent construct. The term scaling refers to the way in which we aggregate scores from the responses to a set of test items or self-report questions into a single score. There are many different approaches to scaling, and a comprehensive treatment of the subject would require several volumes. For our purposes, we will consider different classes of scaling approaches, and briefly discuss their role in this project. A simple way to classify scaling methods is to consider whether the underlying variable is categorical or continuous. For example, we might consider people to be depressed, in a clinical sense, or not depressed, or a child to be hyperactive or not hyperactive. While there are many shades of grey in such constructs, we often wish to make such classifications, especially for the development of social policy. In contrast, a construct such as mathematical ability or socioeconomic status is usually considered continuous. Our observed variables are usually also measured as continuous of categorical variables. When the latent variable is continuous and the observed variables are continuous, the most common approach for creating a scale is factor analysis. Factor analysis is often used to determine the number of latent variables underlying a set of observed scores, as well as to create a scale from the observed responses. When the latent variable is continuous and the observed variables are categorical, a set of techniques based on item response theory (IRT) are commonly used. The basic idea is that people have a certain position on the underlying continuum of the latent variable, and depending on their position, they will have a particular response pattern for the observed variables. If we have ten dichotomous items on a brief mathematics test, for example, we would have 210 (1024) possible responses, whereas if we have 6 Likert items intended to measure a persons self esteem, each with 4 possible responses (e.g., strongly disagree, disagree, agree, strongly agree), we would have 46 (4096) different response patterns. IRT includes several different models for describing the relationship between a persons position on the underlying latent variable and the possible response patterns. The model used in PISA is the Rasch model. It provides information about the position of each item on the underlying latent scale, as well as the likely position of each respondent. A set of techniques commonly called cluster analysis (or latent profile analysis) are used when the latent variable is categorical and the observed variables are continuous. Cluster

analysis techniques generally strive to classify people into a small number of clusters based on their responses to a set of observed continuous variables. Finally, when both the latent and observed variables are categorical, a technique called latent class analysis is commonly used. Like IRT methods, the classification of subjects is based on the entire set of possible response patterns for the observed data. In practice, most researchers use fairly simple approaches for scaling data. For example, on a 6-item scale of self esteem, with 4 possible responses for each item, a common approach is to score the responses as 0, 1, 2, and 3, and then either determine the total score (ranging from 0 to 18) for each respondent, or the average score (ranging from 0 to 3). Although this is not the preferred approach, it usually yields a scale that one can work with, at least in the early stages of analysis. An approach suggested by Mosteller and Tukey is to presume that the underlying latent variable has a logit distribution (like a normal distribution but with fatter tails). Their approach treats the observed scaled scores (e.g., ranging from 0 to 18) as 19 ordinal categories. The scores are recoded into new scale values based on the assumption of a logit distribution and the frequency of each observed score. The new scale values preserve the original order, but the distance between scores varies, depending on the original distribution of scores. A variant of this approach is to presume the underlying latent variable is distributed as a uniform distribution, and rescale the data based on the observed frequency of each attained observed score. We have provided an SPSS syntax file for scaling variables using these two approaches. The second common approach is to conduct a factor analysis of the observed data. In this case, the researcher is presuming that the Likert responses (0, 1, 2, and 3) are measures on a continuous scale. Although this assumption can easily be proven to be false, the results provide an indication of whether there is more than one underlying latent construct. The most straightforward factor analytic technique, principal components analysis, yields a factor score, which is commonly used as the scaled score representing the latent variable. We have used this approach, for example, to construct our measure of socioeconomic status for PISA. IRT is the preferred approach for our 6-item Likert scale, as it considers a much wider range of response patterns. Fitting IRT models requires specialized software, and this can be a rather time-consuming process. On the longer term, we would like to see most of the major constructs we are using in PISA and TIMSS analysed with IRT techniques. That said, however, in practice the simpler techniques suggested above nearly always yield scaled scores that are highly correlated with scores derived with IRT (in our experience with r = 0.98 or higher). To make progress, therefore, many of the variables we use have been scaled using the simpler approaches. The most important variable for PISA and TIMSS, the mathematics scores, were scaled centrally with IRT techniques, and are available on the data downloaded from the web-sites.

Handling Missing Data


Why is missing data a problem? Nearly all standard statistical methods presume that every case has information on all the variables to be included in the analysis. Furthermore, nearly all voluntary surveys have items that respondents refused to answer, accidentally skipped, or were not asked for a particular subset of the sample. Both the PISA and the TIMSS data files have missing data due to item non-response (e.g., a student skipped the question on parental education) or because a teacher or administrator did not complete the questionnaire for their class or school. Data are missing and we need to deal with it! So what can we do? As with approaches to scaling, there are some simple approaches and some more difficult and time-consuming, but preferable, approaches. The method we choose depends to some extent on why the data are missing. Missing completely at random (MCAR). When data are MCAR, the probability of missing data on Y is unrelated to the value of Y itself or to the values of any other variables in the dataset. MCAR allows missingness on Y to be related to missingness on X but not on the value of X. A simple tests for MCAR is to divide the sample by missingness on Y and examine whether the values of X differ for the two groups. MCAR is a strong assumption, but is reasonable in some cases. Weaker assumptions: Missing at Random (MAR). When data are MAR, he probability of missing data on Y is unrelated to the value of Y after controlling for other variables in the analysis: Pr (Y missing | Y, X ) = Pr (Y missing | X ) This assumption is reasonable when the observed data capture key confounding influences, that is, variables related to both missingness and the outcome of interest. Unfortunately, it is impossible to test whether the MAR condition is satisfied. If the data are not MAR, then we say that the miss data problems is non-ignorable. In such cases we would need to model the missing data mechanism in some way to get good estimates of the parameters of interest. An example is Heckmans (1976) two-stage estimator for regression models with selection bias on the dependent variable. This, unfortunately, requires very good prior knowledge about the nature of the missing data process something we are unlikely to have in TIMSS or PISA. Furthermore, the results tend to be sensitive to the models chosen. We will briefly discuss four approaches commonly used by analysts. 1. Listwise deletion The most common method for handling missing data is through listwise deletion. We delete from the sample any observations that have missing data on any of the variables in the model of interest. This is what most often occurs if one runs an analysis, ignoring the missing data. If the data are MCAR, then the reduced sample will be a random subsample of the original sample, resulting in unbiased parameter estimates. However, in many data sets in the social sciences, one can lose data from more than one half of the subjects with listwise deletion when the model includes 6 or 7 variables. Therefore, even if the data are MCAR, an important disadvantage of listwise deletion is a loss of precision. This is reflected in the standard errors,

which are larger because less information is utilized. Therefore the analyst runs the risk of a Type II error. Moreover, if data are MAR rather than MCAR, estimates can be biased. 2. Pairwise deletion. With pairwise deletion, regression analyses or factor analyses are based on a correlation matrix derived from the data available for each pair of variables. Statisticians tend not to like the approach, as the resulting correlation matrix tends to have some undesirable properties (e.g., sometimes it cannot be inverted). We tend to use it for preliminary exploratory analyses, because it can be done easily in SPSS for ordinary least squares regression. However, for logistic regression or various other non-liner models, it is not an option. This option used to be available for level 1 (e.g., pupil-level) data in HLM, but it was never an option for level 2 (e.g., school-level) data. The latest version of HLM does not make provision for pairwise deletion. 3. Mean substitution with dummy variable adjustment. This method is decribed in more detail in Cohen & Cohen (1985) and Cohen et al. (2003). The analyst creates dummy (flag) variable D that is equal to 1 if data are missing on X and equal to 0 otherwise. He or she then creates a new variable, X*, which is equal to X when data are data are not missing (i.e., when D = 0). When cases are missing on X, X* is set to some constant, c. Normally one sets c equal to the mean value of X. The regression model regresses Y on X*, D, and other independent variables. The coefficient for X* is interpreted as the relationship between Y and X for those subjects for whom data on X were available, while the coefficient for c, if mean substitution was used, is the missing effect, that is, the difference in Y values for those who had data on X and those who did not. One of the problems with this approach is that the estimate for the coefficient for X* pertains to subjects for whom data on X were available, and we what we would really like to know is the estimate for all subjects. Thus, the estimates can be biased, even under MCAR (Jones, 1996). Another problem is that when one has 6 or 7 variables, each with missing data, the analyst had 6 or 7 dummy variables to contend with. This set of variables tends to be collinear, and can therefore cause a regression model to become unstable or yield unrealistically large standard errors. That said, this approach is easy to use, and is quite practical when for dealing with variables that have a small amount of missing data (e.g., less than 10%). 4. Maximum Likelihood/EM Algorithm. This approach examines the relationships among variables intended for use in an analysis, and imputes values for the missing data using iterative approaches (Little & Rubin, 1987). It provides appropriate standard errors for the estimates for studies using large samples (n>200). The model assumes missingness is completely accounted for by variables in the model, and under this assumption, the estimates are the best possible. This procedure works for missing data for Xs and for Y. This technique requires specialized software, although many of the large commercial programs (e.g., SPSS, BMPD, MPLUS) now include modules for imputing missing data. Moreover, the best imputation techniques require the analyst to specify a particular regression model, and include Y in the imputation. The technique then estimates several plausible values for the missing values on X variables, and ideally one should analyze each of the several data sets and aggregate the results (this is known as multiple imputation) (See Allison 2002 for a complete discussion of options for handling missing data). This allows the uncertainty of the imputed values to be represented in the analysis, thus resulting in appropriate standard errors. While this is feasible it requires much more analysis, and is not practical in the exploratory phase of analysis. Another hurdle is that the procedures used to impute values on X do not account for the multilevel structure of most educational data. Multilevel procedures for imputing missing data are in their early stages of development.

Therefore, for this project, we advocate the use of the mean substitution/dummy variable to get us started. The syntax files used to make the long skinny files described earlier include variables with missing data set to the country mean alongside a dummy variable flag. The HLM analyses described in Chapter 4 demonstrate the use and interpretation of this approach. As the project proceeds, we plan to use more advanced methods to determine whether our principal findings are sensitive to the method used for handling missing data.

10

Chapter 3 Using Survey Design Weights and Plausible Values


The Function of Survey Design Weights
In many studies, subjects are randomly selected from a target population. In the simplest case, students are randomly sampled from the target population, with each student having an equal chance of being selected. However, simple random sampling is seldom used in large-scale national and international surveys. One reason is that researchers often wish to make inferences about particular sub-populations. In order to obtain achieve accurate statistical estimates for each sub-population, the design might call for over-sampling (i.e., taking a proportionately larger sample) from small sub-populations and under-sampling from large sub-populations. In Canada for example, the survey design for PISA over-samples small provinces (e.g., Prince Edward Island and New Brunswick) and under-samples large provinces (e.g., Ontario and Quebec), such that the achieved sample sizes for each province are approximately equal. This enables researchers to obtain statistical estimates for each province that are of comparable accuracy. However, if one calculated a statistic for the national sample, without taking account of this sample design, the estimate would be biased. Biases associated with sampling can also arise because the sampling frame (the list of potential participants comprising the population) is inaccurate or out-of-date, or because some schools or students refuse to participate. Student absenteeism or can also contribute to bias. Generally, the sample design aims to help researchers answer the research questions of interest with the maximum precision, given the available resources. The role of the design weight is to weight students differentially such that these biases are eliminated, or at least reduced. One way to think about this is that each student has a certain probability of being selected, and the weight is the inverse of the probability of selection. In New Brunswick, for example, about one-third of all 15-year olds students are selected for PISA. The weights for New Brunswick students are therefore about 3.0. In Ontario, where about 1 in 30 students are selected, the weights for each child are about 30. Actually, the sampling design is more complicated than this, but hopefully this conveys the idea. One way that the design is more complicated is that the sample is drawn in two stages. In the first stage, schools are drawn from the population of schools with a probability proportional to their size, and in the second stage a fixed number of children are sampled from within the school. The weight for a student must therefore reflect the probability of his or her school being selected, as well as the probability that he or she was selected within that school. The design weights can also incorporate information about non-response to the survey. For example, if girls are less likely than boys to be absent from school, or to refuse to do the survey, then the weights can compensate for this potential bias. Both the PISA and TIMSS surveys include an overall sample design weight, and this should be used to calculate most statistics.

11

The Function of Replicate Weights


When we calculate most statistics, such as a mean or a regression coefficient, we also want to know how accurate it is. One way to think about statistical accuracy is to imagine drawing 1000 different samples for the PISA or TIMMS study, each time following the random selection procedures consistent with our sampling design. We could then use the 1000 different data sets to obtain 1000 estimates of the statistic of interest. The estimates would differ from one data set of another, as they are based on different samples from the population. For example, we could estimate the variance of 1000 estimated means. This is called the sampling variance for the mean. The square root of the sampling variance is called the standard error, which is commonly used as an indicator of the accuracy of the statistic. In the case of a mean score, the magnitude of the sampling variance depends on the sample size. It also depends on the extent to which the scores vary in the population. In the case of mean scores, the sampling variance can be calculated directly. It is /n2, where is the population standard deviation of the variable. Generally we estimate by calculating s, the sample standard deviation, and s/n2 is the sampling variance. However, this formula applies under conditions of simple random sampling, but does not apply for more complicated sample designs. Moreover, the formulae for standard errors for many statistics are unknown. Also, of course, it is entirely impractical to draw 1000 separate samples from the population, as with the thought experiment described above. Instead, consider using the original sample of data, which has n students, and drawing 1000 samples from the data set, each time sampling n students with replacement. With this approach some students might get selected more than once in some of the 1000 sample, while others may not be selected at all. With the 1000 bootstrap samples in hand, you can simply calculate the statistic of interest for each of the 1000 samples, and then calculate the sampling variance (and hence the standard error) of the 1000 estimates of the statistic. This technique is called the bootstrap. The jackknife is a close cousin of the bootstrap, but instead of sampling n students with replacement, the analyst draws n replicate samples of n-1 units. The formula for calculating the sampling variance is similar to that of the normal formula for a variance. It entails estimating the difference between the estimate based on the replicate sample and that of the full sample for each of the n replicates. These differences are squared, and then the sampling variance is the sum of these squared values times n-1/n. If is the statistic of interest (e.g., mean or regression coefficient), and i is the same statistic estimated for replicate sample i (i = 1, n), then the sampling variance is:
2 Sampling Variance = sv =

n 1 n 2 ( i ) . n i =1

In the PISA and TIMSS surveys, the sampling design calls for national centres to stratify their sampling frame of schools into strata that are likely to be related to school performance. These strata might be certain geographic regions, or they can be types of schools, such as public and private or vocational and academic. Then, schools and students are selected in a two-stage design, with schools being selected proportionately to their size. The strategy for estimating the Jackknife sampling variance is similar to that described above for a simple random sample, except that it entails the pairing of schools within strata, such that schools of 12

similar size are paired. In this case there are now J pairs of schools (j = 1 J), and the formula is:
2 Sampling Variance = sv = ( j ) 2 j =1 J

Rust (1995) provides further details, and shows how this approach takes both the stratification and the two-stage clustering into account. When conducting the analysis, we do not have to draw the random replicate samples within each strata. Instead, a set of weights are provided which do the sampling for us. These weights simply select some cases (by weighting by 1.0 for example) and not select others (by weighting by 0 for example). To get the estimate a particular statistic we must simply weight by the replicate weight for replicate sample 1, compute 1 , and then repeat this step for all of the J replicates. With these j estimates in hand we can estimate the sample variance using the formula above. We have provided SPSS programs to do this operation in a reasonably automated fashion. Other programs such as WESVAR or SUDAN are somewhat better suited for this, but this requires you to learn a new program, and so for this project we prefer to keep all of the operations in SPSS. The TIMSS study uses the jackknife as described above, while PISA uses a variant of this called Balanced Repeated Replication (BRR), modified with a procedure develop by Fay. The BRR approach selects one school for removal from each stratum, but then doubles the weights for all other schools. The Fay modification, which avoids problems associated with estimation in small strata, uses weights of -0.5 for the school selected for removal, and -1.5 for the schools selected for inclusion in each replicate. (Thus, schools are not really removed, but the effect is the same.) In PISA the BRR procedure has 80 replicate samples, and the formula for the sampling variance is:
2 Sampling Variance = sv =

2 80 2 ( i ) 80 i =1

The BRR procedure can also be accommodated with SPSS, although the syntax is somewhat awkward, especially for regression coefficients. We provide examples for estimating means and regression coefficients. The multilevel regression package, HLM (or the alternative, MLWin), explicitly take into account the hierarchical nature of the two-stage (pupils within schools) sample. However, the analyst must also use the overall design weight provided for PISA or TIMSS. With a weighted multilevel regression analysis, the estimates of the standard errors are quite close to those obtained using either the jackknife or BRR procedures. It would of course be feasible to use the replicate weights to run the desired HLM analysis multiple times, and then compute the standard errors using the formula above. However, the results are quite similar, and HLM is not very well set up for doing multiple runs.

13

The Use of Plausible Values


Both PISA and TIMSS use an incomplete or rotated-booklet design for testing children on the major outcome variables. One can think of this as having each student complete only a small proportion of a very long achievement test. However, we wish to give each student a score for the full test. This is directly analogous to the problem of imputing missing data discussed earlier. Essentially, instead of one score there is a distribution of plausible scores that the student might have obtained had the student completed the full test, and given the measurement error associated with the test. Plausible values are a sample of scores from this distribution (called the posterior distribution). See Beaton (1987) and Wu and Adams (2002) for a detailed discussion of plausible values. Both PISA and TIMSS select 5 plausible values for each student. To estimate the point estimate of a statistic, one estimates the desired statistic, , for each plausible value, and then averages them for the 5 plausible values: 1 5 = i 5 PV =1 The sampling variance is the sum of two components: an average sampling variance for the 5 plausible values and an imputation variance. The average sampling variance is computed by estimating the sampling variance associated with each plausible value and averaging them. The imputation variance is determined by estimating the variance of the five estimates of i using the normal method of calculating the variance: 1 5 imputation Variance = ( i ) 2 4 PV =1 The sampling variance is then simply the average sampling variance across the 5 PVs plus 1.2 times the imputation variance. As before, the standard error is the square root of the sampling variance. Note that in working with plausible values, one cannot simply estimate the average of the 5 plausible values and use the resulting score as your dependent variable. This results in biased estimates of the standard errors of any calculated statistic. For statistics involving test scores (e.g., mean mathematics score, or when mathematics scores are the dependent variable in a regression model), one must estimate the sampling variance for each of the PVs using the Jackknife or BRR procedure. In the case of PISA, for example, this means conducting an analysis 400 times, 80 times for the replicates for each of the 5 PVs. With HLM, there is an option to indicate that there are plausible values to be taken into account. When you invoke this option, the analysis is repeated five times, one for each PV, and the program then computes the correct standard errors for the regression coefficients. For details, see section The Null HLM and Variance Components of Chapter 4.

14

Chapter 4 Estimating Basic Hierarchical Linear Models


Ordinary Least Squares (OLS) regression models assume that residuals (i.e., the difference between the observed and predicted values from the model) are normally distributed, independent, with a mean of zero and a constant variance. When data are collected using a cluster or area sampling method, as is the case in TIMSS and PISA, the residuals are unlikely to be independent of each other. For example, we would expect the mathematics achievement of students within a school to be more similar than would be the case in a simple random sample of students. This is because students in the same school are more likely to share a common curriculum, common textbooks, common teachers, as well as other school and community resources, than a random sample of students drawn across schools. A major concern when using OLS regression to estimate relationships on clustered data is that the estimated standard errors will be too small (negatively biased), leading to an overestimation of the statistical significance of regression coefficients. Bootstrap and jackknife estimation procedures, as described in the last chapter, are one method to estimate unbiased standard errors for clustered data. Another technique to address this potential alpha inflation (Cohen et al. 2003) is to use a multilevel model, which makes different assumptions about the correlation structure of the individual observations (Kreft & de Leeuw, 1998; Raudenbush & Bryk 2002; Snijders & Bosker, 1999). This class of models, also know as Hierarchical Linear Models (HLMs), have the additional advantage of allowing us to estimate both the effects of group (e.g., school or class) level variables on group average effects (e.g., school mean achievement), as well as on the within group relationship between individual characteristics (e.g., SES) and the outcome of interest. For example, we can estimate the degree to which urban and rural schools have different average levels of achievement, holding student SES constant, as well as whether achievement is more equitably distributed within rural schools than in urban schools. HLMs also allow us to estimate the "true" variance in these regression parameters across groups. For example, we can ask: How much variance is there between schools in both average achievement and in the SES effect, and how much of this variance can be explained by school levels predictors? This chapter explains how to estimate two-level HLMs (e.g., students nested with classes) and describes how to interpret the results. A more in-depth discussion of the technical details underlying these models can be found, in increasing level of technical detail, in Cohen et al. (2003), Kreft & de Leeuw (1998), Snijders & Bosker (1999), Raudenbush & Bryk (2002), and Goldstein (2003). The examples below use Russias TIMSS 2003 data, although the syntax files that created the data, including Make HLM Level 1 File timss Russia.SAV for student-level data and the Make HLM Level 2 File timss Russia.SAV for class/school level data, are available on the UIS website and can be used to replicate the findings for any participating country. A description of how to construct the datafile that HLM will analyze (called an MDM file) can be found at http://www.ssicentral.com/hlm/hlm6/hlm6.htm .

15

The Null HLM and Variance Components


The first step in conducting an HLM analysis is to determine the extent to which observations within schools are correlated; that is, the degree of nonindependence among observations in a sample. A Null Model or one-way ANOVA with Random Effects (Raudenbush & Bryk, 2002: 23) is used to partition the variance in the dependent variable, in this case math achievement, into within- and between-classroom components. For this analysis we use the 5 plausible values on the TIMSS 2003 math assessment as the dependent variable (mathpv1 to mathpv5) and weight the data using TOTWGTN. To use all 5 plausible values of math

Selecting an outcome variable

The Null Model

Choose MATHPV1

To incorporate plausible values and use design weights select Other Settings and Estimation Settings

mathematics achievement as the dependent variable in the analysis, first select MATHPV1 as the outcome variable and then select Other Settings, then Estimation Settings, then Plausible values. Choose MATHPV1 from the pull down menu labelled Choose first variable from level 1 equation and then double click on MATHPV2, MATHPV3, MATHPV4, and MATHPV to move them from the Possible choices column to the Plausible Values column. Click OK. Now click on the Design Weights button and select TOTWGTN from the pull down menu under Level-1 Weight. As we have already normalized the weight to sum to the number of observations we can leave Normalize weight blank. Also click Generalize at level 1 because students are our unit of analysis.

16

Now we are ready to estimate this null or unconditional model, i.e., model with no predictors at either level 1 or level 2. The level 1 model (Yij= 0j +rij ) predicts math achievement within each school with just Chose TOTWGTN & one school-level parameter, the intercept 0j, which is Generalize at level 1 simply the mean outcome for school j. The level-2 model is 0j = 00 + u0j, where 00 represents the grand-mean outcome (here average math achievement in Russia) and u0j is the random effect associated with unit j. By substituting the level 2 model into the level 1 model, Yij= 00 + u0j + rij, it is easier to see how we are partitioning the variance. That is, an individual students score can be decomposed into the country mean score (00), how much the students school differs from the country mean score (u0j), and how much the individual students score differs from their schools average score (rij). It is the partioning of the residual or random component of the model into school/class and student level components that make this a multilevel model. Running the model. It is useful to give each model that you run a different name, so that you can easily recall it later if you want to make a modification. We also recommend that you assign the output file name a label that corresponds to your command file name. Click on Basic Settings to change the Output File Name. It might be Null Model.TXT. Click on File, then Save as, and then assign the Command File Name as, say, Null model.HLM. For some reason, HLM likes the command files and the data files to be in the same directory, so we recommend that you do so (this will usually be the default). Now click on Run on the menu to estimate the model. You will, hopefully, see the maximum likelihood estimation procedure iterating towards convergence. After this screen disappears, you can view the output by clicking on File and then View Output. The output appears in an ASCII text format (most likely in Notepad), which you can save under a different filename if you wish or copy pieces out of it into a spreadsheet. Interpreting the output. This first part of the output recalls what you have just estimated you should see evidence that the data were weighted in the analysis, that all 5 plausible values were used, as well as the specific model estimate (in this case a two level model with no predictors at either level). The most informative output for the null model, however, is found near the bottom of the file.

17

Estimate of grand mean achievement, 00. The standard error can be used to construct a confidence interval around this estimate.

Var(Yij) =Var(u0j + rij) = 00 + 2

Intraclass correlation coefficient= = 00 / (00 + 2)

= 2133.09/(2133.19 + 3830.83 = .36

Intraclass correlation. As noted above, the Null model partitions variation in the dependent variable (Yij ) into two components: between classes (Var(u0j) = 00) and within classes (Var(rij) = 2). In the output above we see estimates for 00=2133.09 and 2=3830.38. The proportion of the total variance that is between classes is called the Intra-class correlation ( = 00 / (00 + 2)) , which in this model is estimated to be = 2133.09/(2133.19 + 3830.83) = .36. Thus, in the Russian TIMSS 2003 data, 36% of the variation in 8th grade students math scores is between schools. Such a large between school variance component is an indication that the threat of alpha inflation that would result from assuming the data came from a simple random sample is considerable (by as much as a factor of 10, see Kreft & de Leeuw, 1998: 10) and that there is considerable variation that could be explained using school or class-level variables. To assess how much Russian achievement scores vary between schools we can construct a plausible value range around the school level means using the estimate of 00 (Raudenbush & Bryk, 2002: 71): 00 +/- 1.96(001/2) = 507.62 +/- 1.96*(3830.381/2) = (417.1, 598.1) This indicates that school mean achievement is likely to range from 417 to 598 across 95% of the schools in Russia that serve 8th grade students.

18

A Two-level HLM with Variables at Level 1


A recommended next step in model building is to select student-level variables to try and explain within school variation in achievement. We want to use theory, prior research, and whatever we are interested in studying to select a relatively small number of level-one variables, Xq, as predictors. We can think of each school having Yij=0j + 1j(Xij-X.j) + rij its own regression equation with an intercept (0j) and a slope (1j). We can estimate the mean value of these parameters (00 E(0j ) = 00 Var(0j)= 00 and 10) as well as the variance in these parameters across schools (00 and 11, respectively). For example, we can E(1j) = 10 Var(1j)=11 examine the degree to which parents level of educational attainment is associated with students achievement within Cov(0j,1j) = 01 schools, as well as how much the strength of this relationship varies across schools. We can also examine the degree to which the intercepts and slopes are correlated (01). We estimate this model using the Russian TIMSS data (see left). We brought three variations of this variable into HLM: uncentered (where zero = less than ISCED 1, primary education), class mean centered (where zero = the average level of parental education in each classroom in the Russian data), and country-mean centered (where zero = the average level of parental education in the Russian data). Since the interpretation of the intercept (00) depends on when all Xqs = 0, it matters how the data are centered. For a more complete discussion of Centering, look up Centering under Frequently Asked Questions at http://www.ssicentral.com/hlm/hlm6/hlm6.htm or see Raudenbush & Bryk (2002). Here we include parental education as class-mean centered, meaning that we can interpret the intercept as the predicted average achievement level if each student was at their class-average level of parental education. This is also how we would interpret the variance in this parameter. Note that in this model there are three random effects being estimated: rij, individual student is deviation from their class js average achievement after controlling for level of parental education; u0j, the deviation of school j from the country average math achievement, and u1j, the deviation of school js parental education slope-math achievement slope from the average slope across schools (see Raudenbush & Bryk, 2002: 76 for further detail). The results of this model are below. We have also included a dummy variable flag M_HPEDij to indicate when the parental education variable was imputed using mean substitution (Cohen et al. 2003).

19

Here grand mean achievement (00), interpreted as the average of the school means on math achievement, is 507.5, similar to the estimate in the Null model. Since we interpret this coefficient when HPEC_CLM=0, i.e., if each student was at their class average level of parental education, we say that the intercept is unadjusted for parental education. That is, it is not adjusted for between class differences in the distribution of students parents education level. The average parental education-math achievement slope (10) is 10.51., meaning that we could expect a student whose parents have a one unit higher ISECD level (e.g., an upper secondary vs. a post secondary, sub bachelors degree) to have a 10.51 higher mathematics score. As the ratio of both of these coefficients to their corresponding standard errors is greater than 1.96, we consider them statistically significant at the .05 level (here the p-values are <.0001). A coefficient of -20.38 on the dummy variable flag M_HPEDij is an indication that students that skipped this question scored about 20 below average on mathematics achievement. Next we interpret the variance components:

Level 1 residual variance (2) has been reduced to 3710.02, compared to 3830.38 in our model

without any predictors (the random effects ANOVA or Null Model). We can calculate the proportion of variance explained by students highest parental education by comparing the 2 estimates of these two models (see left). Proportional variance explained at level 1 Here we see that adding students 2(random ANOVA)-2 (HPED_CLM) highest level of parental education as a 2(random ANOVA) predictor of math achievement reduced the within-school variance by only = 3830.38 - 3710.02 = 0.314 3.14%.
3830.38

We can also see that the variance in the intercept, Var(0j)= 00, is 2166.24 with a 2 of 2791.3, to be compared to a critical value of 2 with J-1=213 degrees of freedom. We infer that a large difference exists across the school means. The estimated variance in the parental education-math achievement slope, Var(1j)=11, is 48.9 with a of 2 of 242.4 with J-1=213 degrees of freedom, p<.081. Using conventional standards for detecting statistical significance (p<.05) we fail to reject the null hypothesis that 11=0 . As with the intercept, we can use this estimate of variance of the education-math achievement slope to construct a 95% plausible value range around the average parental education-math achievement slope (10) of 10.51. 10 +/- 1.96(111/2) = 10.51+/- 1.96*(48.901/2 ) = (-3.20, 24.23) 20

This indicates that education-math achievement slope is likely to range from (-3.20, 24.23) across 95% of the schools in Russia that serve 8th grade students. This suggests that we might try to model this variance. The HLM output also provides information on the covariance of the random effects, Cov(0j,1j) = 01. Above the Final estimation of fixed effects in the HLM output you will find the variance-covariance matrix (see the format below)also called the Tau matrix. The variances, Var(0j)= 00=2166.24 and Var(1j)=11= 48.9 are on the left to right diagonal and the covariance, Cov(0j,1j) = 01= -73.60 is on the off diagonal. In the table below the Tau matrix these estimates are reported as correlations. The correlation between 0j and 1j = -.226, indicating that 8th grade classes with higher average math achievement tend to have less of a relationship between parental education and math achievement than class with lower average math achievement (i.e., achievement is more equitably distributed in the classes with higher average achievement). 0j 1j

Random or fixed effects? In the random coefficients model presented above, we allowed both the intercept, 0j, and the slope, 1j, to randomly vary across schools. Substituting the level 2 model into the level one model gives us a mixed model (which is what HLM is actually estimating):

which can rearranged as:


MathPV1ij = (00 + uoj ) + (10 + u1j )*HPED_CLMij + rij

What we would normally think of as the 'intercept' is now made up of two components a fixed effect, 00, and a random effect u0j, which varies across schools. HLM provides a t-test of the hypothesis that the fixed effect is zero (H0: 00= 0), and a 2 test that the variance in u0j is zero [H0: Var(u0j) = 0]. Similarly, the slope for HPED_CLMij is made up of a fixed effect, 11 and a random effect u1j. As noted above, one can think of the slope as an average slope 11 and an increment that is unique to each school, u1j. Again we ask whether the average slope is significant, and whether the schools vary in their slopes. The same applies for any additional level one predictor that we add to the model, Xq. If the variances of the group-level residuals are insignificant, we constrain them to be zero (e.g., if p> .10). This is done by clicking on the respective error term in the model and making it go gray (see below). By doing this you are fixing Var(u0j) = 0, i.e., that the slopes do not vary between schools.

21

To fix a parameter variance to zero, click on the level 2 error term to make it go gray. This is equivalent to setting Var(u0j) =0. Note that u1j *HPED_CLMij also disappears from the mixed model equation.

If the parameter variances are non-zero (e.g., if p<= .10), it means we have some betweengroup variance to work with for examining the effects of some group-level variables. Once these are in the model they play a similar role to the fixed portion of the within-group effects.

A Two-level HLM with Variables at Levels 1 and 2


After estimating the variability of regression equations across classes, we can focus on building an explanatory model to try and explain why some classes have higher average math achievement than others and why some classes have a stronger association between students parental education and their math achievement. When hierarchical models involve both random intercepts and slopes (e.g., Var(u0j) 0 & Var(u1j) 0, we recommended that you develop a tentative model to explain between school variance in the intercept, 0j, before proceeding to fit models for the random slopes (see Raudenbush & Bryk, 2002: 267 for an additional discussion of model building). Following this guideline, we begin by trying to explain between class average math achievement in Russia. Note that if we leave parents education at level 1 class-mean centered, the variance in class mean achievement that we are modeling will be unadjusted for between class differences in the distribution of students parents education level. If parents education is country mean (i.e., grand mean) centered, then the variance in class mean achievement that we are modeling will be adjusted parents education (i.e., we are examining variance in the predicted value of oj as if all students parents had the country-average level of parental education). This difference affects the interpretation of level 2 coefficients, namely whether they are explaining adjusted or unadjusted variance in the level 1 parameters. We start by adding class average parents education level (CLHPED_ij), which is aggregated from the student data, small class size (S_1_24j), large class size (S_33_40j), class size missing (M_sizej), a school climate index (PPSCA_ij), and school climate missing (M_PPSCj)

22

to our equation modeling between school variance in class-average math achievement (Boj). As CLHPEDij was not missing for any Russian class, no correspond dummy flag is needed.

The results from the fixed portion of the model are presented below.

Grand mean achievement (00), interpreted as class mean average achievement in classes with an average level of parental education, class size between 25 and 32, and average climate, is 507.5, similar to the estimate in the Null model and the random coefficients model run earlier. The coefficient on class average parental education (10) is 10.98, which again, is similar to the coefficient estimated on this variable in the random coefficients model and is statistically significant (p<.0001). In general, we would not expect the adding of class-level explanatory variables to level 2 models explaining variation in the intercept to affect the size of previously estimated level-1 coefficients, i.e., we would not expect class level variables to explain with class relationships. An exception to would be when a level-1 variable is correlated with a level 2 residual (e.g., cov (Xqij, uoj) 0), a violation of the assumptions underlying HLM. In this case the adding of a level 2 explanatory variable could affect the size of the estimated coefficient on Xqi, in effect, removing a form of omitted variable bias from the model (for discussion of this and other model misspecification issues see Chapter 9 of Raudenbush & Bryk, 2002, Chapter 9 of Snijders & Bosker, 2000).

23

Turning to the estimated coefficients on the level 2 variables, we see that the estimated effect of class-average parental education is large (01 = 52.95) and statistically significant (p<.0001). A class whose average parental education is one ISCED level higher than another class is expected to have a 53 point higher average math score. Note that the between school effect of parental education on student achievement (b )is nearly 5 times the within school effect (w), although the variation in parental education is considerably larger within than between classes (2 of .92 within classes compared to 2 of .47 between classes). This relationship can be seen in the graph below. A one standard deviation increase in students highest level of parental education is associated with about a 10 point increase in class average achievement (.92 * 10.98), while the expected difference in class average achievement between classes one standard deviation below the mean and a class at the mean is about 25 (.47* 52.95). For a description of how to construct graphs like this with HLM, see http://www.ssicentral.com/hlm/hlm6/hlm6.htm

The estimated effect of being in a small or large class, as opposed to a medium size class was not statistically significant in the above model. Average school academic climate (05 = 15.81), an index calculated by taking the mean of 8 questions asked of the principal regarding teachers job satisfaction; teachers understanding of the schools curricular goals; teachers degree of success in implementing the schools curriculum; teachers expectations for student achievement; parental support for student achievement; parental involvement in school activities; students regard for school property; and students desire to do well in school (with very high =1 and low=4) is marginally significant (p= .072). A standard deviation worsening in school climate (a .35 increase in the index) is associated with a -5.53 reduction in class average achievement (0.35 * -15.81). Although not statistically significant, a coefficient of -11.7 on the variable school climate missing (M_PPSCj) is an indication that sampled schools that did not fill out the school questionnaire or skipped all of the items on climate had classes that were about 11 points below average on average class mathematics achievement. Next we examine the variance components in the output:

24

Here we see that the variance in the intercept, Var(0j)= 00, is 1493.33 with a 2 of 1895.2, to be compared to a critical value of 2 with J-7 (The intercept plus the 6 coefficients estimated)=207 degrees of freedom. We can compare the variance in this model with 4 Proportional variance explained at level level 2 explanatory variables (and two dummy 00 (random coeff.)- 00 (current model) variable missing flags) to the variance 00 (random coeff.) reported in the random coefficients model reported above (where 00(random coeff.)= = 2166.24 - 1493.33 = 0.311 2166.24)). Here we see that class average parental education, class size, and school 2166.24 climate explain 31 percent of the variance in class average math achievement. Note that there remains significant variance in the intercept left to explain (H0: , Var(0j)= 0, reject with p<.0001). The variance in the HPED_CLM slope remains similar to the variance in the random coefficients model, which we would expect since we have not yet added any variables to explain this variance. Intercepts and slopes as outcomes Finally, we use class and school level variables to try and explain the between school variance in the HPED_CLM slope (Var(1j)= 11), that is, why some classes have stronger relationships between parental education and student math achievement than other classrooms. For this example, we will use the same level 2 variables to explain the variance in 1j that we used to explain variance in the class average math achievement, 0j (see below).

This model examines whether classes with higher levels of parental education, small or large classes, and classes in schools with a lower academic climate have a stronger or weaker within class relationship between students parental education and their math achievement. For example, we might hypothesize that classes with higher average parental education will have less disparity in math achievement because the teacher may have fewer at risk kids and can focus more attention on them or that there might be stronger peer effects. A similar argument could be made for why there might be a flatter parental education/math achievement slope in smaller classes or in schools with better academic climates. Although we have chosen here to keep the same class/school variables in both the level 2 equation explaining variance in the intercept and variance in the parental education/math achievement slope there is no requirement that they be the same. If either theory, prior research, or a personal hunch leads 25

you to believe that the important predictors might be different, feel free to construct different models. Be aware, however, misspecification of the model explaining variance in one random coefficient (e.g., 0j ) can lead to biased coefficients in models explaining variance in other random coefficients (1j). See Chapter 9 of Raudenbush & Bryk (2002) for a detailed discussion of specification issues at Level 2. The intercepts and slopes as outcomes model:

The results from the fixed portion of the model:

26

Since the coefficients on our level 2 variables explaining between class variation in the intercept (0j) are similar to the previous model, we will only focus on the coefficients for variables included to explain between school variation in the parental education/math achievement slope (1j). Although none of the coefficients are statistically significant at the p<=.05 level, we will go through their interpretation for instructional purposes. A one unit increase in class average parental education is associated with slight reduction in the parental education/math achievement slope (11 = -2.26). Thus, while a medium size class with average parental education and average school climate has a predicted parental education/math achievement slope of 8.49, a medium size class with one standard deviation higher parental education (.45) has a predicted parental education/math achievement slope of parental education/math achievement slope of 7.43, calculated as: 10+ (11 *SD(CLHPED_I)) = 8.49 + (-2.26 *.45))= 7.43 By graphing these predicted values we see a slight flattening of the association between students parental education and their math scores in classes with higher average levels of parental education.

Small classes, between 1 and 24 students, have a stronger relationship between parental education and math achievement than medium size classes (10+12 = 8.49 + 5.11 = 13.6, p=.075). Large classes also have a steeper slope, although the difference is less and not statistically significant (10+13 = 8.49 + 2.13 = 10.62, p=.755). By graphing these predicted values we can see visually how the effect of being in a small math class is stronger in classes with higher average levels of parental education.

27

Next we examine the variance components in intercepts and slopes as outcomes model:

Here we see that the variance in the slope, Var(1j)= 11, is 40.93 with a 2 of 226.85, to be compared to a critical value of 2 with J-7 (The intercept 10 plus the 6 coefficients estimated)=207 degrees of freedom. We can Proportional variance explained at level 2 compare the variance in the prior model with no explanatory variables predicting the 11 (No explan vars.)- 11 (current model) variance in 1j to this model. Here we see 11(No explan vars.) that class average parental education, class = 50.46 40.93 = .189 size, and school climate explain 19 percent of the variance in the parental 50.46 education/math achievement slope. Further, the variance in this slope is no longer statistically significant (although it was only marginally so before). For additional information on proportional variance explained statistics see the HLM FAQ at http://www.ssicentral.com/hlm/hlm6/hlm6.htm . Happy modeling!

28

References
Allison, P.D. (2002). Missing data. Thousand Oaks, CA: Sage. Beaton, A.E. (1987). The NAEP 1983-1984 Technical Report. ETS, Princeton. Cohen, J. & Cohen, P. (1983) Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.).Hillsdale, NJ: Erlbaum. Cohen, J., Cohen, P., West, SG, Aiken, L.S. (2003) Applied multiple regression/correlation analysis for the behavioral sciences (rdnd ed.).Hillsdale, NJ: Erlbaum. Goldstein, H. 2003. Multilevel Statistical Models 3rd edition. Available for download at: http://www.soziologie.uni-halle.de/langer/multilevel/books/goldstein.pdf Kreft, I, de Leeuw, J. 1998. Introduction to Multilevel Modeling. Thousand Oaks, CA: Sage Little, R.J.A. & Rubin, D.B. (1987) Statistical analysis with missing data. New York, Wiley. Mosteller, F., & Tukey, J. W. (1977). Data analysis and regression. Reading, MA: AddisonWesley. Raudenbush SW, Bryk AS. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods. Thousand Oaks, CA: Sage. 2nd edition Rust, K.F. (1996). TIMSS 1995 working paper. Snijders TAB, Bosker RJ. 1999. Multilevel Analysis: An Introduction to Basic and AdvancedMultilevel Modeling. Thousand Oaks, CA: Sage. Wu, M. & Adams, R.J. (2002, April). Plausible Values why they are important. Paper presented at the International Objective Measurement Workshop. New Orleans, Louisiana.

29

You might also like