You are on page 1of 2

Statistics Tutorial: Stratified Random Sampling Stratified random sampling refers to a sampling method that has the following

properties. The population consists of N elements. The population is divided into H groups, called strata. Each element of the population can be assigned to one, and only one, stratum. The number of observations within each stratum NH is known, and N = N1 + N2 + N3 + ... + NH-1 + NH. The researcher obtains a probability sample from each stratum. In this tutorial, we will assume that the researcher draws a simple random sample from each stratum. Advantages and Disadvantages Stratified sampling offers several advantages over simple random sampling. A stratified sample can provide greater precision than a simple random sample of the same size. Because it provides greater precision, a stratified sample often requires a smaller sample, which saves money. A stratified sample can guard against an "unrepresentative" sample (e.g., an all-male sample from a mixed-gender population). We can ensure that we obtain sufficient sample points to support a separate analysis of any subgroup. The main disadvantage of a stratified sample is that it may require more administrative effort than a simple random sample. Proportionate Versus Disproportionate Stratification All stratified sampling designs fall into one of two categories, each of which has strengths and weaknesses as described below.

Proportionate stratification. With proportionate stratification, the sample size of each stratum is proportionate to the population size of the stratum. This means that each stratum has the same sampling fraction. Proportionate stratification provides equal or better precision than a simple random sample of the same size. Gains in precision are greatest when values within strata are homogeneous. Gains in precision accrue to all survey measures. Disproportionate stratification. With disproportionate stratification, the sampling fraction may vary from one stratum to the next. The precision of the design may be very good or very poor, depending on how sample points are allocated to strata. The way to maximize precision through disproportionate stratification is discussed in a subsequent lesson (see Statistics Tutorial: Sample Size Within Strata). If variances differ across strata, disproportionate stratification can provide better precision than proportionate stratification, when sample points are correctly allocated to strata. With disproportionate stratification, the researcher can maximize precision for a single important survey measure. However, gains in precision may not accrue to other survey measures.

Recommendation: If costs and variances are about equal across strata, choose proportionate stratification over disproportionate stratification. If the variances or costs differ across strata, consider disproportionate stratification.

Regression analysis
In its simplest form regression analysis involves finding the best straight line relationship to explain how the variation in an outco me (or dependent) variable, Y, depends on the variation in a predictor (or independent or explanatory) variable, X. Once the relationship h a s been estimated we will be able to use the equation: Y = bO + b1 X In order to predict the value of the outcome variable for different values of the explanatory variable. Hence, for example, if age is a predictor for the outcome of treatment, then the regression equation would enable us to predict the outcome of treatment for a person of a particular age. Of course this is only useful if most of the variation in the outcome variable is explained by the variation in the explanatory variable. In many situations the outcome will depend on more than one explanatory variable. This leads to multiple regression, in which the dependent variable is predicted by a linear combination of the possible explanatory variables. For example, it is known that the male peak expiratory flow rate (PEFR) depends on both age and height, so that the regression equation will be: PEFR = bO + b1 x age + b2 x height, where the values bO, b1, b2 are called the regression coefficients and are estimated from the study data by a mathematical process called least squares, explained by Altman (1991). If we want to predict the PEFR for a male of a particular age and height we can use this equation directly. Often there will be many possible explanatory variables in the data set and, by using a stepwise regression process, the explanatory variables can be considered one at a time. The one that explains most variation in the dependent variable will be added to the model at each step. The process will stop when the addition of an extra variable will make no significant improvement in the amount of variation explained. The amount of variation in the dependent variable that is accounted for by variation in the predictor variables is measured by the value of the coefficient of determination, often called R2 adjusted. The closer this is to 1 the better, because if R2 adjusted is 1 then the regression model is accounting for all the variation in the outcome variable. This is discussed, together with assumptions made in regression analysis, both by Altman (1991) and Campbell & Machin (1993). In the preceding paper the outcome variable is ISQ-SR-N score and several independent variables were considered in the stepwise regression, which selected four for inclusion in the final model. Although this is the best model it still only accounts for 15.2% of the variation in ISQ-SR-N, because the R2 adjusted is only O.152. In other words, although the model explains a statistically significant amount of the variation, it still leaves most of it unexplained.

You might also like