You are on page 1of 12

Section on Survey Research Methods

Small-Area Estimation: Theory and Practice

Michael Hidiroglou
Statistical Innovation and Research Division, Statistics Canada, 16 th Floor Section D, R.H. Coats Building, Tunney's
Pasture, Ottawa, Ontario, K1A 0T6, Canada

Abstract This paper draws on the small area methodology


discussed in Rao (2003) and illustrates how some of
Small area estimation (SAE) was first studied at the estimators have been used in practice on a number
Statistics Canada in the seventies. Small area estimates of surveys at Statistics Canada. It is structured as
have been produced using administrative files or follows. Section 2 provides a summary of the primary
surveys enhanced with administrative auxiliary data uses of small area estimates as criteria for computing
since the early eighties. In this paper we provide a them. Section 3 defines the notation, provides a
summary of existing procedures for producing official number of typical direct estimators, and indirect
small-area estimates at Statistics Canada, as well as a estimators used in small area estimation. Section 4
summary of the ongoing research. The use of these provides four examples that reflect the diverse uses of
techniques is provided for a number of applications at small area estimations at Statistics Canada
Statistics Canada that include: the estimation of health
statistics; the estimation of average weekly earnings; 2. Primary uses and Criteria for SAE Production
the estimation of under-coverage in the census; and the
estimation of unemployment rates. We also highlight One of the primary objectives for producing small area
problems for producing small-area estimates for estimates is provide summary statistics to central or
business surveys. local governments so that they can plan for immediate
or future resource allocation. Typical small area
estimates include Employment indicators (employed
KEY WORDS: Small Area, Official Statistics, Fay- and unemployed), Health indicators (drug use, alcohol
Herriot use) and Business indicators such as average salary.

1. Introduction The production of small area estimates depends on a


number of factors. What the demand for such
Small domain or area refers to a population for which statistics? What is the commitment and will of the
reliable statistics of interest cannot be produced due to agency to support methodological, systems, and
certain limitations of the available data. Examples of subject matter staff. How much methodology and
domains include a geographical region (e.g. a subject matter expertise exist within the agency. How
province, county, municipality, etc.), a demographic well correlated are existing auxiliary data with the
group (e.g. age x sex), a demographic group within a variables of interest? Is the survey sample size large
geographic region. The demand for such data small enough to allow reliable estimates by using both the
areas has greatly increased during the past few years survey data and the existing auxiliary data? How much
(Brackstone, 1987). This increase is due to the bias are the agency and clients willing to tolerate with
usefulness of these data in government policy and the estimates,: what are the consequences for making
program development, allocation of various funds and incorrect decisions? The size of the small areas in
regional planning. terms of the number of the units that belong to them is
also an important consideration. Small areas that are
A number of national and regional statistical agencies, too small may results in confidentiality breeches.
including Statistics Canada, have introduced programs Furthermore, small area estimates may be quite
aimed at producing estimates for small areas to meet different from statistics based on local knowledge.
the new demand. Available data to produce such
estimates are based on surveys that are not designed
for these levels. However, if administrative sources 3. SAE Estimators
have data at the small area level, and that they are
well correlated with variables of interest at the 3.1 Introduction
corresponding level, several procedures are available
to estimates various parameters of interest for these A survey population U consists of N distinct elements
lower levels. (or ultimate units) identified through the labels j = 1, . .
. , N . A sample s is selected from U with probability

3445
Section on Survey Research Methods

p(s), and the probability of including the j-th element in Rao (2003). We will confine ourselves to just a few
in the sample is π j . The design weight for each of them that include the synthetic estimator, and the
more well-known composite estimators.
selected unit j ∈ s is defined as w j = 1/ π j . Suppose
U i denotes a domain (or subpopulation) of interest. 3. 2 Direct Estimation
Denote as si = s ∩ U i the part of the sample s that
Let w j be the design weight associated with j ∈ s .
falls in domain U i .The realized sample size of si is a
The Horvitz-Thompson is the simplest direct estimator.
random variable ni , where 0 ≤ ni ≤ N i . Auxiliary data x
If the small area total Yi is to be estimated for small
will either be known at the element level x j for j ∈ s
area U i , then the corresponding Horvitz-Thompson
or for each small area i as totals X i = ∑
j∈U i
x j or
estimator is given by Yˆ = w y provided i , HT ∑ j∈si j j

means X i = X i / Ni . that the realized sample size ni is non-zero.

The problem is to estimate the domain total Auxiliary information can be available either at the
Yi = ∑j∈U
y j or the domain mean Yi = Yi / N i ,
i
population level or at the domain level. If it available
at the population level, then we used the Generalized
where N i , the number of elements in U i may or may Regression Estimator (GREG) given by
not be known. We define yij to be y j if j ∈ U i , and 0 Yˆ
i ,GR= X ′ β% + Yˆ
i ,GREG − Xˆ ′ β% ( where
i , HT HT i ,GREG )
otherwise. An indicator variable aij is similarly
∑∑
m
defined: it is equal to one if j ∈ U i and 0 otherwise. X′ =
j∈Ui
xj , Xˆ HT
′ = ∑ xk′ / π k , and
∑y =∑y a
s
i =1
Note that Yi can be written as Yi = ij j ij .
j∈U j∈U β%i ,GREG is the set of regression coefficient obtained by
Small area estimation is categorized into two types of
regressing yij on x j . That is
estimators: direct and indirect estimators. A direct
estimator is one that uses values of the variable of −1
⎛ w j x j x ′j ⎞ w j x j yij
⎜∑ ∑
interest, y, only from the sample units in the domain of β%i ,GREG = ⎜ ⎟ ,
interest. However, a major disadvantage of such ⎝
s cj ⎟

s cj
estimators is that unacceptably large standard errors may where c j is a specified constant ( c j >0 ).
result: this is especially true if the sample size within
the domain is small or nil. An indirect estimator uses
values of the variable of interest from a domain and/or The straight GREG is estimator is not efficient, and it
time period other than the domain and time period of is better to use regression estimators that use auxiliary
interest. Three types of indirect estimators can be data available as close possible to the small areas of
identified. A domain indirect estimator uses values of interest. One such estimator is the domain–specific
the variable of interest from another domain but not GREG that uses auxiliary data at the domain level. It is
from another time period. A time indirect estimator
uses values of the variable of interest from another
given by Y * = X ′βˆ
i , GR + Yˆ − Xˆ ′ βˆ
i i ,GREG ( i , HT i , HT i , GREG )
−1
time period but not from another domain. An estimator
that is both domain and time indirect uses values of the
where βˆi ,GREG = ⎛⎜

∑ si
w j x j x ′j / c j ⎞⎟

∑ si
wj x j y j / c j .
variable of interest from another domain and another
time period. An estimator that is approximately p-unbiased as the
overall sample size increases but uses y-values outside
An alternative is to use estimators that borrow strength the domain is the modified direct estimator given by
across small areas, by modeling dependent on
independent variables across a number of small areas: Yˆ = X ′βˆ
i , SR + Yˆ
i GREG − Xˆ ′ βˆ ( where
i , HT i , HT GREG )
( ∑ w x x′ / c ) ∑
they are called indirect estimators. Indirect estimators −1
will be quite good (i.e.: indirectly increase the effective βˆGREG = j j j j wj x j y j / c j .This
s s
sample size and thus decrease the standard error) if the
estimator is also referred to in Woodruff (1966), and
models obtained across small areas still hold at the
Battese, Harter, and Fuller (1988) as the “survey
small area level. Departures from the model will result
regression estimator”.
in unknown biases. There is a wide variety of indirect
estimators available, and a good summary is provided

3446
Section on Survey Research Methods

Hidiroglou and Patak (2004) compared a number of composite estimator is most insensitive when the mean
the direct estimators. One of their conclusions was that square errors of the two component estimators do not
the direct estimators would be best if the domains of differ greatly. Simple weighting factors for the
interest coincided as closely as possible with the composite estimators that depend on the realized
design strata. domain size were given by Drew, Singh and Choudhry
(1982), and by Hidiroglou and Särndal (1985)
3.2 Indirect Estimation
Small area estimators are split into two main types,
Some of the most widely used indirect estimators have depending on how models are applied to the data
been the synthetic estimator, the regression-adjusted within the small areas: these two types are known as
synthetic, the composite estimator, and the sample- area level and unit level. Small area estimators are
dependent estimator. based on area level computations if models link small
area means of interest (y) to area-specific auxiliary
The synthetic estimator uses reliable information of a variables (such as x sample means). They are based on
direct estimator for a large area that spans several unit level computations if the models link unit values
small areas, and this information is used to obtain an of interest to unit-specific auxiliary variables. Area
indirect estimator for a small area. It is assumed that based small area estimators are computed if the unit
the small areas have the same characteristics as the level area data are not available. They can also be
large area: Gonzalez (1978) provides a good account computed if the unit level data are available by
how these estimators were obtained, and used to obtain summarizing them at the appropriate area level.
unemployment statistics at levels lower than those
planned in the survey design. The National Center for 3.2.1 Area Model
Health Statistics (1968) in the United States pioneered
the use of synthetic estimation for developing state One of the most widely used area based level small
estimates of disability and other health characteristics area estimator was given by Fay and Herriot (1979)
from the National Health Interview Survey (NHIS).
Sample sizes in most states were too small to provide
small. Population totals ( Yi = ∑
j∈U
y j ) or means
i

reliable direct state estimates. ( Yi = Yi / N i ), where N i is the number of elements in


small area U i , can be estimated. The Fay-Herriot
Levy (1971) used mortality data to compute average methodology is usually presented as an estimator of
relative errors of synthetic estimates for States. He
the small area population mean Yi for a given small
used the regression-adjusted synthetic estimator to ac-
count for local variation by combining area-specific area U i where i = 1, …, m . The Fay-Herriot estimator
covariates with the synthetic estimator. These for small area U i is a linear combination of a direct
covariates attempted to attenuate the magnitude of ˆ
potential relative bias associated with the synthetic estimator (say Yi , DIR ) and a synthetic estimator
estimator. ˆ
(say Yi , SYN ). The direct estimator of the population
The potential bias associated with indirect estimators ˆ
mean Yi is given by Yi , DIR = Yˆi , DIR / Nˆ i , DIR where
Yˆi , INDIR can be attenuated by combining them with the
direct estimators Yˆ via a weighted average. The
i , DIR

Yˆi , DIR = w% j y j and Nˆ i, DIR =
j∈si ∑ w% j . The
j∈si

resulting combined estimator is given by weight w% j associated with the j-th unit can be the
design weight w j (i.e. w% j = w j ) or a final weight that
Yˆi ,COMB = φiYˆi , DIR + (1 − φi )Yˆi , INDIR reflects any adjustment (i.e.: non-response, calibration,
or a product thereof) made to the design weight.
where φi ( 0 ≤ φi ≤ 1) . The optimal φi* is determined by
The synthetic portion is estimated as the product of a
minimizing the MSE of Yˆi ,COMB . The resulting given auxiliary population mean row-vector
composite estimator has a mean square error which is
smaller than that of either component estimator.
(say Zi′ = ∑ z ′j / N i ) for the i-th small area of
j∈U i

Schaible (1978) noted that the composite estimator is ˆ


interest times an estimated regression vector (say βFH ,
insensitive to poor estimates of the optimum weight.
This insensitivity depends on the relative sizes of the where FH stands for Fay-Herriot). The auxiliary data
mean square errors of the component estimators. The row-vector z ′j is known for all units in the population

3447
Section on Survey Research Methods

ˆ
small area U i . The regression vector βFH is computed and ψˆ iEXP = ∑w y ∑w
j∈U i
j j
j∈U i
j is the simple estimator
across a number of small areas in such a way that the
ˆ of the mean involving the design weights w j . The
model linking the variable of interest (the mean Yi , DIR )
computations required to obtain the normal regression
auxiliary data also holds at the small area level. The estimator do not involve estimating any variance
Fay-Herriot estimator of a given population mean Yi is components.
estimated as:
ˆ
Yi , FH = γ i Yi , DIR + (1 − γ i ) Zi′βFH
% % 3.2.2 Unit Model
(3.1)
The unit model originates with Battese, Harter and
The two components (direct estimator and synthetic Fuller (1988). They used the nested error regression
estimator) of (3.1) are weighted γ i and (1 − γ i ) where model to estimate county crop areas using sample
%
survey data in conjunction with satellite information.
γ i = σ v2 /(σ v2 +ψ i ) . The regression vector βFH and γ i Their model is given by
depend on the population variance ψ i , DIR of the direct
yij = xij′ β + vi + eij (3.5)
ˆ
estimator Yi , DIR and the model variance σ v2 . Although iid iid

ˆ where they assumed that vi ~(0, σ v2 ); eij ~(0, σ e2 ) ,


the sampling variance of Yi , DIR is easy to compute, it
may be unstable if the domain sizes are small. This is
K
i=1,…,m and j = 1, , ni . The small areas of interest in
Battese, Harter and Fuller (1988) were 12 counties
repaired with a smoothing of the estimated variances.
(m=12) in North-Central Iowa. Each county was divided
We denote the smoothed variances as ψˆ i , DIR . The into area segments and the areas under corn and
soybeans were ascertained for a sample of segments by
estimated model variance σˆ 2 and βˆ are computed v FH interviewing farm operators. The number of sampled
recursively. Details of the required computations for segments in a county ni ranged from 1 to 6. Auxiliary
obtaining σˆ v2 can be found in of Rao (2003, pp. 118- data were in the form of numbers of pixels (a term used
119). The estimated regression vector βˆ and the FH
for "picture elements" of about 0.45 hectares) classified
as corn and soybeans were also obtained for all the area
factor γˆi are given by: segments, including the sampled segments, in each
−1
⎡ D Z ′ Yˆ ⎤ county using LANDSAT satellite readings.
∑ Z i′

⎡ D Zi ⎤
βˆ = ⎢ ⎥ ⎢ i i , DIR
⎥ (3.2)
FH
ˆ ⎢ i =1 ψˆ i , DIR + σˆ v
⎣ i =1 ψ i , DIR + σˆ 2 2
⎢ v ⎥⎦ ⎥ The resulting sample mean using (3.5) is given by
⎣ i ⎦
yi. = xi′. β + vi + ei . (3.6)
and
where yi . , xi′. and ei. are the means of the associated ni
(
γˆi = σˆ v2 / ψˆ i , DIR + σˆ v2 ) (3.3)
(y, x) observations and e-residuals. Battese et al.
(1988)’s objective was to estimate the conditional
respectively. population mean given the realized cluster (county)
The Fay-Herriot estimator Yi , FH can also be expressed
ˆ effect. Under the assumption of model (3.5), the
conditional population mean is given by
as: Yi. = X i′. β + vi (3.7)
ˆ ˆ ˆ
( ˆ
Yi , FH = Z i′ βFH + γˆi Yi , DIR − Zi′βFH FH . ) (3.4) where Yi. , X i′. are the population means of the
associated Ni observations ( yij , xij )in the i-th
This form of the Fay-Herriot estimator is very similar sampled cluster U i . The corresponding predictor y%i.
to the “normal” regression estimator
for the county mean crop area per segment is X i′. β% + v%i.

( ) ∑
ni
ˆ ˆ ˆ ˆ
Yi , REG = Z i′ βREG + Yi , EXP − Z i′ βREG (3.5) where v%i. = ni−1
j =1
(y ij )
− xij′ β% γ i with
given in Cochran (1977), where the estimated
regression vector is given by −1

∑∑ ∑∑ ( x y − γ x y ) (3.8)
m ni m ni
⎛ ⎞
−1 β% BHF = ⎜ ( xij xij′ − γ i xi. xi′. ) ⎟
∑ Z ′Z /ψˆ ∑ Z ′ ψˆ
D D
βˆ REG
⎛ ⎞ ⎛ ⎞ ij ij i i. i.
=⎜ i i i , EXP ⎟ ⎜ i i
EXP
/ψˆ i , EXP ⎟ ⎝ i =1 j =1 ⎠ i =1 j =1

⎝ i =1 ⎠ ⎝ i =1 ⎠

3448
Section on Survey Research Methods

and γ i = σ v2 (σ v2 + ni−1σ e2 ) .
−1

Replacing σ v2 and σ e2 by σˆ v2 and σˆ e2 , we obtain a


The resulting best linear unbiased prediction (BLUP) survey-weighted estimator of β (say β%YR ). The
estimator is y% i , BHF = γ i yi. + ( X i′. − γ i xi′. ) β% BHF for the i- resulting “pseudo-EBLUP” estimator Yˆi , PR is given by
th small area. However, the variance components
σ v2 and σ e2 are not known. Battese et al (1988) use the
Y
ˆ
= X ′ βˆ + γˆ y − x ′ βˆ
i , PR i PR iw ( iw iw PR )
. Note that the self-

well-known-method of fitting-of-constants to estimate benchmarking property means that the sum of the
them. The resulting estimator of the i-th area sample estimated small area totals is equal to the direct
mean is known as the EBLUP estimator, because the estimator of the overall total Y. That is,
variance components were estimated.
ˆ ′

m
Prasad and Rao (1990) derived an approximation to N i Yi , PR = Yˆw + ( X − X w ) βˆ w
i =1
o(m −1 ) for the model based mean squared error of the
ˆ
∑ ∑ ∑
m m ni
Battese-Harter-Fuller estimator, and also obtained its where Yˆw = Ni Yi , PRYˆw = w% ij yij and
i =1 i =1 j =1
estimator to o(m −1 ) as well. Prasad-Rao (1999) were
the first to include the survey weights in the unit level Xˆ w is similarly defined.
model: they labelled their estimator as a pseudo-
EBLUP estimator of the small area mean Yi . The 4. Applications
Prasad-Rao estimator of Yi is given by 4.1 Canadian Community Health Survey: Area
model
Yi , PR = X i′ βˆ PR + γ iw yiw − xiw
%
(
′ β% PR ) (3.9) The Canadian Community Health Survey CCHS is a
cross-sectional health survey carried out by Statistics
Canada since 2001.The survey operates on a two-year
where (
γ iw = σ v2 / σ v2 + σ e2 ∑ j∈si
w% 2j ) with collection cycle. The first year of the survey cycle
"x.1" is a large sample (130,000 persons), general
yiw = ∑ j∈si
w% j y j ; w% ij = wij* / ∑ j∈si
wij* and wij* are population health survey, designed to provide reliable
estimates at the health region (sub-provincial areas
calibrated weights, and β% PR is given by defined in terms of Census results), provincial and
national levels. This portion of the survey collects
−1 m information related to health status, health care
∑γ ∑γ
m
⎛ ⎞
β%PR = ⎜ iw xiw xiw′ ⎟ iw xiw yiw (3.10) utilization and health determinants for the Canadian
⎝ i =1 ⎠ i =1 population. The second year of the survey cycle "x.2"
has a smaller sample (30,000 persons) and is designed
Prasad and Rao (1999) also provided model based to provide provincial and national level results on
expressions for the MSE of their estimator when it specific focused health topics.
included the estimated variance components
σ v2 and σ e2 . The CCHS is based on a multiple frame (two frames)
sampling design of that uses. The first one, used as the
primary frame, is the area frame designed for the
The sum of small area estimates do not necessarily add
Canadian Labour Force Survey. This survey is
up to the corresponding direct estimator. You-and Rao
basically a two-stage stratified design that uses
(2002) proposed an estimator of β that ensures self-
probability proportional to size without replacement at
benchmarking of the small area estimates to the each stage. Face to face interviews take place with
corresponding direct estimator. Their estimator is individuals selected from that frame. The second frame
given by uses a list frame of telephone numbers in some of the
Health Regions for cost reasons. Individuals selected
Yi ,YR = X i′ βˆYR + γ iw yiw − xiw (
′ βˆYR ) (3.10) in that frame are interviewed by telephone.
where The area frame uses the Labour Force Frame. This
resulting sample is a two-stage stratified cluster.
−1

∑∑ ∑∑ w ( x − γ Sampling in that frame is carried out in three steps.


ni ni
⎛ m ′⎞
m
β%YR =⎜ w% ij xij (x
ij − γ iw xiiw. ) ⎟ % ij ij iw xiiw. ) yij . Firstly, a list of the dwellings that were or had been in
⎝ i =1 j =1 ⎠ i =1 j =1

3449
Section on Survey Research Methods

scope to the Labour Force sample is identified.


Secondly, a sample of dwellings was selected from this
list. The households in the selected dwellings then
formed the sample of households. The majority (88%)
of the targeted sample was selected from the area
frame. Lastly, respondents are randomly selected from
households in this frame. Although a single individual
is normally randomly selected from each household,
the requirement to over sample youths results in a
second member of a number of households to be
selected as well. Face-to-face interviews are carried
out with the selected respondents.

The telephone frame is mainly based on a stratified


version (Health Regions) of the Canada Phone
directory. Simple random sampling takes place within Figure 4.1: Health areas in British Columbia
each of the resulting strata. Random digit dialling is
carried out in five HRs and the three Territories. ψ̂ rDIR DIR
,a denotes the estimated variance for p̂r ,a under the

sampling design, the associated estimated design effect


The direct estimator of a population total Yi for a given
domain i is given by Ŷi DIR = ∑ w% *j y j where
is given by deff rDIR ˆ DIR ˆ DIR ˆ DIR (
,a = ψ r ,a / pr ,a (1 − pr ,a ) / nr ,a . The )
j∈si
smoothed design effect over all I=200 domains is
w% *j represents the overall weight that incorporates the
multiple frame nature of the sampling design, non-
given by def
DIR
= ∑ deff
i i
DIR
/ I . The estimated
response adjustments at each stage, where appropriate, coefficient of variation, cv ( ˆprDIR
,a ) , for p̂i
DIR
for a given
and the calibration (age groups 12 to 19, 20 to 29, 30
( ˆp (1 − ˆp ) / n ) / ˆp
DIR DIR DIR DIR
to 44, 45 to 64 and 65 or older for each sex within each domain i is def r ,a r ,a r ,a r ,a .
health region and province). More details of this
The common mean model is the simplest one that can
sampling design are available in Béland (2002).
be implemented using the Fay-Herriot (1979)
methodology. This model assumes that the proportion
Estimates of various population parameters can be
of alcohol abuse is the same within each of the twenty
produced for different domains. In the present
Health Regions for a given age-sex group: that is, the
example, taken from Hidiroglou, Singh and Hamel
(2007), our parameter of interest is the proportion of linking model is given by Pr ,a = β a +ν r ,a where Pr ,a is
alcohol abuser within the previously stated domains the unknown population proportion of interest,
belonging to the province of British Columbia using and β a is the common mean across the health regions
the two year (2000-2001) CCHS sample. The for the a-th age-sex group. The corresponding
associated sample had 18,302 observations with
sampling model is given by p̂rDIR ,a = Pr ,a + er ,a . The
domain sample size ranging from 20 to 238 for the 200
domains. Figure 4.1 provides an idea of how the resulting small area estimate for the ra-th domain is
,a + (1 − γ r ,a ) β r ,a ,
Health Regions are delineated in British Columbia. given by pˆ rEBLUP = γˆ r ,a ˆprDIR ˆ ˆ where
,a

The i-th domain is a cross-classification of health σˆ v2


γˆ r ,a = (see Rao 2003, p. 116). The
regions r ( r = 1,… , 20 ) and age-sex groups a ψ DIR
% r ,a + σˆ v2

,a (1 − pr ,a )
( a = 1,… , 10 ). The direct estimator of proportion of DIR ˆ rDIR
p ˆ DIR
alcohol abuse is given by p̂ DIR = Yˆ DIR / Nˆ DIR ψ% rDIR
,a term given by ψ r ,a = def
%
DIR
r ,a r ,a r ,a nr ,a
where N̂ DIR
r ,a = ∑ j∈sr ,a
*
w . Given that, for domain ra,
%j is obtained using the smoothed design effect
def
DIR
= ∑ deffi i
DIR
/ I over the I=200 domains. The
σ̂ v2 term is obtained from the Fay-Herriot
methodology: computational details for
estimating σ̂ v can be found in Rao (2003, p. 118). The
2

estimated coefficient of variation cv ( ˆprEBLUP


,a ) for

3450
Section on Survey Research Methods

estimators could be unacceptably large measures of


mse ( pˆ rEBLUP )
p̂rEBLUP is given by
,a
, where error.
,a
p̂rEBLUP
,a
Rubin et al. (2007) investigated whether Small Area
σˆ 2ψ% DIR
mse ( ˆp EBLUP
r ,a ) = ψ% DIRv +r ,aσˆ 2 represents the estimated Estimation (SAE) procedures could be used to produce
r ,a v estimates for AWE with reasonably good estimated
leading term of MSE ( ˆprEBLUP ) . Figure 4.2 is a graph mean squared errors for lower levels, namely industry
,a
groups at the North American Industry Classification
between the estimated coefficients of variation System (NAICS4) level 4 and geography at the level of
resulting for the direct and indirect estimation province, that is, the "NAICS 4 x province" domains.
The Average Weekly Earnings for a population
domain i ( U i ) is given by
Yi = ∑ j∈U i
Eij yij / ∑ j∈U i
Eij
Est where yij is the average weekly earnings and Eij is
cv%
the average number of employees within the j-th
establishment within that domain.

A Monte Carlo study was carried out to evaluate the


properties of the GREG estimator and a number of
Observed Proportion SAE estimators. The y-values for the population used
for the study were created for twelve months for
Figure 4.2: Estimated coefficients of variation for the
twelve months representing the January to December
direct (blue) and EBLUP (red) estimators of proportion
2005 calendar year. In sample y-values were kept as is,
and the kept as is and the y-values for the out-of-
4.2 Canadian Survey of Employment Payroll and sample units were synthesized using the nearest
Hours: Unit model neighbour using the average number of employment
and average monthly earnings (available for the whole
The Canadian Survey of Employment, Payrolls and
population). Some 100o samples were then
Hours (SEPH) collects and publishes on a monthly
independently sampled from each of the twelve
basis, estimates of payrolls, employment, paid hours
generated populations, preserving the longitudinal
and earnings at detailed industrial and geography
aspect of SEPH (i.e.: sample rotation of one-twelfth of
levels. Estimators for average weekly earnings (AWE)
the sample on a monthly basis). Summary statistics
have been produced since the early nineties by SEPH. (r)
These estimates have been produced via the based on the specific estimators, yi,EST , used of the i-
generalized regression (GREG) estimator using a th small area (i=1,…,I ) computed from the Monte
combination of survey and payroll deduction Carlo, included the average relative bias (ARB),
(administrative) data provided to Statistics Canada by
∑ ∑( y
1 I 1 R

the Canada Tax department. The GREG estimator is


(r)
i ,EST − Yi ) , and the average root
I i =1 RYi r =1
approximately design unbiased (ADU).
relative mean square error (ARMSE) ,
SEPH is currently being redesigned to redefine 0.5

∑ ∑( y
2
1 I ⎛ 1 R ⎞
primary domains of interest, as well as incorporate ⎜
(r )
i ,EST − Yi ) ⎟ .
I ⎜ RYi ⎟
improvements on the use of the administrative data. i =1 ⎝ r =1 ⎠
The resulting sample, estimated to be between 11,00 to
20,00 establishments (depending on budget Estimators considered in the Rubin et al. (2007)
constraints) will be allocated to the newly defined simulation included the GREG, the Prasad-Rao (1999)
strata, defined as cross-classifications of geography pseudo-EBLUP unit level, and the You-Rao (2002)
(provinces) and industry (NAICS3), so that the pseudo-EBLUP area level SAE estimators given in
resulting GREG estimates for AWE satisfy coefficients Section 3.0. The GREG estimator is given by
of variation. The design strata are also referred to
model groups since the GREG estimators are
y =
i ,GREG ∑
E% x ′ βˆ + w E% y − x ′ βˆ
Ui ij ij ∑
(4.1) si ij ij ( ij ij )
computed at these levels as well. Estimates below this with xij′ = (1, xij ) . Here xij is the average monthly
level can be obtained using domain estimation. As the
sample associated will be relatively small (or non- earnings associated with the j-th sampled
existent), the reliability associated with the GREG establishment within domain U i , and β̂ is the

3451
Section on Survey Research Methods

regression estimator resulting from the (Undercount) and the gross number of persons
iid erroneously included in the final Census count
model yij = xij′ βˆ + eij , with eij ~( 0,σ / Eij ) .
2
e (Overcount). The sample size of the RRC is designed
to produce reliable direct estimates for the provinces
Figure 3 and 4 provide the ARB and ARMSE (including the two Territories),and eight age - sex
respectively for construction domains in Canada for groups, with age categories are less than 19, 20 to
2005. The GREG estimator has the smallest ARB 29, 30 to 44, and 45 and over at the national level.
amongst the three estimators. The Prasad-Rao (1999) The cross tabulation of these two marginal tabulations
is the best estimator in terms of ARMSE. This is results in m= 96 (12*8) cells. These cells are
reasonable on account that the You-Rao (2002) considered as small areas because they have too few
estimator loses efficiency on account of its observations to sustain reliable direct estimates. The
benchmarking property. objective is to use small area techniques to improve the
reliability of the cell estimates. Dick (1995) applied the
Fay-Herriot methodology for this purpose.

For the i-th cell (small area), we define the following


quantities. The true (but unknown) Census count is
denoted as Ti , and the corresponding observed Census
count as Ci . This means that the difference ( Ti − Ci )
is the missed unknown net undercoverage count
( M i ) . This net undercoverage count is estimated by
the RRC for the i-th small area is Mˆ . The true count
i
Figure 4.3: Average absolute relative bias for Ti can be expressed as the product of the observed
construction domains in Canada for 2005
count Ci and the true adjustment factor
θ i = ( M i + Ci ) / Ci = Ti / Ci . The true adjustment
factor can be estimated directly as
(ˆ )
yi = M i + Ci / Ci . However, the direct estimator
Mˆ i may not be reliable. The problem is cast into a
Fay-Herriot context as follows.

The sampling model can be written as yi = θi + ei


wh ere we assu me that E p ( ei ) = 0 a nd
Figure 4.4: Average relative root mean square error
V p ( ei ) = ψ i , where ψ i is ass umed to be
for construction domains in Canada for 2005
kn o wn. T he lin kin g mo de l is give n b y
4.3 Canadian Census of Population under coverage θ i = zi′ β +ν i , where zi is a set of a ux iliar y
iid

The Census of Canada is conducted every five years. va r iab les , and vi ~(0, σ v2 ) .
One objective is to provide the Population Estimates
Program with accurate baseline counts of the number T he r esu lt in g Fa y-Herr iot est imator is give n
of persons by age and sex for specified geographic
areas. However, not all persons are correctly
as (
θˆi , FH = zi′ βˆ FH + γˆi yi − zi′ βˆ FH)where
enumerated. Two errors that occur are undercoverage - γˆi = σˆ /(σˆ +ψ i ) .
2
v
2
v
exclusion of eligible persons - and over coverage -
erroneous inclusion of persons. This undercoverage The sampling variances are not known, but can be
varies between 2 and 3 %.
estimated as from ψˆ i = v ( yi ) given the sampling plan
A special survey, known as the Reverse Record Check for the RRC. As these variances are for domains, they
(RRC), with a sample size of 60,000 persons, estimates will be tend to be variable. Dick (1995) smoothed them
the net number of persons missed by the Census. This
net number combines two types of coverage errors: the
gross number of persons missed by the Census

3452
Section on Survey Research Methods

( )
by using log v( Mˆ i ) = α + β log ( Ci ) + ηi where it is

assumed that ηi  N ( 0, ζ 2 ) .
iid

The smoothed estimate of variance for the i-th small


area is v(% M i i (
ˆ ) = expˆ α + βˆ log ( C ) . Hence, the )
smoothed variance of yi = 1 + Mˆ i / Ci is
ψ% i = v( ˆ 2
% M i ) / Ci . Figure 4.6: Comparison of Direct and FH CVs
(Source: You and Dick 2004)
Replacing the unknown ψ i by ψ% i leads to
In terms of the CV comparison given in figure 4.6, the
θ%i , FH = zi′ β% FH + γ%i ( yi − zi′ β% FH ) where σ2
%v and β% FH HB approach achieves a large CV reduction when the
sample sizes are small. As sample size increases, the
are solved iteratively using the algorithm given in the CV reduction decreases. As the sample size increases,
appendix. the CVs of the direct and HB estimates quite similar.
State which variables used and for Census (2011). For
4.4 Labour Force Survey
further details see You, Rao and Dick (2002)
Unemployment rates are produced on a monthly
The above methodology was used to estimate the 2001
Canadian Census undercoverage. The final z-variables
basis in Canada by the Labour Force Survey (LFS).
used in the linking model (4.3) were Yukon, Nunavet, The LFS samples some 53,000 households based on a
Male 20 to 29, Male 30 to 44, Female 20 to 29, British stratified multi-stage design. The survey reduces
Colombia renters, Ontario renters and North West response burden by having one-sixth of its sample
Territories renters. replaced each month. For a detailed description of
the LFS design, see Gambino, Singh, Dufour,
Figure 1 displays the direct and FH estimates of Kennedy and Lindeyer (1998). The published
undercoverage ratios by the domain sample sizes. provincial and national estimates unemployment
Figure 2 displays the corresponding coefficients of rates are a key indicator of economic performance in
variation (CV) of the direct and FH HB estimates. Canada.

Unemployment rates at levels lower than the


provincial level are also of great interest. For
instance, the unemployment rates for Census
Metropolitan Areas (CMAs, i.e., cities with
Population more than 100,000) and Census
Agglomerations (CAs, i.e., other urban centers)
receive scrutiny at local governments. However,
many of the CAs do not have a large enough sample
to produce adequate direct estimates. Their
estimates need to be produced using SAE
Figure 4.5: Comparison of Direct and HB Estimates
techniques. You, Rao and Gambino (2003) used a
(Source: You and Dick 2004)
cross-sectional and time series model to estimate
unemployment for such small areas: their methodology
Figure 4.5 supports the conclusion that the FH
borrowed strength both across time and small areas.
approach leads to smoothed estimates, particularly for
the domains with relatively small sample sizes. When
sample size is small, some direct net undercoverage Let yit denote the direct LFS estimate of θ it the true
estimates are negative due to the fact that the unemployment rate of the ith CA (small area) at
overcoverage estimates are larger then the time t, for i =1, ..., m, t =1, ..., T, where m is the total
undercoverage estimates. The FH method “corrected” number of CAs and T is the (current) time of
the negative values. All the FH net undercoverage interest. Assume that the sampling model is
estimates are positive.

3453
Section on Survey Research Methods

yit = θ it + eit , i = 1,...,m; t = 1,...,T


0,9
where eit ’s are sampling errors. Since the CAs can 0,8
Direct Est Fay-Herriot Space-Time
be treated as strata, the eit ’s are uncorrelated between 0,7
0,6
themselves for a given time period t. However, the

Est. cv
0,5
rotation results in a significant level of overlap for the 0,4
sampled households. This is reflected in the linking 0,3
model given by θ it = xit′ β + vi + uit where the error 0,2
0,1
structure of the uit ’s is assumed to follow an AR(1)
0

( 0,σ )
iid
process, represented as uit = ui , t −1 + ε it ; ε it  2 Figure 4.8: Comparison of coefficients of variation of
unemployment rates using Direct, Fay-Herriot, and
space-time estimates for June 1999
The error structure of the eit ’s is assumed known, and
as this is not the case, the sample based estimates need Acknowledgements: The author would like to
to be smoothed. You, Rao and Gambino (2003) used acknowledge Jon Rao, Peter Dick and Susana
the Hierarchical Bayes (HB) procedure to estimate the Rubin-Bleuer.
required parameters in the error and linking equation.
They compared numerically three estimators of the References
unemployment rates in June 1999. These estimators
were the direct estimator (Direct Est), a small area Australian Bureau of Statistics (2006). A Guide to
estimator based only on the current cross-sectional Small Area Estimation - Version 1.1. Internal
data (the Fay-Herriot), and one using both the cross- ABS document.
sectional and longitudinal data (Space-time). Battese, G.E., Harter, R.M., Fuller, W.A. (1988). An
Error-Components Model for Prediction of Crop
Figure 4.7 displays these LFS estimates for the June Areas Using Survey and Satellite Data, Journal of
1999 unemployment rates for the 62 CAs across the American Statistical Association, 83, 28-36.
Canada. The 62 CAs appear in the order of Brackstone, G. J. (1987). Small area data: policy issues
population size with the smallest CA (Dawson and technical challenges. In R. Platek, J. N. K. Rao,
Creek, BC, population is 10,107) on the left and the C. E. Sarndal, and M. P. Singh, eds., Small Area
largest CA (Toronto, Ont., population is 3,746,123) Statistics, pp. 3-20. John Wiley & Sons, New York.
on the right. The Fay-Herriot model tends to shrink Béland, Yves Canadian Community Health Survey
(2002). Methodological overview. Health report,
the estimates towards the average of the
Statistics Canada, Catalogue no. 82-003-XPE
unemployment rates. The space-time model leads to
(0030182-003-XIE.pdf), Vol. 13, No. 3, ISSN
moderate smoothing of the direct LFS estimates. For 0840-6529.
the CAs with large population sizes and therefore Dick, P. (1995). Modelling Net Undercoverage in the
large sample sizes, the direct estimates and the HB 1991 Canadian Census, Survey Methodology ,
estimates are very close to each other; for smaller 21, 45-54.
CAs, the direct and HB estimates differ substantially Drew, D., Singh, M.P., and Choudhry, G.H. (1982).
for some regions. Evaluation of Small Area Estimation Techniques
for the Canadian Labour Force Survey, Survey
Methodology , 8, 17-47.
16
Direct Est Fay-Herriot Space-Time
Fay, R.E. and Herriot, R.A. (1979). Estimation of
14
Income for Small Places: An Application of
Unemployment rate(%)

12
James-Stein Procedures to Census Data. Journal
10
8
of the American Statistical Association, 74, 269-
6
277.
4 Fuller, W.A. (1999). Environmental Surveys Over
2 Time, Journal of Agricultural, Biological and
0 Environmental Statistics, 4, 331-345.
Gambino, J.G., Singh, M.P., Dufour, J., Kennedy, B.
Figure 4.7: Comparison of unemployment rates using and Lindeyer, J. (1998). Methodology of the
Direct, Fay-Herriot, and space-time for June 1999 Canadian Labour Force Survey, Statistics Canada,
Catalogue No. 71-526.

3454
Section on Survey Research Methods

Gonzalez, M.E., and Hoza, C. (1978), Small-Area Singh, M.P., Gambino, J., Mantel, H.J. (1994). Issues
Estimation with Application to Unemployment and Strategies for Small Area Data, Survey
and Housing Estimates, Journal of the American Methodology, 20, 3-22.
Statistical Association, 73, 7-15. Woodruff, R.S. (1966), Use of a Regression Technique
Hidiroglou M.A. and Singh A., and Hamel M. (2007). to Produce Area Breakdowns of the Monthly
some thoughts on small area estimation for the National Estimates of Retail Trade, Journal of the
Canadian community health survey (CCHS). American Statistical Association, 61, 496-504.
Internal Statistics Canada document. You, Y., and Rao, J.N.K. (2002). A Pseudo-Empirical
Hidiroglou, M.A. and Särndal, C.E., (1985). Small Best Linear Unbiased Prediction Approach to
Domain Estimation: A Conditional Analysis, Small Area Estimation Using Survey Weights,
Proceedings of the Social Statistics Section, Canadian Journal of Statistics, 30, 431-439.
American Statistical Association, 147-158. You, Y., Rao, J.N.K. and Dick, J.P. (2002)
Hidiroglou, M.A. and Patak, Z. (2004). Domain Benchmarking hierarchical Bayes small area
estimation using linear regression. Survey estimators with application in census
Methodology, 30, 67-78. undercoverage estimation. Proceedings of the
Levy, P.S. (1971). The Use of Mortality Data in Survey Methods Section 2002, Statistical Society
Evaluating Synthetic Estimates, Proceedings of of Canada, 81 - 86.
the Social Statistics Section, American Statistical You, Y, Rao, J.N.K., and Gambino, J.G. (2003).
Association, pp. 328-331. Model-based unemployment rate estimation for
Prasad, N.G.N., and Rao, J.N.K. (1990), The the Canadian Labour Force Survey: A hierarchical
Estimation of the Mean Squared Error of Small- Bayes approach, Survey Methodology, 29, 25-32.
Area Estimators,. Journal of the American You, Y and Dick, P. (2004). Hierarchical Bayes
Statistical Association, 85, 163-171. Small Area Inference to the 2001 Census
Undercoverage Estimation. Proceedings of the
Prasad, N.G.N. and Rao, J.N.K. (1999). On robust ASA Section on Government Statistics, 1836-
small area estimation using a simple random 1840.
effects model. Survey Methodology, 25, 67-72 .
Rao, J.N.K. (2003). Small Area Estimation. New York: .
Wiley.
Rao, J.N.K. and Choudhry, H. (1995). Small Area
Estimation: Overview and Empirical study.
Business Survey Methods, Edited by Cox, Binder,
Chinnappa, Christianson, Colledge, Kott, Chapter
27.
Rubin-Bleuer, S., Godbout S and Morin Y (2007).
Evaluation of small domain estimators for the
Canadian Survey of Employment, Payrolls and
Hours. Paper presented at the third International
Conference of Establishment Surveys July 2007
Statistical of Society Meetings.
Schaible, W.A. (1978). Choosing Weights for
Composite Estimators for Small Area Statistics,
Proceedings of the Section on Survey Research
Methods, American Statistical Association, pp.
741-746.
Singh A.C. and Verret F. (2006). Mixed Linear
Nonlinear Aggregate level and Matt Type for
formulas? Models for Small Area Estimation for
Binary count data from Surveys. Proceedings of
the Statistics Canada Symposium.
Singh, A.C. (2006). Some problems and proposed
solutions in developing a small area estimation
product for clients. ASA Proc. Surv. Res. Meth.
Sec.

3455
Random
effects model
Section on Survey Research Methods

Appendix: Fay-Herriot computational summary


Description Computation
1. Model a smooth function of Yi θ i = g ( Yi ) where Yi is the small area population mean for i-th small area;
i=1,…,m
2. Direct estimate of θi ()
θˆi = g Yˆi where Yˆi is the observed direct estimate
3. Auxiliary data
zi = ( z1i , z2i , K, z pi )′
4. Linking model: Connect the θ i θ i = zi′ β + vi ; vi i.i.d under model ( 0, σ v2 ) ; σ v2 =model variance
5. Sampling model θˆi = θ i + ei ;sampling errors ei independent E p ( ei θi ) = 0 and sampling
variance V p ( ei θ i ) = ψ i (assumed known)
6. Combine 6 and 7 θˆi = zi′ β + vi + ei : Fay-Herriot model
7. Estimation of σ v2 Method of moments:

∑ (θˆ − z ′ β (σ )) / (ψ + σ
m 2
Solve h (σ v2 ) = i i
% 2
v i
2
v ) = m − p for σ 2
v via iteration
i =1

(
σ v2( r +1) = σ v2 ( r ) + ⎡⎣ m − p − h (σ v2 ) ⎤⎦ h*′ σ v2 ( r ) constraining to σ v2( r +1) ≥ 0 , )
∑ (θˆ − z ′ β ) / (ψ + σ
m 2

where h*′ (σ v2 ) = − )
2
2
i i
%
i v is an approximation to the
i =1

derivative of h (σ v2 ) . (see p. 118, Rao (2003))


8. Optimal model-based Fay-Herriot
estimator (
θˆiFH = γˆiθˆi + (1 − γˆi ) zi′ βˆ = zi′ βˆ + γˆi θˆi − zi′ βˆ = zi′ βˆ + vˆi where β̂ is the )
weighted least squares estimator of β . Now
−1

∑ ∑ z θˆ / (ψ + σˆ )
m m
βˆ = β% (σˆ v2 ) = ⎢ zi zi′ / (ψ i + σˆ v2 ) ⎥
⎡ ⎤ ⎡ 2 ⎤
⎢ i i i v ⎥ where
⎣ i =1 ⎦ ⎣ i =1 ⎦

γˆi = σˆ v2 / (ψ i + σˆ v2 ) (see p. 116 Rao (2003))

MSE of θˆi, FH ( ) ( )
2
9. Leading term of MSE θˆi , FH = E θˆi , FH − θ i where the expectation is with
respect to the Fay-Herriot model; see step 8; g1i (σ v2 ) = γ iψ i shows the

efficiency of θˆi, FH over direct estimator θˆi is γ i−1 for large number of areas
m. If γ i = σ v2 / (ψ i + σ v2 ) = 1/ 2 , then efficiency is 200% or gain in
efficiency is 100%.
10. Scenarios for large efficiency Sampling variance ψ i large or model variance σ v2 small relative to ψ i
gains
11. Nearly unbiased estimator of
( )
mse θˆi , FH : See equation (7.1.26), p. 129, Rao (2003); easily
(
MSE θˆi , FH) programmable
12. Estimation of small area mean Yi Yˆ = g −1 θˆ
i , FH ( ) = K (θˆ )
i , FH i , FH

13. MSE estimator of Yˆi , FH ( ) = ⎡⎣ K ′ (θˆ )⎤⎦ mse (θˆ ) ; may not be nearly unbiased.
2
mse Yˆi , FH i , FH i , FH

Empirical Bayes (EB) and hierarchical Bayes( HB) methods are better
suited for handling non-linear cases , K θˆ , see p. 133 , Rao (2003) ( i , FH )

3456

You might also like