You are on page 1of 22

Nicola Torelli, Matilde Trevisani

Labour force estimates for small geographical domains in Italy: problems, data and models
Working Paper n. 118 2008

Labour force estimates for small geographical domains in Italy: problems, data and models ( )
Nicola Torelli and Matilde Trevisani
Dipartimento di Scienze Economiche e Statistiche, Universit` di Trieste, Italy a e-mail: nicola.torelli@econ.units.it, matilde.trevisani@econ.units.it

Abstract One of the contexts where small area estimation techniques have proved their potential is the analysis of data collected in national labour force surveys to get estimates for small geographical domains. Applications of small area estimation methods to data from labour force surveys have been recently considered also in Italy. The paper gives a review of specic problems, data and opportunities for application of small area estimation models for producing reliable information at provincial and subprovincial level in Italy on labour force aggregates. Some new developments stimulated by application of small area estimation models to analysis of labour force survey data are also discussed. Keywords: Small area estimation, Bayesian hierarchical models, count data, local labour markets, spatial misalignment.

1. Introduction
Labour force surveys (LFSs) are the major source of information on labour force conditions in most developed countries. LFSs are aimed to provide timely ofcial estimates on the number of employed and unemployed persons, rates of activities, employment and unemployment which are important measures of the performance of the country economy. The importance of these surveys for understanding participation in the labour market cannot be overstated: it is not surprising that LFSs are often the most demanding and expensive surveys carried out by national statistical agencies. In fact, they are usually large scale repeated surveys and allow estimation of the relevant quantities over time: monthly (like in US and in Canada), quarterly (like in many European countries, including Italy) or annually. LFSs usually adopt complex survey designs with two or more stages and stratication of primary sampling units with the aim to produce estimates for the country as a whole and for large geographical domains. In the last years, demand for estimates for smaller geographical areas has grown enormously: welfare policies, aimed to devolve resources from central authorities to local ones according to the state of the economy of the area, need reliable information. Even for such large surveys, direct estimates for small geographical domains, i.e., based on survey data actually collected therein, are too unstable to be useful. Small area estimation (SAE) techniques are aimed to get more reliable estimates by using (possibly) explicit small area models that borrow strength from related areas across space and/or time or through auxiliary information which is supposed to be correlated to the variable
Work supported by Italian Ministry of University and Research, Prin 2005: Innovative Methods, organization and contents in sample surveys for agriculture and environment.
( )

of interest (for a survey on small area estimation methods, see Rao (2003), but also, Pfeffermann (2002)). Estimation of labour force related quantities like number of employed or unemployed persons have been considered a eld where small area estimation techniques can be more effectively applied due to: (i) the availability of administrative archives from which potentially helpful auxiliary data can be easily obtained; (ii) the availability of estimates of the relevant quantities over time. It is then not surprising that small area estimation techniques have been so frequently applied to labour force estimation and still is one of the more stimulating eld of application of SAE techniques. The aim of this paper is to review applications of SAE to LFS data with a focus on the Italian case. Some specic problems arising when estimating labour force quantities offer in fact new methodological challenges that can help extending applicability and scope of SAE techniques. In section 2, we will describe Italian LFS. After a concise review of small area models (section 3), application of small area estimation techniques to Italian Labour Force Survey data will be discussed. Section 5 and 6 will consider some new challenges arising when using SAE models for labour force estimation. Focusing on the hierarchical Bayesian approach, SAE modelling of count data will be discussed (a case relevant when the goal is to estimate variables like the number of employed or unemployed persons in a small geographical domain). The problem of using complementary data which refer to geographical units that differ from the target geographical domain is considered in section 6. It is worth noting that this extension of SAE models have been stimulated by some real problem faced when trying to use administrative data like the number of units enrolled in Labour exchange ofces in Italy. Some concluding remarks are in section 7.

2. The Italian Labour force survey and the need of information for small geographical domains
In most countries there is great demand for reliable statistics at local level to support decision making and service organization and delivery. At the same time producers of ofcial statistics are required to reduce costs and respondent burden. This is particularly true for labour force statistics. In response to these needs in Italy in the 80s it was quite common that some local government areas (like Italian Regions) paid to increase sample size of LFS in order to produce reliable estimates for smaller geographical domains like Provinces or large Municipalities. Labour force surveys represent the main informative source for studying the labour market. Nowadays, national statistical agencies of countries of the European Union carry out the LFSs according to Eurostat community regulations and the LFSs are designed according to common standards to provide quarterly ofcial estimates on the number of employed and unemployed persons, rates of activities, employment and unemployment, etc.. The vastness of the information gathered allows the national statistical agencies to produce also information on many peculiar aspects of the participation in the labour market: professional condition, economic activity sector, type of working hours, job duration, profession carried out, etc. In many countries the LFS is the largest and most demanding household survey: this is true also in Italy where the sample includes over 70.000 households (the

ultimate sampling units) selected within a sample of municipalities (the primary sampling units, PSU). PSUs are stratied according to the size of the municipalities, and large municipalities are self representative strata. Since 2004, the Italian LFS has been restructured: its informative contents have been enlarged and its methodologies modied. Before 2004 the survey used to gather information on the rst week of the three-month period, while now it is carried out continuously on all 13 weeks of the same period. Italian LFS has been explicitly designed in order to produce reliable statistics for large administrative areas. The actual sample size allows one to obtain quarterly estimates of the relevant quantities for areas as large as the Italian Regions. Yearly estimates of the various quantities can be produced also for Provinces (smaller administrative areas within the Regions) by pooling data collected in different waves of the survey within the year. These large areas have been designated as planned domains while other local self-government areas, areas including a single (large) municipality or groups of small municipalities are unplanned domains. For unplanned domains the actual sample size is usually too small and direct estimates (possibly post-stratied by using appropriate auxiliary information) of the quantities of interest are extremely unreliable. Small area estimation techniques have been recognized by national statistical agencies as the most effective strategy to address the problem of obtaining valid estimates for small geographical domains. Estimating the number of employed or unemployed people, the rate of activity or the unemployment rate by using small area estimation techniques involve, as we will discuss later on, modelling survey data and using appropriately all the available information from auxiliary sources, when available. This task is not straightforward and successful application of small area techniques depends on the specic quantity to be estimated, on the territorial scale at which information is needed and on the availability and the specic type of auxiliary information.

3. Small area models: some general issues


Details on small area estimation techniques can be easily found in the general references given above and a thorough discussion of the methods and of their properties is beyond the scope of this paper. The structure of the basic small area models will be only sketched here as a base for the subsequent discussion of the potential of SAE models for application to LFS data. Some extensions of these models, within the Bayesian framework, will be introduced in section 5 and 6 . Rao (2003) argued that traditional methods of indirect estimation like synthetic estimation are based on an implicit model that specify through auxiliary data the link relating the small areas. More signicantly, in the last decades, emphasis has been on explicit small area models that are mixed models where specic-area random effects are allowed to explain extra variation in small areas not accounted by the available auxiliary variables introduced into the model. Explicit models can be classied into two categories: (i) area level models and (ii) unit level models. The core of classical small area models consists essentially of linear mixed models. The more common specication of an area level model is the one due to Fay-Herriot model (Fay and Herriot, 1979). It consists of an area-linking model, e.g., i = xT + i i

i N (0, ) (hereafter stays for independently distributed as), coupled with a sampling error model, e.g. i = i + ei , ei |i N (0, i ). As for i , i and xi terms, they are, respectively, characteristic of interest, its survey estimate obtained by using the sample data actually collected in area i and auxiliary data available for each area i. The linking model is merely a linear mixed model where are xed coefcients, accounting for effects of the auxiliary variables x, valid for the entire population, while i are random area-specic effects. Sampling variances i are usually assumed to be known, and are parameters to be estimated. Parameters estimation following the usual strategies adopted for linear mixed models that leads to well known EBLUP (Empirical Best Linear Unbiased Prediction). Alternatively prediction by an hierarchical Bayesian (HB) form of the mixed models have been proposed (a rst example is in Datta and Ghosh (1991) while an accounts on its merits is in Arora and Lahiri (1997)). The specication of Fay-Herriot model within a hierarchical Bayesian framework will be presented in section 5 along with some new development. The specication of a unit level model (according to the seminal paper by Battese et al. (1988)) has the following structure: let us denote by yij the target variable and by xij the values of a set of auxiliary variables for unit j in area i, yij = xt + i + ij
ij

where i and ij are mutually independent (gaussian) error terms with zero means and 2 variances u and 2 respectively. The random term i represents the effect of area characteristics that are not accounted for by the auxiliary variables. Note that for this model knowledge of auxiliary information at the unit level is assumed. EBLUP is also possible for this model whose structure parallels a standard two level models (details can be found in Rao (2003)). These models have been rened to face many of the problems that often arise in practice. Generalized linear mixed models can be considered when the variable of interest is a proportion or a count (a general approach is suggested in Ghosh et al. (1998)). When the focus is on small geographical entities it can be sensible to allow small area random effect to be spatially correlated (a rst application of this idea in the small area context is in Datta et al. (1999)). Time series models can be introduced to borrow strength from measurement taken on the same area over different occasions (Pfeffermann et al. (1998),Pfeffermann and Tiller (2006)). Time series models are highly relevant for application of SAE to LFS data since time series of measurements (of the unemployment rate for instance) are usually available. Signicantly enough, many of these renements and further developments, if not all, have been developed to face specic problems encountered in estimating labour force related quantities for small geographical domains. Small area estimation from LFS data is still very stimulating and it is not surprising that some new developments originated from it. More specically, standard SAE model specication with a continuous target variable, as in Fay-Herriot, could be inappropriate. In section 5 it will be argued that models for count data should be more appropriate when estimating classical labour force related quantities (i.e., the number of employed or unemployed persons).

4. SAE for Labour force survey data in Italy


In this section (experiences of) applications of small area estimation techniques to data from Italian LFS will be reviewed, with emphasis on the use of explicit models. The geographical detail at which small area estimates are needed will be the rst criterion to classify Italian studies. Actual availability of potential auxiliary information, the quality of this information, the nature of the quantities to be estimated will be other important evaluation keys. The aim is not simply to give a (more or less detailed) account of the use of small area estimation methodology for LFS data, we expect that the analysis can help to point out some new directions of research and to stimulate new developments and new approaches to SAE. In Italy researchers of the National statistical agency (Istat) started about twenty years ago to work on developing SAE methods and evaluating implications for production of small area statistics (see, for instance, Falorsi et al. (1994, 1995)). This has led, more recently, to applications of small area techniques for production of ofcial statistics. Istat, for instance, recently began using SAE for producing labour force estimates at territorial level below the regional one, like Local Labour Markets (LLM) and provinces. These are, in fact, the two territorial levels that were considered by Istat for application of SAE techniques. Note that they correspond to two levels of the European Nomenclature des Unit s Territoriales Statistiques (NUTS) hierarchy. Actually, NUTS3 corresponds to e provinces while in Italy there is in fact no ofcially dened NUTS 4 geography. LLM correspond in Italy to groups of municipalities and can be considered as a proxy for NUTS4. It is worth noting, however, that the Italian local authorities demand SAE also for different geographical partitions and trying to cover all possible cases is pointless. Our analysis will be limited to those geographical partitions that so far have actually been considered for real applications and that refer to geographical domains for which the need of statistical data have been largely recognized. Local labour markets (LLM) LLM consist of groups of municipalities geographically connected such that the proportion of people living and working within each group is high. The partition of Italy in local labor systems is one of the most relevant spatial reference for studying labor market and for policy evaluation. LLM are a ner partition of the territory and therefore is a geographical domain not planned by the Italian LFS. The survey sample sizes associated with such minor domains result inadequate to allow for precise (design-based) estimates. LLM determination is based upon the analysis of data collected in the Census and is therefore revised every ten years. This geographical partition has been, perhaps, the most relevant for which application of SAE techniques for labour force survey have been considered . The basic methodology adopted by Istat for SAE, leading to publication of ofcial gures, is presented in Cruciani et al. (2002). It relies on the use of composite estimators without the use of explicit SAE models. Composite estimators balance an unbiased direct estimator and the bias of a more efcient synthetic estimator (related for instance to a larger area) by taking a weighted average of the two. The direct estimator is usually calibrated by using information on some auxiliary variables (whose totals are assumed to be known at area level).

SAE techniques have been compared in De Vitiis et al. (2002) and Di Consiglio et al. (2003) who considered also explicit models. In those cases, the assumption was that auxiliary variables were available from census data. The LLM in Italy were also considered as a target geographical level within the Eurarea project (EURAREA, 2004), a large European project, started in 2001, which involved many National statistical ofces and selected academic researchers . Among the aims of the project there was also a comparison among different SAE techniques. Within the EURAREA project a large simulated study was designed in order to compare, for Italy, the performance of different models in estimating unemployment at LLM level. The project gave new stimuli for research on SAE models: the use of explicit SAE models, the introduction of spatial correlation and the use of time series of data to borrow strength over time was clearly advocated. Also in this case the auxiliary variables were obtained by census data. Industrial districts The industrial districts correspond to aggregations of neighbouring municipalities in which small and medium-sized rms specialising in manufacturing activities are backward and forward linked in the same production chain (Bartoloni, 2008). Their size does not differ much from the size of typical LLM, but the districts can be scattered over the territory and usually they do not form a partition of the entire large area. Bartoloni (2008) considered SAE of employed and unemployed people for industrial districts (as dened by the regional law) in Lombardy. Composite estimators are considered and some auxiliary variables (population counts by sex and age within at the district level from the census) are used to post-stratify the direct estimator. The use of explicit models and of more suitable auxiliary variables is clearly advocated. Provinces (quarterly) Provinces represent in Italy an administrative domain (unlike LLM and Industrial districts that are functional domains) and it is not surprising that they form a geographical domain for which statistical information is of crucial importance. They correspond to the NUTS3 level and in Italy their population size vary from few thousands to millions (like Provinces including Rome or Milan). In fact, we already mentioned that LFS is designed to provide at least every year reliable estimates of the main quantities (the unemployment rate, for instance). Nonetheless, sample size for a given quarter is too small for many provinces and data collected in the four LFS carried out in a given year are pooled together in order to achieve a large enough sample. The goal is then to obtain quarterly provincial estimates. This problem has already been considered by researchers in Istat. Falorsi et al. (1998) proposed alternative composite estimators where direct estimators were post-stratied by using provincial level population counts. Provinces where considered in the EURAREA extensive simulation programme. Results parallel those for LLM: the use of explicit models allowing spatial (and possibly temporal) correlation could lead to signicant improvement. Menardi et al. (2004) present an application of small area models to provide quarterly estimation of labour force aggregates in the four provinces of an Italian region (Friuli-Venezia Giulia). This application: (i) considers a explicit small area

model, (ii) exploits important auxiliary variables from an external source available at provincial level (i.e., number of unemployed enrolled in the Labour exchange ofces), (iii) borrows strength from time using time series models that exploit the collection of measurements of the quantity to be estimated over time, (iv) uses Bayesian approach in specifying the model, predicting the provincial values and getting sensible variability measures. The paper follows closely the strategy already adopted by Datta et al. (1999) and is an excellent example of the potential of SAE models making efcient use of complementary auxiliary information. (Large) Municipalities Municipalities are very natural administrative domains (they corresponds to NUTS5). It is surprising that the municipal level has been so far neglected as a level for using SAE techniques. Note that the size of Municipalities in Italy vary wildly and the the problem of estimating labour force related quantities for a large city (say Rome or Milan, self representing strata in the LFS and then selected into the sample with certainty) or for a small village (administrative units that are selected to be in the sample with a given probability) can be very different. SAE techniques can be obviously of some help in this context. A partly related problem has been described in Giommi et al. (2008). Starting from data collected in a specially designed LFS-type survey in a Florence, a SAE unit level model is estimated to get estimates of employed and unemployed persons for district and other sub-areas of the municipality. 4.1. Auxiliary information for SAE As discussed above, the quality of SAE model depends crucially on the availability of auxiliary information. At area level, for geographical domains corresponding to administrative regions, it could be easier to nd good auxiliary covariates, while this cannot be true when areas ar for functional domains (like the LLM). In many application of SAE to LFS in Italy the census has been the major source of auxiliary information. Census data have many advantages since they allow to build appropriate information for any area level worth of consideration. Moreover, the census collects data on aspects strictly related to labour force status. Unfortunately census data become very quickly outdated. A much more important source of auxiliary information are administrative archives. This information is often strictly related to phenomena under study and its explanatory power is very high compared to potential auxiliary variables collected in the survey. Usually registers of labour exchange ofces, of those receiving unemployment benets or participate to other welfare programs, of tax and revenues, can provide valuable information at the appropriate territorial level to be used as auxiliary information. These administrative archives can be very rich and the quality of data collected is usually good. Administrative archives are excellent in north-european countries where register-based ofcial statistics are quite familiar. It is important to note however that in many cases, and especially for geographical domains that do not have administrative relevance (like LLM) it could be the case that area level auxiliary information is not dened over the same geographical grid. This can easily happen when administrative data relate to a service that has been organized according to different territorial criteria. In this case, the target small areas and the areas

for which auxiliary variable are available are misaligned. This is a problem that call for new methodological developments and that will be dealt, with reference to estimation of unemployment in LLM, in section 6. Finally, it is important to note that unit level models when working with LFS have been applied less frequently in Italy. In fact, it is difcult to nd relevant auxiliary information from external sources at the unit level. Using census data, when not completely out of date, will be in principle possible by using record linkage procedures to locate the records pertaining to the same units in the two archives. The same idea could be applied to records from administrative archives, but in this case the use of a record linkage procedure could be even more complicated by the conicting goal of protecting privacy of respondents and avoiding disclosure of sensitive information.

5. Hierarchical Bayesian approach to area level models with count data


Bayesian methods have proved to be quite effective for solving SAE problems. But the general theory as well as specic applications of both the EB and HB approaches have concentrated mainly on continuous variables. To date, a discussion on what is the most appropriate model specication when small area estimates are needed for discrete or categorical variates has not been yet thoroughly developed. In this section, we will introduce alternative HB models when survey data, at area level, consist of counts, by summing up several model structures proposed in recent years (Ghosh et al., 1998; You and Rao, 2002; Rao, 2003; You et al., 2004; Trevisani and Torelli, 2004, 2006, 2007). The motivating example is estimation of small area totals (e.g. the numbers of employed or unemployed for Local Labour Markets) from having, at area level, LF survey data as information. The basic small area model for area level data is the classical FayHerriot (FH) model which, under a HB approach, can be specied as di |i ,i N ormal(i , i ) i |, N ormal(xT , ) i (, ) (, ) . (1)

As formerly, i is the small area total, di the direct survey estimate (when available) and xT = (xi1 , . . . , xip ) any area-specic auxiliary data, for each area i. i In (1) the FH model is stage-wise specied (as commonly is in HB writing to enhance the conditional independence assumptions): sampling and linking models (rst and second stages) are unchanged whereas, within a full HB approach, an additional hyperprior stage (third row), that is a prior distribution on founding parameters and , is required. Formally, then, a Bayesian analysis allows for incorporating any prior opinion or external empirical evidence via the prior distribution . In practice (in particular, this is still the mainstream in SAE analyses), for ignorance or because we want inference to be driven solely by the data at hand, noninformative priors are often used. In this case, to avoid posterior density to be improper, diffuse yet proper (otherwise said, weaklyinformative) priors are routinely assumed. Such a choicewhich however needs a careful sensitivity analysis especially when models are barely identiedgenerally ensures a valid inference.

The classical FH specication (1) may be defective either for (i) assuming the sampling errors i = di i as normal or for (ii) setting a linear link i = xT + i (with i i N ormal(0, )) directly between i and xi . Indeed, i s are counts (i.e. positiveinteger valued variates), moreover, a non-identity link g(i ) = xT + i can be more i appropriate when the predicted variable i is non-continuous and / or the covariates xi are thought to produce a non-additive effect on it. Of course, the Normal-Normal model (1) owes its popularity for being in general computationally convenient and inferentially tractable by classical estimation methods. We can envision several directions in extending the FH model (1), which are listed in the section below. The rst one (matched models) actually consists in an alternative denition of FH-type models: a matched specication is, essentially, a two stage Normal model tted to both estimates of interest (i s) and data (di s) likewise transformed. The other two (unmatched and nonnormal sampling error models) offer a range of more realistic models for solving SAE problems with count data. Inference for these more complex models, which is daunting if not infeasible under classical approaches, is here enabled by Markov chain Monte Carlo (MCMC) estimation methods, the most popular computing tools in Bayesian practice. Indeed, Bayesian methods now enjoy broad scientic application mainly thanks to the exibility inherent to HB modeling which on turn is made computationally tractable by MCMC tools. In Section 5.2, another advantage of HB way of thinking will become apparent: its ability to capture uncertainty and incorporate the multiple sources of it into an unied analysis. More specically, we will mention a natural extension of the several specications here introduced, that is letting sampling variances be stochastically determined rather than xed to design estimates as standardly are. 5.1. Alternative hierarchical Bayesian specications Matched models Once a suitable link function g() has been chosen, also direct estimates di , in the sampling model, are accordingly transformed, g(di ) = g(i ) + ei (again with normal errors ei ). That allows to combine the two equations into a single linear mixed model, (from the foregoing equations) g(di ) = xi + i + ei . Small area estimates for i s are then obtained by inverting g(). When the quantity to be estimated is a count, a log-linear link is customarily adopted so that sampling and linking models in (1) are accordingly replaced by log(di )|i ,i N ormal(log(i ), i ) log(i )|, N ormal(xT , ) . i (2)

Specication (2) preserves computational tractability (it is still a Normal-Normal model) but one has to be warned by a number of possible faults. You and Rao (2002) pointed out that customary hypotheses on sampling errors ei = g(di ) g(i ) may be quite questionable when g is nonlinear and area sample size is small. In particular, they refer to the unbiasedness assumption, E(ei |i ) = 0, and the Taylor approximation ordinarily set for the variance, i.e. i = var(ei |i ) {g (i )}2 i with i = var( i |i ) from (1). This approximation is largely used for allowing the imputation of design variance vp (di ), which is taken as known, to sampling variance, i := vp (di ). Lastly, with specic reference to (2), survey information may be partly wasted, in that transformed direct estimates log(di ) are not dened when di = 0. Thus, missing data originate both from areas with null direct

estimates (which may be not so rare when area sample size is small) and, as ordinarily, from non-sampled small areas. Unmatched models From the foregoing, You and Rao advise to let sampling model di = i + i be unchanged so that design-unbiasedness is preserved (and sampling variance i , xed at the design variance vp (di ), is conveniently treated as known). Thereby, with reference to count data instance, they propose to couple the sampling model of (1), di |i , i N ormal(i , i ), with the linking model of (2), log(i )|, N ormal(xT , ). Clearly, the two model stages cannot be combined into a single linear i mixed model. As standard estimation methods cannot be applied, You and Rao adopt a full HB approach. On the other hand, Trevisani and Torelli (2004, 2006) propose an unmatched version similar to the You and Rao structure, though this new one is derived by building stage-wise a HB model suited to SAE with count variates, di |i , i N ormal(i , i ) i |i P oisson(i ) log(i )|, N ormal(log(Si ) + xT , ) . i This choice has been determined by i being of count type and (because of that) borrowed from the extensive literature on disease mapping. In such an area, disease counts are modeled as Poisson variates with mean i = i Ei where i is the relative risk and Ei the expected count in area i. A regression equation is then usually set on a logarithmic scale, log(i ) = xi + i , to opportunely accommodate for a linear predictor xi and any random effect i (which is customarily assumed to be normally distributed). When relevant, Ei may be taken as random (Wakeeld and Best, 1999) but usually is taken as known (thus, treated as an offset in the regression equation). In SAE context, Ei can be set to known synthetic estimate Si and a model-like-disease mapping is then tted to counts i . We recall that a synthetic estimate is an indirect estimate derived for small area i from using a reliable direct estimate for a large area or a survey planned domain (such as the Region in our eld of investigation). To some readers the specication for di might appear inconsistent with that for i : di s are generated by a continuous distribution over the real line while i s are drawn from a discrete distribution over the positive integers. Yet, it is the sampling model for di s to be really inconsistent with the count type of the variable of interest i (indeed, di might be non-integer since it derives from an estimation process not necessarily constrained to produce integer values, though denitely it cannot be negative). Nonetheless, in (3), we let sampling model be specied as standardly is in SAE literature while we originally assumed i being generated by a Poisson model. It is superuos to remark that there is no inconsistency in restricting the parameter space of a Normal distribution mean (i ) to the sole integers. Moreover, there is no need of discretizing di : i naturally arises as integer for being generated as a Poisson variate. (Incidentally, Bugs software which we usually use for computations allows specifying a Poisson prior for any continuous quantity.) Last but not least feature for which we turned to a Poisson model was its varianceproportional-to-mean property; though nonnegativeness and discretness of its sampling space are undoubtful advantages, they are not so urgent to require the replacement of the standardly assumed Normal model. Nonnormal sampling error models The characteristic of interest is a count, thus a canonical Poisson model is set from the very rst stage. A standard generalized linear

(3)

mixed model for count variates is the Poisson-logNormal specication whose a form suitable for SAE problems may be the following, di |i P oisson(i ) log(i )|i , i N ormal(log(i ), i ) log(i )|, N ormal(log(Si ) + xi , ).

(4)

The log-Normal stage here depends on two sources of random variability: the sampling error, ei , and the random effect, i (second and, respectively, third row of (4)). Again, the sampling error variance i is set according to the Taylor approximation above written. Just to remedy possible failures implied by the Taylor approximation, a Gamma PoissonlogNormal model, di |i P oisson(i ) i |i , i Gamma(i , i /i ) log(i )|, N ormal(log(Si ) + xi , ),

(5)

is alternatively proposed (Trevisani and Torelli, 2007). Mean / variance i of di is given a Gamma distribution (second row) which is opportunely set (by conveniently xing the hyperparameter i ) so that design unbiasedness (E(di |i ) = i ) and design variance imputation (var(di |i ) = vp (di )) hold for sampling error. An ordinary log-Normal model follows at the linking stage. Performance of the above alternative HB specications has been assessed on the basis of a range of simulation studies. Simulated data were generated by assuming population characteristics of interest as well as sampling survey design as known. In one set of experiments, the actual LLM numbers of employment / unemployment from census data (of Veneto in 1991) were utilized, in others population characteristics were varied (by changing the type of symmetry of their distribution). Still, LLM survey sampling sizes were either maintained xed at actual (1999 rst quarter) LFS valuesthus keeping the same non-sampled small areas (source of missing data di )either varied in different ways. The sampling design has been kept quite simple across all studies (we adopted a simple random sampling without replacement scheme), moreover, synthetic estimates comprise the sole source of auxiliary information incorporated into models. A more detailed description of the simulation analysis can be found in Trevisani and Torelli (2007). HB small area estimates have been examined along several criteria, such as bias, accuracy, efciency as well as reliability in predicting the realized (actual) nite population data of interest, beyond of some standardly adopted criteria for model selection in Bayesian analyses. Across all simulation studies, the standard Fay-Herriot or matched specications (such as (1) and (2)) show persistent model failures (see Tables 2 and A2A4, rows labelled FH(2)/(3) of the above cited work). In general, both non matched and Gamma Poisson-logNormal models achieve the highest performance scores. 5.2. Model-based sampling variance estimates As already mentioned, a natural extension of a model-based approach to SAE is to let sampling variances be stochastic, whereas, in the foregoing discussion, they have been

treated as constants and xed at some off-set estimates (typically at values of sampling design variance vp (di ) or a proper function of it). Two options are conceivable for taking into account the uncertainty inherent in estimating sampling variances. The rst one, which we refer to as model-based sampling variance functions, is a straightforward strategy under stronger information conditions. Suppose that the design variance is given by a certain function of the unknown quantity of interest, that is vp (di ) = f (i ) , with f known from the knowledge of sampling design characteristics. The common practice for xing sampling variance values into models (i or i , when a logarithmic transformation takes place, or i in (5)) is using an estimate of the design variance which results from replacing the unknown i with direct estimate di , that is vp (di ) = f (di ). Under a modeling perspective, a strategy may then consist in (stopping a little earlier and) letting i (or i or i ) be properly dened as f (i ) (or a function of it) while specifying the model. Hence, sampling variances (likewise small area quantities i ) will be obtained as model-based estimates (through i ). Such a treatment has showed to remarkably improve the performance of all the types of model listed in Section 5.1, if compared to the respective xed sampling variances counteparts (see Trevisani and Torelli (2007), Tables 5, A2-A4). It is worth noting that this modeling option is straightforward and relatively easy to implement only within the Bayesian approach. However, this is only one option among others for stochastically modeling small area sampling variances. A more general strategy would consist in explicitely modeling the estimated sampling variances by enriching the HB setup with two additional components, that is, si |i p(i ) i i (6)

where si denotes one estimate of sampling error variance whereas p and i broadly indicate a suitable likelihood for si and, respectively, a prior distribution for hyperparameter i . For instance, You and Chapman (2006) add, to an otherwise standard FH model (1) for continuous data, the component (ni 1)si |i i 2 wherein si derives from an unbiased estimator of sampling variance, independent of the direct estimator di , and ni is the small area sample size (e.g. si is the sample variance and di the sample mean of the ni area specic observations). Lastly, they assume an inverse-Gamma prior distribution for sampling variance i . The model-based sampling variance estimates approachsketched by specication (6)is more general than the rst pinpointed, and preferable when one cannot exactly know the sampling variance estimating function, f (i ), or can only badly approximate it. For future work, the HB stochastic approach to sampling variances can be extended to the general class of area level models encompassing the entire range of survey data, from continuous to categorical type. In particular, the unmatched and nonnormal sampling models formerly presented can be suitably elaborated to take into account the extrauncertainty associated with the estimation of sampling variances in order to produce more reasonable and reliable small area estimates.

6. SAE and misalignment issues


6.1. A fully model-based approach to combining misaligned auxiliary data A fundamental object of SAE methods consists in the efcient use of any auxiliary data possibly related to the characteristic of interest and thereby source of important information. However, the level of spatial resolution at which auxiliary data are available may be different from the one associated with the small area of interest. This section, in particular, will address the case of auxiliary variables observed on areal partitions nonnested with the set of small areas (Trevisani and Torelli, 2005). For instance, consider the specic problem of estimating LLM numbers of unemployed. The areas into which LLMs parcel out the region under study (small areas) do not correspond to the administrative districts (ADs) which may be a source of data (e.g. numbers of unemployed enrolled in Labor exchange ofces) thought to be relevant for improving on direct estimates of LLM unemployed totals (Fig. 1, left and right panels respectively, where the Veneto Region is the whole domain of study). The statistical issue then consists Figure 1: (Left) Direct estimates of unemployed for Local Labour Market areas of Veneto Region (from the Italian Quarterly Labour Force Survey, 1999 rst quarter); (right) number of unemployed enrolled in Labor exchange ofces available at administrative district level (from Veneto Lavoro, stock at the end of 1998). (The gures are percentuals over population.)

in extending a traditional SAE model (e.g. the HB model (3) which is traditional relatively to the current section development) in order to perform a regression analysis also on covariates whose areal data are misaligned with the small areas set (misaligned areal regression problem). A number of techniques exist to obtain estimates for misaligned data at the required area level (estimation process usually called modiable areal unit problem in statistical literature, or areal interpolation in geography): the pycnophylactic density surface estimation (Tobler, 1979), the areal weighting interpolation (Goodchild and Lam, 1980), the intelligent areal interpolation (Flowerdew and Green, 1992, 1994) as well as

other methods implemented in Geographical Information System environment. Though, none of these methods allows for a fully inferential approach to the problem of areal interpolation. We follow, instead, a fully inferential philosophy and adopt the so-called atom-based models borrowed from the most recent HB literature on spatial misalignment (Mugglin and Carlin, 1998; Mugglin et al., 1999, 2000; Zhu and Carlin, 2000; Banerjee et al., 2004). The rationale underneath atom-based models is to realign the misaligned data onto a common intersection-partition, whose areal unit is the atom, whence conveniently set the regression model at the atom level, and, nally, build the estimates of interest upon aggregation over atoms. With reference to our motivating application, atom areas (Fig. 2, left) arise from crossing any LLM area by any district unit area. Moreover, each Figure 2: (Left) Regional partition into atoms; (right) number of manifacturing industry units available at municipality level (1991 census data, gures are percentuals over population).

atom consists of a group of municipalities for they are nested within either partition, the one made up of small areas and the one made up of administrative districts. Thus, the skeleton of an atom-based model as we framed it in the context of SAE, is below sketched (Trevisani and Gelfand, 2008). At atom level, small area quantity and nonnested auxiliary variable are modeled as k |k P oisson(Sk ek ) k = 0 + 1 Xk + 2 zk + i(: k) Xk |k P oisson(Nk ek ) k = 0 + 1 wk + j(: k) (7) (8)

where k , Xk and (zk , wk ) are, respectively, the latent (or unknown) total of interest, the latent auxiliary variable and further known covariates on atom-k; i and j are random effects inheritedby atom-kfrom small area-i and, respectively, auxiliary source areaj (which atom-k belongs to, indicated in (8) by k notation); Sk and Nk are known synthetic estimate and, respectively, population count for atom-k; s and s are xed (unknown) regression coefcients.

Then, upon real regional partitions, -model and X-model consist in di |i N ormal(i , i ) i = k:ki Xj = Xk:kj (9) (10)

which amount to set sum-constraints (10) on each atom set making up either a small area or a auxiliary source area, besides opportunely assuming a sampling model (9) for direct estimates di s. Remark 1 We consider the possibly important covariate X to be a count variable (see (7)) as in the case example (Fig. 1, right). However, in a very large class of cases, data for areal units are likely to be derived from some discrete process (there are plentiful socioeconomic examples). Of course, depending on the type of variables at hand, different distributional assumptions have to be properly supposed as having generated them. Remark 2 In principle, the linking model for might be set at small area level. In such a case, we should aggregate atom values Xk so to reconstruct small area value Xi (i.e. Xi = Xk:ki ) which is the one to be imputed into the regression equation set for i (or the log-relative risk i , as parameterized in (8)). Though, a regression analysis is denitely more convenient to be carried out at atom or ner level whenever further covariates are available on regional partitions ner than small area set. In (8), zk as well as wk denote known values of auxiliary variables whose source is nested within atoms. For instance, in our application, both Z and W are variables observed at municipality level, thus nested within any other partition level so far considered. The number of manifacturing industry units (Fig. 2, right), available from census data for each municipality, serves as an xample of covariate possibly related both to and X (compare to Fig. 1). Remark 3 Linking model (8) is dened on the log-linear scale, nonetheless other functional forms, such as the identity-linear or the mixed type ones, can be considered (see Best et al. (2000)). Remark 4 In regression analysis for or X, random effects, generically denoted by i and j in (8), can be set either as spatially-structured (e.g. by specifying a Gaussian conditionally autoregressive prior (CAR) which assumes a neighbour-based spatial correlation between areas; for a comprehensive account of neighbourhood denitions see Best et al. (1999)) or simply as heterogeneous effects (by customarily assuming an exchangeable Gaussian prior as in Section 5.1), otherwise a convolution form which combines both types of effects constitutes a still more exible modeling option (Besag et al., 1991). 6.2. SAE via spatial misalignement models Previous section introduces a class of misaligned data models opportunely framed to cope with a frequently occurring SAE issue, namely, integrating into SAE model any source of auxiliary information even if available on an areal partition misaligned with the small areas set. But such a class may be furtherly developed to produce better small area estimates, (i) by combininig sources of auxiliary information whichever is the associated level of spatial resolution (misaligned regression problem); (ii) by integrating, within a fully HB inferential approach, large area estimates as well (misaligned areal interpolation problem). By hitting target (i) any auxiliary information regardless of the type of misalignment (either point or areal data of any shape or size) can be in principle

integrated. As regards (ii), an improvement on small area estimates is reasonably expected given that direct estimates evaluated for survey planned domains are generally reliable. Towards the rst end, the formerly introduced atom-based model is given a Gaussian process version (Trevisani and Gelfand, 2006). Consider the basis of the former HB model, that is the atom level stage (7)-(8). Likewise (7), latent counts, of both the quantity of interest, k , and the misaligned covariate, Xk , are again modeled as Poisson variates with mean arising as product of population size (or a function of this) and incidence rate. But, differently from (8), incidence is currently a (function of a) Gaussian process modeling the spatial point pattern (one associated to , another to X) over the entire region. Atom counts are then driven by integrating the point process over atoms. More formally, the mean of latent counts arises as Sk ek =
Ak

S(s)e(s) ds S(s)

Nk ek =
Ak

N (s)e(s) ds

(11) (12) (13)

Sk |Ak | and (s) = 0 + 1 Xk + 2 zk + (s) with

N (s)

Nk |Ak | (s) = 0 + 1 wk + (s)

where s denotes a spatial point and Ak the atom-k (with area |Ak |). Moreover, we assume atoms being small enough to approximate S(s) and N (s) by a constant function over atom as in (12). Finally, random effects (s) and (s) are given Gaussian process priors, which in principle could be made dependent using coregionalization (Banerjee et al., 2004). This version of misaligned data models ts an incidence surface which is a more exible model than a step distribution over the region (as in the discrete version (8) with CAR and / or exchangeable priors for random effects), allowing, in principle, for local adjustment at every location (hence achieving target (i) stated above). Moreover, it models at the highest spatial resolution (at point location), thus avoiding choice of areal units (instead cumulating to these). Finally, as far as interpretation is concerned, it is directly explained as an intensity surface for the spatial point pattern of cases of interest (e.g. cases of unemployment, as in our leading example). Second object is straightforwardly achieved by specifying an additional stage for large area estimates, that is by assuming an opportune sampling model dp |p N ormal(p , p ) with the large area estimates subject to a sum-constraint p = k:kp as in (10). Symbols dp and p have the customary interpretation of direct estimates and, respectively, sampling error variances for survey planned domains (e.g. for provinces in our application). We conclude this section by pointing out some relevant aspects of novelty of the proposed class of misaligned data models (in both versions) with respect to a tradidional model for SAE. The proposed HB framework builds small area estimates (i ) upon aggregation (over k s), hence, in a way similar to the one survey estimates are generally obtained. Though, it integrates any source of potentially important information by a full probability model, thus formally combining the multiple sources of variation within a global analysis. Sources of information are those traditionally incorporated into a SAE model, such as small area direct estimates (di ), auxiliary variables (zk ), (possibly) synthetic estimates (Sk ). Still, the atom-based model here proposed allows for misaligned supplementary variables (Xk ) and large area direct estimates (dp ) being incorporated as well, properly accounting for

the uncertainty related to this integration. Whilst traditional model-based estimators do not generally benchmark to reliable direct survey estimates for large areas (though, there have been attempts for correcting for this, see e.g. Ugarte et al. (2008) or Lu and Larsen (2007) among the most recent references), benchmarking is automatically satised in the proposed atom-based model (small area estimates sum up to large area estimates).

7. Concluding remarks
SAE methods are by now largely recognized as essential for ofcial statistics production. They are, in many cases, the most convincing solution to the growing demand for reliable information for ne geographical partitions, to support decisions of local government, allocation of resources and funds to regions or provinces, etc.. More specically, labour statistics production at local level can benet greatly from application of SAE techniques. Nonetheless, applications of SAE for the actual production of ofcial statistics seem still limited or at least well below their potential. A possible explanation is that there is still resistance, in national statistical agencies, to fully accept model based inference when estimating population parameters. Design based approach is considered by far more convincing. SAE is one of the area (along with the use of models for dealing with non sampling errors) pushing more practitioners to consider statistical models as a viable solution. In fact they can give reasonable answers also in cases where the sample is too small or there is no sample at all. Using Bayesian models is even harder to accept in ofcial statistics production although applications of Bayesian models in this area are becoming more and more widespread. Again, as far as SAE is concerned the use of Bayesian approach can lead to more convincing solutions (notable examples are the problem discussed in section 5 and 6). SAE model can help greatly to widen the toolbox of national statistical agencies for effective production of statistical information. Italian LFS design is a very complex one and there is room for some interesting new advances. It is worth mentioning two interesting applications that have been already envisaged and whose development is left to future research. The rst idea has been put forward by Ferrante and Pacei (2004) and is aimed to take advantage from the fact that Italian LFS has actually a 2-2-2 rotation sampling scheme: households are interviewed in two consecutive quarters and after a two-quarter interval they are re-interviewed in the corresponding two quarters of the following year. So each unit participates to the survey in four occasions. It is likely that information from the same units in other survey occasions can be exploited also for building more effective models fro SAE. The second idea relies upon the new format of the survey where interviews are spread over the 13 weeks of the quarters. It means that, in principle, a small sample is available within a shorter time reference. Note that Eurostat currently releases monthly unemployment estimates based on the Labour Force Survey (LFS) for many countries while Italy is not producing monthly estimates. Calibration estimation can be a very effective strategy for the national level while SAE techniques can be adopted to obtain monthly estimates at a given territorial level (i.e., for a Region or a Province).

References
Arora V. and Lahiri P. (1997) On the superiority of the bayesian method over the blup in small area estimation problems, Statistica Sinica, 7, 10531063. Banerjee S., Carlin B.P. and Gelfand A.E. (2004) Hierarchical Modeling and Analysis for Spatial Data, Chapman & Hall. Bartoloni E. (2008) Small area estimation and the labour market in lombardy industrial s districts: a methodological approach, Italian Journal of Regional Science, 2. Battese G.E., Harter R.M. and Fuller W.A. (1988) An error-components model for prediction of county crop areas using survey and satellite data, Journal of the American Statistical Association, 83. Best N.G., Arnold R.A., Thomas A., Waller L.A. and Conlon E.M. (1999) Bayesian models for spatially correlated disease and exposure data, in: Bayesian Statistics 6, Bernardo J.M., Berger J.O., A. D.P. and M S.A.F., eds., Oxford University Press, Oxford, 107132. Best N.G., Ickstadt K. and Wolpert R.L. (2000) Spatial poisson regression for health and exposure data measured at disparate resolutions, J. Amer. Statist. Assoc., 95, 1076 1088. Cruciani S., Faramondi A., Falorsi S., Di Consiglio L., Solari F. and Rizzo F.P. (2002) Metodologia utilizzata per le stime sulloccupazione residente e le persone in cerca di occupazione nei sistemi locali del lavoro per gli anni 1998-2000, ISTAT: Progetto interdipartimentale Informazione statistica territoriale e settoriale per le politiche strutturali 2001-2008, 127. Datta G.S. and Ghosh M. (1991) Bayesian prediction in linear models: Application to small area estimation, Ann. Statist., 19, 17481770. Datta G.S., Lahiri P., Maiti T. and Lu K.L. (1999) Hierarchical bayes estimation of unemployment rates for the states of the u.s., J. Amer. Statist. Assoc., 94, 10741082. De Vitiis C., Di Consiglio L. and Falorsi S. (2002) A comparison among two stage sampling plans based on different time stratications for the redesign of the italian labour force survey, in: Proceedings of The XLI Meeting of the Italian Statistical Society, Milan. Di Consiglio L., Falorsi P., Falorsi S. and Russo A. (2003) Conditional and unconditional analysis of some small area estimators in complex sampling, Survey Methodology, 29, 412. EURAREA (2004) Project reference volume: Enhancing small area estimation techniques to meet european needs, Technical report, Proceedings of the Survey Research Methods Section. Falorsi P., Falorsi S. and Russo A. (1994) Empirical comparison of small area estimation methods for the italian labour force survey, Survey Methodology, 20.

Falorsi P., Falorsi S. and Russo A. (1995) Small area estimation at provincial level in the italian labour force survey, in: 1995 Annual Research Conference Proceedings, Bureau of the Census, Washington,D.C. Falorsi P., Falorsi S. and Russo A. (1998) Small area estimation at provincial level in the italian labour force survey, Journal of the italian statistical society, 7. Fay R.E. and Herriot R.A. (1979) Estimates of income for small places: an application of james-stein procedures to census data, J. Amer. Statist. Assoc., 85, 398409. Ferrante M. and Pacei S. (2004) Small area estimation for longitudinal surveys, Statistical methods, applications, 13, 327340. Flowerdew R. and Green M. (1992) Developments in areal interpolation methods and gis, Ann. Reg. Sci. , 26, 6778. Flowerdew R. and Green M. (1994) Areal interpolation and types of data in spatial analysis and gis, Taylor and Francis, London, 121145. Ghosh M., Natarajan K., Stroud T.W.F. and Carlin B.P. (1998) Generalized linear models for small-area estimation, J. Amer. Statist. Assoc., 93, 273282. Giommi A., Innocenti R. and Rocco E. (2008) The labour force survey in the municipality of orence: Technical innovations and methods for small area estimation, Scienze Regionali, 2. Goodchild M.F. and Lam N.S.N. (1980) Areal interpolation: a variant of the traditional spatial problem, Geo-Processing, 1, 297312. Lu L. and Larsen M.D. (2007) Small area estimation in a survey of high school students in iowa, Technical report, Proceedings of the Survey Research Methods Section, americal Statistical Association. Menardi G., Monte A. and F P. (2004) Unemployment estimation for friuli-venezia giulia provinces by time series small area models, in: Proceedings of The XLII Meeting of the Italian Statistical Society, Bari. Mugglin A.S. and Carlin B.P. (1998) Hierarchical modeling in geographical information systems: Population interpolation over incompatible zones, J. of Agric., Biolog., and Environm. Statist., 3, 111130. Mugglin A.S., Carlin B.P. and Gelfand A. (2000) Fully model-based approaches for spatially misaligned data, J. Amer. Statist. Assoc., 95, 877887. Mugglin A.S., Carlin B.P., Zhu L. and Conlon E. (1999) Bayesian areal interpolation, estimation, and smoothing: an inferential approach for geographic information systems, Environment and Planning A, 31, 13371352. Pfeffermann D. (2002) Small area estimation- new developments and directions, International Statistical Review, 70, 125143.

Pfeffermann D., Feder M. and Signorelli D. (1998) Estimation of auto- correlations of survey errors with application to trend estimation in small areas, Journal of Business and Economic Statistics, 16, 339348. Pfeffermann D. and Tiller R. (2006) Small-area estimation with state-space models subject to benchmark constraints, Journal of the American Statistical Association, 101, 13871397. Rao J.N.K. (2003) Small Area Estimation, Wiley, New York. Tobler W.R. (1979) Smooth pycnophylactic interpolation for geographical regions, J. Amer. Statist. Assoc. , 74, 519530. Trevisani M. and Gelfand A. (2006) A gaussian process version of misaligned data models for small area estimation problems, Contributed paper, Valencia/ISBA 8th World Meeting on Bayesian Statistics. Trevisani M. and Gelfand A. (2008) Spatial misalignment models for small area estimation problems, Technical report. Trevisani M. and Torelli N. (2004) Small area estimation by hierarchical bayesian models: some practical and theoretical issues, Atti della XLII Riunione Scientica della Societ` a Italiana di Statistica, 273276. Trevisani M. and Torelli N. (2005) Spatial misalignment modeling for small area estimation problems, Contributed paper, Bayesian Inference on Stochastic Processes, Fourth Workshop. Trevisani M. and Torelli N. (2006) Comparing hierarchical bayesian models for small area estimation, in: Metodi statistici per lintegrazione di basi di dati da fonti diverse, Franco Angeli, 1736. Trevisani M. and Torelli N. (2007) Hierarchical bayesian models for small area estimation with count data, Working Paper 115, Universit` degli Studi di Trieste, Dipartimento di a Scienze Economiche e Statistiche. Ugarte M.D., Militino T.G. and Goicoa T. (2008) Benchmarked estimates in small areas using linear mixed models with restrictions, Test, 123, online rst. Wakeeld J. and Best N. (1999) Accounting for inaccuracies in population counts and case registration in cancer mapping studies, J. Roy. Statist. Soc. A, 162, 363382. You Y. and Chapman B. (2006) Small area estimation using area level models and estimated sampling variances, Survey Methodology, 32, 97103. You Y. and Rao J.N.K. (2002) Small area estimation using unmatched sampling and linking models, Canadian Journal of Statistics, 30, 315. You Y., Rao J.N.K. and Dick P. (2004) Benchmarking hierarchical bayes small area estimators in the canadian census undercoverage estimation, Statistics in Transition, 6, 631640.

Zhu L. and Carlin B.P. (2000) Comparing hierarchical models for spatio-temporally misaligned data using the deviance information criterion, Statistics in Medicine, 19, 22652278.

You might also like