Assessing Microdata Disclosure Risk Using The Poisson-Inverse Gaussian Distribution

Assessing Microdata Disclosure Risk Using the Poisson-Inverse Gaussian Distribution
Michael Carlson Department of Statistics, Stockholm University SE-106 91 Stockholm, Sweden. June 28, 2002
Abstract An important measure of identication risk associated with the release of microdata or large complex tables is the number or proportion of population units that can be uniquely identied by some set of characterizing attributes. Various methods for estimating this quantity based on sample data have been proposed in the literature by means of superpopulation models. In the present paper the Poisson-inverse Gaussian (PiG) distribution is evaluated as a possible approach within this context. It has been noted that in disclosure applications the frequency distribution tends to have an inverse-J shape with a heavy upper tail, properties characteristic to the PiG distribution making it a plausible candidate distribution. Furthermore, the distribution is expressed in closed form which gives it an advantage over e.g. the Poisson-lognormal. Disclosure risk measures are discussed and derived under the proposed model as are various likelihood estimation approaches Keywords: statistical disclosure; uniqueness; inverse-Gaussian; Poisson; superpopulation.
1. Introduction
A considerable amount of research has been done in the area of statistical disclosure and dierent approaches to dening and assessing disclosure risk are treated in depth by among others Dalenius (1977), Duncan and Pearson (1991), Frank (1979), Lambert (1993), Skinner et al. (1994), Willenborg and de Waal, (1996, 2000). A special case concerns the the release of public-use microdata les and so-called identity disclosure, see Duncan and Lambert (1989). It is well known that there remains the possibility that statistical disclosure could occur even when such a le has been anonymized by the removal of direct identiers, see e.g. Block and Olsson (1976) and Dalenius (1986). The main concern is to ensure that no
E-mail: Michael.Carlson@stat.su.se
record in a released microdata set can be reliably associated with an identiable individual. For instance, a unique is dened as an entity that has a unique set of values on a set of characterizing attributes or key attributes. A unit that is unique in the population is referred to as a population unique whereas a unit that is unique in a sample is referred to as a sample unique. If a population unique is included in the sample it is necessarily also a sample unique but the converse is not true. Obviously a population unique is subjected to a greater risk of being exposed relative to other non-unique units if included in the released data. Furthermore, it could be argued that an intruder will be more inclined to focus upon those records that are sample unique since it is only these that can by denition be population uniques, see e.g. Elliot et al. (1998). Thus a possible indicator of disclosure risk is the number or proportion of population uniques included in the sample amongst the sample uniques. The objective is to estimate this proportion based on sample information, e.g. a data set considered for release. Various methods for estimating this quantity based on sample data have been proposed in the literature by means of superpopulation models and especially compound Poisson models. Under a superpopulation model it is assumed that the population at hand, as dened by the frequency structure of the key attributes, has been generated by some appropriate distribution. The risk assessment, here in terms of uniqueness, is then reduced to a matter of parameter estimation. Bethlehem et al. (1990) were perhaps the rst to adapt this approach and others include Chen and Keller-McNulty (1998), Hoshino (2001), Samuels (1998), Skinner and Holmes (1993, 1998), St-Cyr (1998), Takemura (1999). In the present paper we propose the Poisson-inverse Gaussian (PiG) distribution as a possible candidate. This distribution has appeared elsewhere in the literature but we are not aware of it being applied to the disclosure problem earlier. It was introduced by Holla (1966) in studies of repeated accidents and recurrent disease symptoms and further developed by Sichel (1971) to a more general three-parameter family of distributions. The Pig model has been applied to a wide range of situations, see e.g. Sichel (1973, 1974, 1975, 1982a), Ord and Whitmore (1986) and Willmot (1987). Chen and Keller-McNulty (1998) noted that the frequency distribution in disclosure applications tends to have an inverse-J shape with heavy upper tail. Since the PiG distribution is characterized by its positive skewness and heavy upper tail it appears to be an appropriate distribution for modeling frequency counts in disclosure applications. The present paper is a slightly shorter version of Carlson (2002) and the scope is limited to a review of the proposed model with some simple simulated examples to illustrate the method. In the following section the superpopulation model is specied and some basic notation is introduced. In section 3 the PiG distribution is reviewed and in section 4 its application to the problem of assessing disclosure risks, in terms of uniqueness, is discussed. Parameter estimation is described 2
in section 5 and section 6 provides some examples on simulated data sets. Some concluding remarks and possible directions for future research are given in section 7.
2. Specication of the Superpopulation Model

Consider a nite population U of size N from which a simple random sample s U of size n 6 N is drawn. The sampling fraction is denoted by s = n=N. With each unit h 2 U is associated the values of a number of discrete variables, X1 ; : : : ; Xq with c1 ; : : : ; cq categories respectively. The cross-classication of these variables dene the discrete variable X with ci = C categories or cells and for simplicity we let the cells of X be labeled as 1; 2; : : : ; C. Following e.g. Bethlehem et al. (1990) the Xi are termed key variables, X the key and the C dierent categories of X; the key values. We denote by Fi the number of units in the population with key value i, i.e. the population frequency or size of cell i for i = 1; : : : ; C. The sample counterpart is denoted by fi . Dene Tj and its sample counterpart tj as the number of cells of size j, i.e. Tj = and tj =
C X i=1 C X i=1
IFi =j ;
j = 0; 1; : : : ; N
Ifi =j
j = 0; 1; : : : ; n
respectively and where I() is the usual indicator function. The Tj and tj are usually termed cell size indices or frequencies of frequencies. It is clear that
C X i=1
Fi =
N X j=1
jTj = N;
C X i=1
fi =
n X j=1
jtj = n
and
N X j=0
Tj =
n X j=0
tj = C:
Of these quantities, C; N and n are xed in the design, the fi and tj are observed and the Fi and Tj are unknown. The goal is to model and estimate the frequency structure, i.e. the Tj and especially T1 which is the number of unique individuals in the population, based on sample information. The key divides the population of size N into C cells and the frequency structure, the Tj , is a function of the actual Fi which therefore need to be estimated. However, it is not possible for a sample to carry all the information about the 3
structure of a population and nite population theory will give only unreliable estimates especially when the sampling fraction is small. See also the discussion by Willenborg and de Waal (2000, p. 55). A way out is to model the structure and view the population as the realization of a superpopulation model. As a starting point we therefore assume that the cell frequencies are generated independently from Poisson distributions with individual rates i ; i = 1; : : : ; C: To simplify the model further we view the i as independent realizations of a continuous random variable with a common distribution g () : The number of cells C is usually quite large and this assumption will signicantly reduce the number of parameters that need to be estimated. The simplication seems reasonable also in view of that we are not conditioning on any of the characterizing variables dening the key. The specication of the mixing distribution g () is the crucial step and several dierent suggestions have been studied. Bethlehem et al. (1990) proposed a gamma distribution which implies that the marginal distribution of each Fi is the negative binomial. This model has however been noted to provide a poor t to real-life data. Skinner and Holmes (1993, 1998) argue instead for the use of a lognormal distribution. Chen and Keller-McNulty (1998) considered a shifted negative binomial for the marginal distribution of the Fi and achieve better results compared to the Poisson-gamma. St-Cyr (1998) proposed a mixture of a Pareto and a truncated lognormal distribution based on his ndings from studying the relationship between the conditional probability of population uniqueness and the sampling fraction. Hoshino and Takemura (1998) investigated the relationships between dierent models and provide interesting results. It is also obvious in many situations that certain combinations of the key variables may be impossible, such as married 4-year-olds or male primiparas, i.e. so-called structural zeroes. Skinner and Holmes (1993) specied their superpopulation model to allow for individual rates to be zero with positive probability. Following their idea we therefore assume that the distribution of the i is a mixture of a discrete probability mass at zero and (1 ) of a continuous distribution with probability density function (pdf) g () on (0; 1); i.e. we specify the marginal of the cell frequencies as Pr (Fi = j) = Ij=0 + (1 ) Pj where 0 6 < 1 and Pj = Z
1
(1)
j e g () d: j!
(2)
Note that with the Poisson assumption the total of the cell counts Fi is a random variable. Although it is true from the design that Fi = N it may suce to check if E (Fi ) = N; rather than conditioning on this fact, cf. Bethlehem et al. (1990). Denoting the expectation of the Fi by (conditional on i > 0) this requirement 4
translates to C (1 ) = N: (3)
since the sum is taken only over those cells where the probability of observing population units is strictly positive, i.e. cells that are not structural zero cells.
3. The Poisson-Inverse Gaussian Distribution

We assume that the individual rates follow an inverse Gaussian (iG) distribution and using the same parameterization as Willmot (1987) the pdf is ! ( )2 ; > 0; g ( j ; ) = (4) 1=2 exp 2 2 3
where > 0; > 0: The mean and variance of (4) are and respectively. The inverse Gaussian distribution is the rst passage time distribution of Brownian motion with positive drift. It may be derived through an inversion relationship associated with cumulant generating functions of the Gaussian and inverse Gaussian families, hence the name. Folks and Chhikara (1978) provide a review of the iG distribution with an extensive set of references. See also Johnson et al. (1994, chapter 15). Integrating out from (2) with g () replaced by (4) yields the probability mass function (pmf) of the Poisson-inverse Gaussian (PiG), i.e. Pj = p j Kj1=2 (= ) ; j! K1=2 (= ) j = 0; 1; : : : (5)
p where = 1 + 2 and K (z) denotes a modied Bessel function of the second kind (sometimes referred to as the third kind) of order and with argument z, see e.g. Abramowitz and Stegun, (1970, chapter 9). The mean and variance of the PiG distribution in (5) are and (1 + ) respectively. The distribution in (5) is a special case of a more general three-parameter PiG, commonly known as the Sichel distribution which was introduced by Sichel (1971). A review of the Sichel distribution is given in Johnson et al. (1992, pp. 455-457). In the present paper we will however limit the scope to the two-parameter distribution in (5). The expression in (5) is not always convenient due to the Bessel functions, but by making use of the properties of the Bessel function a more practical recurrence formula for calculating the probabilities is easily derived, see Carlson (2002) for details. The rst two probabilities are given by (1 ) P0 = exp (6) and P1 = P0
and the following probabilities by Pj = 1 2j 3 2 Pj1 + 2 Pj2 ; 2 j j (j 1) j = 2; 3; : : : (7)
Approximating simple random sampling with Bernoulli sampling with sampling probability s = n=N (cf. Srndal et al., 1992, chapter 3) it follows that fi j i P o (s i ) and Fi fi j i P o ((1 s ) i ) : (8)
The marginal pmf for the sample cell frequencies fi then follows as Pr (fi = j) = Ij=0 + (1 ) pj where pj ! (s )j es ( )2 d = 1=2 exp 2 j! 0 2 3 p j Kj1=2 (s s = s ) s s = j! s K1=2 (s = s ) Z
1
(9)
p i.e. a PiG distribution where s = s ; s = s and s = 1 + 2 s : This provides an easy transformation when we have a simple random sample from the larger population, i.e. given the sample we estimate the parameters s and s and simply multiply by 1 : See also section 2 of Sichel (1982a) and the discussion s concerning sampling in Takemura (1999).
4. Risk assessment
The outline of the disclosure problem considered here is the same as that of many authors, e.g. Bethlehem et al. (1990) and Skinner and Holmes (1998). Consider an intruder who attempts to disclose information about a set of identiable units in the population termed targets. The intruder is assumed to have prior information about the key values of the targets and attempts to establish a link between these and individual records in the released microdata le using the values of the key attributes. Assume that the intruder nds that a specic record r in the microdata le matches a target with respect to the key X. Now Fi is the number of units in the population with X = i and we let i (r) denote the value of X for record r. If Fi(r) was known the intruder could infer that the probability of a 1 correct link is Fi(r) and if Fi(r) = 1 the link is correct with absolute certainty. Usually the intruder will not know the true value of Fi(r) since the microdata set contains only a sample but by introducing a superpopulation model he may attach a probability distribution P r (Fi = j) to the cell frequencies. Furthermore, it could be argued that an intruder will be more inclined to focus upon those 6
records that are sample unique since it is only these that can by denition be population uniques, see e.g. Skinner et al. (1994). By equating uniqueness with disclosure risk, a simple measure of the risk for a given sample is the proportion of sample uniques that are also population uniques, i.e. R= # (records that are Pop.Uniques and SampleUniques) : # (records that are SampleUniques)
Under simple random sampling or Bernoulli sampling the expected number of population uniques to fall into the sample is s T1 = nT1 =N: Since T1 is unknown, this proportion will have to be estimated. Under the model (1) we have that the expected number of population uniques is E (T1 ) = C Pr (Fi = 1) = C (1 ) P1 : Thus, an obvious risk measure denoted by R1 , is dened as the proportion of the observed number of sample uniques expected to be population uniques, i.e. R1 = E (T1 ) =N s C (1 ) = exp (1 ) : t1 =n t1 (10)
Assuming Bernoulli sampling an alternative risk measure R2 follows naturally from (8) and is dened by Z 1 exp ( (1 s ) ) g ( j fi = 1) d R2 = Pr (Fi = 1 j fi = 1) = (11)
0
where i exp (s i ) g (i ) g (i j fi = 1) = R 1 0 exp ( s ) g () d R2 = E (T1 ) =N = s exp (s ) E (t1 ) =n (12)
which simplies to the risk measure
i.e. the observed value of t1 in (10) is replaced for its expectation under the model. This is the approach considered by Skinner and Holmes (1998). Note that R2 ! 1 as s ! 1 which is not necessarily the case with R1 . The risk measures R1 and R2 coincide if and only if a perfect t of the model to the observed number of sample uniques is obtained, i.e. E (t1 ) = t1 : Either way, the disclosure risk is estimated by replacing the parameters by their estimates into (10) or (12). It should be noted that population uniqueness is neither a sucient nor necessary condition for re-identication or for disclosing additional information, see e.g. Willenborg and de Waal (1996, pp. 19-20). It is not a sucient condition since, rst of all, the unique unit must be included in the sample and secondly, 7
for j = 1; 2; : : : . Simple examples pertaining to some of the situations just described may be e.g. j = 1 and k = 1 which after simplifying yields Pr (Fi = 2 j fi = 1) = (1 s ) R2 or j = 2 and k = 0 which yields Pr (Fi = 2 j fi = 2) = s R2 2 ( + ) s : 2 (s s + s ) ( + ) 2
it must also be known to the intruder that the unit is in fact unique. It is not necessary for several reasons. If for instance a person in the population shares the same values on the key attributes with say only one other person, they will both be able to re-identify and disclose information about each other. In general, coalitions of respondents exchanging information can be formed within small groups sharing the same scores on the key attributes, in order to disclose information about an individual within the same group but outside the coalition. Alternatively, if a group of people share the same values on the key attributes, none of them are unique. But if they in addition all share the same score on a certain sensitive attribute provided in the released data, the sensitive information can be disclosed for all those individuals without re-identifying individual records. Another possibility is response knowledge, i.e. knowledge that a specic individual participated in the survey and consequently that his or her data must be included in the data. Identication and disclosure can then occur if the person is unique in the sample and not necessarily in the population (Bethlehem et al., 1990). These issues will not be investigated further in the present paper, but they do however motivate an extension of the risk measure in (11) to a more general measure dened by R1 k 1 0 (1 s ) k+j exp () g () d R1 j : Pr (Fi = j + k j fi = j) = k! exp ( s ) g () d 0
5. Estimation
Let S0 denote the number of structural zeroes. If this number is known a priori the parameter is known and equals 1 (C S0 ) =C: When this is the case and especially when S0 = 0; the parameters s and s can be estimated using ordinary maximum likelihood (ML) methods. The likelihood is derived using (9) over the C S0 non-structural-zero cells. Willmot (1987) derived the MLestimates for the present parameterization and we include them here for the sake of self-containment. It is easily shown that the ML-estimate of s is simply the average cell size, i.e. s = ^ n : C S0 8 (13)
Note that this estimator satises the requirement in (3) since N n C S0 =N C (1 ) = C 1 1 ^ C n C S0 The ordinary ML-estimate of s is the solution to h=
1 X j=0
tj
(j + 1) pj+1 n=0 pj
(14)
with t0 replaced for t0 S0 . Equation (14) is easily solved using e.g. NewtonRaphson iteration. The technical details are described in Willmot (1987). It would however more often be the case that S0 is unknown and that needs to be considered in the estimation process. By employing a zero-truncated likelihood where only the non-empty sample cell frequencies are considered, this problem is circumvented and is accordingly treated as a nuisance parameter since Pr (fi = j j fi 1) = (1 ) pj pj = ; 1 ( + (1 ) p0 ) 1 p0 j = 1; 2; : : : (15)
Zero-truncated estimation was described by Sichel (1975) and Ord and Whitmore (1986) and it is the approach considered by Skinner and Holmes (1993) designating it a conditional likelihood (CL). The log likelihood of the fi for those i which fi 1; is thus dened as lCL =
1 X j=1
tj log
pj 1 p0
(16)
Estimates of s and s are the solutions to (17) and may be obtained by numerical iteration methods such as Newton-Raphson. The derivation of (17) and the required derivatives are provided in Carlson (2002). ^ ^ From (1) we have E (t0 ) = C + C (1 ) p0 so once the estimates s and s are obtained it is straightfoward to estimate by replacing E (t0 ) for t0 and p0 for its estimate, i.e. ^ ^ = t0 C p0 C (1 p0 ) ^ 9 (18)
which yields the system of equations 8 n s > h1 = =0 > < 1 p0 C t0 1 X np0 > h2 = tj 'j n + =0 > :
j=1 s
(17)
Skinner and Holmes (1993) also considered truncating the set of probabilities in (15) above a threshold value m: The idea is that in applications to disclosure control, a lack of t in the right hand tail is not likely to be as critical as the left hand tail which may be considered more crucial since only cells belonging to t0 and t1 can by denition contain population uniques. This is also basically the approach used by Chen and Keller-McNulty (1998) as well who explicitly t their model to the observed values of t1 and t2 : A further motivation for this approach is a possible reduction of computational eort. Thus, the pj are assumed to be proportional to (9) for j = 1; : : : ; m and no assumptions are made about the pj for j > m + 1: The right-truncated approach is described in Carlson (2002) and will not be pursued further in the present paper.
^ As a consequence we have t0 = t0 ; that is, we obtain a perfect t for the number ^ ^ of empty cells in the sample. Furthermore, using that = N s =n and from (17), ^ ^ that s = n (1 p0 ) = (C t0 ), it is seen that the restriction in (3) is automatically satised since ^ ^ t0 C p0 N (1 p0 ) = N: C 1^ =C 1 ^ C (1 p0 ) ^ C t0
6. Example
To illustrate the method a simulated population consisting of N = 50; 000 units was generated. The data was designed to possess a reasonable level of structure with dependencies between the variables. This resulted in a frequency structure similar to the properties mentioned in the introduction, i.e. an inverse J-shape and heavy right tail, cf. Chen and Keller-McNulty (1998). The data consisted of six categorical variables with 2, 3, 7, 8, 8, and 8 categories respectively, giving a total of C = 21; 504 possible key values or cells. Of these T0 = 19; 581 were empty and the number of population uniques was T1 = 534. The largest cell contained 852 units. Figure 1 below depicts the data. To demonstrate the dierences between the ordinary ML and zero-truncated CL approaches of estimation, the PiG-model was tted to a simple random sample consisting of 5; 000 units using both methods. The number of empty cells in the sample was t0 = 20; 642 and the number of sample uniques was t1 = 382 of which 55 where true population uniques corresponding to approximately 10% of T1 . The proportion of sample uniques that also were population uniques was accordingly approximately 14% which is the quantity to be estimated by (10) and (12). The largest cell size of the 862 non-empty cells in the sample was 97 (one cell). The results of the estimation procedures are summarized in table 1 and it is clear that the large number of empty cells has an impact on the estimates. This is most notable in the estimate of s which diers with about a factor of 10 and with the estimate of of the CL procedure which correpsonds to about 19; 234 structural ^ ^ zero cells. Note also that the risk indicators R1 and R2 are
10
Figure 1: Population cell size distribution, (a) entire distribution and (b) details of the left hand side, small frequency cells. The frequency of empty cells, i.e. T0 , is omitted. ^ 0:894 ^ R1 0:35061 0:09813 ^ R2 0:28800 0:09940
Estimates Ordinary ML Zero-truncated CL
s ^ 0:2325 2:2024
s ^ 52:740 33:316
Likelihood 5603:4 1967:7
Table 1: Results of estimation using the ordinary ML and zero-truncated CL procedures.
Cell size j
Observed tj
^ Fitted tj , ML 20638:42 465:04 120:41 59:68 36:95 25:63 19:04 14:82 . . . 4:25 66:30 49:336 14
^ Fitted tj ; CL (20642) 377:14 143:39 75:14 47:13 32:77 24:33 18:89 . . . 5:22 70:79 13:425 13
0 1 2 3 4 5 6 7 . . . 15 16+ 2
d.f
20642 382 127 79 47 31 27 22 . . . 6 76
Table 2: Observed and tted cell size frequencies for 10% sample of simulated data set.
11
Figure 2: Sample cell size distribution with (a) ordinary ML ts and (b) zerotruncated CL ts imposed. over-estimated using the ML approach whereas the CL approach under-estimates the risk in this particular case. The observed and tted cell size frequencies are summarized in table 2 and depicted in gure 2. It is seen that there was a larger degree of discrepancy between the observed and tted values when the ML procedure was used especially among the smaller cell sizes. For the larger cell sizes it was found that the dierences were smaller but that the CL procedure tends to adapt better to the heavy tail.
7. Remarks
Since the scope of this paper as been limited to a theoretical review with only a small-scale simulated example, it is necessarily dicult to evaluate how the PiG model fares in the general case and when compared to competing approaches. Before any conclusions can be made a more extensive evaluation is called for including tests on real-life data and comparisons with other models. Such an evaluation is intended to appear in a coming report. The PiG model does however provide easily computed calculations of the disclosure risk along the lines discussed. Also the positive skewness and heavy upper tail characterizing the PiG seem to promote it as an interesting candidate in disclosure applications. In the present paper we have only considered the two parameter version of the general three-parameter Sichel distribution. The three-parameter distribution is very powerful and a number of known distribution functions such as the Poisson, negative binomial, geometric, Fishers logarithmic series, are special or limiting forms of the Sichel. As Sichel (1975) also points out, a useful discrimination between discrete distributions is achieved by studying the ratio function Pj+1 =Pj . For many standard distributions this function is either monotonically increasing or decreasing. The ratio function of the Sichel however, may rst decrease and,
12
after passing a through a minimum, increase again. We note also that the risk measures in (10) and (12) provide only an overall measure of disclosure risk pertaining to the sample as a whole. A per-record measure of disclosure risk is perhaps more useful as it would provide a means to identify sensitive (unique) records to which disclosure controlling measures can be applied. From an intruders point of view it would be optimal to utilize as much as possible of the information provided in the sample when formulating a model. More structured methods which attempt to capture the underlying probability structure inherent from the key variables dening X have been suggested, see e.g. Fienberg and Makov (1998) and Skinner and Holmes (1998). The latter considered a per-record measure based on their Poisson-lognormal model and we note that similar regression methods are available for PiG data based on a model of the form x = exp (x) with xed, see Dean et al. (1989) for details. In conclusion, the possibility of extending the present model in these directions is certainly worthy of future exploration.
8. Acknowledgement
The author wishes to thank Professor Daniel Thorburn, Dep. of Statistics, Stockholm University, for valuable comments and suggestions in the preparation of this paper and Boris Lorenc, Dep. of Statistics, Stockholm University, for providing the simulated data.
References
[1] Abramovitz, M. and Stegun, I.A. (1970) Handbook of Mathematical Functions. Monograph. New York: Dover Publications. [2] Bethlehem, J.G., Keller, W.J. and Pannekoek, J. (1990) Disclosure Control of Microdata. Journal of the American Statistical Association, 85, pp. 38-45. [3] Block, H and Olsson, L. (1976) Backwards Identication of Personal Information - Bakvgsidentiering (in Swedish). Statistisk Tidskrift, 4, pp. 135-144. [4] Carlson, M. (2002) Assessing Microdata Disclosure Risk Using the PoissonInverse Gaussian Distribution. Research Report, Dep. of Statistics, Stockholm University. [5] Chen, G. and Keller-McNulty, S. (1998) Estimation of Identication Disclosure Risk in Microdata. Journal of Ocial Statistics, 14, pp. 79-95. 13
[6] Dalenius, T. (1977) Towards a Methodology For Statistical Disclosure Control. Statistisk Tidskrift, 5, pp. 429-444. [7] Dalenius, T. (1986) Finding a Needle In a Haystack or Identifying Anonymous Census Records. Journal of Ocial Statistics, 2, pp. 329-336. [8] Dean, C., Lawless, J.F. and Willmot, G.E. (1989) A mixed Poisson-inverseGaussian regreesion Model. The Canadian Journal of Statistics, 17, pp. 171181. [9] Duncan, G. and Lambert, D. (1989) The Risk of Disclosure for Microdata. Journal of Business & Economic Statistics, 7, pp. 207-217. [10] Duncan, G.T. and Pearson, R.W. (1991) Enhancing Access to Microdata while Protecting Condentiality: Prospects for the Future. With discussion. Statistical Science, 6, pp. 219-239. [11] Elliot, M.J., Skinner, C.J. and Dale, A. (1998) Special Uniques, Random Uniques and Sticky populations: Some Counterintuitive Eects of Geographical Detail on Disclosure Risk. Research in Ocial Statistics, 1, pp. 53-67. [12] Fienberg, S.E. and Makov, U. (1998) Condentiality, Uniqueness and Disclosure Limitation for Categorical Data. Journal of Ocial Statistics, 14, pp. 385-397. [13] Folks, J.L. and Chhikara, R.S. (1978) The Inverse Gaussian Distribution and Its Statistical Application - A Review. With discussion. Journal of the Royal Statistical Society, Series B, 40, pp. 263-289. [14] Frank, O. (1979) Inferring Individual Information From Released Statistics. In Proceedings of the 42nd Session of the International Statistical Institute, Manila Dec. 4-14 1979. The Hague: International Statistical Institute, pp.487-498. [15] Holla, M.S. (1966) On a Poisson-inverse Gaussian Distribution, Metrika, 11, pp. 115-121. [16] Hoshino, N. (2001) Applying Pitmans Sampling Formula to Microdata Disclosure Risk Assessment. Journal of Ocial Statistics, 17, pp. 499-520. [17] Hoshino, N. and Takemura, A. (1998) On the Relation Between Logarithmic Series Model and Other Superpopulation Models Useful for Microdata Disclosure Risk Assessment. Discussion Paper, 98-F-7, Faculty of Economics, University of Tokyo. [18] Johnson, N.L., Kotz, S. and Balakrishnan, N. (1994) Continuous Univariate Distributions, Vol. 1, 2nd ed.. Monograph. John Wiley & Sons, New York.
14
[19] Johnson, N.L., Kotz, S. and Kemp, A.W. (1992) Univariate Discrete Distributions, 2nd ed.. Monograph. John Wiley & Sons, New York. [20] Lambert, D. (1993) Measures of Disclosure Risk and Harm. Journal of Ocial Statistics, 9, pp. 313-331. [21] Ord, J.K. and Whitmore, G. (1986) The Poisson-Inverse Gaussian distribution as a model for species abundance, Communications in Statistics Theory and Methods, 15, pp. 853-871. [22] Samuels, S.M. (1998) A Bayesian, Species-Sampling-Inspired Approach to the Uniques Problem in Microdata Disclosure Risk Assessment. Journal of Ocial Statistics, 14, pp. 373-383. [23] Srndal, C.E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. Monograph. New York: Springer-Verlag. [24] Sichel, H.S. (1971) On a Family of Discrete Distributions Particularly Suited to Represent Long-tailed Frequency Data. In N.F. Laubscher (editor) Proceedings of the Third Symposium on Mathematical Statistics, pp. 51-97. Pretoria: C.S.I.R. [25] Sichel, H.S. (1973) The Density and Size Distribution of Diamonds, Bulletin of the International Statistical Institute, 45(2), pp. 420-427. [26] Sichel, H.S. (1974) On a Distribution Representing Sentence-length in Written Prose, Journal of the Royal Statistical Society, Series A, 137, pp. 25-34. [27] Sichel, H.S. (1975) On a Distribution Law for Word Frequencies, Journal of the American Statistical Association, 70, pp. 542-547. [28] Sichel, H.S. (1982a) Repeat-buying and the Generalized Inverse GaussianPoisson Distribution, Applied Statistics, 31, pp.193-204. [29] Skinner, C.J. and Holmes, D.J. (1993) Modelling Population Uniqueness. In Proceedings of the International Seminar on Statistical Condentiality, Dublin, 8-10 September 1992, pp. 175-199. Luxembourg: Oce for Ocial Publications of the European Communities. [30] Skinner, C.J. and Holmes, D.J. (1998) Estimating the Re-identication Risk Per Record. Journal of Ocial Statistics, 14, pp. 361-372. [31] Skinner, C., Marsh, C., Openshaw, S. and Wymer, C. (1994) Disclosure Control for Census Microdata, Journal of Ocial Statistics, 10, pp. 31-51. [32] St-Cyr, P. (1998) Modelling Population Uniqueness Using a Mixture of Two Distributions. In Statistical Data Protection - Proceedings of the Conference, Lisbon, 25 to 27 March 1998 - 1999 edition, pp. 277- 286. Luxembourg: Oce for Ocial Publications of the European Communities. 15
[33] Takemura, A. (1999) Some Superpopulation Models for Estimating the Number of Population Uniques. In Statistical Data Protection - Proceedings of the Conference, Lisbon, 25 to 27 March 1998 - 1999 edition, pp. 59-76. Luxembourg: Oce for Ocial Publications of the European Communities. [34] Willmot, G.E. (1987) The Poisson-inverse Gaussian Distribution as an Alternative to the Negative Binomial, Scandinavian Actuarial Journal, pp. 113127. [35] Willenborg, L. and de Waal, T. (1996) Statistical Disclosure Control in Practice; Series: Lecture Notes in Statistics, Vol. 111. Monograph. SpringerVerlag, New York. [36] Willenborg, L.C. and de Waal, T. (2000) Elements of Statistical Disclosure Control; Series: Lecture Notes in Statistics, Vol. 155. Monograph. New York: Springer-Verlag.
16

Assessing Microdata Disclosure Risk Using The Poisson-Inverse Gaussian Distribution

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessing Microdata Disclosure Risk Using The Poisson-Inverse Gaussian Distribution

Uploaded by

Copyright:

Available Formats

Assessing Microdata Disclosure Risk Using the Poisson-Inverse Gaussian Distribution

2. Specication of the Superpopulation Model

3. The Poisson-Inverse Gaussian Distribution

and the following probabilities by Pj = 1 2j 3 2 Pj1 + 2 Pj2 ; 2 j j (j 1) j = 2; 3; : : : (7)

where i exp (s i ) g (i ) g (i j fi = 1) = R 1 0 exp ( s ) g () d R2 = E (T1 ) =N = s exp (s ) E (t1 ) =n (12)

which simplies to the risk measure

Estimates Ordinary ML Zero-truncated CL

Likelihood 5603:4 1967:7

Table 1: Results of estimation using the ordinary ML and zero-truncated CL procedures.

20642 382 127 79 47 31 27 22 . . . 6 76

You might also like