You are on page 1of 16

SOME ALTERNATIVES TO ECOLOGICAL CORRELATION'

LEO A. GOODMAN

ABSTRACT
Under certain specified conditions, ecological data, e.g., the percentage of non-whites and the percentage
in domestic service for different community areas, can be used to estimate the "individual correlation"
between two dichotomous classifications, e.g., the non-white-white classification and the domestic service-
other than domestic service classification. Quite accurate estimates are obtained for some data on the relation
between color and occupation, somewhat less accurate estimates on the relation between color and literacy.
For some situations where the specific conditions here described are not met, other conditions are present-
ed that lead to different methods for estimating the individual correlation from the ecological data. These
methods are developed for the study of the individual correlation between two dichotomous variables,
between two qualitative variables where one of them is dichotomous, and between two quantitative vari-
ables.

The present article discusses some of the ferred to the variables describingproperties
results reportedin Robinson's paper on eco- of individuals, while "ecological data" re-
logical correlation2and explores further the ferred to the ecological variables describing
suggestions in earlier notes by the present properties of groups. An "ecological regres-
author3 and by Duncan and Davis.4 The sion" study is a standard regressionanalysis
terminologyused in these earlierpaperswill, for ecological variables. The problems of
for the sake of convenience, be used here, "aggregation," as discussed in some of the
although it does have some disadvantages. economics literature,6are related somewhat
"In an ecological correlation. . . the vari- to the mathematical problemsthat have ap-
ables are . . . descriptive properties of peared in the discussion of ecological and
groups.... An individual correlation is a individual correlations, although the ter-
correlation in which the. .. variables are minology of this literature is quite different
descriptive properties of individuals....") from that of the papers referred to earlier.
The phrase "behavior of individuals" re- The variables in an ecological correlation
I The research was carried out at the Statistical
are usually quantitative (e.g., percentages
Research Center, University of Chicago, under the or means for each of the 48 states), while the
sponsorship of the Statistics Branch, Office of Naval variablesin an individual correlationmay be
Research, and of the Social Science Research Com- qualitative (e.g., race of each individual) or
mittee, University of Chicago. Reproduction in quantitative (e.g., height of each individ-
whole or in part is permitted for any purpose of the
United States government. I am indebted to
ual). The ecological correlation coefficient
R. Blough, who assisted with the numerical com- used in the earlierpapers was the Pearsonian
putations; and to 0. D. Duncan, P. F. Lazarsfeld, correlationcoefficientfor the joint distribu-
J. S. Coleman, Z. Griliches, Y. Grunfeld, P. M. tion of two quantitative ecologicalvariables.
Hauser, P. H. Rossi, and H. Zeisel for helpful com- These papers dealt mainly, though not ex-
ments.
clusively, with the situation in which both
2 W. S. Robinson, "Ecological Correlations and
variables in the individual correlationstudy
the Behavior of Individuals," American Sociological
Review, XV (1950), 351-57. were qualitative and dichotomous and the
3 Leo A. Goodman, "Ecological Regression and
individual correlation coefficient used was
Behavior of Individuals," American Sociological Re- the fourfold-point (q) correlationcoefficient
view, XVIII (1953), 663-64. for the cross-classificationtable describing
4 Otis Dudley Duncan and Beverly Davis, "An
the joint distribution of the two dichoto-
Alternative to Ecological Correlation," A merican 6 For example, H. Theil, Linear Aggregations of
Sociological Review, XVIII (1953), 665-66.
Economic Relations (Amsterdam: North-Holland
5 Robinson, op. cit., p. 351. Publishing Co., 1954).
610

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVESTO ECOLOGICALCORRELATION 611
mous variables.7 The present article will ferences about "individualbehavior"will be
also study this situation, as well as situa- used to estimate ind'ividual correlations.
tions in which both variables considered in (Since the individual correlationcoefficient,
the individual correlation study are quanti- X$,may not be an appropriatemeasureof as-
tative, or where both are qualitative and one sociation in many situations,"3the author's
is dichotomous. Situations in which one note did not discuss individual correlations
variable is quantitative and the other quali- explicitly but rather inferences about "indi-
tative, or where both are qualitative but vidual behavior." However, since the indi-
neither dichotomous, will not be considered vidual correlation coefficient may some-
here.8 times be an appropriatemeasure, it will be
It has been shown that ecological correla- investigated here. The general method de-
tions cannot be used as substitutes for indi- veloped here can also be applied to situa-
vidual correlations.9 However, ecological tions in which some other measure of associ-
correlations may be of interest in them- ation is of interest.) This article will also ex-
selves; the kinds of questions that can be plore in some detail the method presented in
answered by a study of ecological correla- the note by Duncan and Davis14 and will
tions are sometimes of direct concern to so- suggest a few techniques that lead to further
cial scientists.10In some problems, both the insight into it."s
ecological and the individual correlations If individual correlationsare of interest,
and the relations between them may be of it is best to obtain the directly relevant data
interest. Even if the investigator is con- on individual behavior rather than ecologi-
cerned only with individual correlations, cal data. For example, if the individual cor-
ecological data may be of service, though relation between color (Negro-white) and
ecological correlations are not recom- illiteracy (illiterate-literate) is of interest,
mended." the appropriate data would be a fourfold
The author's earlier note'2 showed that, table describing the cross-classification of
under very special circumstances,the analy- individuals according to Negro-white and
sis of the regressionbetween ecological vari- illiterate-literate categories.'" However, in
ables may be used to make inferencesabout some situations this table may not be avail-
"individual behavior," i.e., about the un- able; thus the fourfold-pointcorrelationco-
known data, for a population of individuals, efficient cannot be computed from it. How-
describing the cross-classificationof two di- ever, the marginaltotals (i.e., the numberof
chotomous attributes. In the present ar- Negroes, whites, illiterates, and literates) for
ticle, the general approach presented in the the total Negro-white population and also
note will be developed further, and the in- for the Negro-white populations of various
7 See, e.g., Helen M. Walker and Joseph Lev, subdivisions of the country may be known.
Statistical Inference (New York: Henry Holt & Co., Using these ecological data, methods will be
1953), p. 272, and Leo A. Goodman and William H. presented for estimating the data, which
Kruskal, "Measures of Association for Cross Classi- would have appeared in the table, and the
fications," Journal of the American Statistical Associ-
ation, XLIX (1954), 732-64, esp. 739. fourfold-point correlation coefficient for it.
8 See Goodman and Kruskal, op. cit., pp. 735-38,
These methods can also be used to estimate
for a description of some distinctions between these 13 See Goodmanand Kruskal,op. cit., for further
situations.
discussionof this point.
9Robinson, op. cit., p. 357; Goodman, op. cit., p.
663. 140P. cit.
10Herbert Menzel's "Comment" on Robinson's 15 Cf., Hanan C. Selvin, "Durkheim's'Suicide'

paper, in American Sociological Review, XV (1950), and Problems of Empirical Research,"American


674; Goodman, op. cit., p. 663. Journal of Sociology, LXIII (1958), 615-18. Selvin
11Duncan and Davis, op. cit., p. 665; Goodman, refersto the resultspresentedin an unpublishedver-
op. cit., p. 664. sion of the presentarticle.
12 Op. Cit. 16See, e.g., Robinson, op. cit., p. 353, Table 1.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
612 THE AMERICANJOURNALOF SOCIOLOGY
the non-available data and the correspond- A and B. The variance aJ(y Ix) of the ob-
ing correlation coefficient for each subdivi- served y values from the straight line will
sion of the country. These methods are depend on the variance a2(plx) computed
simple, but they cannot be expected to lead for the probability distribution of the pro-
to as accurate estimates as those obtained portion p illiterate among Negroes for popu-
from relevant data on individual behavior. lations with the same proportion x of Ne-
On the other hand, if the ecologicaldata are groes; the variance o-2(rIx) of the r for popu-
easily available, then the amount of com- lations with a given x value; the covariance
putation involved in using the methods sug- Cov(p,r x) of these two proportions; and
gested here costs very little in comparison the distribution of the x values. If a2(y Ix) is
with the cost of obtaining the directly rele- not approximatelyconstant for the different
vant data from a special study. x values, it will sometimes be worthwhile to
Ecological regression.-The proportion, modify the standard regressionmethods by
y, of individuals in the Negro-white popula- the use of a "weighted regression."'8(An-
tion who are illiterate may be written as other kind of modification, which will some-
y = xp + (1 - x)r, where x is the propor- times be appropriate,can be based on meth-
tion in the population who are Negro, p is ods developed for the situation of "linear
the proportionof Negroes who are illiterate, regressionwhere both variates are subject to
(1 - x) is the proportion in the population error"19rather than for the standard linear
who are white, and r is the proportion of regression.) For a given x, o-2(yx) = o2(p I
whites who are illiterate. Thus y = r + x)x2+?2(r Ix)(1 -x)2 + 2Cov(p, rI x) x(1 -
(p - r)x = a + bx, where a = r and b = x). Thus, under the present assumptions,
p - r. Hence, if differentpopulationsor areas x2(yI X) = 0 only when all the p and r values
are consideredwhere the proportionp is the equal B + A and A, respectively, or when
same for each of these populations and also there is a specific negative linear relation-
the proportionr is the same for these popula- ship between p and r, for each x; viz.,
tions, then there will be an exact linear rela- px+r(1-x) =A +Bx, or p=A +
tionship, y = a + bx, between the values of B - (r - A) (1 - x)/x. [In the final section
y and x for the different populations (as- herein, different assumptions are made,
suming that not all the values of x are which lead to quite different kinds of situa-
equal), where the slope will be b = p -r, tions in which it is possible that o2(yIx) =
and the y-intercept will be a = r. This O.]
straight line could be used to determine The expected values will be constant, and
r = a and p = b + a. the variances will be small, when the prob-
In practice, the actual values of p and r ability of illiteracy, say, is much more a
will not be constant, but it may be the case function of color (i.e., it depends on whether
that the average E(p Ix) of the values of p, a person is white or Negro) rather than a
for populations with the same proportion x function of the ecological area being consid-
of Negroes, is constant [i.e., E(p x) is the ered. Where the phenomenon under investi-
same for different values of x] and the aver- gation is more a function of the area (i.e., the
age E(r Ix) of the values of r, for populations 18 See, e.g., R. L. Anderson and T. A. Bancroft,
with the same x value, is also constant. In
Statistical Theory in Research (New York: McGraw-
this situation, the main assumption of linear Hill Book Co., Inc., 1952), pp. 182-86.
regression analysis, E(ylx) = A + Bx, 19 See, e.g., M. G. Kendall, "Regression, Struc-
holds true, where A = E(r Ix) and B = ture, and Functional Relationship," Parts I and II,
E(p Ix) - E(r Ix). Thus standard methods Biometrika, XXXVIII (1951), 11-25, and XXXIX
of linear regression"7can be used to estimate (1952), 96-108; D. V. Lindley, "Regression Lines
and Linear Functional Relationship," Journal of the
17See,e.g., WilfridJ. Dixon and FrankJ. Mas- Royal Statistical Society, Suppl., IX (1947), 219-44;
sey, Jr., Introductionto StatisticalAnalysis (New J. W. Tukey, "Components in Regression," Bio-
York:McGraw-HillBook Co., 1951),chap. xi. metrics, VII (1951), 33-70.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 613

p and r values differ widely in the different tively, where NX is the total number of
areas) than a function of color, the methods Negroes.
presented here are not recommended;how- Since p and r lie between 0 and 1, it is
ever, in some situations the variance of p desirable that the estimates P and P also lie
and r may be sufficiently small for them to between 0 and 1. When this is not the case,
be applicable,while in others the variance of the underlying assumptions should be re-
p and r for a particular subset of ecological examined, although it is possible to obtain
areas (e.g., for the states in a given geo- such estimates even if these assumptionsare
graphic division of the United States) or for satisfied. A method for dealing with this
a set of combined ecological areas (e.g., for situation was suggested in the author's ear-
the nine geographic divisions of the United lier note.
States) may be sufficiently small for the The estimated proportion? = d + &X =
present method to be applied to the subset O6X+ P(1 - X) of illiterates in the Negro-
of ecological areas or to the set of combined white population should be close to the
areas. If the variance of p and r for the states known proportion, Y, of illiterates. If this is
in each division of the United States is not the case for a given set of data, this
small, then the methods may be applied to method is not recommended.In the special
obtain separate estimates for each division, case where Y is, in fact, equal to the aver-
thus obtaining estimates where, in a certain age, y, of the illiteracy proportions in the
sense, geographic divisions have been held various ecologicalareas and X is equal to the
constant. average, x?,of the proportions Negro in the
If the scatter diagram of y and x does not ecological areas, then this check on the un-
suggest a linear relation between y and x, derlyingassumptionsof the method does not
then the present strategy is not recom- apply, since in this case P = d + '& =
mended (unless the scatter diagramof y and y = F, even if the assumptions are not met.
x for a subset of the areas or for a set of com- A method for determining roughly whether
bined areas suggests linearity). If the scatter or not Y is sufficiently close to F will be
diagram does suggest a linear relation, then mentioned later in this section.
it may be applicable, but it is still possible Rather than compute the correlation co-
that the variances of p and r are large. In efficientc directly for the estimated fourfold
this case, the present method leads to esti- table by the usual formula, a simplifiedfor-
mates of E(p Ix) and E(r Ix) when these mula is b+VX(1- X)/I(1 - i). Since Pwill
average (expected) values are constant, but be close to Y when this general approach is
the estimate of the individual correlationfor applicable, it will not matter much whether
the total population based on these esti- or not Y is replaced by the known propor-
mates of expected values may be quite poor tion Y. Thus an estimate of the fourfold-
if the variance of p and/or r is large. point correlationis
From the scatter diagram of the per cent
Negro and per cent illiterate in different c = bVX(1 - X)/Y(1- Y).
areas, the slope B and the y-intercept A can
be estimated by the usual methods of linear Following standard correlation theory,
regression, obtaining the estimates b and 4, the ecological correlation can be computed
respectively. Then E(r lx) and E(p Ix) can by multiplying b by the ratio of the standard
be estimated by r = 4 and 5 = o + r = deviation of the proportions of Negroes in
b + d, respectively. The numbers of illiter- the ecological areas and the standard devia-
ate Negroes, illiterate whites, literate Ne- tion of the proportions of illiterates there.
groes, and literate whites (i.e., the four en- Since this ratio will usually be very different
tries in the fourfold "individual behavior" from VIX(1 - X)/Y(1 - Y) when ecologi-
table) are estimated by p'NX, rN(1 -X) cal data are used, the ecologicaland individ-
(1 - p)NX, and (1 - r')N(l - X), respec- ual correlations will usually be very differ-

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
614 THE AMERICANJOURNALOF SOCIOLOGY
ent. However, the ecological correlation ence between 2-and Y to determnine roughly
might also serve as a rough measure of whether or not 2-is sufficiently close to Y,
whether the underlyingassumptions are not which is another partial check on the under-
satisfied for a particular set of data;20the lying assumptions.
present method is not to be recommendedif The formulaspresentedabove must be in-
it is rather small in absolute value. terpreted with caution, since they compute
The estimates b and a will be unbiasedes- only the variance of the estimate from its
timates of B and A, respectively, if expected value, whereas the difference be-
E(y Ix) A + Bx. If U2(y Ix) does not de- tween the estimate and the actual (rather
pend on x (i.e., the special case of "ho- than the expected),population value would
moscedasticity"), then the estimates b and d be of greater interest. Furthermore, if ho-
are the "best" unbiased estimates. When moscedasticity cannot be assumed, the nu-
02(yIx) is not constant, which will usually be merical values obtained by the formulas
the case, the estimates will still be unbiased, may be in error.In developing the formulas,
but they may not be "best." When ho- it was not necessary to make any assump-
moscedasticity can be assumed and each y tions about the distribution of y for given x
value, given x, is a statistically independent except that of homoscedasticity and linear-
observation, the variances of the estimates ity of regression.If, in addition, the distribu-
of 6 and a are 2(b) G2(yjx)/nc_2(x) and tion of y for given x is a normaldistribution,
then it is also possible to obtain confidence
~~2(yIx)n 2
intervals based on b, p, r, Y, and c, using the
-2 (y 2 (x)
(72 (8) = I X)
variance formulas that are given above.
Our approach must begin, in each case,
where &2(x)is the variance of the observed x with a carefulexaminationof the underlying
values and n is the numberof observations. assumptions;23however, the only necessary
The variancesof p, r, x, and c can be written assumptionfor the justification of the use of
as follows: o2Qp) c2(y) + (1- the point estimates b, p ,Y, , and c is that p
2r= q2(d) =2(y) + 2 y2(b) -2(g) = and r must be more or less constant for the
2(y) + (X - )2af2(b), c2(6) =
2(b)X(1- different ecological areas in such a way that
X)/ Y(1- Y), where c2(y) = o2(yJx)/n. the standard linear regressionrmodel can be
These variances are all proportional to applied.
o2(y X), whichdependson o2(p I X), -2(r Ix), If the proportionz of Negroes among the
Cov(r,p Ix), and the distribution of x. When illiterates is approximately constant and if
homoscedasticity can be assumed, the vari- the proportionv of Negroes among the liter-
ances,o'2(p), a2() 2(y), 0y2(C), can be esti- ates is also approximatelyconstant, then an
mated, in an unbiased manner, by replacing analogous approach to the one presented
I2(yI X) in the formulasgiven above by the here could be used with the same ecological
mean-square deviation of the observed y data to obtain estimates of the proportionsz
values from the least-squaresregressionline, and v and the individual correlationc. Thus
y = c + bx.22The estimated variance of Y this approach may lead to two quite differ-
can be used along with the observed differ- ent estimates of c; the choice between them
20 A statistical test for linearity of regression for
should depend upon whether p and r are
certain kinds of data is described in Walker and Lev, more constant than z and v (see comments in
op. cit., pp. 245-46. the final section herein).
21 A somewhat different, but equivalent, set of 23 Relevant are two earlier
papers: Frederick F.
formulas is given, for example, in Alexander McFar- Stephan, "Sampling Errors and the Interpretation
land Mood, Introduction to the Theory of Statistics of Social Data Ordered in Time and Space," Journal
(New York: McGraw-Hill Book Co., Inc., 1950), of the American Statistical Association, XXIX
p. 294. (March 1934 Suppl.), 165-66, and Frank Alexander
22See, e.g., the discussion of regression analysis in Ross, "Ecology and the Statistical Method," Ameri-
Dixon and Massey, op. c can Journal of Sociology, XXXVIII (1933), 507-22.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 615

Two numerical examples.-Let us first ual correlation,obtained by the mnethodsug-


consider the example discussed by Robin- gested by Duncan and Davis, are -.07 and
son,24where, for the Census Bureau's nine +.60.26 Thus the estimate 6 = .38, while not
geographicdivisions of the United States in very close, is closer to the known value of
1930, the ecological correlationbetween the the individual correlation than are the eco-
per cent illiterate and the per cent Negro logical correlation (.95) and the bounds.
for the divisional Negro-white populations, We shall now consider the numerical ex-
ten years old and over, was .95, while the ample on the relation between color (non-
individual fourfold-point correlationfor the white-white) and occupation (domestic
2 X 2 table giving the cross-classifiedcolor- service-other than domestic service) for em-
illiteracy data for the corresponding total ployed females in Chicago in 1940, which is
Negro-white population was .20.25Using the discussed in the note by Duncan and Davis.
present approach, we see that the graph The individual correlation was .29, while
(Robinson's Fig. 1) for the nine geographic their method, when applied to the available
divisions looks more or less linear and that ecological data for community areas, led to
the slope t) = .25 and the y-intercept d = the bounds .126 and .355. The scatter dia-
.02. Also, the estimated proportion, Y, of il- gram for the proportionin domestic service
literates in the total population is .04, which and the proportion non-white among the
is, in fact, equal to the known proportion employed females, computed from the avail-
Y = .04. This does not serve as a second able ecological data for each of fifteen com-
partial check on the underlying assump- munity areas and the "balance of city"27in-
tions, since, in this example, Y does not dif- dicates that the relation is more or less
fer very much from the average, y, of the linear, the ecological correlationis .93, b =
proportions illiterate in the nine divisional .27, and d = .07. Also, the estimated pro-
populations and X does not differ from the portion, Y, of persons in domestic service in
average, x = .10, of the proportions Negro the total employed female population in
in the nine divisional populations. (We shall Chicago is .08, which is, in fact, equal to the
see, in our second illustration, how the com- known proportion Y = .08. (Since Y = .08
parison of Y and Y can serve as a partial differed from the average, y = .13, of the
check on the assumptions. Differences be- proportions of employed females in do-
tween Y and y, X and x, are due to the fact mestic service in the various ecological
that the percentagesappearing in the graph areas, and X = .07, the proportion non-
are weighted by the relative population size white among the total employedfemale pop-
of the correspondingarea in the computa- ulation in Chicago, differed from the aver-
tion of Y and X, but not so in the computa- age, x = .24, of the proportions non-white
tion of y and x.) The fourfold-pointcorrela- among the employed females in the ecologi-
tion for the estimated table, based on the cal areas, the fact that Y was very close to
ecological data for the nine geographicdivi- Y gives us some further confidence in the
sions, is c = .38. The bounds for the individ- application of this method to the present
data.) The fourfold-pointcorrelationfor the
24 Op. Cit.
estimated table is c = .25. Thus the ecologi-
25Further discussion and application of some of cal data show that the individual correlation
the methods developed here will be presented in Otis 26 See Duncan and Davis, op. cit.; comments on
Dudley Duncan, Ray P. Cuzzort, and Beverly Dun-
can, Statistical Geography: Problems in Analyzing this method will be presented in the following sec-
Areal Data (to be published by the Free Press, tion.
Glencoe, Ill.); see also Donald J. Bogue and Mar- 27 Sixteenth Census of the United States: 1940:
garet J. Hagood, Subregional Migration in the United Population and Housing Statistics for Census Tracts
States, 1935-40, Vol. II: Differential Migration in the and Community Areas, Chicago, Illinois (Washing-
Corn and CottonBelts (Oxford, Ohio: Scripps Foun- ton: Government Printing Office, 1943), Tables A-3
dation, 1953), Appendix A, for a somewhat different and A-3a, pp. 25-39, and Tables 3 and 3a, pp. 176-
approach to the problem of ecological correlation. 341.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
616 THE AMERICAN JOURNAL OF SOCIOLOGY

must lie in the interval between .13 and .35, correlationcan be derived from the marginal
and the estimate 6 = .25 is quite close to the frequencies for the census tracts than from
known value of .29. The computation of the the marginals for the community areas
bounds may be used as a third partial check (which are combinations of tracts) and that
on the estimate 6, as well as for determining the criterion for choice between the results
the possible range of the individual correla- of different systems of areal subdivisions is
tions; we would not have recommendedthat clear: "The individual correlationis approxi-
6 be used if it had not been within the in- mated most closely by the least maximum
terval determined by the bounds. and the greatest minimum among the re-
Duncan and Davis also determine that sults for several systems of areal subdivi-
the possible range of the percentageof non- sions." Where one areal subdivision (e.g.,
whites in domestic service, based on the community areas) represents a combination
community area data, is between 21.1 and of another areal subdivision (census tracts),
44.5 per cent. It can be seen that the ecologi- the least maximum and greatest minimum
cal regressionmethod leads to the quite ac- are obtained from the finer areal subdivi-
A
curate estimate = .34 in this particular sion.28Thus, if the best bounds are desired,
case; the known value of this proportion is it is necessary only to compute the bounds
.38. for the finer subdivision. However, it is pos-
sible to combine the areas of the finer sub-
TABLE 1
division into not more than four combined
Estimated areas, where all the areas in a given com-
True Standard
Parameters Values Estimates Deviations bined area are similar (in a sense to be de-
B .32 .27 .03 fined in the followingparagraph),so that the
E(p x) .38 .34 .02 bounds computed by using only the data for
E(rLx) .06 .07 .01
Y .08 .08 .01 the less fine subdivision will be equal to the
c .29 .25 .03 best bounds determined by the finer sub-
division.
In this particularillustration, since Y and Duncan and Davis indicate that substan-
X differedfrom y and x, respectively, there tially closer bounds are obtained for their
were three partial checks on the underlying data when the finerareal subdivisionis used.
assumptions, while in the preceding ex- It may sometimes happen that there is little
ample, there were, in effect, only two. For or no differencebetween the bounds for the
the present illustration, the results may be finer subdivision and the less fine subdivi-
summarizedin Table 1, where the estimates sion. All tracts in which the number of non-
are shown to compare favorably with their white employed females was not more than
respective known true values and where the the number of females in domestic service
estimated standard deviations, computed and the numberof females in domestic serv-
from the formulasgiven earlier,are also pre- ice was not more than the number of white
sented. Since the true value of Y is known employed females can be combined into a
from the ecological data, a rough compari- single area without affecting the bounds (ex-
son between If and Y can be made by using cept for roundingerrors);tracts in which the
the informationabout the estimate of o-(Y), number of non-whites was not less than the
another partial check on the assumptions number in domestic service and the number
underlying the present method. in domestic service was not more than the
The methodof obtainingbounds.-Robin- number of whites can be combined without
son and also Duncan and Davis point out affecting the bounds; etc. Tracts that can be
that different systems of areal subdivision combined in this way will be called "simi-
give different results. Duncan and Davis lar" (this definition of "similarity" is con-
mention that, for their illustrative material, 28 See reference to this finding in Selvin,
op. cit.,
substantially closer bounds to the individual pp. 616-18.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 617

venient in this particular problem but not the points, or the ecological areas that they
necessarily so in other quite different prob- represent, that appear in the same part of
lems). Thus the fact that substantially the graph can be combined to form a single
closer bounds were obtained when census combined area, thus obtaining four areas:
tracts rather than community areas were A, B, C, D. Bounds for the individual cor-
used indicates that some of the tracts that relation computed by using the ecological
form a given community area were not data for the combined areas will yield the
"similar." same result as the bounds computed by us-
The color-illiteracy data for the forty- ing the data for each of the points on the
eight states and Washington, D.C., indicate graph; i.e., the bounds computed by using
that the nine geographicdivisions were com- each of the points on the graph could ac-
binations of areas that were, in fact, quite tually be computed from, at most, four
"similar." More specifically, the number of points, A, B, C, D, placed on the graph,
Negroes was not more than the number of where point A is a weighted average
illiterates, and the number of illiterates was (weighted by relative population size) of the
not more than the number of whites (or the points in part A of the graph, etc.
number of Negroes was not less than the The bounds for the individual correlation
number of illiterates, and the number of il- are determinedby first calculating the mini-
literates was not more than the number of mum number of non-white females that are
whites) in almost all the areas that were in domestic service; based on the available
combined to form a given division. Only ecological data, this can be seen to be 0 for
seven states had been combined with areas area A (or for any ecological area repre-
that differed in this respect. Thus the sented by a point in part A of the graph) and
bounds (-.07 and +.60) for the individual also for area B (i.e., the areas where y <
color-illiteracycorrelationbased on the data 1 - x); it is equal to the differencebetween
for the nine geographic divisions differ only the number of females in domestic service
slightly from the bounds (-.07 and +.58) and the number of white employed females
based on the data for the forty-eight states for area C and also for area D (i.e., the areas
and Washington, D.C. The method of com- where y > 1 - x). The maximum number
biningareas so that the same bounds are ob- of non-white employed females in domestic
tained for the combinedareas as for the finer service can be seen to be equal to the num-
subdivision is as follows: Draw two lines ber of non-white employed females for area
y x and y = 1 - x on the graph of the A and also for area D (i.e., the areas where
scatter diagram of the observed ecological y > x); it is equal to the number of females
variables y and x. This divides the graph in domestic service for area B and also for
into four parts: A, B, C, D, where A don- C (i.e., the areas where y < x). Thus, at a
tains all those points representing areas minimum, the number of non-white females
where the number of non-white employed in domestic service for the total population
females was not more than the number of under consideration(i.e., the combinationof
females in domestic service and the number areas A, B, C, and D) will be equal to the
of females in domestic service was not more differencebetween the number of females in
than the number of white employed females domestic service and the number of white
(x < y < 1 - x); B contains those points employed females for the combined popula-
where x > y < 1 - x; C contains those tion in areas C and D. At a maximum, the
points where x > y > 1 - x; and D con- total number of non-white females in do-
tains those points where x < y > 1 - x. mestic service will be equal to the sum of the
(If a point falls exactly on a diagonal line number of non-white employed females for
dividing the parts of the graph, it may be the combined population in areas A and D
put in either one of the adjacent parts, but, and the number of females in domestic serv-
of course, cannot be put in both parts.) All ice for the combined population in areas B

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
618 THE AMERICAN JOURNAL OF SOCIOLOGY

and C. From the available ecological data data alone. However, if each of the areas in,
for community areas in Chicago, we find say, part B of the graph becomes more com-
that these minimum and maximum num- pletely white (i.e., the percentagenon-white
bers are 5,826 and 12,271, respectively. Ac- decreases) or more completely non-white,
cordingly, the fourfold-point correlationco- but the areas still remain in part B (i.e., the
efficient is between .13 and .35. The differ- percentage in domestic service still remains
ence, T, between the maximum and mini- less than the percentage non-white and the
mum numbers, 12,271 - 5,826 = 6,445, can percentage white), then the accuracy of the
be shown to be equal to the sum, S, of the bounds need not be improvedunless the per-
number of non-white employed females who centage in domestic service in these areas
reside in area A, the number white in area also decreased.Using this general approach,
C, the number in domestic service in B, and an examination of the respective graphs de-
the number not in domestic service who re- scribing the ecological data and a glance at
side in area D. By obtaining S separately the respective marginal proportionsfor the
and comparingit with T, we have a partial total populations would reveal that the
check on our computations of the minimum bounds for the color-occupationdata for the
and maximum numbers. community areas or census tracts would be
It can be seen that T multiplied by the more accurate than the bounds for the color-
total population is equal to the product of illiteracy data obtained on either a division-
the possible range, R, of the fourfold-point al or a state basis.
correlation(.35 - .13 = .22) and the square Since it is possible, in computing bounds,
root of the product of the four marginal to- to reduce the original ecological data to, at
tals in the fourfold cross-classificationtable most, four areas (this simplifies somewhat
for the population. Thus T = S is directly the amount of computation; in any case,
proportional to R (the constant of propor- very little computation is required), the
tionality depends on the population mar- bounds are based essentially only on the in-
ginal totals). Hence the "accuracy"R of the formation available for these four combined
bounds, for a given set of population mar- areas or their respective four points on the
ginal totals, depends on the magnitude of S, graph. The actual distribution of points on
which can be determined very quickly in a the graph is not used except insofar as it
roughfashion by an examinationof the data supplies data for the combined areas; the
described by the graph. These bounds will accuracy of the bounds depends on how
be quite accurate if all the points in part A closely these four points "hug" the four
of the graph are very close to the vertical sides of the graph.
line determinedby x = 0, the points in part Furthercommentson ecologicalregression.
B are close to the horizontal line y = 0, the -The regressionapproach made use of the
points in part C are close to the vertical line graph of the ecological data, and the results
x = 1, and the points in part D are close to depended on these data. If some other areal
the horizontal line y = 1. The exact value subdivision of the population is of interest,
of the individual correlation can be deter- quite different estimates of the slope and y-
mined (R = 0) from the ecological data if intercept may be obtained, unless the under-
S = 0; i.e., if all the ecological areas can be lying assumptions of this approachalso hold
represented by points on some of the lines true for this second areal subdivisionand the
that form the sides (boundaries) of the values of E(p Ix) and E(r Ix) remain un-
graph. In other words, if in each ecological changed. Sometimes it is possible to com-
area employed females were either all white, bine areas or to use some other method of
all non-white, all in domestic service, or all definingareas or classes of individuals in or-
not in domestic service, we should, of course, der to obtain a new subdivision of the popu-
be able to determine the exact value of the lation for which the underlyingassumptions
individual correlation from the ecological of the regressionapproachare more reason-

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVESTO ECOLOGICALCORRELATION 619

able than for the original area data. It is not similar to the regressionapproach described
necessary that this new subdivision actually here for the case where 3 = 2, we see that
divide the population into mutually exclu- the proportiony of individualsin the Negro-
sive classes or that the entire population be white-"other races" population who are
included in the subdivision. For the areas in illiterate may be written as y = xipi +
the new subdivision and for the entire popu- X2p2 + X3p3, wherexi is the proportionin
lation, the underlying assumptions concern- the population who are Negro, pi is the pro-
ing p and r should be more reasonable.With portion of Negroes who are illiterate, X2 is
additional information about the popula- the proportion in the population who are
tion, it mnaybe possible to determine such a white, P2 iS the proportionof whites who are
subdivision of the population. If this is the illiterate, X3is the proportionin the popula-
case, the regressionmethods should be ap- tion who belong to "other races," and p3 is
plied to this new subdivision rather than to the proportion of people in "other races"
the original data. who are illiterate.Since X1 + X2 + X3 = 1,
None of the methods discussed here we have y = x1p1 + x2p2 + (1- x-
makes much use of the information about X2)P3 = p3 + (pl - P3)Xl + (P2 -P3)X2
the spatial distribution of the areas under a + b1x1+ b2X2, where a = p3, b=
consideration.This information and the in- PI - P3,and b2 = P2 -P3. Hence, if differ-
formation concerning the relative popula- ent areas are considered where the propor-
tion sizes of the areas are not contained in tion p, is the same for each area, the
the graph. The informationmay be of inter- proportion P2 is the same for each area,
est in itself and also it would probably be and the proportion p3 is the same for
worthwhileto make some use of it in dealing each area, then there will be an exact multi-
with the present problem. For example, the linear relationship,y = a + b1x1+ b2x2,be-
method mentioned herein for "holding con- tween the values of y and Xi, X2for the differ-
stant the geographic divisions" does make ent areas, where the slopes will be b6=
some use of the spatial distribution of the P- p3 and b2 = P2 - p3, and the y-inter-
states. The spatial distribution, the popula- cept will be a =p3. This multilinear rela-
tion sizes, and any other relevant informa- tionship could be used to determine p3 a,
tion should enter into the discussion of P2 = b2 + a, and pi 6b + a.
whether or not, in a particular case, the un- In practice, the actual values of PI, P2,
derlying assumptions are met. The informa- and p3 will not be constant, but it may be
tion concerningrelative population sizes can the case that the averageE(p1IxlI, X2) of the
also be utilized (to a certain extent) as values of Pi, the average E(p2 X1, x2) of the
"weights" in the weighted linear regression valuesof p2, and the averageE(p31 X1, X2) of
analysis referredto earlier in this article. the values of P3 for populations with the
Relation betweentwo qualitativevariables same proportions xi and x2 of Negroes and
when one of them is dickotomous.-We shall whites, respectively, are constant. Then the
now considerthe situation in which the "in- main assumption of multiple linear regres-
dividual behavior" described by a 2 X ,B sion analysis, E(y xl, X2) A + B1ix +
cross-classificationtable for a population is B2X2,holds true, whereA E(p3IXi, X2),
of direct interest but where the only avail- B1 E(p1jx1, x2) - A, and B2 = E(p2IX1,
able data are the marginaltotals in the table x2)- A. Thus standard methods of mul-
for the population and the marginal totals tiple regression can be used to obtain esti-
for some subdivision of the population. Here mates a, b1,b2,of A, B1, B2, respectively.29If
both variables in the individual correlation the variances of PI, P2, and P3 are not large,
study are qualitative; one of them has two then these estimates can be used to obtain
categories (e.g., literate-illiterate) and the the estimatesj3 = a p2-=b + a, andj5=
other has 3 categories (e.g., Negro-white- b1 + i of the expected values of ps, P2, and
"other races"; , = 3). Using an approach 29 See, e.g., Walker and Lev, op. cit., chap. xiii.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
620 THE AMERICAN JOURNAL OF SOCIOLOGY

pi, respectively. The six entries in the 2 X 3 average income and average family-size
cross-classificationtable describing the rela- data, by the ratio V of the standard devia-
tion between literacy and color can then be tion of the population income distribution
estimated by a method analogous to that and the standard deviation of the popula-
describedearlierfor the case in which 3 = 2. tion family-size distribution. [This corre-
The estimates15, p2, P3 shouldbe examined sponds to the earlier multiplication of &by
to see whether they all lie between 0 and 1. the V7X(1-X)/V/Y(1 - Y).] The usual
Also the estimated proportion Yl = d + ecological correlation, which cannot be used
61X1+ b2X2 of illiterates in the total popu- in general to estimate the individual correla-
lation (where X1 is the proportion of Ne- tion, is obtained by multiplying b by the
groes in the total population and X2 is the ratio of the observed standard deviation of
proportion of whites in this population) the distribution of the forty-eight average
should be close to the known proportion r incomes for the states and the observed
of illiterates in this population, if the meth- standard deviation of the distribution of the
od suggested here is to be applied. Many of forty-eight average family sizes for the
the comments discussed earlier for the case states; if Washington, D.C., is included,
B= 2 can be generalizedto the situation de- there will be forty-nine averages. If, as is
scribedin this section. This is left as an exer- often the case, this latter ratio is much larger
cise for the interested reader. than the ratio V, the usual ecological cor-
Relation betweentwo quantitativevariables. relation will overestimate the individual cor-
-Many of the ideas presented above can be relation.A0The ecological correlation may
applied also in the case in which we are deal- serve as a rough measure and partial check
ing with quantitative variables rather than on whether the underlyingassumptionis not
categories-e.g., income rather than race. satisfied-i.e., whether E(y Ix) = A + Bx,
Let us consider the situation in which the where A and B are constant for all states.
individual Pearsonian correlation between We have seen that it is not necessary to
income and size of family is of interest and know the entire distribution of income and
the relevant cross-classifieddata for the en- the distributionof size of family for the total
tire population are not available. That is, for population, but only the standard devia-
each individual in the population, informa- tions of the two, in order to use the ecologi-
tion about both his income, x, and size of his cal data to estimate the individual correla-
family, y, cannot be obtained, but it is pos- tion, since only these standard deviations
sible to determine or to estimate the income enter into the computation of the ratio V
distribution and the distribution of size of and of the individual correlation estimate.
family for the population (i.e., the marginal These standard deviations can be deter-
totals). If the average income and the aver- mined from the standard deviations (or
age size of family for, say, each of the forty- variances) and the averages of the respec-
eight states is known and if there is a linear tive distributions for the states, since the
relationship between income x and average variance of the population income (family
size of family E(y Ix) when x is given, and it size) distribution is the weighted sum
is more or less constant in these states [i.e., 2 w(i)M(i) of the average squared devia-
E(y Ix) = A + Bx is true for the individ- tions M(i) of the income (family size) of a
uals in each state and A and B are constant person in the ith state from the average in-
for all states], then it is possible to use these come (family size) X for the total popula-
ecologicaldata for each of the states, to esti- tion, where w(i) is the relative population
mate the individual correlationfor the popu- size of the ith state, M(i) = (2(i) + [X(i) -
lation. The appropriate estimate of this 30See G. Udny Yule and M. G. Kendall, An In-
Pearsonian correlationis obtained by multi- troduction to the Theoryof Statistics (New York:
plying b, the estimate of the slope of the re- Hafner Publishing Co., 1950), pp. 313-14, for some
gression line of y and x obtained from the related comments.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 621

XI2, X(i) is the average income (family size) ratio BSW(x)/BSW(y) of the "between
in the ith state, and 0-2(j) is the variance of states weighted" variances.
the income (family-size) distribution in the The ratio of the usual ecological correla-
ith state. These data, together with the esti- tion and the estimate of the individual cor-
mated slope b of the regressionline obtained relation is V\[BS(G)/BS(y)]/V[o2(X) /lo2(Y)].
from the data on average income and aver- If the observed variances BS(y) and l2(Y)
age family size for the states, lead to the es- are replaced in the above ratio by their ex-
timate t)Vof the individual correlationcoef- pected values computed under the usual in-
ficient for the entire population, where there dependence assumptions of the linear re-
is a constant linear relationshipbetween in- gression model, where E(y x) = A + Bx
come x and average family size E(y Ix) when for each state and -2[y x(i)] is the variance
x is given. around the regressionline in the ith state,
We have seen that the variance o-2(X) of then the so-called "expected" ratio obtained
the population income (family size) distribu- will be larger than 1 whenever
tion is 2 w(i) M(i) = 2 w(i)o2(i) + 2 w(i)
2; {W (i) 2- [Y I X(i]}
[xC(i)- X]2; i.e., the population variance is
the sum of two terms: (a) the weighted sum O2 (X)

of the variances for the states [the "within { C [y I X(i]


z /W(i}
states" variance, WS(X)] and (b) the TN [BS ( )] '
weighted sum of the squared deviations of
the averages xc(i)for the states from A [the where there are T states and a total popula-
"between states weighted" variance, tion size of N individuals-i.e., whenever
BSW(x,)], where the weights are the relative BS -.t > " [ I X (i) I /W(i)}
(X): {2 IIy
population sizes. Thus o-2(X)= WS(X) + (x) }-
NI w (i) oI[ y Ix (t) ]
BSW(x), and the variance of the family-
size distribution is o-2(Y) = WS(Y) + In the special situation where w(i) = 1/T,
BSW(y). The usual ecological correlationis then this "expected"ratio will be largerthan
1 wheneverBS(x?)> o-2(X)T/N-i.e., when-
bV/BS(Z)/BS(y), ever BS(x) is greater than o-2(X)divided by
the average population size N/T of the
where BS(x) and BS(y) are the observed states, which will often be the case. [If the x
(unweighted) variances of the aver- values observed in each state were a random
age incomes x(i) and the average family sample of size N/T from the same popula-
sizes y(i), respectively, for the states. Since tion of x values with variance y2(X), then
the estimate of the individual correlationis the expected value of BS(x) would be ap-
proximately o-2(X)/(N/T).] By a similar ap-
bV= b+ o2(X)/of2(Y), the ecological cor- proach, the "expected" ratio of the
relation will be larger than bV whenever "weighted" ecological correlation and the
C2(X)/1f2(Y) < BS(x)/BS(y), i.e., when- estimate of the individual correlationwill be
ever WS(X)/WS(Y) < [BS(x)/BS(y)] larger than 1 whenever BSW(x) >
+ E, where E = [BSW(y)BS(xc) - a'2(X)[2{a2[yIx(i)] }]/N 2 {w(i)oJ2[yIx(i)] }.
BS(y)BSW(x)]/WS(Y)BS(y). If the usual Where 2L[yjx(i)]is the same for each state
ecological correlation is modified to obtain [or wherew(i) = 1/T], this "expected"ratio
the "weighted" ecological correlation, will be larger than 1 whenever BSW(xc)>
bVBBSW(x)/BSW(y), then this "weighted" a2(X)T/N, which will usually be the case.
ecological correlationwill be larger than the Since o-2(X)= WS(X) + BSW(4), the rela-
estimate of the individual correlationwhen- tionships being presented here can be de-
ever oi2(X)/(J2(Y) < BSW(x)/BSW(y); scribed in terms of relationships between
thus, whenever the ratio WS(X)/WS(Y) of WS(X), BSW(x), and BS(x). These rela-
the "within states" variancesis less than the tionships indicate to a certain extent why

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
622 THE AMERICANJOURNALOF SOCIOLOGY
the (usual and the "weighted") ecological (i)j/V[S(i)] = D6, E{d} - A = D[i(i) -
correlationsare generally larger than the es- 2(i)3], where ox(i), 2(i)] is the covariance
timate of the individual correlationand thus between x(i) and 2(i) for the differentstates;
why they cannot be used as estimates of the V[x(i)] is the variance of x(i) for the states;
individual correlation. and 6 =- (i), (i)]/V[x(i)]. Thus, if D =
The standard deviation of the estimate of 0, the relationbetween y and x will be linear,
the individual correlationand confidencein- and the standardestimates will be unbiased.
tervals for it can be determined, when cer-
However, if D 5 0, the b will be biased un-
tain additional assumptions are made, by
less the covariancebetween x(i) and 2(i) is 0.
using methods similar to those developed
Even in the situation where bis unbiasedbut
earlier herein; this will not be discussed
D # 0, it will not be possible, except under
here. The statements and formulas pre-
special circumstances, to estimate the indi-
sented here should be understood to hold
vidual correlationbetween y and x from the
when the amount of data is sufficientlylarge
ecological data y(i) and x(i), since the indi-
to permit sampling fluctuations to be neg-
vidual values of z and their relation to y and
lected; i.e., we assume that the estimate b of
x will play an important role in determining
the slope B is quite accurate and that Y, the this correlationif D # 0.
average size of family in the population, is
Let us now consider the special circum-
close to the numerical value d + bX. The stance in which the individual value z meas-
comparisonbetween d + bX and Y can be ures a characteristicof the state in which the
used as a partial check on the underlyingas- individual lives (e.g., its size), so that z will
sumptionsmade in this section (except if the be the same for all individuals living in it.
average x(i) of the x(i) is close to X and the The value of D may be known, or it can be
average y(i) of the y(i) is close to Y) in the estimated, along with A and B, by the usual
same way that the comparisonof i and F methods of multiple linear regressionapplied
was used to check the assumptions made to the ecological data concerningy(i), x(i),
earlier in this article. and 2(i), thus obtaining the estimates d, a, b.
We shall now considerbriefly what might In this case, there will be a simple linear rela-
happen if the method suggested here is ap- tion between y and x, for the individuals in a
plied where the underlying assumptions did given state, but the y-intercept of the line
not apply. Suppose that the simple linear may differ for the different states; i.e.
relationship was not true but that a mul- E(y Ix) = (A + Dz) + Bx, where z = 2(i),
tilinear relation, E(y Ix, z) = A + Bx + may differ from state to state. Thus the re-
Dz, did hold true for the individuals, where gressionline for each state can be estimated,
A, B, and D were constant for all states and and each line will have the same slope, 6. If
where z was some relevant variable. In this the variances of the y and x measurements,
case the averages y(i), x(i), and 2(i) for the for a given state, are known, then it is pos-
ith state are related as follows: E[y(i) |x(i), sible to estimate the individual correlation
z(i)] = A + Bx(i) + D2(i). Then the stand- coefficient for the population in that state.
ard methods of multiple linear regression Furthermore, if these variances are known
can be applied to the ecological data, y(i), for each state; then they can be used, to-
x(i), 2(i), in order to obtain estimates of the gether with the values y(i), x(i), b and the
constants A, B, D, in the multilinear equa- relative size of each state, to estimate the
tion for the individuals.However, if a simple individual correlation coefficientfor the to-
linear relationship between y(i) and x(i) is tal population. Hence it is possible to obtain
incorrectly assumed and 2(i) is neglected, an estimate of the individual correlationco-
then the standard estimates a and b for the efficientfor a population from the ecological
regressionline between y and x will have the data, even if there is no constant linear rela-
following biases: E{b} - B = Da[i(i), tionship, E(y x) = A + Bx, as long as the

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 623

situation is such that the slope B remainsthe might be interpreted as the "effect of z on
same in the different states, while the y-in- the illiteracy of whites" (z might measure
tercept may differ from state to state in a average income, average social status, per
way that is linearly related to some meas- cent unemployed, etc., for each state). It
ured characteristic, z, of the state. should be noted that z cannot be taken equal
Relationbetweentwodichotomousvariables. to x (neither can z be a linear function of x),
-The point of view describedat the end of unless some additional assumptions are
the precedingsection can be applied to show made, because if z = x, then E(y Ix) =
that if the average E(r Ix) of the values of C + (B + F)x, in which case B + F and C
the proportionr of whites who are illiterate, can be estimated by the methods of linear
for states with the same proportionx of Ne- regressionapplied to y and x, but it will not
groes, is a linear function of a measurable be possible to obtain separate estimates of B
characteristicz of each state [i.e., E(r Ix) = and F unless additional assumptions are
C + Fz] and if the difference between the made about their relative magnitudes. For
average E(p Ix) of the values of the propor- example, if the additional assumptions that
tion p of Negroes who are illiterate (for F = 0 is made (i.e., that the "effect of z on
states with the same proportion x of Ne- the illiteracy of whites" is zero), then the
groes) and E(r Ix) is constant [i.e., E(p Ix) - methods developed earlier can be utilized;
E(r Ix) = B], then the average of the values but if the assumption that B = 0 is made
of the proportion y of illiterates (for states (i.e., that the "effect of race on illiteracy" is
with the same proportion x of Negroes) is zero), then the table entries for each state
equal to E(yIx) = C + Fz + Bx. The spe- can be estimated as described in the pre-
cial situation where F = 0 has been studied ceding paragraphand these table entries for
earlier in this article. By standard methods the states can then be combined to estimate
of multiple regressionapplied to the ecologi- the individual correlationfor the total popu-
cal data (i.e., to the proportionsy and x and lation. In this particular example in which
the value of z for each state), estimates e, f, Z= x, if F = 0, then the effect of the per-
and b of C, F, and B, respectively, can be ob- centage of population which is Negro in a
tained, which can then be used to obtain the state on the illiteracyrate for the whites there
estimates f = 6 + fz and p = b + 6 + fz of is zero,while if B = 0, then the averagediffer-
E(r Ix) and E(p Ix), respectively, for each ence between the illiteracy rate for Negroes
state. These estimates f and 23can be used and the rate for whites is zero in states hav-
along with the values of x and the size of the ing the same proportionx of Negroes. In this
population of each state to estimate the four situation, where B = 0, the estimated indi-
entries in the 2 X 2 cross-classificationtable vidual correlation between race and illiter-
describing the relation between the two acy computedfor each state will be zero, but
dichotomous variables, race and illiteracy, the individual correlation estimated for the
for each state. These tables for the separate total population may not be zero unless F =
states can then be combined to estimate the 0 as well. Since it is possible to obtain an ex-
four table entries for the total population; act linear relationshipbetween y and x when
thus an estimate of the individual correla- either F = 0 or B = 0 (or even when neither
tion between race and illiteracy can be ob- F nor B equals zero), it is not possible to de-
tained for the total population. cide on the basis of the ecological data con-
The magnitude of the estimnateb of B = cerning y and x whether it should be as-
E(p Ix) - E(r Ix), the average differencebe- sumed that F = 0, that B = 0, or that the
tween the illiteracy rates for whites and the ratio B/F is a known constant. The research
rates for Negroes for states having the same worker will require additional data to help
proportion x of Negroes might be inter- him choose between these models and the
preted as the "effect of race on illiteracy," assumptionsunderlyingthem. This is an im-
while the magnitude of the estimate f of F portant choice, since they lead to different

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
624 THE AMERICANJOURNALOF SOCIOLOGY
methods of analysis of the data and also to and F(J - 1).32These estimates can then be
different interpretations of the results. It used to estimate C, F, and J. With these es-
was assumed earlier that F = 0, and the timates and the values of x and the size of
methods described in that case led to esti- the population of each state, it is possible to
mates of the individual correlations which estimate the entries in the cross-classifica-
were different from what they would have tion table describing the relation between
been if it had been assumed that B- 0 or if race and illiteracy for each state and then to
it had been assumed that the ratio B/F was combine these tables for the separate states
a known constant. to obtain an estimate of the cross-classifica-
Let us now consider the situation where tion table for the total population. It is pos-
E(r|x) = C + Fz and E(pjx) = G + Hz. sible to perform a rough test of whether
[The preceding comments in this section F(J - 1) = 0 by applying the standardtest
dealt with the special situation where H = that the regressionof y on x is linear rather
F, so that E(p|x) - E(rIx) = G - C = than quadratic.33If F = 0, then the meth-
B.] In this case, E(yf x) = C + Fz + [G - ods developed here earlier may be appropri-
C + (H-F)z]x=C + Fz + (G-C)x + ate, while if J - 1 = 0 (i.e., the average il-
(H - F)zx, and a multiple regressionanaly- literacy rate for Negroes equals the average
sis of the ecological variable y on the three rate for whites in states having the same
variablesz, x, and zx will lead to estimates of proportion x of Negroes), then the method
C, F, G - C, and H - F. These estimates described in this paragraph can be applied.
can be used to obtain estimates of C, F, G, If F(J - 1) = 0, the decision as to whether
and H, which in turn can be used along with to assume F = 0 or J - 1 = 0 should de-
the values of z for each state to estimate pend on the research worker's available
E(r|x) and E(p|x) for each state. From knowledge or on some additional data re-
these estimates and the values of x and the lated, directly or indirectly, to the magni-
size of the population of each state, the four tudes of F and J - 1. The magnitude of
entries in the 2 X 2 cross-classificationtable J - 1 may be interpreted, for the model
describing the relation between race and il- under consideration,as the "effect of race on
literacy in each state can be estimated, and illiteracy." For this model, the scatter dia-
the table entries for the states can be com- gram of y on x can suggest whether (1) both
bined to estimate the table entries for the F and J - 1 are different from zero (if the
total population, thus providing an estimate relationship between y and x is not linear,
of the individual correlation for the total but it can be fitted by a second-degreepoly-
population. If z = x (or if z is a linear func- nomial in x); (2) either F or J - 1 is differ-
tion of x), then the methods describedin this ent from zero but not both (if the relation-
paragraph cannot be applied unless some ship is linear but the slope of the line is not
specific additional assumptions about the zero); or (3) both F and J - 1 are equal to
relationships between the constants are zero (if the relationshipis linear with a slope
made, similar to those mentioned in the pre- of zero). The extent to which the scatter dia-
ceding paragraph.31 gram can be fitted by a first- or second-de-
Let us now considerthe situation in which gree polynomial in x can serve as a partial
E(p x)/E(r/x) = J is constant for the dif- check on the assumptions underlying the
ferent values of x and E(r Ix) = C + Fx. In methods described here. This is no more
this case, E(y Ix) = E(r x) + E(r Ix) [J - than a partial check, since, as we had seen
l]x= [C + Fx][1 + (J - l)x]-C + [F +
32 See, e.g., George W. Snedecor, Statistical Meth-
C(J - l)]x + F(J - 1)x2, and a multiple
ods (4th ed., 4th printing; Ames, Iowa: Iowa State
regressionanalysis of y on the variablesx and College Press, 1950), Sec. 14.3, pp. 379-82, for a
x2will lead to estimates of C, F + C(J - 1), description of curvilinear regression methods for a
31See Duncan, Cuzzort, and Duncan, op. cit., for second-degree polynomial.
some related comments. 23
See ibid., pp. 381-84.

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
ALTERNATIVES TO ECOLOGICAL CORRELATION 625

earlier in this article, several different mod- (C + Fz)[1 + (J - 1)x] C + Fz +


els may lead to a specified relationship be- C(J - )x + F(J - 1)zx-a multilinearre-
tween y and x, and the methods applied to lation between y and the variables z, x, and
the ecological data will depend very much zx. As a partial check on this model, the rela-
on which model is chosen. tions between the four constants in the mul-
If E(p [x)/E(r Ix) = J and the relation tilinear relation can be examined to see
betweenE(rl x) and x is E(rl x) = C + Fx + whether they lead to a consistent set of esti-
Kx2 or some more complicated relation, it is mates of the three constants C, F, and J.
still possible to use a method similar to the The research worker who uses the meth-
one given in the precedingparagraph,in or- ods describedherein should be aware of the
der to estimate the constants C, F, K, J and underlyingassumptions of each method and
then to use these estimates to estimate the should take advantage of all possible partial
individual correlation between race and il- checks on them. The choice between the
literacy for each state separatelyand also for various models described here should be
the total population. If E(p Ix)/E(r Ix) = J made on the basis of the research worker's
and E(r x) = C + Fz, where z is some knowledge or on some additional data per-
measurablecharacteristicof each state, then taining to the underlyingassumptionsof the
it is also possible to estimate the constants models.
C, F, and J from the relation E(y x) = UNIVERSITYOF CHICAGO

This content downloaded from 177.220.086.174 on March 05, 2018 07:40:45 AM


All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).

You might also like