Professional Documents
Culture Documents
LEO A. GOODMAN
ABSTRACT
Under certain specified conditions, ecological data, e.g., the percentage of non-whites and the percentage
in domestic service for different community areas, can be used to estimate the "individual correlation"
between two dichotomous classifications, e.g., the non-white-white classification and the domestic service-
other than domestic service classification. Quite accurate estimates are obtained for some data on the relation
between color and occupation, somewhat less accurate estimates on the relation between color and literacy.
For some situations where the specific conditions here described are not met, other conditions are present-
ed that lead to different methods for estimating the individual correlation from the ecological data. These
methods are developed for the study of the individual correlation between two dichotomous variables,
between two qualitative variables where one of them is dichotomous, and between two quantitative vari-
ables.
The present article discusses some of the ferred to the variables describingproperties
results reportedin Robinson's paper on eco- of individuals, while "ecological data" re-
logical correlation2and explores further the ferred to the ecological variables describing
suggestions in earlier notes by the present properties of groups. An "ecological regres-
author3 and by Duncan and Davis.4 The sion" study is a standard regressionanalysis
terminologyused in these earlierpaperswill, for ecological variables. The problems of
for the sake of convenience, be used here, "aggregation," as discussed in some of the
although it does have some disadvantages. economics literature,6are related somewhat
"In an ecological correlation. . . the vari- to the mathematical problemsthat have ap-
ables are . . . descriptive properties of peared in the discussion of ecological and
groups.... An individual correlation is a individual correlations, although the ter-
correlation in which the. .. variables are minology of this literature is quite different
descriptive properties of individuals....") from that of the papers referred to earlier.
The phrase "behavior of individuals" re- The variables in an ecological correlation
I The research was carried out at the Statistical
are usually quantitative (e.g., percentages
Research Center, University of Chicago, under the or means for each of the 48 states), while the
sponsorship of the Statistics Branch, Office of Naval variablesin an individual correlationmay be
Research, and of the Social Science Research Com- qualitative (e.g., race of each individual) or
mittee, University of Chicago. Reproduction in quantitative (e.g., height of each individ-
whole or in part is permitted for any purpose of the
United States government. I am indebted to
ual). The ecological correlation coefficient
R. Blough, who assisted with the numerical com- used in the earlierpapers was the Pearsonian
putations; and to 0. D. Duncan, P. F. Lazarsfeld, correlationcoefficientfor the joint distribu-
J. S. Coleman, Z. Griliches, Y. Grunfeld, P. M. tion of two quantitative ecologicalvariables.
Hauser, P. H. Rossi, and H. Zeisel for helpful com- These papers dealt mainly, though not ex-
ments.
clusively, with the situation in which both
2 W. S. Robinson, "Ecological Correlations and
variables in the individual correlationstudy
the Behavior of Individuals," American Sociological
Review, XV (1950), 351-57. were qualitative and dichotomous and the
3 Leo A. Goodman, "Ecological Regression and
individual correlation coefficient used was
Behavior of Individuals," American Sociological Re- the fourfold-point (q) correlationcoefficient
view, XVIII (1953), 663-64. for the cross-classificationtable describing
4 Otis Dudley Duncan and Beverly Davis, "An
the joint distribution of the two dichoto-
Alternative to Ecological Correlation," A merican 6 For example, H. Theil, Linear Aggregations of
Sociological Review, XVIII (1953), 665-66.
Economic Relations (Amsterdam: North-Holland
5 Robinson, op. cit., p. 351. Publishing Co., 1954).
610
p and r values differ widely in the different tively, where NX is the total number of
areas) than a function of color, the methods Negroes.
presented here are not recommended;how- Since p and r lie between 0 and 1, it is
ever, in some situations the variance of p desirable that the estimates P and P also lie
and r may be sufficiently small for them to between 0 and 1. When this is not the case,
be applicable,while in others the variance of the underlying assumptions should be re-
p and r for a particular subset of ecological examined, although it is possible to obtain
areas (e.g., for the states in a given geo- such estimates even if these assumptionsare
graphic division of the United States) or for satisfied. A method for dealing with this
a set of combined ecological areas (e.g., for situation was suggested in the author's ear-
the nine geographic divisions of the United lier note.
States) may be sufficiently small for the The estimated proportion? = d + &X =
present method to be applied to the subset O6X+ P(1 - X) of illiterates in the Negro-
of ecological areas or to the set of combined white population should be close to the
areas. If the variance of p and r for the states known proportion, Y, of illiterates. If this is
in each division of the United States is not the case for a given set of data, this
small, then the methods may be applied to method is not recommended.In the special
obtain separate estimates for each division, case where Y is, in fact, equal to the aver-
thus obtaining estimates where, in a certain age, y, of the illiteracy proportions in the
sense, geographic divisions have been held various ecologicalareas and X is equal to the
constant. average, x?,of the proportions Negro in the
If the scatter diagram of y and x does not ecological areas, then this check on the un-
suggest a linear relation between y and x, derlyingassumptionsof the method does not
then the present strategy is not recom- apply, since in this case P = d + '& =
mended (unless the scatter diagramof y and y = F, even if the assumptions are not met.
x for a subset of the areas or for a set of com- A method for determining roughly whether
bined areas suggests linearity). If the scatter or not Y is sufficiently close to F will be
diagram does suggest a linear relation, then mentioned later in this section.
it may be applicable, but it is still possible Rather than compute the correlation co-
that the variances of p and r are large. In efficientc directly for the estimated fourfold
this case, the present method leads to esti- table by the usual formula, a simplifiedfor-
mates of E(p Ix) and E(r Ix) when these mula is b+VX(1- X)/I(1 - i). Since Pwill
average (expected) values are constant, but be close to Y when this general approach is
the estimate of the individual correlationfor applicable, it will not matter much whether
the total population based on these esti- or not Y is replaced by the known propor-
mates of expected values may be quite poor tion Y. Thus an estimate of the fourfold-
if the variance of p and/or r is large. point correlationis
From the scatter diagram of the per cent
Negro and per cent illiterate in different c = bVX(1 - X)/Y(1- Y).
areas, the slope B and the y-intercept A can
be estimated by the usual methods of linear Following standard correlation theory,
regression, obtaining the estimates b and 4, the ecological correlation can be computed
respectively. Then E(r lx) and E(p Ix) can by multiplying b by the ratio of the standard
be estimated by r = 4 and 5 = o + r = deviation of the proportions of Negroes in
b + d, respectively. The numbers of illiter- the ecological areas and the standard devia-
ate Negroes, illiterate whites, literate Ne- tion of the proportions of illiterates there.
groes, and literate whites (i.e., the four en- Since this ratio will usually be very different
tries in the fourfold "individual behavior" from VIX(1 - X)/Y(1 - Y) when ecologi-
table) are estimated by p'NX, rN(1 -X) cal data are used, the ecologicaland individ-
(1 - p)NX, and (1 - r')N(l - X), respec- ual correlations will usually be very differ-
must lie in the interval between .13 and .35, correlationcan be derived from the marginal
and the estimate 6 = .25 is quite close to the frequencies for the census tracts than from
known value of .29. The computation of the the marginals for the community areas
bounds may be used as a third partial check (which are combinations of tracts) and that
on the estimate 6, as well as for determining the criterion for choice between the results
the possible range of the individual correla- of different systems of areal subdivisions is
tions; we would not have recommendedthat clear: "The individual correlationis approxi-
6 be used if it had not been within the in- mated most closely by the least maximum
terval determined by the bounds. and the greatest minimum among the re-
Duncan and Davis also determine that sults for several systems of areal subdivi-
the possible range of the percentageof non- sions." Where one areal subdivision (e.g.,
whites in domestic service, based on the community areas) represents a combination
community area data, is between 21.1 and of another areal subdivision (census tracts),
44.5 per cent. It can be seen that the ecologi- the least maximum and greatest minimum
cal regressionmethod leads to the quite ac- are obtained from the finer areal subdivi-
A
curate estimate = .34 in this particular sion.28Thus, if the best bounds are desired,
case; the known value of this proportion is it is necessary only to compute the bounds
.38. for the finer subdivision. However, it is pos-
sible to combine the areas of the finer sub-
TABLE 1
division into not more than four combined
Estimated areas, where all the areas in a given com-
True Standard
Parameters Values Estimates Deviations bined area are similar (in a sense to be de-
B .32 .27 .03 fined in the followingparagraph),so that the
E(p x) .38 .34 .02 bounds computed by using only the data for
E(rLx) .06 .07 .01
Y .08 .08 .01 the less fine subdivision will be equal to the
c .29 .25 .03 best bounds determined by the finer sub-
division.
In this particularillustration, since Y and Duncan and Davis indicate that substan-
X differedfrom y and x, respectively, there tially closer bounds are obtained for their
were three partial checks on the underlying data when the finerareal subdivisionis used.
assumptions, while in the preceding ex- It may sometimes happen that there is little
ample, there were, in effect, only two. For or no differencebetween the bounds for the
the present illustration, the results may be finer subdivision and the less fine subdivi-
summarizedin Table 1, where the estimates sion. All tracts in which the number of non-
are shown to compare favorably with their white employed females was not more than
respective known true values and where the the number of females in domestic service
estimated standard deviations, computed and the numberof females in domestic serv-
from the formulasgiven earlier,are also pre- ice was not more than the number of white
sented. Since the true value of Y is known employed females can be combined into a
from the ecological data, a rough compari- single area without affecting the bounds (ex-
son between If and Y can be made by using cept for roundingerrors);tracts in which the
the informationabout the estimate of o-(Y), number of non-whites was not less than the
another partial check on the assumptions number in domestic service and the number
underlying the present method. in domestic service was not more than the
The methodof obtainingbounds.-Robin- number of whites can be combined without
son and also Duncan and Davis point out affecting the bounds; etc. Tracts that can be
that different systems of areal subdivision combined in this way will be called "simi-
give different results. Duncan and Davis lar" (this definition of "similarity" is con-
mention that, for their illustrative material, 28 See reference to this finding in Selvin,
op. cit.,
substantially closer bounds to the individual pp. 616-18.
venient in this particular problem but not the points, or the ecological areas that they
necessarily so in other quite different prob- represent, that appear in the same part of
lems). Thus the fact that substantially the graph can be combined to form a single
closer bounds were obtained when census combined area, thus obtaining four areas:
tracts rather than community areas were A, B, C, D. Bounds for the individual cor-
used indicates that some of the tracts that relation computed by using the ecological
form a given community area were not data for the combined areas will yield the
"similar." same result as the bounds computed by us-
The color-illiteracy data for the forty- ing the data for each of the points on the
eight states and Washington, D.C., indicate graph; i.e., the bounds computed by using
that the nine geographicdivisions were com- each of the points on the graph could ac-
binations of areas that were, in fact, quite tually be computed from, at most, four
"similar." More specifically, the number of points, A, B, C, D, placed on the graph,
Negroes was not more than the number of where point A is a weighted average
illiterates, and the number of illiterates was (weighted by relative population size) of the
not more than the number of whites (or the points in part A of the graph, etc.
number of Negroes was not less than the The bounds for the individual correlation
number of illiterates, and the number of il- are determinedby first calculating the mini-
literates was not more than the number of mum number of non-white females that are
whites) in almost all the areas that were in domestic service; based on the available
combined to form a given division. Only ecological data, this can be seen to be 0 for
seven states had been combined with areas area A (or for any ecological area repre-
that differed in this respect. Thus the sented by a point in part A of the graph) and
bounds (-.07 and +.60) for the individual also for area B (i.e., the areas where y <
color-illiteracycorrelationbased on the data 1 - x); it is equal to the differencebetween
for the nine geographic divisions differ only the number of females in domestic service
slightly from the bounds (-.07 and +.58) and the number of white employed females
based on the data for the forty-eight states for area C and also for area D (i.e., the areas
and Washington, D.C. The method of com- where y > 1 - x). The maximum number
biningareas so that the same bounds are ob- of non-white employed females in domestic
tained for the combinedareas as for the finer service can be seen to be equal to the num-
subdivision is as follows: Draw two lines ber of non-white employed females for area
y x and y = 1 - x on the graph of the A and also for area D (i.e., the areas where
scatter diagram of the observed ecological y > x); it is equal to the number of females
variables y and x. This divides the graph in domestic service for area B and also for
into four parts: A, B, C, D, where A don- C (i.e., the areas where y < x). Thus, at a
tains all those points representing areas minimum, the number of non-white females
where the number of non-white employed in domestic service for the total population
females was not more than the number of under consideration(i.e., the combinationof
females in domestic service and the number areas A, B, C, and D) will be equal to the
of females in domestic service was not more differencebetween the number of females in
than the number of white employed females domestic service and the number of white
(x < y < 1 - x); B contains those points employed females for the combined popula-
where x > y < 1 - x; C contains those tion in areas C and D. At a maximum, the
points where x > y > 1 - x; and D con- total number of non-white females in do-
tains those points where x < y > 1 - x. mestic service will be equal to the sum of the
(If a point falls exactly on a diagonal line number of non-white employed females for
dividing the parts of the graph, it may be the combined population in areas A and D
put in either one of the adjacent parts, but, and the number of females in domestic serv-
of course, cannot be put in both parts.) All ice for the combined population in areas B
and C. From the available ecological data data alone. However, if each of the areas in,
for community areas in Chicago, we find say, part B of the graph becomes more com-
that these minimum and maximum num- pletely white (i.e., the percentagenon-white
bers are 5,826 and 12,271, respectively. Ac- decreases) or more completely non-white,
cordingly, the fourfold-point correlationco- but the areas still remain in part B (i.e., the
efficient is between .13 and .35. The differ- percentage in domestic service still remains
ence, T, between the maximum and mini- less than the percentage non-white and the
mum numbers, 12,271 - 5,826 = 6,445, can percentage white), then the accuracy of the
be shown to be equal to the sum, S, of the bounds need not be improvedunless the per-
number of non-white employed females who centage in domestic service in these areas
reside in area A, the number white in area also decreased.Using this general approach,
C, the number in domestic service in B, and an examination of the respective graphs de-
the number not in domestic service who re- scribing the ecological data and a glance at
side in area D. By obtaining S separately the respective marginal proportionsfor the
and comparingit with T, we have a partial total populations would reveal that the
check on our computations of the minimum bounds for the color-occupationdata for the
and maximum numbers. community areas or census tracts would be
It can be seen that T multiplied by the more accurate than the bounds for the color-
total population is equal to the product of illiteracy data obtained on either a division-
the possible range, R, of the fourfold-point al or a state basis.
correlation(.35 - .13 = .22) and the square Since it is possible, in computing bounds,
root of the product of the four marginal to- to reduce the original ecological data to, at
tals in the fourfold cross-classificationtable most, four areas (this simplifies somewhat
for the population. Thus T = S is directly the amount of computation; in any case,
proportional to R (the constant of propor- very little computation is required), the
tionality depends on the population mar- bounds are based essentially only on the in-
ginal totals). Hence the "accuracy"R of the formation available for these four combined
bounds, for a given set of population mar- areas or their respective four points on the
ginal totals, depends on the magnitude of S, graph. The actual distribution of points on
which can be determined very quickly in a the graph is not used except insofar as it
roughfashion by an examinationof the data supplies data for the combined areas; the
described by the graph. These bounds will accuracy of the bounds depends on how
be quite accurate if all the points in part A closely these four points "hug" the four
of the graph are very close to the vertical sides of the graph.
line determinedby x = 0, the points in part Furthercommentson ecologicalregression.
B are close to the horizontal line y = 0, the -The regressionapproach made use of the
points in part C are close to the vertical line graph of the ecological data, and the results
x = 1, and the points in part D are close to depended on these data. If some other areal
the horizontal line y = 1. The exact value subdivision of the population is of interest,
of the individual correlation can be deter- quite different estimates of the slope and y-
mined (R = 0) from the ecological data if intercept may be obtained, unless the under-
S = 0; i.e., if all the ecological areas can be lying assumptions of this approachalso hold
represented by points on some of the lines true for this second areal subdivisionand the
that form the sides (boundaries) of the values of E(p Ix) and E(r Ix) remain un-
graph. In other words, if in each ecological changed. Sometimes it is possible to com-
area employed females were either all white, bine areas or to use some other method of
all non-white, all in domestic service, or all definingareas or classes of individuals in or-
not in domestic service, we should, of course, der to obtain a new subdivision of the popu-
be able to determine the exact value of the lation for which the underlyingassumptions
individual correlation from the ecological of the regressionapproachare more reason-
able than for the original area data. It is not similar to the regressionapproach described
necessary that this new subdivision actually here for the case where 3 = 2, we see that
divide the population into mutually exclu- the proportiony of individualsin the Negro-
sive classes or that the entire population be white-"other races" population who are
included in the subdivision. For the areas in illiterate may be written as y = xipi +
the new subdivision and for the entire popu- X2p2 + X3p3, wherexi is the proportionin
lation, the underlying assumptions concern- the population who are Negro, pi is the pro-
ing p and r should be more reasonable.With portion of Negroes who are illiterate, X2 is
additional information about the popula- the proportion in the population who are
tion, it mnaybe possible to determine such a white, P2 iS the proportionof whites who are
subdivision of the population. If this is the illiterate, X3is the proportionin the popula-
case, the regressionmethods should be ap- tion who belong to "other races," and p3 is
plied to this new subdivision rather than to the proportion of people in "other races"
the original data. who are illiterate.Since X1 + X2 + X3 = 1,
None of the methods discussed here we have y = x1p1 + x2p2 + (1- x-
makes much use of the information about X2)P3 = p3 + (pl - P3)Xl + (P2 -P3)X2
the spatial distribution of the areas under a + b1x1+ b2X2, where a = p3, b=
consideration.This information and the in- PI - P3,and b2 = P2 -P3. Hence, if differ-
formation concerning the relative popula- ent areas are considered where the propor-
tion sizes of the areas are not contained in tion p, is the same for each area, the
the graph. The informationmay be of inter- proportion P2 is the same for each area,
est in itself and also it would probably be and the proportion p3 is the same for
worthwhileto make some use of it in dealing each area, then there will be an exact multi-
with the present problem. For example, the linear relationship,y = a + b1x1+ b2x2,be-
method mentioned herein for "holding con- tween the values of y and Xi, X2for the differ-
stant the geographic divisions" does make ent areas, where the slopes will be b6=
some use of the spatial distribution of the P- p3 and b2 = P2 - p3, and the y-inter-
states. The spatial distribution, the popula- cept will be a =p3. This multilinear rela-
tion sizes, and any other relevant informa- tionship could be used to determine p3 a,
tion should enter into the discussion of P2 = b2 + a, and pi 6b + a.
whether or not, in a particular case, the un- In practice, the actual values of PI, P2,
derlying assumptions are met. The informa- and p3 will not be constant, but it may be
tion concerningrelative population sizes can the case that the averageE(p1IxlI, X2) of the
also be utilized (to a certain extent) as values of Pi, the average E(p2 X1, x2) of the
"weights" in the weighted linear regression valuesof p2, and the averageE(p31 X1, X2) of
analysis referredto earlier in this article. the values of P3 for populations with the
Relation betweentwo qualitativevariables same proportions xi and x2 of Negroes and
when one of them is dickotomous.-We shall whites, respectively, are constant. Then the
now considerthe situation in which the "in- main assumption of multiple linear regres-
dividual behavior" described by a 2 X ,B sion analysis, E(y xl, X2) A + B1ix +
cross-classificationtable for a population is B2X2,holds true, whereA E(p3IXi, X2),
of direct interest but where the only avail- B1 E(p1jx1, x2) - A, and B2 = E(p2IX1,
able data are the marginaltotals in the table x2)- A. Thus standard methods of mul-
for the population and the marginal totals tiple regression can be used to obtain esti-
for some subdivision of the population. Here mates a, b1,b2,of A, B1, B2, respectively.29If
both variables in the individual correlation the variances of PI, P2, and P3 are not large,
study are qualitative; one of them has two then these estimates can be used to obtain
categories (e.g., literate-illiterate) and the the estimatesj3 = a p2-=b + a, andj5=
other has 3 categories (e.g., Negro-white- b1 + i of the expected values of ps, P2, and
"other races"; , = 3). Using an approach 29 See, e.g., Walker and Lev, op. cit., chap. xiii.
pi, respectively. The six entries in the 2 X 3 average income and average family-size
cross-classificationtable describing the rela- data, by the ratio V of the standard devia-
tion between literacy and color can then be tion of the population income distribution
estimated by a method analogous to that and the standard deviation of the popula-
describedearlierfor the case in which 3 = 2. tion family-size distribution. [This corre-
The estimates15, p2, P3 shouldbe examined sponds to the earlier multiplication of &by
to see whether they all lie between 0 and 1. the V7X(1-X)/V/Y(1 - Y).] The usual
Also the estimated proportion Yl = d + ecological correlation, which cannot be used
61X1+ b2X2 of illiterates in the total popu- in general to estimate the individual correla-
lation (where X1 is the proportion of Ne- tion, is obtained by multiplying b by the
groes in the total population and X2 is the ratio of the observed standard deviation of
proportion of whites in this population) the distribution of the forty-eight average
should be close to the known proportion r incomes for the states and the observed
of illiterates in this population, if the meth- standard deviation of the distribution of the
od suggested here is to be applied. Many of forty-eight average family sizes for the
the comments discussed earlier for the case states; if Washington, D.C., is included,
B= 2 can be generalizedto the situation de- there will be forty-nine averages. If, as is
scribedin this section. This is left as an exer- often the case, this latter ratio is much larger
cise for the interested reader. than the ratio V, the usual ecological cor-
Relation betweentwo quantitativevariables. relation will overestimate the individual cor-
-Many of the ideas presented above can be relation.A0The ecological correlation may
applied also in the case in which we are deal- serve as a rough measure and partial check
ing with quantitative variables rather than on whether the underlyingassumptionis not
categories-e.g., income rather than race. satisfied-i.e., whether E(y Ix) = A + Bx,
Let us consider the situation in which the where A and B are constant for all states.
individual Pearsonian correlation between We have seen that it is not necessary to
income and size of family is of interest and know the entire distribution of income and
the relevant cross-classifieddata for the en- the distributionof size of family for the total
tire population are not available. That is, for population, but only the standard devia-
each individual in the population, informa- tions of the two, in order to use the ecologi-
tion about both his income, x, and size of his cal data to estimate the individual correla-
family, y, cannot be obtained, but it is pos- tion, since only these standard deviations
sible to determine or to estimate the income enter into the computation of the ratio V
distribution and the distribution of size of and of the individual correlation estimate.
family for the population (i.e., the marginal These standard deviations can be deter-
totals). If the average income and the aver- mined from the standard deviations (or
age size of family for, say, each of the forty- variances) and the averages of the respec-
eight states is known and if there is a linear tive distributions for the states, since the
relationship between income x and average variance of the population income (family
size of family E(y Ix) when x is given, and it size) distribution is the weighted sum
is more or less constant in these states [i.e., 2 w(i)M(i) of the average squared devia-
E(y Ix) = A + Bx is true for the individ- tions M(i) of the income (family size) of a
uals in each state and A and B are constant person in the ith state from the average in-
for all states], then it is possible to use these come (family size) X for the total popula-
ecologicaldata for each of the states, to esti- tion, where w(i) is the relative population
mate the individual correlationfor the popu- size of the ith state, M(i) = (2(i) + [X(i) -
lation. The appropriate estimate of this 30See G. Udny Yule and M. G. Kendall, An In-
Pearsonian correlationis obtained by multi- troduction to the Theoryof Statistics (New York:
plying b, the estimate of the slope of the re- Hafner Publishing Co., 1950), pp. 313-14, for some
gression line of y and x obtained from the related comments.
XI2, X(i) is the average income (family size) ratio BSW(x)/BSW(y) of the "between
in the ith state, and 0-2(j) is the variance of states weighted" variances.
the income (family-size) distribution in the The ratio of the usual ecological correla-
ith state. These data, together with the esti- tion and the estimate of the individual cor-
mated slope b of the regressionline obtained relation is V\[BS(G)/BS(y)]/V[o2(X) /lo2(Y)].
from the data on average income and aver- If the observed variances BS(y) and l2(Y)
age family size for the states, lead to the es- are replaced in the above ratio by their ex-
timate t)Vof the individual correlationcoef- pected values computed under the usual in-
ficient for the entire population, where there dependence assumptions of the linear re-
is a constant linear relationshipbetween in- gression model, where E(y x) = A + Bx
come x and average family size E(y Ix) when for each state and -2[y x(i)] is the variance
x is given. around the regressionline in the ith state,
We have seen that the variance o-2(X) of then the so-called "expected" ratio obtained
the population income (family size) distribu- will be larger than 1 whenever
tion is 2 w(i) M(i) = 2 w(i)o2(i) + 2 w(i)
2; {W (i) 2- [Y I X(i]}
[xC(i)- X]2; i.e., the population variance is
the sum of two terms: (a) the weighted sum O2 (X)
situation is such that the slope B remainsthe might be interpreted as the "effect of z on
same in the different states, while the y-in- the illiteracy of whites" (z might measure
tercept may differ from state to state in a average income, average social status, per
way that is linearly related to some meas- cent unemployed, etc., for each state). It
ured characteristic, z, of the state. should be noted that z cannot be taken equal
Relationbetweentwodichotomousvariables. to x (neither can z be a linear function of x),
-The point of view describedat the end of unless some additional assumptions are
the precedingsection can be applied to show made, because if z = x, then E(y Ix) =
that if the average E(r Ix) of the values of C + (B + F)x, in which case B + F and C
the proportionr of whites who are illiterate, can be estimated by the methods of linear
for states with the same proportionx of Ne- regressionapplied to y and x, but it will not
groes, is a linear function of a measurable be possible to obtain separate estimates of B
characteristicz of each state [i.e., E(r Ix) = and F unless additional assumptions are
C + Fz] and if the difference between the made about their relative magnitudes. For
average E(p Ix) of the values of the propor- example, if the additional assumptions that
tion p of Negroes who are illiterate (for F = 0 is made (i.e., that the "effect of z on
states with the same proportion x of Ne- the illiteracy of whites" is zero), then the
groes) and E(r Ix) is constant [i.e., E(p Ix) - methods developed earlier can be utilized;
E(r Ix) = B], then the average of the values but if the assumption that B = 0 is made
of the proportion y of illiterates (for states (i.e., that the "effect of race on illiteracy" is
with the same proportion x of Negroes) is zero), then the table entries for each state
equal to E(yIx) = C + Fz + Bx. The spe- can be estimated as described in the pre-
cial situation where F = 0 has been studied ceding paragraphand these table entries for
earlier in this article. By standard methods the states can then be combined to estimate
of multiple regressionapplied to the ecologi- the individual correlationfor the total popu-
cal data (i.e., to the proportionsy and x and lation. In this particular example in which
the value of z for each state), estimates e, f, Z= x, if F = 0, then the effect of the per-
and b of C, F, and B, respectively, can be ob- centage of population which is Negro in a
tained, which can then be used to obtain the state on the illiteracyrate for the whites there
estimates f = 6 + fz and p = b + 6 + fz of is zero,while if B = 0, then the averagediffer-
E(r Ix) and E(p Ix), respectively, for each ence between the illiteracy rate for Negroes
state. These estimates f and 23can be used and the rate for whites is zero in states hav-
along with the values of x and the size of the ing the same proportionx of Negroes. In this
population of each state to estimate the four situation, where B = 0, the estimated indi-
entries in the 2 X 2 cross-classificationtable vidual correlation between race and illiter-
describing the relation between the two acy computedfor each state will be zero, but
dichotomous variables, race and illiteracy, the individual correlation estimated for the
for each state. These tables for the separate total population may not be zero unless F =
states can then be combined to estimate the 0 as well. Since it is possible to obtain an ex-
four table entries for the total population; act linear relationshipbetween y and x when
thus an estimate of the individual correla- either F = 0 or B = 0 (or even when neither
tion between race and illiteracy can be ob- F nor B equals zero), it is not possible to de-
tained for the total population. cide on the basis of the ecological data con-
The magnitude of the estimnateb of B = cerning y and x whether it should be as-
E(p Ix) - E(r Ix), the average differencebe- sumed that F = 0, that B = 0, or that the
tween the illiteracy rates for whites and the ratio B/F is a known constant. The research
rates for Negroes for states having the same worker will require additional data to help
proportion x of Negroes might be inter- him choose between these models and the
preted as the "effect of race on illiteracy," assumptionsunderlyingthem. This is an im-
while the magnitude of the estimate f of F portant choice, since they lead to different