You are on page 1of 13

This article was downloaded by: [University of Haifa Library]

On: 25 October 2013, At: 12:58


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK

Cybernetics and Systems: An


International Journal
Publication details, including instructions for
authors and subscription information:
http://www.tandfonline.com/loi/ucbs20

INTERPRETATION OF
THE GENERALIZED ZIPF-
MANDELBROT LAW
PARAMETERS
a b
Patricia Sastre-Vzquez , Yolanda Villacampa ,
b b
Jos A. Reyes & Fernando Garca-Alonso
a
Department of Applied Mathematics , Universidad
Nacional del Centro de la Provincia de Buenos
Aires , Argentina
b
Department of Applied Mathematics , University of
Alicante , Alicante, Spain
Published online: 28 Apr 2009.

To cite this article: Patricia Sastre-Vzquez , Yolanda Villacampa , Jos A. Reyes


& Fernando Garca-Alonso (2009) INTERPRETATION OF THE GENERALIZED ZIPF-
MANDELBROT LAW PARAMETERS, Cybernetics and Systems: An International Journal,
40:4, 326-336, DOI: 10.1080/01969720902847029

To link to this article: http://dx.doi.org/10.1080/01969720902847029

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the
information (the Content) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness,
or suitability for any purpose of the Content. Any opinions and views
expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the
Content should not be relied upon and should be independently verified with
primary sources of information. Taylor and Francis shall not be liable for any
losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or
indirectly in connection with, in relation to or arising out of the use of the
Content.

This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan,
sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden. Terms & Conditions of access and use can be found at
http://www.tandfonline.com/page/terms-and-conditions
Downloaded by [University of Haifa Library] at 12:58 25 October 2013
Cybernetics and Systems: An International Journal, 40: 326336
Copyright # 2009 Taylor & Francis Group, LLC
ISSN: 0196-9722 print=1087-6553 online
DOI: 10.1080/01969720902847029

INTERPRETATION OF THE GENERALIZED


ZIPF-MANDELBROT LAW PARAMETERS
Downloaded by [University of Haifa Library] at 12:58 25 October 2013


PATRICIA SASTRE-VAZQUEZ 1
,
YOLANDA VILLACAMPA , JOSE A. REYES2, and
2

FERNANDO GARCIA-ALONSO2
1
Department of Applied Mathematics, Universidad
Nacional del Centro de la Provincia de Buenos Aires,
Argentina
2
Department of Applied Mathematics, University of
Alicante, Alicante, Spain

This article provides an interpretation of the parameters of the


generalized Zipf-Mandelbrot law, which the authors have called
the generalized range-frequency law which has been developed in
2009. Said law models the distribution of frequencies of words in
school textbooks obtained using a Pearson system. The significance
of the parameters gathered was obtained by analyzing certain
variables. The fact that the statistical law can be applied to texts
written using natural language means that it may be applicable to
ecological models, these being considered as a particular case of
mathematical models written as texts in a formal language, as devel-
oped in 1999. After establishing certain hypotheses regarding the
significance of the parameters of the generalized law obtained for
natural language, this study was designed to analyze whether this
law is applicable to models considered as texts written in formal
language. If this were so, the law would be a very useful tool for
comparing models.

Keywords: Modelling; Pearson system; Statistical law; Texts

Address correspondence to Yolanda Villacampa, University of Alicante, Department


of Applied Mathematics, Alicante, Spain. E-mail: villacampa@ua.es
INTERPRETATION OF PARAMETERS ZIPF-MANDELBROT LAW 327

INTRODUCTION
This article deals with a particular type of text, namely, text systems in
accordance with the theory defined and published in Villacampa, Castro,
Uso, and Sastre (1999a) and Villacampa and Us o-Domenech (1999b).
For this type of text, the authors carried out an interpretation of the
law obtained as a generalization of the laws of Zipf (1932, 1949) and
Mandelbrot (1953), which they have denominated as the generalised
range-frequency law and which has been published in Sastre-Vazquez,
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

Villacampa, Garca-Alonso, and Reyes (2009).


With data corresponding to written texts from oral narrations of
schoolchildren, equations were found for the distribution of the fre-
quency of words ordered by their rank, in a similar fashion to the studies
carried out by Sastre-Vazquez, Us o-Domenech, Villacampa-Esteve, and
Mateu (1999). In this case, the authors interpreted the parameters
defined in the distributions that were obtained using a Pearson system
(Sastre-Vazquez, 2002; Sastre-Vazquez et al., 2009) and in certain
variables, analysis of variance was carried out.
The authors have obtained a law derived from a Pearson system in Sastre-
Vazquez et al. (2009). A Pearson system is a system of equations obtained
from the generalization of the differential equation for Gaussian Distribution:
dy a1 x a
y:
dx b0 2b1 x b2 x 2
Pearson distributions are wholly determined with the first four
moments:
l3 l4 3l32 l 4l2 l4  3l23
a0 ; b0  2 ;
A A
l l 3l22 2l l  3l23  6l32
b1  3 4 ; b2  2 4
2A A
with A 10l4 l2  l32  l23 . In accordance with the values of the roots, a
and b of the square trinomial b0 2b1 x b2 x 2 , we established 12 differ-
ent kinds of Pearson distributions, according to the values of the discri-
b2
minant D b0 b2  b21 and of k b0 b1 2 . For the texts analyzed using the
Pearson method, we obtained values of D < 0 and k < 0, giving a
Type I, whose probability density is

f x Cx ab1 b  xb2 ; x 2 a; b 1
0; x 2
= a; b
328 P. SASTRE-VAZQUEZ ET AL.

where

a2b1 b2b2
C 2
a bb1 b2 1 Bb1 1; b2 1

Cb1 1Cb2 1
Bb1 1; b1 2 ; with b > 1 y > 1: 3
Cb1 b2 2
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

If the following replacements are made in expression (1):

aq

b1 b

Cb  xb2 PL;

we obtain


f x PLx qb ; x 2 a; b; with b > 1 y b2 > 1;
1
0; x 2
= a; b

an expression that coincides with the Estoup-Zipf-Mandelbrot


Law, which is in turn a generalization of the Zipf Law. To summarize,
we can state that the expression for the frequency of words obtained
through applying a Pearson system is a generalization of the Zipf and
Estoup-Zipf-Mandelbrot rank frequency laws.
In this article are denoted the parameters of the model by: Alfa a,
Beta b, Beta1 b1 y Beta2 b2.
The main aim was to identify whether said parameters are indepen-
dent of classifications by group, gender, and age, or whether on the
contrary each text is characterized with regard to these classifications
by some specific parameter or variable. In addition, we calculated the
correlations between the parameters and certain variables, such as text
length, number of different words, tokens, entropy, etc. This allowed
us to identify the basis upon which to adjust valid regressions for later
interpretation of the distribution parameters derived from the Pearson
system. For each text, we also calculated the entropy H using the
INTERPRETATION OF PARAMETERS ZIPF-MANDELBROT LAW 329

following equation:
X
n
H  pi ln pi ;
i1

where pi is the probability of a word in the text.


Downloaded by [University of Haifa Library] at 12:58 25 October 2013

ANALYSIS OF VARIANCE
With the data obtained concerning sighted and nonsighted (blind) boys
and girls of school age (7 to 13, 1st year to 7th year of basic general edu-
cation in Argentina), analysis of variance was carried out. To do so, the
following factors were taken into account for the model: group, gender,
and age, with all second-order interactions and considering the following
analysis variables: 1) text length, 2) vocabulary, (number of different
words in the text), 3) token (quotient between vocabulary and text
length), 4) the parameters of the model derived from the Pearson system
(polynomial roots: a and b; exponents: b1 and b2).
The model used for the analysis of variance (ANOVA) corresponded
to a completely random design, with three classification criteria: group,
gender, and age, taking into account all second-order interactions, whose
general analytical equation is as follows:

yijkl l ai bj ck abij akik bcjk eijkl model I

with yijkl being: observation of the variable considered for the ith level of
the group, the jth level of gender, the kth level of age, and the lth repeti-
tion; l: general average; ai: fixed effect of the ith level of the group, with
i 1,2; bj: fixed effect of the jth level of gender, with j 1, 2; ck: fixed
effect of the kth level of age, with k 1, 2, . . . , 7; (ab)ij: effect of the inter-
action of the ith level of the group with the jth level of gender; (ac)ik:
effect of the interaction of the ith level of the group with the kth level
of age; (bc)jk: effect of the interaction of the j-th level of gender with
the k-th level of age; eijkl: random error.
Duncans test was used for the multiple comparison tests. This uses
multiple ranges and a variable level of significance that depends on the
number of averages involved for each stage. Before carrying out the
analysis of variance, the suppositions of normality and homogeneity
required by the models used were checked. To do so, residual analysis
330 P. SASTRE-VAZQUEZ ET AL.

was carried out, with no evidence that the hypothesis of normality was
not satisfied.
Verification of the homogeneity of variances was carried out using
graphical methods and using Levenes test. This test consists of carrying
out an analysis of variance of the absolute value of the residuals and
proves the null hypothesis that the variances of the populations are equal.
Said test resulted in the nonrejection of said hypothesis. Analysis of the
graphs of the residuals versus the sample averages did not show any wor-
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

rying pattern, such as the creation of a cone or diamond as the averages


progressed. As a result, the homogeneity of the variances was assumed.
Table 1 shows the values of R2 y Pr > F obtained when carrying out
the ANOVAs for certain variables and the parameters.
It can be seen that for the variables studied, except for the Alfa,
Beta 1, and Beta 2, parameters, statistically significant values are
obtained for the models and satisfactory values for the adjustments (R2).
A statistically significant effect for the interaction between the group
and age factors can be seen for all the variables except for the dependent
variable entropy. With the entropy variable, only an effect due to the age
factor was detected in Table 2.
Duncans test was used for the multiple comparison tests of the
averages.
Table 3 compares the averages of the entropies of the texts. The
entropy of the text falls as the age of the school child increases, clearly
showing two groups with statistically significant differences: 1) 7, 8, 9,
and 10 years of age; 2) 12 and 13 years of age. The group of texts from
schoolchildren age 11 fell between the two groups previously mentioned.
Table 4 shows the averages for the length of texts according to the
age of the schoolchildren and the group to which they belong. It can
be seen that for both sighted and nonsighted children, the greatest length
of text corresponds to schoolchildren age 11. However, nonsighted

Table 1. Analysis of variance using Model I

Dependent Different
variable Entropy Length Words Token Alfa Beta Beta 1 Beta 2

Pr > F 0.03 0.0001 0.0001 0.008 0.12 0.0001 0.21 0.49


R2 0.55 0.81 0.80 0.60 0.29 0.80 0.45 0.38
Average 3.0410 396.87 169.017 0.462 0.058 244.00 0.72 0.69
INTERPRETATION OF PARAMETERS ZIPF-MANDELBROT LAW 331

Table 2. Analysis of the variance for the models that were statistically significant

Variable Factor g.l. SC CM F Pr > F

Entropy G 1 0.19682857 0.19682857 0.40 0.5289


S 1 1.23017857 1.23017857 2.53 0.1210
E 6 11.77298571 1.96216429 4.04 0.0037
G S 1 0.00731429 0.00731429 0.02 0.9031
G E 6 3.61402143 0.60233690 1.24 0.3114
S E 6 3.49262143 0.58210357 1.20 0.3315
Length G 1 231814.446 231814.446 11.03 0.0021
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

S 1 57152.161 57152.161 2.72 0.1083


E 6 1983130.500 330521.750 15.73 0.0001
G S 1 11400.018 11400.018 0.54 0.4664
G E 6 577732.929 96288.821 4.58 0.0017
S E 6 200884.714 33480.786 1.59 0.1792
Vocabulary G 1 37183.0179 37183.0179 15.77 0.0004
(different words) S 1 5540.1607 5540.1607 2.35 0.1346
E 6 212505.6071 35417.6012 15.02 0.0001
G S 1 17.1607 17.1607 0.01 0.9325
G E 6 58001.1071 9666.8512 4.10 0.0034
S E 6 17621.4643 2936.9107 1.25 0.3082
Beta G 1 64966.7252 64966.7252 13.11 0.0009
S 1 12354.9036 12354.9036 2.49 0.1236
E 6 478128.9872 79688.1645 16.08 0.0001
G S 1 512.6165 512.6165 0.10 0.7497
G E 6 120790.0787 20131.6798 4.06 0.0036
S E 6 39283.8028 6547.3005 1.32 0.2747

Table 3. Duncans multiple range test

Age 7 10 8 9 11 12 13

Entropy 2.504(a) 2.690(a) 2.721(a) 2.736(a) 3.061(ab) 3.695(b) 3.705(b)

The same leters indicate averages that do not differ.

Table 4. Duncans multiple range test


Sighted Length 1044.25 675.50 403.75 368.75 306.00 296.25 134.00
Age 11 12 13 10 8 9 7
A B C C DC DC D
Nonsighted Length 479.50 445.25 398.50 342.75 282.50 221.50 157.75
Age 11 13 12 10 8 9 7
A BA BA BAC BAC BC C

The same letters indicate averages that do not differ.


332 P. SASTRE-VAZQUEZ ET AL.

Table 5. Duncans multiple range test


Sighted Age 11 12 13 10 8 9 7
Vocabulary 372.7(a) 271.2(b) 186.2 162.2 150.2 141.5 79.2(d)
(c ) (c ) (cd ) (cd )
Nonsighted Age 13 11 12 10 8 9 7
Vocabulary 191.7(a) 190.2(a) 162.5 150.2 124.2 101.5 82.2(c)
(ab) (abc) (abc) (bc)

The same letters indicate averages that do not differ.


Downloaded by [University of Haifa Library] at 12:58 25 October 2013

schoolchildren produce texts half as long as those corresponding to


sighted schoolchildren. In addition, in both groups, the shortest text is
obtained with the schoolchildren age 7, this being similar for both
groups, although slightly higher for the nonsighted children.
Sighted schoolchildren aged 12 produced texts with slightly more
than half the words produced by sighted children aged 11, with this dif-
ference being statistically significant. The number of words used by
sighted children aged 11 and 13 did not differ in a statistically significant
way. Sighted schoolchildren aged 8 and 9 produced texts that were very
similar to and slightly longer than those aged 7.
With nonsighted schoolchildren, a similar tendency to that shown by
sighted schoolchildren regarding length of text associated with age was
observed. However, the differences were not statistically significant in
many cases.
With regard to the number of different words used by the schoolchil-
dren, the tendencies were very similar to those found for length of text.
This result was to be expected, as new words are logically incorporated
as text length increases.
Table 5 shows the averages corresponding to this variable.
Table 6 compares the averages of the values of the beta parameter.
Once again, the same tendency was detected as for the length of

Table 6. Duncans multiple range test


Sighted Age 11 12 13 10 8 9 7
Beta 542.04 393.7 262.67 231.32 204.67 204.02 108.04
(a) (b)1 (c) (c) (cd) (cd) (d)
Nonsighted Age 11 13 12 10 8 9 7
Beta 284.41 275.64 243.89 221.03 178.70 148.95 116.98
(a) (a) (ab) (abc) (abc) (bc) (c)

The same letters indicate averages that do not differ.


INTERPRETATION OF PARAMETERS ZIPF-MANDELBROT LAW 333

text variable. This allows us to hypothesize that the beta parameter


has strong links with the length of text variable and that both
dependent variables length of text and beta actually measure the same
phenomenon.

CORRELATIONS AND REGRESSIONS


Table 7 analyzes the correlations between the parameters and certain
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

variables (length of text, number of different words, tokens, entropy,


etc.). Based on the values shown on Table 7, we decided to adjust the
regressions shown in Tables 810.
The adjusted regressions for the alpha and beta parameters show
values of R2 of over 0.90, which indicates that the independent vari-
ables used in each case explain almost all variation of the parameters
estimated. For the beta 1 parameter, the value of R2 was close to
0.50, indicating that the independent variables used only explain 50%
of the variation of the parameter. There are possibly other causes of
variation that have not been taken into account in this study that
should be investigated. Notwithstanding, as a first approximation
towards estimating the beta 1 parameter, the adjusted regression can
be regarded as acceptable.

Table 7. Correlation coefficients for Pearson=Prob > jRj under null hypothesis:
Rho 0=N 56
Alfa
Token Length Vocabulary Entropy
R 0.97645 0.71110 0.66311 0.20311
Prob > jRj 0.0001 0.0001 0.0001 0.1333
Beta
Vocabulary Length Token Entropy
R 0.99530 0.99131 0.69427 0.26444
Prob > jRj 0.0001 0.0001 0.0001 0.0489
Beta 1
Entropy Token Length Vocabulary
0.42405 0.33649 0.18294 0.13509
0.0011 0.0112 0.1772 0.3208
Beta 2
Entropy Token Length Vocabulary
R 0.21022 0.07037 0.05271 0.04966
Prob > jRj 0.1199 0.6063 0.6996 0.7163
334 P. SASTRE-VAZQUEZ ET AL.

Table 8. Regression of the alpha parameter as a function of the proportion of different


words in a text (TOK)

Dependent variable FV d.f. Sum of Sq Mean Sq F Prob > F

Alfa Model 1 0.02070 0.02070 1106.248 0.0001


Error 54 0.00101 0.00002
Total 55 0.02171 R2 0.95

Variable d.f. Value Error t value Prob > jtj


Downloaded by [University of Haifa Library] at 12:58 25 October 2013

Alfa Intercept 1 0.051652 0.00351748 14.684 0.0001


TOK 1 0.249660 0.00750626 33.260 0.0001

With the estimates that arise from Tables 810 the following
regression equations can be formulated, which allow us to estimate the
parameters of the Pearson distributions:

a 0:051652  0:24966 token

b 53:601433 0:479758 length

b1 0:38899 0:206085 entropy  1:141164 token


 0:586747entropy  token

Table 9. Regression of the beta parameter as a function of the proportion of the length of
text (length)

Dependent
variable FV d.f. Sum of Sq Mean Sq F Prob > F

Beta Model 1 869219.70089 869219.70089 3065.006 0.0001


Error 54 15314.11809 283.59478
Total 55 884533.81898 R2 0.98

Variable d.f. Value Error t value Prob > jtj

Beta Intercept 1 53.601433 4.11004425 13.042 0.0001


Length 1 0.479758 0.00866576 55.362 0.0001
INTERPRETATION OF PARAMETERS ZIPF-MANDELBROT LAW 335

Table 10. Regression of the alpha parameter as a function of the proportion of different
words in a text (TOK), of entropy (Entropy)

Dependent
variable FV d.f. Sum of Sq Mean Sq F Prob > F

Beta 1 Mode 3 0.18720 0.06240 14.378 0.0001


Error 52 0.22568 0.00434
Total 55 0.41289 R2 0.48

Variable d.f. Value Error t value Prob > jtj


Downloaded by [University of Haifa Library] at 12:58 25 October 2013

Beta 2 Intercept 1 0.38899 0.24158039 1.61 0.1134


Entropy 1 0.206085 0.08464025 2.435 0.0184
TOKEN 1 1.141164 0.53580876 2.130 0.0379
Entropy Token 1 0.586747 0.19028192 3.084 0.0033

CONCLUSIONS
The Alpha parameter of the generalized law is an indicator of the
proportion of different words that appear in the text. This parameter falls
by 0.25 units per single unit of increase in the proportion of new words in
the text.
The Beta parameter of the generalized law indicates the length of the
text in question. Its value increases by 0.48 units for each word added to
the text.
The Beta 1 parameter of the generalized law is a function of
the entropy of the text, of the proportion of different words, and the
interaction between these two variables. However, more studies
should be carried out to obtain a more precise explanation of this
parameter.
The Beta 2 parameter cannot be explained by the length, token,
vocabulary, and entropy variables. Based on the results obtained by
Sastre (2002), it is believed that said parameter could be related to the
language in which the texts are written.
The length of the narrative text increases with the age of the school-
children, up to the age of 11, as does the vocabulary used.
Schoolchildren aged 11 produce the longest narrative texts. From
this age onwards, the length of the narrations starts falling until the
age of 13, when they reach levels similar to those of children aged 10.
The length of texts for nonsighted schoolchildren aged 11 and 12 is
notably inferior to that of sighted schoolchildren of the same age, a
336 P. SASTRE-VAZQUEZ ET AL.

difference that is almost nonexistent when the children reach 13 years


of age.

REFERENCES
Mandelbrot, B. 1953. Structure formelle des textes et communication. J. Word,
10: 127. Bingley, UK.
Sastre-Vazquez, P., Us o-Domenech, J. L., Villacampa-Esteve, Y., and Mateu, J.
Downloaded by [University of Haifa Library] at 12:58 25 October 2013

1999. Statistical linguistic laws in ecological models. Cybernetics and Systems


30: 697724.
Sastre-Vazquez, P. 2002. Thesis doctoral. Estadstica Lingustica. Modelos para
el Analisis de Textos en el Lenguaje de Escolares Videntes y No Videntes.
University of Jaume I.
Sastre-Vazquez, P., Villacampa, Y., Garca-Alonso, F., and Reyes, J. A. 2009.
Modelling of statistical laws for text using a Pearson system. Cybernetics
and Systems, 40: 113.
Villacampa, Y., Castro, M. A., Us o, J. L., and Sastre, P. 1999a. A text theory of
ecological systems. Cybernetics and Systems 30: 587607.
Villacampa, Y. and Us o-Domenech, J. L. 1999b. Mathematical models of
complex structural systems. International Journal of General Systems 28:
3752.
Zipf, G. K. 1932. Selective studies and the principle of relative frequency.
Cambridge: Addison Wesley.
Zipf, G. K. 1949. Human behavior and the principle of least effort. Cambridge:
Addison Wesley.

You might also like