Professional Documents
Culture Documents
Our hope is that the characterization presented will be use- information for Information Retrieval tasks [1]. Our work
ful both in the design of more effective social systems and in goes in a different direction: we aim at understanding the
increasing our understanding of the underlying driving forces behavior of Wikipedia contributors as a way to understand
governing content creation by humans. The main findings of its general evolution. The work in [17] follows a similar goal
this study are summarized below: but is different from ours in the sense that the authors do not
try to model contributor interaction in Wikipedia as a whole.
• Wikipedia evolution follows a self-similar process both Instead, they focus on proposing a visualization framework
in terms of revisions to existing articles and creation that can be used to evaluate cooperation and conflict patterns
of new ones. This result is in contrast to the findings on individual articles, in addition to providing some general
from prior studies [4, 6], where researchers found that statistics regarding the whole Wikipedia.
the changes to the Web pages follow a Poisson Process. A number of characterizations of different workloads types
• The number of articles on Wikipedia has been growing have been conducted, such as peer-to-peer systems [10], chat
exponentially since its creation in 2001. This growth is rooms [7], e-mail [3], streaming media [16], e-business [15],
mainly driven by the exponential increase in the num- and Web servers [2]. We borrow from these studies the tech-
ber of users contributing new articles, indicating the niques they have used, but there is a fundamental difference
importance of the Wikipedia’s open editorial policy in between their analysis and ours. Their focus was primarily
the current success. Interestingly, we find that the num- performance evaluation its influences on system design and
ber of articles contributed by each user has decreased we use the tools to understand contributor behavior and its
over time as the Wikipedia grows larger. impact on Wikipedia evolution.
# of updates
90 5000
# of updates
3 80
70 4000
2 60 3000
50
1 40 2000
30 1000
0 20
0
500
1000
1500
2000
2500
3000
3500
4000
200
400
600
800
1000
1200
1400
1600
0
Sat Sun Mon Tue Wed Thu Fri Sat Sun
Time (1 sec slots) Time (1 minute slots) Time (1 hour slots)
(a) Update arrival - 1 hour (b) Update arrival - 1 day (c) Update arrival - 1 week
2 55 3500
50
45 3000
1.5 40
# of updates
# of updates
2500
# of updates
35
30 2000
1
25
20 1500
0.5 15
10 1000
5 500
0 0
0
500
1000
1500
2000
2500
3000
3500
4000
200
400
600
800
1000
1200
1400
1600
0
Sat Sun Mon Tue Wed Thu Fri Sat Sun
Time (1 sec slots) Time (1 minute slots) Time (1 hour slots)
(d) Article creation - 1 hour (e) Article creation - 1 day (f) Article creation - 1 week
Fig. 2: Time series for the arrival processes for updates and article creation
of the results on the evolution of Wikipedia. Section 4.1 dis- and burstiness, differently from the prediction of the Poisson
cusses the general processes governing contributor interaction model.
with Wikipedia in terms of updates. The growth of Wikipedia Another class of processes, the so called self-similar pro-
is discussed in more detail in Section 4.2. Section 4.3 shows cesses, have been extensively used to model user access and
some results based on the analysis of contributor-centered network traffic patterns. These kinds of processes typically
metrics. Finally, Section 4.4 shows the analysis in terms of exhibit what is called a “fractal-like” behavior, i.e. no matter
an article centered approach. what time scale you use to examine the data, you see similar
patterns. The implication of such a behavior is that: (1) Xt
is bursty across several time scales, (2) there is no natural or
4.1 Self-similarities in Wikipedia evolution expected length of a burst, and (3) Xt does not aggregate as
One of the key characteristics of stochastic processes is their well as a Poisson process, which is close to what is observed
arrival process. in Figure 2.
In our case, we wanted to examine two processes: (1) the To investigate the appropriate model for this process more
arrival process of updates in general and (2) the arrival pro- formally, we analyzed the data using a statistical method.
cess of the creation of new articles. A usual way of analyzing There exists several methods to evaluate the self-similarity
such data is through the the time series plots for the events nature of a time-series. Each of this methods explores one or
generated by the process. Let Xt be a random variable that more of the statistical properties of self-similar time-series as
captures the number of events of a certain process that hap- a way to estimate the self similarity of a sample data. These
pened in the time interval [t − 1, t). For example, if the unit properties include the slowly decaying variance, long range
time is one second, then X1 represents the number of events dependency, non-degenerate autocorrelations and the Hurst
that occurred during the first second, X2 represents the num- effect.
ber of events that occurred during the second second and so In this study we have used the rescaled adjusted range
on. Figure 2 shows the time-series plots the updates received statistics in order to estimate the Hurst effect for the pro-
((a), (b) and (c)) and for the creation of new articles ((d), cesses being analyzed [9]. Let Xt be the time series being
(e) and (f)) at three different time scales, with the unit time analyzed, the Hurst effect is based on the fact that for almost
varying from 1 second to 1 hour. all of the naturally occurring time series, the rescaled ad-
Recent studies have shown that the evolution of the Web justed range statistic (R/S statistic) for a sequential sample
as a whole follows a Poisson process [4, 6]. In the Poisson of Xt with size n obeys the following relation:
process the time between successive events follows an i.i.d.
exponential distribution. As a result of this characteristic E[Rn /Sn ] = CnH (1)
the Poisson process is said to aggregate well. In other words,
as the unit time goes from a small timescale, e.g. 1 second, to The parameter H of Equation 1 is called the Hurst parameter
a larger one, e.g. 1 hour, Xt tends to get smoother approach- and lies between 0 and 1. For Poisson processes H = 0.5. For
ing a straight line. Our plots show that the Xt time-series self similar processes 0.5 < H < 1, so this parameter is easy-
for the Wikipedia data still preserves some of its variability to-use single parameter that is able to characterize whether
10000 effective in a context in which the evolution of a data-source
Actual data does not follow a Poisson process.
H = 0.78
1000
4.2 Wikipedia growth
100 Another point regarding the page creation process is how it
R/S
10000
# of users
100000 80
1000
10000 60
100
40
1000 10 actual data 20
exponential
100 1 0
Jan/01
Jul/01
Jan/02
Jul/02
Jan/03
Jul/03
Jan/04
Jul/04
Jan/05
Jul/05
Jan/06
Jul/06
Jan/07
Jan/01
Jul/01
Jan/02
Jul/02
Jan/03
Jul/03
Jan/04
Jul/04
Jan/05
Jul/05
Jan/06
Jul/06
Jan/07
Jan/01
Jul/01
Jan/02
Jul/02
Jan/03
Jul/03
Jan/04
Jul/04
Jan/05
Jul/05
Jan/06
Jul/06
Jan/07
Time Time Time
(a) # of articles (b) # of contributors (c) Average # of articles created per con-
tributor per month
20 1e+10
# of article created/user
# of updates
1e+07
12 1e+06
10 100000
8 10000
6 1000
4 100
2 10
Jan/01
Jul/01
Jan/02
Jul/02
Jan/03
Jul/03
Jan/04
Jul/04
Jan/05
Jul/05
Jan/06
Jul/06
Jan/07
1
0.1
10
100
1000
10000
100000
1e+06
Time
(a) Users registered from Jan 2001 to
Dec 2001
User rank
for those regostered in 2005
6.5
# of article created/user
6
5.5
Fig. 6: Contributor rank in terms of updates made.
5
4.5
4 tribute mostly well below 1000 articles. The exact reason for
3.5
3
this clear separation is not clear, but it may be because most
2.5 individuals have a limited number of updates that they can
2 make and keep track of.
Nov/04
Jan/05
Mar/05
May/05
Jul/05
Sep/05
Nov/05
Jan/06
Mar/06
May/06
Jul/06
Sep/06
Time 1
(b) Users registered from Jan 2005 to
Dec 2005 0.8
P[X <= x]
P[X <= x]
attention to which specific articles are being modified or cre- 0.7
ated. But this is an important characteristic that may help 0.6
us understand the reasoning behind contributor actions and 0.5
may help to explain the results we have found so far. 0.4
0.3
100000 0.2
10
100
1000
10000
100000
1e+06
Actual data
10000 (k=0.66)
# of updates
10
100
1000
10000
100000
1e+06
1e+07
updated article
3500
3000
2500
Article rank
(a) Zipf’s law for article popularity. 2000
1500
1000
% updates to popular articles
0.5 500
actual data
0.45 smoothed curve 0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
0.4
0.35 Session length
0.3 (b) Session length versus # of updates to the most up-
0.25
dated article in each session
0.2
Fig. 9: Contributor focus while interacting with Wikipedia.
0.15
Jan/01
Jul/01
Jan/02
Jul/02
Jan/03
Jul/03
Jan/04
Jul/04
Jan/05
Jul/05
Jan/06
Jul/06
Jan/07