You are on page 1of 6

1

A Novel Approach to Parameter Estimation in


Markov-modulated Poisson Processes
Larry N. Singh, G. R. Dattatreya
Department of Computer Science
The University of Texas at Dallas
Richardson, Texas 75083-0688
Phone: (972) 883-2189 Fax: (972) 883-2349
Email: {lns,datta}@utdallas.edu

Abstract— The Markov-modulated Poisson Process (MMPP) reality, these fluctuations occur over a wide range of time
is employed as a network traffic model in some applications. scales generating high variability and self-similar behaviour.
In this model, if traffic from each node is a Poisson process, Self-similar processes are structurally similar over a many
the final sequence of service requirements is a hyperexponential
renewal process, a special case of MMPP. This paper solves the different time scales. This phenomenon leads to long range
estimation of parameters of such an MMPP as follows. Two dependencies in network traffic.
novel algorithms for estimating the parameters of the hyper- Ostensibly, an alternative to classical Poisson methods is
exponential density are formulated. In the first algorithm of sought. To this end, long-tailed distributions have shown
the present paper, equations are developed for the M mixing much success. A distribution is a long-tailed distribution (also
probabilities in terms of the component means of the hyper-
exponential density. This reduces the number of unknown pa- referred to as a heavy-tailed distribution) if its ccdf decays
rameters from 2M to M. An objective function is constructed as a slower than exponentially, i.e. if
function of the unknown component means and an estimate of the
cumulative distribution function (cdf) of the hyper-exponential lim eαx F c(x) = ∞, (1)
x→∞
density. The component means are obtained by minimizing this
objective function, using quasi-Newton techniques. The mixing for all α > 0 (Feldmann and Whitt [4]). The complemen-
probabilities are then computed using these known means and tary cumulative distribution function (ccdf) is defined as the
linear least squares analysis. In the second algorithm, an objective
complement of the cumulative distribution function (cdf), as
function of the unknown component means, mixing probabilities,
and an estimate of the cdf is constructed. All the 2M parameters follows
are computed by minimizing this objective function, using quasi-
Newton techniques. The merits of each algorithm are discussed. F c(x) = 1 − F (x), (2)
The algorithms developed are computationally efficient and easily
implemented. Simulation results presented here demonstrate that where F (x) is the cdf. Conversely, a distribution is a short-
both algorithms work well in practical situations. tailed distribution if its ccdf decays exponentially, i.e. if

lim eαx F c (x) = 0, (3)


x→∞
I. I NTRODUCTION
for some α > 0. Of note is a special case of long-tailed
The Markov-modulated Poisson Process (MMPP) is a dou-
distributions called the power-tail distribution. A distribution
bly stochastic Poisson process in which the current rate is
is said to be power-tail distribution if
determined or modulated by a continuous-time Markov chain.
This process is a special case of the Markovian Arrival Process F c x ∼ αx−β as x → ∞, (4)
or MAP (Trivedi [1]). There are a number of reasons for
the growth in popularity of the MMPP in certain Computer where α and β are positive constants and the operator ∼ is
Science applications, particularly in the area of computer defined such that f(x) ∼ g(x) implies that
networks. In recent years, the volume of traffic on large-scaled f(x)
networks and across the Internet has increased tremendously, lim = 1. (5)
x→∞ g(x)
necessitating serviceable statistical models for traffic flow and
analysis. Statistical models are required for the design of Two common examples of long-tailed distributions that are
underlying IP-based transport layer protocols, efficient data used widely in network performance analysis are the Pareto
gathering and evaluation, and statistical analysis of corre- and Weibull distributions. In addition, the Pareto distribution
sponding random processes and variables (Markovitch and is a power-tail distribution, however, the Weibull distribution
Krieger [2]). The work of Leland et al [3] demonstrates that is not.
classical Poisson-based methods are insufficient for large-scale Recent studies have demonstrated that long-tailed distribu-
networks. Typical Poisson methods predict early fluctuations tions are aptly able to model many network characteristics.
in network traffic and hinge on the assumption that these For example, long-tailed distributions have been valuable
anomalies will smooth out over a long period of time. In in modeling World Wide Web (WWW) traffic, since said
2

traffic often originates from heterogeneous sources [2]. Long- medical, industrial and sociological research [10]. Indeed,
tailed distributions have also been shown to be successful Onof et al [11] utilize the MMPP to study rainfall patterns. A
in modeling file transfer protocol (FTP) connections and further exposition of the MMPP is found in [6].
intervals between connection requests [4]. As efficacious as
the long-tailed distribution has been in capturing the statis-
A. Definition of the MMPP
tical characteristics of large scale networks, there are some
deficiencies of this model. Most notably, long-tailed distri- The MMPP has a finite number, M , of states and operates as
butions are generally very difficult to analyze. An example a Poisson process with state-dependent rate λi where 1 ≤ i ≤
is the task of analyzing the performance figures of the basic M . Define qij to be the transition rate from state i to state j,
M/G/1 queue which becomes quite involved if the service-time and the M × M matrix Q to be the matrix with element qij at
distribution is Pareto. Furthermore, unlike many short-tailed row i and column j. The steady state probabilities are defined
distributions, expressions for the Laplace transforms of long- as π = [π1, . . . , πM ]T and satisfies the matrix equation
tailed distributions are quite complex. Laplace transforms are
πQ = 0. (6)
generally useful for analyzing the distributions by numerical
transform inversion. Standard non-parametric estimators such From this definition of the model, it is clear that the output
as the histogram, projections and kernel estimators are not of the MMPP may be modeled as the outcome of a mixture
suitable for long-tailed distributions, mostly due to the fact of exponential random variables with probability distribution
that long-tailed distributions do not have compact support. function (pdf)
Short-tailed distributions, on the other hand, do have compact
M
support. It is well known that kernel estimators suffer from X
spurious noise appearing in the tail of the estimator (Silver- f(x) = πi λi e−λi x . (7)
i=1
man [5]). Markovitch and Krieger [2] have explored some
non-parametric procedures for estimating and approximating This mixture random distribution is commonly referred to
long-tailed distributions. Two of these procedures include as the hyperexponential random distribution. Given N in-
transformation functions that map a long-tailed density into dependent identically distributed (iid) samples of data, the
a pdf with compact support, and polygrams. Polygrams are problem dealt with here is to estimate the parameters λ =
histograms with variable bin widths. [λ1, . . . , λM ]T and π of the hyperexponential density. Thus,
Hyperexponential densities have exhibited much success in all the parameters of the corresponding MMPP except for the
approximating long-tailed distributions and for constructing transition rates Q, may be determined from the hyperexpo-
network performance models ([4] and [2]). Moreover, analysis nential density.
of hyperexponential densities is tractable and the Laplace In much of the literature, it is assumed that the parameters
transforms are simple expressions. The MMPP is a relatively of the MMPP model are known, and so treatment of the
simple model and is able to approximate network traffic problem of parameter estimation is sparse [10]. However, there
activity well without the accompanying analysis becoming have been some algorithms presented that tackle the task of
intractable (Fischer and Meier-Hellstern [6]). In addition, the estimating these parameters. The majority of these algorithms
MMPP models arrival streams and bursty traffic more precisely are based on maximum-likelihood (ML) techniques. Scott [10]
than other models (Bolch et al [7]). As a consequence, the demonstrates a procedure wherein a MMPP may be expressed
MMPP has been widely used to model network traffic and in as a hidden Markov model (HMM) and the corresponding pa-
queuing theory. For instance, Heffes [8] demonstrates how the rameters estimated using either the expectation-maximization
MMPP may be utilized to model the statistical multiplexing (EM) algorithm or by a rapidly mixing Markov chain Monte
of a finite number of voice services and data. Muscariello Carlo (MCMC) algorithm. Meier-Hellstern [12] presents a
et al [9] show how the MMPP approximates the long-range technique for estimating the parameters using the EM algo-
data characteristics (LRD) of Internet traffic traces. Scott [10] rithm in the special case wherein the number of states M is
discusses the application of the MMPP to web traffic model- two. These techniques that make use of the EM algorithm
ing. Consider also, the following type of sensor network. Each inherently suffer from the problems associated with the EM
sensor transmits a sequence of bursts of data to a central server algorithm. It is well known that the EM algorithm has a
for processing. Each burst may consist of several packets of slow rate of convergence and may converge to values on the
requests. Bursts from multiple nodes are merged resulting in boundary of the parameter space.
a single sequence of packets. An appropriate model for the In this paper, we present two computationally efficient,
sequence of sensor node identifications (IDs) of the packets tractable and easily implemented algorithm for estimating
in the merged sequence is a Markov chain. Service require- the parameters of a hyperexponential random distribution.
ments for packets from each node is exponentially distributed, These techniques may be extended to estimate the transition
with different nodes having different means. The sequence probabilities Q as well. The first algorithm — hereafter,
of service requirements results in an MMPP. If traffic from referred to as Algorithm 1 — is a two-step procedure. The
each node is a Poisson process, the final sequence of service first step involves estimating the component means λ of the
requirements is a hyperexponential renewal process, a special hyperexponential distribution. The approach here is to develop
case of an MMPP. Applications of the MMPP are not limited equations that express π in terms of λ. These expressions are
to computer networks and may also be found in environmental, then substituted into an objective function which is a function
3

of λ and an estimate for the cdf of the hyperexponential distri- provided that A is nonsingular. Matrix A is a Cauchy ma-
bution. Minimizing this objective function yields the required trix and hence, has certain nice properties (Boras [14]). For
estimates of λ. The second step estimates the steady state instance, the inverse of a Cauchy matrix can be represented
probabilities π given that the component means are known by as an explicit expression (Knuth [15]) and thus, B = A−1 is
making use of linear least squares analysis. To our knowledge, an M × M matrix with elements
there are no other similar techniques for parameter estimation Q 1
( λj + αk )( λ1k + αi)
of hyperexponential distributions. The second algorithm — k
bij =  , (11)
Algorithm 2 — estimates π and λ in one phase by minimizing
Q  Q
( λ1j + αi) ( λ1j − λ1k ) (αi − αk )
an objective function that is a function of π, λ and an estimate k k

for the cdf of the hyperexponential distribution. The relative


k6=j k6=i

merits of each algorithm are also compared and discussed. at row i and column j, where 1 ≤ k ≤ M , λ1 6= λ2 6= . . . 6=
Once the parameters of the hyperexponential density are ob- λM 6= 0 and α1 6= α2 6= . . . 6= αM . Clearly, each λi is distinct
tained, the algorithm developed by Dattatreya [13] is employed by the assumption that the mixing proportions for each com-
to compute the transition probabilities of the MMPP. ponent are all non-zero, and each αi is distinct by assumption.
Hence, A is clearly invertible. From equations (10) and (11)
B. Organization of paper the steady state probabilities may be expressed as
Algorithm 1 is developed in section II, along with the M
required supporting expressions for the associated objective
X
πi (λ) = bij aj . (12)
function. Likewise, the corresponding objective function and j=1
expressions for Algorithm 2 are constructed in III. Simulation
results and analysis are described in section IV. Finally, This gives a means of computing π if the component means
section V concludes the paper. and expectations in equation (8) are provided.

II. A LGORITHM 1
B. Determination of λ
Most ML techniques for computing the parameters of
The expressions for the steady state probabilities developed
hyperexponential densities intrinsically require estimation of
in the previous section afford a means of reducing the number
2M parameters. Likewise, any optimization procedure that
of unknown variables to just M . In this section, an algorithm
uses or evaluates the hyperexponential density directly, also
to obtain these M component means is formulated by fitting
involves estimation of 2M parameters. In this section, equa-
a candidate cdf to the given exact cdf. The cdf of a hyperex-
tions expressing π in terms of λ and functions of the samples
ponential density is
of data are derived, thus reducing the number of dependent
unknown parameters to M . Making use of these equations, M
X
an objective function is formulated and the values of λ are F (x) = 1− πk e−λk x. (13)
calculated by minimizing this objective function. It is assumed k=1
that each mixing proportion is strictly positive, ensuring that
Let λ̃ = [λ̃1, . . . , λ̃M ]T and π̃(λ̃) = [π̃1(λ̃), . . . , π̃M (λ̃)]T
each component plays a role in influencing the resulting
be the current approximations for λ and π, respectively, in
pdf. Algorithm 1 essentially reduces the number of unknown
an iterative sequence of approximations for fitting the cdfs.
parameters by incorporating additional information from the
Define â to be the estimate of a from the samples of data.
samples of data.
Hence, an approximate or candidate cdf is given as
A. Expressions for π M
X
The Laplace transform of the hyperexponential densities F̃ (x, λ̃) = 1− π̃k (λ̃)e−λ̃k x . (14)
obtained at M distinct points gives tractable equations con- k=1

necting the transforms and unknown parameters. Define α = Notice that the only unknown in this candidate cdf is λ̃. The
[α1, . . . , αM ]T where each αi is a distinct, real, positive value. error of the fit of this candidate cdf at a single point x is
The Laplace transform of the pdf in equation (7) scaled by a defined to be
factor of α1i is  2
h1 M F (x) − F̃ (x, λ̃) . (15)
−X
i X 1
E e αi = πj (8)
αi 1
+ αi This result can be extended over the entire domain of x —
j=1 λj
i.e. the set of all real numbers greater than or equal to zero
Let A be the matrix with elements 1
1
+αi
at row i and (R+ )— to give the total error by integrating equation (15)
λj
column j. Also let a = [a1 , . . ., aM ] be the vector of
T for all x ∈ R+ . Unfortunately, this integral does not furnish a
−X
expectations where ai = E[ α1i e αi ]. Therefore, in matrix simple, tractable expression for the total error. Thus in practice,
notation, equation (8) is the integral would have to be numerically evaluated which is a
somewhat expensive task. Realize that for practical purposes,
Aπ = a and (9) the entire domain of R+ need not be considered. A viable
π = A −1
a, (10) approximation of the total error may be obtained by computing
4

the error at a finite number of points, m, over the region of cdf is defined as,
interest and summing up the error as follows
0,

 x < y1
m 
 i−1 ,

x = yi
n−1
(20)
2
X F̂ (x) = x−yi
d(λ̃) = F (x) − F̃ (xk , λ̃) dx. (16) i−1
+ (yi+1 −yi )(n−1) , y i < x < yi+1
 n−1


k=1 1, x > yn .

Observe that d(λ̃) ≥ 0 for all λ̃i > 0 and 1 ≤ i ≤ M . Define λ̂ as the estimates of λ and substitute equation (20)
Therefore, d(λ̃) is bounded from below and has a global into (16) giving the new objective function
minimizer. Moreover, this minimum is known to be zero m  2
and for an ideal candidate cdf, d(λ̃) = 0. Obviously a
X
d̂(λ̂) = F̂ (xk ) − F̃ (xk , λ̂) . (21)
better approximation of the total error is obtained if m is k=1
made large, however, a lower bound on m is desirable. The
family of hyperexponential densities is identifiable (Yakowitz Minimizing this objective function produces the estimates λ̂.
and Spragins [16]), and hence, by [17], a minimum of M
data points {x1, . . . , xM } are required in order for a set D. Estimation of π from statistical data
of parameters to uniquely determine this data. Therefore, in In the preceding section, a method was developed for deter-
theory it is necessary that m ≥ M . mining estimates of λ given an estimate for the cdf. A naive
The component means are obtained by minimizing (16) with approach to attaining values for π is to exploit equation (12).
respect to λ̃. It is worth noticing that the objective function However, due to the nature of the hyperexponential density and
in question is not a convex function of λ̃ in general. There- the equations involved, the values for the vector a cannot be
fore, Newton methods for minimization cannot be directly estimated accurately enough from data for computing π. The
applied. Nonetheless, the problem of finding λ̃ is posed as values attained for π from equation (12) are very sensitive to
a constrained nonlinear optimization problem. The first set of the values derived for λ̃. This means that small inaccuracies
constraints ensure that λk > 0 for all 1 ≤ k ≤ M . The in λ̃ translate into large errors in π. This is mainly due to
second set are as a result of ensuring that π are valid, non- the inability of â to estimate a with sufficient accuracy. In
zero probabilities, i.e. contrast, the estimate for the cdf F̂ (x) from cdf provides a very
M
accurate representation of the exact cdf. This is yet another
X advantage of using the estimate of the cdf for function-fitting.
πk (λ̃) = 1 and (17)
The cdf of the hyperexponential density is linear in π. In
k=1
addition, the estimated cdf can be expressed in terms of the
0 < πk (λ̃) < 1 for all 1 ≤ k ≤ M. (18)
estimates for λ̂ and the unknown parameters π as follows,
M
X
C. Estimation of λ from statistical data π̂k e−λ̂k x = 1 − F̂ (x), (22)
k=1
In order to minimize the objective function developed,
for all 0 < x < ∞. Let z = {z1, . . . , zS } be an arbitrary set
there are several quantities that need to be estimated from
of positive, real constants such that
the samples of data. In this section, estimators for these
quantities are devised. Given n samples of data {x1, . . . , xn}, inf zi = inf xj and (23)
1≤i≤S 1≤j≤M
and assumed distinct constants α, an estimator for ai is
sup zi = sup xj . (24)
n 1≤i≤S 1≤j≤M
1 X −αi xk
âi = e . (19) Define F̂(z) = [F̂ (z1 ), . . . , F̂ (zS )]T and π̂ = [π̂1, . . . , π̂M ]T .
n
k=1
Let Ĉ be the S × M matrix with Ĉij = e−λ̂j zi being the
The equations derived previously implicitly assume that the element at the ith row and j th column. From these definitions,
cdf F (x) is exact and available. Of course this is not the equation (22) can be written using matrix notation as
case, and an estimate for the cdf must be obtained from the Ĉπ̂ = 1 − F̂(z), (25)
samples of data. The cdf is a good choice for function-fitting
and estimation in general for the hyperexponential density for leading to the following theorem.
a number of reasons. First, the cdf is a smooth, monotonically Theorem 1: Equation (25) has a unique solution for the
increasing function. Second, the cdf can be estimated easily mixing proportions, given that the component means are
and accurately from a finite number of samples of data. Third, known.
given the chosen implementation for the cdf, a lookup for a Proof: Equation (25) can be solved using linear least
value runs in O(log n) time which is quite fast. A piecewise squares regression analysis, and is also a convex function.
continuous estimate of the cdf is obtained as follows. Sort Therefore, this function has a global minimizer and has a
the observations of data {x1, . . ., xn} to produce the values unique solution for π.
{y1 , . . . , yn} such that y1 ≤ y2 ≤ . . . ≤ yn and {y1 , . . . , yn} From this theorem, π̃ is obtained by solving equation (25)
is a permutation of {x1, . . . , xn}. Hence, the estimate of the using linear least squares regression analysis.
5

E. Summary of Algorithm 1 3
actual pdf
algorithm 1/2 pdf
The following presents a summary of Algorithm 1, for em pdf
2.5
obtaining estimates of the parameters for the hyperexponential
density, given n samples of data.
2
1) Choose values for α such that each αi is distinct and
positive.

f(x)
1.5
2) Obtain an initial estimate for λ̂.
3) Compute â using α and the samples of data. 1
4) Minimize d̂(λ̂) to obtain a new estimate for λ̂.
5) Using the new estimate of λ̂, compute C and using 0.5
linear least squares regression, obtain an estimate for
π. 0
0 0.5 1 1.5 2 2.5 3 3.5 4
x

III. A LGORITHM 2
The second algorithm developed is similar to Algorithm 1 Fig. 1. Sample plot of pdf generated from Algorithm 1 using 100 samples
of data and after 10-20 iterations.
of the previous section. The essential difference in Algorithm
2 is that 2M parameters are estimated from the developed
objective function. In the following subsection, the required proportions are met. Thus, the only applicable constraint is that
objective function is developed. λi > 0, for all 1 ≤ i ≤ M . Observe that the hyperexponential
cdf is strictly monotonic, and thus, the value of the objective
A. Development of the objective function function will be large if any λi becomes negative. Therefore,
As in the previous section, an approximate cdf is constructed the constraint that λi > 0 is implicitly enforced and we have
as follows an unconstrained nonlinear optimization problem. Let γ̂ be
M the estimates of γ and introduce the new objective function
defined as
X
F̃ (x, λ̃, π̃) = 1− π̃k e−λ̃k x . (26)
k=1 Xm  2
Notice that this cdf has three arguments as opposed to two, and d̂u(λ̂, γ̂) = F̂ (x) − F̂ (xk , λ̂, π̂(γ̂)) dx, (30)
k=1
that π̃ is no longer described as a function of λ̃. Following,
the procedure of the previous section, the error of fit of this It is worth noticing that (30) is not a convex function of λ̂
candidate function is expressed as and γ̂, so the Newton method for optimization cannot be
applied. Instead, the minimization should be performed using
(F (x) − F̃ (x, λ̃, π̃))2 . (27)
either quasi-Newton methods with the BFGS update method
Similarly, the approximate total error of the fit is denoted and inexact line searches, or using the Levenberg-Marquardt
m  2 method for nonlinear regression analysis [19].
For Algorithm 2, the only quantity that needs to be esti-
X
d(λ̃, π̃) = F (x) − F̃ (xk , λ̃, π̃) dx, (28)
k=1 mated is the cdf. The approach for estimating the cdf is the
same in both algorithms 1 and 2. In the next subsection, a
where m ≥ M . The function d(λ̂, π̂) has similar properties
summary of the algorithm is presented.
as d(λ̂), in that d(λ̂, π̂) ≥ 0 for all λ̂i > 0 and 1 ≤ i ≤
M . Hence, d(λ̂, π̂) is also bounded below and has a global
minimizer. B. Summary of Algorithm 2
The required parameters are obtained by minimizing the The following presents a summary of of Algorithm 2.
objective function in equation (28). As in the previous section, Estimates of the parameters of the hyperexponential density
this is a constrained nonlinear optimization problem. The are obtained, given n samples of data.
constraints are the same, i.e. that the component means are
1) Obtain an initial estimate for λ̂ and π̂.
positive and that the mixing proportions are valid probabilities.
2) Minimize d̂u(λ̂, γ̂) to give new estimates for λ̂ and π̂.
The latter constraint can, however, be relaxed by introducing
the softmax function (Bishop [18]). Let γ = {γ1 , . . . , γM } be
a vector of real constants, and hence, IV. S UMMARY OF S IMULATION E XPERIMENTS AND
R ESULTS
e γi
πi(γ) = PM , (29) The algorithms discussed in the previous sections were
γj
j=1 e implemented and tested through simulation using different
for all 1 ≤ i ≤ M . The number of free parameters can be values for M , λ and π. A subset of the simulation trials is
further reduced to 2M − 1 by fixing one the γi to an arbitrary discussed here. Both algorithms were tested on a four compo-
constant. For instance, let γM = 0, and now the values for nent hyperexponential density using synthetically generated iid
the remaining γi will be translated appropriately. Using this mixture samples. In addition, results of both algorithms were
definition for πi ensures that the constraints of the mixing compared to the basic implementation of the EM algorithm.
6

Evaluation of the objective function d̂(λ̂) was performed EM algorithm in terms of robustness and computation speed.
by evaluating the approximate total error for equally spaced Algorithm 1 appears to be superior to Algorithm 1 in terms
points over a meaningful domain. In Algorithm 1, the first of computation speed and number of data samples that are
phase of estimating λ̂ is accomplished through the use of required. However, Algorithm 2 is somewhat conceptually
quasi-Newton methods with the BFGS method for Hessian simpler to implement and execute. Nevertheless, both algo-
update and safeguarded mixed quadratic and cubic polynomial rithms are efficient and easily implemented using standard
interpolation and extrapolation method for line searches. The tools of numerical optimization. In addition, simulation results
implementation chosen is the fminunc MATLAB function [20]. indicate that both of the algorithms give remarkably accurate
For the second phase of Algorithm 2, constrained linear estimates of the hyperexponential pdf.
regression was performed using SQP methods and specifically There are possible areas for improvement of the algorithm
the lsqlin MATLAB function. presented here under certain conditions. The use of simulated
Sample results of the three algorithms are given in figure 1. annealing may overcome the need for initial estimates for the
The pdf’s generated by both Algorithm 1 and 2 were very component means, and may also allow for obtaining a global
similar, so only one plot is shown. Each figure compares minimum to the objective function. Boras [14] demonstrates
the generated pdf of the corresponding algorithm to the techniques for improving on the numerical precision of the
actual pdf. Synthetic data having the following characteristics task of finding the inverse of a Cauchy matrix. These tech-
was generated and used for each of the algorithms λ = niques may improve the accuracy and quality of the estimates.
[1.0, 2.0, 3.0, 4.0] and P = [0.28, 0.14, 0.38, 0.2]. For each
algorithm, starting points for λ̂ were chosen randomly in the R EFERENCES
range [0.5, 4.5]. Algorithm 1 was executed using 100 data [1] K. S. Trivedi, Probability and Statistics with Reliability Queuing and
samples, and terminated after 10 − 20 iterations for all the Computer Science Applications, John Wiley and Sons, NY, USA, 2002.
simulation runs. Algorithm 2 was executed using 1000 data [2] N. M. Markovitch and U. R. Krieger, “Nonparametric estimation of
long-tailed density functions and its application to the analysis of the
samples and terminated after a maximum of 100 iterations. World Wide Web Traffc,” Performance Evaluation 42, (2-3), pp. 205-
The EM algorithm was performed on 1000 data samples and 222, 2000.
[3] W. E. Leland, M. S. Taqqu, W. Willinger and V. Wilson, “On the
terminated after 250 iterations. From the simulation results, it
Self-Similar Nature of Ethernet Traffic (Extended Version),” IEEE/ACM
is evident that algorithms 1 and 2 produce similar quality of Trans. on Networking Vol. 2, No. 1, pp. 1-15, Jan. 1994.
results. Our results for the basic EM algorithm are not very [4] A. Feldmann and W. Whitt, “Fitting Mixtures of Exponentials to
accurate unless the starting point is quite close to the solutions. Long-Tail Distributions to Analyze Network Performance Models,”
Performance Evaluation, Vol. 31, Iss. 3-4, pp. 245-279, Jan. 1998.
Algorithm 1 appears to be superior to algorithm 2 since it uses [5] B. W. Silverman, Density Estimation for Statistics and Data Analysis,
much less data — 10% of Algorithm 2 — and terminates in Chapman and Hall, Ltd., 1986.
far less iterations — 10%-20% of the iterations required for [6] W. Fischer and K. S. Meier-Hellstern, “The Markov-modulated Poisson
process (MMPP) cookbook,” Performance Evaluation No. 18, pp. 149-
Algorithm 2. 171, 1992.
Since Algorithm 1 has M unknown parameters, versus 2M [7] G. Bolch, S. Greiner, H. de Meer and K. S. Trivedi, Queuing
Networks and Markov Chains: Modeling and Performance Evaluation
parameters in Algorithm 2, it is expected that that more solu- with Computer Science Applications, John Wiley and Sons, NY, USA,
tions that are minimums of the corresponding object function 1998.
will exist. This explains why more data samples are required in [8] H. Heffes and D. M. Lucantoni, “A Markov Modulated Characterization
of Packetized Voice and Data Traffic and Related Statistical Multiplexer
Algorithm 2 for a sufficiently accurate solution. Realize also Performance,” IEEE Journal Sel. Areas Commun., Vol. SAC-4, No. 6,
that the computations for Algorithm 1 are somewhat more Sept. 1986.
complex than those of Algorithm 2, and there are two distinct [9] L. Muscariello, M. Mellia, M. Meo, M. Marsan and R. Lo Cigno, “An
MMPP-Based Hierarchical Model of Internet Traffic,” ICC 2004.
steps versus one in Algorithm 1 as opposed to Algorithm 2. [10] S. L. Scott and P. Smyth, “The Markov Modulated Poisson Process and
So the iterations of Algorithm 1 are slightly more expensive in Markov Poisson Cascade with Applications to Web Traffic Modeling,”
terms of computation time than those of Algorithm 2. There Bayesian Statistics 7, 2003.
[11] C. Onof, B. Yameundjeu, J. P. Paoli and N. Ramesh, “A Markov
is no simulation evidence to suggest that there are operating modulated Poisson process model for rainfall increments,” Water Science
regions in which one algorithm is superior to the other, given and Technology, Vol. 45, No. 2, pp. 91-97, 2002.
the current approaches taken for optimization. The general [12] K. S. Meier-Hellstern, “A fitting algorithm for Markov-modulated
Poisson processes,” Euro. Jour. Oper. Res. No. 29, pp. 370-377, 1987.
manifestation is that Algorithm 1 is better than algorithm 2 in [13] G. R. Dattatreya, “Estimation of prior and transition probabilities in
terms of computation expense and accuracy of solution, given multi-class finite Markov mixtures,” IEEE Trans. on Sys., Man., and
that all input parameters to both algorithms are identical. Cyber., Vol. 21, Iss. 2, pp. 418-416, Mar. 1991.
[14] T. Boras, “Studies in Displacement Structure Theory,” Ph.D. Disserta-
tion, Stanford University, CA, USA, 1996.
V. C ONCLUSION [15] D. E. Knuth, Fundamental Algorithms - The Art of Computer Program-
ming: Vol. 1, Second Edition, Addison-Wesley, MA, USA, 1973.
The major contributions presented here are algorithms for [16] S. J. Yakowitz and J. D. Spragins, “On the Identifiability of Finite
Mixtures,” Ann. Math. Stats. Vol. 39, No. 1, pp. 209-214, 1968.
computing the parameters of a hyperexponential density. Our [17] H. Teicher “Identifiability of Finite Mixtures,” Ann. Math. Stat., Vol.
algorithms are easily implemented yet computationally very 34, No. 4, pp. 1265-1269, December 1963.
efficient. There are numerous potential applications of this [18] C. M. Bishop, “Neural Networks for Pattern Recognition,” Oxford
University Press, NY, USA, 1995.
algorithm particularly in the areas of network traffic modeling [19] J. Nocedal and S. J. Wright, “Numerical Optimization,” Springer-Verlag,
and queuing theory. In addition, evidence is presented to NY, USA, 1999.
suggest that the algorithms presented are superior to the [20] Matlab Software, http://www.mathworks.com

You might also like