Professional Documents
Culture Documents
Abstract - This paper presents two statistical methods for Category Rating (ACR) test method. The ACR gives
determining a voice quality degradation event that relies on scores from 5 (excellent) to 1 (bad). Because of the
call duration analysis, approach that is suitable for large call telecommunication environment, testing is done without a
volume systems with little processing power requirements. comparison to an undistorted reference. This copes with a
For the analysis we used real life data (460000 calls) and 387
typical situation of a phone call, where the listener has no
customer complaints from a prepaid long distance provider.
We are showing that, on average, more than 80% of the access to a comparison with a reference, for example the
voice quality degradation events that lead to customer original voice of the other party. However this test can be
complaints can be identified in real time using our tests. regarded as comparison of a test signal and a reference “in
the mind” of the listener, because the listener is very
I. INTRODUCTION familiar with the natural sound of human voice; he/she
can compare the test signal with his/her representation of
The public switched telephone network (PSTN) gives voice. The testing procedure is: a number of 20 to 50 test
a high level of reliability and conversational quality to its subjects will be presented with an identical series of
users [1, 3] attributes achieved by the dedicated network it speech fragments. Every test subject is asked to score
uses. Unlike PSTN, Voice over Internet Protocol (VoIP) each sample using ACR. After statistically processing
has a lower reliability [1] due to the best effort nature of individual results, a mean opinion score (MOS) can be
the Internet. On the other side, the Internet enabled a calculated.
quick and easy interconnect for the VoIP carriers that B. OBJECTIVE MEASUREMENTS
made possible the existence of a large number of carriers Objective quality evaluation followed from the need
and providers with a greater number of interconnects of computer automation of these tests, and two categories
between them. The quick expansion of VoIP also of methods were developed: intrusive and non-intrusive.
introduced equipment and bandwidth dimensioning 1. INTRUSIVE OBJECTIVE MEASUREMENT
problems resulting in fluctuating delays and packet losses, Intrusive measurements for voice quality like PSQM,
factors that lower the voice quality of the phone calls. PESQ[4] and PEAQ[5] for audio quality, per definition
For many telephony applications voice quality use two input signals for the quality evaluation. This is
monitoring is required to determine in (near) real-time if a the undistorted original signal and the degraded signal.
voice quality degradation event has occurred. The existing The MOS score is estimated based on a weighted sum of
objective voice quality assessment solutions are voice differences between the two signals. The most successful
stream analysis (intrusive and non-intrusive) algorithms, intrusive algorithms rely on psychoacoustic models which
with their main drawback being the processing power emulate the human hearing process and measure the
they require, as presented in section II. In section III, we differences between the reference and test signal on a
present the call length distribution and determine an perceptual basis.
analytical model for it. Sections IV and V describe the In order to use intrusive measurement, a VoIP
two new test-methods and we present our conclusions in provider needs to have access to the distorted signal,
section VI. situation which can be rarely achieved in a real world
deployment, due to difficulties of mounting equipment at
II. CURRENT METHODS the client end.
2. INTRUSIVE OBJECTIVE MEASUREMENT
The methods currently used for voice quality In non-intrusive measurements, like the one defined
assessment can be divided into subjective and objective in ITU-T recommendation P. 563[6], only the degraded
ones. voice stream is used for the MOS estimation.
A. Subjective measurements Measurements following this approach are typically built
Historically, the first developed method for quality on general models of the human vocal tract to model
evaluation was a subjective one, defined by the speech generation as well as psychoacoustic models to
International Telecommunication Union (ITU-T) simulate the human hearing process.
recommendation P.800, which introduced the Absolute
λk x k −1e − λx
0.25 f ( x, k , λ ) = for x, λ ≥ 0 (3)
(k − 1)!
Long term CLD
0.2 Short term CLD 1
Short term CLD 2 Based on the Central Limit Theorem, we can state that
for a sufficiently large k the Erlang (Gamma) distribution
probability
0.15
converges to Gaussian distribution with:
0.1
k
μ= (4)
0.05
λ
and variance:
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
call length [15s]
σ2 = (5)
Figure 1. Call lengths distribution: ‘Long term CLD’ is built on call data λ2
acquired in a long term while the other two are built on recent calls,
‘Short term CLD 2’ is considered better than ‘Short term CLD 1’,
having a distribution similar to the long term CLD. From formula (4) and formula (5), the mean of k
samples will be a normal distribution with the mean value
A. DISTRIBUTION OF THE CLD MEAN of
Several studies were made trying to identify the
distribution family of the call lengths. Willinger W. et al 1
argue that the distribution is exponential [4]. For wireless μ= (6)
channel holding times Jedrzycki C. et al. [5] and Barcelo λ
F. et al. [10] shows that the log-normal distribution is a
better fit. In his work in ITC 14 Bolotin [11] mentioned and variance
several papers appearing in ITC 13 which assumed an
exponentially distributed call length. The authors showed,
however, that mixtures of lognormals fit the call holding
1035
1 where l is then number of short term periods in the
σ2 = (7) long term period.
kλ2
So in our formula the σμS becomes σL (the standard
From this result, we can compute the accuracy gain by deviation of the CLDL), calculated using the mean values
using a larger number of calls in calculating the average of the CLDS:
1 n
σ1
1
k1λ2 k2 (8)
σ L (= σ μ ) =
S ∑
n i =0
(μSi − μL )2 (16)
= =
σ2 1 k1 where n is the number of call lengths for the short
k 2λ2 term average calculation.
Ex. to have a reduction of 50% for the value of sigma The number of calls for the long term statistic (n) is
we need to use 4 times more call samples. determined by the number of calls in a long period of time
In the following sections we will analyze separately (4 months in our case). This number (n) can be considered
the case where k is large and the case where k is small. random but large enough to make the Central Limit
Theorem assumption.
IV. TESTS USING LARGE SAMPLE NUMBER (K LARGE) The assessment of the difference between the two
means can be done using the Z test given in our case by:
First we will analyze the case where the number of (μS − μ L )
Z= (17)
samples is large and we can make the normality σL
assumption for the sample mean, in our case the call
length average. Testing for a statistically significant Because μs is considered normally distributed, Z has a
difference between the short and long term call length standard normal distribution (μ=0, σ=1) as shown in Fig.
average can be done using the statistical Z test. 2. Our H0 hypothesis is true if the Z value is less than a
A. THE Z TEST critical Z value ( Zcritical ) for which we can define a
The general form of testing a hypothesis is: critical value (α) of our test. α is defined as the cumulative
H 0 : μ < μ0 Null hypothesis (9) distribution function value of the standard normal
distribution (CDFSN) with Zcritical as its argument. It
H1 : μ ≥ μ 0 Alternate hypothesis (10) represents a measure of how confident we want to be, we
if μX (the sample mean of an X CLD distribution) is do not reject our hypothesis in the 100(1- α)% confidence
normally distributed and σμX is the standard deviation of interval, which denotes a significance level of 100 α%.
μX , the general formula of the Z-test is given by: α = CDFSN ( Z critical ) (18)
Z X = ( μ X − μ0 ) / σ μ X (11)
If Zx is smaller than the critical value (Zx<Zcrit), we
will state that we cannot reject the test hypothesis (H0)
based on the evidence data.
Our test hypothesis is:
H 0 : μS < μL (12)
where
1 n
μL = ∑ CDi - sample mean for the CLDL
n i =1
(13)
1 k
μ S = ∑ CDi - sample mean for the CLDS (14)
k i=1
We consider the X distribution being the CLDS, so we
have to define the standard deviation of the mean value of
Figure 2. The standard normal distribution. ZCRITICAL represents the
the CLDS. But a long term period is built of more short- value for which the cumulative distribution function (CDF) of the
term periods, and calculating the sample mean of these standard normal distribution is α
periods (μS), we can state that:
Looking up our calculated Z in the standard normal
1 l
μL = ⋅ ∑ μSi (μL is the sample mean of μS) (15) distribution table shows us the probability of seeing this
l i =0
1036
(or even a more extreme) difference compared to the 35
80
30 calls
70
60
40 calls Θm = Θ(carrier _ number * sample _ size) (20)
50
This requirement grows linearly with the carrier
40
30
number and the sample size. For most cases, this memory
20 requirement is bounded to a low value.
10
0
0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
Cases=get_cases(destination,search_interval)
confidence interval (1-α) Mean_LT=compute_mean(search_interval)
Figure 3. Percentage of confirmed detection events from all Sigma_LT=compute_sigma(search_interval)
customer complaints. What can be noticed is a better detection For(i=0;i<carrier_number;i++)
percentage for a lower number of calls used to compute the mean. This For(j=0;j<calls_for_carrieri;j++)
makes the test more sensitive but with a higher false positive rate, as Call_array[j%sample_size]=call[j]
seen in Figure 4.
Mean_ST=0;
For(k=0;k<sample_size;k++)
Mean_ST=
Mean_ST+call_array[k].duration/sample_size
1037
λ - rate parameter
Z=compute_z_value(Mean_LT,Sigma_LT,call_array) B. SIMULATION ANALYSIS
Foreach probability The data used for the Erlang test are the same as for
Found=0 the Z test presented in section IV.B. Confirming a
If(Z<probability.z_crit) detection event is done using the same methodology as in
Found=search_for_case(Cases, IV.B
call_array[j%sample_size].time,call_array[(j+1)%sample The results of the simulation are presented in Figure 6
_size].time) and Figure 7. Figure 6 confirms that this methodology is
If(found) able to detect a very high percentage of the voice quality
confirmed_detection_event[probability]++ degradation events. The rate of false positive events is
detection_event[probability]++ high (only 10-20% is confirmed), although this can be
Fig. 5 pseudo-code for the Z-test algorithm explained by the fact that the customers don’t always call
the customer service when they experience a poor quality
V. TESTS USING SMALL SAMPLE NUMBER (K SMALL) call.
95
In order to build a statistical test, which works for a
90
low number of samples (calls) that have an exponential
percentage of detection
3 calls
5 calls
distribution, we will use the following methodology: as 85
10 calls
presented in 3.1, the Erlang distribution is the sum of k 80
60
i =1 (21) 0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
So the sum of k call lengths will have an Erlang confidence interval (1-α)
distribution and we can compute the probability of Figure 6. Percentage of confirmed detection events from all
customer complaints using the Erlang test. An increase in the percentage
observing that or a more extreme value by: of found cases can be seen with the decreasing of the confidence
Ex interval. However this makes the test more sensitive, with a higher false
α= ∫f Erlang ( x, k , λ )dx positive rate as seen in Figure 7.
25
percentage of positive detection
0 (22)
A. THE ERLANG TEST 20
As considered in section 4, the presumption for a
15
voice quality degradation event is the lowering of the
short term average, and accordingly the null and alternate 10
hypothesis will be the same as defined above, (9) and 3 calls
5 calls
respectively (10). 5
10 calls
The test variable Ex will be computed using the 0
formula: 0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
1038
this value was aquired by computing the Gamma stop sending calls to a carrier whose voice quality is not
distribution value (as presented in section 3.1 the Erlang good enough.
distribution is a special case of the Gamma distribution). Choosing the best algorithm for a VoIP provider for a
These critical values were pre-computed for every value certain destination depends mainly on the call rate
of sample size (from 2 to 20) and Ex (from 1 to 5400) and (calls/hour) available. For low call rates the Erlang test
saved to a database (534600 rows). The computation of will give results closer to real time with a higher false
the Erlang critical value reduces to an indexed search in a positive rate. If the call rate is higher, the Z test should be
database, search that we considered of constant time. used. Due to the larger sample size it takes into account,
the rate of false positive detection events is lower.
Cases=get_cases(destination,search_interval) If one can define the cost of a false positive detection
Mean_LT=compute_mean(search_interval) event and the cost of not detecting a degradation event,
For(i=0;i<carrier_number;i++) the optimal test (Erlang or Z), for a given confidence
For(j=0;j<calls_for_carrieri;j++) level, can be computed by finding the point that
Call_array[j%sample_size]=call[j] maximizes the revenue.
Sum_ST=0; In the future we are planning to develop a test method
For(k=0;k<sample_size;k++) that is based on the presumption of the call distribution
Sum_ST = Sum_ST +call_array[k].duration being a mixture of log-normal distributions.
Ex=Sum_ST ACKNOWLEDGMENT
Foreach probability The authors would like to thank Florin Miron and
Found=0 Silvana Santa for providing the data and for many
If(Ex <probability.erlang_crit) insightful discussions that contributed to the ideas of this
Found=search_for_case(Cases, paper. They would also like to thank Gheorghe Toacse for
call_array[j%sample_size].time,call_array[(j+1)%sample his continous feedback, that helped improuve this paper.
_size].time) REFERENCES
If(found) [1] Carolyn R.J: VoIP Reliability: A Service Provider’s Perspective,
IEEE Communications magazine, July 2004
confirmed_detection_event[probability]++
[2] Holub B., Beerends J.G., Smid R.: A Dependence Between Average
detection_event[probability]++ Call Duration and Voice Transmission Quality, Wireless
Figure 8. Pseudo code for the Erlang test telecommunication symposium, 2004
[3] Jiang W., Schulzrine H.: Assessment of VoIP Service Availability in
The time and memory complexity analysis follows the The current internet. Proceedings of the 4th International Workshop on
same steps as in IV.C, so the time and memory Passive and Active Network Measurement (PAM 2003)
[4] Willingner W., Paxon V.: Where Mathematics meets the Internet
requirements are expressed by:
Notices of the American Mathematical Society, 1998
Θt = Θ(call _ number * sample _ size) (24) [5] Jedrzycki C., Leung VCM.: Probability distribution of Channel
(25) Holding time in Cellular Telephony Systems IEEE 46th Vehicular
Θm = Θ(carrier _ number * sample _ size) Technology Conference, 1996
These requirements are linear with the call number for [6] ITU-T Rec. P.861 Objective quality measurement of telephone-band
the time complexity, and with the carrier number for the speech codecs, International Telecommunication Union, Geneva, 1996
[7] ITU-T Rec. P.862: Perceptual evaluation of speech quality,
memory requirements. The scaling constant sample_size International Telecommunication Union, Geneva, 2001
is lower by approximately a factor of 10. [8] ITU-T Rec. P.563: Single-ended method for objective speech quality
assessment in narrow-band telephony applications, International
VI. CONCLUSIONS Telecommunication Union, Geneva, 2004
[9] Opticom “3SQM OEM User manual” version 2.0.1
[10] Barcelo F., Jordan J.: Channel Holding Time Distribution in
The contribution of this paper consists in depicting Cellular Telephony, 9th. Int. Conf. on Wireless Communications
and defining two new statistical methods for detecting a (Wireless ’97), Calgary, July 1997
significant lowering in call quality. These two new [11] V. Bolotin: Telephone Circuit Holding Time Distribution, Proc. 14th
International Teletraffic Congress pp. 125-134, North Holland, 1994
statistical methods have a lower complexity and [12] Chlebus E.: Empirical validation of Call Holding Time Distribution
computational power requirement than PESQ and 3SQM In Cellular Communications Systems, Proc. 15th International Teletraffic
which are the current methods. These statistical methods Congress pp. 1179-1189, Elsevier Science B.V., 1997
were developed for both low and high call volume
destinations. Because of the low computational
requirements and the flexibility in the call volume, these
methods can be applied to medium and large scale VoIP
deployments: ex. a VoIP prepaid long distance provider.
By using them, the exact MOS value of the calls cannot
be determined, but it can be assessed if it's lower than the
average. This information is valuable for a provider with
multiple carriers for the same destination, because it can
1039