Voice Quality Degradation Recognition Using The Call Lengths

2010, 12th International Conference on Optimization of Electrical and Electronic Equipment, OPTIM 2010
Voice Quality Degradation Recognition Using the

Call Lengths
Zoltan Gaspar, Izabella Gocza
Computer Science Departament, Transilvania University
gaspar@vega.unitbv.ro gocza@vega.unitbv.ro
Abstract - This paper presents two statistical methods for Category Rating (ACR) test method. The ACR gives
determining a voice quality degradation event that relies on scores from 5 (excellent) to 1 (bad). Because of the
call duration analysis, approach that is suitable for large call telecommunication environment, testing is done without a
volume systems with little processing power requirements. comparison to an undistorted reference. This copes with a
For the analysis we used real life data (460000 calls) and 387
typical situation of a phone call, where the listener has no
customer complaints from a prepaid long distance provider.
We are showing that, on average, more than 80% of the access to a comparison with a reference, for example the
voice quality degradation events that lead to customer original voice of the other party. However this test can be
complaints can be identified in real time using our tests. regarded as comparison of a test signal and a reference “in
the mind” of the listener, because the listener is very
I. INTRODUCTION familiar with the natural sound of human voice; he/she
can compare the test signal with his/her representation of
The public switched telephone network (PSTN) gives voice. The testing procedure is: a number of 20 to 50 test
a high level of reliability and conversational quality to its subjects will be presented with an identical series of
users [1, 3] attributes achieved by the dedicated network it speech fragments. Every test subject is asked to score
uses. Unlike PSTN, Voice over Internet Protocol (VoIP) each sample using ACR. After statistically processing
has a lower reliability [1] due to the best effort nature of individual results, a mean opinion score (MOS) can be
the Internet. On the other side, the Internet enabled a calculated.
quick and easy interconnect for the VoIP carriers that B. OBJECTIVE MEASUREMENTS
made possible the existence of a large number of carriers Objective quality evaluation followed from the need
and providers with a greater number of interconnects of computer automation of these tests, and two categories
between them. The quick expansion of VoIP also of methods were developed: intrusive and non-intrusive.
introduced equipment and bandwidth dimensioning 1. INTRUSIVE OBJECTIVE MEASUREMENT
problems resulting in fluctuating delays and packet losses, Intrusive measurements for voice quality like PSQM,
factors that lower the voice quality of the phone calls. PESQ[4] and PEAQ[5] for audio quality, per definition
For many telephony applications voice quality use two input signals for the quality evaluation. This is
monitoring is required to determine in (near) real-time if a the undistorted original signal and the degraded signal.
voice quality degradation event has occurred. The existing The MOS score is estimated based on a weighted sum of
objective voice quality assessment solutions are voice differences between the two signals. The most successful
stream analysis (intrusive and non-intrusive) algorithms, intrusive algorithms rely on psychoacoustic models which
with their main drawback being the processing power emulate the human hearing process and measure the
they require, as presented in section II. In section III, we differences between the reference and test signal on a
present the call length distribution and determine an perceptual basis.
analytical model for it. Sections IV and V describe the In order to use intrusive measurement, a VoIP
two new test-methods and we present our conclusions in provider needs to have access to the distorted signal,
section VI. situation which can be rarely achieved in a real world
deployment, due to difficulties of mounting equipment at
II. CURRENT METHODS the client end.
2. INTRUSIVE OBJECTIVE MEASUREMENT
The methods currently used for voice quality In non-intrusive measurements, like the one defined
assessment can be divided into subjective and objective in ITU-T recommendation P. 563[6], only the degraded
ones. voice stream is used for the MOS estimation.
A. Subjective measurements Measurements following this approach are typically built
Historically, the first developed method for quality on general models of the human vocal tract to model
evaluation was a subjective one, defined by the speech generation as well as psychoacoustic models to
International Telecommunication Union (ITU-T) simulate the human hearing process.
recommendation P.800, which introduced the Absolute
978-1-4244-7020-4/10/$26.00 '2010 IEEE 1034

The processing power requirements for non-intrusive time better than the exponential distribution when
measurement are hundreds of MFLOPS for each 8 second applying the Kolmogorov-Smirnov (K-S) goodness-of-fit
input file [7]. This factor limits its usability to lab test to an empirically obtained sample. In ITC 15 Chlebus
environments, making it too costly for a medium or large [12] used the Anderson-Darling (A-D) test to show that
size deployment due to equipment and license fees. call duration in mobile telephony follows the same
patterns as shown by Bolotin [11] for fixed telephony, as
III. STATISTICAL ASSESSMENT OF VOICE QUALITY could be expected. In order to determine the
characteristics of the mean value, without introducing a
A study regarding the correlation of call lengths and significant amount of error, we will consider for the call
MOS score was made in [2], and came to the conclusion length distribution, f(x,λ), an exponential expression as:
that in the low to medium quality range (MOS scores
between 1 and 3) the call duration increases with f ( x, λ ) = λe − λx for x, λ ≥ 0 (1)
increasing speech quality. This can be explained by the
λ – rate parameter of the exponential distribution
fact that when the connection quality is bad, customers
x – random variable representing the call length
usually hang up and call again, increasing the percentage
of short calls.
We can build the call lengths distribution (CLD) by The computation of the mean value, taking k number
computing the relative frequency of the call durations. of samples, is expressed by:
Based on the number of calls used when building the 1 k
CLD, we can define a distribution built on recent calls μ= ∑
k i =1
λ e − λx (2)
(near real time evaluation on short term) and a
distribution built on call data acquired in a long term-
presented in fig.1. It is also known that the sum of k independent and
identical (same λ) exponentially distributed random
variables is the Erlang distribution (a special case of the
CLDS short term call length distribution
Gamma distribution, with k integer) expressed by:
CLDL long term call length distribution
λk x k −1e − λx
0.25 f ( x, k , λ ) = for x, λ ≥ 0 (3)
(k − 1)!
Long term CLD
0.2 Short term CLD 1
Short term CLD 2 Based on the Central Limit Theorem, we can state that
for a sufficiently large k the Erlang (Gamma) distribution
probability
0.15
converges to Gaussian distribution with:
0.1
k
μ= (4)
0.05
λ
and variance:
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
call length [15s]
σ2 = (5)
Figure 1. Call lengths distribution: ‘Long term CLD’ is built on call data λ2
acquired in a long term while the other two are built on recent calls,
‘Short term CLD 2’ is considered better than ‘Short term CLD 1’,
having a distribution similar to the long term CLD. From formula (4) and formula (5), the mean of k
samples will be a normal distribution with the mean value
A. DISTRIBUTION OF THE CLD MEAN of
Several studies were made trying to identify the
distribution family of the call lengths. Willinger W. et al 1
argue that the distribution is exponential [4]. For wireless μ= (6)
channel holding times Jedrzycki C. et al. [5] and Barcelo λ
F. et al. [10] shows that the log-normal distribution is a
better fit. In his work in ITC 14 Bolotin [11] mentioned and variance
several papers appearing in ITC 13 which assumed an
exponentially distributed call length. The authors showed,
however, that mixtures of lognormals fit the call holding
1035
1 where l is then number of short term periods in the
σ2 = (7) long term period.
kλ2
So in our formula the σμS becomes σL (the standard
From this result, we can compute the accuracy gain by deviation of the CLDL), calculated using the mean values
using a larger number of calls in calculating the average of the CLDS:
1 n
σ1
1
k1λ2 k2 (8)
σ L (= σ μ ) =
S ∑
n i =0
(μSi − μL )2 (16)
= =
σ2 1 k1 where n is the number of call lengths for the short
k 2λ2 term average calculation.
Ex. to have a reduction of 50% for the value of sigma The number of calls for the long term statistic (n) is
we need to use 4 times more call samples. determined by the number of calls in a long period of time
In the following sections we will analyze separately (4 months in our case). This number (n) can be considered
the case where k is large and the case where k is small. random but large enough to make the Central Limit
Theorem assumption.
IV. TESTS USING LARGE SAMPLE NUMBER (K LARGE) The assessment of the difference between the two
means can be done using the Z test given in our case by:
First we will analyze the case where the number of (μS − μ L )
Z= (17)
samples is large and we can make the normality σL
assumption for the sample mean, in our case the call
length average. Testing for a statistically significant Because μs is considered normally distributed, Z has a
difference between the short and long term call length standard normal distribution (μ=0, σ=1) as shown in Fig.
average can be done using the statistical Z test. 2. Our H0 hypothesis is true if the Z value is less than a
A. THE Z TEST critical Z value ( Zcritical ) for which we can define a
The general form of testing a hypothesis is: critical value (α) of our test. α is defined as the cumulative
H 0 : μ < μ0 Null hypothesis (9) distribution function value of the standard normal
distribution (CDFSN) with Zcritical as its argument. It
H1 : μ ≥ μ 0 Alternate hypothesis (10) represents a measure of how confident we want to be, we
if μX (the sample mean of an X CLD distribution) is do not reject our hypothesis in the 100(1- α)% confidence
normally distributed and σμX is the standard deviation of interval, which denotes a significance level of 100 α%.
μX , the general formula of the Z-test is given by: α = CDFSN ( Z critical ) (18)
Z X = ( μ X − μ0 ) / σ μ X (11)
If Zx is smaller than the critical value (Zx<Zcrit), we
will state that we cannot reject the test hypothesis (H0)
based on the evidence data.
Our test hypothesis is:
H 0 : μS < μL (12)
where
1 n
μL = ∑ CDi - sample mean for the CLDL
n i =1
(13)
1 k
μ S = ∑ CDi - sample mean for the CLDS (14)
k i=1
We consider the X distribution being the CLDS, so we
have to define the standard deviation of the mean value of
Figure 2. The standard normal distribution. ZCRITICAL represents the
the CLDS. But a long term period is built of more short- value for which the cumulative distribution function (CDF) of the
term periods, and calculating the sample mean of these standard normal distribution is α
periods (μS), we can state that:
Looking up our calculated Z in the standard normal
1 l
μL = ⋅ ∑ μSi (μL is the sample mean of μS) (15) distribution table shows us the probability of seeing this
l i =0
1036
(or even a more extreme) difference compared to the 35
percentage of positive detection

critical value. 30
20 calls
30 calls
B. SIMULATION ANALYSIS 25 40 calls
In order to determine if there is degradation in the 20
voice quality, our method assesses the difference between
15
the mean value of the CLDS and CLDL. The statistically
10
significant decrease of the mean value of CLDS shows a
low MOS value for the recent calls. 5
In our simulation we used real life call data together 0

0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
with customer complaint reports from a prepaid long confidence interval (1-α)
distance provider. For computing the CLDL , we chose an
interval of four months from May 2007 to August 2007. Fig. 4 The percentage of confirmed detection events from all
detection events. The higher the number of calls used for the
During this period there were 460000 answered calls and computation of the average call length, the higher the percentage of
the customer service department received a number of positive detection.
387 complaints from the customers. The CLDS data are
taken from the same period, but only a number of 20, 30 2.
the ratio between confirmed events and total
or 40 consecutive answered calls are taken into events, called percentage of positive
consideration and we define the sample period as the time detection (Figure 4).
from the first to the last call. C. COMPLEXITY ANALYSIS
When our test confirmed a significant statistical The pseudo-code of the algorithm can be seen in
difference between μS and μL for a specific confidence Fig.5. The long time average value and standard deviation
interval, we called it a detection event. When, in the computation are not inside any loop, so although they are
corresponding sample period we found a complaint from database intensive operations they can be considered as
at least one customer to the specific destination, we called running in constant time. It is the nature of these values to
that quality degradation event confirmed, otherwise we change infrequently, so the results can be saved in a
called it unconfirmed. If a customer experienced a poor database and read therein without the need to re-compute
voice quality call, he/she would not always call right them.
away to the customer service. For this reason we added a The time complexity of the algorithm is given by the
period of 2 days to the sample period when searching for execution number of the most frequent instruction. In our
a customer complaint. case, this instruction is the computation of the short term
The results of the simulation were saved in a database mean value (Mean_ST). We can notice that the two outer
and further analyzed in two aspects of the proposed loops: carrier and calls on a carrier have a total number of
method, for 20, 30 and 40 calls: loops equal to the call number to the destination, so the
1. the percentage of the confirmed detection time complexity is given by:
events from all customer complaints Θt = Θ(call _ number * sample _ size) (19)
(Figure3.) The sample size being limited to a relatively low
number, in the order of tens, we can consider this
100
90 algorithm of linear complexity with the call number. On
20 calls
the other hand, the memory requirement is given by:
percentage of detection
80
30 calls
70
60
40 calls Θm = Θ(carrier _ number * sample _ size) (20)
50
This requirement grows linearly with the carrier
40
30
number and the sample size. For most cases, this memory
20 requirement is bounded to a low value.
10
0
0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
Cases=get_cases(destination,search_interval)
confidence interval (1-α) Mean_LT=compute_mean(search_interval)
Figure 3. Percentage of confirmed detection events from all Sigma_LT=compute_sigma(search_interval)
customer complaints. What can be noticed is a better detection For(i=0;i<carrier_number;i++)
percentage for a lower number of calls used to compute the mean. This For(j=0;j<calls_for_carrieri;j++)
makes the test more sensitive but with a higher false positive rate, as Call_array[j%sample_size]=call[j]
seen in Figure 4.
Mean_ST=0;
For(k=0;k<sample_size;k++)
Mean_ST=
Mean_ST+call_array[k].duration/sample_size
1037
λ - rate parameter
Z=compute_z_value(Mean_LT,Sigma_LT,call_array) B. SIMULATION ANALYSIS
Foreach probability The data used for the Erlang test are the same as for
Found=0 the Z test presented in section IV.B. Confirming a
If(Z<probability.z_crit) detection event is done using the same methodology as in
Found=search_for_case(Cases, IV.B
call_array[j%sample_size].time,call_array[(j+1)%sample The results of the simulation are presented in Figure 6
_size].time) and Figure 7. Figure 6 confirms that this methodology is
If(found) able to detect a very high percentage of the voice quality
confirmed_detection_event[probability]++ degradation events. The rate of false positive events is
detection_event[probability]++ high (only 10-20% is confirmed), although this can be
Fig. 5 pseudo-code for the Z-test algorithm explained by the fact that the customers don’t always call
the customer service when they experience a poor quality
V. TESTS USING SMALL SAMPLE NUMBER (K SMALL) call.
95
In order to build a statistical test, which works for a
90
low number of samples (calls) that have an exponential
percentage of detection
3 calls
5 calls
distribution, we will use the following methodology: as 85
10 calls
presented in 3.1, the Erlang distribution is the sum of k 80
independent exponentially distributed random variables 75
(x) with the same λ (rate parameter). 70

k
E x = ∑ x (λ )
65
60
i =1 (21) 0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
So the sum of k call lengths will have an Erlang confidence interval (1-α)
distribution and we can compute the probability of Figure 6. Percentage of confirmed detection events from all
customer complaints using the Erlang test. An increase in the percentage
observing that or a more extreme value by: of found cases can be seen with the decreasing of the confidence
Ex interval. However this makes the test more sensitive, with a higher false
α= ∫f Erlang ( x, k , λ )dx positive rate as seen in Figure 7.
25
percentage of positive detection
0 (22)
A. THE ERLANG TEST 20
As considered in section 4, the presumption for a
15
voice quality degradation event is the lowering of the
short term average, and accordingly the null and alternate 10
hypothesis will be the same as defined above, (9) and 3 calls
5 calls
respectively (10). 5
10 calls
The test variable Ex will be computed using the 0
formula: 0.5 0.8 0.9 0.95 0.98 0.99 0.995 0.998 0.999
k confidence interval (1-α)
E x = ∑ CLi Figure 7. The percentage of confirmed detection events from all

i =1 (23) detection events using the Erlang test. The higher the number of calls
used for the computation of the average call length, the higher the
where CLi is the call length of the i’th call (a sample percentage of positive detection. In all cases the percentage of positive
from the exponential distribution) detection has a positive slope. This is because, by definition, false
If the call length random variable has an exponential positive results (unconfirmed detection events) are more probable for
distribution, then Ex will approximate the Erlang lower confidence intervals.
distribution. Comparing the Ex value to the critical values
C. COMPLEXITY ANALYSIS
(values for which the cumulative distribution function has
The pseudo-code of the algorithm can be seen in Fig.
the value of 0.001, 0.002, 0.005, 0.01 and so on) of the
8. First we compute the expected value of the exponential
Erlang distribution will tell us if we can reject the null
distribution (Mean_LT) using a large number (long term
hypothesis. Our hypothesis (H0, null hypothesis) is true if
statistic) of calls. As presented in section 4.3, this value
the Ex value is less than Ecrit (with a specific confidence
changes infrequently and can be saved in a database,
interval α) and H1 is true otherwise. We compute the
avoiding the need to re-compute it.
Ecrit values from the inverse Erlang (Gamma)
A very complex operation is the computation of the
distribution with our parameters:
critical values of the Erlang distribution. As most of the
k - sample size
statistical libraries don’t include the Erlang distribution,
1038
this value was aquired by computing the Gamma stop sending calls to a carrier whose voice quality is not
distribution value (as presented in section 3.1 the Erlang good enough.
distribution is a special case of the Gamma distribution). Choosing the best algorithm for a VoIP provider for a
These critical values were pre-computed for every value certain destination depends mainly on the call rate
of sample size (from 2 to 20) and Ex (from 1 to 5400) and (calls/hour) available. For low call rates the Erlang test
saved to a database (534600 rows). The computation of will give results closer to real time with a higher false
the Erlang critical value reduces to an indexed search in a positive rate. If the call rate is higher, the Z test should be
database, search that we considered of constant time. used. Due to the larger sample size it takes into account,
the rate of false positive detection events is lower.
Cases=get_cases(destination,search_interval) If one can define the cost of a false positive detection
Mean_LT=compute_mean(search_interval) event and the cost of not detecting a degradation event,
For(i=0;i<carrier_number;i++) the optimal test (Erlang or Z), for a given confidence
For(j=0;j<calls_for_carrieri;j++) level, can be computed by finding the point that
Call_array[j%sample_size]=call[j] maximizes the revenue.
Sum_ST=0; In the future we are planning to develop a test method
For(k=0;k<sample_size;k++) that is based on the presumption of the call distribution
Sum_ST = Sum_ST +call_array[k].duration being a mixture of log-normal distributions.
Ex=Sum_ST ACKNOWLEDGMENT
Foreach probability The authors would like to thank Florin Miron and
Found=0 Silvana Santa for providing the data and for many
If(Ex <probability.erlang_crit) insightful discussions that contributed to the ideas of this
Found=search_for_case(Cases, paper. They would also like to thank Gheorghe Toacse for
call_array[j%sample_size].time,call_array[(j+1)%sample his continous feedback, that helped improuve this paper.
_size].time) REFERENCES
If(found) [1] Carolyn R.J: VoIP Reliability: A Service Provider’s Perspective,
IEEE Communications magazine, July 2004
confirmed_detection_event[probability]++
[2] Holub B., Beerends J.G., Smid R.: A Dependence Between Average
detection_event[probability]++ Call Duration and Voice Transmission Quality, Wireless
Figure 8. Pseudo code for the Erlang test telecommunication symposium, 2004
[3] Jiang W., Schulzrine H.: Assessment of VoIP Service Availability in
The time and memory complexity analysis follows the The current internet. Proceedings of the 4th International Workshop on
same steps as in IV.C, so the time and memory Passive and Active Network Measurement (PAM 2003)
[4] Willingner W., Paxon V.: Where Mathematics meets the Internet
requirements are expressed by:
Notices of the American Mathematical Society, 1998
Θt = Θ(call _ number * sample _ size) (24) [5] Jedrzycki C., Leung VCM.: Probability distribution of Channel
(25) Holding time in Cellular Telephony Systems IEEE 46th Vehicular
Θm = Θ(carrier _ number * sample _ size) Technology Conference, 1996
These requirements are linear with the call number for [6] ITU-T Rec. P.861 Objective quality measurement of telephone-band
the time complexity, and with the carrier number for the speech codecs, International Telecommunication Union, Geneva, 1996
[7] ITU-T Rec. P.862: Perceptual evaluation of speech quality,
memory requirements. The scaling constant sample_size International Telecommunication Union, Geneva, 2001
is lower by approximately a factor of 10. [8] ITU-T Rec. P.563: Single-ended method for objective speech quality
assessment in narrow-band telephony applications, International
VI. CONCLUSIONS Telecommunication Union, Geneva, 2004
[9] Opticom “3SQM OEM User manual” version 2.0.1
[10] Barcelo F., Jordan J.: Channel Holding Time Distribution in
The contribution of this paper consists in depicting Cellular Telephony, 9th. Int. Conf. on Wireless Communications
and defining two new statistical methods for detecting a (Wireless ’97), Calgary, July 1997
significant lowering in call quality. These two new [11] V. Bolotin: Telephone Circuit Holding Time Distribution, Proc. 14th
International Teletraffic Congress pp. 125-134, North Holland, 1994
statistical methods have a lower complexity and [12] Chlebus E.: Empirical validation of Call Holding Time Distribution
computational power requirement than PESQ and 3SQM In Cellular Communications Systems, Proc. 15th International Teletraffic
which are the current methods. These statistical methods Congress pp. 1179-1189, Elsevier Science B.V., 1997
were developed for both low and high call volume
destinations. Because of the low computational
requirements and the flexibility in the call volume, these
methods can be applied to medium and large scale VoIP
deployments: ex. a VoIP prepaid long distance provider.
By using them, the exact MOS value of the calls cannot
be determined, but it can be assessed if it's lower than the
average. This information is valuable for a provider with
multiple carriers for the same destination, because it can
1039

Voice Quality Degradation Recognition Using The Call Lengths

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Voice Quality Degradation Recognition Using The Call Lengths

Uploaded by

Copyright:

Available Formats

2010, 12th International Conference on Optimization of Electrical and Electronic Equipment, OPTIM 2010

Voice Quality Degradation Recognition Using the

978-1-4244-7020-4/10/$26.00 '2010 IEEE 1034

percentage of positive detection

In our simulation we used real life call data together 0

independent exponentially distributed random variables 75

(x) with the same λ (rate parameter). 70

k confidence interval (1-α)

E x = ∑ CLi Figure 7. The percentage of confirmed detection events from all

You might also like