You are on page 1of 4

Audio Classification Using Features Derived From The Hartley Transform

I. Paraskevas, E. Chilton, M. Rangoussi


Centre for Vision Speech and Signal Processing (CVSSP) Department of Electronics
School of Electronics and Physical Sciences, University of Surrey Technological Education Institute of Piraeus
Guildford, GU2 7XH, Surrey, UK 250, Thivon str., Aigaleo-Athens, GR-12244, GREECE
E-mail: eeplip@ee.surrey.ac.uk, e.Chilton@ee.surrey.ac.uk Phone/Fax: +302105381222,6 E-mail: mariar@teipir.gr

Keywords: Audio recordings, audio databases, content-based retrieval, classification, phase spectrum, Hartley
transform.

audio classification. In this paper, the phase and magnitude


Abstract - The increasing use of audio databases has led to information are combined in order to discriminate sounds
the need for automatic content-based retrieval and that spectrally belong to the same family.
classification of audio signals. In this paper, a feature Most researchers do not use phase information in the
extraction method is presented for classification of audio classification process, although it conveys essential
recordings. The proposed method employs magnitude and information, because of the difficulties in processing the
phase-related information based on the Hartley transform. discontinuities in the phase spectrum. In our previous work
Up to now, the features extracted from an audio recording [7], the Fourier transform was used in order to implement
have been temporal and magnitude spectrum related, since the phase spectrograms. In this paper, the Hartley
the phase spectrum could not be used due to the transform is used instead. As will be explained in section
discontinuities it exhibits. In the proposed method, the 5, the major advantage of the Hartley over the Fourier
sources of phase discontinuities are detected and overcome, transform is that the Hartley-derived phase related
resulting in a phase spectrum in which the number of spectrum conveys fewer discontinuities and also
discontinuities is significantly reduced. Experimental results encapsulates the phase related content of the signal in an
show that the classification performance is improved, using improved manner.
magnitude and phase related information together, compared
The discontinuities appearing in the phase spectrum are
to the case where only magnitude is used.
unwanted features that affect the classification rate. There
exist two kinds of phase discontinuities: the “extrinsic”
type arises from the computation of the phase, [4], whereas
1. INTRODUCTION the “intrinsic” type is due to the structure of the signal, [5].
Note that, the proposed method for the process of the
Recent research on automatic content-based phase spectrum is non-parametric so it does not use any a-
classification of audio recordings employs magnitude priori information [6].
spectrum and temporal related features to classify different
The novelty of the proposed method is the use of phase
acoustic events retrieved from audiovisual databases, [1].
related information together with magnitude information,
In most cases, the task is to classify sounds that are
for frequency based statistical feature extraction for audio
acoustically different and yet belong to the same family,
classification. Feature vectors, containing phase and
such as sports sounds, [2]. In other cases, existing research
magnitude spectrogram statistical values, are
work attempts to discriminate between various TV events,
independently passed to a Mahalanobis metric classifier.
such as news, advertisements, etc. that are acoustically
The experimental results obtained from a database of
dissimilar, [3].
spectrally similar sounds, indicate that for certain classes
Some of the features employed for speech recognition, of audio signals the use of phase-related feature streams
such as Mel-scale cepstral coefficients or Perceptual together with magnitude feature streams (section 4)
Linear Prediction coefficients, are not always effective for increase the classification rate as compared to the only-
audio classification, as speech signals constitute a special magnitude feature streams case.
group within the family of audio signals. Magnitude
spectrum features convey energy-related information that
is essential for the classification process. Hence, for audio 2. PHASE SPECTRUM DISCONTINUITIES
classification, most of the frequency domain feature
extraction techniques are based on the spectral magnitude
content of the signal, [3]. The calculation of the magnitude 2.1. Phase spectrum computation via the DTFT
spectrum of a signal preserves information related to the
absolute value of its real and imaginary Fourier spectrum The phase spectrum calculation is first carried out in the
components, but does not preserve information related to Fourier transform domain, using the Discrete-Time Fourier
the signs/change of signs of these components. The aim of Transform (DTFT), in order to investigate the signal
this work is to show that, for certain classes of audio properties in that domain. After establishing the
signals, the feature vectors extracted from the phase relationship between the Fourier and the Hartley
content of a signal perform better, in terms of transforms, the analysis will be extended to the Hartley
classification, as compared to the feature vectors extracted transform domain.
from the magnitude content of a signal, in the frequency The phase of the DTFT is defined as:
domain. Consequently, phase-related feature vectors can
be used for frequency based feature extraction, towards
due to computational errors in the evaluation of the roots
⎛ S (ω ) ⎞ (1) of the polynomial, zeros lying not exactly on but within a
f (ω ) = arctan ⎜⎜ I ⎟⎟,0 ≤ f (ω ) ≤ 2π , certain distance (‘ring’) of the unit circle have to be
⎝ S R (ω ) ⎠ removed as well. The width of the ‘ring’ has to be kept
within certain limits so that the information loss is limited
where S (ω ) is the complex Fourier spectrum of the data (see Section 5).
and S I (ω ) , S R (ω ) are its imaginary and real
components respectively. The advantage of this method over the DTFT-based
method is that, although each phase component still comes
When the phase information is extracted from S (ω ) , out of the inverse tangent function, its value is constrained
two problems arise. The first one is related to the within ± π radians. Therefore, ‘wrapping’ ambiguities do
discontinuities that the inverse tangent function presents not arise (phase is not ‘wrapped’ around zero) and
(‘extrinsic’ discontinuities). The computation of this ‘extrinsic’ discontinuities do not appear at all.
function results in phase values modulo 2π . In order to
overcome these ambiguities, the phase has to be
‘unwrapped,’ [4]. 2.3. Fourier transform and Hartley transform relation
The second kind of discontinuity is caused by ‘intrinsic’
characteristics of the system that generates the data under In this subsection, the Hartley spectrum magnitude and
analysis. The ‘intrinsic’ discontinuities appear at phase are defined, based on the relationship between the
frequencies ω (‘critical points’) where both the imaginary Fourier and the Hartley transforms. Let H (ω ) denote the
and the real part of the Fourier spectrum become Hartley spectrum; then, [9],:
simultaneously zero, during the phase evaluation process. H (ω ) = S R (ω ) − S I (ω ) (2)
This is equivalent to the existence of a ‘pole’ or a ‘zero’ of On the other hand, the Fourier spectrum S (ω ) can be
the signal on the circumference of the unit circle in the z- written as
domain, [5], [8]. This latter kind of discontinuity causes π
‘jumps’ in the phase spectrum, [5]. In order to overcome S (ω ) = M (ω )(cos ϕ (ω )) − j sin(ϕ (ω ))) (3)
the ‘intrinsic’ discontinuities, the ‘critical points’ of the in terms of its magnitude M (ω ) , (4), and phase ϕ (ω ) ,
signal have to be detected first. Then the spectrum is (5), respectively:
scanned from lower (zero) to higher frequencies and π is
added to the rest of the phase spectrum values if the value M (ω ) = S R (ω ) + S I (ω )
2 2
before the critical point is higher or subtracted, if it is
⎛ S (ω ) ⎞
lower.
ϕ (ω ) = arctan⎜⎜ I ⎟⎟
The techniques used to detect and overcome ‘extrinsic’ ⎝ S R (ω ) ⎠
and ‘intrinsic’ discontinuities have their drawbacks. For
the ‘extrinsic’ discontinuities, based on [4], whenever a The real and imaginary parts of S (ω ) become:
phase ‘jump’ occurs, 2π is added or subtracted S R (ω ) = M (ω ) cos(ϕ (ω )) (6)
accordingly. The disadvantage is that phase ‘jumps’ can be S I (ω ) = − M (ω ) sin(ϕ (ω )) (7)
caused either by rapidly changing angles or due to the
‘wrapping’ ambiguity; yet, conventional ‘unwrapping’ Then, the Hartley spectrum H (ω ) and its ‘conjugate’
algorithms cannot discriminate the cause of these ‘jumps’. H ∗ (ω ) = S R (ω ) + S I (ω ) take on the forms:
On the other hand, ‘intrinsic’ discontinuities may appear
when both the real and imaginary parts of the complex H (ω ) = M (ω )(cos(ϕ (ω )) + sin(ϕ (ω ))) (8)
representation of the spectrum assume values very close to H ∗ (ω ) = M (ω )(cos(ϕ (ω )) − sin(ϕ (ω ))) (9)
- but not exactly - zero, due to the precision limitations of
discrete computation, [13].
The Hartley spectrum magnitude, N (ω ) , is defined as:
Although the ‘intrinsic’ discontinuities are due to the
structure of the signal and not to computational artifacts,
their existence still reduces the classification rate. The N (ω ) = H (ω ) H * (ω ) = M (ω ) cos(2ϕ (ω )) (10)
classification score is maximized when both sources of
discontinuities are detected and removed, [7]. The Hartley spectrum magnitude combines in a single
quantity both Fourier spectrum magnitude and phase;
hence, it captures all Fourier spectrum-related information
2.2. Phase spectrum computation via the z-transform of the signal.
Furthermore, the ‘whitened’ Hartley spectrum or
The alternative approach for the removal of the Hartley phase spectrum, Y (ω ) , is defined as:
discontinuities appearing in the phase spectrum is based on
the z-transform. Any signal can be modeled as an all-
‘zero’ filter by taking its roots, [8]. The ‘zeros’ (roots of
Y (ω ) = H (ω ) / M (ω ) = cos(ϕ (ω )) + sin(ϕ (ω )) (11)
the polynomial formed by the signal) lying on the unit
circle are causing π or near- π ‘jumps’ in the phase Y (ω ) is a function of the Fourier phase only, [14]; as
spectrum, [5]. These phase ‘jumps’ are the ‘intrinsic’ such, it inherits the ‘intrinsic’ discontinuities. Also Y (ω )
discontinuities. values are bounded within ± 2 , from (11).
In the proposed method, [7], all the zeros lying on the
unit circle are removed and the phase spectrum is re- 2.4. Hartley phase spectrum computation
constructed from the remaining zeros, based on the
‘geometric’ evaluation of phase, [8]. In practice, however,
(1) If we choose to employ the DTFT-based phase 0.38 calibre semi-automatic pistol, viii) lever action
computation method, (Subsection 2.1), ‘intrinsic’ Winchester rifle, ix) 37mm anti-tank gun and x) pistol.
discontinuities can be removed as follows: The particular database is chosen as a demanding
Let ω denote a critical frequency point, where π

classification task, because all ten classes belong
should be either added or subtracted to the Fourier phase acoustically to the same family. Each recording has
for compensation, and let b = ϕ (ω ) . Then, by (11), the

different length from the rest, as it contains real field data.
compensated Y (ω ) is given by (12) in either case:

Seven (7) recordings of each class are used as training data
and the rest as test data.
cos(b ± π ) + sin(b ± π ) = − (cos(b) + sin(b)) (12) For the spectrogram computation, each signal is
segmented into frames of equal length, with zero padding.
Frame length is set to 256 samples, to limit the round-off
Thus, compensation for a ‘intrinsic’ discontinuity is error in the calculation of the polynomial roots (Subsection
achieved by multiplication of Y (ω ) by (-1). 2.2.) from each frame.
Consequently, the method of Subsection 2.1 can now be
simplified as follows (computation of Y (ω ) via the Five (5) different spectrograms are calculated from each
Discrete-time Hartley Transform – DTHT): After the recording of the database, and feature vectors are extracted
‘critical points’ of the signal have been detected, the from each of them:
spectrum is scanned from lower to higher frequencies. 1. Hartley magnitude spectrogram, based on (10),
Whenever a critical frequency point ω is met, all Y (ω )

2. Hartley transform spectrogram, based on (2),
values for ω ≥ ω are multiplied by (-1).

3. Fourier magnitude spectrogram,
(2) If we choose to employ the z-transform based phase 4. ‘whitened’ Hartley spectrogram (via the DTHT), and
computation method (Subsection 2.2), this method can also 5. ‘whitened’ Hartley spectrogram (via the z-transform).
be modified to employ Y (ω ) (computation of Y (ω ) via
the z-transform). Y (ω ) can be evaluated geometrically,
(see Appendix).
5. RESULTS AND DISCUSSION

3. STATISTICAL ANALYSIS & CLASSIFICATION The correct classification rates (average across ten
classes) obtained from the gunshot database with the
In order exploit the Hartley phase properties mentioned Mahalanobis classifier are as follows:
earlier in an audio signal classification problem, eight (8) • Hartley transform spectr.: 82.9%,
different statistical features are extracted by analysis of • Hartley magnitude spectr. 91.4%,
each spectrogram (Hartley phase spectrum), namely:
variance, skewness, kurtosis, entropy, inter-quartile range, • whitened Hartley spectr. (via the DTHT): 81.4%,
range, median, mean absolute deviation, [10], [11]. The • whitened Hartley spectr. (via the z-transf.): 82.9%.
value range of the Hartley phase spectrum is not used as a These results compare favorably to their Fourier
feature for either of the ‘whitened’ Hartley spectrograms, spectrum counterparts, reported in [7] under the same
because Y (ω ) is always bounded within ± 2 . experimental set up. In fact, the whitened Hartley
As the focus of our work is on the statistical feature spectrogram (via the DTHT) outperforms the Fourier
extraction rather than the classifier, a simple Mahalanobis spectrogram (via the DTFT) by 5.7% while via the z-
distance classifier is employed in order to classify the transform by 4.8%. This significant improvement is
acoustic patterns. The Mahalanobis distance is defined as: explained by the ‘immunity’ of the whitened Hartley
spectrum to the discontinuities, relative to its Fourier
counterpart.
d ( xt , xr ) = ( xr − xt )Cr−1 ( xr − xt )T , (13)
For the computation of the whitened Hartley
spectrogram via the z-transform, the choice of the width of
where xt and xr are the test and the reference feature the exclusion ‘ring’ around the unit circle is studied as to
vectors, respectively and Cr is the covariance matrix of its impact on the classification rate. The width of the ‘ring’
the reference data, [12]. is varied between 0 (when ‘zeros’ are excluded only if they
A codebook is derived from the mean values of each lie on the unit circle – not possible in practical calculations
class of audio signals and represents the class. The of roots) and 0.001. Further increase would yield
distance between a test pattern and the codeword of each unreliable results, as the ring would exclude too many
class is calculated and the test pattern is assigned to the ‘zeros’. Classification scores vary accordingly between
class that yields the minimum distance among all classes. 78.6% and 68.6%. In general, classification scores
decrease as the ring width increases, due to the information
loss incurred; they peak for a ‘ring’ width of 0.00003
4. AUDIO DATABASE & EXPERIMENTAL SET UP (82.9%) – a similar performance and at the same ring
width is reported in [7] for the Fourier spectrogram via the
z-transform.
An audio signal database containing gunshot recordings
(10 classes of 10 recordings each, on average) is used to A closer examination of correct classification scores per
test the performance of the proposed Hartley spectrum- class reveals that the 2nd, 4th, 5th and 10th classes yield the
based feature set. The ten classes contain firings of: poorer results and lower the average scores across classes.
In order to assist the decision as to these classes, three out
i) revolver, ii) .22 caliber handgun, iii) M-1 rifle, iv)
of the five ‘experts’ or streams are combined in a fusion
World War II German rifle, v) cannon, vi) 30-30 rifle, vii)
scheme shown in Fig.1:
1. Fourier magnitude spectrogram: it encapsulates only [2] Z. Xiong et. al., “Audio events detection based highlights
the magnitude content of the signal, extraction from baseball, golf and soccer games in a
2. Whitened Hartley spectrogram (via the z-transform): it unified framework,” IEEE Intl. Conf. on Acoustics, Speech
conveys the phase-related content of the signal, and and Signal Processing, vol. 5, pp. 632-635, 2003.
3. Hartley magnitude spectrogram: it encapsulates both [3] Y. Wang, Z. Liu, “Multimedia Content Analysis using both
signal magnitude and phase spectral content. Audio and Visual Cues,” IEEE Signal Processing
Magazine, November 2000.
Fourier Mahal. [4] J.M. Tribolet, “A new phase unwrapping algorithm,” IEEE
magn. - classifier Trans. on Acoustics, Speech and Signal Processing, vol.
stream 1
25, pp. 170-177, April 1977.
[5] H. Al-Nashi, “Phase Unwrapping of Digital Signals,” IEEE
Trans. on Acoustics, Speech and Audio Processing, vol. 37,
‘whitened’ Mahal. Major no. 11, November 1989.
Hart. via the classifier vote
z - stream 2
[6] C.L. Nikias, A.P. Petropulu, Higher-order spectra
analysis: a nonlinear signal processing framework,
Prentice Hall - Signal Processing Series, 1993, Ch. 6.
[7] I. Paraskevas, E. Chilton, “Combination of Magnitude and
Hartley Mahal.
magnitude - classifier Phase Statistical Features for Audio Classification,”
stream 3 Acoustics Research Letters Online, vol. 5, (3), July 2004.
[8] J.G. Proakis, D.G. Manolakis, Digital Signal Processing
Fig. 1. Majority vote decision rule: A recording is classified to a Principles, Algorithms, and Applications, Macmillan
certain class if two or more streams agree. Publishing Company, 1992, Ch. 4 and 5.
[9] R.N. Bracewell, The Fourier Transform and Its
An incoming sound is ‘misclassified’ when two or more Applications, McGraw-Hill Book Company, 1986, Ch.19.
out of the three streams assign it to the same ‘false’ class, [10] E. Mansfield, Basic Statistics with Applications, W.W.
and ‘unclassified’ when each one of the three streams Norton and Co., 1986.
assigns it to a different class. [11] A. Papoulis, Probability and Statistics, Prentice-Hall, Inc.,
For these four classes, classification rates per single 1990, Ch. 12.
‘expert’ or stream are tabulated in Table 1. When the [12] P.C. Mahalanobis, “On the generalized distance in
majority vote decision rule of Fig.1 is employed, however, statistics,” Proceedings of the National Institute of Science
the classification score reaches 90.5%, with 9.5%
of India, vol. 12, pp. 49-55, 1936.
unclassified recordings (no misclassifications).
[13] G.E. Forsythe, M.A. Malcom, C.B. Moler, Computer
Methods for Math. Comput, Prentice-Hall, 1977, Sect. 7.
Fourier Whitened Hartley
[14] E. Chilton, “An 8kb/s speech coder based on the Hartley
Magnitude Hartley (z) Magnitude
transform,” ICCS ‘90 Communication Systems: Towards
Class 2 83.3 83.3 83.3
Class 4 63.6 63.6 77.9 Global Integration, vol. 1, pp. 13.5.1-13.5.5, 1990.
Class 5 81.8 91.7 91.7
Class 10 91.7 85.1 91.7
All 4 classes 80.4 81.2 86.5 APPENDIX
Table 1. Correct classification scores in (%) per ‘expert’.
Imaginary
B
z
6. CONCLUSION
A ϕ
ϕ C
We have proposed a novel approach to phase extraction,
based on the Hartley rather than the Fourier transform, ωω
with application to an audio signal classification problem. O

The experimental results obtained indicate that, for certain Real


classes of audio signals, the combination of magnitude and
phase-related information provides an improved
performance, as compared to the independent use of each
information stream. Moreover, the experimental results
indicate that the ‘whitened’ Hartley spectrograms perform
on average better than their Fourier phase counterparts.
The contribution of a ‘zero’ z to the phase spectrum is evaluated
REFERENCES with respect to a single frequency point C, as an example. For the
Fourier spectrum, ϕ (ω ) = arctan(BC / AB ) , while for the
[1] T. Zhang, C.C.J. Kuo, “Audio content analysis for online
Hartley spectrum,
audiovisual data segmentation and classification,” IEEE
Trans. on Acoustics, Speech, and Audio Proc., vol. 9, no.
Y (ω ) = cos(ϕ (ω )) + sin(ϕ (ω )) = ( AB + BC ) /( AC )
4, 2001. Repeat for all frequency points of interest and all ‘zeros’.

You might also like