Professional Documents
Culture Documents
Keywords: Audio recordings, audio databases, content-based retrieval, classification, phase spectrum, Hartley
transform.
3. STATISTICAL ANALYSIS & CLASSIFICATION The correct classification rates (average across ten
classes) obtained from the gunshot database with the
In order exploit the Hartley phase properties mentioned Mahalanobis classifier are as follows:
earlier in an audio signal classification problem, eight (8) • Hartley transform spectr.: 82.9%,
different statistical features are extracted by analysis of • Hartley magnitude spectr. 91.4%,
each spectrogram (Hartley phase spectrum), namely:
variance, skewness, kurtosis, entropy, inter-quartile range, • whitened Hartley spectr. (via the DTHT): 81.4%,
range, median, mean absolute deviation, [10], [11]. The • whitened Hartley spectr. (via the z-transf.): 82.9%.
value range of the Hartley phase spectrum is not used as a These results compare favorably to their Fourier
feature for either of the ‘whitened’ Hartley spectrograms, spectrum counterparts, reported in [7] under the same
because Y (ω ) is always bounded within ± 2 . experimental set up. In fact, the whitened Hartley
As the focus of our work is on the statistical feature spectrogram (via the DTHT) outperforms the Fourier
extraction rather than the classifier, a simple Mahalanobis spectrogram (via the DTFT) by 5.7% while via the z-
distance classifier is employed in order to classify the transform by 4.8%. This significant improvement is
acoustic patterns. The Mahalanobis distance is defined as: explained by the ‘immunity’ of the whitened Hartley
spectrum to the discontinuities, relative to its Fourier
counterpart.
d ( xt , xr ) = ( xr − xt )Cr−1 ( xr − xt )T , (13)
For the computation of the whitened Hartley
spectrogram via the z-transform, the choice of the width of
where xt and xr are the test and the reference feature the exclusion ‘ring’ around the unit circle is studied as to
vectors, respectively and Cr is the covariance matrix of its impact on the classification rate. The width of the ‘ring’
the reference data, [12]. is varied between 0 (when ‘zeros’ are excluded only if they
A codebook is derived from the mean values of each lie on the unit circle – not possible in practical calculations
class of audio signals and represents the class. The of roots) and 0.001. Further increase would yield
distance between a test pattern and the codeword of each unreliable results, as the ring would exclude too many
class is calculated and the test pattern is assigned to the ‘zeros’. Classification scores vary accordingly between
class that yields the minimum distance among all classes. 78.6% and 68.6%. In general, classification scores
decrease as the ring width increases, due to the information
loss incurred; they peak for a ‘ring’ width of 0.00003
4. AUDIO DATABASE & EXPERIMENTAL SET UP (82.9%) – a similar performance and at the same ring
width is reported in [7] for the Fourier spectrogram via the
z-transform.
An audio signal database containing gunshot recordings
(10 classes of 10 recordings each, on average) is used to A closer examination of correct classification scores per
test the performance of the proposed Hartley spectrum- class reveals that the 2nd, 4th, 5th and 10th classes yield the
based feature set. The ten classes contain firings of: poorer results and lower the average scores across classes.
In order to assist the decision as to these classes, three out
i) revolver, ii) .22 caliber handgun, iii) M-1 rifle, iv)
of the five ‘experts’ or streams are combined in a fusion
World War II German rifle, v) cannon, vi) 30-30 rifle, vii)
scheme shown in Fig.1:
1. Fourier magnitude spectrogram: it encapsulates only [2] Z. Xiong et. al., “Audio events detection based highlights
the magnitude content of the signal, extraction from baseball, golf and soccer games in a
2. Whitened Hartley spectrogram (via the z-transform): it unified framework,” IEEE Intl. Conf. on Acoustics, Speech
conveys the phase-related content of the signal, and and Signal Processing, vol. 5, pp. 632-635, 2003.
3. Hartley magnitude spectrogram: it encapsulates both [3] Y. Wang, Z. Liu, “Multimedia Content Analysis using both
signal magnitude and phase spectral content. Audio and Visual Cues,” IEEE Signal Processing
Magazine, November 2000.
Fourier Mahal. [4] J.M. Tribolet, “A new phase unwrapping algorithm,” IEEE
magn. - classifier Trans. on Acoustics, Speech and Signal Processing, vol.
stream 1
25, pp. 170-177, April 1977.
[5] H. Al-Nashi, “Phase Unwrapping of Digital Signals,” IEEE
Trans. on Acoustics, Speech and Audio Processing, vol. 37,
‘whitened’ Mahal. Major no. 11, November 1989.
Hart. via the classifier vote
z - stream 2
[6] C.L. Nikias, A.P. Petropulu, Higher-order spectra
analysis: a nonlinear signal processing framework,
Prentice Hall - Signal Processing Series, 1993, Ch. 6.
[7] I. Paraskevas, E. Chilton, “Combination of Magnitude and
Hartley Mahal.
magnitude - classifier Phase Statistical Features for Audio Classification,”
stream 3 Acoustics Research Letters Online, vol. 5, (3), July 2004.
[8] J.G. Proakis, D.G. Manolakis, Digital Signal Processing
Fig. 1. Majority vote decision rule: A recording is classified to a Principles, Algorithms, and Applications, Macmillan
certain class if two or more streams agree. Publishing Company, 1992, Ch. 4 and 5.
[9] R.N. Bracewell, The Fourier Transform and Its
An incoming sound is ‘misclassified’ when two or more Applications, McGraw-Hill Book Company, 1986, Ch.19.
out of the three streams assign it to the same ‘false’ class, [10] E. Mansfield, Basic Statistics with Applications, W.W.
and ‘unclassified’ when each one of the three streams Norton and Co., 1986.
assigns it to a different class. [11] A. Papoulis, Probability and Statistics, Prentice-Hall, Inc.,
For these four classes, classification rates per single 1990, Ch. 12.
‘expert’ or stream are tabulated in Table 1. When the [12] P.C. Mahalanobis, “On the generalized distance in
majority vote decision rule of Fig.1 is employed, however, statistics,” Proceedings of the National Institute of Science
the classification score reaches 90.5%, with 9.5%
of India, vol. 12, pp. 49-55, 1936.
unclassified recordings (no misclassifications).
[13] G.E. Forsythe, M.A. Malcom, C.B. Moler, Computer
Methods for Math. Comput, Prentice-Hall, 1977, Sect. 7.
Fourier Whitened Hartley
[14] E. Chilton, “An 8kb/s speech coder based on the Hartley
Magnitude Hartley (z) Magnitude
transform,” ICCS ‘90 Communication Systems: Towards
Class 2 83.3 83.3 83.3
Class 4 63.6 63.6 77.9 Global Integration, vol. 1, pp. 13.5.1-13.5.5, 1990.
Class 5 81.8 91.7 91.7
Class 10 91.7 85.1 91.7
All 4 classes 80.4 81.2 86.5 APPENDIX
Table 1. Correct classification scores in (%) per ‘expert’.
Imaginary
B
z
6. CONCLUSION
A ϕ
ϕ C
We have proposed a novel approach to phase extraction,
based on the Hartley rather than the Fourier transform, ωω
with application to an audio signal classification problem. O