You are on page 1of 29

Audio Processing

Audio
RMS Mel-Frequency Cepstrum Coefficients (MFCCs)

Audio Visual
Mixelgrams (Hershey & Movellan, 2000)

RMS
Root Mean Square Audio feature A measure of the average amplitude of the audio signal N
1 2 ! at N t =1

Where ai is the audio amplitude (I.e., raw audio sample) at time t IKAROS module
RMSAudio

Example XML script:


RMSAudio/Example/RMSAudio_test.xml Example creates file rms.txt with RMS amplitude

Per Visual Frame RMS Amplitude of sDog.mov Audio Track

MFCCs
Implementation in IKAROS Ported from Marsyas by Eric Mislivec
Marsyas: http://sourceforge.net/projects/marsyas

Marsyas implementation based on Matlab code by Malcom Slaney


http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/ See the file mfcc.m contained in the toolbox archive See pp. 30-33 of documentation
http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/AuditoryToolboxTechReport.pdf

MFCCs (Mel-Frequency Cepstrum Coefficients)


parametric representation of a speech audio signal (Davis & Mermelstein, 1980) Objective
Compress speech data by eliminating information not pertinent to phonetic analysis

E.g., represent an interval of 6.4 ms of speech audio with 10 real numbers


With audio (mono) sampled at 44.1 kHz, this would be 282 numbers (samples)

Observation used by Davis & Mermelstein (1980)


The first six eigenvectors of the covariance matrix for Dutch vowels of three speakers, expressed in terms of 17 mel-frequency filter energies, accounted for 91.8% of the total variance (Pols, 1966)

Mel-Frequency Scale
A linear frequency spacing below low frequencies (e.g., 1,000 Hz); a logarithmic spacing above high freq. (e.g., 1,000 Hz) Filters corresponding to these spacings seem to capture phonetically important characteristics of speech (e.g., of the cochlea??)

40 mel-frequency filters; MFCC implementation by Malcolm Slaney (http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/)

Frequency Analysis of Audio


Audio samples as obtained digitally reflect amplitude Mel-frequency filters are applied to the frequency of the audio, not the amplitude Before processing with mel-frequency filters, first take Discrete Fourier Transform (DFT) of the audio
Converts to a frequency representation

DFT analysis occurs in terms of number of equally spaced bins


Each bin represents a particular frequency range DFT analysis gives the amount of energy in the audio signal that is present within the frequency range for each bin

Using Mel-Frequency Filters - 1


Applied to frequency analysis of audio
I.e., power spectrum of DFT

Result of applying mel-frequency filters is to reduce the amount of data


Instead of having a number of values same as number of bins produced by DFT, now have a number of values same as number of filters Example
Input: 512 frequency bin values Output: 40 filter responses

Using Mel-Frequency Filters - 2


Filter can be represented as a matrix
E.g., with 40 filters, and 512 frequency bins, obtain a 40x512 filter matrix
See: MFCC/mfccFilterWeights.xls

Filter is applied to DFT bin output by matrix multiplication


With DFT bin output in a 512x1 matrix (F), and mel-frequency filters in a 40x512 matrix (M), application of the filters to this frequency data is
M x F = 40x1 matrix

Algorithm to compute MFCCs


1. Compute the DFT power spectrum of the speech signal 2. Apply a Mel-frequency filter-bank to the power spectrum to get N filter responses (N=20~60) 3. Compute discrete cosine transform (DCT) of log filter-bank energies to get uncorrelated MFCCs (e.g., M=10 values)

Discrete Cosine Transform (DCT) - 1


Signals represented in terms of basis functions:
(a) First basis function: A constant component (DC) (b) Remaining basis functions: a series of successively increasing frequency components (AC)

Each of the basis functions are uncorrelated (orthogonal and orthonormal) DCT generates as many component scalar values (numbers) as present in original signal
(2i + 1)u! F(u) = C(u) 2 / M # cos f (i) 2M i=0
M "1

! f (i) = 2 / M

M "1 u=0

# C(u)cos

(2i + 1)u! F(u) 2M

Where f(i) is the original discrete valued signal, and F(u) is the transformed signal, with 1 <= u <= M

! 2 $ if u = 0 & C(u) = # 2 # & #1 otherwise & " %

Discrete Cosine Transform (DCT) - 2


A trick with DCT though is to not use all of resulting F(u) terms to represent original signal f(i) Rather, in decompression only some initial portion of the F(u) terms are used
I.e., latter, higher frequency F(u) terms are dropped

This corresponds to representing the lower frequency components of the signal, while dropping some of the higher frequency components Leads us to a second form of the inverse DCT
(2i + 1)u! ! ! f (i) = 2 / M # C(u)cos F(u) 2M u=0
P "1

With P < M, we approximate the approximation

DCT Example - 1
With our MFCC computation, the DCT is applied to the output of 40 mel-scale filters Example (log10) output of filters:
-2.586700, -2.627700, -2.086800, -2.100100, -2.319800, -1.975200, -2.178400, -2.195400, -1.953100, -2.231900, -2.021400, -1.933800, -1.966900, -1.514800, -1.646200, -1.530600, -1.488100, -2.062500, -2.286800, -2.348500, -2.538200, -2.696200, -2.764100, -2.852300, -2.950000, -2.843200, -2.454700, -2.438700, -2.655200, -2.318300, -2.457900, -3.171100, -3.413300, -2.628100, -2.558700, -3.296300, -3.576600, -3.560700, -3.462800, -3.396300

DCT is then applied, and the first 13 of the DCT components are retained
These are the MFCCs

DCT Example - 2
DCT result
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553, -0.574495, 0.279874, -0.355358, 0.257603, -0.292258, 0.079028, 0.226440, -0.369133, 0.400663, -0.348537, -0.034636, 0.321552, -0.140994, 0.129148, -0.001110, 0.211364, 0.337528, 0.264207, 0.034994, -0.117453, -0.037960, -0.082142, 0.059513, 0.011227, 0.032282, 0.017767, -0.027099

First 13 (MFCCs)
-15.667092, 2.549819, -1.343900, -0.593737, -0.913674, 0.896987, 0.198794, -0.630580, -0.427731, 0.001723, -0.197247, 0.127791, 0.041553

DCT Example: Inverse DCT

IKAROS Example - 1
Purpose
To demonstrate the computation of MFCC parameters from samples of audio Audio data: DogAudio.wav

Running the code


Located in MFCC/MFCCAudio/Example From a Terminal
IKAROS MFCCAudio_test.xml

Generates 328 sets of MFCC coefficients (each has 13 values)

IKAROS Example - 1: Evaluation


To evaluate, compare the results of DFT on the DogAudio.wav data itself to a reconstruction of the DFT based on the MFCC output Procedure
Use praat (http://www.praat.org/) to first get a sense of what the DFT should look like Then use Mathematica to display DFT results of MFCCAudio processing Finally, use Matlab to reconstruct DFT results based on MFCC output (mfccTest.m), and use Mathematica to display reconstructed results

IKAROS Example - 1: Evaluation Results


praat

DFT
(Mathematica)

A = Import["/Users/chris/Desktop/DogAudio_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]

Reconstruction
(Matlab + Mathematica)

A = Import["/Users/chris/Desktop/DogAudio_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]

IKAROS Example - 2
Purpose
To demonstrate the computation of MFCC parameters from samples of audio contained in a Quicktime file Audio data: sCup.mov

Running the code


Located in IKAROS/MFCC/MFCCAudio/Example From a Terminal
IKAROS MFCCAudio_test2.xml

Generates 448 sets of MFCC coefficients (each has 13 values)

IKAROS Example - 2: Evaluation Results


praat

DFT
(Mathematica)

A = Import["/Users/chris/Desktop/sCup_fft.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]

Reconstruction
(Matlab + Mathematica)

A = Import["/Users/chris/Desktop/sCup_freqrecon.data", "csv"]; ListDensityPlot[A, Mesh -> False, AspectRatio -> 0.25, ImageSize -> { 800, 200}]

Audio-Visual Synchrony Features


Several algorithms process audio-visual data and generate a measure of audiovisual synchrony as output E.g., Hershey & Movellan (2000), Slaney & Covell (2001), Kidron et al. (2005)

Hershey & Movellan (2000)


Compares current audio (e.g., 1/2s time interval) to all parts of current visual (e.g., same 1/2s time interval)
Separate comparisons for each part of visual

Can process raw or feature processed audio, visual Effectively spatial visual processing; no spatial audio processing
1/2s Visual Audio Time

Mutual Info Computation


| ! A(t k ) || ! V (x, y,t k ) | 1 M (x, y,t k ) = log 2 2 | ! A,V (x, y,t k ) |
Notation (see also section 2.2 of Prince & Hollich, 2005):
is covariance |X| is matrix determinant of matrix X. V(x, y, tk) is a matrix with the current visual information for region (x, y) of the visual frames (e.g., pixel at x, y) over time interval of length S (e.g., S = 15 visual frames) A(tk) is a matrix with current audio information over the same time interval (length S) A,V(x, y, tk) contains columns of V(x, y, tk) and A(tk)

Minimum mutual information is 0

Output: Mixelgram
Qualitative because mutual information is taken for each part of the visual scene (e.g., pixel) at each time step Module Mixelgram in IKAROS does this computation
Example XML file
Mixelgram/Example/Mixelgram_test.xml

Example Output

Movie of Output

Conclusions
Important to share not just ideas but also working system components (e.g., program code components) Lets have someone do Feature Processing Version #2 next year (and more developmental!)

REFERENCES
For Discrete Cosine Transform (DCT)
Li, Z. & Drew, M. S. (2004). Fundamentals of Multimedia. Upper Saddle River, NJ: Pearson Prentice Hall.
Hershey, J., & Movellan, J. (2000). Audio-vision: Using audiovisual synchrony to locate sounds. In S. A. Solla, T. K. Leen, & K. R. Mller (Eds.), Advances in Neural Information Processing Systems 12 (pp. 813-819). Cambridge, MA: MIT Press. Kidron, E., Schechner, Y. Y., & Elad, M. (2005). Pixels that sound. Proceedings of the IEEE Computer Society International Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 88-96. Prince, C. G. & Hollich, G. (2005). Synching models with infants: A perceptual-level model of infant audio-visual synchrony detection. Journal of Cognitive Systems Research, 6, 205-228. Internet: http://www.cprince.com/PubRes/JCSR04 Slaney, M. & Covell, M. (2001). FaceSync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Proceedings of Neural Information Processing Society 13. Cambridge, MA: MIT Press.

You might also like