You are on page 1of 7

Implementing a Quantitative Model for the Effective Signal Processing in the Auditory System on a Dedicated Digital VLSI Hardware

A. Schwarz, B. Mertsching University of Hamburg Computer Science Department, IMA Group D-22527 Hamburg, Germany
schwarz@informatik.uni-hamburg.de

M. Brucke, W. Nebel University of Oldenburg Computer Science Department, VLSI Group D-26111 Oldenburg, Germany
Matthias.Brucke@Informatik.Uni-Oldenburg.De

J. Tschorz, B. Kollmeier University of Oldenburg Physics Science Department, Medical Physics Group D-26111 Oldenburg, Germany
biko@medi.physik.uni-oldenburg.de

Abstract
A digital VLSI implementation of an algorithm modeling the effective signal processing of the human auditory system is presented. The model consists of several stages psychoacoustically and physiologically motivated by the signal processing in the human ear and was successfully applied to various speech processing applications. The processing scheme was partitioned for implementation in a set of three chips. Due to local properties of the signal dynamic and the necessary arithmetical precision different approaches for number representation and appropriate arithmetic operators were investigated and implemented. It is

demonstrated how an application of the model has been used to determine the necessary wordlengths for a transfer of the algorithm into a version suitable for hardware implementation. Fix point arithmetic is used in the linear parts of the origin algorithm and a special small floating point operator set was developed for the nonlinear part. This part was coded in behavioral VHDL and synthesized with Synopsys Behavioral Compiler. The hardware algorithm is being evaluated on different implementation levels for a FPGA and will be manufactured as ASICs in a later version. The presented FPGA chip set will be combined with a commercial DSP system (TMS320C6201) for real time and reconfigurable signal processing. speech recognition [2], objective speech quality measurement [3] and digital hearing aids. The algorithm processes stereo signals and includes a gammatone filter bank (30 bandpass filters equidistant distributed on the ERB scale from 73 to 6700 Hz) to model spectral properties of the human ear like spectral masking and frequency-dependent bandwidth of auditory filters.

1. Introduction
The binaural perception model introduced in [1] describes the effective signal processing in the human auditory system and provides an appropriate internal representation of acoustic signals. Its capabilities were successfully demonstrated as a preprocessing algorithm for

stereo input

1 kHz
max

8 Hz

1 kHz

t1

t2

t3

t4

t5

gammatone filterbank

adaptation loops

stereo

gammatone filterbank

halfwave re lowpass filtering absolute threshold

envelope extraction

absolute threshold

adaptation loops

lowpass filtering

lowpass filter

8 Hz

t1

t2

t3

t4

t5

Figure 1. Processing scheme of the binaural perception model introduced in [1].

in re

A stage modeling inner hair cell behavior (envelope extraction) is followed by five adaptation loops (with time constants between 5 and 500 ms) to consider dynamical effects as nonlinear adaptive compression and temporal masking (see Fig. 1). The demonstrated VLSI design contains additional components to determine differences in phase and magnitude of each channel (Fig. 2).

2. Hardware design specications


Due to the complexity the design was partitioned into three chips (Fig. 2). Besides the serial data interfaces chip 1 contains the 30-channel binaural filter bank, the envelope extraction, and a module computing phase differences and magnitude quotients in each stereo output of the channels. A single bandpass filter is multiplexed through all 30 channels both for the left and right stereo signal. Including a sixstage pipelined multiplier and one adder/subtractor this kerReset INPUT FROM DSP/CODEC SerialDataIn Sync 3 Reset

nel is realized by a quad cascade of a single stage complex IIR filter. This saves chip area but requires a 50 MHz system clock to operate with a 16276 Hz sampling frequency. RAM units save temporary data and filter constants are read from a ROM (see Fig. 2). For further processing three system outputs are available. A high speed interface (30 MBit/s) provides real and imaginary parts of the right and left stereo data for all filter bank channels (chip 1). The adaptive compressed data of the left (1st chip 2) and the right stereo signal (2nd chip 2) and the phase and amplitude information of the filter bank outputs are combined within a second data stream (12 MBit/s). The 4-wire serial interfaces of the chip set (16 bit data words) support a direct interface to the serial ports of most DSP-devices. In a constellation with a DSP-device able to serve the fast serial ports (TI TMS320C6201) a system solution for auditory signal processing is provided.

50 MHz Clock

SerialData 50 MHz Clock 24 MHz Clock Input Interface

Sync 3

Input Interface

Input Serial/Parallel Converter Logic Valid Op2 Reg Add Sub Core Input Divider State Mem RAM 1st Order Lowpass Constants ROM Valid Controller Five Multiplexed ("Rolled") Adaptation Loops Core Output Scale & Lowpass Valid Parallel/Serial Converter Output Interface 12 Mbit/s Output Logic InitBusy Panic

Gammatone Filter Bank Temp Reg Mux Mult Mux

Controller

Op1 Reg

MagnitudeQuotient PhaseDifference Halfwave Rectification

ROM RAM Lowpass & Decimation

Lowspeed Output Interface 12 Mbit/s

1kHz 1st Order IIR Lowpass

Highspeed Output Interface

30 Mbit/s

OUTPUT TO DSP 30 Mbit/s

Panic SerialDataOut Sync

COMBINED OUTPUT Panic 12 Mbit/s SerialDataOut ASIC 1 / 1st ASIC 2 (left) Sync 3 ASIC 1 / 2nd ASIC 2 (right)

Figure 2. Structure and wiring scheme of the internal components of the chip set.

mean square error

Each of the five adaptation loops contains a divider whose quotient is fed back by a 1st order IIR lowpass providing the divisor. This feedback, the necessary precision and signal dynamic requires large fix point wordlengths or a logarithmic number format. The dividers are very area expensive and therefore the most critical components in the design. A fourfold subsampling and data serialization in chip 1 allow a multiplexed loop kernel monaurally implemented in two chips (two of chip 2). The loop kernel contains RAM cells storing the states of all lowpass filters for the 30 serial processed frequency channels.

leads directly to the necessary internal wordlength. Allowing an error of 0.001 a minimal wordlength of 24 bits is necessary for the lowest filter bank channel (Fig. 3).
1e+00 1e-01 1e-02 1e-03 1e-04 1e-05 1e-06 1e-07 1e-08 5 10 15 20 25 number of filter-bank channel 30 16 bit 18 bit 20 bit 22 bit 24 bit 26 bit 28 bit 30 bit

3. Floating point to x point to oating point arithmetic suitable for auditory signal processing
A direct implementation of an IEEE 32 bit single precision floating point arithmetic of the model is not possible due to limitations of area and timing. To gain an optimal implementation different methods are applied to the linear filter bank and the nonlinear adaptation loops respectively. The main problem when converting number formats and dedicated arithmetic is the determination of the required numerical precision. Because the necessary quantization depends on applications and typical signal dynamic the perception model was recoded in C++ using new classes of scalable data types and necessary operators. This class takes the internal wordlength as a parameter and saves the values exactly in the same format as they would be saved in a register on an ASIC. Thus numerical effects of imprecise arithmetic can be simulated in target applications. The kernel arithmetic of gammatone filterbank was designed and successfully evaluated in a fix point notation. After evaluating a scalable fix point version of the nonlinear adaptation loops and recognizing the high area consumption for especially the dividers a small floating point class was successfully tested.

Figure 3. Error introduced by fix point quantization in the gammatone filter bank. Numerical operations. The filter algorithm consists of a fourfold first-order filter which contains only add and multiply by constants operations. Number formats. Due to the increased analysis bandwidth the error for a given wordlength decreases with increasing center frequency and channel number respectively. All channels use the same operator structure, thus a general number format of 24 bits fix point is required.

3.2. Arithmetic transformation for nonlinear adaptation loops


Principle. The determination of an optimal quantization in the adaptation loops is much more difficult because they show a strong nonlinear behavior. It was demonstrated in [3] that the perception model can supply an objective speech quality measure q. Speech signals distorted by low-bit-rate codecs used in mobile telephone devices are compared to their undistorted version and a quality measure q is given, which is correlated with a subjective Mean Opinion Score (MOS) of the test signals. Because this testbench is very sensitive to limited number precision and signal dynamic in the perception model, it can be used to evaluate modifications caused by limited quantization and arithmetic (Fig. 4). An optimized quantization of the nonlinear adaptation loops (small wordlengths i.e. small chip area vs. reliable signal processing) was found by empirical wordlength variation. The results were verified processing two different large speech signal sets varying the input signal levels from -10 to 50 dB.

3.1. Arithmetic transformation for linear gammatone lters


Principle. The necessary internal wordlength for the gammatone filter bank can be assessed in a straight-forward way, because the filters are linear time invariant systems where classical numerical parameters like SNR can be applied. It is sufficient to record the filter responses for pulses for each filter parameterized with different internal wordlengths. Figure 3 shows the mean square error (relative error, i.e. noise-to-signal ratio) between one of these implementations and the original specification with IEEE single precision floating point arithmetic. The choice of a certain maximal square error (e.g. 10-3 for all channels)

original signal

Perception Model

frequency weighting

subjective MOS-data

Codec

IEEE 32 bit floating, fixed or small floating point arithmetic

crosscorrelation

comparation/ correlation

distorted signal

Perception Model

frequency weighting

Figure 4. Speech quality measurement used as a testbench for changes in kernel arithmetic of the adaptation loops in the perception model. Data analysis. Histograms were recorded at internal nodes to investigate signal levels during the processing of typical speech (ETSI-test data [4][5]) and noise input signals (Fig. 5).
10
10

tive values occur in the loops, divisors never exceed 1.0, and the loop outputs are concentrated near zero. This is to be expected since small amplitudes are very frequently in typical speech signals according to their probability density distribution [6]. Numerical operations. The original C-code contains in the loops and the following scaling and lowpass unit all basic arithmetic operators (Table 1.). The current quotients qi[n] in the loops are calculated from local lowpass filter outputs bi[n-1] of the last cycle. The current lowpass output is derived from its last output bi[n-1] and the new quotient qi[n]. The output of the last loop q5[n] is shifted and scaled to s[n] in the scaling unit and after last lowpass filter the result o[n] is given to the output interface. All Cx(i) are constants.

10

loop0 loop1 loop2 loop3 loop4

frequency

10

10

10

10

0.00

10.00 value

20.00

30.00

10

division in loop i i = [0, 1, 2, 3, 4]


divisor0 divisor1 divisor2 divisor3 divisor4

q0[n] = x[n] b0[n-1] (1st loop) qi[n] = qi-1[n] bi[n-1] (others)

10

lowpass in loop i scaling unit completing lowpass

bi[n] = C1i*qi[n] + C2i*bi[n-1] s[n] = (q5[n] - C3) * C4 o[n] = C5*s[n] + C6*o[n-1]

frequency

10

10

Table 1. Operations in the adaptation loops, i is the loop number and n represents sample numbers.
0.20 0.40 value 0.60 0.80 1.00

10

0.00

Figure 5. Histograms of output and divisor in the adaptation loops for typical speech signals. The divisors of the loops have an individual threshold, and their lower bounds are introduced to reduce unwanted peaks. The dynamic range is obviously limited. Only posi-

An useful simplification for the hardware specification is the fact that all values remain in the positive range up to last output of the last loop. Indeed, the scaling unit introduces a sign bit which propagates to the output. Number formats. Considering the necessary precision of the kernel arithmetic and available arithmetic cores in the synthesis tool libraries (Synopsys DesignWare), two approaches are possible. Simulations with the integer pro-

totype show that, using the available fix point operators, a number format of 4 integer (int part) and 15 fraction bits (frac part) is sufficient and all constants Cx(i) have to be quantized in 19 fraction bits. When dividing or multiplying these fix point numbers the internal wordlengths must be greater to hold all possible digits: in case of the divider 34 bits (eq. 1) and the multiplier 38 bit (eq. 2). The dividend has to be prescaled (shifted) because the integer part of the quotient can grow by the fraction bits of the divisor (complementary to multipliers). div wordlength = (int part + frac part), (frac part) (1)

Furthermore, this number format matches the requirements of speech processing systems much better than a fix point system with an equidistant resolution, since its logarithmical range partitioning has the best resolution at the lower end (near zero) of the representable dynamic range where speech signals are concentrated. For the same reason, i.e. the probability density distribution of speech signals, the A- and -law characteristics in the AD and DA converters with companding are efficient standards for telecommunication systems. A similar approach is introduced in [8] for a neural net implementation for speech recognition purposes, where the net weights could be successfully quantized in a floating point format of only 1 sign bit, 1 bit mantissa and 3 bit exponent. Prototype and VHDL implementation. Since design tool libraries do not support scalable floating point data types and -operators respectively, an own prototype was developed. Similar as proposed in [9] floating point operators has been designed which incorporate fix point sub units provided by the synthesis tools. But a test and simulation environment which can evaluate signal distortions with a meaningful coverage processing large data streams (ETSI-test data [5]) is not possible on logic VHDL simulation level. Therefore, a C++ class was designed whose operators work identically like the desired hardware version and allow extensive tests of different wordlengths. Multiplication (eq. 3) and division by (eq. 4) use fix point library elements for multiplication/division of the significants and addition/subtraction of the exponents respectively [10].
( s1 2 ) ( s2 2 ) = ( s1 s2 ) 2
e1 e2 (e1 + e2)

mul wordlength = (int part a), (frac part a+frac part b) (2)

The product wordlength is the sum of the wordlength of the operands a and b. Operand b (filter constants) only have a fraction part (fract part b). In addition a 20 bit fix point adder and subtractor are necessary. The most expensive operator is the 34 bit divider with an unacceptable huge area demand and it seems to be near the limits for handling by the design tools. A floating point number format has been introduced for the adaptation loops to reduce the area requirements and long signal propagation delays through the operator combinational nets (Table 2.). The speech quality measure testbench shows that the small floating point divider with 6 significant bits and 6 bit exponent in the unsigned operands is sufficient (Fig. 6) and has a impressively reduced area demand (see Table 4.). Divider: (precision p=5) Multiplier, Adder, Subtractor: (precision p=13) binary excess

(3) (4)

signicand s=6 exponent e=6 signicand s=14 exponent e=6 100000

( s1 2 ) ( s2 2 ) = ( s1 s2 ) 2

e1

e2

(e1 e2)

largest error =/2 * p =0.03125 (div) largest error =/2 * p =0.00012207 (mul, add, sub) (machine epsilon)[7] max binary value (div) 111111.111111 min binary value (div) 100000.000000 binary zero (div) 000000.100000 Table 2. Properties of the small floating point number format.

The small floating point division is enclosed in normalization operations for each operand and the result in order to get a leading 1 in the MSBs and to reduce complexity in data handling. Under- or overflow during normalization forces signal clipping to zero or full scale. The internal wordlength of the divider is twice the length of the operands to preserve the precision of the operands. Normalization and shrinking to the operand wordlength follow. Adder and subtractor need exponent aligning before the mantissas can be summed or subtracted. If the operands are very different, one of them can disappear during aligning. When subtracting similar large values an additional dirty zero problem can occur, i.e. calculation errors grow. But in this case we could observe a general sufficient distance between subtrahend and minuend.

The use of pure behavioral code synthesizable by Synopsys Behavioral Compiler presumes some more work. Shortly described, the Behavioral Compiler analyzes data dependencies and the required operator usage, schedules the design, and builds a controller. The type of the automatically created finite state machine for the controller may be specified. A binary encoding is used in this case. All operators are implemented as combinational nets for easy timing and scheduling and are handled as dedicated multicycle (-delayed) blocks if necessary. Overloading the operators (+, -, *, /) allows inferring in VHDL and a straight forward coding of the algorithm. In addition, a RAM module of the target library was manually created and is handled by wrappers in behavioral code in order to have indexed cell access to the lowpass values via an array data type. Except for the RAM block, the design is coded completely independent of a target library, because no specific cores of the FPGA technology are instanced. Thus there is no need for code modifications when the target library changes.

an Altera Flex10K100A-1 device. interfaces kernel, scaling unit and lowpass memory (in Flex10K EAB blocks) max clock frequency (kernel) (timing constraints violation) small oat divider (6 bit mantissa, 6 bit exponent operand width) x point divider (34 bit) 195 logic cells = 5 % LC usage 2983 logic cells = 59% LC usage 3600 bits = 14 % EAB usage 24 MHz 94 logic cells, 205 ns delay 1186 logic cells, 1527 ns delay

Table 4. Allocated resources of an Altera Flex10k100A-1 device for the chip2 design. The state vector of the controller has eight bits storing 142 states. Timing analysis shows that the most critical path is a part of this controller, reducing the maximum clock frequency. Since 50 MHz could not be reached for a common clock, one of the two FPGA clock networks drives the kernel with 24 MHz while the other is used for the interface parts. Because very few I/O pins are used by the design pin locking causes no routing problems. Simulation in the testbench was performed extensively on prototype level (C++) with large sample data streams. The enormous simulation times on VHDL logic level allow only single value or short data stream evaluation. The following results for versions of the chip2-arithmetic could be calculated (Fig. 6) using the perception model as a testbench. Diagram (a) shows that the model works correctly and the objective speech quality measure is well correlated with the subjective MOS (indicated by the linear correlation coefficient r). Nearly no losses can be found in diagram (c) due to fixed point quantization errors when the resolution is 4 integer and 30 fraction bits. In (d) enormous losses in the data correlation appear after reducing the wordlength to 4 integer and 24 fraction bits. The small floating point implementation works well with an operand width of 6 bits mantissa for division, 14 bits mantissa for all other, and 6 bits exponent for all operations. Real time experiments become possible with the completion of the demonstrator board and, after installing it on the DSP card, a powerful signal processing system with a reconfigurable coprocessor is available.

4. Synthesis and simulation results


A prototype of the core design of chip 1 (input interface, gammatone filterbank, halfway rectification, lowpass filter, and output interface) was implemented on a Xilinx XC4062XL-2 device. A complete mapped FPGA-cell netlist is transferred to the Xilinx place&route tools. When the temporary values are stored on an external RAM 2186 logic cells are allocated. The FPGA utilization is about 40% (Table 3.). The timing constraints according to the sampling rate of the whole system are met even though the RAM access limits the clock to 32 MHz.

interfaces kernel memory max clock frequency (external RAM access)

273 logic cells = 5 % LC usage 1913 logic cells = 35 % LC usage external RAM 32 MHz

Table 3. Allocated resources of a Xilinx XC4062XL-2 device for the chip1 design. After compilation and mapping the chip2-design to the FPGA look-up-table cell level (not mapped to FPGAgates), an EDIF netlist is transferred to the vendor specific place&route tool. Here, the design is mapped to physical cells and connected. Table 4. presents the allocated hardware resources and timing analysis results when targeting

4.5 4

4.5

(a)
r=0.935

IEEE float single prec.


4 3.5

(b)

"pmx6_6_div_sparc.rpt"

Add, Sub, Mul: 14(M) 6(E) Div: 6(M) 6(E)

r=0.928
3

subjective MOS

3.5 3 2.5 2 1.5 1 0.75 4.5 4 0.8 0.85 0.9 0.95 objective 4 int bits measure / 30 frac q bits

2.5

1.5

1 0.75

0.8

0.85

0.9

0.95

1 4.5

(c)
r=0.927

(d)
r=0.63

4 int bits / 24 frac bits

subjective MOS

3 2.5 2 1.5 1 0.75 0.8 0.85 0.9 0.95 objective measure q

subjective MOS
1

3.5

3.5 3 2.5 2 1.5 1 0.75 0.8 0.85 0.9 0.95 objective measure q 1

Figure 6. Results for a complete objective speech quality measurement with the ETSI half-rate selection test data [4][5].

5. Conclusion
In this paper we present our work on the digital VLSIimplementation of a speech perception model. The hardware design of the algorithm was derived from a recoded version of the model in C/C++ using special classes for fix point and small floating point quantization. An application of the model (speech quality measurement) is used to determine optimized wordlengths in a dedicated hardware. The development of the perception model as a FPGA/ASIC for a target system, e.g. a PC-card, provides efficient co-processing power and allows real time implementations of complex auditory-based speech processing algorithms.

References
[1] Dau, T., Pschel, D. and Kohlrausch, A.: A quantitative model of the effective signal processing in the auditory system I. Journal of the Acoustical Society of America (JASA) 99 (6): 3631-3633, 1996. Tchorz, T., Wesselkamp, M. and Kollmeier, B.: Gehrgerechte Merkmalsextraktion zur robusten Spracherkennung in Strgeruschen. Fortschritte der AkustikDAGA 96: 532-533, DEGA, Oldenburg, Germany, 1996.

[2]

Hansen M. and Kollmeier B.: Using a quantitative psychoacoustical signal representation for objective speech quality measurement. In: Proc. ICASSP-97, Intl. Conf. on Acoustics, Speech and Signal Proc.: 1387, Munich, Germany, 1997. [4] Hansen, M.: Assessment and prediction of speechtransmission quality with an auditory processing model, Dissertation, Oldenburg, Germany, 1998. [5] ETSI, TM/TM5/TCH-HS.: Selection Test Phase II: Listening test results with German speech samples. Technical Report 92/35, FI/DBP-Telekom. Experiment 1, IM4, 1992. [6] Vary, P., Heute, U., Hess, W.: Digitale Sprachsignalverarbeitung. Teubner, Stuttgart, Germany, 1998. [7] Goldberg, D.: What every Computer Scientist Should Know About Floating-Point Arithmetic, Computing Surveys, March 1991. [8] Wst, H., Kasper, K., Reininger, H.: Hybrid Number Representation for the FPGA-Realization of a Versatile NeuroProcessor. Proc. EUROMICRO98, 694-701, Vsteras, Sweden, 1998. [9] Shirazi, N., Walters, A., Athanas, P.: Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. Technical Report, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, 1995. [10] Hennessy, J. L., Patterson, D. A.: Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers, Inc., San Francisco, California, 1996.

[3]

You might also like