You are on page 1of 21

c 



Speech is a very basic way for humans to convey information to one another. With a
bandwidth of only 4 kHz, speech can convey information with the emotion of a human voice.
People want to be able to hear someone¶s voice from anywhere in the world as if the person was
in the same room .As a result a greater emphasis is being placed on the design of new and
efficient speech coders for voice communication and transmission. Today applications of speech
coding and compression have become very numerous.

This paper looks at a new technique for analyzing and compressing speech signals using
wavelets. Any signal can be represented by a set of scaled and translated versions of a basic
function called the. mother wavelet. This set of wavelet functions forms the wavelet coefficients
at different scales and positions and results from taking the wavelet transform of the original
signal. Speech is a non-stationary random process due to the time varying nature of the human
speech production system. Non-stationary signals are characterized by numerous transitory drifts,
trends and abrupt changes. The localization feature of wavelets, along with its time-frequency
resolution properties makes them well suited for coding speech signals.

The coefficients represent the signal in the wavelet domain and all data operations can be
performed using just the corresponding wavelet coefficients. Speech is a non-stationary random
process due to the time varying nature of the human speech production system. Non-stationary
signals are characterized by numerous transitory drifts, trends and abrupt changes. The
localization feature of wavelets, along with its time-frequency resolution properties makes them
well suited for coding speech signals. In designing a wavelet based speech coder, the major
issues explored in this paper are:
i. Choosing optimal wavelets for speech,
ii. Decomposition level in wavelet transforms,
iii. Threshold criteria for the truncation of coefficients,
iv. Efficiently representing zero valued coefficients and
v. Quantizing and digitally encoding the coefficients.
The performance of the wavelet compression scheme in coding speech signals and the quality of
the reconstructed signals is also evaluated.

c c


The human speech in its pristine form is an acoustic signal. For the purpose of communication
and storage, it is necessary to convert it into an electrical signal. This is accomplished with the
help of certain instruments called µtransducers¶.
This electrical representation of speech has certain properties.
1. It is a one-dimensional signal, with time as its independent variable.
2. It is random in nature.
3. It is non-stationary, i.e. the frequency spectrum is not constant in time.
4. Although human beings have an audible frequency range of 20Hz ±20kHz, the human speech
has significant frequency components only upto 4kHz,a property that is exploited in the
compression of speech.
    
With the advent of digital computing machines, it was propounded to exploit the powers of the
same for processing of speech signals. This required a digital representation of speech. To
achieve this, the analog signal is sampled at some frequency and then quantized at discrete
levels. Thus, parameters of digital speech are
1. Sampling rate
2. Bits per second
3. Number of channels.
The sound files can be stored and played in digital computers. Various formats have been
proposed by different manufacturers for example µ.wav¶ µ.au¶ to name a few.In this thesis, the
µ.wav¶ format is used extensively due to the convenience in recording it with µSound recorder¶
software, shipped with WINDOWS OS.
c      
In the recent years, large scale information transfer by remote computing and the development of
massive storage and retrieval systems have witnessed a tremendous growth. To cope up with the
growth in the size of databases, additional storage devices need to be installed and the modems
and multiplexers have to be continuously upgraded in order to permit large amounts of data
transfer between computers and remote terminals. This leads to an increase in the cost as well as
equipment. One solution to these problems is-³COMPRESSION´ where the database and the
transmission sequence can be encoded efficiently.
c    !" 
There are various methods of coding the speech signal




















#$%$

$
&&$'(


This chapter develops the need and motivation for studying the wavelet transform. Historically,
Fourier Transform has been the most widely used tool for signal processing. As signal
processing began spreading its tentacles and encompassing newer signals, Fourier Transform
was found to be unable to satisfy the growing need for processing a bulk of signals. Hence, this
chapter begins with a review of Fourier Methods. Detailed explanation is avoided to rid the
discussion of insignificant details. A simple case is presented, where the shortcomings of Fourier
methods is expounded. The next chapter concerns wavelet transforms, and shows how the
drawback of FT are eliminated.
 #)$'$
)&$'(



A major draw back of Fourier analysis is that in transforming to the frequency domain, the time
domain information is lost. The most important difference between these two kinds of transforms
is that individual wavelet functions are localized in space. In contrast Fourier sine and cosine
functions are non-local and are active for all time t.

















    #)$'$
  *$ 
 #)$'$

&+, #-
 .
This chapter provides a motivation towards the study of wavelets as a tool for signal processing.
The drawbacks inherent in the Fourier methods are overcome with wavelets. This fact is
demonstrated here. It must be reiterated that the discussion in this chapter is by no means
comprehensive and exhaustive. The concepts of time-frequency resolution have been avoided for
the sake of simplicity. Instead, the development endeavors to compare the Wavelet methods with
the Fourier methods as the reader is expected to be well conversant with the latter.
 c ""/ #  
Consider a real or complex-valued continuous-time function y(t) with the following properties 1
1. The function integrates to zero

2. It is square integrable or, equivalently, has finite energy:

A function is called mother wavelet if it satisfies these two properties. There is an infinity of
functions that satisfy these properties and thus qualify to be mother wavelet. The simplest of
them is the µHaar wavelet¶. Some other wavelets are Mexican hat, Morlet. Apart from this, there
are various families of wavelets. Some of the families are daubechies family, symlet family,
coiflet family etc. In this thesis, the main stress is given on the Daubechies family, which has db1
to db10 wavelets. They are shown in the following figure
 c.Some wavelet functions.
   ""#  , #-
Consider the following figure which juxtaposes a sinusoid and a wavelet

 .comparing sine wave and a wavelet


As has already been pointed out, wavelet is a waveform of effectively limited duration that has
an average value of zero. Compare wavelets with sine waves, which are the basis of Fourier
analysis. Sinusoids do not have limited duration -- they extend from minus to plus infinity. And
where sinusoids are smooth and predictable, wavelets tend to be irregular and asymmetric.
Fourier analysis consists of breaking up a signal into sine waves of various Frequencies
(fig 2.1). Similarly, wavelet analysis is the breaking up of a signal into shifted and scaled
versions of the original (or mother) wavelet. Compare the following figure with fig :2.1

 .figure demonstrating the decomposition of a signal into wavelets


The above diagram suggests the existence of a synthesis equation to represent the original signal
as a linear combination of wavelets which are the basis function for wavelet analysis (recollect
that in Fourier analysis, the basis functions are sines and cosines). This is indeed the case. The
wavelets in the synthesis equation are multiplied by scalars. To obtain these scalars, we need an
analysis equation, just as in the Fourier case.












*$
$$#)$'$
&+,#-
 
Calculating wavelet coefficients at every possible scale (for continuous WT) is a fair
amount of work, and it generates an awful lot of data. What if we choose only a subset of scales
and positions at which to make our calculations?
It turns out, rather remarkably, that if we choose scales and positions based on powers of
two -- so-called 0      -- then our analysis will be much more efficient and
just as accurate. We obtain such an analysis from the discrete wavelet transform (DWT).
 c#    0
The Discrete Wavelet Transform (DWT) involves choosing scales and positions based on powers
of two- the so called dyadic scales and positions. The mother wavelet is rescaled or ³dilated´ by
powers of two and translated by integers. Specifically, a function R :2 ,R) (defines space of
square integrable functions) can be represented as

The function y(t) is known as the mother wavelet, while f(t) is known as the scaling function.
The set of functions { 2-:R(2-:- ), 2 - (2- - ).. | £ :; , ,:Î } where Z is the set of
integers, is an orthonormal basis for : 2 ,R). The numbers u :  are known as the
approximation coefficients at scale :, while  are known as the detail coefficients at scale j.
These approximation and detail coefficients can be expressed as
The above 2 equations give a mathematical relationship to compute the approximation and detail
coefficients. This procedure is seldom adopted. A more practical approach is to use Mallat¶s Fast
Wavelet Transform algorithm. The Mallat algorithm for discrete wavelet transform (DWT) is, in
fact, a classical scheme in the signal processing community, known as a O  
 
  using conjugate quadrature filters or quadrature mirror filters (QMF). It is developed in the
following sections.
  )
*1++$


The number of vanishing moments of a wavelet indicates the smoothness of the wavelet function
as well as the flatness of the frequency response of the wavelet filters (filters used to compute the
DWT) Typically a wavelet with p vanishing moments satisfies the following equation,

Wavelets with a high number of vanishing moments lead to a more compact signal representation
and are hence useful in coding applications. However, in general, the length of the filters
increases with the number of vanishing moments and the complexity of computing the DWT
coefficients increases with the size of the wavelet filters.

  &
#)$'$
&+

The Discrete Wavelet Transform (DWT) coefficients can be computed by using Mallet¶s Fast
Wavelet Transform algorithm. This algorithm is sometimes referred to as the 
u  
u 
and involves filtering the input signal based on the wavelet function used.

  +2'$+$
1&'$


To explain the implementation of the Fast Wavelet Transform algorithm consider the following
equations:
The first equation is known as the    uu
(or the dilation equation) and defines the
scaling function. The next equation expresses the wavelet in terms of the scaling function. The
third equation is the condition required for the wavelet to be orthogonal to the scaling function
and it translates the coefficients c(k) or {c0, .., c2N-1} in the above equations represent the
impulse response coefficients for a low pass filter of length 2N, with a sum of 1 and a norm of
1/squrerootof(2). The high pass filter is obtained from the low pass filter using the relationship

where k varies over the range (1-(2N-1)) to 1.

Starting with a discrete input signal vector s, the first stage of the FWT algorithm decomposes the
signal into two sets of coefficients. These are the approximation coefficients cA1 (low frequency
information) and the detail coefficients cD1 (high frequency information), as shown in the figure
below.
The coefficient vectors are obtained by convolving s with the low-pass filter :o_D for
approximation and with the high-pass filter Hi_D for details. This filtering operation is then
followed by dyadic decimation or down sampling by a factor of 2. Mathematically the two-
channel filtering of the discrete signal s is represented by the expressions:

These equations implement a convolution plus down sampling by a factor 2 and give the forward
fast wavelet transform. If the length of each filter is equal to 2N and the length of the original
signal s is equal to n, then the corresponding lengths of the coefficients cA1 and cD1 are given by
the formula:

This shows that the total length of the wavelet coefficients is always slightly greater than the
length of the original signal due to the filtering process used.

 3 +''$)$'$ +2


The decomposition process can be iterated, with successive approximations being decomposed in
turn, so that one signal is broken down into many lower resolution components. This is called the
wavelet decomposition tree. The wavelet Decomposition of the signal s analyzed at level j has the
following structure [cAj, cDj, cD1].:ooking at a signals wavelet decomposition tree can reveal
valuable information. The diagram below shows the wavelet decomposition to level 3 of a sample
signal S.

Since the analysis process is iterative, in theory it can be continued indefinitely. In reality, the
decomposition can only proceed until the vector consists of a single sample. Normally, however
there is little or no advantage gained in decomposing a signal beyond a certain level. The
selection of the optimal decomposition level in the hierarchy depends on the nature of the signal
being analyzed or some other suitable criterion, such as the low-pass filter cut-off.

 4 
1'$ 
 

The original signal can be reconstructed or synthesized using the inverse discrete wavelet
transform (IDWT). The synthesis starts with the approximation and detail coefficients cAj and
cDj, and then reconstructs cAj-1 by up sampling and filtering with the reconstruction filters.
The reconstruction filters are designed in such a way to cancel out the effects of aliasing
introduced in the wavelet decomposition phase. The reconstruction filters (:o_R and Hi_R)
together with the low and high pass decomposition filters, forms a system known as þuu

 R (QMF). For a multilevel analysis, the reconstruction process can itself be iterated
producing successive approximations at finer resolutions and finally synthesizing the original
signal.


3#)$'$

2$$ * +2$



 
The idea behind signal compression using wavelets is primarily linked to the relative scarceness
of the wavelet domain representation for the signal. Wavelets concentrate speech information
(energy and perception) into a few neighbouring coefficients. Therefore as a result of taking the
wavelet transform of a signal, many coefficients will either be zero or have negligible
magnitudes.
Another factor that comes into picture is taken from psychoacoustic studies. Since our ears are
more sensitive to low frequencies than high frequencies and our hearing threshold is very high in
the high frequency regions, we used a method for compression by means of which the detail
coefficients (corresponding to high frequency components) of wavelet transforms are
thresholded such that the error due to thresholding is inaudible to our ears. Since some of the
high frequency components are discarded, we should expect asmoothened output signal, as is
shown in the following figure:

In summary, the notion behind compression is based on the concept that the regular
signal component can be accurately approximated using the following elements: a small
number of approximation coefficients (at a suitably chosen level) and some of the detail
coefficients
Data compression is then achieved by treating small valued coefficients as insignificant
data and thus discarding them. The process of compressing a speech signal using
wavelets involves a number of different stages, each of which is discussed below.

3 c  * $&#)$'$

The choice of the mother-wavelet function used in designing high quality speech coders is of
prime importance. Choosing a wavelet that has compact support in both time and frequency in
addition to a significant number of vanishing moments is essential for an optimum wavelet speech
compressor. This is followed very closely by the Daubechies D20, D12, D10 or D8 wavelets, all
concentrating more than 96% of the signal energy in the :evel 1 approximation coefficients.
Wavelets with more vanishing moments provide better reconstruction quality, as they introduce
less distortion into the processed speech and concentrate more signal energy in a few neighboring
coefficients.

3  #)$'$$ +2


Wavelets work by decomposing a signal into different resolutions or frequency bands, and this
task is carried out by choosing the wavelet function and computing the Discrete Wavelet
Transform (DWT). Signal compression is based on the concept that selecting a small number of
approximation coefficients (at a suitably chosen level) and some of the detail coefficients can
accurately represent regular signal components. Choosing a decomposition level for the DWT
usually depends on the type of signal being analyzed or some other suitable criterion such as
entropy. For the processing of speech signals decomposition up to scale 5 is adequate, with no
further advantage gained in processing beyond scale 5.

3   & $&& $




After calculating the wavelet transform of the speech signal, compression involves truncating
wavelet coefficients below a threshold This means that most of the speech energy is in the high-
valued coefficients, which are few. Thus the small valued coefficients can be truncated or zeroed
and then be used to reconstruct the signal. This compression scheme provided a segmental signal-
to-noise ratio (SEGSNR) of 20 dB, with only 10% of the coefficients. Two different approaches
are available for calculating thresholds. The first, known as Global Thresholding involves taking
the wavelet expansion of the signal and keeping the largest absolute value coefficients. In this
case you can manually set a global threshold, a compression performance or a relative square
norm recovery performance. Thus only a single parameter needs to be selected. The second
approach known as By :evel Thresholding consists of applying visually determined level
dependent thresholds to each decomposition level in the wavelet transform.

3  $ 1 $&& $




Signal compression is achieved by first truncating small-valued coefficients and then efficiently
encoding them. One way of approach to compression is to encode consecutive zero valued
coefficients, with two bytes. One byte to indicate a sequence of zeros in the wavelet transforms
vector and the second byte representing the number of consecutive zeros. For further data
compaction a suitable bit encoding format, can be used to quantize and transmit the data at low bit
rates. A low bit rate representation can be achieved by using an entropy coder like Huffman
coding or arithmetic coding.

3 3 $$ 1) $)

) $
2$$ *&+$


In speech there are two major types of excitation, voiced and unvoiced. Voiced sounds are
produced when air flows between the vocal cords and causes them to vibrate. Voiced speech
tends to be periodic in nature. Examples of voiced sounds are English vowels, such as the /a/ in
.bay and the /e/ in .see. Unvoiced sounds result from constricting the vocal tract at some point so
that turbulence is produced by air flowing past the constriction. Since unvoiced speech is due to
turbulence, the speech is aperiodic and has a noise-like structure. Some examples of unvoiced
English sounds are the /s/ in .so and the /h/ in .he.. In general at least 90% of the speech energy is
always retained in the first N/2 transform coefficients, if the speech is a voiced frame However,
for an unvoiced frame the energy is spread across several frequency bands and typically the first
N/2 coefficients holds less than 40% of the total energy. Due to this wavelets are inefficient at
coding unvoiced speech. Unvoiced speech frames are infrequent. By detecting unvoiced speech
frames and directly encoding them (perhaps using entropy coding), no unvoiced data is lost and
the quality of the compressed speech will remain transparent.

3 4 2$&+ $+$
$


A number of quantitative parameters can be used to evaluate the performance of the wavelet
based speech coder, in terms of both reconstructed signal quality after decoding and compression
scores. The following parameters are compared:

3 4 c 
1'
$

3 4  2$%
1'
$,2
-

3 4  +'
$+$
5$$,+
$-


3 4  $$
1'$$1(

3 4 3  +2$





4 2$&+ $&$ $
2$$ * 1

A male and female spoken speech signals were decomposed at scale 3 and level dependent
thresholds were applied using the Birge-Massart strategy. Since the speech files were of short
duration, the entire signal was decomposed at once without framing. A summary of the
performance is given below for the different wavelets used.


Š  '


Speech coding is currently an active topic for research in the areas of Very :arge Scale Integrated
(V:SI) circuit technologies and Digital Signal Processing (DSP). The Discrete Wavelet
Transform performs very well in the compression of recorded speech signals. For real time
speech processing however, its performance is not as good. Therefore for real time speech coding
it is recommended to use a wavelet with a small number of vanishing moments at level 5
decomposition or less. The wavelet based compression designed reaches a signal to noise ratio of
17.45 db at a compression ratio of 3.88 using the Daubechies 10 wavelet. The performance of the
wavelet scheme in terms of compression scores and signal quality is comparable with other good
techniques such as code excited linear predictive coding (CE:P) for speech, with much less
computational burden. In addition, using wavelets the compression ratio can be easily varied,
while most other compression techniques have fixed compression ratios.


§&*$
(

This thesis was, admittedly, limited in scope. The main objective was to develop an appreciation
for wavelet transforms, discuss their application in compression of human speech signals and
study the effect of a few parameters on the quality of compression. The parameters studied are:
Sampling frequency, type of wavelet, threshold, male/female voice.
There are a few topics that were deliberately excluded due to the limitation of our undergraduate
study. Some of them are stated below. Any or all of these topics can be explored further with a
view to achieving better performance. The analysis that we undertook for wavelets includes only
the orthogonal and compactly supported wavelets. The reader may find it interesting to study the
effect of other wavelets on compression.
Secondly, the sound files that we tested were of limited duration, around 5 seconds. Albeit the
programs will run for larger files (of course, the computation time will be longer in this case), a
better approach towards such large files is to use frames of finite length. This procedure is more
used in real-time compression of sound files, and was not discussed here. Encoding is performed
using only the Run :ength Encoding. The effect of other encoding schemes on the compression
factor have not been studied. In fact, higher compression ratios are expected with coding
techniques like Huffman coding.This thesis considered only wavelets analysis, wherein only
approximation coefficients are split. There exists another analysis, called wavelet packet
analysis, which splits detailcoefficients too. This was not explored in this thesis. :ast but not the
least, the effect of wavelet transform on voiced and unvoiced speech is different1, and thus
compression ratios ought to be different.

You might also like