Professional Documents
Culture Documents
This is to certify that the work titled SPEECH ANALYSIS submitted by ADITYA KHER in
partial fulfillment for the award of degree of B.Tech of Jaypee Institute of Information
Technology, Noida has been carried out under my supervision. This work has not been submitted
partially or wholly to any other University or Institute for the award of this or any other degree or
diploma.
Signature of Supervisor
..
Name of Supervisor ..
Designation
..
Date
..
ACKNOWLEDGEMENT
Words are inadequate to express the overwhelming sense of gratitude and humble regards to my
guide supervisor Mr. Manish Kumar, Department of Electronics and Communication
Engineering for his constant motivation, support, expert guidance, constant supervision and
constructive suggestion for the submission of my progress report of work on Speech
Recognition.
I express my gratitude to Mr. Manish Kumar, Electronics and Communication Engineering for
his invaluable suggestions and constant encouragement all through the work. I also thank all the
teaching and non-teaching staff for their nice cooperation to the students.
This report would have been impossible if not for the perpetual moral support from my family
members, and friends. I would like to thank them all.
INTRODUCTION
First of all, why do we need a transform, or what is a transform anyway?
Mathematical transformations are applied to signals to obtain a further information from that
signal that is not readily available in the raw signal. In the following discussion, we will assume
a time-domain signal as a raw signal, and a signal that has been "transformed" by any of the
available mathematical transformations as a processed signal.
There are number of transformations that can be applied, among which the Fourier transforms
are probably by far the most popular.
Most of the signals in practice, are TIME-DOMAIN signals in their raw format. That is,
whatever that signal is measuring, is a function of time. In other words, when we plot the signal
one of the axes is time (independent variable), and the other (dependent variable) is usually the
amplitude. When we plot time-domain signals, we obtain a time-amplitude representation of the
signal. This representation is not always the best representation of the signal for most signal
processing related applications. In many cases, the most distinguished information is hidden in
the frequency content of the signal. The frequency SPECTRUM of a signal is basically the
frequency components (spectral components) of that signal. The frequency spectrum of a signal
shows what frequencies exist in the signal.
Intuitively, we all know that the frequency is something to do with the change in rate of
something. If something ( a mathematical or physical variable, would be the technically correct
term) changes rapidly, we say that it is of high frequency, where as if this variable does not
change rapidly, i.e., it changes smoothly, we say that it is of low frequency. If this variable does
not change at all, then we say it has zero frequency, or no frequency. For example the publication
frequency of a daily newspaper is higher than that of a monthly magazine (it is published more
frequently).
The frequency is measured in cycles/second, or with a more common name, in "Hertz". For
example the electric power we use in our daily life in the US is 60 Hz (50 Hz elsewhere in the
world). This means that if you try to plot the electric current, it will be a sine wave passing
through the same point 50 times in 1 second. Now, look at the following figures. The first one is
a sine wave at 3 Hz, the second one at 10 Hz, and the third one at 50 Hz. Compare them.
So how do we measure frequency, or how do we find the frequency content of a signal? The
answer is FOURIER TRANSFORM (FT).
The frequency axis starts from zero, and goes up to infinity. For every frequency, we have an
amplitude value. For example, if we take the FT of the electric current that we use in our houses,
we will have one spike at 50 Hz, and nothing elsewhere, since that signal has only 50 Hz
frequency component. No other signal, however, has a FT which is this simple. For most
practical purposes, signals contain more than one frequency component. The following shows
the FT of the 50 Hz signal:
One word of caution is in order at this point. Note that two plots are given in Figure 1.4. The
bottom one plots only the first half of the top one. Due to reasons that are not crucial to know at
this time, the frequency spectrum of a real valued signal is always symmetric. The top plot
illustrates this point. However, since the symmetric part is exactly a mirror image of the first
part, it provides no additional information, and therefore, this symmetric second part is usually
not shown. In most of the following figures corresponding to FT, I will only show the first half
of this symmetric spectrum.
Let's give an example from biological signals. Suppose we are looking at an ECG signal
(ElectroCardioGraphy, graphical recording of heart's electrical activity). The typical shape of a
healthy ECG signal is well known to cardiologists. Any significant deviation from that shape is
usually considered to be a symptom of a pathological condition.
This pathological condition, however, may not always be quite obvious in the original
time-domain signal. Cardiologists usually use the time-domain ECG signals which are recorded
on
strip-charts
to
analyze
ECG
signals.
Recently,
the
new
computerized
ECG
This, of course, is only one simple example why frequency content might be useful.
Today Fourier transforms are used in many different areas including all branches of engineering.
Although FT is probably the most popular transform being used (especially in electrical
engineering), it is not the only one. There are many other transforms that are used quite often by
engineers and mathematicians. Hilbert transform, short-time Fourier transform (more about this
later), Wigner distributions, the Radon Transform, and of course our featured transformation ,
the wavelet transform, constitute only a small portion of a huge list of transforms that are
available at engineer's and mathematician's disposal. Every transformation technique has its own
area of application, with advantages and disadvantages, and the wavelet transform (WT) is no
exception.
For a better understanding of the need for the WT let's look at the FT more closely. FT
(as well as WT) is a reversible transform, that is, it allows to go back and forward between the
raw and processed (transformed) signals. However, only either of them is available at any given
7
time. That is, no frequency information is available in the time-domain signal, and no time
information is available in the Fourier transformed signal. The natural question that comes to
mind is that is it necessary to have both the time and the frequency information at the same time?
As we will see soon, the answer depends on the particular application, and the nature of
the signal in hand. Recall that the FT gives the frequency information of the signal, which means
that it tells us how much of each frequency exists in the signal, but it does not tell us when in
time these frequency components exist. This information is not required when the signal is socalled stationary .
Let's take a closer look at this stationarity concept more closely, since it is of paramount
importance in signal analysis. Signals whose frequency content do not change in time are called
stationary signals . In other words, the frequency content of stationary signals do not change in
time. In this case, one does not need to know at what times frequency components exist , since
all frequency components exist at all times !!! .
is a stationary signal, because it has frequencies of 10, 25, 50, and 100 Hz at any given
time instant. This signal is plotted below:
The top plot in Figure 1.5 is the (half of the symmetric) frequency spectrum of the signal
in Figure 1.5. The bottom plot is the zoomed version of the top plot, showing only the range of
frequencies that are of interest to us. Note the four spectral components corresponding to the
frequencies 10, 25, 50 and 100 Hz.
Contrary to the signal in Figure 1.6, the following signal is not stationary. Figure 1.8 plots a
signal whose frequency constantly changes in time. This signal is known as the "chirp" signal.
This is a non-stationary signal.
Let's look at another example. Figure 1.9 plots a signal with four different frequency components
at four different time intervals, hence a non-stationary signal. The interval 0 to 300 ms has a 100
Hz sinusoid, the interval 300 to 600 ms has a 50 Hz sinusoid, the interval 600 to 800 ms has a 25
Hz sinusoid, and finally the interval 800 to 1000 ms has a 10 Hz sinusoid.
10
11
Other than those ripples, everything seems to be right. The FT has four peaks,
corresponding to four frequencies with reasonable amplitudes... Right
WRONG (!)
For the first signal, plotted in Figure 1.5, consider the following question:
Now, consider the same question for the non-stationary signal in Figure 1.7 or in Figure
1.8.
12
For the signal in Figure 1.8, we know that in the first interval we have the highest frequency
component, and in the last interval we have the lowest frequency component. For the signal in
Figure 1.7, the frequency components change continuously. Therefore, for these signals the
frequency components do not appear at all times!
Now, compare the Figures 1.6 and 1.9. The similarity between these two spectrum should
be apparent. Both of them show four spectral components at exactly the same frequencies, i.e., at
10, 25, 50, and 100 Hz. Other than the ripples, and the difference in amplitude (which can always
be normalized), the two spectrums are almost identical, although the corresponding time-domain
signals are not even close to each other. Both of the signals involves the same frequency
components, but the first one has these frequencies at all times, the second one has these
frequencies at different intervals. So, how come the spectrums of two entirely different signals
look very much alike? Recall that the FT gives the spectral content of the signal, but it gives no
information regarding where in time those spectral components appear . Therefore, FT is not a
suitable technique for non-stationary signal, with one exception:
FT can be used for non-stationary signals, if we are only interested in what spectral
components exist in the signal, but not interested where these occur. However, if this information
is needed, i.e., if we want to know, what spectral component occur at what time (interval) , then
Fourier transform is not the right transform to use.
For practical purposes it is difficult to make the separation, since there are a lot of
practical stationary signals, as well as non-stationary ones. Almost all biological signals, for
example, are non-stationary. Some of the most famous ones are ECG (electrical activity of the
heart , electrocardiograph), EEG (electrical activity of the brain, electroencephalograph), and
EMG (electrical activity of the muscles, electromyogram).
Once again please note that, the FT gives what frequency components (spectral
components) exist in the signal. Nothing more, nothing less.
When the time localization of the spectral components are needed, a transform giving the
TIME-FREQUENCY REPRESENTATION of the signal is needed.
13
Often times a particular spectral component occurring at any instant can be of particular
interest. In these cases it may be very beneficial to know the time intervals these particular
spectral components occur. For example, in EEGs, the latency of an event-related potential is of
particular interest (Event-related potential is the response of the brain to a specific stimulus like
flash-light, the latency of this response is the amount of time elapsed between the onset of the
stimulus and the response).
To make a real long story short, we pass the time-domain signal from various highpass
and low pass filters, which filters out either high frequency or low frequency portions of the
signal. This procedure is repeated, every time some portion of the signal corresponding to some
frequencies being removed from the signal.
Here is how this works: Suppose we have a signal which has frequencies up to 1000 Hz.
In the first stage we split up the signal in to two parts by passing the signal from a highpass and a
lowpass filter (filters should satisfy some certain conditions, so-called admissibility condition)
14
which results in two different versions of the same signal: portion of the signal corresponding to
0-500 Hz (low pass portion), and 500-1000 Hz (high pass portion).
Then, we take either portion (usually low pass portion) or both, and do the same thing
again. This operation is called decomposition .
Assuming that we have taken the lowpass portion, we now have 3 sets of data, each
corresponding to the same signal at frequencies 0-250 Hz, 250-500 Hz, 500-1000 Hz.
Then we take the lowpass portion again and pass it through low and high pass filters; we
now have 4 sets of signals corresponding to 0-125 Hz, 125-250 Hz,250-500 Hz, and 500-1000
Hz. We continue like this until we have decomposed the signal to a pre-defined certain level.
Then we have a bunch of signals, which actually represent the same signal, but all corresponding
to different frequency bands. We know which signal corresponds to which frequency band, and
if we put all of them together and plot them on a 3-D graph, we will have time in one axis,
frequency in the second and amplitude in the third axis. This will show us which frequencies
exist at which time ( there is an issue, called "uncertainty principle", which states that, we cannot
exactly know what frequency exists at what time instance , but we can only know what
frequency bands exist at what time intervals)
The uncertainty principle, originally found and formulated by Heisenberg, states that, the
momentum and the position of a moving particle cannot be known simultaneously. This applies
to our subject as follows:
15
\WAVELET APPLICATIONS
Wavelets have scale aspects and time aspects, consequently every application has scale and time
aspects. To clarify them we try to untangle the aspects somewhat arbitrarily.
For scale aspects, we present one idea around the notion of local regularity. For time aspects, we
present a list of domains. When the decomposition is taken as a whole, the de-noising and
compression processes are center points.
Scale Aspects:
As a complement to the spectral signal analysis, new signal forms appear. They are less regular
signals than the usual ones.
The cusp signal presents a very quick local variation. Its equation is
Biology for cell membrane recognition, to distinguish the normal from the pathological
membranes
Finance (which is more surprising), for detecting the properties of quick variation of
values
16
Time Aspects:
Let's switch to time aspects. The main goals are
Checking undue noises in craned or dented wheels, and more generally in nondestructive
control quality processes
SAR imagery
Intermittence in physics
17
Wavelet Decomposition
Many applications use the wavelet decomposition taken as a whole. The common goals concern
the signal or image clearance and simplification, which are parts of de-noising or compression.
We find many published papers in oceanography and earth studies.
One of the most popular successes of the wavelets is the compression of FBI fingerprints.
When trying to classify the applications by domain, it is almost impossible to sum up several
thousand papers written within the last 15 years. Moreover, it is difficult to get information on
real-world industrial applications from companies. They understandably protect their own
information.
Some domains are very productive. Medicine is one of them. We can find studies on micropotential extraction in EKGs, on time localization of His bundle electrical heart activity, in ECG
noise removal. In EEGs, a quick transitory signal is drowned in the usual one. The wavelets are
able to determine if a quick signal exists, and if so, can localize it. There are attempts to enhance
mammograms to discriminate tumors from calcifications.
Another prototypical application is a classification of Magnetic Resonance Spectra. The study
concerns the influence of the fat we eat on our body fat. The type of feeding is the basic
information and the study is intended to avoid taking a sample of the body fat. Each Fourier
spectrum is encoded by some of its wavelet coefficients. A few of them are enough to code the
most interesting features of the spectrum. The classification is performed on the coded vectors.
18
Fourier Analysis:
Signal analysts already have at their disposal an impressive arsenal of tools. Perhaps the most
well known of these is Fourier analysis, which breaks down a signal into constituent sinusoids of
different frequencies. Another way to think of Fourier analysis is as a mathematical technique for
transforming our view of the signal from time-based to frequency-based.
For many signals, Fourier analysis is extremely useful because the signal's frequency content is
of great importance. So why do we need other techniques, like wavelet analysis?
Fourier analysis has a serious drawback. In transforming to the frequency domain, time
information is lost. When looking at a Fourier transform of a signal, it is impossible to tell when
a particular event took place.
If the signal properties do not change much over time -- that is, if it is what is called a stationary
signal -- this drawback isn't very important. However, most interesting signals contain numerous
nonstationary or transitory characteristics: drift, trends, abrupt changes, and beginnings and ends
of events. These characteristics are often the most important part of the signal, and Fourier
analysis is not suited to detecting them.
19
In an effort to correct this deficiency, Dennis Gabor (1946) adapted the Fourier transform to
analyze only a small section of the signal at a time -- a technique called windowing the signal.
Gabor's adaptation, called the Short-Time Fourier Transform (STFT), maps a signal into a twodimensional function of time and frequency.
The STFT represents a sort of compromise between the time- and frequency-based views of a
signal. It provides some information about both when and at what frequencies a signal event
occurs. However, you can only obtain this information with limited precision, and that precision
is determined by the size of the window.
While the STFT compromise between time and frequency information can be useful, the
drawback is that once you choose a particular size for the time window, that window is the same
for all frequencies. Many signals require a more flexible approach -- one where we can vary the
window size to determine more accurately either time or frequency.
20
Wavelet Analysis:
Wavelet analysis represents the next logical step: a windowing technique with variable-sized
regions. Wavelet analysis allows the use of long time intervals where we want more precise lowfrequency information, and shorter regions where we want high-frequency information.
Here's what this looks like in contrast with the time-based, frequency-based, and STFT views of
a signal:
You may have noticed that wavelet analysis does not use a time-frequency region, but rather a
time-scale region.
21
A plot of the Fourier coefficients (as provided by the fft command) of this signal shows nothing
particularly interesting: a flat spectrum with two peaks representing a single frequency.
However, a plot of wavelet coefficients clearly shows the exact location in time of the
discontinuity.
Wavelet analysis is capable of revealing aspects of data that other signal analysis techniques
miss, aspects like trends, breakdown points, discontinuities in higher derivatives, and selfsimilarity. Furthermore, because it affords a different view of data than those presented by
22
traditional techniques, wavelet analysis can often compress or de-noise a signal without
appreciable degradation.
Indeed, in their brief history within the signal processing field, wavelets have already proven
themselves to be an indispensable addition to the analyst's collection of tools and continue to
enjoy a burgeoning popularity today.
Now that we know some situations when wavelet analysis is useful, it is worthwhile asking
"What is wavelet analysis?" and even more fundamentally, "What is a wavelet?"
A wavelet is a waveform of effectively limited duration that has an average value of zero.
Compare wavelets with sine waves, which are the basis of Fourier analysis. Sinusoids do not
have limited duration -- they extend from minus to plus infinity. And where sinusoids are smooth
and predictable, wavelets tend to be irregular and asymmetric.
Fourier analysis consists of breaking up a signal into sine waves of various frequencies.
Similarly, wavelet analysis is the breaking up of a signal into shifted and scaled versions of the
original (or mother) wavelet.
Just looking at pictures of wavelets and sine waves, you can see intuitively that signals with
sharp changes might be better analyzed with an irregular wavelet than with a smooth sinusoid,
just as some foods are better handled with a fork than a spoon.
23
It also makes sense that local features can be described better with wavelets that have local
extent.
Number Dimensions:
Thus far, we've discussed only one-dimensional data, which encompasses most ordinary signals.
However, wavelet analysis can be applied to two-dimensional data (images) and, in principle, to
higher dimensional data.
This toolbox uses only one- and two-dimensional analysis techniques.
which is the sum over all time of the signal f(t) multiplied by a complex exponential. (Recall that
a complex exponential can be broken down into real and imaginary sinusoidal components.)
The results of the transform are the Fourier coefficients,
sinusoid of frequency
24
Similarly, the continuous wavelet transform (CWT) is defined as the sum over all time of the
signal multiplied by scaled, shifted versions of the wavelet function
The results of the CWT are many wavelet coefficients C, which are a function of scale and
position.
Multiplying each coefficient by the appropriately scaled and shifted wavelet yields the
constituent wavelets of the original signal:
Scaling:
We've already alluded to the fact that wavelet analysis produces a time-scale view of a signal,
and now we're talking about scaling and shifting wavelets. What exactly do we mean by scale in
this context?
Scaling a wavelet simply means stretching (or compressing) it.
To go beyond colloquial descriptions such as "stretching," we introduce the scale factor, often
denoted by the letter a. If we're talking about sinusoids, for example, the effect of the scale factor
is very easy to see:
25
The scale factor works exactly the same with wavelets. The smaller the scale factor, the more
"compressed" the wavelet.
26
Shifting:
Shifting a wavelet simply means delaying (or hastening) its onset. Mathematically, delaying a
function f(t) by k is represented by f(t-k):
27
3. Shift the wavelet to the right and repeat steps 1 and 2 until you've covered the whole
signal.
28
point represents the magnitude of the wavelet coefficient C. These are the coefficient plots
generated by the graphical tools.
These coefficient plots resemble a bumpy surface viewed from above. If you could look at the
same surface from the side, you might see something like this:
The continuous wavelet transform coefficient plots are precisely the time-scale view of the signal
we referred to earlier. It is a different view of signal data from the time-frequency Fourier view,
but it is not unrelated.
29
Thus, there is a correspondence between wavelet scales and frequency as revealed by wavelet
analysis:
Low scale a
changing details
High frequency
30
.
Low
31
Calculating wavelet coefficients at every possible scale is a fair amount of work, and it generates
an awful lot of data. What if we choose only a subset of scales and positions at which to make
our calculations?
It turns out, rather remarkably, that if we choose scales and positions based on powers of two -so-called dyadic scales and positions -- then our analysis will be much more efficient and just as
accurate. We obtain such an analysis from the discrete wavelet transform (DWT).
An efficient way to implement this scheme using filters was developed in 1988 by Mallat (see
[Mal89] in References). The Mallat algorithm is in fact a classical scheme known in the signal
processing community as a two-channel subband. This very practical filtering algorithm yields a
fast wavelet transform -- a box into which a signal passes, and out of which wavelet coefficients
quickly emerge. Let's examine this in more depth.
The filtering process, at its most basic level, looks like this:
32
The original signal, S, passes through two complementary filters and emerges as two signals.
Unfortunately, if we actually perform this operation on a real digital signal, we wind up with
twice as much data as we started with. Suppose, for instance, that the original signal S consists of
1000 samples of data. Then the resulting signals will each have 1000 samples, for a total of 2000.
These signals A and D are interesting, but we get 2000 values instead of the 1000 we had. There
exists a more subtle way to perform the decomposition using wavelets. By looking carefully at
the computation, we may keep only one point out of two in each of the two 2000-length samples
to get the complete information. This is the notion of downsampling. We produce two sequences
called cA and cD.
The process on the right, which includes downsampling, produces DWT coefficients.
To gain a better appreciation of this process, let's perform a one-stage discrete wavelet transform
of a signal. Our signal will be a pure sinusoid with high-frequency noise added to it.
33
Here is our schematic diagram with real signals inserted into it:
s = sin(20.*linspace(0,pi,1000)) + 0.5.*rand(1,1000);
[cA,cD] = dwt(s,'db2');
where db2 is the name of the wavelet we want to use for the analysis.
Notice that the detail coefficients cD are small and consist mainly of a high-frequency noise,
while the approximation coefficients cA contain much less noise than does the original signal.
[length(cA) length(cD)]
You may observe that the actual lengths of the detail and approximation coefficient vectors are
slightly more than half the length of the original signal. This has to do with the filtering process,
which is implemented by convolving the signal with a filter. The convolution "smears" the
signal, introducing several extra samples into the result.
34
35
USER INTERFACE:
OUTPUT:
I would like to explain the analysis with the help of an example:
On recording the signal, we observe that it contains 4 different and distinct words.
We first plot the respective time and frequency graphs for the entire sentence ie. my
name is aditya.
Then we plot the graphs for time and frequency for the individual words in the sentence
like my, aditya etc.
Then we compare the values of time and frequency from the respective plots for the
individual words
The respective plots are shown below. It can be easily seen that the values of frequency and
time for an individual word match those values when these words are a part of a larger sentence.
We can easily match frequencies using this method.
36
SOUND FILE: MY
37
SOUND FILE: IS
38
Now, we note that when the same sentence is recorded by a different person, we still obtain
similar time and frequency characteristics ie. On comparing this graph with our earlier results,
we can easily match the frequencies of individual words.
39
RESULT:
We have observed that the plots of time and frequency for individual words and those same
words contained in larger sentences are almost similar. There is also a high degree of similarity
when a different user is speaking those same words.
Hence, we conclude that we can obtain the frequency-time plots or wavelets for speech signals
with minimal error. Therefore our speech analysis is successful.
40
REFERENCES:
1.
2.
3.
4.
5.
6.
7.
8.
9.
41