You are on page 1of 16

Background of the Study

Speech recognition is the process of converting an acoustic signal, captured by a microphone or


a telephone, to a set of words. The recognized words can be the final results, as for applications
such as commands & control, data entry, and document preparation. They can also serve as the
input to further linguistic processing in order to achieve speech understanding, a subject covered
in section.

Speech recognition systems can be characterized by many parameters, some of the more
important of which are shown in Table 1. An isolated-word speech recognition system requires
that the speaker pause briefly between words, whereas a continuous speech recognition system
does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is
much more difficult to recognize than speech read from script. Some systems require speaker
enrollment---a user must provide samples of his or her speech before using them, whereas other
systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other
parameters depend on the specific task. Recognition is generally more difficult when
vocabularies are large or have many similar-sounding words. When speech is produced in a
sequence of words, language models or artificial grammars are used to restrict the combination
of words.

Table 1: Typical parameters used to characterize the capability of speech recognition systems

Speech recognition is a difficult problem, largely because of the many sources of variability
associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units
of which words are composed, are highly dependent on the context in which they appear. These
phonetic variabilities are exemplified by the acoustic differences of the phoneme /t/ in two,
true, and butter in American English. At word boundaries, contextual variations can be quite
dramatic---making gas shortage sound like gash shortage in American English, and devo andare
sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in the
position and characteristics of the transducer. Third, within-speaker variabilities can result from
changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally,
differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute
to across-speaker variabilities.

Figure 1 shows the major components of a typical speech recognition system. The digitized
speech signal is first transformed into a set of useful measurements or features at a fixed rate,
typically once every 10--20 msec (see sections and 11.3 for signal representation and digital
signal processing, respectively). These measurements are then used to search for the most likely
word candidate, making use of constraints imposed by the acoustic, lexical, and language
models. Throughout this process, training data are used to determine the values of the model
parameters.

Figure 1: Components of a typical speech recognition system.

Speech recognition systems attempt to model the sources of variability described above in
several ways. At the level of signal representation, researchers have developed representations
that emphasize perceptually important speaker-independent features of the signal, and de-
emphasize speaker-dependent characteristics . At the acoustic phonetic level, speaker variability
is typically modeled using statistical techniques applied to large amounts of data. Speaker
adaptation algorithms have also been developed that adapt speaker-independent acoustic models
to those of the current speaker during system use. Effects of linguistic context at the acoustic
phonetic level are typically handled by training separate models for phonemes in different
contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of words in


representations known as pronunciation networks. Common alternate pronunciations of words,
as well as effects of dialect and accent are handled by allowing search algorithms to find
alternate paths of phonemes through these networks. Statistical language models, based on
estimates of the frequency of occurrence of word sequences, are often used to guide the search
through the most probable sequence of words.

The dominant recognition paradigm in the past fifteen years is known as hidden Markov models
(HMM). An HMM is a doubly stochastic model, in which the generation of the underlying
phoneme string and the frame-by-frame, surface acoustic realizations are both represented
probabilistically as Markov processes. Neural networks have also been used to estimate the
frame based scores; these scores are then integrated into HMM-based system architectures, in
what has come to be known as hybrid systems.

An interesting feature of frame-based HMM systems is that speech segments are identified
during the search process, rather than explicitly. An alternate approach is to first identify speech
segments, then classify the segments and use the segment scores to recognize words. This
approach has produced competitive recognition performance in several tasks

History

The first speech recognizer appeared in 1952 and consisted of a device for the recognition of
single spoken digits Another early device was the IBM Shoebox, exhibited at the 1964 New
York World's Fair.

One of the most notable domains for the commercial application of speech recognition in the
United States has been health care and in particular the work of the medical
transcriptionist (MT). According to industry experts, at its inception, speech recognition (SR)
was sold as a way to completely eliminate transcription rather than make the transcription
process more efficient, hence it was not accepted. It was also the case that SR at that time was
often technically deficient. Additionally, to be used effectively, it required changes to the ways
physicians worked and documented clinical encounters, which many if not all were reluctant to
do. The biggest limitation to speech recognition automating transcription, however, is seen as the
software. The nature of narrative dictation is highly interpretive and often requires judgment that
may be provided by a real human but not yet by an automated system. Another limitation has
been the extensive amount of time required by the user and/or system provider to train the
software.
A distinction in ASR is often made between "artificial syntax systems" which are usually
domain-specific and "natural language processing" which is usually language-specific. Each of
these types of application presents its own particular goals and challenges.

Applications

Health care

In the health care domain, even in the wake of improving speech recognition technologies,
medical transcriptionists (MTs) have not yet become obsolete. Many experts in the field
anticipate that with increased use of speech recognition technology, the services provided may be
redistributed rather than replaced. Speech recognition is used for blind people, which is very
helpful. Speech recognition can be implemented in front-end or back-end of the medical
documentation process. Front-End SR is where the provider dictates into a speech-recognition
engine, the recognized words are displayed right after they are spoken, and the dictator is
responsible for editing and signing off on the document. It never goes through an MT/editor.
Back-End SR or Deferred SR is where the provider dictates into a digital dictation system, and
the voice is routed through a speech-recognition machine and the recognized draft document is
routed along with the original voice file to the MT/editor, who edits the draft and finalizes the
report. Deferred SR is being widely used in the industry currently.
Many Electronic Medical Records (EMR) applications can be more effective and may be
performed more easily when deployed in conjunction with a speech-recognition engine.
Searches, queries, and form filling may all be faster to perform by voice than by using a
keyboard.

Military
High-performance fighter aircraft

Substantial efforts have been devoted in the last decade to the test and evaluation of speech
recognition in fighter aircraft. Of particular note are the U.S. program in speech recognition for
the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA), the program
in France on installing speech recognition systems on Mirage aircraft, and programs in the UK
dealing with a variety of aircraft platforms. In these programs, speech recognizers have been
operated successfully in fighter aircraft with applications including: setting radio frequencies,
commanding an autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight displays. Generally, only very limited, constrained
vocabularies have been used successfully, and a major effort has been devoted to integration of
the speech recognizer with the avionics system.

Some important conclusions from the work were as follows:

Speech recognition has definite potential for reducing pilot workload, but this potential was not
realized consistently.

Achievement of very high recognition accuracy (95% or more) was the most critical factor for
making the speech recognition system useful — with lower recognition rates, pilots would not
use the system.

More natural vocabulary and grammar, and shorter training times would be useful, but only if
very high recognition rates could be maintained.
Laboratory research in robust speech recognition for military environments has produced
promising results which, if extendable to the cockpit, should improve the utility of speech
recognition in high-performance aircraft.

Working with Swedish pilots flying in the JAS-39 Gripen cockpit, Englund (2004) found
recognition deteriorated with increasing G-loads. It was also concluded that adaptation greatly
improved the results in all cases and introducing models for breathing was shown to improve
recognition scores significantly. Contrary to what might be expected, no effects of the broken
English of the speakers were found. It was evident that spontaneous speech caused problems for
the recognizer, as could be expected. A restricted vocabulary, and above all, a proper syntax,
could thus be expected to improve recognition accuracy substantially.

The Eurofighter Typhoon currently in service with the UK RAF employs a speaker-dependent
system, i.e. it requires each pilot to create a template. The system is not used for any safety
critical or weapon critical tasks, such as weapon release or lowering of the undercarriage, but is
used for a wide range of other cockpit functions. Voice commands are confirmed by visual
and/or aural feedback. The system is seen as a major design feature in the reduction of
pilot workload, and even allows the pilot to assign targets to himself with two simple voice
commands or to any of his wingmen with only five commands.
Helicopters
The problems of achieving high recognition accuracy under stress and noise pertain strongly to
the helicopter environment as well as to the fighter environment. The acoustic noise problem is
actually more severe in the helicopter environment, not only because of the high noise levels but
also because the helicopter pilot generally does not wear a facemask, which would reduce
acoustic noise in the microphone. Substantial test and evaluation programs have been carried out
in the past decade in speech recognition systems applications in helicopters, notably by the U.S.
Army Avionics Research and Development Activity (AVRADA) and by the Royal Aerospace
Establishment (RAE) in the UK. Work in France has included speech recognition in the Puma
helicopter. There has also been much useful work in Canada. Results have been encouraging,
and voice applications have included: control of communication radios; setting of navigation
systems; and control of an automated target handover system.
As in fighter applications, the overriding issue for voice in helicopters is the impact on pilot
effectiveness. Encouraging results are reported for the AVRADA tests, although these represent
only a feasibility demonstration in a test environment. Much remains to be done both in speech
recognition and in overall speech recognition technology, in order to consistently achieve
performance improvements in operational settings.

Battle management

Battle Management command centres generally require rapid access to and control of large,
rapidly changing information databases. Commanders and system operators need to query these
databases as conveniently as possible, in an eyes-busy environment where much of the
information is presented in a display format. Human-machine interaction by voice has the
potential to be very useful in these environments. A number of efforts have been undertaken to
interface commercially available isolated-word recognizers into battle management
environments. In one feasibility study speech recognition equipment was tested in conjunction
with an integrated information display for naval battle management applications. Users were
very optimistic about the potential of the system, although capabilities were limited.

Speech understanding programs sponsored by the Defense Advanced Research Projects Agency
(DARPA) in the U.S. has focused on this problem of natural speech interface. Speech
recognition efforts have focused on a database of continuous speech recognition (CSR), large-
vocabulary speech which is designed to be representative of the naval resource management task.
Significant advances in the state-of-the-art in CSR have been achieved, and current efforts are
focused on integrating speech recognition and natural language processing to allow spoken
language interaction with a naval resource management system.

Training air traffic controllers

Training for military (or civilian) air traffic controllers (ATC) represents an excellent application
for speech recognition systems. Many ATC training systems currently require a person to act as a
"pseudo-pilot", engaging in a voice dialog with the trainee controller, which simulates the dialog
which the controller would have to conduct with pilots in a real ATC situation. Speech
recognition and synthesis techniques offer the potential to eliminate the need for a person to act
as pseudo-pilot, thus reducing training and support personnel. Air controller tasks are also
characterized by highly structured speech as the primary output of the controller, hence reducing
the difficulty of the speech recognition task.
The U.S. Naval Training Equipment Center has sponsored a number of developments of
prototype ATC trainers using speech recognition. Generally, the recognition accuracy falls short
of providing graceful interaction between the trainee and the system. However, the prototype
training systems have demonstrated a significant potential for voice interaction in these systems,
and in other training applications. The U.S. Navy has sponsored a large-scale effort in ATC
training systems, where a commercial speech recognition unit was integrated with a complex
training system including displays and scenario creation. Although the recognizer was
constrained in vocabulary, one of the goals of the training programs was to teach the controllers
to speak in a constrained language, using specific vocabulary specifically designed for the ATC
task. Research in France has focused on the application of speech recognition in ATC training
systems, directed at issues both in speech recognition and in application of task-domain grammar
constraints.[4]
The USAF, USMC, US Army, and FAA are currently using ATC simulators with speech
recognition from a number of different vendors, including UFA, Inc, and Adacel Systems Inc
(ASI). This software uses speech recognition and synthetic speech to enable the trainee to control
aircraft and ground vehicles in the simulation without the need for pseudo pilots.
Another approach to ATC simulation with speech recognition has been created by Supremis[1].
The Supremis system is not constrained by rigid grammars imposed by the underlying
limitations of other recognition strategies.

Telephony and other domains

ASR in the field of telephony is now commonplace and in the field of computer gaming and
simulation is becoming more widespread. Despite the high level of integration with word
processing in general personal computing, however, ASR in the field of document production
has not seen the expected increases in use.
The improvement of mobile processor speeds made feasible the speech-enabled Symbian and
Windows Mobile Smartphones. Current speech-to-text programs are too large and require too
much CPU power to be practical for the Pocket PC. Speech is used mostly as a part of User
Interface, for creating pre-defined or custom speech commands. Leading software vendors in this
field are: Microsoft Corporation (Microsoft Voice Command), Nuance Communications
(Nuance Voice Control), Vito Technology (VITO Voice2Go), Speereo Software (Speereo Voice
Translator) and SVOX.

People with disabilities

People with disabilities can benefit from speech recognition programs. Speech recognition is
especially useful for people who have difficulty using their hands, ranging from mild repetitive
stress injuries to involved disabilities that preclude using conventional computer input devices.
In fact, people who used the keyboard a lot and developed RSI became an urgent early market
for speech recognition.[5][6] Speech recognition is used in deaf telephony, such as voicemail to
text, relay services, and captioned telephone. Individuals with learning disabilities who have
problems with thought-to-paper communication (essentially they think of an idea but it is
processed incorrectly causing it to end up differently on paper) can benefit from the software
Objective of the Study

Our objective in creating a MATLAB speech recognition program is to determine


numerical numbers in Tagalog language from one (“isa”) to ten (“sampu”). Our program and
algorithms can be used as a basic model for creating larger useable programs for future or
specific use, for example are speech recognition for disabled persons, particularly for the blind
and deaf persons. Another example is for security purposes.

Scope and Limitation

The scope of this project is to create a program which will recognize an input analog
signal, store it in a .wav file format and have its output as a dialog box containing the number
which is desired to be the output.

The limitation of this program is that this program can only record and recognize tagalog
numbers “isa” (one) to “sampu” (ten). Another limitation of this project is that the user can use
the program for a short period of time. As this program uses the frequency of the user’s voice
signal as the basis for comparison with the pre-stored .wav file, speech recognition should be
performed and must match the numbers 1-10, can achieve proper speaker adaptation, and work
in a clean noise environment in order to achieve the right output presented by the speaker which
is also the input.

Project Description

This speech recognition program of numbers uses the frequency of the sound wave being stored
in MATLAB as the basis for comparison and it is being compared using the theory of coefficient
of correlation. As the input sound is being compared with the stored .wav file of the number, the
number closest to 1 will be declared as the output.

The user can input the voice for two (2) seconds. The user can store his voice sample of the
number or use the pre-stored voice sample of the number. As the GUI recognizes the input voice
signal, the program will release a frequency and time domain spectrum. That will be the basis for
comparison with the pre-stored voice signal using coefficient of correlation as the comparing
medium.

Program Algorithm

function varargout = prototype(varargin)


% PROTOTYPE M-file for prototype.fig
% PROTOTYPE, by itself, creates a new PROTOTYPE or raises the existing
% singleton*.
%
% H = PROTOTYPE returns the handle to a new PROTOTYPE or the handle to
% the existing singleton*.
%
% PROTOTYPE('CALLBACK',hObject,eventData,handles,...) calls the local
% function named CALLBACK in PROTOTYPE.M with the given input arguments.
%
% PROTOTYPE('Property','Value',...) creates a new PROTOTYPE or raises the
% existing singleton*. Starting from the left, property value pairs are
% applied to the GUI before prototype_OpeningFcn gets called. An
% unrecognized property name or invalid value makes property application
% stop. All inputs are passed to prototype_OpeningFcn via varargin.
%
% *See GUI Options on GUIDE's Tools menu. Choose "GUI allows only one
% instance to run (singleton)".
%
% See also: GUIDE, GUIDATA, GUIHANDLES

% Edit the above text to modify the response to help prototype

% Last Modified by GUIDE v2.5 14-Apr-2010 12:36:32

% Begin initialization code - DO NOT EDIT


gui_Singleton = 1;
gui_State = struct('gui_Name', mfilename, ...
'gui_Singleton', gui_Singleton, ...
'gui_OpeningFcn', @prototype_OpeningFcn, ...
'gui_OutputFcn', @prototype_OutputFcn, ...
'gui_LayoutFcn', [] , ...
'gui_Callback', []);
if nargin && ischar(varargin{1})
gui_State.gui_Callback = str2func(varargin{1});
end

if nargout
[varargout{1:nargout}] = gui_mainfcn(gui_State, varargin{:});
else
gui_mainfcn(gui_State, varargin{:});
end
% End initialization code - DO NOT EDIT

% --- Executes just before prototype is made visible.


function prototype_OpeningFcn(hObject, eventdata, handles, varargin)
% This function has no output args, see OutputFcn.
% hObject handle to figure
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)
% varargin command line arguments to prototype (see VARARGIN)

% Choose default command line output for prototype


handles.output = hObject;

% Update handles structure


guidata(hObject, handles);

% UIWAIT makes prototype wait for user response (see UIRESUME)


% uiwait(handles.figure1);

% --- Outputs from this function are returned to the command line.
function varargout = prototype_OutputFcn(hObject, eventdata, handles)
% varargout cell array for returning output args (see VARARGOUT);
% hObject handle to figure
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

% Get default command line output from handles structure


varargout{1} = handles.output;

% --- Executes on button press in savone.


function savone_Callback(hObject, eventdata, handles)
% hObject handle to savone (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

o=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(o,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(o),title('Time Domain')
subplot(2,1,2); plot(abs(fft(o))),title('Frequency Domain')
wavwrite(o,fs,'data\isa') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savtwo.


function savtwo_Callback(hObject, eventdata, handles)
% hObject handle to savtwo (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

tw=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(tw,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(tw),title('Time Domain')
subplot(2,1,2); plot(abs(fft(tw))),title('Frequency Domain')
wavwrite(tw,fs,'data\dalawa') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savthree.


function savthree_Callback(hObject, eventdata, handles)
% hObject handle to savthree (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

th=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(th,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(th),title('Time Domain')
subplot(2,1,2); plot(abs(fft(th))),title('Frequency Domain')
wavwrite(th,fs,'data\tatlo') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savfour.


function savfour_Callback(hObject, eventdata, handles)
% hObject handle to savfour (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

fo=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(fo,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(fo),title('Time Domain')
subplot(2,1,2); plot(abs(fft(fo))),title('Frequency Domain')
wavwrite(fo,fs,'data\apat') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savfive.


function savfive_Callback(hObject, eventdata, handles)
% hObject handle to savfive (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

fi=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(fi,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(fi),title('Time Domain')
subplot(2,1,2); plot(abs(fft(fi))),title('Frequency Domain')
wavwrite(fi,fs,'data\lima') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savsix.


function savsix_Callback(hObject, eventdata, handles)
% hObject handle to savsix (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

si=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(si,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(si),title('Time Domain')
subplot(2,1,2); plot(abs(fft(si))),title('Frequency Domain')
wavwrite(si,fs,'data\anim') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savseven.


function savseven_Callback(hObject, eventdata, handles)
% hObject handle to savseven (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

se=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(se,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(se),title('Time Domain')
subplot(2,1,2); plot(abs(fft(se))),title('Frequency Domain')
wavwrite(se,fs,'data\pito') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in saveight.


function saveight_Callback(hObject, eventdata, handles)
% hObject handle to saveight (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

ei=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(ei,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(ei),title('Time Domain')
subplot(2,1,2); plot(abs(fft(ei))),title('Frequency Domain')
wavwrite(ei,fs,'data\walo') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savnine.


function savnine_Callback(hObject, eventdata, handles)
% hObject handle to savnine (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

ni=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(ni,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(ni),title('Time Domain')
subplot(2,1,2); plot(abs(fft(ni))),title('Frequency Domain')
wavwrite(ni,fs,'data\siyam') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in savten.


function savten_Callback(hObject, eventdata, handles)
% hObject handle to savten (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

z=wavrecord(t*fs,fs); %Record sound using PC-based audio input device


wavplay(z,fs) %Play recorded sound on PC-based audio output device
figure
subplot(2,1,1); plot(z),title('Time Domain')
subplot(2,1,2); plot(abs(fft(z))),title('Frequency Domain')
wavwrite(z,fs,'data\sampu') %Write Microsoft WAVE (.wav) sound file

% --- Executes on button press in record.


function record_Callback(hObject, eventdata, handles)
% hObject handle to record (see GCBO)
% eventdata reserved - to be defined in a future version of MATLAB
% handles structure with handles and user data (see GUIDATA)

t=2; %time in seconds


fs=11025; %default sampling rate in Hz

one=wavread('data\isa.wav');
two=wavread('data\dalawa.wav');
three=wavread('data\tatlo.wav');
four=wavread('data\apat.wav');
five=wavread('data\lima.wav');
six=wavread('data\anim.wav');
seven=wavread('data\pito.wav');
eight=wavread('data\walo.wav');
nine=wavread('data\siyam.wav');
ten=wavread('data\sampu.wav');
input=wavrecord(t*fs);

ref=strvcat('one','two','three','four','five','six','seven','eight','nine','ten');
data=[ process(one,input)
process(two,input)
process(three,input)
process(four,input)
process(five,input)
process(six,input)
process(seven,input)
process(eight,input)
process(nine,input)
process(ten,input)]

index=find(data==max(data));

wavplay(input,fs)

figure
subplot(2,1,1); plot(input),title('Time Domain')
subplot(2,1,2); plot(abs(fft(input))),title('Frequency Domain')

if index==1
h = msgbox('isa','OUTPUT');figure(h)
end

if index==2
h = msgbox('dalawa','OUTPUT');figure(h)
end

if index==3
h = msgbox('tatlo','OUTPUT');figure(h)
end

if index==4
h = msgbox('apat','OUTPUT');figure(h)
end

if index==5
h = msgbox('lima','OUTPUT');figure(h)
end

if index==6
h = msgbox('anim','OUTPUT');figure(h)
end
if index==7
h = msgbox('pito','OUTPUT');figure(h)

end

if index==8
h = msgbox('walo','OUTPUT');figure(h)

end

if index==9
h = msgbox('siyam','OUTPUT');figure(h)

end

if index==10
h = msgbox('sampu','OUTPUT');figure(h)

end

function coef=process(x,y)
%----------this function processes two signals----------%
%--------and returns the correlation coefficient--------%
x=fft(x);y=fft(y); %converts the two signals from time domain to frequency domain
x=abs(x);y=abs(y); %Gets the magnitude of the transformed signals
coef=corr2(x,y); %Computes the correlation coefficient between the two signals

User Manual and Program Flowchart

1. Buttons from “isa” (one) to “sampu” (ten) allows the user to record their desired speech
for each corresponding number.
2. “Press this Button then Speak” button allows the user to input the desired speech that
will recognize and display the corresponding number.
3. Matlab will compare the the input analog voice to the stored words from the numbers
from “isa” (one) to “sampu” (ten).

Conclusion and Recommendation

In this project, we have concluded that a speech recognition algorithm can be used to
convert the output of the input voice signal into a readable text. GUI helps us to make the project
more attainable because of its practicality and simplicity in usage.
In creating the project, the group encountered problems we had encountered is the user’s
voice characteristic of constantly changing its frequency from time to time. We have found out
that as the user’s voice change over time, its frequency changes so as the frequency spectrum and
the image file produced will be different from the recorded one.

Creating this project needed time management. We recommend to those who will remake
the same project that they should manage their time properly. In addition, the group should have
the equipment that compatible with each other.

References:
www.mathworks.com/
http://www.wikipedia.com
www.mathtools.net
www.math.siu.edu/matlab/tutorials.html

You might also like