Professional Documents
Culture Documents
(December 2010)
Bachelor of Technology
In
COMPUTER SCIENCE
CERTIFICATE
This is to certify that the project report entitled "SPEECH TEXT ARTIFICE" by Aayush
Sharma(07/BV/CS/017),Vinay Pareek (07/BV/CS/022),Lakshay Gaur(07/BV/CS/030), Sahil
Sikka(07/BV/CS/032) submitted to the „Department of Computer Science Engineering‟ of
Bharati Vidyapeeth College of Engineering, New Delhi, is an authentic record of their own
work carried during the period of August 2010 to November 2010 under my guidance and
completed the project work to my satisfaction.
The matter embodied in this project has not been submitted earlier for the award of any
degree or diploma to the best of my knowledge & belief.
Date:
Project Guide
PROF. RACHNA JAIN
2
SPEECH TEXT ARTIFICE
ACKNOWLEDGEMENT
We gratefully acknowledge the guidance provided by the project supervisor Mrs. Rachna
Jain throughout the development of the project, for her valuable inputs, able guidance,
encouragement, whole-hearted cooperation and constructive criticism throughout the duration
of our project.
We take this opportunity to pay regards to all our teachers who have directly or indirectly
helped us in this project. Last but not the least, we express our thanks to all our friends for
their co-operation and support.
3
SPEECH TEXT ARTIFICE
SYNOPSIS
Speech recognition is the process by which a computer (or other type of machine) identifies
spoken words. Basically, it means talking to your computer, and having it correctly recognize
what you are saying.
Hardware Requirements:
Sound Cards
Microphones
Computers/Processors
Applications:
The possible applications of speech technology include every single task related with the
action of the human voice. In this sense, the application fields can vary from speech
production, storage, transmission and recognition processes. Some of the potential
applications of speech technology that we mention below:
Speech synthesis (voice response systems)
Digital transmission and storage (optimized encryption of signals)
Speaker verification and identification (control of access, legal applications)
Aids to the handicapped
Speech recognition (automatic dictation, command and control)
4
SPEECH TEXT ARTIFICE
It takes text as input and produces a stream of audio as output using synthetic voices. This
application can be used for proof reading of document. This basically serves as a reader, in
literal sense, of text files. We can change the properties of Synthetic Voice. The user as per
requirement can vary following properties:
VOLUME: The volume can be set to be as loud or low as required by the user.
RATE: The user can decide how fast or slow the speech should be.
5
SPEECH TEXT ARTIFICE
CONTENT
Certificate………………………………………………………… 2
Acknowledgement……………………………………………….. 3
Synopsis…………………………………………………………... 4
1 Project Overview……………………………………………... 9
1.1 Project Objective……………………………………………………. 9
1.2 Abstract……………………………………………………………… 10
1.3 Project Scope....................................................................................... 10
2 Literature Review……………………………………………. 11
2.1 An Overview Of Speech Recognition……………………………….. 11
2.2 History………………………………………………………………. 12
2.3 Types Of Speech Recognition............................................................. 13
2.3.1 Isolated Speech……………………………………………….. 13
2.3.2 Connected Speech…………………………………………….. 13
2.3.3 Continuous Speech…………………………………………… 13
2.3.4 Spontaneous Speech………………………………………….. 13
2.4 Speech Recognition Process………………………………………... 14
2.4.1 Components Of Speech Recognition System………………... 15
2.5 Uses Of Speech Recognition Programs............................................. 16
2.6 Applications………………………………………………………... 16
2.6.1 From Medical Perspective…………………………………… 16
2.6.2 From Military Perspective…………………………………… 17
2.6.3 From Educational Perspective……………………………….. 17
2.7 Speech Recognition Weakness And Flaws………………………… 17
2.8 The Future Of Speech Recognition………………………………… 18
2.9 Few Speech Recognition Software………………………………… 19
2.9.1 X Voice……………………………………………………… 19
2.9.2 ISIP…………………………………………………………... 19
2.9.3 Ears…………………………………………………………... 19
6
SPEECH TEXT ARTIFICE
4 Documentation………………………………………………. 37
4.1 System Requirements………………………………………………. 37
4.1.1 Minimum Requirements………………………………………. 37
4.1.2 Best Requirements……………………………………………. 37
4.2 Hardware Requirements…………………………………………… 38
4.3 Software Requirements……………………………………………. 38
4.4 Context Diagram…………………………………………………... 39
4.5 Sequence Diagram………………………………………………… 40
4.6 Package Diagram………………………………………………….. 41
7
SPEECH TEXT ARTIFICE
6 Conclusion……………………………………………………. 54
6.1 Advantages of software……………………………………………… 54
6.2 Disadvantages……………………………………………………….. 54
6.3 Future Enhancements……………………………………………….. 54
7 References…………………………………………………… 55
Appendices
8 Appendix -A Source Code…………………………………… 57
9 Appendix –B Snapshots……………………………………… 78
10 Appendix -C Glossary………………………………………... 81
8
SPEECH TEXT ARTIFICE
LIST OF FIGURES
1. Speech Technology………………………………………………………13
3. Speech Synthesis………………………………………………………...29
4. Speech Recognition……………………………………………………...32
5. Context Diagram………………………………………………………...40
6. Sequence Diagram……………………………………………………....41
7. Package Diagram………………………………………………………...42
8. Activity Diagram………………………………………………………...43
9.Test Cases………………………………………………………………...48
9
SPEECH TEXT ARTIFICE
CHAPTER 1
PROJECT OVERVIEW
development, and its applications. The first section deals with the description of speech
recognition process, its applications in different sectors, its flaws and finally the future of
technology. Later part of report covers the speech recognition process, and the code for the
software and its working. Finally the report concludes at the different potentials uses of the
10
SPEECH TEXT ARTIFICE
1.2 Abstract
At the initial level, effort is made to provide help for basic operations as discussed above , but
the software can further be updated and enhanced in order to provide with more operations.
This project has the speech recognizing and speech synthesizing capabilities though it is not a
complete replacement of what we call a NOTEPAD but still a good text editor to be used
through voice. This software also can open windows based softwares such as Notepad, Ms-
paint and more.
11
SPEECH TEXT ARTIFICE
CHAPTER 2
LITERATURE REVIEW
SPEECH TECHNOLOGY
Three primary speech technologies are used in voice processing applications: Stored Speech,
Text-to- Speech and Speech Recognition. Stored speech involves the production of computer
speech from an actual human voice that is stored in a computer‟s memory and used in any of
several ways.
Speech can also be synthesized from plain text in a process known as text-to – speech which
also enables voice processing applications to read from textual database.
Speech recognition is the process of deriving either a textual transcription or some form of
meaning from a spoken input.
Speech analysis can be thought of as that part of voice processing that converts human speech
to digital forms suitable for transmission or storage by computers.
Speech synthesis functions are essentially the inverse of speech analysis – they reconvert
speech data from a digital form to one that‟s similar to the original recording and suitable for
playback.
Speech analysis processes can also be referred to as a digital speech encoding ( or simply
coding) and speech synthesis can be referred to as Speech decoding.
12
SPEECH TEXT ARTIFICE
environment are the major factors that are counted in as the depending factors for a speech
recognition engine [3].
2.2 History
The concept of speech recognition started somewhere in 1940s [3], practically the first speech
recognition program was appeared in 1952 at the bell labs, that was about recognition of a
digit in a noise free environment [4], [5].
technology,
in this period work was done on the foundational paradigms of the speech recognition that is
automation and information theoretic models [15].
-100 words) of
isolated words, based on simple acoustic-phonetic properties of speech sounds [3]. The key
technologies that were developed during this decade were, filter banks and time
normalization methods [15].
-1000 words) using simple template-based,
pattern recognition methods were recognized.
In 1980s large vocabularies (1000-unlimited) were used and speech recognition problems
based on statistical, with a large range of networks for handling language structures were
addressed. The key invention of this era were HIDDEN MARKOV MODEL (HMM) and the
stochastic language model, which together enabled powerful new methods for handling
continuous speech recognition problem efficiently and with high performance[3] .
In 1990s the key technologies developed during this period were the methods for
Stochastic language understanding, statistical learning of acoustic and language models, and
the methods for implementation of large vocabulary speech understanding systems.
After the five decades of research, the speech recognition technology has finally
entered marketplace, benefiting the users in variety of ways. The challenge of designing a
machine that truly functions like an intelligent human is still a major one going forward.
13
SPEECH TEXT ARTIFICE
Isolated words usually involve a pause between two utterances; it doesn‟t mean
that it only accepts a single word but instead it requires one utterance at a time [4].
Continuous speech allow the user to speak almost naturally, it is also called the
computer dictation.
14
SPEECH TEXT ARTIFICE
15
SPEECH TEXT ARTIFICE
Voice Input
With the help of microphone audio is input to the system, the pc sound card produces the
equivalent digital representation of received audio [8] [9] [10].
Digitization
The process of converting the analog signal into a digital form is known as digitization [8], it
involves the both sampling and quantization processes. Sampling is converting a continuous
signal into discrete signal, while the process of approximating a continuous range of values is
known as quantization.
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text
transcriptions, and using software to create statistical representations of the sounds that make
up each word. It is used by a speech recognition engine to recognize speech [8]. The software
acoustic model breaks the words into the phonemes [10].
Language Model
Language modeling is used in many natural language processing applications such as speech
recognition tries to capture the properties of a language and to predict the next word in the
speech sequence [8]. The software language model compares the phonemes to words in its
built in dictionary [10].
Speech Engine
The job of speech recognition engine is to convert the input audio into text [4]; to accomplish
this it uses all sorts of data, software algorithms and statistics. Its first operation is digitization
as discussed earlier, that is to convert it into a suitable format for further processing. Once
audio signal is in proper format it then searches the best match for it. It does this by
considering the word it knows, once the signal is recognized, it returns its corresponding text
string.
16
SPEECH TEXT ARTIFICE
2.6 Applications
17
SPEECH TEXT ARTIFICE
Some other application areas of speech recognition technology are described as under[13]:
ASR systems that are designed to perform functions and actions on the system are defined as
Command and Control systems. Utterances like "Open Netscape" and "Start a new browser"
will do just that.
Telephony
Some Voice Mail systems allow callers to speak commands instead of pressing buttons to
send specific tones.
Medical/Disabilities
Many people have difficulty typing due to physical limitations such as repetitive strain
injuries (RSI), muscular dystrophy, and many others. For example, people with difficulty
hearing could use a system connected to their telephone to convert the caller's speech to text.
18
SPEECH TEXT ARTIFICE
Homonyms: Are the words that are differently spelled and have the different meaning but
acquires the same meaning, for example “there” “their” “be” and “bee”. This is a challenge
for computer machine to distinguish between such types of phrases that sound alike.
Noise factor: The program requires hearing the words uttered by a human distinctly and
clearly. Any extra sound can create interference, first you need to place system away from
noisy environments and then speak clearly else the machine will confuse and will mix up the
words.
19
SPEECH TEXT ARTIFICE
2.9.1 X Voice
HomePage: http://www.compapp.dcu.ie/~tdoris/Xvoice/
http://www.zachary.com/creemer/xvoice.html
Project: http://xvoice.sourceforge.net
2.9.2 ISIP
The Institute for Signal and Information Processing at Mississippi State University has made
its speech recognition engine available. The toolkit includes a front−end, a decoder, and a
training module. It's a functional toolkit. This software is primarily for developers.
The toolkit (and more information about ISIP) is available at:
2.9.3 Ears
Although Ears isn't fully developed, it is a good starting point for programmers wishing
to start in ASR. This software is primarily for developers.
Homepage: http://www.speech.cs.cmu.edu/sphinx/Sphinx.html
Source: http://download.sourceforge.net/cmusphinx/sphinx2−0.1a.tar.gz
20
SPEECH TEXT ARTIFICE
The NICO Artificial Neural Network toolkit is a flexible back propagation neural
network toolkit optimized for speech recognition applications.
This software is primarily for developers.
Its homepage: http://www.speech.kth.se/NICO/index.html
21
SPEECH TEXT ARTIFICE
CHAPTER 3
3.1.1 Utterances
When user says some things, then this is an utterance[13] in other words speaking a word or
a combination of words that means something to the computer is called an utterance.
Utterances are then sent to speech engine to be processed.
3.1.2 Pronunciation
A speech recognition engine uses a process word is its pronunciation, that represents what the
speech engine thinks a word should sound like [4].Words can have the multiple
pronunciations associated with them.
3.1.3 Grammar
Grammar uses particular set of rules in order to define the words and phrases that are going to
be recognized by speech engine, more concisely grammar define the domain with which the
speech engine works [4]. Grammar can be simple as list of words or flexible enough to
support the various degrees of variations.
3.1.4 Accuracy
The performance of the speech recognition system is measurable[4]; the ability of recognizer
can be measured by calculating its accuracy. It is useful to identify an utterance.
22
SPEECH TEXT ARTIFICE
3.1.5 Vocabularies
Vocabularies are the list of words that can be recognized by the speech recognition engine[4].
Generally the smaller vocabularies are easier to identify by a speech recognition engine,
while a large listing of words are difficult task to be identified by engine.
3.1.6 Training
Training can be used by the users who have difficulty of speaking or pronouncing certain
words, speech recognition systems with training should be able to adapt.
Speaker dependence describes the degree to which a speech recognition system requires
knowledge of a speaker‟s individual voice characteristics to successfully process speech. The
speech recognition engine can “learn” how you speak words and phrases; it can be trained to
your voice.
Speech recognition systems that require a user to train the system to his/her voice are known
as speaker-dependent systems. If you are familiar with desktop dictation systems, most are
speaker dependent. Because they operate on very large vocabularies, dictation systems
perform much better when the speaker has spent the time to train the system to his/her voice.
Speech recognition systems that do not require a user to train the system are known as
speaker-independent systems. Speech recognition in the Voice XML world must be
speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling
into your web site. You cannot require that each caller train the system to his or her voice.
The speech recognition system in a voice-enabled web application MUST successfully
process the speech of many different callers without having to understand the individual
voice characteristics of each caller.
23
SPEECH TEXT ARTIFICE
Hardware
1. Microphone
2. Speakers
Software
1. Smart Draw2000 (For drawing the Gantt chart and Speech Recognition Model)
2. Visual Paradigm for UML 7.1 (for Use case and Activity Diagram)
3. MS-Paint
4. MICROSOFT ROBOTICS DSS node.
5. Windows XP3/VISTA/7
6. Microsoft Speech API 5.0
7. VISUAL STUDIO with .NET Frame work
8. MS Office 2007 (Documentation).
3.3 Methodology
As an emerging technology, not all developers are familiar with speech recognition
technology. While the basic functions of both speech synthesis and speech recognition takes
only few minutes to understand (after all, most people learn to speak and listen by age two),
there are subtle and powerful capabilities provided by computerized speech that developers
will want to understand and utilize.
Despite very substantial investment in speech technology research over the last 40 years,
speech synthesis and speech recognition technologies still have significant limitations. Most
importantly, speech technology does not always meet the high expectations of users familiar
with natural human-to-human speech communication. Understanding the limitations - as well
as the strengths - is important for effective use of speech input and output in a user interface
and for understanding some of the advanced features of the Microsoft Speech API.
An understanding of the capabilities and limitations of speech technology is also important
for developers in making decisions about whether a particular application will benefit from
the use of speech input and output.
24
SPEECH TEXT ARTIFICE
The SAPI application programming interface (API) dramatically reduces the code overhead
required for an application to use speech recognition and text-to-speech, making speech
technology more accessible and robust for a wide range of applications.
API Overview
API for Text-to-Speech
API for Speech Recognition
The SAPI API provides a high-level interface between an application and speech engines.
SAPI implements all the low-level details needed to control and manage the real-time
operations of various speech engines.
The two basic types of SAPI engines are text-to-speech (TTS) systems and speech
recognizers. TTS systems synthesize text strings and files into spoken audio using synthetic
voices. Speech recognizers convert human spoken audio into readable text strings and files.
API overview
25
SPEECH TEXT ARTIFICE
Applications can control text-to-speech (TTS) using the ISpVoice Component Object Model
(COM) interface. Once an application has created an ISpVoice object (see Text-to-Speech
Tutorial), the application only needs to call ISpVoice::Speak to generate speech output from
some text data. In addition, the IspVoice interface also provides several methods for changing
voice and synthesis properties such as speaking rate ISpVoice::SetRate, output volume
ISpVoice::SetVolume and changing the current speaking voice ISpVoice::SetVoice.
Special SAPI controls can also be inserted along with the input text to change real-time
synthesis properties like voice, pitch, word emphasis, speaking rate and volume. This
synthesis markup sapi.xsd, using standard XML format, is a simple but powerful way to
customize the TTS speech, independent of the specific engine or voice currently in use.
The IspVoice::Speak method can operate either synchronously (return only when completely
finished speaking) or asynchronously (return immediately and speak as a background
process). When speaking asynchronously (SPF_ASYNC), real-time status information such
as speaking state and current text location can polled using ISpVoice::GetStatus. Also while
speaking asynchronously, new text can be spoken by either immediately interrupting the
current output (SPF_PURGEBEFORESPEAK), or by automatically appending the new text
to the end of the current output.
In addition to the ISpVoice interface, SAPI also provides many utility COM interfaces for the
more advanced TTS applications.
Just as ISpVoice is the main interface for speech synthesis, ISpRecoContext is the main
interface for speech recognition. Like the ISpVoice, it is an ISpEventSource, which means
that it is the speech application's vehicle for receiving notifications for the requested speech
recognition events.
An application has the choice of two different types of speech recognition engines
(ISpRecognizer). A shared recognizer that could possibly be shared with other speech
26
SPEECH TEXT ARTIFICE
The next step is to set up notifications for events the application is interested in. As the
ISpRecognizer is also an ISpEventSource, which in turn is an ISpNotifySource, the
application can call one of the ISpNotifySource methods from its ISpRecoContext to indicate
where the events for that ISpRecoContext should be reported. Then it should call
ISpEventSource::SetInterest to indicate which events it needs to be notified of. The most
important event is the SPEI_RECOGNITION, which indicates that the ISpRecognizer has
recognized some speech for this ISpRecoContext. See SPEVENTENUM for details on the
other available speech recognition events.
Finally, a speech application must create, load, and activate an ISpRecoGrammar, which
essentially indicates what type of utterances to recognize, i.e., dictation or a command and
control grammar. First, the application creates an ISpRecoGrammar using
ISpRecoContext::CreateGrammar. Then, the application loads the appropriate grammar,
either by calling ISpRecoGrammar::LoadDictation for dictation or one of the
ISpRecoGrammar::LoadCmdxxx methods for command and control. Finally, in order to
activate these grammars so that recognition can start, the application calls
ISpRecoGrammar::SetDictationState for dictation or ISpRecoGrammar::SetRuleState or
ISpRecoGrammar::SetRuleIdState for command and control.
When recognitions come back to the application by means of the requested notification
mechanism, the lParam member of the SPEVENT structure will be an ISpRecoResult by
which the application can determine what was recognized and for which ISpRecoGrammar of
the ISpRecoContext.
27
SPEECH TEXT ARTIFICE
28
SPEECH TEXT ARTIFICE
A speech synthesizer converts written text into spoken language. Speech synthesis is also
referred to as text -to-speech (TTS) conversion.
29
SPEECH TEXT ARTIFICE
example, Japanese has fewer phonemes including sounds not found in English, such as "ts" in
"tsunami".
•Prosody analysis: Process the sentence structure, words and phonemes to determine
appropriate prosody for the sentence. Prosody includes many of the features of speech other
than the sounds of the words being spoken. This includes the pitch (or melody), the timing (or
rhythm), the pausing, the speaking rate, the emphasis on words and many other features.
Correct prosody is important for making speech sound right and for correctly conveying the
meaning of a sentence.
• Waveform production: Finally, the phonemes and prosody information are used to
produce the audio waveform for each sentence. There are many ways in which the speech can
be produced from the phoneme and prosody information. Most current systems do it in one of
two ways: concatenation of chunks of recorded human speech, or formant synthesis using
signal processing techniques based on knowledge of how phonemes sound and how prosody
affects those phonemes. The details of waveform generation are not typically important to
application developers.
The Text to Speech service is designed to provide your applications with a verbal interface. It
can be used in conjunction with the Speech Recognizer service for two-way communication
with the computer.
As a type of speech engine, much of the functionality of a Synthesizer is inherited from the
Engine interface in the System.Speech.Synthesis package and from other classes and
interfaces in that package.
30
SPEECH TEXT ARTIFICE
Inheritance Hierarchy:
System.Object
System.Speech.Synthesis.SpeechSynthesizer
Namespace: System.Speech.Synthesis
Assembly: System.Speech (in System.Speech.dll)
Syntax: public sealed class SpeechSynthesizer : IDisposable
Constructor:
Name Description
The Central class of System.Speech
package is used to obtain a speech
synthesizer by calling the
SpeechSynthesizer SpeechSynthesizer method. The
SetOutputToWaveFile along with
Speak argument provides the formatted
output of appropriate synthesizer. In this
example a synthesizer that speaks
English is requested.
Properties:
Name Description
Rate Rate at which user want to listen .
31
SPEECH TEXT ARTIFICE
Speech recognition (SR) converts spoken words to written text and as a result can be used to
provide user interfaces that use spoken input. The Speech Recognizer service enables you to
include speech recognition support for your application. Speech recognition requires a special
type of software, called an SR engine.
The SR engine may be installed with the operating system or at a later time with other
software. Speech-enabled packages such as word processors and web browsers, may install
their own engines or they can use existing ones. Additional engines are also available through
third party manufacturers. These engines are typically designed to only support a specific
language and may also target a certain vocabulary; for example, a vocabulary specializing in
medical or legal terminology.
32
SPEECH TEXT ARTIFICE
Operations:
The Speech Recognizer service supports the following requests and notifications.
Operation Description
InsertGrammarEntry Inserts the specified entry (or entries) of the supplied grammar
into the current grammar dictionary. If certain entries exist
already a Fault is returned and the whole operation fails
without the current dictionary being modified at all.
UpsertGrammarEntry Inserts entries from the supplied grammar into the current
dictionary if they do not exist yet or updates entries that already
exist with entries from the supplied grammar.
SetSrgsGrammarFile Sets the grammar type to SRGS file and tries to load the
specified file, which has to reside inside your application's
/store folder (directory). If loading the file fails, a Fault is
returned and the speech recognizer returns the state it was
before it processed this request. SRGS grammars require
Windows Vista or Windows 7 and will not work with
Windows XP and Windows Server 2003.
33
SPEECH TEXT ARTIFICE
EmulateRecognize Sets the SR engine to emulate speech input but by using Text
(string). This is mostly used for testing and debugging.
GrammarType Specifies the type of grammar the SR engine will use, either a
simple Dictionary grammar or SRGS grammar.
SpeechDetected Indicates that speech (audio) has been detected and is being
processed.
SpeechRecognitionRejected Indicates that speech was detected, but not recognized as one of
the words or phrases in the current grammar dictionary. The
duration of the speech is available as DurationInTicks.
To support SR you define a grammar - the words and phrases to be recognized and then use
notifications provided by the service to determine what SR engine recognized as the spoken
input. The Speech Recognizer service supports usage of simple dictionary-style grammars.
System.Speech.Recognition Namespace
The Recognition namespace contains Windows Desktop Speech technology types for
implementing speech recognition.
The Windows Desktop Speech Technology software offers a basic speech recognition
infrastructure that digitizes acoustical signals, and recovers words and speech elements from
audio input.
Applications manage and obtain use grammars -- sets of rules defining how specific
combinations of words and phrases are to be understood --through the general purpose
Grammar class, which hosts runtime, persisted, or dynamically constructed instances of
34
SPEECH TEXT ARTIFICE
The SpeechRecognizer class is used to create client applications making use of a system's
current recognition technology, which is configured through the Audio Input member of the
Control Panel, and a computer's default audio input mechanism.
35
SPEECH TEXT ARTIFICE
Inheritance Hierarchy:
System.Object
System.Speech.Recognition.SpeechRecognizer
Namespace: System.Speech.Recognition
Assembly: System.Speech (in System.Speech.dll)
Syntax: public class SpeechRecognizer : IDisposable
Inheritance Hierarchy:
System.Object
System.Speech.Recognition.Grammar
System.Speech.Recognition.DictationGrammar
Namespace: System.Speech.Recognition
Assembly: System.Speech (in System.Speech.dll)
Syntax: public class Grammer
The SpeechRecognizer service represents the core speech recognition service (as opposed to
the SpeechRecognizerGui service which offers the user interface component to the core
service). The core service allows for usage of simple dictionary-style grammars as well as
complex SRGS (Speech Recognition Grammar Specification) grammars, specified in XML.
It does not require any connections, and it will start up when you run the diagram. You can
also optionally start an instance of the Speech Recognizer GUI once you have a DSS node
running by using a web browser and going to the Control Panel page. Starting the service will
automatically attempt to load the default SR engine.
36
SPEECH TEXT ARTIFICE
3.6 (c) Steps required to evolve Dictionary style grammer for Speech-To-Text
conversion using MICROSOFT ROBOTICS DSS command node:
The SpeechRecognizer service supports the Initial State partner. The initial state is used to
configure:
The default config file has to be called "SpeechRecognizer.config.xml", and it specifies the
commands that will be use by the recognizer.
Start the DSS Command Prompt from the Start > All Programs menu.
Start a DSS Host node and create an instance of the service by typing the following
command:
At the bottom of the SpeechRecognizerGui web page you can define a grammar. Note that
the SpeechRecognizer only recognizes words and phrases that are in its grammar. If the
grammar is empty, then nothing will be recognized.
37
SPEECH TEXT ARTIFICE
CHAPTER 4
DOCUMENTATION
All the above mentioned components must be installed prior to the execution of the system.
The performance of the system largely depends on the type of hardware used especially the
microphone. There are special Headphones, which provide relatively more accurate results.
38
SPEECH TEXT ARTIFICE
Sound cards
Speech requires relatively low bandwidth, high quality 16 bit sound card will be better
enough to work . Sound must be enabled, and proper driver should be installed. Sound cards
with the 'cleanest' A/D (analog to digital) conversions are recommended, but most often the
clarity of the digital sample is more dependent on the microphone quality and even more
dependent on the environmental noise. Some speech recognition systems might require
specific sound cards.
Microphones
A quality microphone is key when utilizing the speech recognition system. Desktop
microphones are not suitable to continue with speech recognition system, because they have
tendency to pick up more ambient noise. The best choice, and most common is the headset
style. It allows the ambient noise to be minimized, while allowing you to have the
microphone at the tip of your tongue all the time. Headsets are available without earphones
and with earphones (mono or stereo).
Computer/ Processors
Speech recognition applications can be heavily dependent on processing speed.
This is because a large amount of digital filtering and signal processing can take place in
ASR.
The Developer as well as User site must have following softwares installed in order for
correct working of the system:
Visual Paradigm for UML 7.1 (for Use case and Activity Diagram)
MS-Paint
MICROSOFT ROBOTICS DSS node.
Windows XP3/VISTA/7
39
SPEECH TEXT ARTIFICE
The context diagram shows how the other systems are interacting with SPEECH TEXT
ARTIFICE. SAPI is the backbone of system. Software uses various interfaces provided by
SAPI for Speech Recognition and Speech Synthesis. Special SAPI controls can also be
inserted along with the input text to change real-time synthesis properties like voice, pitch,
word, emphasis, speaking rate and volume. The Application interacts with SAPI using API
(Application Programming Interface) and the SAPI interacts with Recognition and TTS
engine using DDI (Device Driver Interface).
CONTEXT diagram
40
SPEECH TEXT ARTIFICE
Following figure describes the sequence in which the processes are performed for TEXT to
SPEECH as well as SPEECH to TEXT interface.
For TTS text is entered as input and then along, worked upon by API for synthesis and
processing to pass it on to TTS engine which in turn provides us with SPEECH as output
from specific hardware included.
For STT speech is entered as input and then processed by SAPI for recognition of acoustics
so that the output can be transferred to STT engine which provides suitable output in form of
TEXT within the text pad provided in software.
SEQUENCE diagram
41
SPEECH TEXT ARTIFICE
PACKAGES are UML constructs that enable you to organize model elements into groups,
making your UML diagrams simpler and easier to understand. Packages are depicted as file
folders and can be used on any of the UML diagrams, although they are most common on
USE CASE diagram and CLASS diagrams because these models have a tendency to grow.
42
SPEECH TEXT ARTIFICE
43
SPEECH TEXT ARTIFICE
4.7.1
44
SPEECH TEXT ARTIFICE
45
SPEECH TEXT ARTIFICE
46
SPEECH TEXT ARTIFICE
CHAPTER 5
5.1 WORKING
This software is designed to recognize the speech and also has the capabilities for speaking
and synthesizing means it can convert speech to text and text to speech. This software named
„SPEECH TEXT ARTIFICE‟ has the capability to write spoken words into text area of
notepad, and also can recognize your commands as “save, open, clear” this software is
capable of opening windows software such as notepad, ms paint, calculator through voice
input.
The synthesize part of this software helps in verifying the various operations done by user
such as read out the written text for user also informing that what type of actions a user is
doing such as saving a document, opening a new file or opening a file previously saved on
hard disk .
47
SPEECH TEXT ARTIFICE
1
Test Case No:
Status: PASSED
48
SPEECH TEXT ARTIFICE
2
Test Case No:
Status: PASSED
49
SPEECH TEXT ARTIFICE
3
Test Case No:
Description: Checking Up
Status: PASSED
50
SPEECH TEXT ARTIFICE
4
Test Case No:
Status: PASSED
51
SPEECH TEXT ARTIFICE
1
Test Case No:
52
SPEECH TEXT ARTIFICE
2
Test Case No:
53
SPEECH TEXT ARTIFICE
3
Test Case No:
Description: User Training is done (half) but the person who hasn’t
done the training is speaking.
Result: Training improves accuracy but every user must load his/her own
profile while dictating.
54
SPEECH TEXT ARTIFICE
CHAPTER 6
CONCLUSION
6.2 Disadvantages
Low accuracy
Not good in the noisy environment.
This Thesis/Project work of speech recognition started with a brief introduction of the
technology and its applications in different sectors. The project part of the Report was based
on software development for speech recognition. At the later stage we discussed different
tools for bringing that idea into practical work. After the development of the software finally
it was tested and results were discussed, few deficiencies factors were brought in front. After
the testing work, advantages of the software were described and suggestions for further
enhancement and improvement were discussed.
55
SPEECH TEXT ARTIFICE
REFERENCES
[9] http://electronics.howstuffworks.com/gadgets/high-tech-
gadgets/speechrecognition.htm/printable last updated: 30th October 2009
[10] http://www.jisc.ac.uk/media/documents/techwatch/ruchi.pdf
56
SPEECH TEXT ARTIFICE
[11] http://electronics.howstuffworks.com/gadgets/high-tech-gadgets/speech-
recognition3.htm
[12] http://en.wikipedia.org/wiki/Speech_recognition
Visited: 6 DEC 2010
[13] Stephen Cook “”Speech Recognition HOWTO” Revision v2.0 April 19, 2002
Source: http: / /www.scribd.com / doc/ 2586608/ speechrecognitionhowto
[14] B.H. Juang & Lawrence R. Rabiner, “Automatic Speech Recognition – A Brief History
of the Technology Development” 10/08/2004 .
Source: http://www.ece.ucsb.edu/Faculty/Rabiner/ece259/Reprints/354_LALI-ASRHistory-
final- 10-8.pdf
57
SPEECH TEXT ARTIFICE
APPENDIX –A
SOURCE CODE
using System;
using System.Collections.Generic;
using System.Linq;
using System.Windows.Forms;
namespace demo11
{
static class Program
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
58
SPEECH TEXT ARTIFICE
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.ServiceProcess;
using System.Management;
using System.Runtime.InteropServices;
using System.Threading;
namespace demo11
{
public partial class Form1 : Form
{
int kk; string str = "\n";
public Form1()
{
InitializeComponent();
button1.Enabled = false;
menuStrip1.Visible = false;
}
demo11.text2s tt = new text2s();
demo11.speech2t ttt = new speech2t();
59
SPEECH TEXT ARTIFICE
60
SPEECH TEXT ARTIFICE
61
SPEECH TEXT ARTIFICE
tttt.Hide();
tt.Hide();
ttt.Hide();
}
62
SPEECH TEXT ARTIFICE
// string s;
// s = string.IsNullOrEmpty(obj.GetPropertyValue("DeviceName").ToString()) ?
string.Empty : obj.GetPropertyValue("DeviceName").ToString();
// textBox1.Text = s.ToString();
// }
}
private void tEXTTOSPEECHToolStripMenuItem_Click(object sender, EventArgs e)
{
tt.MdiParent = this;
tt.WindowState = FormWindowState.Maximized;
tt.Show();
}
ttt.MdiParent = this;
ttt.WindowState = FormWindowState.Maximized;
ttt.Show();
}
63
SPEECH TEXT ARTIFICE
tt.WindowState = FormWindowState.Maximized;
tt.Show();
menuStrip1.Visible = false;
menuStrip3.Visible = true;
menuStrip2.Visible = false;
}
//after pressing this button instruction written in this function will be executed during
loading of this page.driver’s information will b shown after specific time.
// sp.Show();
// sp.WindowState = FormWindowState.Normal;
64
SPEECH TEXT ARTIFICE
//}
menuStrip4.Visible = false;
displayy();
str1 = str;
}
}
65
SPEECH TEXT ARTIFICE
void displayy()
{
System.Windows.Forms.Timer tm = new System.Windows.Forms.Timer();
tm.Interval = 2500;
tm.Tick += Mytick;
tm.Start();
progressBar1.Increment(3);
}
int i = 0;
public void Mytick(object obj, System.EventArgs e)
{
if (i < 20)
{
textBox1.Text = str;
i++;
}
else
{
((System.Windows.Forms.Timer)obj).Stop();
}
button1.Enabled = true;
}
//void MyTick(object obj, System.EventArgs ea)
//{
// // sp.Hide();
// // sp.Dispose();
// //this.Show();
// //this.WindowState = FormWindowState.Maximized;
// //this.WindowState = FormWindowState.Normal;
// ((System.Windows.Forms.Timer)obj).Stop();
//}
66
SPEECH TEXT ARTIFICE
67
SPEECH TEXT ARTIFICE
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using SpeechLib;
using System.IO;
using System.Threading;
namespace demo11
{
public partial class text2s : Form
{
int X = 50;
public text2s()
{
InitializeComponent();
label2.Text = X.ToString();
}
68
SPEECH TEXT ARTIFICE
//after pressing speak button code will speak whatever it find in main text box.
if (X == 100)
{ button2.Enabled = false; }
label2.Text = X.ToString();
69
SPEECH TEXT ARTIFICE
if (X > 10)
{
X = X - 10;
button2.Enabled = true ;
}
if (X == 10)
button3.Enabled = false;
label2.Text = X.ToString();
}
//after pressing this button open dialog box will open and user can select any notpad file
and will see data in text box.
if (openfiledialog1.ShowDialog() != DialogResult.Cancel)
{
70
SPEECH TEXT ARTIFICE
stro = openfiledialog1.FileName.ToString();
StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();
}
//After pressing this button user can save his content in to notepad file.
if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();
File.WriteAllText(@str, textBox1.Text);
71
SPEECH TEXT ARTIFICE
}
//this button will clear content of text box.
private void button6_Click(object sender, EventArgs e)
{
textBox1.Text = "";
}
}
}
72
SPEECH TEXT ARTIFICE
SPEECH TO TEXT :-
{Speech2T.cs}
using Microsoft.Win32;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using SpeechLib;
using System.IO;
using System.Threading;
using System.Speech;
using System.Speech.Recognition;
using System.Speech.Synthesis.TtsEngine;
namespace demo11
{
public partial class speech2t : Form
{
public speech2t()
{
InitializeComponent();
}
public void listener_Reco(int StreamNumber, object StreamPosition,
SpeechRecognitionType RecognitionType, ISpeechRecoResult Result)
{
73
SPEECH TEXT ARTIFICE
textBox1.Text += heard;
textBox1.Text = textBox1.Text.ToString() + " ";
}
//after pressing this button application will listen for words from user and will write in
to text box.
SpSharedRecoContext listener;
// Grammar object
ISpeechRecoGrammar grammar;
listener = new SpeechLib.SpSharedRecoContext();
listener.Recognition += new
_ISpeechRecoContextEvents_RecognitionEventHandler(listener_Reco);
//grammar = listener.CreateGrammar(0);
//grammar.DictationLoad("",SpeechLoadOption.SLOStatic);
//grammar.DictationSetState(SpeechRuleState.SGDSActive);
//SpeechRecognitionEngine RecognitionEngine = new SpeechRecognitionEngine();
//RecognitionEngine.LoadGrammar(new DictationGrammar());
//RecognitionResult Result = RecognitionEngine.Recognize(new
SetInputToDefaultAudioDevice());
//StringBuilder Output = new StringBuilder();
//foreach (RecognizedWordUnit Word in Result.Words)
//{
// textBox1.Text = Result.Words.ToString();
//}
74
SPEECH TEXT ARTIFICE
if (saveFileDialog1.ShowDialog() == DialogResult.OK)
{
if ((myStream = saveFileDialog1.OpenFile()) != null)
{
// Code to write the stream goes here.
myStream.Close();
}
}
75
SPEECH TEXT ARTIFICE
if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();
File.WriteAllText(@str, textBox1.Text);
}
if (openfiledialog1.ShowDialog() != DialogResult.Cancel)
76
SPEECH TEXT ARTIFICE
stro = openfiledialog1.FileName.ToString();
textBox1.Text = stro;
StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();
//after pressing this button content of text box will b erased and user can speak in new
file.
//after pressing this button user can open any file of hard disk and can append data into
this file.
if (openfiledialog1.ShowDialog() != DialogResult.Cancel)
77
SPEECH TEXT ARTIFICE
stro = openfiledialog1.FileName.ToString();
textBox1.Text = stro;
StreamReader objreader;
objreader = new StreamReader(stro);
textBox1.Text = objreader.ReadToEnd();
objreader.Close();
}
//after pressing this button user can save his/her content in to hardisk as a notepad file
for future use.
if (saveFileDialog1.ShowDialog() != DialogResult.Cancel)
str = saveFileDialog1.FileName.ToString();
File.WriteAllText(@str, textBox1.Text);
}
}
}
78
SPEECH TEXT ARTIFICE
APPENDIX –B
SNAPSHOTS
This is the starting page when application is run. Splash screen appears as soon as the
interface debugs. After checking for the hardware and sound drivers, information generated is
displayed.
If any bugs avialable then reported there after.
Main screen of our software shows various features like FILE , VIEW , HELP, ABOUT US.
Clicking on VIEW button opens us to SPEECH TO TEXT and TEXT TO SPEECH
interfacce.
79
SPEECH TEXT ARTIFICE
80
SPEECH TEXT ARTIFICE
81
SPEECH TEXT ARTIFICE
APPENDIX –C
GLOSSARY
cmdAssociate Phrase : This method can be utilized to add new phrases for applications that
are not added to the grammer list as yet.
cmdLoadFromFile : This is a method by which a grammer file (xml) can be a loaded onto a
grammer object.
Grammer File : Grammer file is a XML file that stores the phrases that the recognition
engine should recognize or look for.
Grammer object : A Grammer object is one that can load an XML file.
Phoneme : Phoneme is a part of STA tools that is concerned with the command and control .
It takes a particular action whenever it recognize a phase.
RC_FalseRecognition : This is an event that gets fired whenever a phrase is not recognized .
Recognition Engine : An engine that analyzes spoken text via micro phone.
Speech Recognition : Speech Recognition is the method in which human voice is recognized
and the consequent action takes place.
Speech Synthesis : Speech Synthesis is the method by which written text is spoken by the
interface.
82
SPEECH TEXT ARTIFICE
User Training : The training Wizard present in windows that trains the recognition engine
and maintain user profile.
83