Professional Documents
Culture Documents
Operated
breaks down the spoken words into phonemes, the basic sounds from which syllables and words are built up. These are analysed to see which string of these units best fits an acceptable phoneme string or structure that the software can derive from its dictionary
Introduction
Eventually, speech may be the way that most people want to interact with their PCs and other computing devices. At Project, we are working to make that possible. Speech Recognition is built right into Windows, representing a big step towards realizing that vision. With Windows Java speech recognition , we had several speech recognition goals based on user feedback about earlier versions of our speech recognition software. When we started this project, we held a number of focus groups, which included people with and without disabilities. From those focus groups, we learned what our users were struggling with, and what they wanted to do, so we could build a system that solved those problems. There are a lot of things that contributed to what we achieved with Windows Java Speech Recognition, but one of the key things was that we started by asking users what they needed. We did something innovative in Windows that simplifies user training and makes it easier for people to learn how to use our speech recognition software. We developed an interactive tutorial that provides both detailed instruction and practice opportunities. Instead of forcing users to read long paragraphs of text to train the system to understand how they say various words, however, we use the words they speak as they work through the tutorial for the same purpose. Once users complete the tutorial, there is no need for them to do additional training unless their speaking style is very far outside the norm. In those cases, users will have a reasonable experience based on what the system learns during the tutorial, but they will have a better experience if they go back and do some additional training. In either case, the system will continue to learn from them as they do more speaking during regular use.
In the past, our technology wasn't mature enough to distinguish between commands and dictation when they were used simultaneously, so we had two separate modes of operation. Users had to tell the computer when they wanted to use voice commands and when they wanted to dictate text. It was a poor user experience, especially for people with disabilities who rely on speech technology to operate their PCs. One goal was to eliminate that burden for users by making the system work seamlessly, so the computer would understand more easily whether it was receiving a voice command or dictation.
Disambiguation = Easily resolve ambiguous situations with a user interface for clarification. When you say a command that can be interpreted in multiple ways, the system clarifies what you intended.
Proposed System
The usability studies enabled us to discover subtle user preferences that had never occurred to us. For example, most products that use speech technology require users to learn the military alphabet (Alpha, Bravo, Charlie, etc.) to spell out words, such as street names, that they want to add to the recognized vocabulary. We thought a simple mnemonic, such as "A as in Apple" or "B as in Boy," would be easier for users, so we chose the one or two most common words that started with each letter and used those. Mnemonics turned out to be a good idea, but we didn't go far enough at first. In the usability study, we put the mnemonic on the screen and gave users a spelling exercise to complete. Much to our surprise, even though the mnemonic was right in front of them, many of the users would choose a different word to represent the letter. For example, they would see "A as in Apple," but they might say "A as in Albuquerque." Because of this feedback, Windows Vista now allows users to say "A" as in any word that starts with that letter. That was a larger technological challenge for us, but it was the best solution for our users.
Methodology
The production of speech can be separated into two parts: Producing the excitation signal and forming the spectral shape. Thus, we can Draw a simplified model of speech production This model works as follows: Voiced excitation is modeled by a pulse generator Which generates a pulse train (of triangleshaped pulses) with its spectrum given by P(f). The unvoiced excitation is modeled by a white noise generator with spectrum N(f). To mix voiced and unvoiced excitation, one can adjust the signal amplitude of the impulse generator (v) and the noise generator (u). The output of both generators is then added and fed into the box modeling the vocal tract and performing the spectral shaping with the transmission function H(f). The emission characteristics of the lips is modeled by R(f). Hence, the spectrum S(f) of the speech signal is given as: S(f) = (v P(f) + u N(f)) H(f) R(f) = X(f) H(f) R(f) (1.2) To influence the speech sound, we have the following parameters in our speech production model: the mixture between voiced and unvoiced excitation (determined by v and u) the fundamental frequency (determined by P(f)) the spectral shaping (determined by H(f)) the signal amplitude (depending on v and u) These are the technical parameters describing a speech signal. To perform speech recognition, the parameters given above have to be computed from the time signal (this is called speech signal analysis or acoustic preprocessing) and then forwarded to the speech recognizer. For the speech recognizer, the most valuable information is contained in the way the spectral shape of the speech signal changes in time. To reflect these dynamic changes, the spectral shape is
determined in short intervals of time, e.g., every 10 ms. By directly computing the spectrum of the speech signal, the fundamental frequency would be implicitly contained in the measured spectrum
Requirements Hardware requirements Numb er 1 2 Description Alternatives available) PC with 2 GB hard- Not-Applicable disk and 256 MB RAM Head phone (If
Description
Alternatives available) Windows 95/98/XP Not Applicable with MS-office Java(1.6) Java (1.5) Eclipse Id Netbaen
Java Speech recognition API
(If
Manpower requirements 3 to 4 students can complete this in 5 6 months if they work fulltime on it.
Future Scope
There are a number of scenarios where speech recognition is either being delivered, developed for, researched or seriously discussed. As with many contemporary technologies, such as the Internet, online payment systems and mobile phone functionality, development is at least partially driven by the trio of often perceived evils that are games, gambling and girls (pornography). Though these applications are outside the educational sphere, it is important to remember that many ICTinnovations, incorporated into academia over the last decade, were developed in other sectors. Computer and video games Speech input has been used in a limited number of computer and video games, on a variety of PC and console-based platforms, over the past decade. For example, the game Seaman24 involved growing and controlling strange half-man half fish characters in a virtual aquarium. A microphone, sold with the game, allowed the player to issue one of a pre-determined list of command words and questions to the fish. The accuracy of interpretation, in use, seemed variable; during gaming sessions colleagues with strong accents had to speak in an exaggerated and slower manner in order for the game to understand their commands. Gambling Online gambling has become a major industry in the last four years (to the degree that it has effected changes in gambling taxation laws in the UK and other countries). Speech recognition has application in games such as online poker (multiplayer), where vocal commands can be both heard by the other players, and are (where appropriate) interpreted by the host computer in order to deal more cards, adjust the money staked and so forth. Precision surgery
Developments in keyhole and micro surgery have clearly shown that an approach of as little invasive or non-essential surgery as possible increases success rates and patient recovery times. There is occasional speculation in various medical for a regarding the use of speech recognition in precision surgery, where a procedure is partially or totally carried out by automated means.
Biblography [Bra00] R. N. Bracewell. The Fourier Transform and its Applications. McGraw-Hill, 2000. [Bri87] E. Oran Brigham. Schnelle Fourier Transformation. Oldenbourg Verlag, 1987. [CHLP92] L. R. Rabiner C.-H. Lee and R. Pieraccini. Speaker Independent Continuous Speech Recognition Using Continuous Density Hidden Markov Models., volume F75 of NATO ASI Series, Speech Recognition and Understanding. Recent Advances. Ed. by P. Laface and R. De Mori. Springer Verlag, Berlin Heidelberg, 1992. [HL92] X. D. Huang and K. F. Lee. Phonene classification using semicontinuous hidden markov models. IEEE Trans. on Signal Processessing, 40(5):19621067, May 1992. [Jel98] F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, 1998.[LRS85a] S. E. Levinson L.R. Rabiner, B.H. Juang and M. M. Sondhi. Recognition of isolated digits using hidden markov models with continuous mixture densities. AT & T Technical Journal, 64(6):12111234, July-August 1985. [LRS85b] S. E. Levinson L.R. Rabiner, B.H. Juang and M. M. Sondhi. Some properties of continuous hidden markov model representations. AT
& T Technical Journal, 64(6):12511270, July-August 1985. [LRS89] J. G. Wilpon L.R. Rabiner and Frank K. SOONG. High performance connected digit recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(8):12141225, August 1989.