You are on page 1of 6

DevelopmentsinAI

SpeechandLanguageprocessing
November2005,week3
SpeechRecognition
Speechrecognitioncanbeunderstoodasasequenceofprocesses.However,inpractice
thereisfeedbackfromonestagetoapreviousstage:theymaynotbeindependent.
Anoverviewofthespeechrecognitionprocessisasfollows.
Acousticprocessing
Aspeechsignalisdetectedandanalysed.Theprocesswilldividethestreamofsounds
intophonologicalunits,andthenseehowtheymightbecombinedintophonemesand
thenwords.
Awomanwithafastspeaking,highpitchedvoiceandamanwithaslow,lowvoice
mightbothproducethesameutterance.Theprocessorhastorecognizethatthesearethe
samewords.Thoughtheacousticsignalsvary,similarcharacteristicsorfeaturesofthe
signalcanbeextractedinbothcases.Theprocessoffeatureextractionisakeyissuein
patternmatching,whetherforstatisticalmethods,asareusuallyusedhere,ortherelated
neuralnetworktechnology.
Similarcharacteristicswillbeextractedfordifferentusers,butformostapplicationsthe
systemwillbecustomizedforeachindividualusersvoiceforgoodperformance.This
involveseachusertrainingthecompletedsystem.

Theacousticanalysisincludesinvestigationof
phonologicalvariationthesamewordmaycomefromvaryingwavesignals,
dependingonwhetheritisemphasizedornot,runintothenextword,orspoken
inanemotionallychargedsituation.
likelysequencesofsoundsasamephonememaycomefromslightlydifferent
signalsdependingonneighbouringphonemes.Fromthehumanperspective,note
thecontrastingshapeofyourlipswhenyouproducethesoundtinteaand
to
wordandsyllableboundariesphonemesaregroupedtogethertoformsyllables,
andthansyllablesaregroupedtoformwords.Thisisahierarchicalprocess.
Detectingsyllableboundariesandthenwordboundariesisasignificantpartofthe
process.Weneedtodistinguishalongfromalongordeepartfrom
depart
Markovprocesses
MarkovModelsrepresentsequentialmovementfromstatetostate.Theymodelalinear
streamofevents,whenaneventattimetdependsonpreviousevents.HiddenMarkov
Models(HMMs)canrepresentasequenceofsoundswhenintermediatestatesmaynotbe
known.Theyarethemostwidelyusedmodelsatthecoreofacousticanalysis.
Sometimesneuralnetsareusedforfinediscriminationbetweensounds,butneuralnets
donotrepresentsequentialdatasoeffectively.

Prosodicprocessing
Prosodicinformationisanalysedatthephraselevel,byrecognisingintonation,finding
phraseandsentenceboundaries.Aphraseorsentenceboundarycanbeanimportant
anchorpointforinterpretingastreamofsounds.Considerthewords
theyarethere.Boysareyouready?
theyaretheirboys.Areyouready?
Prosodicprocessingisbasedontheacousticsignal.Ithelpsdetermineutterancetype
question,statementetc.
Insystemsthatincludedialogmanagementandsomelimitedunderstandingofspeech,
thiswillcontributetogettingsemanticinformationlater.
Languagemodels
Theacousticandprosodicprocessorwillproduceasetofcandidatewords,rankedin
orderoflikelihood.Themostlikelyonewillbethewordgivingtheclosestacoustic
matchwiththeinputword.Thelanguagemodelwillindicatewhichwordismostlikely
inthecontext,notnecessarilyatthetopoftheacousticranking.
Two probabilities are combined: (i) the probability that a certain speech signal S will be
produced by the word w (ii) the probability that the word is w given its overall
probability of occurring (some is more frequent than sum) and its context (has
been is more frequent than has bean).
A language model in automated speech recognition (ASR) is a set of characteristics about
a language. Note the different nuances in the uses of the term model. In some cases a
model encapsulates a theory, such as a neural net model that aims to show how the past
tense of verbs are learnt, or a grammatical model based on a certain theory of grammar.
In other cases the model is just a set of statistics about the data. In Claws the language
model is a set of probabilities that one tag would follow another. The language models in
ASR are of this sort.
Language models play an important part in ASR. A speech recognizer produces a set of
candidate words from the acoustic signal. Each word will have a probability that it is the
right one, given the signal. However, the priority of these candidate words may well be
altered when we look at their context.
Suppose the ASR processor produces as candidate words
courts

quartz

caught s(noozing)

These are homophones. The acoustic signals may be very similar. Homophones are
common in English and other languages, including non Indo-European languages. The
language model will give us information that helps get the right word when we hit a
homophone.
Building the language model
The language model (LM) is extracted from the training data, a corpus or several corpora
of text or transcribed speech. The basic idea behind using language models is that word
2

frequencies and word patterns that occur in the training data are likely to show up again
later. Thus using a suitable corpus is most important. The LM may include a list of words
in general use, it may be a general list supplemented with a specialized vocabulary for
work in a certain domain. Thus, courts will have a higher probability of occurring in
legal reports and tennis commentary, quartz in geological reports. The language model
is a collection of statistics from the training corpora: the frequency of single words, word
bigrams and trigrams. These frequencies are taken as approximations to the probability of
a certain word, or word sequence, occurring.
1. Frequency of words occurring. Suppose the language domain is
legal reports
tennis reports
geological reports
TV chat show
The unigram (single word) probabilities in the language model will help order courts and
its homophones in order of likelihood.
2. We also get much information from bigram and trigram probabilities.
A bigram is a word pair, such as tennis courts, a trigram is a word triple.
In the sentence "The tennis courts were too wet to play." we would have the triples
The tennis courts, tennis courts were, courts were too, were too wet, too wet to,
wet to play, to play .
The LM holds information about bigram and trigram probabilities. Suppose the ASR
processor produces as candidate trigrams
the eye court

the I court

the high court

If "the high court" has occurred a number of times in the training data, while the other
trigrams have never occurred, there is a strong probability that this is the correct one.
Reducing unpredictability
If we know more about which word is likely to be produced, the speech recognizer has an
easier task: words are more predictable. This can be approached in several ways:
Reduce the number of words that can be used. e.g. some telephone dialogue
systems ask you to just say a name or a number from a short list. If the domain is
very restricted, e.g. weather forecasts, the number of words can be limited.
Get unigram (single word) probabilities.
Get probabilities on which words are likely to follow others.
Suppose we had the following probabilities for bigrams starting with the word "prime"
prime minister 0.98
prime cut 0.01
prime number 0.01
prime monkey 0.00
We have more predictability with this information than without it.
Perplexity
3

One reason that ASR has been able to make rapid strides is that there are metrics that can
be used to evaluate different parts of the process. One of these metrics can be applied to
language models, the perplexity measure. In informal terms, perplexity is a measure of
predictability. The lower the perplexity, the higher the predictability. High perplexity is
associated with a large number of possible word choices, low predictability with fewer
choices. It is useful to see whether it is worth spending time extracting word trigrams to
improve the predictability of the LM (it is). It is also useful to see how the LM improves
as training corpus size increases (slow, steady increase).
Perplexity is related to another concept: entropy. In informal terms entropy is also a
measure of disorder, unpredictability. Formally : if entropy is represented by H and
perplexity by P, then P = 2H. These concepts come from Information Theory.
It has been found that using trigram models lowers perplexity significantly, so they are a
significant part of speech recognition language models..
The Zipfian distribution of words and the sparse data problem
The sparse data problem is the fact that when we compare training texts with new, unseen
text then words, bigrams and trigrams may occur that never appeared in the training data.
This can be illuminated by looking at the Zipfian distribution of words. This is an
empirical observation, first noted by Zipf in the 1930s . A small number of words,
function words, are common, but most of the ordinary content words we use only occur
rarely. For example, in the Brown corpus of 1 million words from various domains, 40%
of the words occur only once. Now, this distinctive, highly non-linear distribution will be
more pronounced for word bigrams, and even more marked for trigrams. Consider a
corpus of newspaper articles from the Wall Street Journal, which produces text in a
comparatively limited domain. With a large enough training corpus (39 million words)
single words may be almost fully covered. Bigram coverage is harder to achieve, and for
trigrams 77% of trigrams in a new article will, on average, not have been seen before in
the 39 million word corpus. This type of distribution is typical.
If words are ranked by their frequency, the most frequent, the, being rank 1, then
frequency is proportional to the reciprocal of the rank, approximately. If f is frequency, r
rank and c is a constant, then
f=c* 1
approximately
r
Methods of smoothing
Smoothing techniques are used so that events that are unseen in the training data are not
given zero probability of occurring. These are some of the methods employed.
1. Assume an unseen event has a probability related to that of an event seen once;
"share" probability of single events in the training data over unseen events too.
2. Backoff from trigram to bigram, from bigram to unigram, since these are more likely
to have occurred.
3. Use parts-of-speech instead of words. This can be useful for disambiguating some
common homophones, such as their and there : their is often followed by a
noun or adjective, there by a verb, preposition or end of sentence. However, it is not
4

usually employed because the computation required slows up the system. Many ASR
applications require real time responses.
Improving the language model
In practice, perplexity declines as
The size of the training corpus increases. Corpora of hundreds of millions of
words are commonly used in commercial products.
Information from bigrams and trigrams as well as single words is included in the
language model. Current ASR systems are based on trigram language models.
The training corpus includes material from the same language domain as test data.

Speechrecognitionapplications
Speechrecognizersarenowwidelyusedasstandalonesystems,attachedtoordinary
PCs,assomeofyoumayknow.Theyareusedbysomeprofessionalssuchaslawyers
andsurveyors,whohavetoproducemanyreports.Theyareusefulinareassuchas
radiography,wheretheusercanexamineanimageandspeakthefindingsintoaspeech
recognizer,withouthavingtoturnawayfromtheimage.Afterwards,theuserwillhaveto
editthereporttocorrectanyerrors.Performancewithstandardspeechrecognizers
nowadaysistypically9097%wordscorrect.Inthepastwordshadtobepronounced
separately,butforsometimenownormal,continuousspeechcanbeused.
Training
Speechrecognizersaretypicallytrainedbyeachuser,tocustomisethesystemfor
individualvoices.Trainingcontinueswithuse,ascorrectionsaremade.Foracommercial
productnowavailableonly10minutestrainingisneededtogetperformanceatthe90%
level.However,performanceimproveswithfurthertraining.Workisgoingoninto
developingrecognizersthatcanacceptanunknownvoice.
TheSubtitleproject
IfyouwatchsportsprogrammesonBBC2youcanturnontheteletextsubtitles,provided
forlistenerswithimpairedhearing.Theyarenotlikesubtitlesforfilms,whicharedone
inadvance,butareproducedinrealtimeasthegameisinprogress.Theyarebasedon
speechrecognitiontechnology,developedinpartnershipbyUHSpeechResearchgroup
andanindustrialcompany.Inthepastsubtitleshavebeenproducedbyhighlytrained
stenographers,typingatgreatspeed.Somestillareproducedinthisway.Thealternative
istohaveaspeakercommentingonthesport,andhavingtheirwordsconvertedtotextual
subtitles.Theactualtelevisedspokencommentaryisnotusedforseveralreasons.First,
thespeakerwillhavetrainedthespeechrecognizertohis/hervoiceaseffectivelyas
possible,andwillcontinuecorrectingitwithuse.Secondlyforsomesportssuchas
footballthecommentaryhastobecutdownasthereistoomuchtofitintothesubtitle.
Thirdly,somevoicesseemtogetbetterrecognitionratesthanothers,andthesecanbe
selectedforsubtitlingspeakers.Thespeechrecognizersusedarecommerciallyavailable
products.
5

Thelecturecaptureproject
CurrentlytheSpeechResearchgroupisinvestigatingasysteminwhichlecturerswill
havetheirspeechconvertedtotext,whichisthrownuponascreenforthebenefitof
studentswithimpairedhearing.Thelecturerwearsaradiomic,andhis/herwordsare
eitherdisplayedasstraighttext,orintegratedintoapowerpointpresentationifwanted.

WealsohaveahypothesisthatsomeoverseasstudentsthatdonothaveEnglishastheir
mothertonguecanunderstandwrittenEnglishbetterthanspoken.
TheVerbmobilproject
ThisisanambitiousGerman/Japaneseprojectinwhichspeechrecognitionispartofa
largerapplication.Speechinputisrecognized,analysedtoextractthemeaningofan
utterance,andthentranslated.Englishisusedasanintermediaryform.Thoughthe
projecthasbeenrunningforabout10yearsinanumberofGermanuniversitiesand
industrialresearchdepartmentsitisstillinearlystagesofdevelopment.

You might also like