You are on page 1of 67

speaker and speech recognition

Project Overview:
MOTIVATION
Abstract/Summary:
MOTIVATION Security today in any place is a paramount concern. In this regard
many advances have been made from the simple lock and key combination.
Biometric security has come a long way in helping mankind keep digital secure.
All its constituents, be it IRIS, SPEECH or FACIAL recognition have made great
strides in the field of security. They have evolved from being pure scientific
studies to viable commercial solutions in providing security.

MOTIVATION

Security today in any place is a paramount concern. In this regard many advances have
been made from the simple lock and key combination. Biometric security has come a
long way in helping mankind keep digital secure. All its constituents, be it IRIS,
SPEECH or FACIAL recognition have made great strides in the field of security.
They have evolved from being pure scientific studies to viable commercial solutions
in providing security. SPEAKER and SPEECH RECOGNITION are now the most
widely implemented techniques along with IRIS patterns. There is a certain unique
property to the speech of each individual which can be exploited to recognize the
speaker or speech. This has duly prompted research to find ways to make robust
signal processing systems to obtain and condition speech. In these terms, DSPs are
used to obtain high quality speech signal riding it of environmental effects.
Miniaturized processors which are powerful have been put to use to identify samples
of speech. These are so powerful that they do a brute force comparison with relative
ease. But the efficiency of such algorithms is very low. Using a Neural Network
enables learning and adaption to varied voices there by generalizing the procedure
and requires a much smaller amount memory. With these inherent advantages in the

1
mind, the desire to create a robust recognition system was started in the right earnest
by the authors of this work.

CHAPTER 2

THE PRODUCTION OF SPEECH

2.1 VOCAL TRACT

The vocal tract is the cavity in animals where sound that is produced at the sound source
(larynx in mammals; syrinx in birds) is filtered. In mammals it consists of the laryngeal
cavity, the pharynx, the oral cavity, and the nasal cavity, and in some nonhuman
mammals maybe also the airsacs. The larynx colloquially known as the voicebox, is an
organ in the neck of
mammals involved in
protection of the trachea
and sound production.
The larynx houses the
vocal folds, and is

2
situated just below where the tract of the pharynx splits into the trachea and the
esophagus.

Fig 1.1 The larynx

2.1.1 INNERVATION OF THE LARYNX

The larynx is innervated by branches of the vagus nerve on each side. Sensory
innervation to the glottis and laryngeal vestibule is by the internal branch of the superior
laryngeal nerve. The external branch of the superior laryngeal nerve innervates the
cricothyroid muscle. Motor innervation to all other muscles of the larynx and sensory
innervation to the subglottis is by the recurrent laryngeal nerve. While the sensory input
described above is (general) visceral sensation (diffuse, poorly localized), the vocal fold
also receives general somatic sensory innervation (proprioceptive and touch) by the
superior laryngeal nerve. Injury to the external laryngeal nerve causes weakened
phonation because the vocal cords cannot be tightened. Injury to one of the recurrent

3
laryngeal nerves produces hoarseness, if both are damaged the voice is completely lost,
and breathing becomes difficult.

2.1.2 INTRINSIC MUSCLES ASSOCIATED WITH THE LARYNX

• Cricothyroid muscles lengthen and stretch the vocal folds.


• Posterior cricoarytenoid muscles abduct and externally rotate the arytenoid
cartilages, resulting in abducted vocal cords.
• Lateral cricoarytenoid muscles adduct and internally rotate the arytenoid
cartilages, which can result in adducted vocal folds.
• Transverse arytenoid muscle adducts the arytenoid cartilages, resulting in
adducted vocal cords.
• Oblique arytenoid muscles narrow the laryngeal inlet by constricting the distance
between the arytenoid cartilages and epiglottis.
• Vocalis muscles adjust tension in vocal folds.
• Thyroarytenoid muscles - sphincter of vestibule, narrowing the laryngeal inlet.

2.1.3 EXTRINSIC MUSLCES ASSOCIATED WITH THE LARYNX

There are three pairs of extrinsic muscles of the larynx. All of them attach to the oblique
line of thyroid cartilage.

• Thyrohyoid muscles
• Sternothyroid muscles
• Inferior constrictor muscles

2.1.4 DESCENDED LARYNX

4
In most animals, including infant humans and apes, the larynx is situated very high in the
throat—a position that allows it to couple more easily with the nasal passages, so that
breathing and eating are not done with the same apparatus. However, some aquatic
mammals, large deer, and adult humans have descended larynges. An adult human,
unlike apes, cannot raise the larynx enough to directly couple it to the nasal passage.
Proponents of the aquatic ape hypothesis claim that the similarity between the descended
larynx in humans and aquatic mammals further supports their theory.

Some linguists have suggested that the descended larynx, by extending the length of the
vocal tract and thereby increasing the variety of sounds humans could produce, was a
critical element in the development of speech and language. Others cite the presence of
descended larynges in non-linguistic animals, as well as the ubiquity of nonverbal
communication and language among humans, as counterevidence against this claim.

2.1.5 CARTILAGES

There are nine cartilages, three unpaired and three paired, those support the larynx and
form its skeleton. The unpaired cartilages of the larynx are the thyroid, cricoid and
epiglottis. The paired cartilages of the larynx are the arytenoids, corniculate, and the
cuneiforms.

The vocal apparatus consists of two pairs of mucosal folds. These folds are false vocal
cords (vestibular folds) and true vocal cords (folds). The false vocal cords are covered by
respiratory epithelium, while the true vocal cords are covered by stratified squamous
epithelium. The false vocal cords are not responsible for sound production, but rather for
resonance. These false vocal cords do not contain muscle, while the true vocal cords do
have skeletal muscle. During swallowing, the backward motion of the tongue forces the
epiglottis over the laryngeal opening to prevent swallowed material from entering the
lungs; the larynx is also pulled upwards to assist this process. Stimulation of the larynx
by ingested matter produces a strong cough reflex to protect the lungs.

2.2 HUMAN VOICE

5
The human voice consists of sound made by a human being using the vocal folds for
talking, singing, laughing, crying, screaming, etc. Human voice is specifically that part of
human sound production in which the vocal folds (vocal cords) are the primary sound
source. Generally speaking, the mechanism for generating the human voice can be
subdivided into three parts; the lungs, the vocal folds within the larynx, and the
articulators. The lung (the pump) must produce adequate airflow and air pressure to
vibrate vocal folds (this air pressure is the fuel of the voice). The vocal folds (vocal
cords) are a vibrating valve that chops up the airflow from the lungs into audible pulses
that form the laryngeal sound source. The muscles of the larynx adjust the length and
tension of the vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the
vocal tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and
filter the sound emanating from the larynx and to some degree can interact with the
laryngeal airflow to strengthen it or weaken it as a sound source.

The vocal folds, in combination with the articulators, are capable of producing highly
intricate arrays of sound. The tone of voice may be modulated to suggest emotions such
as anger, surprise, or happiness. Singers use the human voice as an instrument for
creating music.

2.2.1 VOICE TYPES AND CORDS

Adult men and women have different vocal folds sizes; reflecting the male-female
differences in larynx size. Adult male voices are usually lower-pitched and have larger
folds. The male vocal folds (which would be measured vertically in the opposite
diagram), are between 17 mm and 25 mm in length. The female vocal folds are between
12.5 mm and 17.5 mm in length. The folds are located just above the trachea (the
windpipe which travels from the lungs). Food and drink do not pass through the cords but
instead pass through the esophagus, an unlinked tube. Both tubes are separated by the
epiglottis, a "flap" that covers the opening of the trachea while swallowing.

6
The folds in both sexes are within the larynx. They are attached at the back (side nearest
the spinal cord) to the arytenoid cartilages, and at the front (side under the chin) to the
thyroid cartilage. They have no outer edge as they blend into the side of the breathing
tube (the illustration is out of date and does not show this well) while their inner edges or
"margins" are free to vibrate (the hole). They have a three layer construction of an
epithelium, vocal ligament, then muscle (vocalis muscle), which can shorten and bulge
the folds. They are flat triangular bands and are pearly white in color. Above both sides
of the vocal cord is the vestibular fold or false vocal cord, which has a small sac between
its two folds. The difference in vocal folds size between men and women means that they
have differently pitched voices. Additionally, genetics also causes variances amongst the
same sex, with men and women's singing voices being categorized into types. For
example, among men, there are basses, baritones and tenors, and among women,
contraltos, mezzo-sopranos and sopranos. There are additional categories for operatic
voices. This is not the only source of difference between male and female voice. Men,
generally speaking, have a larger vocal tract, which essentially gives the resultant voice a
lower tonal quality. This is mostly independent of the vocal folds themselves.

7
Figure 1.2: Human Voice

2.2.2 VOICE IN MODULATION IN SPOKEN LANGUAGE

Human spoken language makes use of the ability of almost all persons in a given society
to dynamically modulate certain parameters of the laryngeal voice source in a consistent
manner. The most important communicative, or phonetic, parameters are the voice pitch
(determined by the vibratory frequency of the vocal folds) and the degree of separation of
the vocal folds, referred to as vocal fold adduction (coming together) or adduction
(separating). The ability to vary the ab/adduction of the vocal folds quickly has a strong
genetic component, since vocal fold adduction has a life-preserving function in keeping
food from passing into the lungs, in addition to the covering action of the epiglottis.
Consequently, the muscles that control this action are among the fastest in the body.
Children can learn to use this action consistently during speech at an early age, as they
learn to speak the difference between utterances such as "apa" (having an abductory-
adductory gesture for the p) as "aba" (having no abductory-adductory
gesture).Surprisingly enough, they can learn to do this well before the age of two by
listening only to the voices of adults around them who have voices much different than
their own, and even though the laryngeal movements causing these phonetic
differentiations are deep in the throat and not visible to them.

If an abductory movement or adductory movement is strong enough, the vibrations of the


vocal folds will stop (or not start). If the gesture is abductory and is part of a speech
sound, the sound will be called [Voiceless]. However, voiceless speech sounds are
sometimes better identified as containing an abductory gesture, even if the gesture was
not strong enough to stop the vocal folds from vibrating. This anomalous feature of
voiceless speech sounds is better understood if it is realized that it is the change in the
spectral qualities of the voice as abduction proceeds that is the primary acoustic attribute

8
that the listener attends to when identifying a voiceless speech sound, and not simply the
presence or absence of voice (periodic energy).

An adductory gesture is also identified by the change in voice spectral energy it produces.
Thus, a speech sound having an adductory gesture may be referred to as a "glottal stop"
even if the vocal fold vibrations do not entirely stop. Other aspects of the voice, such as
variations in the regularity of vibration, are also used for communication, and are
important for the trained voice user to master, but are more rarely used in the formal
phonetic code of a spoken language.

2.2.3 PHYSIOLOGY AND VOICE TIMBRE

The sound of each individual's voice is entirely unique not only because of the actual
shape and size of an individual's vocal cords but also due to the size and shape of the rest
of that person's body, especially the vocal tract, and the manner in which the speech
sounds are habitually formed and articulated. (It is this latter aspect of the sound of the
voice that can be mimicked by skilled performers.) Humans have vocal folds which can
loosen, tighten, or change their thickness, and over which breath can be transferred at
varying pressures. The shape of chest and neck, the position of the tongue, and the
tightness of otherwise unrelated muscles can be altered. Any one of these actions results
in a change in pitch, volume, timbre, or tone of the sound produced. Sound also resonates
within different parts of the body, and an individual's size and bone structure can affect
somewhat the sound produced by an individual.

Singers can also learn to project sound in certain ways so that it resonates better within
their vocal tract. This is known as vocal resonation. Another major influence on vocal
sound and production is the function of the larynx which people can manipulate in
different ways to produce different sounds. These different kinds of laryngeal function
are described as different kinds of vocal registers. The primary method for singers to
accomplish this is through the use of the Singer's Formant, which has been shown to be a
resonance added to the normal resonances of the vocal tract above the frequency range of

9
most instruments and so enables the singer's voice to carry better over musical
accompaniment.

2.2.3.1 VOCAL REGISTRATION

Vocal registration refers to the system of vocal registers within the human voice. A
register in the human voice is a particular series of tones, produced in the same vibratory
pattern of the vocal folds, and possessing the same quality. Registers originate in
laryngeal function. They occur because the vocal folds are capable of producing several
different vibratory patterns. Each of these vibratory patterns appears within a particular
range of pitches and produces certain characteristic sounds. The term register can be
somewhat confusing as it encompasses several aspects of the human voice. The term
register can be used to refer to any of the following

• A particular part of the vocal range such as the upper, middle, or lower registers.
• A resonance area such as chest voice or head voice.
• A phonatory process
• A certain vocal timbre
• A region of the voice which is defined or delimited by vocal breaks.
• A subset of a language used for a particular purpose or in a particular social
setting.

In linguistics, a register language is a language which combines tone and vowel


phonation into a single phonological system.

Within speech pathology the term vocal register has three constituent elements: a certain
vibratory pattern of the vocal folds, a certain series of pitches, and a certain type of
sound. Speech pathologists identify four vocal registers based on the physiology of
laryngeal function: the vocal fry register, the modal register, the falsetto register, and the
whistle register.

10
2.2.3.2 VOCAL RESONATION

Vocal resonation is the process by which the basic product of phonation is enhanced in
timbre and/or intensity by the air-filled cavities through which it passes on its way to the
outside air. Various terms related to the resonation process include amplification,
enrichment, enlargement, improvement, intensification, and prolongation; although in
strictly scientific usage acoustic authorities would question most of them. The main point
to be drawn from these terms by a singer or speaker is that the end result of resonation is,
or should be, to make a better sound. There are seven areas that may be listed as possible
vocal resonators. In sequence from the lowest within the body to the highest, these areas
are the chest, the tracheal tree, the larynx itself, the pharynx, the oral cavity, the nasal
cavity, and the sinuses.

2.2.4 MANNER OF ARTICULATION

In linguistics (articulatory phonetics), manner of articulation describes how the tongue,


lips, jaw, and other speech organs are involved in making a sound make contact. Often
the concept is only used for the production of consonants. For any place of articulation,
there may be several manners, and therefore several homorganic consonants.

One parameter of manner is stricture, that is, how closely the speech organs approach one
another. Parameters other than stricture are those involved in the ar sounds (taps and
trills), and the sibilancy of fricatives. Often nasality and laterality are included in manner.
From greatest to least stricture, speech sounds may be classified along a cline as stop
consonants (with occlusion, or blocked airflow), fricative consonants (with partially
blocked and therefore strongly turbulent airflow), approximants (with only slight
turbulence), and vowels (with full unimpeded airflow). Affricates often behave as if they
were intermediate between stops and fricatives, but phonetically they are sequences of
stop plus fricative. Historically, sounds may move along this cline toward fewer strictures
in a process called lenition.

2.2.5 OTHER PARAMETERS

11
Sibilants are distinguished from other fricatives by the shape of the tongue and how the
airflow is directed over the teeth. Fricatives at coronal places of articulation may be
sibilant or non-sibilant, sibilants being the more common.

Taps and flaps are similar to very brief stops. However, their articulation and behavior is
distinct enough to be considered a separate manner, rather than just length.

Trills involve the vibration of one of the speech organs. Since trilling is a separate
parameter from stricture, the two may be combined. Increasing the stricture of a typical
trill results in a trilled fricative. Trilled affricates are also known.

Nasal airflow may be added as an independent parameter to any speech sound. It is most
commonly found in nasal stops and nasal vowels, but nasal fricatives, taps, and
approximants are also found. When a sound is not nasal, it is called oral. An oral stop is
often called a plosive, while a nasal stop is generally just called a nasal.

Laterality is the release of airflow at the side of the tongue. This can also be combined
with other manners, resulting in lateral approximants (the most common), lateral flaps,
and lateral fricatives and affricates.

2.2.6 INDIVIDUAL MANNERS

• Plosive or oral stop, where there is complete occlusion (blockage) of both the
oral and nasal cavities of the vocal tract, and therefore no air flow. Examples
include English /p t k/ (voiceless) and /b d g/ (voiced). If the consonant is voiced,
the voicing is the only sound made during occlusion; if it is voiceless, a plosive is
completely silent. What we hear as a /p/ or /k/ is the effect that the onset of the

12
occlusion has on the preceding vowel, and well as the release burst and its effect
on the following vowel. The shape and position of the tongue (the place of
articulation) determine the resonant cavity that gives different plosives their
characteristic sounds. All languages have plosives.

• Nasal stop, usually shortened to nasal, where there is complete occlusion of the
oral cavity, and the air passes instead through the nose. The shape and position of
the tongue determine the resonant cavity that gives different nasal stops their
characteristic sounds. Examples include English /m, n/. Nearly all languages have
nasals, the only exceptions being in the area of Puget Sound and a single language
on Bougainville Island.

• Fricative, sometimes called spirant, where there is continuous frication


(turbulent and noisy airflow) at the place of articulation. Examples include
English /f, s/ (voiceless), /v, z/ (voiced), etc. Most languages have fricatives,
though many have only an /s/. However, the Indigenous Australian languages are
almost completely devoid of fricatives of any kind.

• Sibilants are a type of fricative where the airflow is guided by a


groove in the tongue toward the teeth, creating a high-pitched and very
distinctive sound. These are by far the most common fricatives. Fricatives
at coronal (front of tongue) places of articulation are usually, though not
always, sibilants. English sibilants include /s/ and /z/.

• Lateral fricatives are a rare type of fricative, where the frication


occurs on one or both sides of the edge of the tongue. The "ll" of Welsh
and the "hl" of Zulu are lateral fricatives.

• Affricate, which begins like a plosive, but this releases into a fricative rather than
having a separate release of its own. The English letters "ch" and "j" represent
affricates. Affricates are quite common around the world, though less common
than fricatives.

13
• Flap, often called a tap, is a momentary closure of the oral cavity. The "tt" of
"utter" and the "dd" of "udder" are pronounced as a flap in North American
English. Many linguists distinguish taps from flaps, but there is no consensus on
what the difference might be. No language relies on such a difference. There are
also lateral flaps.

• Trill, in which the articulator (usually the tip of the tongue) is held in place and
the airstream causes it to vibrate. The double "r" of Spanish "perro" is a trill. Trills
and flaps, where there are one or more brief occlusions, constitute a class of
consonant called rhotics.

• Approximant, where there is very little obstruction. Examples include English


/w/ and /r/. In some languages, such as Spanish, there are sounds which seem to
fall between fricative and approximant.

• One use of the word semivowel is a type of approximant,


pronounced like a vowel but with the tongue closer to the roof of the
mouth, so that there is slight turbulence. In English, /w/ is the semivowel
equivalent of the vowel /u/, and /j/ (spelled "y") is the semivowel
equivalent of the vowel /i/ in this usage. Other descriptions use semivowel
for vowel-like sounds that are not syllabic, but do not have the increased
stricture of approximants. These are found as elements in diphthongs. The
word may also be used to cover both concepts.

• Lateral approximants, usually shortened to lateral, are a type of


approximant pronounced with the side of the tongue. English /l/ is a
lateral. Together with the rhotics, which have similar behavior in many
languages, these form a class of consonant called liquids.

14
CHAPTER 3

SPEECH PERCEPTION

Speech perception refers to the processes by which humans are able to interpret and
understand the sounds used in language. The study of speech perception is closely linked
to the fields of phonetics and phonology in linguistics and cognitive psychology and
perception in psychology. Research in speech perception seeks to understand how human

15
listeners recognize speech sounds and use this information to understand spoken
language. Speech research has applications in building computer systems that can
recognize speech, as well as improving speech recognition for hearing- and language-
impaired listeners.

3.1 BASICS OF SPEECH PERCEPTION

The process of perceiving speech begins at the level of the sound signal and the process
of audition. After processing the initial auditory signal, speech sounds are further
processed to extract acoustic cues and phonetic information. This speech information can
then be used for higher-level language processes, such as word recognition.

3.1.1 ACOUSTIC CUES

16
Figure 3.1: Spectrograms of syllables "dee" (top), "dah" (middle), and "doo"
showing how the onset formant transitions that define perceptually the consonant
[d] differ depending on the identity of the following vowel. (Formants are
highlighted by red dotted lines; transitions are the bending beginnings of the
formant trajectories).

The speech sound signal contains a number of acoustic cues that are used in speech
perception. The cues differentiate speech sounds belonging to different phonetic
categories. For example, one of the most studied cues in speech is voice onset time or
VOT. VOT is a primary cue signaling the difference between voiced and voiceless stop
consonants, such as "b" and "p". Other cues differentiate sounds that are produced at
different places of articulation or manners of articulation. The speech system must also
combine these cues to determine the category of a specific speech sound. This is often
thought of in terms of abstract representations of phonemes. These representations can
then be combined for use in word recognition and other language processes.

It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a
particular speech sound:

At first glance, the solution to the problem of how we perceive speech seems
deceptively simple. If one could identify stretches of the acoustic waveform that
correspond to units of perception, then the path from sound to meaning would be
clear. However, this correspondence or mapping has proven extremely difficult to
find, even after some forty-five years of research on the problem.

If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests
using speech synthesizers would be sufficient to determine such a cue or cues. However,
there are two significant obstacles:

1. One acoustic aspect of the speech signal may cue different linguistically relevant
dimensions. For example, the duration of a vowel in English can indicate whether

17
or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a
voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can
distinguish the identity of vowels. Some experts even argue that duration can help
in distinguishing of what is traditionally called short and long vowels in English.
2. One linguistic unit can be cued by several acoustic properties. For example in a
classic experiment, it was shown that the onset formant transitions of /d/ differ
depending on the following vowel but they are all interpreted as the phoneme /d/
by listeners.

3.1.2 LINEARITY AND SEGMENTATION PROBLEM

Figure 3.2: A spectrogram of the phrase "I owe you". There are no clearly
distinguishable boundaries between speech sounds.

Although listeners perceive speech as a stream of discrete units (phonemes, syllables, and
words), this linearity is difficult to be seen in the physical speech signal (see Figure 2 for
an example). Speech sounds do not strictly follow one another, rather, they overlap. A
speech sound is influenced by the ones that proceed and the ones that follow. This

18
influence can even be exerted at a distance of two or more segments (and across syllable-
and word-boundaries).

Having disputed the linearity of the speech signal, the problem of segmentation arises:
one encounters serious difficulties trying to delimit a stretch of speech signal as
belonging to a single perceptual unit. This can be again illustrated by the fact that the
acoustic properties of the phoneme /d/ will depend on the identity of the following vowel
(because of coarticulation).

3.1.3 LACK OF INVARIANCE

The research and application of speech perception has to deal with several problems
which result from what has been termed the lack of invariance. As was suggested,
reliable constant relations between a phoneme of a language and its acoustic
manifestation in speech are difficult to find. There are several reasons for this:

• Context-induced variation. Phonetic environment affects the acoustic properties


of speech sounds. For example, /u/ in English is fronted when surrounded by
coronal consonants. Or, the VOT values marking the boundary between voiced
and voiceless stops are different for labial, alveolar and velar stops and they shift
under stress or depending on the position within a syllable.

• Variation due to differing speech conditions. One important factor that causes
variation is differing speech rate. Many phonemic contrasts are constituted by
temporal characteristics (short vs. long vowels or consonants, affricates vs.
fricatives, stops vs. glides, voiced vs. voiceless stops, etc.) and they are certainly
affected by changes in speaking tempo. Another major source of variation is
articulatory carefulness versus sloppiness which is typical for connected speech

19
(articulatory ‘undershoot’ is obviously reflected in the acoustic properties of the
sounds produced).

• Variation due to different speaker identity. The resulting acoustic structure of


concrete speech productions depends on the physical and psychological properties
of individual speakers. Men, women, and children generally produce voices
having different pitch. Because speakers have vocal tracts of different sizes (due
to sex and age especially) the resonant frequencies (formants), which are
important for recognition of speech sounds, will vary in their absolute values
across individuals. Dialect and foreign accent cause variation as well.

3.1.4 PERCEPTUAL CONSTANCY AND NORMALIZATION

Figure 3.3: The left panel shows the 3 peripheral American English vowels /i/, /ɑ/,
and /u/ in a standard F1 by F2 plot (in Hz). The mismatch between male, female,
and child values is apparent. In the right panel formant distances (in Bark) rather
than absolute values are plotted using the normalization procedure.

Given the lack of invariance, it is remarkable that listeners perceive vowels and
consonants produced under different conditions and by different speakers as constant
categories. It has been proposed that this is achieved by means of the perceptual
normalization process in which listeners filter out the noise (i.e. variation) to arrive at the
underlying category. Vocal-tract-size differences result in formant-frequency variation

20
across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic
characteristics of a particular speaker. This may be accomplished by considering the
ratios of formants rather than their absolute values. This process has been called vocal
tract normalization. Similarly, listeners are believed to adjust the perception of duration
to the current tempo of the speech they are listening to – this has been referred to as
speech rate normalization.

Whether or not normalization actually takes place and what is its exact nature is a matter
of theoretical controversy . Perceptual constancy is a phenomenon not specific to speech
perception only; it exists in other types of perception too.

3. 1.5 CATEGORICAL PERCEPTION

Figure 3.4: Example identification (red) and discrimination (blue) functions

Categorical perception is involved in processes of perceptual differentiation. We perceive


speech sounds categorically, that is to say, we are more likely to notice the differences
between categories (phonemes) than within categories. The perceptual space between
categories is therefore warped, the centers of categories (or 'prototypes') working like a
sieve or like magnets for in-coming speech sounds.

Considering an artificial continuum between a voiceless and a voiced bilabial stop where
each new step differs from the preceding one in the amount of VOT. The first sound is a

21
pre-voiced [b], i.e. it has a negative VOT. Then, increasing the VOT, we get to a point
where it is zero, i.e. the stop is a plain unaspirated voiceless [p]. Gradually, adding the
same amount of VOT at a time, we reach the point where the stop is a strongly aspirated
voiceless bilabial [pʰ]. If we test the ability to discriminate between two sounds with
varying VOT values but having a constant VOT distance from each other (20 ms for
instance), listeners are likely to perform at chance level if both sounds fall within the
same category and at nearly-100% level if each sound falls in a different category.

3.1.6 TOP DOWN INFLUENCE AND SPEECH PERCEPTION

The process of speech perception is not necessarily uni-directional. That is, higher-level
language processes connected with morphology, syntax, or semantics may interact with
basic speech perception processes to aid in recognition of speech sounds. It may be the
case that it is not necessary and maybe even not possible for listener to recognize
phonemes before recognizing higher units, like words for example. After obtaining at
least a fundamental piece of information about phonemic structure of the perceived entity
from the acoustic signal, listeners are able to compensate for missing or noise-masked
phonemes using their knowledge of the spoken language.

In a classic experiment, Richard M. Warren (1970) replaced one phoneme of a word with
a cough-like sound. His subjects restored the missing speech sound perceptually without
any difficulty and what is more, they were not able to identify accurately which phoneme
had been disturbed. This is known as the phonemic restoration effect. Another basic
experiment compares recognition of naturally spoken words presented in a sentence (or at
least a phrase) and the same words presented in isolation. Perception accuracy usually
drops in the latter condition. Garnes and Bond (1976) also used carrier sentences when
researching the influence of semantic knowledge on perception. They created series of
words differing in one phoneme (bay / day / gay, for example). The quality of the first
phoneme changed along a continuum. All these stimuli were put into different sentences
each of which made sense with one of the words only. Listeners had a tendency to judge

22
the ambiguous words (when the first segment was at the boundary between categories)
according to the meaning of the whole sentence.

CHAPTER 4

NEURAL NETWORKS

In general a biological neural network is composed of a group or groups of chemically


connected or functionally associated neurons. A single neuron may be connected to many
other neurons and the total number of neurons and connections in a network may be
extensive. Connections, called synapses, are usually formed from axons to dendrites,
though dendrodendritic microcircuits and other connections are possible. Apart from the
electrical signaling, there are other forms of signaling that arise from neurotransmitter
diffusion, which have an effect on electrical signaling. As such, neural networks are
extremely complex.

Artificial intelligence and cognitive modeling try to simulate some properties of neural
networks. While similar in their techniques, the former has the aim of solving particular
tasks, while the latter aims to build mathematical models of biological neural systems.

In the artificial intelligence field, artificial neural networks have been applied
successfully to speech recognition, image analysis and adaptive control, in order to
construct software agents (in computer and video games) or autonomous robots. Most of

23
the currently employed artificial neural networks for artificial intelligence are based on
statistical estimation, optimization and control theory.

The cognitive modeling field involves the physical or mathematical modeling of the
behavior of neural systems; ranging from the individual neural level (e.g. modeling the
spike response curves of neurons to a stimulus), through the neural cluster level (e.g.
modeling the release and effects of dopamine in the basal ganglia) to the complete
organism (e.g. behavioral modeling of the organism's response to stimuli).

4.2 HISTORY OF NEURAL NETWORK ANALOGY

The concept of neural networks started in the late-1800s as an effort to describe how the
human mind performed. These ideas started being applied to computational models with
Turing's B-type machines and the perceptron.

In early 1950s Friedrich Hayek was one of the first to posit the idea of spontaneous order
in the brain arising out of decentralized networks of simple units (neurons). In the late
1940s, Donald Hebb made one of the first hypotheses for a mechanism of neural
plasticity (i.e. learning), Hebbian learning. Hebbian learning is considered to be a 'typical'
unsupervised learning rule and it (and variants of it) was an early model for long term
potentiation.

The Perceptron is essentially a linear classifier for classifying data specified by

parameters and an output function f = w'x + b. Its parameters are


adapted with an ad-hoc rule similar to stochastic steepest gradient descent. Because the
inner product is a linear operator in the input space, the Perceptron can only perfectly
classify a set of data for which different classes are linearly separable in the input space,
while it often fails completely for non-separable data. While the development of the
algorithm initially generated some enthusiasm, partly because of its apparent relation to

24
biological mechanisms, the later discovery of this inadequacy caused such models to be
abandoned until the introduction of non-linear models into the field.

The Cognitron (1975) was an early multilayered neural network with a training
algorithm. The actual structure of the network and the methods used to set the
interconnection weights change from one neural strategy to another, each with its
advantages and disadvantages. Networks can propagate information in one direction only,
or they can bounce back and forth until self-activation at a node occurs and the network
settles on a final state. The ability for bi-directional flow of inputs between neurons/nodes
was produced with the Hopfield's network (1982), and specialization of these node layers
for specific purposes was introduced through the first hybrid network.

The parallel distributed processing of the mid-1980s became popular under the name
connectionism. The rediscovery of the back propagation algorithm was probably the main
reason behind the repopularisation of neural networks after the publication of "Learning
Internal Representations by Error Propagation" in 1986 (Though back propagation itself
dates from 1974). The original network utilized multiple layers of weight-sum units of
the type f = g (w'x + b), where g was a sigmoid function or logistic function such as used
in logistic regression. Training was done by a form of stochastic steepest gradient
descent. The employment of the chain rule of differentiation in deriving the appropriate
parameter updates results in an algorithm that seems to 'back propagate errors', hence the
nomenclature. However it is essentially a form of gradient descent. Determining the
optimal parameters in a model of this type is not trivial, and steepest gradient descent
methods cannot be relied upon to give the solution without a good starting point. In
recent times, networks with the same architecture as the back propagation network are
referred to as Multi-Layer Perceptrons. This name does not impose any limitations on the
type of algorithm used for learning.

The back propagation network generated much enthusiasm at the time and there was
much controversy about whether such learning could be implemented in the brain or not,
partly because a mechanism for reverse signaling was not obvious at the time, but most
importantly because there was no plausible source for the 'teaching' or 'target' signal.

25
4.3 THE BRAIN, NEURAL NETWORKS AND
COMPUTERS

Neural networks, as used in artificial intelligence, have traditionally been viewed as


simplified models of neural processing in the brain, even though the relation between this
model and brain biological architecture is debated.

Biological neuron anatomy


Although the brain exhibits a great diversity of neuron shapes, dendritic trees,
axon lengths, etc., all neurons seem to process information in much the same way.
Information is transmitted in the form of electrical impulses called action poten-
tials via the axons from other neuron cells. Such action potentials have amplitude
of about 100 milli volts, and a frequency of approximately 1 KHz. When
the action potential arrives at the axon terminal, the neuron releases chemical
neurotransmitters which mediate the interneuron communication at specialized
connections called synapses

Figure 4.1: A biological Neuron

A subject of current research in theoretical neuroscience is the question surrounding the


degree of complexity and the properties that individual neural elements should have to
reproduce something resembling animal intelligence.

26
Historically, computers evolved from the von Neumann architecture, which is based on
sequential processing and execution of explicit instructions. On the other hand, the
origins of neural networks are based on efforts to model information processing in
biological systems, which may rely largely on parallel processing as well as implicit
instructions based on recognition of patterns of 'sensory' input from external sources. In
other words, at its very heart a neural network is a complex statistical processor (as
opposed to being tasked to sequentially process and execute).

4.4 NEURAL NETWORKS AND ARTIFICIAL


INTELLIGENCE

An artificial neural network (ANN), also called a simulated neural network (SNN) or
commonly just neural network (NN) is an interconnected group of artificial neurons that
uses a mathematical or computational model for information processing based on a
connectionistic approach to computation. In most cases an ANN is an adaptive system
that changes its structure based on external or internal information that flows through the
network. In more practical terms neural networks are non-linear statistical data modeling
or decision making tools. They can be used to model complex relationships between
inputs and outputs or to find patterns in data.

4.4.1 BACKGROUND

An artificial neural network involves a network of simple processing elements (artificial


neurons) which can exhibit complex global behavior, determined by the connections
between the processing elements and element parameters. Artificial neurons were first
proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, an MIT
logician. One classical type of artificial neural network is the Hopfield net.

27
In a neural network model simple nodes, which can be called variously "neurons",
"neurodes", "Processing Elements" (PE) or "units", are connected together to form a
network of nodes — hence the term "neural network". While a neural network does not
have to be adaptive per se, its practical use comes with algorithms designed to alter the
strength (weights) of the connections in the network to produce a desired signal flow.

In modern software implementations of artificial neural networks the approach inspired


by biology has more or less been abandoned for a more practical approach based on
statistics and signal processing. In some of these systems neural networks, or parts of
neural networks (such as artificial neurons) are used as components in larger systems that
combine both adaptive and non-adaptive elements.

The concept of a neural network appears to have first been proposed by Alan Turing in
his 1948 paper "Intelligent Machinery".

4.4.2 COMPUTATIONAL MODELS

• The McCulloch-Pitts model


The computational model of the neuron presented by McCulloch and Pitts
is binary and operates in discrete time. The output of a neuron is = 1 when
an action potential is generated, and = 0 otherwise. A weight value w is
associated to each ith connection to the neuron. Such weights characterize the
synapses as excitatory if w > 0, and inhibitory if w < 0. A neuron responds when
the sum of inhibitions and excitations is larger than a certain threshold.

• The Perceptron

The American psychologist Frank Rosenblatt proposed a computational


model of neurons he called the perceptron. The essential innovation was the introduction
of numerical interconnection weights instead of simple inhibitory/excitatoryconnections
as in the McCulloch-Pitts model.

28
• Multilayer perceptrons
A perceptron with n inputs can separate two classes of input vectors by tracing
a hyper-plane in the n-dimensional space of such input vectors. By arranging
several such perceptron units into layers and connecting the outputs of the perceptrons of
one layer with the inputs of the next layer we are able to build
multilayer perceptrons. Such an MLP is called a fully connected,
layered feedforward network since signals are fully connected between layers, and no
backward signals are transmitted. In this case, it has two layers, the output layer and a so-
called hidden layer.

4.4.2 NEURAL NETWORK TOPOLOGIES

Feed forward networks share many properties with combinatorial logic, i.e., we
have seen how to implement logic gates, and how we can use networks of logic
gates to implement complex decision surfaces, just by connecting them into layers.
Similarly, recurrent networks, that is, neural networks with recurrent connectivity
behave like sequential logic; they have a sort of short-term memory and
previous experiences may have an effect on the activation of such units. The
Hopfield network is a fully connected network introduced in the early 1980s
as a model of memory. This and other models exhibit a transient dynamic behavior, but
eventually and a stable state that permit them to store information.
Such networks can be used as associative memories which are content addressable,i.e., a
memorized pattern can be retrieved by specifying a part of it. More recent recurrent
neural network models include Elman networks and Jordan networks that are widely used
when systems display.

4.4.3 LEARNING PARADIGMS

There are three major learning paradigms, each corresponding to a particular abstract
learning task. These are supervised learning, unsupervised learning and reinforcement

29
learning. Usually any given type of network architecture can be employed in any of those
tasks.

4.4.3.1 SUPERVISED LEARNING

In supervised learning, a set of example pairs is given and the


aim is to find a function f in the allowed class of functions that matches the examples. In
other words, we wish to infer how the mapping implied by the data and the cost function
is related to the mismatch between mapping and the data.

4.4.3.2 UNSUPERVISED LEARNING

In unsupervised learning we are given some data x, and a cost function which is to be
minimized which can be any function of x and the network's output, f. The cost function
is determined by the task formulation. Most applications fall within the domain of
estimation problems such as statistical modeling, compression, filtering, blind source
separation and clustering.

4.4.3.3 REINFORCEMENT LEARNING

In reinforcement learning, data x is usually not given, but generated by an agent's


interactions with the environment. At each point in time t, the agent performs an action yt
and the environment generates an observation xt and an instantaneous cost ct, according to
some (usually unknown) dynamics. The aim is to discover a policy for selecting actions
that minimizes some measure of a long-term cost, i.e. the expected cumulative cost. The
environment's dynamics and the long-term cost for each policy are usually unknown, but
can be estimated. ANNs are frequently used in reinforcement learning as part of the
overall algorithm. Tasks that fall within the paradigm of reinforcement learning are
control problems, games and other sequential decision making tasks.

30
4.5 APPLICATIONS

The utility of artificial neural network models lies in the fact that they can be used to
infer a function from observations and also to use it. This is particularly useful in
applications where the complexity of the data or task makes the design of such a function
by hand impractical.

The tasks to which artificial neural networks are applied tend to fall within the following
broad categories:

• Function approximation, or regression analysis, including time series prediction


and modeling.
• Classification, including pattern and sequence recognition, novelty detection and
sequential decision making.
• Data processing, including filtering, clustering, blind signal separation and
compression.

Application areas include system identification and control (vehicle control, process
control), game-playing and decision making (backgammon, chess, racing), pattern
recognition (radar systems, face identification, object recognition, etc.), sequence
recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial
applications, data mining (or knowledge discovery in databases, "KDD"), visualization
and e-mail spam filtering.

31
CHAPTER 5

OBTAINING THE DATA SET

The data was obtained using Microsoft sound recorder which is available with standard
windows installation. The sound recording was done with a sampling frequency of
44.100 KHz. The size of each sample was 8 Bit and the number of channels through
which the recording was done is one i.e. Mono.

To stack the database, ten utterances of each word were taken from three people of which
8 were used for training and the remaining two for testing. The network was tested with
samples of many words to encompass the whole voice spectrum. The words used were
‘That’, ‘Then’, ‘hello’, ‘Tin’ and ‘Ghosh’. Since each person had made ten utterances of
each word, this resulted in 60 utterances per person or a database of 180 word utterances.

The dataset available were only mere recordings and had to be filtered for use. To do the
same, some measure of preprocessing in terms of filtering and windowing was done. I
digital signal processing, electronic noise is generally assumed to be high frequency part
of the signal. Thus in orders to improve the SNR, the samples were low pass filtered with
the use of a hamming window.

32
CHAPTER 6

LINEAR PREDICTIVE CODING

Linear predictive coding (LPC) is a tool used mostly in audio signal processing and
speech processing for representing the spectral envelope of a digital signal of speech in
compressed form, using the information of a linear predictive model. It is one of the most
powerful speech analysis techniques, and one of the most useful methods for encoding
good quality speech at a low bit rate and provides extremely accurate estimates of speech
parameters.

6.1 Overview

LPC starts with the assumption that a speech signal is produced by a buzzer at the end of
a tube (voiced sounds), with occasional added hissing and popping sounds (sibilants and
plosive sounds). Although apparently crude, this model is actually a close approximation
to the reality of speech production. The glottis produces the buzz, which is characterized
by its intensity (loudness) and frequency (pitch). The vocal tract (the throat and mouth)
forms the tube, which is characterized by its resonances, which give rise to formants, or

33
enhanced frequency bands in the sound produced. Hisses and pops are generated by the
action of the tongue, lips and throat during sibilants and plosives.

LPC analyzes the speech signal by estimating the formants, removing their effects from
the speech signal, and estimating the intensity and frequency of the remaining buzz. The
process of removing the formants is called inverse filtering, and the remaining signal
after the subtraction of the filtered modeled signal is called the residue.

The numbers which describe the intensity and frequency of the buzz, the formants, and
the residue signal, can be stored or transmitted somewhere else. LPC synthesizes the
speech signal by reversing the process: use the buzz parameters and the residue to create
a source signal, use the formants to create a filter (which represents the tube), and runs
the source through the filter, resulting in speech.

Because speech signals vary with time, this process is done on short chunks of the speech
signal, which are called frames; generally 30 to 50 frames per second give intelligible
speech with good compression.

6.2 EARLY HISTORY LPC

According to Robert M. Gray of Stanford University, the first ideas leading to LPC
started in 1966 when S. Saito and F. Itakura of NTT described an approach to automatic
phoneme discrimination that involved the first maximum likelihood approach to speech
coding. In 1967, John Burg outlined the maximum entropy approach. In 1969 Itakura and
Saito introduced partial correlation, May Glen Culler proposed realtime speech encoding,
and B. S. Atal presented an LPC speech coder at the Annual Meeting of the Acoustical
Society of America. In 1971 realtime LPC using 16-bit LPC hardware was demonstrated
by Philco-Ford; four units were sold.

In 1972 Bob Kahn of ARPA, with Jim Forgie (Lincoln Laboratory, LL) and Dave
Walden (BBN Technologies), started the first developments in packetized speech, which
would eventually lead to Voice over IP technology. In 1973, according to Lincoln
Laboratory informal history, the first realtime 2400 bit/s LPC was implemented by Ed

34
Hofstetter. In 1974 the first realtime two-way LPC packet speech communication was
accomplished over the ARPANET at 3500 bit/s between Culler-Harrison and Lincoln
Laboratories. In 1976 the first LPC conference took place over the ARPANET using the
Network Voice Protocol, between Culler-Harrison, ISI, SRI, and LL at 3500 bit/s. And
finally in 1978, Vishwanath et al. of BBN developed the first variable-rate LPC
algorithm.

6.3 LPC COEFFICIENTS REPRESENTATION

LPC is frequently used for transmitting spectral envelope information, and as such it has
to be tolerant of transmission errors. Transmission of the filter coefficients directly (see
linear prediction for definition of coefficients) is undesirable, since they are very
sensitive to errors. In other words, a very small error can distort the whole spectrum, or
worse, a small error might make the prediction filter unstable.

There are more advanced representations such as Log Area Ratios (LAR), line spectral
pairs (LSP) decomposition and reflection coefficients. Of these, especially LSP
decomposition has gained popularity, since it ensures stability of the predictor, and
spectral errors are local for small coefficient deviations.

6.3.1 THE PREDICTION MODEL

The most common representation is

35
where is the predicted signal value, x(n − i) the previous observed values, and ai the
predictor coefficients. The error generated by this estimate is

where x(n) is the true signal value.

These equations are valid for all types of (one-dimensional) linear prediction. The
differences are found in the way the parameters ai are chosen.

For multi-dimensional signals the error metric is often defined as

where is a suitable chosen vector norm.

6.3.2 ESTIMATING THE PARAMETERS

The most common choice in optimization of parameters ai is the root mean square
criterion which is also called the autocorrelation criterion. In this method we minimize
the expected value of the squared error E[e2(n)], which yields the equation

for 1 ≤ j ≤ p, where R is the autocorrelation of signal xn, defined as

36
where E is the expected value. In the multi-dimensional case this corresponds to
minimizing the L2 norm.

The above equations are called the normal equations or Yule-Walker equations. In matrix
form the equations can be equivalently written as

where the autocorrelation matrix R is a symmetric, Toeplitz matrix with elements ri,j =
R(i − j), vector r is the autocorrelation vector rj = R(j), and vector a is the parameter
vector.

Another, more general, approach is to minimize

where we usually constrain the parameters ai with a0 = − 1 to avoid the trivial solution.
This constraint yields the same predictor as above but the normal equations are then

where the index i ranges from 0 to p, and R is a (p + 1) × (p + 1) matrix.

Optimization of the parameters is a wide topic and a large number of other approaches
have been proposed.

Still, the autocorrelation method is the most common and it is used, for example, for
speech coding in the GSM standard.

37
Solution of the matrix equation Ra = r is computationally a relatively expensive process.
The Gauss algorithm for matrix inversion is probably the oldest solution but this
approach does not efficiently use the symmetry of R and r. A faster algorithm is the
Levinson recursion proposed by Norman Levinson in 1947, which recursively calculates
the solution. Later an improvement was proposed to this algorithm called the split
Levinson recursion which requires about half the number of multiplications and
divisions. It uses a special symmetrical property of parameter vectors on subsequent
recursion levels.

6.3.3 OBTAINING THE COEFFICIENTS

Levinson recursion or Levinson-Durbin recursion is a procedure in linear algebra that


recursively calculates the solution to an equation involving a Toeplitz matrix. The
algorithm runs in Θ(n2) time, which is a strong improvement over Gauss-Jordan
elimination, which runs in Θ(n3).

Newer algorithms, called asymptotically fast or sometimes superfast Toeplitz algorithms,


can solve in Θ(n logpn) for various p (e.g. p = 2, p = 3). Levinson recursion remains
popular for several reasons; for one, it is relatively easy to understand in comparison; for
another, it can be faster than a superfast algorithm for small n (usually n < 256 ).

The Levinson-Durbin algorithm was proposed first by Norman Levinson in 1947,


improved by J. Durbin in 1960, and subsequently improved to 4n2 and then 3n2
multiplications by W. F. Trench and S. Zohar, respectively.

Other methods to process data include Schur decomposition and Cholesky


decomposition. In comparison to these, Levinson recursion (particularly Split-Levinson
recursion) tends to be faster computationally, but more sensitive to computational
inaccuracies like round-off errors.

6.4 THE PROBLEM

Matrix equations follow the form:

38
The Levinson-Durbin algorithm may be used for any such equation, so long as M is a

known Toeplitz matrix with a nonzero main diagonal. Here is a known vector, and is
an unknown vector of numbers xi yet to be determined.

For the sake of this article, êi is a vector made up entirely of zeroes, except for its i'th
place, which holds the value one. Its length will be implicitly determined by the
surrounding context. The term N refers to the width of the matrix above -- M is an N×N
matrix. Finally, in this article, superscripts refer to an inductive index, whereas subscripts
denote indices. For example (and definition), in this article, the matrix Tn is an n×n
matrix which copies the upper left n×n block from M -- that is, Tnij = Mij.

Tn is also a Toeplitz matrix; meaning that it can be written as:

6.4.1 INTRODUCTORY STEPS

The algorithm proceeds in two steps. In the first step, two sets of vectors, called the
forward and backward vectors, are established. The forward vectors are used to help get
the set of backward vectors; then they can be immediately discarded. The backwards
vectors are necessary for the second step, where they are used to build the solution
desired.

39
Levinson-Durbin recursion defines the nth "forward vector", denoted , as the vector of
length n which satisfies:

The nth "backward vector" is defined similarly; it is the vector of length n which
satisfies:

An important simplification can occur when M is a symmetric matrix; then the two
vectors are related by bni = fnn+1-i -- that is, they are row-reversals of each other. This can
save some extra computation in that special case.

6.4.2 OBTAINING THE BACKWARD VECTORS

Even if the matrix is not symmetric, then the nth forward and backward vector may be
found from the vectors of length n-1 as follows. First, the forward vector may be
extended with a zero to obtain:

In going from Tn-1 to Tn, the extra column added to the matrix does not perturb the
solution when a zero is used to extend the forward vector. However, the extra row added
to the matrix has perturbed the solution; and it has created an unwanted error term εf
which occurs in the last place. The above equation gives it the value of:

40
This error will be returned to shortly and eliminated from the new forward vector; but
first, the backwards vector must be extended in a similar (albeit reversed) fashion. For the
backwards vector,

Like before, the extra column added to the matrix does not perturb this new backwards
vector; but the extra row does. Here we have another unwanted error εb with value:

These two error terms can be used to eliminate each other. Using the linearity of
matrices,

If α and β are chosen so that the right hand side yields ê1 or ên, then the quantity in the
parentheses will fulfill the definition of the nth forward or backward vector, respectively.
With those alpha and beta chosen, the vector sum in the parentheses is simple and yields
the desired result.

41
To find these coefficients, , are such that :

and respectively , are such that :

By multiplying both previous equations by one gets the following equation:

Now, all the zeroes in the middle of the two vectors above being disregarded and
collapsed, only the following equation is left:

With these solved for (by using the Cramer 2x2 matrix inverse formula), the new forward
and backward vectors are:

42
Performing these vector summations, then, gives the nth forward and backward vectors
from the prior ones. All that remains is to find the first of these vectors, and then some
quick sums and multiplications give the remaining ones. The first forward and backward
vectors are simply:

6.4.3 USING THE BACKWARD VECTORS

The above steps give the N backward vectors for M. From there, a more arbitrary
equation is:

The solution can be built in the same recursive way that the backwards vectors were built.

Accordingly, must be generalized to a sequence , from which .

The solution is then built recursively by noticing that if:

Then, extending with a zero again, and defining an error constant where necessary:

We can then use the nth backward vector to eliminate the error term and replace it with
the desired formula as follows:

43
Extending this method until n = N yields the solution .

In practice, these steps are often done concurrently with the rest of the procedure, but
they form a coherent unit and deserve to be treated as their own step.

44
Figure 6.1: LPC coefficients for the word close uttered by a male speaker

CHAPTER 7

THE NEURAL NETWORK

7.1 PERCEPTRON

The perceptron is a type of artificial neural network invented in 1957 at the Cornell
Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of
feedforward neural network: a linear classifier.

7.1.1 DEFINITION

The Perceptron uses matrix eigenvalues to represent feedforward neural networks and is a
binary classifier that maps its input x (a real-valued vector) to an output value f(x) (a
single binary value) across the matrix.

where w is a vector of real-valued weights and is the dot product (which computes
a weighted sum). b is the 'bias', a constant term that does not depend on any input value.

45
The value of f(x) (0 or 1) is used to classify x as either a positive or a negative instance, in
the case of a binary classification problem. The bias can be thought of as offsetting the
activation function, or giving the output neuron a "base" level of activity. If b is negative,
then the weighted combination of inputs must produce a positive value greater than − b in
order to push the classifier neuron over the 0 threshold. Spatially, the bias alters the
position (though not the orientation) of the decision boundary. Since the inputs are fed
directly to the output unit via the weighted connections, the perceptron can be considered
the simplest kind of feed-forward neural network.

7.1.2 LEARNING ALGORITHM

The learning algorithm is the same across all neurons; therefore everything that follows is
applied to a single neuron in isolation. Some variable have to be defined first

• x(j) denotes the j-th item in the n-dimensional input vector


• w(j) denotes the j-th item in the weight vector
• f(x) denotes the output from the neuron when presented with input x
• α is a constant where (learning rate)

Further, it is assumed for convenience that the bias term b is zero. This is not a restriction
since an extra dimension n + 1 can be added to the input vectors x with x(n + 1) = 1, in
which case w(n + 1) replaces the bias term.

46
Figure 7.1: The Perceptron

The appropriate weights are applied to the inputs, and the resulting weighted sum passed
to a function which produces the output y. Learning is modeled as the weight vector
being updated for multiple iterations over all training examples. Let

denote a training set of m training examples.

Each iteration the weight vector is updated as follows:

For each (x,y) pair in

Note that this means that a change in the weight vector will only take place for a given
training example (x,y) if its output f(x) is different from the desired output y.

The initialization of w is usually performed simply by setting w(j): = 0 for all elements
w(j). The training set Dm is said to be linearly separable if there exists a positive constant

γ and a weight vector w such that for all i. Novikoff (1962)


proved that the perceptron algorithm converges after a finite number of iterations if the

data set is linearly separable and the number of mistakes is bounded by where R
the maximum norm of an input vector.However, if the training set is not linearly
separable, the above online algorithm is not guaranteed to converge. Note that the
decision boundary of a perceptron is invariant w.r.t. scaling of the weight vector, i.e. a
perceptron trained with initial weight vector w and learning rate α is an identical
estimator to a perceptron trained with initial weight vector w / α and learning rate 1.

47
Thus, since the initial weights become irrelevant with increasing n.o. iterations, the
learning rate does not matter in the case of the perceptron and is usually just set to one.

7.2 MULTILAYER PERCEPTRON

A multilayer perceptron is a feedforward artificial neural network model that maps sets
of input data onto a set of appropriate output. It is a modification of the standard linear
perceptron in that it uses three or more layers of neurons (nodes) with nonlinear
activation functions, and is more powerful than the perceptron in that it can distinguish
data that is not linearly separable, or separable by a hyperplane.

7.2.1 ACTIVATION FUCTION

If a multilayer perceptron consists of a linear activation function in all neurons, that is, a
simple on-off mechanism to determine whether or not a neuron fires, then it is easily
proved with linear algebra that any number of layers can be reduced to the standard two-
layer input-output model (see perceptron). What makes a multilayer perceptron different
is that each neuron uses a nonlinear activation function which was developed to model
the frequency of action potentials, or firing, of biological neurons in the brain. This
function is modeled in several ways, but must always be normalizable and differentiable.

The two main activation functions used in current applications are both sigmoids, and are
described by

in which the former function is a hyperbolic tangent which ranges from -1 to 1, and the
latter is equivalent in shape but ranges from 0 to 1. Here yi is the output of the ith node
(neuron) and vi is the weighted sum of the input synapses. More specialized activation
functions include radial basis functions which are used in another class of supervised
neural network models.

48
7.2.2 LAYERS

The multilayer perceptron consists of an input and an output layer with one or more
hidden layers of nonlinearly-activating nodes. Each node in one layer connects with a
certain weight wij to every other node in the following layer.

7.2.3 LEARNING THROUGH BACK PROPOGATION

Learning occurs in the perceptron by changing connection weights (or synaptic weights)
after each piece of data is processed, based on the amount of error in the output compared
to the expected result. This is an example of supervised learning, and is carried out
through backpropagation, a generalization of the least mean squares algorithm in the
linear perceptron.

The error in output node j in the nth data point is represented by ej(n) = dj(n) − yj(n),
where d is the target value and y is the value produced by the perceptron. Corrections are
then made to the weights of the nodes based on those corrections which minimize the
energy of error in the entire output, given by

By the theory of differentials, change in each weight to be

where yi is the output of the previous neuron and η is the learning rate, which is carefully
selected to ensure that the weights converge to a response that is neither too specific nor
too general. In programming applications, typically ranges from 0.2 to 0.8.

The derivative to be calculated depends on the input synapse sum vj, which itself varies.
It is easy to prove that for an output node this derivative can be simplified to

49
where is the derivative of the activation function described above, which itself does not
vary. The analysis is more difficult for the change in weights to a hidden node, but it can
be shown that the relevant derivative is

Note that this depends on the change in weights of the kth nodes, which represent the
output layer. So to change the hidden layer weights, the output layer weights ha ve to be
changed according to the derivative of the activation function, and so this algorithm
represents a backpropagation of the activation function.

7.2.4 APPLICATIONS

Multilayer perceptrons using a backpropagation algorithm are the standard algorithm for
any supervised-learning pattern recognition process and the subject of ongoing research
in computational neuroscience and parallel distributed processing. They are useful in
research in terms of their ability to solve problems stochastically, which often allows one
to get approximate solutions for extremely complex problems like fitness approximation.
Currently, they are most commonly seen in speech recognition, image recognition, and
machine translation software, but they have also seen applications in other fields such as
cyber security. In general, their most important use has been in the growing field of
artificial intelligence, where the multilayer perceptron's power comes from its similarity
to certain biological neural networks in the human brain.

50
CHAPTER 8

IMPLEMENTATION IN MATLAB

8.1 OVERVIEW OF THE WORK DONE

Speaker recognition was done using a three layer perceptron network. The speech signal
is filtered and then approximated using a 12th order linear polynomial. The coefficients of
this polynomial are then fed to the network which is then trained using a set of the
coefficients. The network once trained is used for testing. The first layer of the network
has 13 inputs. This is the number of coefficients produced by the predictor polynomial.
The second layer has 12 neurons and this is the hidden layer. The number of neurons in
this layer has been obtained by trial. The output layer has a number equal to the number
of people that need to be identified. In this case it is one

8.2 THE MATLAB CODE

8.2.1 CREATION OF THE NETWORK

Constructing Layers

51
An empty network object named `net' is to be created in the workspace. It can be done by
typing

nn = network;

The properties of the input layer have to be defined. The number of inputs in the

nn.numInputs = 1;

Number of neurons in the input layer has to be defined. This should of course be equal to
the dimensionality of the data set. The appropriate property to set is nn.inputs{i}.size,
where i is the index of the input layers.
nn.inputs{1}.size = 13;

The next properties to set are nn..numLayers, which sets the total number of layers in
the network, and nn.layers{i}.size, which sets the number of neurons in the ith layer. To
build the required network, 2 extra layers were defined (a hidden layer with 3 neurons
and an output layer with 1 neuron), using:

nn.numLayers = 2;
nn.layers{1}.size = 3;
nn.layers{2}.size = 1;

Connecting Layers
To have a functional network, it is necessary to define the connections between the
layers. The connections between the rest of the layers were defined as a connectivity
matrix called nn.layerConnect, which can have either 0 or 1 as element entries. If
element (i,j) is 1, then the outputs of layer j are connected to the inputs of layer i.
Defining the output layer is done by setting nn.outputConnect(i) to 1 for the appropriate
layer i.

Finally, since a supervised training set was used, it was necessary to also which layers are
connected to the target values. This is done by setting nn.targetConnect(i) to 1 for the
appropriate layer i. So, for our example, the appropriate commands would be

nn.inputConnect(1) = 1;
nn.layerConnect(2, 1) = 1;
nn.outputConnect(2) = 1;
nn.targetConnect(2) = 1;

Setting Transfer Functions


Each layer has its own transfer function which is set through the
nn.layers{i}.transferFcn property. So to make the first layer and second layer use
sigmoid transfer functions the definitions were:
nn.layers{1}.transferFcn = 'logsig';

52
nn.layers{2}.transferFcn = 'logsig';
.

Weights and Biases

Defining layers that have biases is done by setting the elements of nn.biasConnect to
either 0 or 1, where nn.biasConnect(i) = 1 means layer i has biases attached to it.

It should be decided on an initialization procedure for the weights and biases. When done
correctly, it should be able to simply issue a

nn. = init(nn);
To reset all weights and biases according to your choices.

The first thing to do is to set nn.initFcn. This let's each layer of weights and biases use
their own initialisation routine to initialise.

nn.initFcn = 'initlay';

Exactly which function this is should of course be specified as well. This is done through
the property nn.layers{i}.initFcn for each layer. The two most practical options here are
Nguyen-Widrow initialisation ('initnw', type 'help initnw' for details), or 'initwb', which
let's you choose the initialisation for each set of weights and biases separately.
nn.layers{i}.initFcn = 'initnw';

for each layer i.

Performance Functions
The two most commonly used are the Mean Absolute Error (mae) and the Mean Squared
Error (mse). The mae is usually used in networks for classification, while the mse is most
commonly seen in function approximation networks.

The performance function is set with the nn.performFcn property, for instance:

nn.performFcn = 'mse';

Train Parameters
To train the network using train, the last step is defining nn.trainFcn, and setting the
appropriate parameters in nn.trainParam. Two other useful parameters are
nn.trainParam.epochs, which is the maximum number of times the complete data set
may be used for training, and nn.trainParam.show, which is the time between status
reports of the training function.

53
nn..trainFcn = 'traingdx';
nn..trainParam.lr_inc = 1.05;
nn..trainParam.lr = 0.05;
nn..trainParam.mc = 0.8;
nn..trainParam.goal = 1e-6;
nn..trainParam.epochs = 3000;
nn..trainParam.show =100;

8.2.2 THE CODE

clc;

clear all;

s1=wavread('s1');

s2=wavread('s2');

s3=wavread('s3');

s4=wavread('s4');

s5=wavread('s5');

s6=wavread('s6');

s7=wavread('s7');

s8=wavread('s8');

s9=wavread('s9');

s10=wavread('s10');

s11=wavread('s11');

y1=lpc(s1,12);

54
y2=lpc(s2,12);

y4=lpc(s4,12);

y5=lpc(s5,12);

y6=lpc(s6,12);

y7=lpc(s7,12);

y8=lpc(s8,12);

y9=lpc(s9,12);

y10=lpc(s10,12);

y11=lpc(s11,12);

nn = network;

nn.numInputs = 1;

nn.inputs{1}.size = 13;

nn.numLayers = 2;

nn.layers{1}.size = 12;

nn.layers{2}.size = 1;

nn.inputConnect(1) = 1;

nn.layerConnect(2, 1) = 1;

nn.outputConnect(2) = 1;

nn.targetConnect(2) = 1;

nn.layers{1}.transferFcn = 'logsig';

nn.layers{2}.transferFcn = 'logsig';

nn.biasConnect(1)=1;

nn.biasConnect(2)=1;

55
nn.initFcn = 'initlay';

nn.layers{1}.initFcn = 'initnw';

nn.layers{2}.initFcn = 'initnw';

nn=init(nn);

nn.performFcn = 'mse';

nn.trainFcn = 'traingdx';

nn.trainParam.lr_inc = 1.05;

nn.trainParam.lr = 0.05;

nn.trainParam.mc = 0.8;

nn.trainParam.goal = 1e-6;

nn.trainParam.epochs = 3000;

nn.trainParam.show =100;

8.3 OUTPUTS OF THE PROGRAM

56
8.3.1 STAGE 1: SIGNAL AQUISITION

Figure 8.1 The signal

8.3.2 STAGE 2: LPC

57
Figure 8.2 LPC coefficients

8.3.3 STAGE 3: NETWORK TRAINING

58
Figure 8.3 Training using sample 9

59
Figure 8.4 Training using sample 8

60
Figure 8.5 Training using sample 7

61
Figure 8.6 Training using sample 6

62
Figure 8.7 Training using sample 5

63
Figure 8.8 Training using sample 4

64
Figure 8.9 Training using sample 3

65
Figure 8.10 Training using sample 2

66
Figure 8.11 Training using sample 1

67

You might also like