You are on page 1of 12

A Simple LPC Vocoder

a given model order. Fortunately, most


modern speech production models of the

Bob Beauchaine
human vocal tract consider the speech
production process to be sufficiently well

EE586, Spring 2004


modeled as the output of a linear, time-
varying system excited by quasi-periodic
impulses, broadband random noise, or some
combination of the two.
Linear Prediction Coefficient voice Most tractable models of human
coding/decoding attempts to take advantage speech production consider the human vocal
of the fact that sampled speech has tract to be a series of time-varying tubes of
significant correlation between samples various lengths and cross sections (see
(assuming sufficient sampling frequency). Figure 1). The excitation source produces
Any signal which possesses this property is input to the tube at one end, and the sound
amenable to linear prediction, in which propagates through the series of tubes,
current speech samples are modeled as a bouncing off the walls, reflecting at the
linear combination of past samples. junctions of each new section, and being
Minimization of the Mean Squared Error shaped by the instantaneous nature of
between the predicted samples and their transmission path. Note that this simplified
actual values leads to a unique set of view of speech production produces an all-
coefficients for a chosen filter order. Thus pole model. How well this assumption holds
an LPC vocoder is at its heart little more up in reality has profound implications on
than a classical MMSE system. the quality of speech produced with its use,
which we will discuss later.
If the vocal tract can be modeled in
Vocal tract modeling this fashion, we then need a model of the
excitation source to complete our view of
Of course, any system can be
linear speech production. Analysis of
modeled in this sense as long as certain
speech waveforms have shown that, in the

Figure 1 – Human vocal tract acoustic model

conditions are met by the input signal. This broadest of terms, speech can be broken
does not imply that the model is a down into two major categories. The first
particularly “good” fit to the real data – just type of speech is called voiced speech. It is
that it is the best fit that can be achieved for produced by the periodic excitation of the
vocal tract at the larynx via vibration of the of a segment of voiced and unvoiced speech
vocal cords excited by the passage of air (above) and the impulse response of a 12th
expelled from the lungs (this excitation order LPC filter predicted from that speech
source is generally referred to as a glottal (below). A cursory analysis shows that the
pulse). Voiced speech is characterized by a fine details of the speech are clearly lost,
pseudo-stationary primary frequency (called but that the general envelope of the spectra
a formant) plus harmonics. Voiced speech is retained. It is this ability of the LPC
is typically associated with sounds produced model to track the gross spectral
by an “open” vocal tract – in English, vowels characteristics of the human vocal tract that
provide the best example of voiced speech. makes is such a successful speech coding
The other primary form of speech is model.
called unvoiced. Unvoiced speech is
produced by turbulence when air is forced LPC Vocoder
through a constriction in the vocal tract – by
Now that we have a rudimentary
pursed lips, tongue against teeth, or other
understanding of speech production, we
combinations. Unvoiced speech closely
may finally discuss the basic LPC speech
resembles random noise, showing little
encoder/decoder system (see Figure 3 – LPC
periodicity and little correlation between
encoder block diagram). First, speech is
samples. Of course, nothing as complex as
sampled at a frequency appropriate to
speech production can be reduced to a
capture all of the necessary frequency
binary model without some compromise.
components important for processing and
Much more robust models segment speech
recognition. For voice transmission, 10kHz
into many more categories, such as
is typically the sampling frequency of choice,

Sound ‘aah’, male Sound ‘shhh’, male


speaker speaker
6 5

5
4

4
3

1
1

0
0

-1 -1
0 50 100 150 200 250 300 0 50 100 150 200 250 300

Red = spectrum of original speech


Blue = impulse response of 12 pole predicted
filter

Figure 2 LPC modeling of voiced/unvoiced speech

sonorants, voiced consonants, nasals, semi- though 8kHz is not unusual. This is
vowels, fricatives – all of which possess because, for almost all speakers, all
quantifiable differences that could in theory significant speech energy is contained in
be used to improve LPC based speech those frequencies below 4kHz (although
analysis and production, but which in reality some women and children violate this
only manage to complicate the analysis. assumption). The speech is then
Figure 2 LPC modeling of voiced/unvoiced segmented into blocks for processing.
speech shows a comparison of the spectrum Simple LPC analysis uses equal length blocks
of between 10 and 30ms. Less than 10ms order – the number of taps used in the
does not encompass a full period of some estimation process. Considerable analysis of
low frequency voiced sounds for male the prediction error has been examined in
speakers. My own experiments with male the literature. It has been shown that the
speech sounded synthetic at 10ms sample normalized error drops steeply for prediction
windows when, for certain frames, pitch orders from 0 to 4, then gradually reduces
detection became impossible. More than up to order 12, thereafter flattening out into
30ms violates the basic principle of a region of diminishing returns1. Thus, since

Figure 3 – LPC encoder block diagram

stationarity upon which the least squares LPC systems are almost universally used in
method relies. low bit-rate applications, and since using a
Once segmented, the speech type is prediction order of greater than 14 produces
determined as either voiced, unvoiced, or little improvement, LPC coders most often
silence. This simple sounding task is indeed use a model of order 10-14.
the most difficult issue in LPC vocoders – In reality, the determination of LPC
the past 25 years have seen dozens of parameters can be performed at any point
papers published on varying methods of in the analysis process, and in fact some
accomplishing this feat reliably across V/U/S detectors make use of the LPC
parameters.
Once the LPC
parameters are
determined, the
energy of the
signal segment
is determined.
This is required
at the synthesis
end to equalize
energy levels of
the synthesized
speech. Again,
Figure 4 LPC decoder the energy
content of the
speakers and environments.
signal is, as we shall see, a very powerful
If the speech is classified as voiced,
discriminator of speech type.
then its pitch must be determined. This is
At the end of this process, we have
because the human ear, while tolerant of
a model of the speech segment that
many errors in speech coding, is somewhat
determines its type, energy content, pitch,
sensitive to errors in pitch. Speech that is
and LPC parameters. If it is not clear at this
produced with pitch errors is quite annoying
point, it should be stated that, for an LPC
and typically sounds synthetic. Once the
type of speech is determined, the LPC
parameters are estimated. Crucial to the 1
Digital Processing of Speech Signals, L.R.
speech production model is the prediction Rabiner, R.W. Schafer, pg. 427.
vocoder, no portion of the actual speech
waveform is transmitted. The model is
sufficient for the receiver to synthesize N  M  M
1 1
speech at the receiver end that can be a
remarkable facsimile of the original. This ∑ a i⋅ 
M ∑
s ( k − i) s ( k − j)
M
⋅ ∑ s ( k) s ( k − j)

allows LPC vocoders to achieve significant i=1 k = 1  k =1


bit rate reduction – commercial systems are
available that code down to 2.4kbps2. where j varies from 1..N. By letting
At the receiver (see figure 4) the
transmitted model parameters are decoded  M 1 
to produce a time varying vocal tract inverse
filter. Depending on the V/U/S classification
φ ij 
M ∑
s ( k − i) s ( k − j)
made by the encoder, the excitation source k = 1 
is chose as being either a random noise
source or a periodic impulse train whose the “covariance” term be represented as
period was also transmitted by the encoder. we can write, in matrix form, the solution to
The excitation source is scaled appropriately the simultaneous set of LPC equations as
by the transmitted gain, and passed through
the inverse filter to produce synthetic
speech.
ΦA ψ
Because of the similarity to a covariance
Computations function, this method of LPC analysis is
One should never lose sight of the known as the covariance method.
fact that LPC is an attempt to significantly An alternative formulation of the LP
reduce, in real time, the bit rate required to equations proceeds as follows. First, the
transmit speech. Hence some discussion of limits on the sums for the analysis window
the LPC parameters is merited. If the are changed to plus and minus infinity, with
parameters cannot be computed efficiently, the implicit knowledge that all points outside
then a real time system cannot be realized of the current frame are zero (thus making
using the technique. For any LPC system, the sums tractable).
the basic LPC parameter determination is N  ∞
1  1

always the same. We are faced with the ∑ a i⋅ 
M ∑s ( k − i) s ( k − j)
M
⋅ ∑ s ( k) s ( k − j)
need to minimize the squared prediction i=1 k = − ∞  k =−∞
error, defined as
Substituting m for k – l,
2
N
1 ∞  1

1
M  N  ∑ a i⋅ 
M ∑ s ( m) s ( m + j − i)
M
⋅ ∑ s ( m) s ( m + j)

ε
M
⋅ ∑  s ( k) −
 ∑ a i⋅ s ( k − i) i=1 k = − ∞  k =−∞

k =1  i=1  which, because of the limits on the sums


where N is the model order, M the length of and because
the speech segment under consideration, ai

is the ith LP coefficient, and s(k) is the kth 1
speech sample within the analysis frame. R( j)
M
⋅ ∑ s ( m) s ( m + j )
Differentiating this equation with respect to m=−∞
the a’s, then setting equal to zero, we come
to we can write in matrix form

RA=C
2 The fundamental difference between these
Digital Compression for Multimedia, Jerry two representations is the form of the
D. Gibson et. al.
covariance or correlation matrix. In the important and single most difficult, part of
autocorrelation system, the R matrix is the LPC vocoder process.
Toeplitz. In the covariance representation, On the surface, the determination
this is not the case. Toeplitz matrices lend problem would not seem to be all that
themselves to a particularly elegant, difficult. A quick survey of many of the
efficient, and recursive solution known as characteristics of voiced speech versus
the Durbin or Levinson-Durbin algorithm. unvoiced speech would lead one to expect
Durbin’s recursion computes the LP that the two are easily separable. Let’s take
coefficients one at a time in terms of the such a survey. First, lets compare the
previous coefficient. While the covariance energy content of voiced and unvoiced
coefficients can be calculated somewhat speech segments.
more efficiently than brute force matrix
inversion through Cholesky decomposition, 3000
the Levinson-Durbin algorithm has a decided 2000

computational advantage. Additionally, the 1000


filter coefficients resulting from the 0

Levinson-Durbin algorithm produce an -1000

inherently stable inverse filter3. Both -2000


0 1000 2000 3000 4000 5000 6000
methods find widespread use, though the Log Energy
literature seems to favor the autocorrelation 60

approach (while it should be noted that the 50


Federal LPC-10 standard is based on the 40
covariance method). One other method of
note is the lattice based LP filter, which is 30

beyond this presentation, but uses a lattice 20


0 10 20 30 40 50 60
based inverse synthesis filter approach. For
comparison, see Table 1 – Storage and Figure 5 - Log energy of the utterance of the word
"six"
computational burden of LPC methods for
an analysis of the three methods4
The central portion of figure 5 shows the
Table 1 – Storage and computational vowel (voiced) portion of the utterance “six”
burden of LPC methods as spoken by a male speaker. As the graph
shows, the energy content of the voiced
Storage Covariance Auto- Lattice portion of the signal is on the order of 15-
correlation 20dB higher than that of the unvoiced
Data M M 3M
portion, and 40dB higher than those areas
Matrix N2/2 N
of silence. This difference, which is typically
Window 0 M
maintained within a speaker for all speech,
Multiplies
Windowing 0 M
is one of the stronger indicators of speech
Correlation M*N M*N type.
Matrix N3 N2 5*M*N
soltn.

Speech type determination


Aside from calculation of the LP
coefficients, determination of the speech
type for a given frame is the single most

3
Digital Processing of Speech Signals, L.R.
Rabiner, R. W. Schafer, pg. 418.
4
Ibid
Another metric widely used in speech, with its characteristically lower
speech type analysis is the zero crossing frequency content, there is usually a strong
correlation from one sample to the next.
3000
Unvoiced speech, being much closer to
2000
random, does not have this property, and
1000
the difference manifests itself in properties
0
of the LP coefficients, most notably the first
-1000
LP coefficient. Consider figure 7, which
-2000
0 1000 2000 3000 4000 5000 6000 shows the first LP coefficient calculated for
Zero-crossing rate segments of the utterance of “six” we’ve
0.8
been using. Notice how the first coefficient
0.6 clusters around the value 1 for the unvoiced
0.4
3000
0.2
2000
0
0 10 20 30 40 50 60 1000

-1000
Figure 6 - Zero crossing rate, utterance of the word
-2000
“six” 0 1000 2000 3000 4000 5000 6000

First prediction coefficient


rate. For voiced speech, the majority of the 2

speech energy is concentrated well below 1


3kHz. For unvoiced speech, considerable 0
energy content is still present at and above
3kHz. Since low frequencies imply low zero -1

crossing rates, and high frequencies imply -2


0 10 20 30 40 50 60
high zero crossing rates, the zero cross rate
normalized normalized to the block length is
highly correlated with the energy content of
Figure 7 - First LP coefficient, utterance of the
the speech segment, which is in turn highly word “six”
correlated with the speech type (see figure
6). Note how the zero crossing rate is very portions of the speech, but is clustered
high for the ‘s’ and last portion of the ‘x’ around -1 for the voiced section and for the
sound, is much lower for the vowel portion, silence. This shows some specificity in
and is almost zero for the silent periods determining voiced from unvoiced speech,
(notice that the word “six” has a stop in its but little ability to distinguish between
production, where the excitation source to voiced speech and silence. Other
the vocal tract is temporarily blocked, difference metrics exist, and the topic is still
producing momentary silence). under consideration 30 years after speech
Zero crossing rate is very sensitive coding first became a hot research topic.
to any DC offset in the signal, which Consider figures 8, 9, and 10, which show a
necessitates a high pass filter prior to its set of indicators used in a speech
determination in signals of unknown classification system by Atal and Rabiner5.
recording origin. It is also sensitive to Shown are histograms for the distribution of
broadband background noise, which five metrics of speech type, including zero
becomes a much harder problem to crossing rate, first autocorrelation coefficient
eliminate in practice. measurement, log energy, first LPC
As stated earlier, in some vocoders
5
the V/U/S determination is made after the A Pattern Recognition Approach to Voiced-
calculation of the LP coefficients. This is Unvoiced-Silence classification with
because there is considerable information Applications to Speech Recognition, Bishnu
about the nature of the speech sample S. Atal, Lawrence R. Rabiner, IEEE
embedded in those coefficients. For voiced Transactions on Acoustics, Speech, and
Signal Processing, June 1976
coefficient, and LPC estimation error from a Once all three distance measures (for
variety of speakers and speech samples. silence, unvoiced, and voiced decisions) are
What is obvious in these distributions is that calculated, classical Bayesian probability
significant differences exist between voiced provides the joint probabilities as
speech, unvoiced speech, and silence which
should be able to discriminate between the
three. Equally obvious, however, is the
overlap in the distributions that render any
one of the metrics, taken by itself, as
insufficient to the task. Some manner of
maximum likelihood estimator must be
employed which minimizes the probability of
making a false determination. Once useful
solution to the problem assumes a
multidimensional Gaussian distribution with
known mean mi and covariance matrix Wi,
where i = 1,2,3 corresponds to silence, The highest P is chosen as the speech type.
unvoiced, or voiced speech. The L- The only remaining part of the
dimensional Gaussian density function for speech decision metric is the calculation of
such a system has, given a measurement x the covariance matrices between the five
which is a column vector of distance metrics. This constitutes the learning
metrics, portion of the algorithm. In my project, I
did not have the time to perform extensive
learning on speech samples. Instead, I
attempted to use the values for the W
matrices as provided by Rabiner and Atal,
Our decision rule which minimizes computed for various sets of training data.
classification errors states that the This implies that the decision making portion
measurement vector x should be assigned of my vocoder was based on assumptions
to class i if that did not hold directly for the conditions
in which my speech was recorded. As a
practical point, their algorithm assumed a
where pi is the a priori probability that x specific scaling of data and block size that I
belongs to class i. had to emulate because they did not use a
This expression for the decision normalized metric for the zero crossing rate.
metric can be simplified, since it is a Ultimately, this technique failed to
monotonically increasing function, as adequately determine speech type, so I fell
back on a simpler method employing the
first LPC coefficient, the zero crossing rate,
and the signal energy optimized for the
speech signals with which I was working.

Since the last two terms don’t depend on x,


Atal and Rabiner claim that they only
introduce a negligent bias towards the ith
class, and don’t provide any “significant
advantage” over simplifying the decision
rule to only use the measurement vector x,
which thus reduces our distance metric to
be
the synthesized speech – an example of a
fully processed sentence is included in the
Figures 8 & 8 - Speech type estimators presentation that has a “chirp” where
unvoiced speech was mistaken for voiced
So why is the speech type speech, a high pitch rate detected, and the
determination so important? Recall that the resultant frame stands out clearly to the ear
speech synthesizer uses the speech type to as in error.
choose the excitation source for each frame.
Failure to chose the right type of excitation Pitch detection
produces speech that, while intelligible, The final important step in LPC speech
sounds very wrong to the human ear. processing is the detection of pitch in voiced
Voiced speech excited with a noise source speech segments. During normal human
tends to sound “breathy”, or, as I call it, speech production, we vary the frequency of
Borg-like (an example is included on the the glottal pulse to produce the normal ups
sample slide of the power point and downs of the primary frequency content
presentation.). Unvoiced speech that is for that speech, much as we do when
excited with a periodic input sounds singing, though not nearly as pronounced.
synthetic or “robotic”, something like one While we are perceptually largely unaware
hears from a child’s speech synthesizer. of this in normal conversation, a speech
Occasional mistakes in the classification synthesizer that does not preserve this part
cause individual frames to sound wrong in
of the speech signal will sound monotonic – often works, but can easily be confounded
perhaps you’ve heard Steven Hawking’s by the specifics of the vocal tract response
speech coder that produces all speech at a that may shape the envelope of the speech
single excitation frequency (again, an signal in such a was as to accentuate the
harmonics of the fundamental period,
creating peaks that may actually exceed
the fundamental. Additionally, computing
a true autocorrelation sequence is an
expensive operation, requiring many
multiplies.
In fact, the information content of
a full autocorrelation is excessive to the
rather simpler task of pitch detection.
Small excursions and noise in the signal
ought not contribute anything to the pitch
detection process. Therefore, several
methods can be found that use a modified
form of input signal as input for pitch
detection. Once such method, proposed
by Sondhi6, is to use a highly non-linear
transformation called center clipping.

Figure 10 - Speech type estimators cont'd In center clipped speech, a variable clipping
level is determined, usually as some fixed
example is provided in the presentation).
Theoretically, pitch detection is
Y(n)=C[x(n)]
straightforward. A Fourier transform of the
frame can typically pick out the fundamental
frequency for any given frame. The
problem, of course, is the processing power
7
-CL
x 10
14

+CL
12

10

2
Figure 11 - Center clipping
0

-2 percentage of the maximum signal


-4
amplitude within the frame. All signal
0 50 100 150 200 250 300
samples falling below this threshold are set
to 0, and all values above the value in
Figure 9 - Autocorrelation of voiced speech absolute value retain that value. This
“compresses” the center out of the signal,
required to make this determination. In leaving only the excursions above and below
principle, the autocorrelation function for a the clipping threshold. The modified speech
frame ought to work equally well – samples
that are related by the fundamental period 6
“New Methods of Pitch Extraction”, M.M.
or its harmonics in the frame should show Sondhi, IEEE Trans. Audio and
the highest correlation. In practice, this Electroacoustics, June 1968
can then be autocorrelated and a peak technique has been used with some
detector employed to locate the success.
fundamental. In all of these algorithms, pitch
For many applications, especially period estimates contain errors where a
low lag voice coders, the autocorrelation is harmonic was mistaken for a fundamental,
still too time consuming and unnecessarily or where the pitch was simply miscalculated.
complex. To reduce this complexity, the It has become almost universal to follow the
Average Magnitude Difference Function, or pitch detector with a 5th order median filter.
AMDF7. The AMDF is defined as A median filter does not behave like an
γ n (k ) = ∑m =−∞ x(n + m) − x(n + m − k ) averaging filter, which has the form of a low

pass filter. Median filters do a better job of


throwing out outliers in the pitch estimation
For a periodic sequence, x(n) – x(n-k) while retaining the shape of the pitch
should be zero for k = 0, P, 2P, etc, where contour.
the difference between sample values goes
to zero at multiples of the fundamental
frequency. Thus, the output of the AMDF is Shortcomings and Fixes
a function that shows distinct dips at the
So where does the LPC model break
fundamental period. A suitable negative
down? The most obvious problem with LPC
peak detector may then be employed to
in general is that human speech production,
extract the pitch period.
while sufficiently modeled by an all pole
In practice, the AMDF does not add
inverse filter, is not strictly an all pole
any processing capability to a pitch detector,
system. Several zeros may be present in
and is no better than the autocorrelation for
speech – these come from radiation effects
that purpose. Its main benefit is in its
at the lips, nasals (the sounds like ‘m’ and
computational simplicity, which involves no
‘n’ that are produced through the nasal
multiplies, and hence has found wide
cavity with the mouth mostly closed), and
acceptance in low lag speech coding
from the glottal pulse itself. There is no
applications.
simple fix for this problem in an LPC vocoder
A more robust, and commensurately
unless one is willing to model the system
more complicated pitch detector goes by the
with an IIR structure. The success of
acronym SIFT, for Simple Inverse Filtering
commercial vocoders that do not require this
Tracking. In the SIFT algorithm, the speech
complexity, and which produce reasonable
is low-pass filtered to 900Hz, then
voice quality at low bit rates, are testament
decimated 5:1. This produces a 1 kHz wide
to how well the all pole model actually
smaller signal for analysis which, because of
works in general.
the human speech process, still retains all of
One widely used method to account
the important formant information. That
for radiation effects is pre-emphasis. This
decimated signal is then LPC analyzed with
spectrally flattens the signal to ameliorate
a 4th order autocorrelation model filter. The
the roll-off at high frequencies. This roll-off
decimated input is inverse filtered using this
demands that the LP coefficients retain high
4th order filter, producing the prediction
precision, in direct competition with low bit
error signal, which is approximately
rates. Typically, the speech signal is passed
spectrally flat. Its autocorrelation is taken,
through a filter of the form
and a peak picker extracts the fundamental
frequency, which is then interpolated to
improve its estimation accuracy. Citations in H ( z ) = 1 − αz −1
the literature seem to suggest that this
that has a characteristic high pass
7
quality but is computationally quite simple to
“Average Magnitude Difference Function implement. Alpha in the range of .9 to .95
Pitch Extractor”, M.J.Ross, H.L.Shaffer, et. is typical. To reverse the effects of the pre-
Al, IEEE Trans. Acoust. Speech and Signal emphasis, the synthesized speech signal
Proc, October 1974.
must be passed through a corresponding coefficients exceeding 1 in absolute value
inverse de-emphasis filter. This approach indicate that instability has crept into the
was employed in my simple Matlab based calculation, and an intelligent processor can
vocoder. deal with the problem.
As a practical matter, stability Another technique used to improve
concerns may sometimes plague the inverse the stability of the synthesis filter is called
synthesis filter when too few bits are used bandwidth expansion. The filter coefficients
at some point in the analysis or transmission are modified by the following equation:
of the LPC model. Even though the
autocorrelation method is guaranteed to a new = γ i ⋅ ai
produce a stable filter in theory, this is only
where gamma is a positive constant less
guaranteed when coefficient round-off is not
than 1 (typical values range from 0.988 to
an issue. Any loss of precision in coefficient
0.996)8. The effect of this adjustment is to
generation can push a pole outside of the
broaden the spectrum of the speech signal,
unit circle and cause the speech synthesis to
especially around the formant peaks. This
explode at the other end. A practical fix for
also shortens the impulse response of the
this problem in low bit rate coders is what
inverse filter, aiding in stability.
are known as PARCOR coefficients, or
Other issues with the LPC vocoder
PARtial CORrelation coefficients. These
are based on its block structure processing
coefficients model the reflections that occur
of speech. Transition frames are frames of
at the impedance mismatched tubes in the
speech that are moving from one type of
vocal tract model. They are limited in
speech to another. Traditional LPC cannot
magnitude to be less than 1, and are easily
model this problem, and so is forced to
calculated using a variation of the Levinson-
make a binary decision. Pitch synchronous
Durbin recursion, and just as easily inverted
LPC attempts to solve this problem by
into filter tap weights at the receiver. These
fracturing the speech signal not on fixed
coefficients also provide a robust stability
temporal boundaries, but rather on natural
check on any speech frame – PARCOR
Order 12 LPC
Amplitude 0.015-.03 second Pre-emphasis Hanning
Parameter
normalization Frame segmentation 1-0.9z window
estimation

Zero crossing rate detection Center clipped


V/U/S 5 sample pitch
Gain estimation Log energy detection Correlation pitch
decision median filter
Prediction error analysis extraction

Pitch period impulse


train generator

De-emphasis
Excitation selection Inverse LPC filter
1/1-.9z

Random noise
generator

Figure 12 - Vocoder realization

8
“Speech Coding Algorithms”, Wai C. Chu,
pg. 133
speech boundaries where each frame
contains only a single type of speech. I
have no working experience with this type
of encoder, and cannot attest to its worth,
but it has apparently found its way into
commercial products.
The binary speech type decision
also causes problems for LPC. Certain kinds
of sounds in speech do not fall cleanly into
either the voiced or unvoiced category –
these include sounds like the letter ‘z’ and
some of the nasals. These sounds show a
noticeable mix of excitations, appearing as
noisy periodic signals. Handling this
problem in LPC has created a class of mixed
excitation coders that use both periodic and
noise components, suitably balanced, as the
excitation source for the synthesizer.

Finally, I would like to mention the vocoder


that I created and tweaked to perform the
experiments and to generate the sample
waveforms for this presentation. Refer to
Figure 12 - Vocoder realization for a block
diagram. As usual, what is theoretically a
simple task becomes a devil-in-the-details
engineering problem. By far, the V/U/S
decision block was the most difficult task, as
the literature had quite properly warned me.
Nonetheless, the project provided many
hours of pain and pleasure, and I quite
enjoyed the opportunity to build and to
present it.

You might also like