You are on page 1of 9

AN APPROACH FOR CLASSIFICATION OF DYSFLUENT AND FLUENT SPEECH USING K-NN,

ANN NAIVE BAYES & SVM

ABSTRACT

This paper presents a current approach for classification of dysfluent and fluent speech using Mel-frequency
Cepstral Coefficient (MFCC). When persons speech flows easily and smoothly then the speech is fluent.
Combine sounds into syllable, syllables join together into words and words link into sentences with little effort.
When someones speech is dysfluent, it does not flow effortlessly and irregular. Hence a dysfluency is a break
in the smooth, significant flow of speech. Stammering is one such disorder in which the fluent speech flow is
interrupt by existence of dysfluencies such as replications, prolongations, interruption and so on. In this work
we have considered three types of dysfluencies such as replications, prolongation and interruption to distinguish
dysfluent speech. After acquiring dysfluent and fluent speech, the speech signals are examine in order to extract
MFCC features. The k-Nearest Neighbour (k-NN) and Support Vector Machine (SVM), Naive Bayes, Artificial
Neural Network (ANN) classifiers are used to classify the speech as dysfluent and fluent speech. The 80% of
the data is used for training and 20% for testing. The average accuracy of 81.46% and 93.80% is obtained for
dysfluent and fluent speech respectively.

KEYWORDS

MFCC, Stammering, Fluent Speech , kNN, ANN, SVM, Navies Bayes.

1. INTRODUCTION
Stammering is a speech fluency disorder that affects the speech flow and Stuttering is also called as
dysphemia. It is serious issue in speech pathology and badly understood disorder. Approximately about 1% of
the population suffers from this disorder and has found to affect four times as many males as females. Stuttering
is the subject, relevant to researchers from different domains like acoustics, pathology, speech physiology,
signal analysis and psychology. Thus, this area is a multidisciplinary research field of science.
The speech fluency can be definite in terms of co-articulation, rate, effort and continuity. Continuity
relates to the absence or presence of pauses and also the degree to which syllables and words are logically
sequenced. If semantic units follow one another in a continual and logical Information flow, the speech is
interpreted as fluent. If there is a split in the smooth, meaningful flow of speech, then it is said to be dysfluent
speech. The type of dysfluency that illustrate stuttering disorder are shown in Table 1.

SPEECH DATA

The speech samples are obtained from University College London Archive of Stuttered Speech
(UCLASS).The database comprises of recording for monologs, readings and discussion. There are 40 distinct
speakers contributing 107 perusing recording in the database. In this work discourse tests are taken from
standard perusing of 25 distinct speakers with age between 10 a long time to 20 years. The samples were
chosen to cover a wide range of age and stuttering rate. The replications, prolongation and filled pause
dysfluencies are segmented physically by hearing the speech signal. Then extract the Features of the segmented
samples. The fluent audio database was collected from Carnegie Mellon University. This contains two sets of
500 sentences taken from the CMU ARCTIC databases.

METHODOLOGY
The general procedure of fluent and dysfluent speech classification is divided into 4 steps as show in figure 1.
Pre-Emphasis
This step is performed to improve the efficiency and accuracy of the feature extraction process. This
will compensate the high-frequency component that was suppressed in the sound production mechanism of
humans. The Speech signal s (n) is sent to the high-pass Filter.
s 2 (n) s (n) as (n1)

Where s 2 (n) is output signal and the suggested value of a is usually between 0.9 and 1.0.
The Z- transform of the filter is
H (Z)=1aZ1

The aim of this stage is to boost up the amount of energy in the high frequencies.

Figure1. Schematic diagram of classification Method

Segmentation
In this paper we are considering 3 types of dysfluencies in stuttered speech such as replications, prolongations,
interruption; these were identified and segmented manually by hearing the recorded speech samples. The
segmented samples are subject to feature extraction.
Feature Extraction (MFCC)

Feature extraction is to convert an observed speech signal to some type of parametric representation for
processing and further investigation. Some feature extraction algorithms are used for this task such as Linear
Predictive Cepstral Coefficients (LPCC), Perceptual Linear Prediction (PLP) cepstra, Linear Predictive
Coefficients (LPC) and Mel Frequency Cepstral Coefficients (MFCC).
The MFCC feature extraction is one of the best known and most commonly used for extracting the features for
speech recognition. Multi dimensional feature vector is produced for each and every frame of speech. In this
study we have consider 7MFCCs. The method is based on human hearing perceptions which cannot perceive
frequencies over 1KHz. In other words, MFCC is based on known variation of the human ears critical
bandwidth with frequency. The block diagram for computing MFCC is given in figure 2. The step-by-step
computations of MFCC are discussed briefly in the subsequent sections.
Framing

In framing, we divide the pre-emphasis signal into few frames, such that we are going to analyze each frame in
the short duration instead of analyzing the whole signal at once. Apply Hamming window to each of the frame,
the information at the beginning and end of frames may. This can be overcome by using overlapping, to retain
the information back into extracted feature frames. Initially set frame length to 25ms and to ensure the
stationary between frames, overlap is 10ms between two adjacent frames.
Windowing

From the Framing process the effect of the spectral artifacts can be reduced by windowing. A point-wise
multiplication linking the framed signal and the window function called as windowing. It becomes the
convolution of short-term spectrum and the transfer function of the window in the frequency domain. If the
transfer functions has a narrow main lobe and low side lobe levels then it is said to be good window function.
The basic idea of applying Hamming window is to minimize the signal discontinuities and the spectral
distortion. The following equation shows the Hamming window function:

2 n
w (n)=0.54 0.46 cos( ) ,0 n N 1
N 1

Where w (n) is Hamming window function and then the resulted windowing signal is defined as
Y (n)= X (n) x W (n)

Where N= Number of samples in each of the frame, Y (n) = Output signal , X (n) = input signal and
W (n) = Hamming window
Figure2. MFCC computation

Fast Fourier Transform (FFT)

To convert the signal from time domain to frequency domain we use FFT, prepare to the next stage (Mel
frequency wrapping). Basically we perform the Fourier transform is to convert the convolution of the vocal
tract impulse response and the glottal pulse in the time domain into multiplication in the frequency domain.
Then the equation is given by:
Y (w)=FFT [h(t)x(t)]=H (w)X (w)

If X (w), H (w), Y ( w) are the Fourier transform of x (t) , h(t) , y (t) correspondingly.

Mel Filter Bank Processing


The frequency resolution of the human ear is approximated by using a set of triangular filter banks.The Mel
frequency scale is linear up to 1000 Hz and logarithmic thereafter. Overlapping the set of Mel filters are made
such that their centre frequencies are middle on the Mel
scale. The Filter banks can be implementing in both time domain and frequency domain. For the
Purpose of MFCC processing, filter banks are implemented in frequency domain. According to Mel scale the
filter bank is shown in figure 3.
Figure3. Mel scale filter bank
a weighted sum of filter spectral components are computed by using a set of triangular filters as shown in the
figure 3 and the process output is approximately equal to a Mel scale. The center frequency is equal to unity and
each filter magnitude frequency response is triangular in shape. The centre frequency of two adjacent filters is
decreases linearly to zero. The output of each filter is equal to the sum of its filtered spectral components.
Afterwards estimate of Mels for a particular frequency can be expressed using the following equation:
f
mel (f )=2595log 10 (1+ )
700

Discrete Cosine Transform (DCT)


Using DCT we are going to convert log Mel spectrum to time domain. The output of conversion is called
MFCCs. Since the speech signal represent as a convolution between quickly varying glottal pulse (source) and
slowly varying vocal tract impulse response (filter), the speech spectrum consists of the spectral details (high
frequency) and the spectral envelope (low frequency). Now, we have to separate the spectral envelope and
spectral details from the spectrum. By using the logarithm, change the multiplication into addition. Therefore
by taking the DCT of the logarithm of the magnitude spectrum, we can simply convert the multiplication of the
magnitude of the Fourier transform into addition. We can compute the Mel frequency cepstrum from the
outcome of the last step using equation:
Sk
~
log
~
Cn = , n=1,2,3 k
K


k=1

~ ~
Where C n is MFCC, S k is Mel Spectrum and K is the number of cepstrum coefficients.

Classification

The k-Nearest Neighbor (kNN), Artifical Neural Network (ANN), Navies Bayes, Support Vector Machines
(SVM) are used as classification techniques in the proposed approach.
k-Nearest Neighbor (kNN)

k-NN classifies new case test based on neighboring training examples in the feature space. k-NN
is a kind of instance-based learning, or lazy learning where the function is locally approximated
and until the classification is done, all computation is delayed. Each test speech signal (query object) is
compared with each of training speech signal (training object). Then the object is classified by a popular choice
of its neighbors with the object being assigned to the class most frequent surrounded by its k nearest neighbors
(k is a typically small positive integer). If k = 1, then the object is just assign to the class of its closest neighbor.

In this study, the minimum distance can be calculated from test speech signal to each of the training speech
signal in the training set. This classifies test speech sample belongs to the same class as the most similar or
nearby point in the training set of data. A Euclidean distance can be computed to find the closeness between
each of the training set data and test data. The Euclidean distance calculated equation is given by:


n
d e ( a ,b ) = (b iai )2
i=1

Support Vector Machine (SVM)

Based on the statistical learning theory, SVM is one of the classification techniques. It is a supervised learning
technique, by using a labelled data set for training; it tries to find a decision function that classifies the best
training data. The main idea of the algorithm is to find a hyperplane to define decision boundaries sorting out
between data points of dissimilar classes. SVM classifier finds the best possible hyperplane that properly
classifies (separates) the largest fraction of data points while maximizing the distance of either of the class from
the hyperplane. The hyperplane equation is given by
T
w x +b

Where w the weight vector and b is bias.

In this work, for each pair of classes to separate members of one class from members of the other we use One-
against-One method in which there is one binary SVM. This method allows us to train all the system, with a
maximum number of different samples for each class, with a limited computer memory.

Artifical Neural Network (ANN)

Nowadays, ANNs are utilized in numerous applications due to their parallel distributed processing, pattern
learning, distributed memories, distinguishing ability and error stability. ANN is a sequence processing pattern
consisting of a number of easy processing units or nodes called neurons. Each neuron acquires a weighted set of
inputs and produces an output. Algorithms based on ANN are well appropriate for addressing speech
recognition tasks. Motivated by the human brain, neural network models use amount of attributes such as
learning, fault tolerance, generalization, adaptively etc.

In this work, we are using MLP architecture, which contains of an input layer, one or more hidden layers, and
an output layer. Back propagation algorithm is used in this network in which the input is obtainable to the
network and move throughout the weights and nonlinear activation functions towards the output layer, and in
the backward direction, error is corrected using the familiar error back propagation correction algorithm.

After general training, the network ultimately establishes the input-output relationships throughout the adjusted
weights on the network. The dataset is tested after training the network.

Naive Bayes

Bayes classifier is very simple and effective probability classification method which works on bayesian theory.
It is a supervised classification and can handle multiclass classification problems. For each class value it
estimates that a given occurrence belongs to that class. The feature items in one class are assumed to be
independent of other feature values called class conditional independence. It needs only small amount of
training set to estimate the parameters for classification. The classifier is stated as

P( AB)=P(BA )P( A)/ P(B)

A
Where P ) is the prior probability of marginal probability of A, P( AB) is the conditional probability of

B A
A, given B called the posterior probability P(B)
, P ) is the conditional probability of B given A and is

the prior or marginal probability of B which acts as a normalizing constant.

RESULTS AND DISCUSSIONS

The samples were chosen as mentioned in section 2 of this paper. The database contains two subsets:
training set and testing set based on the ratio 80:20 respectively. The Table 2 shows the speech segments
distribution for training and testing. We analyze the speech samples first by extracting the MFCC features and
later create two training for dysfluent and fluent speech samples. Once the system is trained, test set is used to
estimate the performance of the classifiers.

Table 2. The speech Data


Speech samples Training Testing
Dysfluent Speech 78 62 16
Fluent Speech 78 62 16

Classification %
100
90
80
Dysfluent
70
Fluent
60
50
40
30
20
10
0
k-NN ANN Nayies Bayes SVM

Table 3. Dysfluent and fluent classification result with 3 different set

Data Set k-NN ANN Navie Bayes SVM


Dysfluent Fluent Dysfluent Fluent Dysfluent Fluent Dysfluent Fluent
Set 1 94.2 100 89.32 100 96.75 94.87 88.45 100
Set 2 75.50 96.15 96.87 97.44 60.32 94.87 88.625 98.717
9
Set 3 89.98 87.5 75.00 87.5 63.67 81.25 88.46 87.5
Average 86.57 94.5 77.08 94.97 73.69 90.33 88.508 95.40
Classification
%
CONCLUSIONS

The speech signal can be used as a dependable indication of speech abnormalities. We have proposed an
approach to discriminate dysfluent and fluent speech based on MFCC feature analysis. Four classifiers such as
k-NN, Navies Bayes, ANN and SVM were useful on MFCC feature set to classify dysfluent and fluent speech.
Using k-NN classifier we have obtained an average accuracy of 81.46% and 93.80% for dysfluent and fluent
speech respectively. The SVM classifier yielded an accuracy of 88.5% and 95.4% for dysfluent and fluent
speech respectively. In this work we have considered combination of three types of dysfluencies which are
important in classification of dysfluent speech. In future work, we improve the accuracy of testing data by
increasing the number of samples in training data, testing data and various feature extraction algorithm can also
be used to improve the performance.

You might also like