Vocal Emotion Recognition in Five Languages of Assam Using Features Based On MFCCs and Eigen Values of Autocorrelation Matrix in Presence of Babble Noise

Vocal Emotion Recognition in Five Languages of Assam Using Features Based on MFCCs and Eigen Values of Autocorrelation Matrix
in Presence of Babble Noise

Aditya Bihar Kandali#1, Aurobinda Routray#2
Department of Electrical Engineering Indian Institute of Technology Kharagpur Kharagpur, PIN-721302, West Bengal, India
Tapan Kumar Basu

Department of Electrical Engineering Aliah University DN 47, Sector 5, Salt Lake City, Kolkota, India
basutk06@rediffmail.com
#1
abkandali@rediffmail.com, #2aurobinda.routray@gmail.com
AbstractThis work investigates whether vocal emotion expressions of (i) discrete emotion be distinguished from noemotion (i.e. neutral), (ii) one discrete emotion be distinguished from another, (iii) surprise, which is actually a cognitive component that could be present with any emotion, be also recognized as distinct emotion, (iv) discrete emotion be recognized cross-lingually. This study will enable us to get more information regarding nature and function of emotion. Furthermore, this work will help in developing a generalized vocal emotion recognition system, which will increase the efficiency of human-machine interaction systems. In this work, an emotional speech database consisting of short sentences of six full-blown basic emotions and neutral is created with 140 simulated utterances per speaker of five native languages of Assam. This database is validated by a Listening Test. A new feature set is proposed based on Eigen Values of Autocorrelation Matrix (EVAM) of each frame of the speech signal. The Gaussian Mixture Model (GMM) is used as classifier. The performance of the proposed feature set is compared with Mel Frequency Cepstral Coefficients (MFCCs) at sampling frequency of 8.1 kHz and with additive babble noise of 5 db and 0 db Signal-to-Noise Ratios (SNRs) under matched noise training and testing condition. Keywords-Full-blown Basic Emotion, Vocal Emotion, GMM, MFCC, Eigen Values of Autocorrelation Matrix
and the prosody of emotion to express the affective state of the speaker. In addition, speakers also possess their own style, i.e. a characteristic articulation rate, intonation habit and loudness characteristic. Thus, isolation of the affective information i.e. the emotion, from voice is not easy. The present work investigates whether vocal emotion expressions of (i) discrete emotion be distinguished from no-emotion (i.e. neutral), (ii) one discrete emotion be distinguished from another, (iii) surprise, which is actually a cognitive component that could be present with any emotion [1], be also recognized as distinct emotion, (iv) discrete emotion be recognized cross-lingually. This study will enable one to get more information about the nature and function of emotion. This work will also help in developing a generalized vocal emotion recognition system, which will increase the efficiency of human-machine interaction systems. Some applications are as follows: (i) to obtain more efficient and more accurate performance of automatic speech and speaker recognition systems due to the reduced search space consisting of models corresponding to prerecognized emotions ([2] - [6]), (ii) to design an automatic speech translator across languages retaining the emotional content ([2] - [5], [7], [8]), and (iii) to make more efficient automatic tutoring, alerting, and entertainment systems [9]. Picard [10] has explained about affective computing algorithms which can improve problem solving capability of a computer and make it more intelligent by giving it the ability to recognize and express emotions. In some experiments of vocal emotion recognition the contents of the testing utterances are kept same as that of the training utterances and the test utterances are uttered by the set of same speakers, i.e. the experiments are text as well as speaker-dependent [11]. In some studies the contents of the testing utterances are kept different from that of the training utterances but the test utterances are uttered by the set of same speakers, i.e. the experiments are text-independent but speaker-dependent ([12] [16]). Very few studies are carried out where the contents of the testing utterances as well as the speakers are different than that of the training conditions, i.e. the experiments are text as well as speakerindependent ([12], [17], [18]).
I.
INTRODUCTION
Emotions are expressed in speech, face, gait and other body languages explicitly by human beings along with internal physiological signals such as muscle voltage, blood volume pressure, skin conductivity and respiration. The vocal expressions are harder to regulate than other explicit emotional signals. So, it is possible to know the actual affective state of the speaker from her/his voice without any physical contact. But exact identification of emotion from voice is very difficult due to several factors. The speech consists broadly of two components coded simultaneously: (i) What is said and (ii) How it is said. The first component consists of the liguistic information pronounced as per the sounds of the language. The second component consists of non-linguistic or paralinguistic or suprasegmental component which includes the prosody of the language i.e. pitch, intensity and speaking-rate rules to give lexical and grammatical emphasis for the spoken messages;
When a machine is trained with emotion utterances of one set of languages and tested with emotion utterances of a set of different languages, the process is called as cross-lingual (or cross-cultural) vocal emotion recognition. Very few studies of cross-lingual vocal emotion recognition have been reported in the literature [19]. Amongst these noteworthy is the study by Scherer et al. [20], conducted in nine countries of Europe, the United States, and Asia on vocal emotion portrayals of anger, sadness, fear, joy, and neutral voice, which are produced by professional German actors. In this study, overall perception accuracy by human subjects is found to be 66%. Also, the patterns of confusion are found very similar across all countries, which suggest the existence of similar inference rules from vocal expression across cultures. Generally, accuracy decreases with increasing language dissimilarity from German in spite of the use of language-free speech samples. So their conclusion is that culture- and language-specific paralinguistic patterns may influence the decoding process. Juslin and Laukka [19] also reported that cross-cultural decoding accuracy of voice expression of emotions is significantly higher than that expected by chance. Laukka [21] has also reported that: (i) vocal expressions of discrete emotions are universally recognized, (ii) distinct patterns of voice cues correspond to discrete emotions, and (iii) vocal expressions are perceived as discrete emotion categories but not as broad emotion dimensions. The above cross-lingual experiments are done mostly with a very few number of European and Asian languages. So, the authors feel a need to verify these findings using more number of languages and also a need to find different experimental procedures and better methods to improve the recognition scores. In the present paper, the study of vocal emotion recognition is carried out using speech samples of five Indian languages: Assamese, Bodo (or Boro), Dimasa, Karbi, and Mising (or Mishing), which are the native languages ( not dialects ) of the state of Assam. Assamese is an Indo-Aryan language, Bodo and Dimasa are TibetoBurman (Bodo-Garo group) languages, Karbi is a TibetoBurman (Kuki-Chin group) language and Mishing is a Tibeto-Burman (Tani group) language ([22] [24]). The Bodo and Dimasa languages are tonal languages and have some linguistic similarities. From a study of the above references, it is found that the total number of mono-sounds (vowels and consonants) in these languages is not more than 32. Also the speech acoustic signal is highly non-stationary, which means that the statistics of the characteristics of the acoustic signal of the same speech content spoken by the same speaker are different at different times. The present study is based on a modified Brunswikian lens model of process of vocal communication of emotion ([25], [26]). This model motivates research to determine the proximal cues i.e. the representation of voice acoustic cues in the basilar membrane of the cochlea, amygdala, and auditory cortex, which will lead to the perception of the vocal emotion. Based on studies by researchers ([9], [13], [25] - [27]), one can identify three broad types of proximal voice cues: (i) fundamental frequency or pitch frequency (F0) contour i.e. fundamental frequency variation in terms of geometric patterns; (ii) continuous acoustic variables: magnitude of fundamental frequency, intensity, speaking
rate, and spectral energy distribution, and (iii) voice quality (tense, harsh or breathy): described by high frequency energy, formant frequencies, precision articulation and glottal waveform. But there exists some interrelations among these three broad types of voice cues, e.g. the information about the fundamental frequency contour and the voice quality is contained in the continuous acoustic variables. A description of relationships among archetypal emotions and the voice cues is given in ([9], [13], [25] [27]). In this paper, a new robust feature set is proposed based on 5 most significant Eigen Values of Autocorrelation Matrix (EVAM) of each frame of the speech signal. They represent the powers of 5 most prominent frequency components in the speech [28]. Since, the formants of the speech are the most prominent frequencies, the 5 most significant eigen values, will represent the corresponding powers (though with some additive noise), and hence it is expected to get a good performance in automatic emotion recognition with this feature set. This feature set is also expected to be robust in presence of noise, since these eigen values correspond to the most prominent signal subspace eigenvectors. The performance of the EVAM feature set is compared with Mel Frequency Cepstral Coefficient (MFCC) [29] feature set. In this paper, the Gaussian mixture model (GMM) is used as the classifier [30]. The initial means and the elements of diagonal covariance matrices of the GMM are computed by split-Vector Quantization algorithm [31]. Since GMM classifier is a statistical classifier, i.e. here the classification is based on computation of log-likelihood using probability distribution functions of each emotion training-featurevector-data modeled as GMMs, a statistical comparison of the different feature sets will be inherently done in the experiment. The feature vectors computed from a set of training utterances, collected from speakers of 5 native languages of Assam, are used to train GMM for each of 6 full-blown discrete basic emotions (anger, disgust, fear, happiness, sadness, and surprise ) and 1 no-emotion (i.e. neutral ). After training, the classifier is tested with features from the test utterances. The mean-log-likelihood of feature vectors of one test-utterance with respect to the trained GMM corresponding to each emotion-class is computed. The testutterance is considered to belong to that emotion-class with respect to which the mean log-likelihood is the largest. II. DATA COLLECTION The utterances in native languages (not dialects) of Assam are collected from different places of the state. The subjects are chosen as mostly students and faculty members of some educational institutions of Assam. Some subjects are non-professional actors and others are trained for the first time through a few rehearsals, so as to avoid exaggerated expressions. Thirty randomly selected volunteer subjects (3 males and 3 females per language) are requested for recording emotional speeches of the 5 native languages of Assam to build a database which is named as Multilingual Emotional Speech Database of North East India (MESDNEI). The following equipment and software are used in the recording process: (i) a wearable headphone-mic set for single channel recording, (ii) a notebook computer with onboard sound interface having 16-bit depth and 44.1
kHz sampling frequency, (iii) the Sony Sound Forge 7 software, and (iv) an almost noise-free small closed room. Each subject is asked to utter a fixed set of 140 short sentences (20 per emotion) of different lengths of her/his first language only. The subjects are asked to act to produce facial expressions and body languages along with the full-blown vocal emotion expressions while recording (simulated emotion). The subjects are asked to rehearse their acting a few times before final recording. The distance between the Microphone and the mouth of the subject and the Line-in volume of the notebook are simultaneously adjusted such that the speech waveforms are not clipped while recording. The set of the utterances are recorded in two different sessions. In one session the utterances corresponding to emotions: sadness, disgust and fear are recorded in the same order. In the other session, the utterances corresponding to emotions: happiness, surprise and anger are recorded in the same order. The neutral utterances are recorded at the beginning of any of the above sessions. III. LISTENING TEST A listening test of the emotional utterances is carried out with the help of 6 randomly selected volunteer human judges i.e. listeners (3 Males and 3 Females) for each language of the MESDNEI database in order to validate the database and also to set up benchmark scores for the performance of automatic vocal emotion recognizer. The listeners have never heard the utterances of the languages of the MESDNEI database. Some of the listeners are selected as different persons for each language while others remained the same, because of the unavailability of a complete set of different volunteer listeners, who do not understand or never heard utterances in the above languages. The average scores of the listening test are given in Table I. TABLE I
Cross-lingual Average Vocal Emotion Recognition Success Scores (%) of Listening Test of the Utterances [Lang: Language, As: Assamese, Bo: Bodo, Dm: Dimasa, Ka: Karbi, Mi: Mishing]
Success Scores (RSS) for all 5 combinations of train-test data. Experiment II: Text- and Speaker-independent Vocal Emotion Recognition in each language A text-independent and speaker-independent automatic vocal emotion recognizer is developed as follows. A total of seven GMMs, one for each emotion, are trained using the Expectation-Maximization (EM) algorithm [30] and LeaveOne-Out (LOO) cross-validation method [32] and the first 10 utterances of 5 subjects (training-speakers) of 1 language and then tested (i.e. emotion decoding by machine) by the other 10 utterances of the remaining subject (testing-speaker) of the same language at a time. The Percentage Average Recognition Success Score (PARSS) of each emotion and the Mean-PARSS (MPARSS) of all emotions are computed from the Recognition Success Scores (RSS) computed with all the 6 combinations of the training-testing data considering each language one at a time. Hereafter, the Mean-PARSS will be referred as the Average. V. FEATURE EXTRACTION For each utterance following pre-processing tasks are done: (i) end-points are detected and (ii) dc-components and silence periods are removed, (iii) all utterances are decimated to 8.1 kHz sampling frequency, (iv) the frame duration is chosen to be of 31.6 ms (256 samples) and they are extracted with 50% overlap with preceding frames. The babble noise sample is downloaded from the NOISEX-92 database which is available on the Rice University Digital Signal Processing (DSP) group home page (i.e. http://spib.rice.edu/spib /select_noise.html) and decimated to 8.1 kHz sampling frequency. The decimated babble noise of appropriate amount (i.e. Not added, of Signal-to-Noise Ratios (SNRs) of 5 db and 0 db) is added to the utterances. Fourteen MFCC [29] and one total log-energy features are computed from each Hamming-windowed-frame using 24 triangular Melfrequency filter banks. The MFCC features are normalized by cepstral mean subtraction. The proposed EVAM feature set is computed as follows. Each frame is rectangular windowed. After that for each frame, the autocorrelation matrix with lag p=8 are computed. The optimal lag value p=8 is determined from the set [5-16, 32, 64] by several trials to get maximum performance with minimum lag values. Then, after eigen decomposition of the autocorrelation matrix, a 5-element feature vector is formed using the 5 most significant eigen values from each frame. The computational procedure of EVAM Feature Vector Matrix (FVM) for each utterance is shown in Fig.1. The EVAM features are normalized by subtracting mean and dividing by the standard deviation of the EVAM features of the training utterances.
Digitized Utterance Remove dc, Detect speech end points, Remove silence Add Noise Framing with 50% overlap Frames FVM: 5 most significant EVAM of 1 frame per row Eigen Value Decomposition Compute Autocorrelation Matrix with lag p
Lang Emotion Anger Disgust Fear Happiness Sadness Surprise Neutral Average
As 97.22 95.00 97.78 85.69 97.50 87.78 99.03 94.29
Bo 70.97 45.83 85.28 74.86 88.19 67.36 81.81 73.47
Dm 95.42 85.97 95.56 84.17 97.08 78.06 94.17 90.06
Ka 88.19 75.69 87.08 78.75 96.25 69.58 93.61 84.17
Mi 86.67 75.42 93.33 79.44 94.58 71.81 90.97 84.60
Aver -age 87.69 75.58 91.81 80.58 94.72 74.92 91.92 85.32
IV.
EXPERIMENTS
Experiment I: Cross-lingual Vocal Emotion Recognition A cross-lingual automatic vocal emotion recognizer is developed as follows. A total of seven GMMs, one for each emotion, are trained using the Expectation-Maximization (EM) algorithm [30] and Leave-One-Out (LOO) crossvalidation method [32] and 10 utterances of the subjects of 4 languages (training-languages) and then tested by the other 10 utterances of the subjects of the left-out language (testinglanguage) at a time. The Percentage Average Recognition Success Score (PARSS) of each emotion and the MeanPARSS of all emotions are computed from the Recognition
Figure 1: Block diagram for computation of EVAM Feature Vector Matrix (FVM) for each utterance
VI.
RESULTS AND DISCUSSION
TABLE V
Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise in Dimasa Language [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signal-to-Noise Ratio]
The percentage average scores of vocal emotion recognition from original uncorrupted (UC) utterances and with additive babble noise of 5 db and 0 db SNRs at sampling frequency of 8.1 kHz under matched noise condition are shown in table TABLE II for Experiment I and tables TABLE III to VII for Experiment II respectively. It is observed that the average performance of the EVAM feature set is equal or above 86.9% at all the SNRs and also much higher than that of the MFCC feature set in all the cases under matched noise training and testing condition.
Cross-lingual Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signal-to-Noise Ratio]
TABLE II
SNR(db) FS Emotion Anger Disgust Fear Happiness Sadness Surprise Neutral Average
UC MF 75.00 53.33 78.33 73.33 96.67 55.00 66.67 71.19 EV 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 MF 86.67 30.00 73.33 60.00 70.00 56.67 53.33 61.43
5 EV 98.33 91.67 100.0 98.33 100.0 96.67 100.0 97.86 MF 83.33 36.67 71.67 58.33 63.33 48.33 61.67 60.48
0 EV 98.33 90.00 93.33 96.67 93.33 90.00 98.33 94.29
UC MF 74.33 68.67 82.67 77.00 47.33 53.33 84.00 69.62 EV 100.00 99.00 100.00 100.00 99.67 100.00 99.00 99.67 MF 67.00 33.00 74.00 70.00 45.67 41.00 61.67 56.05
5 EV 97.33 93.67 97.33 97.67 92.67 90.67 97.33 95.24 MF 74.00 21.67 62.33 68.00 42.67 30.67 64.67 52.00
0 EV 96.67 89.00 93.67 93.33 94.00 91.33 94.67 93.24
TABLE VI
Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise in Karbi Language [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signalto-Noise Ratio]
TABLE III
Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise in Assamese Language [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signal-to-Noise Ratio]
UC MF 80.00 55.00 55.00 76.67 35.00 38.33 80.00 60.00 EV 100.0 100.0 100.0 100.0 100.0 98.33 93.33 98.81 MF 63.33 48.33 48.33 55.00 25.00 16.67 46.67 43.33
5 EV 96.67 95.00 100.0 100.0 95.00 90.00 83.33 94.29 MF 58.33 40.00 46.67 50.00 21.67 8.33 28.33 36.19
0 EV 100.0 78.33 81.67 100.0 95.00 78.33 75.00 86.90
UC MF 65.00 85.00 60.00 90.00 16.67 48.33 70.00 62.14 EV 100.0 100.0 100.0 100.0 100.0 96.67 98.33 99.29 MF 55.00 73.33 58.33 75.00 35.00 41.67 45.00 54.76
5 EV 90.00 93.33 100.0 95.00 96.67 91.67 98.33 95.00 MF 65.00 60.00 41.67 48.33 33.33 30.00 51.67 47.14
0 EV 76.67 85.00 96.67 80.00 96.67 90.00 95.00 88.57
TABLE VII
Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise in Mishing Language [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signal-to-Noise Ratio]
TABLE IV
Average Vocal Emotion Recognition Success Scores (%) at 8.1 kHz Sampling Frequency and With Additive Babble Noise in Bodo Language [FS: Feature Set, MF: MFCC, EV: EVAM, UC: Uncorrupted, SNR: Signalto-Noise Ratio]
UC MF 78.33 61.67 66.67 61.67 30.00 55.00 66.67 60.00 EV 100.0 95.00 100.0 96.67 98.33 100.0 100.0 98.57 MF 55.00 23.33 50.00 73.33 43.33 33.33 80.00 51.19
5 EV 91.67 81.67 98.33 86.67 93.33 91.67 100.0 91.90 MF 63.33 33.33 38.33 43.33 41.67 26.67 81.67 46.90
0 EV 90.00 78.33 100.0 76.67 76.67 86.67 100.0 86.90
UC MF 51.67 31.67 73.33 53.33 16.67 78.33 50.00 50.71 EV 100.0 93.33 100.0 100.0 96.67 98.33 91.67 97.14 MF 63.33 6.67 78.33 28.33 6.67 40.00 36.67 37.14
5 EV 96.67 81.67 91.67 98.33 90.00 90.00 83.33 90.24 MF 60.00 6.67 70.00 35.00 10.00 25.00 45.00 35.95
0 EV 91.67 73.33 100.0 95.00 76.67 88.33 83.33 86.90
VII. CONCLUSION The proposed EVAM feature set shows much higher performance than MFCC feature set in all the cases. The lowest average scores in Bodo language in all the cases indicate that this database is more challenging for vocal emotion recognition experiments, which can also be inferred from the Listening Test scores. This study verified that the full-blown discrete basic vocal emotions are recognized (i) from no-emotion (i.e. neutral), (ii) one from another in all languages separately, and (iii) cross-lingually one from another with accuracies much above the chance level at 5 db and 0 db SNRs and in case of the original utterances at
sampling frequency of 8.1 kHz. The surprise vocal emotion is recognized discretely in all single languages separately and also cross-lingually across languages by accuracy much above the chance level in all the cases. Thus in this work it is verified that there exist distinct patterns of proximal voice cues corresponding to full-blown discrete basic emotions at sampling frequency of 8.1 kHz when the vocal expressions are received via a transmission channel which adds (i) almost no noise (i.e. the original utterances), (ii) babble noise by 5 db and 0 db SNRs. The EVAM features have high potential for vocal emotion recognition in case of utterances received via telephone channel which operates at 8 kHz sampling frequency. ACKNOWLEDGMENT The authors are grateful to the students and the faculty members of Jorhat Engineering College, Jorhat, Assam, and all the other people of Assam, who have actively cooperated during the collection of data for the present work. The authors also express gratefulness to all the students of Indian Institute of Technology Kharagpur, India, who have voluntarily taken part as listeners in the listening test. Finally the authors acknowledge the All India Council for Technical Education, New Delhi; the Indian Institute of Technology Kharagpur, India, and the Government of Assam, India, for providing the research facilities and opportunities for the present work. REFERENCES
[1] K. Oatley and P. N. Johnson-Laird, Towards a Cognitive Theory of Emotions, Cognition and Emotion, pp. 1, 29-50, 1987. [2] S. Furui, Digital Speech Processing, Synthesis and Recognition, New York: Marcel Dekker, 1989. [3] L. R. Rabiner and B. H. Juang, B.H., Fundamentals of Speech Recognition, Englewood Cliffs, NJ: Prentice Hall, 1993. [4] D. Jurafsky and J. H. Martin, Speech and Language Processing, Englewood Cliffs, New Jersey: Prentice Hall Inc., 2000. [5] J. Holmes and W. Holmes, Speech Synthesis and Recognition, 2nd ed., New York: Taylor & Francis, 2001. [6] P. Rose, Forensic Speaker Identification, New York, 302: Taylor & Francis, 2002. [7] I. R. Murray and J. L. Arnott, Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion, J. Acoust. Soc. Amer., 93 (2), pp. 10971108, 1993. [8] I. R. Murray and J. L. Arnott, Implementation and testing of a system for producing emotion-by-rule in synthetic speech, Speech Communication, 16, pp. 369-390, 1995. [9] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, Emotion recognition in humancomputer interaction, IEEE Signal Process. Mag., Vol. 18, No. 1, pp. 32-80, 2001. [10] R. W. Picard, Affective Computing, Cambridge, Massachusetts, London, England: The MIT Press, 1997. [11] S. Ramamohan, and S. Dandapat., Sinusoidal Model-Based Analysis and Classification of Stressed Speech, IEEE Trans. Speech Audio Process., Vol. 14, No. 3, pp. 737 746, 2006.
[12] A. B. Kandali, A. Routray, and T. K. Basu, Emotion recognition from Assamese speeches using MFCC features and GMM classifier, Proc. IEEE Region 10 Conf. TENCON, Hyderabad, India, pp. 1-5, 19-21 November 2008. [13] T. L. New, S. W. Foo, and L. C. D. Silva, Speech emotion recognition using hidden Markov models, Speech Communication, Vol. 41, pp. 603623, 2003. [14] A. A. Razak, A. H. M. Isa, and R. Komiya, A Neural Network Approach for Emotion Recognition in Speech, Proc. 2nd Int. Conf. Art. Intell. In Engineering & Technology, Kota Kinabalu, Sabah, Malaysia, 2004. [15] D. Ververidis, C. Kotropoulos, and I. Pitas, Automatic Emotional Speech Classification, ICASSP, pp. (I-593) (I-596), 2004. [16] T. Vogt, and E. Andre, Comparing Feature Sets for Acted and Spontaneous Speech in view of Automatic Emotion Recognition, Proc. IEEE, 2005. [17] A. B. Kandali, A. Routray, and T. K. Basu, Emotion Recognition From Speeches of Some Native Languages of Assam Independent of Text and Speaker, National Seminar on Devices, Circuits and Communication, Department of E.C.E., B.I.T. Mesra, Ranchi, Jharkhand, India, 6-7 November 2008. [18] Y. Wang, and L. Guan, An Investigation of Speech-based Human Emotion Recognition, IEEE 6th Workshop on Multimedia Signal Processing, pp. 15 18, 2004. [19] P. N. Juslin and P. Laukka, Communication of Emotions in Vocal Expression and Music Performance, Psychological Bulletin, Vol. 129, No. 5, pp. 770814, 2003. [20] K. R. Scherer, R. Banse, and H. G. Wallbott, Emotion Inferences from Vocal Expression Correlate Across Languages and Cultures, J. Cross-Cultural Psychology, 32(1), pp. 76-92, 2001. [21] P. Laukka, Vocal Expression of Emotion Discrete-emotion and Dimensional Accounts, Comprehensive Summaries of Uppsala Dissertations from the Faculty of Social Sciences 141, ACTA Universitatis Upsaliensis, Uppsala.Experiments, 2004. [22] R. J. LaPolla and G. Thurgood, Eds. The Sino-Tibetan Languages, Routledge Language Family Series, London and New York: Routledge, 2002. [23] http://www.iitg.ernet.in/rcilts/ heirarchy.htm [24] http://www.ciil.org/Main/languages/index.htm [25] K. R. Scherer, Vocal Communication of Emotion: A Review of Research Paradigms, Speech Communication, 40, pp. 227-256, 2003. [26] K. R. Scherer, T. Johnstone, and G. Klasmeyer, Vocal Expression of Emotion, Part IV Chapter 23 in R. J. Davidson, K. R. Scherer, K. R., and H. H. Goldsmith, Handbook of Affective Science, 1st ed., Oxford University Press, 2003. [27] D. Ververidis and C. Kotropoulos, Emotional speech recognition: Resources, features, and methods, Speech Communication, 48, pp. 1162-1181, 2006. [28] S. L. Marple, Jr., Digital Spectral Analysis With Applications, Englewood Cliffs, New Jersey 07632: Prentice Hall INC., 1987. [29] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Trans. Audio Speech and Signal Process., Vol. 28, No. 4, pp. 357365, 1980. [30] D. A. Reynolds and R. C. Rose, Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models, IEEE Trans Speech Audio Process., Vol. 3, No. 1, pp. 72 83, 1995. [31] Y. Linde, A. Buzo, and R. M. Gray, An Algorithm for Vector Quantizer Design, IEEE Transactions on Communications, Vol. 28, No. 1, pp. 84 95, 1980. [32] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed., New York: Morgan Kaufmann, Academic Press, 1990.

Vocal Emotion Recognition in Five Languages of Assam Using Features Based On MFCCs and Eigen Values of Autocorrelation Matrix in Presence of Babble Noise

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vocal Emotion Recognition in Five Languages of Assam Using Features Based On MFCCs and Eigen Values of Autocorrelation Matrix in Presence of Babble Noise

Uploaded by

Copyright:

Available Formats

Vocal Emotion Recognition in Five Languages of Assam Using Features Based on MFCCs and Eigen Values of Autocorrelation Matrix

in Presence of Babble Noise

Tapan Kumar Basu

As 97.22 95.00 97.78 85.69 97.50 87.78 99.03 94.29

Bo 70.97 45.83 85.28 74.86 88.19 67.36 81.81 73.47

Dm 95.42 85.97 95.56 84.17 97.08 78.06 94.17 90.06

Ka 88.19 75.69 87.08 78.75 96.25 69.58 93.61 84.17

Mi 86.67 75.42 93.33 79.44 94.58 71.81 90.97 84.60

RESULTS AND DISCUSSION

0 EV 98.33 90.00 93.33 96.67 93.33 90.00 98.33 94.29

0 EV 96.67 89.00 93.67 93.33 94.00 91.33 94.67 93.24

0 EV 100.0 78.33 81.67 100.0 95.00 78.33 75.00 86.90

0 EV 76.67 85.00 96.67 80.00 96.67 90.00 95.00 88.57

0 EV 90.00 78.33 100.0 76.67 76.67 86.67 100.0 86.90

0 EV 91.67 73.33 100.0 95.00 76.67 88.33 83.33 86.90

You might also like