Feature Analysis Fot Emotion Identification in Speech

490
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010
Feature Analysis and Evaluation for Automatic Emotion Identication in Speech

Iker Luengo, Eva Navas, and Inmaculada Hernez
AbstractThe denition of parameters is a crucial step in the development of a system for identifying emotions in speech. Although there is no agreement on which are the best features for this task, it is generally accepted that prosody carries most of the emotional information. Most works in the eld use some kind of prosodic features, often in combination with spectral and voice quality parametrizations. Nevertheless, no systematic study has been done comparing these features. This paper presents the analysis of the characteristics of features derived from prosody, spectral envelope, and voice quality as well as their capability to discriminate emotions. In addition, early fusion and late fusion techniques for combining different information sources are evaluated. The results of this analysis are validated with experimental automatic emotion identication tests. Results suggest that spectral envelope features outperform the prosodic ones. Even when different parametrizations are combined, the late fusion of long-term spectral statistics with short-term spectral envelope parameters provides an accuracy comparable to that obtained when all parametrizations are combined. Index TermsEmotion identication, information fusion, parametrization.
I. INTRODUCTION
EATURES extracted from the speech signal have a great effect on the reliability of an emotion identication system. Depending on these features, the system will have a certain capability to distinguish emotions and will be able to deal with speakers not seen during the training. Many works in the eld of emotion recognition are aimed to nd the most appropriate parametrization, yet there is no clear agreement on which feature set is best. One of the major problems to determine the best features for emotion identication is that there is no solid theoretical basis relating the characteristics of the voice with the emotional state of the speaker [1]. That is why most of the works in this eld are based on features obtained from direct comparison of speech signals portraying different emotions. The comparison enables to estimate the acoustic differences among them, identifying features that could be useful for emotion identication.
Manuscript received December 11, 2009; revised March 22, 2010; accepted May 06, 2010. Date of current version September 15, 2010. This work was supported in part by the Spanish Government under the BUCEADOR project (TEC2009-14094-C04-02) and in part by the AVIVAVOZ project (TEC200613694-C03-02). The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Hamid K. Aghajan. The authors are with the Department of Electronics and Telecommunications, University of the Basque Country, Bilbao 48013, Spain ( e-mail: iker.luengo@ehu.es; eva.navas@ehu.es; inma.hernaez@ehu.es). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TMM.2010.2051872
The only widely known theory that describes the physiological changes caused by an emotional state [2], [3] takes the Darwininan theory [4] as a reference, considering emotions as a result of evolutionary needs. According to this theory, each emotion induces some physiological and psychological changes in order to prepare us to do something, e.g., fear prepares us to run from a danger. These changes have certain inuence on the speech characteristics, mainly on those related to intonation, intensity, and speaking rate, i.e., on prosody. For many years, automatic identication systems have used prosodic features almost exclusively, mainly because of the aforementioned theory. The relation between emotion and prosody is reected in many works in the literature [5][9]. A detailed summary of these works and their conclusions is presented in [10]. Prosodic features are mostly used in the form of long-term statistics, usually estimated over the whole utterance. The most common ones are simple statistics such as mean, variance, minimum, maximum, or range. But it is also possible to use more complex representations in an attempt to retain more emotional information. The use of prosodic features gives a certain reiterative confusion pattern among emotions. They seem to be able to discriminate high arousal emotions (anger, happiness) from low arousal ones (sadness, boredom) easily. But the confusion level for emotions of the same arousal level is very large [1]. However, humans are able to distinguish anger from happiness and sadness from boredom accurately. This reinforces the idea that there are some other voice characteristics useful for emotion identication. Some works in the literature use spectral measurements or voice quality features, showing the importance of these parameters for emotion identication. The vocal tract is also inuenced by the emotional state, and so are the spectral characteristics of the voice. The relationship between emotion and spectral characteristics is empirically supported by several works [7], [11][14]. Furthermore, taking into account that prosody is determined mainly by the vocal fold activity and that the spectrum envelope is mostly inuenced by the vocal tract, it is reasonable to assume that they are not very correlated. Therefore, both types of features provide different information about the emotional state of the speaker, and the combination of prosodic and spectral parametrizations may be interesting. In fact, most of the works using spectral measures use them together with prosodic features, signicantly reducing the error rate with respect to using only one of the parametrizations [12], [13], [15][18]. The effect that emotions have on the vocal folds cause changes in the voice quality, too [19], [20], so in the last couple of years, features related to voice quality have also been used
1520-9210/$26.00 2010 IEEE
LUENGO et al.: FEATURE ANALYSIS AND EVALUATION FOR AUTOMATIC EMOTION IDENTIFICATION IN SPEECH
491
to help emotion identication [19], [21]. Nevertheless, few authors use them, due to the difculty of extracting the glottal signal from the speech [20]. The estimation of the glottal signal can only be done with acceptable accuracy on very stable speech segments, which makes it difcult to extract voice quality features by automatic means. Finally, it is also possible to use linguistic features such as the occurrence of sighs or certain emotive words [17], [18], [22]. However, only a few authors use them. On the one hand, because in most cases an automatic speech recognition system (ASR) is needed, which increases the complexity of the system. On the other hand, because this kind of features only makes sense with spontaneous speech, and most of the works in the eld have been developed using read speech databases, often with the same texts for all emotions. Best results are obtained when parametrizations of different nature are combined in order to increase the available information. When all features have the same temporal structure, this combination can be done by simply concatenating the feature vectors. The problem arises when the temporal structure is different for the considered parametrizations. For example, prosodic information is usually given in the form of long-term statistics, while traditional spectral envelope parametrizations like Mel frequency cepstral coefcients (MFCC) or linear prediction cepstral coefcients (LPCC) are extracted for each frame. A simple solution applied in many works is to calculate long-term statistics of spectral features, making it possible to concatenate them to the prosodic feature vector [15][17]. Another alternative is to do the opposite. Instead of calculating and intensity statistics of spectral features, frame-wise values are used and appended directly to the MFCC or LPCC vectors [14], [23]. In [15], both strategies are compared, obtaining very similar results in both cases. A more elaborate solution uses a late fusion of classiers [24], [25], i.e., training a different classier with each parametrization and combining the results of these classiers, as in [12] and [18]. Going over the literature shows a disagreement on which features are best for the identication of emotions in speech. It is widely accepted that prosodic features carry most of the emotional information, and that the combination of parametrizations of different nature improves the results. But no systematic study has been done comparing the effectiveness of each kind of parametrization or their combinations. This work attempts to ll this gap analyzing the characteristics and appropriateness of different sets of features used for the recognition of emotions. It is focused on acoustic parameters that can be extracted directly from the speech signal without using ASR systems: prosody, spectral envelope, and voice quality. The effectiveness of different combinations of features is also studied, and the different combination approaches are compared to see which one is more appropriate: the simple feature concatenation (early fusion) or the classier result combination (late fusion). Section II in the paper describes the emotional database used in this work, as well as the acoustic processing carried out to extract the data needed for the calculation of the features. Section III describes the considered features, whereas Section IV describes the analysis of these features. Some empirical experiments are carried out and described in Section V,
in order to conrm the results obtained in the analysis. Finally, the results are commented and some conclusions are extracted. II. WORKING DATABASE A. Description of the Database Both the analysis of the features and the emotion identication experiments described in this paper were carried out using the Berlin emotional speech database [26]. This database contains 535 recordings uttered by ve male and ve female speakers, simulating seven emotional states: anger, boredom, disgust, fear, happiness, neutral, and sadness. Although the original recordings are available at 16 kHz and 16 bits per sample, they were subsampled to 8 kHz for this work. The database corpus contains ten sentences of neutral semantic content that all speakers recorded portraying the seven emotional styles. As described in [26], in order to get emotions as natural as possible, the recordings were evaluated in a perceptual test where 20 listeners had to identify the intended emotion and score its naturalness. All recordings with an identication rate lower than 80% and an overall naturalness score under 60% were discarded, leaving the 535 sentences available in the nal database. Due to the perceptual selection process, some emotions are more represented than others. This database was chosen because it presents certain characteristics that were of interest. It is a multispeaker database, making it possible to perform speaker-independent tests. Furthermore, the perceptual selection guarantees that the portrayed emotions are highly natural. In addition, many researches about identication of emotions have been done using this database, which makes it possible to compare the results with other published results. B. Processing of the Recordings The recordings were processed in order to get the characteristic curves and labelings needed for the feature extraction. The processing included detection of the vocal activity, estimation of the glottal source signal and of the intonation curve, voiced-unvoiced labeling, pitch-period marking, and vowel detection. All this processing was performed automatically without manual corrections. 1) Vocal Activity Detection (VAD): Silences and pauses were detected using the LTSE VAD algorithm described in [27]. The labels obtained were further processed to discard silence labels shorter than 100 ms that usually appear as a result of detection errors, and which have no linguistic meaning. 2) Glottal Source Estimation: The glottal signal is needed in order to compute various features related to voice quality. However, estimating the glottal ow from the speech signal is inherently difcult and only when it is applied to very stationary segments of the speech the results may be acceptable. Iterative adaptive inverse ltering (IAIF) [28] was used to perform the inverse ltering and recovering the glottal ow, since it is a fully automatic method that gives acceptable results with stationary signals. Nevertheless, it is assumed that the estimated glottal signal is inaccurate for nonstationary segments. 3) Intonation Curve and Voiced-Unvoiced Labeling: The intonation curve was computed with the cepstrum dynamic pro-
492
gramming (CDP) algorithm described in [29], which uses the cepstrum transform and dynamic programming. This algorithm provides the voiced-unvoiced (VUV) labeling, too. 4) Pitch Period Marking: In addition to glottal ow estimation, pitch synchronous marks are also needed for the extraction of voice quality features. Once the intonation curves, the VUV labeling and the glottal ow were extracted, pitch marks were placed in the negative peaks of the inverse ltering residual, which were located by simple peak picking using the estimated value as a clue to detect the next peak. 5) Vowel Detection: Vowels are some of the most stable segments in a speech signal, making them very appropriate for the computation of certain features, such as those derived from the glottal source estimation. Furthermore, vowels are always voiced and are strongly affected by intonation patterns, providing a consistent point to calculate intonation-related fearisings or fallings. The vowels in the database tures such as were automatically detected using the algorithm described in [30], which is based on a phoneme recognizer working with HMM models of clustered phonemes. Phoneme clustering was automatically performed according to their acoustical similarity, keeping the vowels on their own cluster. This provides a consistent and very robust set of models, capable of detecting 80% of vowel boundaries with less than 20-ms error. III. FEATURE DEFINITION A. Segmental Features Segmental features are calculated once for every frame, allowing the analysis of their temporal evolution. A 25-ms Hamming windowing with 60% overlap is used, i.e., a new feature vector is extracted every 10 ms. 1) Spectral Envelope: Reference [11] shows that log-lter power coefcients (LFPC) outperform traditional MFCC or LPCC parameters in emotion identication, so they were chosen for the frame-wise spectral characterization. LFPC features represent the spectral envelope in terms of the energy in Mel-scaled frequency bands. Eighteen LFPC coefcients were estimated for each frame, together with their rst and second derivatives, giving a total of 54 spectral features at the segment level. In order to minimize microphone distance effects, LFPC features were normalized to the mean value of the whole utterance. 2) Prosody Primitives: We refer to the intonation and intensity curves as prosody primitives, as the (suprasegmental) prosodic features are estimated from these. A new sample of intensity and was estimated for every frame, together with their rst and second derivatives, providing six features per frame. Intensity curves were normalized to the mean intensity value of the whole utterance in order to minimize the effect of the microphone distance. values are not dened for unvoiced frames, two feature As vector streams were extracted from each recording, the rst one corresponding to voiced frames (characterized by both intonation and intensity features) and the second one corresponding to unvoiced frames (characterized only by intensity values). Both streams were treated as different parametrizations during this work.
TABLE I SUPRASEGMENTAL PARAMETERS AND THE CORRESPONDING SYMBOL USED ALONG THIS DOCUMENT
B. Suprasegmental Features Suprasegmental features represent long-term information, estimated over time intervals longer than a frame. In this work, this interval has been dened as the time between two consecutive pauses. It is expected that speech pauses correspond roughly to linguistic stops in the message, so this approach is very similar to using a whole utterance as integration time. However, as pauses can be detected automatically with the VAD algorithm, this method makes it possible to adapt the parametrization algorithm to work with direct audio input if necessary. 1) Spectral Statistics: For the suprasegmental characterization of the spectrum, long-term statistics of LFPC were calculated. For each of the 18 LFPC coefcients and their rst and second derivatives, six statistics were computed, as shown in suprasegmental spectral Table I. At the end, features were extracted.
493
2) Prosody: Prosodic features are divided into ve categories, according to the nature of the information they represent. Altogether 54 prosodic features were dened. Intonation Statistics: Intonation features were calculated as the same six statistics presented in Table I, applied to the values and their rst and second derivatives, giving 18 parameters as a result. Only frames detected as voiced were used for the computation of the statistics. Intensity Statistics: Following the same approach as with the intonation, intensity features were composed of the same statistics applied to the intensity values and their rst and second derivatives, providing 18 new parameters. Speech Rate Features: They were dened as the mean and variance of the vowel duration, as shown in Table I. Regression Features: In each detected vowel, a linear reand intensity. Then gression was estimated for the values of the absolute value of the regression line slopes was calculated, and the six features listed in Table I were extracted. With these features, we try to combine long integration intervals (with a length comparable to a sentence) and short ones (with an approximate duration of a phoneme), as done by other authors [31]. Sentence-End Features: Prosodic values at the end of the sentences may provide additional information about the emowith respect to the rest of tion. For example, an increase in the sentence can represent surprise, whereas a lower intensity is usually related to sadness. Therefore, ten more features were extracted from the last vowel detected in the integration time, as shown in Table I. The normalized values are dened as the corresponding non-normalized ones divided by the mean value over all the vowels detected in the integration segment. 3) Voice Quality Features: Features related to voice quality are extracted from the glottal source signal and from the pitch period marks. They were computed only for vocalic segments, in order consider only segments with reliable glottal source estimation. The features specied in Table I were calculated for each vowel, and the values corresponding to vowels in the same integration segment were averaged in order to obtain a single feature vector for the whole integration time. Jitter and shimmer were estimated using the ve-point period perturbation quotient (ppq5) and ve-point amplitude perturbation quotient (apq5) values as dened in Praat.1 The normalized amplitude quotient (NAQ) [32] is estimated for every glottal pulse, so the NAQ value for a vowel was calculated by averaging the NAQ obtained all along that vowel. Similarly, spectral tilt and spectral balance [33] are calculated for every frame, and the value for a vowel was calculated averaging the values all along the vowel. IV. FEATURE ANALYSIS AND SELECTION A. Inter-Emotion and Intra-Emotion Dispersion The capability of a parameter set to retain the emotional characteristics and to avoid the remaining attributes of the speech can be measured in terms of inter-emotion dispersion and intra-emotion dispersion. A large inter-emotion dispersion
1http://www.praat.org
TABLE II DISCRIMINABILITY VALUES OF THE FEATURE FAMILIES MEASURED WITH THE J CRITERION
would mean that the features take very different values for each emotion, separating the distributions and making the classication task easier. A reduced intra-emotion dispersion reects the consistency of the features within a given emotion. The relation between intra-emotion and inter-emotion dispersions provides a measure of the overlapping of the class distributions. This relation can be estimated using the following criterion [34]: (1) where denotes the trace of a matrix and and are the intra-class and inter-class dispersion matrices, respectively: (2)
(3) being the number of training samples, the number of emotions, the samples of class , the centroid of the class, and the global mean: (4)
(5) The criterion is often used in discriminant analysis, and it is a generalization of the well-known Fisher criterion (6) for the multiclass and multidimensional case [35]: (6) values were computed for each feature family as a rst estimation of their capability to discriminate emotions. The results are presented in Table II. If we focus on suprasegmental features, it is observed that prosody is less discriminative than spectral envelope statistics. This result is especially signicant, as prosodic features are by
494
Fig. 1. (a) Scatter plot of suprasegmental prosodic, (b) suprasegmental spectral, (c) segmental prosodic, and (d) segmental spectral features, projected over the two most discriminant directions by LDA. An: anger, Bo: boredom, Di: disgust, Fe: fear, Ha: happiness, Ne: neutral, Sa: sadness.
far the most used ones in the literature for emotion identication. The difference shown in Table II could be partly justied by the fact that 324 statistics are used in the spectral parametrization and only 54 in the prosodic one. But when the calculation is repeated using only the best 54 spectral features (see the feature selection procedure in Section IV-C), they still seem to discrim. inate emotions better, obtaining a To conrm this result and to show the discriminability of each set of features visually, an LDA transformation was applied to both spectral and prosodic features. The two most discriminant directions are represented graphically in Fig. 1. As can be seen, emotions are less overlapped when using spectral features than when using prosodic ones. It is remarkable that in both cases, the most discriminant direction (horizontal axis) seems to be related to the activation level, shifting high activation emotions (anger, happiness, and fear) to one side and low activation emotions (sadness and boredom) to the other. In Fig. 1(a), we can already identify the confusions usually described in the literature when using prosodic features, for example anger and happiness. Disgust is also frequently reported to be hard to detect by prosodic features, and this is reected in the scatter plot, too. The distribution of disgust samples has a large dispersion and is strongly overlapped with other emotions, especially with the neutral style. On the contrary, it is clearly separated from the other emotions when using spectral features.
Going back to the discriminability values in Table II, it can be seen that features related to voice quality seem to provide very little information, at least the ones considered in this work. Nevertheless, when these features are concatenated with the prosodic ones, they contribute to increase the value from 6.47 to 7.14. This is not so surprising if we take into account that features without discrimination power by themselves may be useful when combined with others [36], as in this case. The concatenation of all suprasegmental features provides the highest discrimination, suggesting that the information captured by the spectrum, the prosody, and the voice quality are complementary. Regarding segmental features, it can be seen that their capability to separate emotions is almost nonexistent. This is clearly observed in Fig. 1(c) and (d), where the two most discriminant features given by LDA are represented for prosodic primitives and LFPC features. All emotions are completely overlapped. Due to the short-term nature of the segmental parametrization, each LFPC feature vector reects the spectral envelope of a single frame, i.e., it represents the characteristic vocal tract lter for the phonemes articulated in that frame. As the spectral envelope is much more different for different phonemes than for different emotions, the intra-class dispersion is very large, increasing the overlap among emotions. In the case of intonation and intensity, a similar effect occurs, as the frame-wise samples have more variation due to the linguistic content than to the emotional content of the utterances.
495
TABLE III UNSUPERVISED CLUSTERING RESULTS FOR SPECTRAL STATISTICS
TABLE IV UNSUPERVISED CLUSTERING RESULTS FOR PROSODIC STATISTICS
This does not mean that the features are useless. Both the criterion and the LDA are optimal only if the classes have normal homoscedastic distribution, i.e., they have the same covariance matrices [35]. They cannot make use of the subtle differences in the shape of the distributions, which can be captured if the right classication engine is applied. Usually Gaussian mixture models (GMMs) are used for this task, as they can capture small differences among distributions, assuming that enough training samples are provided. As segmental features are extracted once for every 10 ms, there are indeed enough training samples as to train robust and accurate models. In fact, GMMs are very popular in emotion identication when frame-wise features are used. Unfortunately, this means that no conclusions can measures for segmental parametrizabe extracted from the tions. The only way to get an estimation of their capability to distinguish emotions is to perform empirical tests of automatic identication of emotions. The results of these empirical tests are presented in Section V. B. Unsupervised Clustering values given above provide a clue about the discrimThe ination power of each feature family, and the scatterplot of the most discriminant directions estimated by LDA gives a visual representation of it. But using only two directions is somehow unrealistic, since the addition of more dimensions could provide more separation among the classes. Unfortunately, it is not possible to give an understandable graphical representation of more than two dimensions. However, it is possible to obtain descriptive results that can provide some insight of what happens when all features are used. For this purpose, a blind unsupervised clustering was performed using the k-means algorithm. If the emotional classes are correctly separated, the resulting clusters should correspond to each emotion. The outcome of this clustering is shown in Tables III and IV for suprasegmental features. No clustering was performed to segmental features because the distributions are so overlapped that the algorithm would not be able to locate the classes correctly. The clustering is able to identify the emotions quite accurately, with the spectral parametrization having a better perforvalues. mance than the prosodic one, as predicted by the Using spectral statistics, almost all samples belonging to a given emotion are assigned to the same cluster, whereas prosodic features exhibit the typical confusion among emotions: anger with happiness and neutral with boredom and sadness. If we consider these tables as classication confusion matrices, prosodic
features would achieve an overall accuracy of 75.87% whereas spectral statistics would get a 98.68%. Note that the expected accuracy of an emotion identication system is much lower, because the test utterances will not be seen during the training. C. Feature Selection The use of noninformative or redundant features may decrease the accuracy of a classier, due to the confusion they add to the system. A feature selection algorithm can help identifying the truly useful features, reducing the dimensionality of the parametrization and making the classier run faster and more accurately. Furthermore, detecting the discriminative features may provide a deeper understanding about the inuence of the emotions in the acoustic characteristics of the voice. The minimal-redundancy-maximal-relevance (mRMR) algorithm [37] has been used to get a ranking of the features, from the most to the least signicant one. This algorithm selects the features that maximize the mutual information between the training samples and their classes (maximal relevance) and simultaneously minimize the dependency among the selected features (minimal redundancy). mRMR has been applied to all ve parametrization families dened in Section III, as well as to their combinations. The ranking of these combinations, presented in Table V, is very interesting as it shows which parametrization is preferred. When all suprasegmental features are concatenated, among the rst ten features ranked, we nd four prosodic and six spectral features. Looking further up to position 30, we nd ten prosodic, one voice quality, and 19 spectral features. These results are in line with the ones obtained in [16] and [17], supporting the previous results in Section IV-A that suggest that spectral suprasegmental features provide more information about the emotion than prosodic ones. Voice quality features are in general ranked low in the list, even though some studies have claimed their relation with the emotional state of the speaker [19], [20]. Probably the low ranking is due to the automatic nature of the parametrization. In works dealing with voice quality, features are usually extracted with human intervention, providing very accurate values. In the case presented in this paper, all processing was fully automatic, and even though voice quality features were extracted only for vowels (which are supposedly very stable), the resulting estimation errors may increase the confusability in the system, making them not very suitable for identication. Errors during the automatic vowel detection may further increase this confusvalues ability. These results are also conrmed by the low obtained by voice quality features.
496
TABLE V BEST TEN FEATURES RANKED FOR FEATURE COMBINATIONS
Fig. 2. Nested double cross-validation. In the (a) outer level, ve blocks are dened according to speakers. Four are used for training and the last one for speaker independent testing. In the (b) inner level, the recordings in the training set are randomly rearranged to form ve new sub-blocks for development purposes.
The ranking for segmental parametrizations was performed separately for the voiced and unvoiced streams. Following the approach taken with prosodic primitives, LFPC features were also divided into voiced and unvoiced streams, so that the framewise spectral parametrization and the prosodic primitives can be easily combined by concatenation. In the combination of , and the voiced streams, the three intonation features ( , ) are ranked among the ten best ones, while intensity deltas and ) are between positions 10 and 20. The ( frame intensity is far below in the ranking for the voiced stream, but appears in second position for the unvoiced one. Having the prosodic primitives ranked so high in the list corroborates the importance of these features for emotion identication. V. EXPERIMENTAL EVALUATION A. Experimental Framework In order to validate the results of the feature analysis, emotion identication tests were carried out on the Berlin database. Suprasegmental features were modeled with SVMs using RBF kernel, whereas GMMs were used for segmental features. This way, the characteristics of each parametrization, as shown in Section IV-A, can be exploited. On the one hand, GMMs should be able to model the subtle differences in the distribution of the highly overlapped segmental features, thanks to the large number of training samples provided by the frame-wise parametrization. On the other hand, SVMs can take advantage of the larger separability of suprasegmental features. Furthermore, the high generalization capability exhibited by SVMs [38] will guarantee the creation of robust models, even though the suprasegmental parametrization provides very few training samples. In the case of SVMs, the one-vs-all approach [39] was used for the multiclass classication. The experimental framework was designed as a nested double cross-validation (see Fig. 2). The outer level ensures speaker independent results, where the speakers of the test recordings have not been seen during the training. The inner level intends to get development tests for the optimization of
the classiers. The speakers in the database are divided into ve for the outer level. Each block conblocks tains one male and one female, so that gender balance is kept within the blocks. For the th loop, blocks dene the training set, leaving block for testing. For the are randomly inner loop, the sentences available in , which are distributed into ve sub-blocks then used for the inner level cross-validation. Once the ve inner-level loops have ended, their results are gathered and used to estimate optimal values for the number of mixtures in the GMM, the RBF kernel spread, the SVM misclassication cost, and the optimum number of features. Finally, the whole set is used to train the system for the th loop in the outer level and perform the testing on block . B. Selection of the Number of Features In order to estimate the optimum number of features, development tests were repeated adding one feature at a time, according to the ranking obtained in Section IV-C. Fig. 3 shows the resulting accuracy in the development tests using suprasegmental features, as a function of the number of features. According to these results, we can observe that, even though voice quality parametrization showed a very low class separation (see Table II), it does not perform so badly considering the low number of features it uses. With all ve voice quality features, the system gets 49.4% of correct classications, whereas with the best ve prosodic features, it gets 50.7%. Furthermore, the combination of prosody and voice quality seems to be benecial, as predicted by the estimated class separation values: the best prosodic system reaches a maximum of 65.5% with 39 features, whereas the combination obtains a 67.4% with 17 features. Not only does the accuracy increase, but it also reaches its maximum with fewer features. Suprasegmental spectrum statistics clearly outperform prosody, at least when more than 15 features are used, which again conrms the conclusions obtained with the class separability analysis. Spectrum statistics reach a maximum of 75.4% accuracy with 96 features and keep steady from then on. The combination of all suprasegmental parametrizations obtains the best results if more than 25 features are used, reaching an almost steady state at 152 features with 77.9% accuracy, with a marginally better absolute maximum of 78.6% with 247 features. The development results for segmental features are represented in Fig. 4. None of the curves reach a real saturation
497
Fig. 3. Development results for suprasegmental parametrizations as a function of the number of features. (a) is a zoom view to see the results with few features. TABLE VI ACCURACY ON DEVELOPMENT AND TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES
Fig. 4. Development results for segmental parametrizations as a function of the number of features.
point, as the accuracy continues growing as new features are added. But it can be seen that the improvement is very small once 20 features have been used. For example, LFPC parameters get 70.9% accuracy with 20 features, and when all 54 are used, this number increases only to 72.9%. When voiced and unvoiced frames are treated separately, the accuracy of LFPC decreases, which seems reasonable as there are approximately half the training samples for each stream. Results with prosodic primitives seem rather modest, reaching 65.3% accuracy in the voiced stream and 50.8% in the unvoiced one. But it should be kept in mind that only six and three features are used, respectively. If for example, only the best six features in the voiced LFPC stream are kept, the system gets 58.8% accuracy. However, adding the prosodic primitives to LFPC features does not improve the results signicantly. According to the ranking shown in Table V, prosodic primitives are more informative than most LFPC parameters, as they are among the rst 10 or 20 features selected. However, this is true only if few features are selected, e.g., less than 15. When the
number of features increases, the LFPC features that are added compensate for the information of the prosodic primitives, so that at the end, the combination has no effect on the overall accuracy. Table VI summarizes the development results for each parametrization family and their combinations, both with the estimated optimal number of features and with the complete set. The nal speaker independent test results are also presented for each case. Looking at these nal test results, it can be concluded that the spectral statistics are the best isolated parametrization, reaching a 70.5% accuracy with 96 features (and almost the same accuracy with all 324 features). Among the feature combinations, the best result is obtained combining all suprasegmental features, providing 72.2% correct answers with 152 features and 72.5% with all 383. C. Late Fusion Combination So far, early fusion schemes have been used, i.e., concatenating parametrizations. But it may be interesting to see if results improve with a late fusion system, i.e., combining not the features themselves but the results of the classiers trained with them. Furthermore, the late fusion allows combining the information captured by parametrizations of different temporal structure, such as segmental and suprasegmental features or the voiced and unvoiced streams. An SVM-based fusion system [40], [41] has been used for this task. Given an utterance to be classied and a set of classiers, a vector is formed with the scores provided by the classiers for each emotion. This score vector is then classied by the fusion SVM to get the nal decision. With an appropriate training, the SVM is expected to learn the score patterns of the errors and hits of the classiers and improve the results. The scores of the development tests were used for this training.
498
TABLE VII ACCURACY ON LATE FUSION TEST RESULTS WITH SELECTED NUMBER OF FEATURES AND ALL FEATURES
Results for the late fusion system are shown in Table VII. The fusion of suprasegmental features (column 1) achieves very similar results with both early and late fusion systems. Segmental features (column 2) on the other hand show a great improvement. However, this improvement is partly due to the combination of both the voiced and unvoiced streams of LFPC and prosodic primitives, which is not possible with the early fusion. Modeling the voiced and unvoiced streams separately in segmental features and combining them afterwards through late fusion provides good results. While LFPC parametrization obtains 69.9% accuracy with 20 features and 72.2% with all 54, the fusion of its streams (column 3) gets 72.0% and 76.5%, respectively (an error reduction of 7% and 15%, respectively). A noticeable improvement can also be seen when combining the streams of prosodic primitives (column 4). Therefore, we decided to keep voiced and unvoiced streams separated. The late fusion system can also be used to combine segmental and suprasegmental systems. Combining the results from prosody and LFPC (column 5) yields to a similar accuracy as combining the results from long-term spectral statistics and frame-wise prosodic primitives (column 6). In both cases, the accuracy is higher than when using the spectral statistics alone (the best isolated system) and slightly better than using the early fusion of all suprasegmental features (the best early fusion system). When combining LFPC with spectral statistics (column 7) or prosody with prosodic primitives (column 8), there is also a signicant improvement. This means that fusing systems that use features of the same acoustic origin but different time span can also be helpful to reduce classication errors. As a last experiment, all features were combined with the late fusion system (column 9): suprasegmental prosody, voice quality, and spectral statistics together with segmental LFPC and prosody primitives, with separated voiced and unvoiced streams. Altogether, seven classiers were combined in this last test. The obtained results are the best among all the systems tested: 78.3% for all features and 76.8% for selected ones, which implies a 20% error reduction with respect to the best early fusion system. D. Analysis of the Results by Emotion Altogether, it seems that spectral characteristics provide higher identication accuracies than prosodic ones. Nevertheless, prosodic features may be more appropriate to identify certain emotions, even though they are not the best parametrization overall. The identication rates of each emotion have been examined separately, in order to verify whether or not this is the case. Fig. 5 presents the identication rates obtained for each emotion with some representative parametrizations: prosodic
Fig. 5. Comparison of the identication accuracy for each emotion with prosodic and spectral parameters.
and spectral statistics in the case of suprasegmental features, and the late fusion of voiced and unvoiced streams of LFPC and prosody primitives in the case of segmental features. It can be observed that fear and happiness are the worst identied emotions, with less than 50% of recordings correctly classied in most cases. Looking into the results, we observed that happiness is mostly confused with anger, while fear is confused with neutral with suprasegmental features and with neutral and happiness with segmental parametrizations. This result suggests that the considered parametrizations are not suitable to capture the characteristics of these emotions, and that other features should be accounted. Both anger and neutral get similar accuracies with prosodic and spectral long-term statistics. On the other hand, LFPC seems a more suitable short-term parametrization than prosody primitives for these emotions. In the case of boredom and sadness, it is the segmental features that obtain similar identication rates, whereas in the suprasegmental parametrizations, the spectral statistics get better results. Finally, disgust seems to be very difcult to detect with prosody-related features, but the accuracy increases signicantly with spectral characteristics, both in segmental and suprasegmental parametrizations. Spectral features provide similar or higher accuracies than prosodic ones in all emotions. This suggests that, even when each emotion is considered on its own, spectral characteristics are more suitable for emotion identication. VI. CONCLUSION The main goal of the work presented here was to analyze features related to prosody, spectral envelope, and voice quality regarding their capability to separate emotions. This analysis sheds light on an aspect of the automatic recognition of emotions in speech that is not fully covered in the literature. Although there are many studies that analyze how individual fea-
499
tures change according to the emotional state of the speaker [7], [8], [42], [43], they do not evaluate the behavior of the whole feature set. This can lead to inaccurate conclusions about the usefulness of a given parametrization, as the features that are most discriminant individually may not be the most discriminant when taken together. Indeed, most of these studies show that, when considered individually, changes in the prosodic features are more signicant than changes in the spectral parameters. However, the results presented in this work suggest that if the whole feature set is considered, spectral envelope parametrizations are more informative than prosodic ones. The analysis presented in this paper has been performed by means of discriminality measures, unsupervised clustering, and feature ranking. The results have been conrmed with empirical experiments of automatic identication of emotions. The discriminality criterion that has been used provides a way to estimate the performance of the whole feature set instead of individual parameters. These measures have been complemented with unsupervised clustering to see whether or not the parametrizations provide enough class separation. Both methods reveal that long-term statistics of spectral envelope provide larger separation among emotions than traditional prosodic features. This explains the higher accuracy of the former in the experimental tests. The measures also show that combining suprasegmental parametrizations of spectral envelope, prosody, and voice quality increases the separation among emotions, conrming that the use of features extracted from different information sources helps in reducing the identication error. Unfortunately, these methods are not suitable for the analysis of segmental parametrizations. Instead, a feature ranking algorithm has been applied to both segmental and suprasegmental parametrizations to detect the most discriminant features. Although many papers in the eld mention the use of some kind of feature selection algorithm, only few of them discuss the outcome from this selection. When this outcome is provided, long-term spectral statistics are usually selected rst, over prosodic ones [16], [17], suggesting that suprasegmental spectral features provide more information about the emotional state of the speaker. The feature selection results shown in the present paper are in agreement with this conclusion. Some works in the literature rely entirely on empirical experiments to determine the capacity of a given parametrization to identify emotions. Although this approach is suitable to nd out the most discriminant parametrization, it gives no further information about the relation among these features. Furthermore, most works that use a combination of different sets of features (e.g., prosodic and spectral) provide only the accuracy results for that combination, but not separately for each feature set, so it is not possible to deduce which set is more informative. In this work, we have analyzed each feature set independently and in combination with the others. The experimental results have also been given for each parametrization and for their combinations. This way, it is possible to measure the accuracy gain when different feature sets are combined, and see whether or not this combination is advantageous. For example, even though segmental prosodic primitives were ranked high by the feature selection algorithm, it has been shown that the combination of
frame-wise LFPC and prosodic primitives gets similar accuracy as the LFPC alone. Only with a reduced number of features is this combination helpful (see Fig. 4). Also for suprasegmental parametrizations, prosodic features perform better than the spectral ones only when few features are used [Fig. 3(a)]. According to these results, we can say that traditional prosodic features seem to be the most appropriate for automatic identication of emotions only if they are considered individually or in a very reduced feature set. But if large feature sets are considered, spectral features outperform them. The analysis presented in this paper has been performed using an acted speech emotional database. In order to see whether the results are applicable to real-life systems, they should be validated using different databases, especially databases of natural emotions. Nevertheless, the database used in this work is supposed to contain highly natural emotion portrayals, so it is very likely that these conclusions are valid to a great extent. A previous work using the Aibo database of natural emotions [44] showed similar conclusions, with frame-wise MFCC features outperforming long-term prosodic statistics [45]. Also in [12], [13], and [46], MFCC features get better results than suprasegmental ones. The results from the late fusion tests suggest that the combined use of parametrizations extracted from the same information source but with different time-scale, i.e., segmental and suprasegmental features, increases the accuracy of the system. The difference in the time scales and in the classiers makes each subsystem to retain different characteristics of the emotions. In fact, one of the best results has been achieved with the late fusion of long-term LFPC statistics with the voiced and unvoiced streams of frame-wise LFPC features. This system has only been outperformed by the late fusion of all segmental and suprasegmental features, including spectral envelope, prosody, and voice quality. However, this last system is much more complex and the obtained improvement is very small. The use of all features requires estimating LFPC, intonation values, voicedunvoiced decision, pitch period marking, inverse ltering, and vowel detection. This makes the parametrization step very complex and time-consuming. Furthermore, it uses seven different classiers prior to the fusion. The spectral system on the other hand needs only LFPCs, some simple statistics obtained from them, and the voiced-unvoiced decisions in order to separate the streams, reducing the number of classiers to three. The difference in the accuracy may not justify increasing the complexity of the system in such a degree. We are not claiming that features extracted from prosody are useless. Several papers show that humans are able to identify emotions in prosodic copy-synthesis experiments [5], [47], conrming that prosody does carry a great amount of emotional information, at least for some emotions. But the traditional prosodic representations may not be well suited to capture this information. On the one hand, long-term statistics estimated over the whole sentence lose the information of specic characteristic prosodic events. On the other hand, short-term prosodic primitives do not capture the prosodic structure correctly, which is suprasegmental by denition. The results suggest that a new more elaborate representation is needed to effectively extract the emotional information contained in the prosody.
500
REFERENCES
[1] K. R. Scherer, Vocal communication of emotion: A review of research paradigms, Speech Commun., vol. 40, pp. 227256, Apr. 2003. [2] K. R. Scherer, Psychological models of emotion, in The Neuropsychology of Emotion, J. Borod, Ed. Oxford, U.K.: Oxford Univ. Press, 2000, ch. 6, pp. 137166. [3] P. Ekman, An argument for basic emotions, Cognit. Emotion, vol. 6, pp. 169200, 1992. [4] C. Darwin, The Expression of the Emotions in Man and Animals, 3rd ed. Oxford, U.K.: Oxford Univ. Press, 1998. [5] F. Burkhardt and W. F. Sendlmeier, Verication of acoustical correlates of emotional speech using formant-synthesis, in Proc. ISCA Tutorial and Research Workshop Speech and Emotion, Belfast, Ireland, Sep. 2000, pp. 151156. [6] C. F. Huang and M. Akagi, A three-layered model for expressive speech perception, Speech Commun., vol. 50, pp. 810828, Oct. 2008. [7] K. R. Scherer, R. Banse, H. G. Wallbott, and T. Goldbeck, Vocal cues in emotion encoding and decoding, Motiv. Emotion, vol. 15, no. 2, pp. 123148, 1991. [8] M. Schrder, Speech and emotion research, Ph.D. disservation, Universitt des Saarlandes, Saarbrcken, Germany, 2003. [9] E. Navas, I. Hernez, A. Castelruiz, J. Snchez, and I. Luengo, Acoustic analysis of emotional speech in standard Basque for emotion recognition, in Progress in Pattern Recognition, Image Analysis and Applications, ser. Lecture Notes in Computer Science. Berlin, Germany: Springer, Oct. 2004, vol. 3287, pp. 386393. [10] D. Erickson, Expressive speech: Production, perception and application to speech synthesis, Acoust. Sci. Tech., vol. 26, pp. 317325, 2005. [11] T. L. Nwe, S. W. Foo, and L. C. de Silva, Speech emotion recognition using hidden Markov models, Speech Commun., vol. 41, pp. 603623, Jun. 2003. [12] S. Kim, P. G. Georgiou, S. Lee, and S. Narayanan, Real-time emotion detection system using speech: Multi-modal fusion of different timescale features, in Proc. Int. Workshop Multimedia Signal Processing, Crete, Greece, Oct. 2007, pp. 4851. [13] B. Vlasenko, B. Schuller, A. Wendemuth, and G. Rigoll, Frame vs. turn-level: Emotion recognition from speech considering static and dynamic processing, in Affective Computing and Intelligent Interaction, ser. Lecture Notes in Computer Science. Berlin, Germany: Springer, 2007, vol. 4738, pp. 139147. [14] J. Nicholson, K. Takahashi, and R. Nakatsu, Emotion recognition in speech using neural networks, Neural Comput. Appl., vol. 9, pp. 290296, Dec. 2000. [15] O. W. Kwon, K. Chan, J. Hao, and T. W. Lee, Emotion recognition by speech signals, in Proc. Eurospeech, Geneva, Switzerland, 2003, pp. 125128. [16] T. Vogt and E. Andr, Improving automatic emotion recognition from speech via gender differentiation, in Proc. LREC, Genoa, Italy, May 2006. [17] B. Schuller, R. Mller, M. Lang, and G. Rigoll, Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005, pp. 805808. [18] R. Lpez-Cozar, Z. Callejas, M. Kroul, J. Nouza, and J. Silovsk, Two-level fusion to improve emotion classication in spoken dialogue systems, in Graphics Recognition. Recent Advances and New Opportunities, ser. Lecture Notes in Computer Science. Berlin, Germany: Springer, 2008, vol. 5246, pp. 617624. [19] M. Lugger and B. Yang, The relevance of voice quality features in speaker independent emotion recognition, in Proc. ICCASP, Honolulu, HI, Apr. 2007, vol. 4, pp. 1720. [20] C. Gobl and A. N. Chasaide, The role of voice quality in communicating emotion, mood and attitude, Speech Commun., vol. 40, pp. 189212, Apr. 2003. [21] R. Tato, R. Santos, R. Kompe, and J. Pardo, Emotional space improves emotion recognition, in Proc. ICSLP, Sep. 2002, pp. 20292032. [22] R. Mller, B. Schuller, and G. Rigoll, Enhanced robustness in speech emotion recognition combining acoustic and semantic analyses, in Proc. From Signals to Signs of Emotion and Vice Versa, Santorino, Greece, Sep. 2004. [23] A. Nogueiras, A. Moreno, A. Bonafonte, and J. B. Mario, Speech emotion recognition using hidden Markov models, in Proc. Eurospeech, Aalborg, Denmark, Sep. 2001, pp. 26792682.
[24] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, On combining classiers, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 22639, Mar. 1998. [25] D. Ruta and B. Gabrys, An overview of classier fusion methods, Comput. Inf. Syst., vol. 7, no. 1, pp. 110, Feb. 2000. [26] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, A database of German emotional speech, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005, pp. 15171520. [27] J. Ramirez, J. C. Segura, C. Benitez, A. de la Torre, and A. Rubio, Efcient voice activity detection algorithms using long-term speech information, Speech Commun., vol. 42, pp. 271287, Apr. 2004. [28] P. Alku, H. Tiitinen, and R. Ntnen, A method for generating natural-sounding speech stimuli for cognitive brain research, Clin. Neurophysiol., vol. 110, pp. 13291333, Aug. 1999. [29] I. Luengo, I. Saratxaga, E. Navas, I. Hernez, J. Snchez, and I. n. Sainz, Evaluation of pitch detection algorithms under real conditions, in Proc. ICASSP, Honolulu, HI, Apr. 2007, pp. 10571060. [30] I. Luengo, E. Navas, J. Snchez, and I. Hernez, Deteccin de vocales mediante modelado de clusters de fonemas, Procesado Del Lenguaje Natural, vol. 43, pp. 121128, Sep. 2009. [31] F. Ringeval and M. Chetouani, Exploiting a vowel based approach for acted emotion recognition, in Verbal and Nonverbal Features of Human-Human and Human-Machine Interaction, ser. Lecture Notes in Computer Science. Berlin, Germany: Springer, Oct. 2008, vol. 5042, pp. 243254. [32] T. Bckstrm, P. Alku, and E. Vilkman, Time-domain parameterization of the closing phase of glottal airow waveform from voices over a large intensity range, IEEE Trans. Speech Audio Process., vol. 10, no. 3, pp. 186192, Mar. 2002. [33] R. van Son and L. Pols, An acoustic description of consonant reduction, Speech Commun., vol. 28, pp. 125140, Jun. 1999. [34] K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, 1990. [35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classication. New York: Wiley, 2001. [36] I. Guyon and A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003. [37] H. Peng, F. Long, and C. Ding, Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 12261238, Aug. 2005. [38] C. J. Burges, A tutorial on support vector machines for pattern recognition, Data Min. Knowl. Discov., vol. 2, pp. 121167, 1998. [39] C. W. Hsu and C. J. Lin, A comparison of methods for multi-class support vector machines, IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415425, Mar. 2002. [40] J. Fierrez-Aguilar, D. Garcia-Romero, J. Ortega-Garcia, and J. Gonzalez-Rodriguez, Adapted user-dependent multimodal biometric authentication exploiting general information, Pattern Recognit. Lett., vol. 26, no. 16, pp. 26282639, Dec. 2005. [41] B. Gutschoven and P. Verlinde, Multi-modal identity verication using support vector machines (SVM), in Proc. Int. Conf. Information Fusion, Paris, France, Jul. 2000, vol. 2, pp. 38. [42] R. Banse and K. R. Scherer, Acoustic proles in vocal emotion expression, J. Personal. Social Pathol., vol. 70, no. 3, pp. 614636, 1996. [43] L. Devillers, I. Vasilescu, and L. Vidrascu, F and pause features analysis for anger and fear detection in real-life spoken dialogs, in Proc. Speech Prosody, Nara, Japan, Mar. 2004, pp. 205208. [44] A. Batliner, S. Steidl, B. Schuller, D. Seppi, K. Laskowski, T. Vogt, L. Devillers, L. Vidrascu, N. Amir, L. Kessous, and V. Aharonson, Combining efforts for improving automatic classication of emotional user states, in Proc. Information SocietyLanguage Technologies Conf. (IS-LTC), Ljubljana, Slovenia, Oct. 2006, pp. 240245. [45] I. Luengo, E. Navas, and I. Hernez, Combining spectral and prosodic information for emotion recognition in the interspeech 2009 emotion challenge, in Proc. Interspeech, Brighton, U.K., Sep. 2009, pp. 332335. [46] I. Luengo, E. Navas, I. Hernez, and J. Sanchez, Automatic emotion recognition using prosodic parameters, in Proc. Interspeech, Lisbon, Portugal, Sep. 2005, pp. 493496. [47] E. Navas, I. Hernez, and I. Luengo, An objective and subjective study of the role of semantics in building corpora for TTS, IEEE Trans. Speech Audio Process., vol. 14, no. 4, pp. 111727, Jul. 2006.
501
Iker Luengo received the telecommunication engineering degree from the University of the Basque Country, Bilbao, Spain, in 2003 and is currently pursuing the Ph.D. degree in telecommunications. He has been a researcher in the AhoLab Signal Processing Group in the Electronics and Telecommunications Department since 2003. He has participated as a research engineer in government-funded R&D projects, focused in emotional speech, speaker recognition, diarization of meetings, and speech prosody. Mr. Luengo is member of the International Speech Communication Association (ISCA) and the Spanish thematic network on Speech Technologies (RTTH).
Eva Navas received the telecommunication engineering degree and the Ph.D. degree from the Department of Electronics and Telecommunications of the University of the Basque Country, Bilbao, Spain. Since 1999, she has been a researcher at the AhoLab Signal Processing Group. She is currently teaching at the Faculty of Industrial and Telecommunication Engineering in Bilbao. She has participated as a research engineer in government-funded R&D projects as well as in privately-funded research contracts. Her research is focused on expressive speech characterization, recognition, and generation. Dr. Navas is a member of the International Speech Communication Association (ISCA), the Spanish thematic network on Speech Technologies (RTTH), and the European Center of Excellence on Speech Synthesis (ECESS).
Inmaculada Hernez received the telecommunications engineering degree from the Universitat Politecnica de Catalunya, Barcelona, Spain, and the Ph.D. degree in telecommunications engineering from the University of the Basque Country, Bilbao, Spain, in 1987 and 1995, respectively. She is a Full Professor in the Electronics and Telecommunication Department, Faculty of Engineering, University of the Basque Country, in the area of signal theory and communications. She is founding member of the Aholab Signal Processing Research Group. Her research interests are signal processing and all aspects related to speech processing. She is also interested in the development of speech resources and technologies for the Basque language. Dr. Hernez is a member of the International Speech Communication Association (ISCA), the Spanish thematic network on Speech Technologies (RTTH), and the European Center of Excellence on Speech Synthesis (ECESS).

Feature Analysis Fot Emotion Identification in Speech

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Feature Analysis Fot Emotion Identification in Speech

Uploaded by

Copyright:

Available Formats

490

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

Feature Analysis and Evaluation for Automatic Emotion Identication in Speech

1520-9210/$26.00 2010 IEEE

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

TABLE III UNSUPERVISED CLUSTERING RESULTS FOR SPECTRAL STATISTICS

TABLE IV UNSUPERVISED CLUSTERING RESULTS FOR PROSODIC STATISTICS

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

TABLE V BEST TEN FEATURES RANKED FOR FEATURE COMBINATIONS

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 12, NO. 6, OCTOBER 2010

You might also like