Professional Documents
Culture Documents
1. Introduction
Vietnamese speech has been created for some thousands of years, closely related to the AustroAsiatic languages. Today, with the population of more than 80 million people, Vietnamese definitely becomes one of the main languages in the Southeast Asia. Vietnamese has around seven thousands word units [2]. Many of them have their own meaning while some others have to be associated together to form a meaning word. In the second case, a meaning word usually consist of two or three word units, which in turn, is similar to Chinese.
Fig 1 and Fig 2 show all the characters and diacritics used in the modern Vietnamese writing system. From the phonetic point of view, Vietnamese is a monosyllable, tonal language. That means every word unit is made with only single syllable and the meaning of each word unit depends on the tone (basically a specific pitch and glottalization pattern) in which it is pronounced. There are six distinct tones. The first one is not marked (denoted by none), and the other five are indicated by diacritics applied to the main vowel of the syllable as shown in Fig 3.
Tone raises the main problem for the recognition of the Southeast languages, including Chinese [6], Thai [7] and Vietnamese as it spreads over the whole syllable, which leads to the difficulty in correctly specifying the desired basic units. Recently, most of the current modeling strategies for Vietnamese are based on the decomposition of syllable. Among them, decomposition of syllable into initial and final and some modifications become the dominant ones. With
280 274
the tonal problem, there are two ways to deal with: modeling them separately or together with basic units. In the previous papers, Quan et al have reported some robust approaches for the recognition of Vietnamese characters, including both handwritten and printed images, isolated speech word recognition [1] and also preliminary results on acoustic model for continuous speech recognition [2-5]. In this paper, we present our recent improvements on acoustic modeling for the Vietnamese LVCSR. Specifically diphthongs and triphthongs are also explored in the acoustic model, instead of using just monophthong as in the previous approach. With the new model, the number of basic units was increased, compared to the former but the average number of HMMs for each word unit was reduced significantly. Experimental results on the Vietnamese Broadcast News Corpus (VNBN) showed that the new model not only reduced the word error rate but also reduced the recognition time. The paper is organized as follows. In Section 2, we briefly describe the baseline system, including training and test data, acoustic and language modeling. The new improvements in acoustic modeling are carefully represented in Section 3. Finally, Section 4 ends up with some notes and remarks.
12 dimensional MFCC, energy, and their delta and acceleration (39 length front-end parameters).
Figure 4. Initial-Final units for Vietnamese As depicted in Fig 4, we follow the usual approach as for Chinese acoustic modeling [6] in which each syllable is decomposed into initial and final parts. While most of Vietnamese syllables consist of an Initial and a Final, some of them have only the Final. The Initial part always corresponds to a consonant. The final part includes main sound or monophthongs plus tone and an optional ending sound. This decomposition results in a total number of 47 phones, as shown in Figure 4. With this decomposition scheme, the contextdependent model could be built straightforwardly. Figure 5 illustrates the process of making triphones from a syllable.
Figure 5 Construction of triphones We use the tree-based state tying technique in which a set of questions was designed, based on the Vietnamese linguistic knowledge.
Table 1. Data for training and testing The whole VNBN corpus was collected from February to June 2007, from VOH the Hochiminh city broadcaster, which consisted a total of 27 hours. They were manually transcribed and segmented into sentences, which resulted in a total of 7496 sentences and a vocabulary size of 3651 words. The corpus was further divided into two sets: training and testing, as shown in Table 1. All the speech was sampled at 16 kHz and 16bits. They were further parameterized into
281 275
A decision tree is built using a top-down sequential optimum procedure starting from the root node of the tree. Initially, all data with the same central unit are pooled together at the root node, which is also the only leaf node. Each leaf node is then split according to that can result in a maximum likelihood increase on the training data. The process is repeated until the likelihood increase is smaller than a predefined threshold. Figure 6 shows the split process of the decision tree for the main sound .
Bigram Trigram
2.4 Recognition
In this stage, we use a Vietnamese continuous speech recognizer of which acoustic model and language model were mentioned in the previous sections. The recognizer used the Hidden Markov Model (HMM) technique for acoustic modeling [8]. Gaussian mixtures are employed as the state-conditioned output probability distributions. Each basic unit was modeled by a three-state HMM, excluding the initial and final states that have no distributions. By increasing the number of Gaussian mixture models, we obtained some significant improvements on word accuracy rates as shown in Figure 5. In the best case, the word accuracy rate reached 85.39%
282 276
Figure 6. The new acoustic modeling With the new acoustic model, the number of HMMs for each syllable is almost similar. Specifically, two syllables in the previous example, xuyn and xin now is modeled by 3 HMMs. Table 3 showed some other examples of using this model for representing syllables. Syllble Initial Main Tone S (I) (V) (T) t i <> ting t oa <`> ta ngh i <~> nghim y <> yn Table 3. Evaluation of language model Final F ng m n
Figure 7. WACs of the two acoustic modeling With the new acoustic modeling, in the best case, the accuracy rate reached 86.06%, which gained 0.67% absolutely improvement compared to the old model. Moreover, the run-time of the recognition system using the new AC model also reduced. Table 4 showed the run-time of the system using two acoustic models. Indeed, the recognition system that used the new AC models runs faster than the one that used the old AC model. It is also important to notice that with the new model, the recognition system executes faster than the real-time. This is because the number of HMMs for modeling syllables in the new model is smaller than one in the old model. Real-time factor Old AC Modeling 1.05 New AC Modeling 0.90 Table 4. Run-time comparison of the two models
4. Conclusion
In this paper, we present a systematic performance comparison among various levels of acoustic modeling for continuous Vietnamese speech recognition. The main observation from the research is that it provides us a confidence for choosing the right basic units for acoustic modeling in a Vietnamese LVCSR system. Experimental results showed that, by exploring the use of diphthongs and triphthong in acoustic modeling, the performance of the recognition system was improved, in both word accuracy and time-execution.
283 277
10. References
[1] Quan Vu et al, A Robust Method for the Vietnamese Handwritten and Speech Recognition, ICPR, 2002, p732735, Canada, 2002. [2] Ha Nguyen, Quan Vu, Selection of Basic Units for Vietnamese Large Vocabulary Continuous Speech Recognition, The 4th IEEE RIVF2006, HoChiMinh City, Vietnam, Feb 2006. [3] Ha Nguyen, Quan Vu, Preliminary Experiments on Acoustic Modeling for Vietnamese Large Vocabulary Continuous Speech Recognition, SPECOM, 2005, Patras, Greece, p527-530. [4] Khoa Trinh, Quan VU et al, An empirical study of multipass decoding for vietnamese LVCSR, SLTU08, Hanoi, Vietnam, 2008. [5] Quan VU et al, "Vietnamese Automatic Speech Recognition: the FLaVoR Approach", In Proc. International Symposium on Chinese Spoken Language Processing, Singapore, Singapore, December 2006. [6] J.J.Wu et al., Modeling Context-dependent Phonetic Units in a Continuous Speech Recognition System for Mandarin Chinese, ICSLP 96, Philadelphia, USA, 1996. [7] Sinaporn Suebvisai et al, Thai Automatic Speech Recognition, ICASSP2005, Philadelphia, PA, USA, pp. 857- 860, 2005. [8] Young, S. et all, The HTK Book (for HTK Version 3.2),Cambridge University, 2004 [9] Andreas Stolcke,SRILM - An Extensible Language Modeling Toolkit, in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, pp. 901-904, 2004.
284 278