You are on page 1of 5

2009 International Conference on Asian Languages Processing 2009 International Conference on Asian Language Processing

Advances in Acoustic Modeling for Vietnamese LVCSR


Tuan Nguyen, Quan Vu University of Science, VNU-HCM, Vietnam vhquan@fit.hcmus.edu.vn Abstract
In this paper, we present our experiments on the selection of basic phonetic units for the Vietnamese large vocabulary continuous speech recognition (LVCSR). Two acoustic models were compared. The first model has just used vowels or monophthongs as phonemes [2] while the second one, which was proposed in this paper, has explored the use of diphthongs and triphthongs as phonemes as well. The two models were trained and evaluated on a Broadcast News corpus containing 27 hours of acoustic training data and 1 hour of acoustic testing data. Moreover, an 146M-word corpus collection of newspaper was employed for building the language models. Experimental results indicate significant improvements in both word accuracy rate and time-execution. With the second acoustic model, the word accuracy rates reach 86.06% on the best case and the execution time is faster than the real-time. However, the Vietnamese writing system seems completely different from all the other countries systems in the Southeast Asia. In the 17th century, the Western evangelists came to Vietnam to preach Christianism and used the Latin alphabet to transcribe the Vietnamese. Step by step, this kind of writing has replaced the traditional writing systems, which were based on Chinese characters and ancient Vietnamese characters, to become the Vietnamese official writing system today.

Figure 2. Diacritics used for representing 6 tones

1. Introduction
Vietnamese speech has been created for some thousands of years, closely related to the AustroAsiatic languages. Today, with the population of more than 80 million people, Vietnamese definitely becomes one of the main languages in the Southeast Asia. Vietnamese has around seven thousands word units [2]. Many of them have their own meaning while some others have to be associated together to form a meaning word. In the second case, a meaning word usually consist of two or three word units, which in turn, is similar to Chinese.

Fig 1 and Fig 2 show all the characters and diacritics used in the modern Vietnamese writing system. From the phonetic point of view, Vietnamese is a monosyllable, tonal language. That means every word unit is made with only single syllable and the meaning of each word unit depends on the tone (basically a specific pitch and glottalization pattern) in which it is pronounced. There are six distinct tones. The first one is not marked (denoted by none), and the other five are indicated by diacritics applied to the main vowel of the syllable as shown in Fig 3.

Figure 3. The representation of tones in vowels

Figure 1. Vietnamese alphabet and its pronunciation


978-0-7695-3904-1/09 $26.00 2009 IEEE DOI 10.1109/IALP.2009.66

Tone raises the main problem for the recognition of the Southeast languages, including Chinese [6], Thai [7] and Vietnamese as it spreads over the whole syllable, which leads to the difficulty in correctly specifying the desired basic units. Recently, most of the current modeling strategies for Vietnamese are based on the decomposition of syllable. Among them, decomposition of syllable into initial and final and some modifications become the dominant ones. With
280 274

the tonal problem, there are two ways to deal with: modeling them separately or together with basic units. In the previous papers, Quan et al have reported some robust approaches for the recognition of Vietnamese characters, including both handwritten and printed images, isolated speech word recognition [1] and also preliminary results on acoustic model for continuous speech recognition [2-5]. In this paper, we present our recent improvements on acoustic modeling for the Vietnamese LVCSR. Specifically diphthongs and triphthongs are also explored in the acoustic model, instead of using just monophthong as in the previous approach. With the new model, the number of basic units was increased, compared to the former but the average number of HMMs for each word unit was reduced significantly. Experimental results on the Vietnamese Broadcast News Corpus (VNBN) showed that the new model not only reduced the word error rate but also reduced the recognition time. The paper is organized as follows. In Section 2, we briefly describe the baseline system, including training and test data, acoustic and language modeling. The new improvements in acoustic modeling are carefully represented in Section 3. Finally, Section 4 ends up with some notes and remarks.

12 dimensional MFCC, energy, and their delta and acceleration (39 length front-end parameters).

2.2 Acoustic Model

Figure 4. Initial-Final units for Vietnamese As depicted in Fig 4, we follow the usual approach as for Chinese acoustic modeling [6] in which each syllable is decomposed into initial and final parts. While most of Vietnamese syllables consist of an Initial and a Final, some of them have only the Final. The Initial part always corresponds to a consonant. The final part includes main sound or monophthongs plus tone and an optional ending sound. This decomposition results in a total number of 47 phones, as shown in Figure 4. With this decomposition scheme, the contextdependent model could be built straightforwardly. Figure 5 illustrates the process of making triphones from a syllable.

2. The baseline system


In this section, we describe the processes of building the Vietnamese Broadcast News Transcription System. They include the Vietnamese Broadcast News Corpus (VNBN), the acoustic modeling, the language modeling and the obtained baseline results. The details of the system could be found in [2].

2.1 The VNBN corpus


Training Dialect Length (hours) Saigon 27 Testing #Sentence Length (hours) 7267 1 #Sentence 202

Figure 5 Construction of triphones We use the tree-based state tying technique in which a set of questions was designed, based on the Vietnamese linguistic knowledge.

Table 1. Data for training and testing The whole VNBN corpus was collected from February to June 2007, from VOH the Hochiminh city broadcaster, which consisted a total of 27 hours. They were manually transcribed and segmented into sentences, which resulted in a total of 7496 sentences and a vocabulary size of 3651 words. The corpus was further divided into two sets: training and testing, as shown in Table 1. All the speech was sampled at 16 kHz and 16bits. They were further parameterized into

Figure 6. Decision tree-based state tying

281 275

A decision tree is built using a top-down sequential optimum procedure starting from the root node of the tree. Initially, all data with the same central unit are pooled together at the root node, which is also the only leaf node. Each leaf node is then split according to that can result in a maximum likelihood increase on the training data. The process is repeated until the likelihood increase is smaller than a predefined threshold. Figure 6 shows the split process of the decision tree for the main sound .

2.3 Language modeling


Both bigram and trigram language model (LM) were developed mainly by exploiting newspaper text sources available on the Internet. In particular, a 146M-word collection of newspaper text was employed. Numeric expressions occurring in the text were replaced by the corresponding words. A lexical of 5K-word was selected of all words occurring in the corpus. In order to evaluate the language model, 135 test utterances containing 8253 words were randomly selected from the VOH. Table 2 shows the perplexity of the bigram and trigram LMs on the test set. Obviously, the trigram LM is much better than the bigram as its perplexity of 135.5 compared to the bigram-perplexity of 212.6. utterance 135 135 var 36.4 39.9 perplexity 212.6 135.5 Figure 5. Word accuracy rates vs. number of Gaussian mixtures

3. Improvements in Acoustic Modeling


3.1 A new acoustic modeling
The acoustic modeling mentioned in the previous section has two advantages. Firstly, the number of basic units or phonemes is relatively small (44 phonemes). Secondly, as we treat tone as a distinct phoneme, followed immediately after the main sound, the context-dependent model for tone could be built straightforwardly. It means that the recognition of tones was fully integrated in the system in just one recognition pass. However, there is a problem with this approach. Since Vietnamese is a monosyllable, tonal language, the duration of each syllable in the Broadcast News is not significantly different. But with the previous acoustic modeling, the number of HMMs for modeling each syllable is significant different. For example, three syllables, namely , xuyn, xin have a similar duration but the numbers of HMMs for modeling them are 1, 5 and 3 respectively. This definitely affected to the performance of the recognition system. To overcome this problem, we proposed a new model which explored the use of both diphthongs and tripthongs as basic units, instead of using just monothongs as in the previous model. Fig. 6 showed the new acoustic model. The new model has totally 77 phonemes in which 29 phonemes used for representing diphthongs, 8 phonemes for triphthongs, 10 phonemes for monophthong, 23 phonemes for consonants and 5 phonemes for tones .

Bigram Trigram

Table 2. Evaluation of language model

2.4 Recognition
In this stage, we use a Vietnamese continuous speech recognizer of which acoustic model and language model were mentioned in the previous sections. The recognizer used the Hidden Markov Model (HMM) technique for acoustic modeling [8]. Gaussian mixtures are employed as the state-conditioned output probability distributions. Each basic unit was modeled by a three-state HMM, excluding the initial and final states that have no distributions. By increasing the number of Gaussian mixture models, we obtained some significant improvements on word accuracy rates as shown in Figure 5. In the best case, the word accuracy rate reached 85.39%

282 276

Figure 6. The new acoustic modeling With the new acoustic model, the number of HMMs for each syllable is almost similar. Specifically, two syllables in the previous example, xuyn and xin now is modeled by 3 HMMs. Table 3 showed some other examples of using this model for representing syllables. Syllble Initial Main Tone S (I) (V) (T) t i <> ting t oa <`> ta ngh i <~> nghim y <> yn Table 3. Evaluation of language model Final F ng m n

Figure 7. WACs of the two acoustic modeling With the new acoustic modeling, in the best case, the accuracy rate reached 86.06%, which gained 0.67% absolutely improvement compared to the old model. Moreover, the run-time of the recognition system using the new AC model also reduced. Table 4 showed the run-time of the system using two acoustic models. Indeed, the recognition system that used the new AC models runs faster than the one that used the old AC model. It is also important to notice that with the new model, the recognition system executes faster than the real-time. This is because the number of HMMs for modeling syllables in the new model is smaller than one in the old model. Real-time factor Old AC Modeling 1.05 New AC Modeling 0.90 Table 4. Run-time comparison of the two models

3.2 Experimental results


For a fair comparison, two acoustic models has been trained and evaluated in the same system, with the same training and testing data as used in the baseline. Fig 7 illustrated the word accuracy rate resulting from the two acoustic models, namely, the old AC modeling and the new AC modeling. The x-axis is the number of Gaussian mixture models that the recognizer used. The y-axis is the word accuracy rate.

4. Conclusion
In this paper, we present a systematic performance comparison among various levels of acoustic modeling for continuous Vietnamese speech recognition. The main observation from the research is that it provides us a confidence for choosing the right basic units for acoustic modeling in a Vietnamese LVCSR system. Experimental results showed that, by exploring the use of diphthongs and triphthong in acoustic modeling, the performance of the recognition system was improved, in both word accuracy and time-execution.

283 277

10. References
[1] Quan Vu et al, A Robust Method for the Vietnamese Handwritten and Speech Recognition, ICPR, 2002, p732735, Canada, 2002. [2] Ha Nguyen, Quan Vu, Selection of Basic Units for Vietnamese Large Vocabulary Continuous Speech Recognition, The 4th IEEE RIVF2006, HoChiMinh City, Vietnam, Feb 2006. [3] Ha Nguyen, Quan Vu, Preliminary Experiments on Acoustic Modeling for Vietnamese Large Vocabulary Continuous Speech Recognition, SPECOM, 2005, Patras, Greece, p527-530. [4] Khoa Trinh, Quan VU et al, An empirical study of multipass decoding for vietnamese LVCSR, SLTU08, Hanoi, Vietnam, 2008. [5] Quan VU et al, "Vietnamese Automatic Speech Recognition: the FLaVoR Approach", In Proc. International Symposium on Chinese Spoken Language Processing, Singapore, Singapore, December 2006. [6] J.J.Wu et al., Modeling Context-dependent Phonetic Units in a Continuous Speech Recognition System for Mandarin Chinese, ICSLP 96, Philadelphia, USA, 1996. [7] Sinaporn Suebvisai et al, Thai Automatic Speech Recognition, ICASSP2005, Philadelphia, PA, USA, pp. 857- 860, 2005. [8] Young, S. et all, The HTK Book (for HTK Version 3.2),Cambridge University, 2004 [9] Andreas Stolcke,SRILM - An Extensible Language Modeling Toolkit, in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, pp. 901-904, 2004.

284 278

You might also like