You are on page 1of 4

International Journal of Wisdom Based Computing, Vol.

1 (2), August 2011


Scope for performance enhancement of CMU Sphinx by parallelising with OpenCL

Harish S C, Balaji D, Vignesh M, Deepan Kumar P, Adinarayanan V
Department of Information Technology Amrita Vishwa Vidhyapeetham, Coimbatore

Abstract Automatic Speech Recognition (ASR) system that utilises many-core Graphics Processing Unit (GPU) architecture enables myriad of emerging applications like mobile based speech recognition, multimedia content transcription, and voice based language translation. This article discusses the feasibility and challenges in performance enhancement of CMU Sphinx-3.08 by parallelising the data-parallel parts using OpenCL that can utilise the GPU many-core architecture. The evolution of high performance GPUs opens a new path in improving the efficiency of ASR by increasing the vocabulary size. We examine the trade offs in implementing the data-parallel parts of Sphinx-3.08 using OpenCL on GPUs.

of those steps to Sphinx-3.08 is the main focus of the paper. The trade-offs for parallelising for many-core GPU platforms are also discussed in this paper. II. SPHINX 3.08

Keywords: ASR, GPU, OpenCL, CMU Sphinx I. INTRODUCTION

In this age of many-core revolution, we have reached a stage where there is a paradigm shift from increasing the clock speed to increase the performance of software to redesign the entire software so that it utilizes the multicore and many-core platforms efficiently. The multicore revolution has brought a situation, where development of a faster serial microprocessor has halted. With the advent of fusion processor, for applications to work faster, software must utilise parallel hardware (i.e.) Single-Chip Multiprocessors or many-core GPUs. Automatic Speech Recognition (ASR) system is widely used in many applications ranging from mobiles to space missiles. Yet, the quality of this recognition is far below human ability. Added to that, many time critical applications are unable to use ASR because of the heavy latency in processing the speech with a large vocabulary size. For instance, it can help an airplane pilot to control the flight with voice commands. Basically, the architecture of ASR has data-parallel parts that can utilise the many-core GPU architecture [1]. If data-parallel parts of ASR are tweaked for many-core GPU architecture, it would run efficiently even for large vocabulary size by breaking the barrier posed due to the heavy latency. The steps in parallelising a typical program so that it utilises the many-core platform by executing as a GPU fragment program and the challenges faced in application

Fig 1: Overall architecture of CMU Sphinx-3.08 The CMU Sphinx 3.08 includes seven mode executables in the package: align, allphone, dag, livepretend, astar, decode, and decode_anytopo [2]. They are of four types: batch mode executables, live mode simulators, live mode demonstration, and search programs with specific purpose. The functions of these executables are: 1) decode: It is a batch mode decoder which works with cepstral files. It uses tree lexicon with optimization on GMM computation and cross word triphone traversals. It also decodes using finite state grammar and different implementations of search. It is called as s3 fast. 2) decode_any: It is a batch mode decoder which works with cepstral files. It uses flat lexicon with optimization on trigram and cross-word triphones. It is called as s3 slow. 3) live pretend: It is a live mode simulator which uses the engine of decode executable for decoding.

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011


4) livedecode: It is a live mode demonstration. It also uses the engine of decode executable for decoding. It allows the users to recognise speech in a push-button manner. 5) align: It computes state, phone and word level of alignment given the transcription of an utterance. 6) allphone: It recognises the phonemes for an utterance with full triphone expansions. 7) dag: It does the second stage rescoring of the lattice. It also searches for the best path within the lattice in a given trigram language model. 8) astar: It generates an N-best list from a given lattice. It is useful for N-best rescoring. III. INFERENCE ENGINE OPTIMIZATION

more single-instruction multiple-data (SIMD) processing elements. A. GPU based Programming Model GPUs by their inherent data parallelization property are most suitable in hosting apps that are computationally intense and have a high data-transfer demand. In-order for a system to tap the potential of the GPUs, the host applications must abide by a specific set of programming model constraints posed by the GPUs architecture. The following is a description of this GPU programming model from the standpoint of the Graphics API and from that of the Stream Programming model. B. The Stream Programming Model The Stream programming model provides a way of looking at the applications from a unique perspective that instantly reveals information about the systems data parallelization capability. In Stream programming, the application is partitioned into data-segments or streams that flow between the arithmetically intense kernels that operate these streams. This Kernel Stream discretion offers insight into the systems parallelizability [5]. In general purpose GPUs, the fragment processor is more arithmetically performing than the vertex processor since fragments hold a larger share than vertices in a typical graphics scenario. A GPU based program is generally structured [6] as follows: Segmentation: The First step is to analyse and spot the data-parallel sections in the application. These sections must be programmatically independent of each other and are considered as Kernels of the GPU model. The Kernels are implemented in the fragment processor. The I/O to the segments is stored in the format of texture streams which are acted upon by kernels to perform the desired operation. Defining Output: The kernels output is specified in the form of quadrilateral positioned parallel to the image plane, and covers a rectangular array of pixels. The GPUs rasterizer generates a fragment for every pixel in the quad. Kernel Munching: The Fragments that are created are then individually processed by the active kernel fragment program. All fragments are computed on by the same fragment program that reads from arbitrary memory locations and writes to a frame buffer location, specific to the fragment. Extracting output: The Fragment computation results in a single value or a vector. The value may be the end result or could be the intermediate data (stored as textures) needed for the forthcoming computations (in multi-pass computations) Single-pass systems limit the maximal amount of complexity that a system can bear but Multi-pass systems allow systems of any arbitrary complexity.

There are four steps in optimising the HMM based inference engine of an ASR [3]. They are: finding unique labels, parallelising observation probability computation, next state likely hood computation, and next state pruning. Finding unique labels: In CPU based inference engine, unique labels are found by sorting followed by compaction. In the GPU based inference engine, a lookup table is used over a statically compiled recognition network to find the unique labels. Each thread that runs on the may-core looks at one potential next state, and set the flag for the label of that state. This reduces a complex sorting problem into a parallel hashing problem. Parallelising observation probability computation: The computation of probability for a unique label is independent so it is parallel in nature. In GPU based inference engine, computation of a GMM is assigned to a thread block on the GPU. The threads in the thread block should be synchronised because the observation probability is the sum of the weighted distances from each Gaussian in the given mixture. Next state likely hood computation: There are two different scenarios for calculating next state. First is when the transition is within a word and the second is when the transition is between two words. Within a word transition involves very few next state and the transition within a word is determined by the vocabulary but whereas in word to word transition, the final state of one word is connected to the first state of next word. For word to word transition we make use of the probability from bigram model. In its absence, we calculate the probability from unigram probability of the first state and back off constant of end state as in [1]. IV. THE OPENCL PROGRAMMING MODEL

OpenCL standard provides a generic API for execution of the program on systems which have different types of computational devices like GPUs, multicore CPUs, and other accelerators [4]. In OpenCL, the program is executed in a computational device. The devices have one or more processor cores. These processor cores are made of one or

International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011




[8] Achieved 17.7x speedup for compute-intensive phases and 4.4x speedup for communicate-intensive phases. The inference engine uses the following c99 standard headers: stdio.h, string.h, assert.h, stdlib.h, ctype.h, limits.h, and time.h but OpenCL 1.1 Standard [9] doesnt support these C99 standard headers. All functions in those standard headers must be rewritten according to the OpenCL 1.1 standard to implement the inference engine in OpenCL. Rewriting all the standard C99 headers would be an optimal solution for this problem. VI. CONCLUSION

Automatic Speech Recognition (ASR) has always consistently exploited the advancements in computation capabilities. There have been many experiments to parallelize the speech recognition over decades. We would like discuss the challenges in two key ways to parallelize the inference engine.

Being a technology with undeniable consumer utility and untapped commercial potential, the Automatic Speech Recognition (ASR) has always tried to capitalize on the benefits of powerful upcoming computational platforms. With the emergence of parallel multicore and manycore processors, we see immense opportunities for speech recognition in ASR for processing speech with a large vocabulary set in real-time applications, leading to intuitive human-computer interfaces, natural computer applications and smarter mobile devices. Here we have presented our directions on possibilities of harvesting the potential of contemporary Graphics Processing Units giving rise to more accurate, agile and efficient automatic speech recognition systems. We have also presented our on-going research work, specially focusing on the opportunities and challenges of parallelizing Sphinx-3.08 ASR Framework. Fig 2: Approaches in parallelizing Sphinx They are listed as follows: 1) Implementing computationally intensive parts on GPU: Sphinx has a Hidden Markov model (HMM) based inference engine. In a HMM based inference engine, the outer iteration processes a single input vector at a time [3]. [7] demonstrates an approximate 5x speedup when computation-intensive phases (i.e., observation probability computations) are implemented in GPU and other communicate-intense (i.e., Viterbi Search) phases are mapped onto host processor. But, this software architecture leads to a significant penalty for copying the intermediate results between the GPU and host processor [8]. So, the ways should be devised to minimise the copying operation between In Sphinx 3.08, if the compute-intense phases are implemented on GPU using OpenCL and communicateintense phases in host processor, it would lead to significant overhead for copying the intermediate results between GPU and host processor. 2) Implementing all the parts of the inference engine on GPU: Implementing all parts of the Sphinxs inference on GPU would eliminate the extra overhead due to copying. Also, our recommendations on future research in ASR may serve as a guide for further enhancements in Sphinx3.08 for efficient utilisation of Graphical Processing Units for speech recognition. REFERENCES
[1] J. Chong, Y. Yi, A. Faria, N. R. Satish, and K. Keutzer, DataParallel Large Vocabulary Continuous Speech Recognition on Graphics Processors, Proceedings of the 1st Annual Workshop on Emerging Applications and Many Core Architecture (EAMA), June 2008. A. Chan, E. Gouva, R. Singh, M. Ravishankar, R. Rosenfeld, Y. Sun, D. Huggins-Daines, M. Seltzer, The Hieroglyphs: Building Speech Applications Using CMU Sphinx and Related Resources, J. Chong, K. You, Y. YI, E. Gonina, C. Hughes W. Sung, and K. Keutzer, "Scalable HMM based inference engine in large vocabulary continuous speech recognition", Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, June 2009. J. Stone, D. Gohara, and G. Shi, "OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems", May-June 2010. J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krger, A. Lefohn, T. J. Purcell, "A Survey of General-Purpose Computation on Graphics Hardware", EUROGRAPHICS 2005 STAR-State of The Art Report A Survey of General-Purpose Computation on Graphics Hardware, 2007. M. Harris, "Mapping computational concepts to GPUs", GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 31, pp. 493508. P.R. Dixon, T. Oonishi, S. Furui, Dept. of Computer Science, Tokyo Institute of Technolology, Tokyo, "Fast acoustic







International Journal of Wisdom Based Computing, Vol. 1 (2), August 2011

computations using graphics processors", Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference, April 2009. J. Chong, G. Friedland, A. Janin, N. Morgan, C. Oei, "Opportunities and challenges of parallelizing speech




recognition", HotPar'10 Proceedings of the 2nd USENIX conference on Hot topics in parallelism, June 2010. A. Munshi, "The OpenCL Specification, Version: 1.1", Khronos OpenCL Working Group, November 2010, ch. 6 pp. 192,