You are on page 1of 10

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO.

7, JULY 2007

801

SIMD Processor-Based Turbo Decoder Supporting Multiple Third-Generation Wireless Standards


Myoung-Cheol Shin, Member, IEEE, and In-Cheol Park, Senior Member, IEEE
AbstractA programmable turbo decoder is designed to support multiple third-generation wireless communication standards. We propose a hybrid architecture of hardware and software, which has small size, low power, and high performance like hardware implementations, as well as the exibility and programmability of software. It mainly consists of a congurable hardware soft-inputsoft-output (SISO) decoder and a 16-b single-instruction multipledata processor, which is equipped with ve processing elements and special instructions customized for interleaving in order to provide interleaved data at the speed of the hardware SISO. A fast and exible software implementation of the block interleaving algorithm is also proposed. The interleaver generation is split into two parts, preprocessing and on-the-y generation, to reduce the timing overhead of changing the interleaver structure. We present detailed descriptions of the interleaving implementation applied to the W-CDMA and cdma2000 standard turbo codes. The decoder occupies 8.90 mm2 in a 0.25- m CMOS with ve metal layers and exhibits the maximum decoding rate of 5.48 Mb/s. Index Termscdma2000, parallel algorithm, turbo code, turbo interleaver, W-CDMA.

I. INTRODUCTION

S TURBO codes [1], or parallel concatenated convolutional codes, have extremely impressive performances, they entered the eld of standardized systems in recent years. One of the most important examples is the channel coding for high-speed data transmission of the third-generation (3G) mobile communication systems such as W-CDMA [2] and cdma2000 [3]. Flexible and programmable turbo decoders are required for 3G communications because of two reasons: 1) the global roaming is recommended between different 3G standards and 2) even in a standard, the turbo coding frame size may change on a frame-by-frame basis. The turbo decoder consists of interleavers and soft-input-soft-output (SISO) decoders that decode recursive systematic convolutional (RSC) codes. Flexible and programmable implementation is especially needed for the turbo interleaver, as each 3G standard has a distinct and complicated interleaver. The simplest approach to implement an interleaver is to store the interleaved patterns in a ROM. The approach is not adequate for a turbo decoder supporting multiple wireless standards, as

it needs a large ROM to store all of the possible interleaved patterns. Though some implementations based on digital signal processors (DSPs) have programmability that may support various standards, their performance is far below the maximum bit rate of up-to-date wireless systems [4], [5]. In [6], Bekooij et al. proposed a exible turbo decoder as a form of very long instruction word (VLIW) microprocessor. However, dedicated hardware blocks, which are not programmable, are employed to implement its interleaver and SISO decoder. In this paper, we propose a multiple-standard turbo decoder implemented with a combination of the dedicated hardware part processing the computationally intensive but regular tasks such as SISO decoding and the software part running on a programmable single-instruction-multiple-data (SIMD) processor for the tasks that requires exibility. The turbo interleaving that differs largely depending on the standards is also implemented in software. In this way, we can achieve the small area, high performance, and low power consumption of hardware, as well as the exibility and programmability of software needed to support multiple standards. In addition, we address a software interleaving implementation that can change the interleaver structure in a very short period of time and requires only a small amount of memory. The interleaver construction is split into two parts, preprocessing and on-the-y generation, in order to hide the timing overhead of interleaver changing effectively. Since the proposed SIMD processor is equipped with instructions specialized for turbo interleaving algorithms, the processor can provide interleaved data at the speed of the hardware SISO decoder. The remainder of this paper is organized as follows. Section II presents the fundamentals of turbo coding and iterative decoding, leading to a discussion on the turbo codes adopted by 3G wireless systems in Section III. In Section IV, the proposed turbo decoder is described in detail. Section V explains the proposed interleaver implementation, and Section VI describes its application to standards including W-CDMA and cdma2000. In Section VII, experimental results on the implemented chip are presented. Finally, we make concluding remarks in Section VIII. II. ITERATIVE DECODING OF TURBO CODES Here, we summarize the most essential part of the iterative turbo decoding and the notations to be use in next sections. A turbo encoder consists of two binary RSC encoders separated by an -bit interleaver, together with an optional puncturing mechanism [1]. A typical example is shown in Fig. 1. The function of the interleaver is to take each incoming frame of data bits and rearrange them in a pseudorandom fashion prior to

Manuscript received November 8, 2002; revised March 29, 2007. This work was supported in part by the Institute of Information Technology Assessment through the ITRC and IC Design Education Center (IDEC). M.-C. Shin was with the Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea. He is now with Samsung Electronics Company, Ltd., Giheung, 449-711, Korea (e-mail: mch.shin@samsung.com). I.-C. Park is with the Department of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon 305-701, Korea (e-mail: icpark@ee.kaist.ac.kr). Digital Object Identier 10.1109/TVLSI.2007.899237

1063-8210/$25.00 2007 IEEE

802

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007

The SISO that is optimal in terms of minimizing the decoded BER should produce the a posteriori probability that each encoder input bit was 1 or 0 given the received symbol sequence y and a priori probability. Bahl et al. originally proposed this maximum a posteriori probability (MAP) algorithm [1], [7]. As it requires a huge dynamic range of intermediate values and a large number of expensive operations such as multiplications and divisions, its simplied versions, Log-MAP or Max-Log-MAP algorithms [8], are usually used in hardware implementation. If we use capital Greek letters to indicate the logarithm of the variables, for example, and , where is a current trellis state and is a previous state, the algorithms can be represented with the following equations. The branch metric of each transition is calculated as
Fig. 1. Turbo encoder for cdma2000.

(3) is the logarithm of a priori probability, which is where equal to the extrinsic LLR output ) obtained from the is the channel reliability measure given previous SISO, and . The forward path metric is computed recuras sively as

the second encoding. Unlike the other interleavers mainly used to spread burst errors, it is important that the turbo interleaver should sort the bits as randomly as possible so that no apparent order, or uniformity, can be found. Any uniformity may degrade the error-correcting performance of turbo decoding. The puncturing mechanism periodically deletes some code bits to increase the coding rate. As shown in Fig. 1, the encoder processes of bits to produce the the input sequence and parity sequences systematic sequence , where . All of these sequences are binary bits, but the modulated sequences and are genfollowing the equations: erally assumed to have the value of

(4) In the Max-Log-MAP algorithm, the operation is approximated as the real maximum operation, which causes some BER performance degradation. However, the Log-MAP operation exactly by adding a algorithm calculates the correction term as follows:

(1) . where The general structure of the iterative turbo decoder is shown in Fig. 2, where the superscript represents an interleaved sequence [1]. Two component decoders are linked by interleavers in a feedback structure. Component decoders are usually called SISO decoders as they deal with the probabilities of the inputs instead of hard decisions. For an encoder input bit , the soft input and output of the SISO decoder are typically represented in terms of the so-called log likelihood ratio (LLR) given by

(5) The correction function can be implemented with a small lookup table (LUT). Similarly, the backward metric is calculated as

(6) (2) Each SISO receives the systematic channel bits , which are modeled as the sum of modulated bits and channel noise with , similarly the sum of and channel variance , parity bits coming from the noise, and the extrinsic information other SISO as additional a priori information. As the feedback iterations go on, the bit-error-rate (BER) performance of the decoded bits improves due to the extrinsic information. Since the improvement obtained with additional iterations decreases as the number of iterations increases, four to eight iterations are usually used. The logarithm of a posteriori probability for each input bit is then obtained by

(7) and the extrinsic information to be fed to the next SISO by (8)

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS

803

Fig. 2. Iterative turbo decoder structure.

TABLE I DIFFERENCES BETWEEN THE cdma2000 AND W-CDMA TURBO CODES

III. TURBO CODES IN 3G STANDARDS We discuss here the similarities and differences of the turbo codes of the W-CDMA [2] and cdma2000 [3], which are shown in Table I, and why a multistandard turbo decoder requires a programmable interleaver. Fig. 1 shows the turbo encoder architecture of cdma2000. Rate-1/2, -1/3, -1/4, and -1/5 turbo codes are realized with appropriate puncturing patterns. For the W-CDMA, the code rate is xed to 1/3 and the encoder is obtained simply by removing and from Fig. 1. The code rates other than the path to 1/3 can be implemented by applying the rate-matching process [2]. The constraint length of the RSC code for W-CDMA and cdma2000 is four, and the number of states is eight. The RSC decoder of the W-CDMA is the same as that of cdma2000 when the rate is 1/3.

Since the turbo codes work in a frame-wise fashion, the trellises of the two constituent encoders need to be terminated at the end of a frame. For both cdma2000 and W-CDMA systems, turbo codes are terminated in a similar way. The dotted lines in Fig. 1 are active during the trellis termination. Tail bits come from the shift registers in order to bring the trellises back to the all-zero state. Because the contents of the shift registers are different in each constituent encoder, the tail bit sequences are different and the systematic code bit of the second encoder should be transmitted. The cdma2000 termination sequence varies depending on the rate, and the termination sequence of W-CDMA is the same as that of the rate-1/2 cdma2000 turbo code. We can conclude that a SISO compatible to both standards can be implemented without much difculty, as the RSC code of W-CDMA is actually a subset of cdma2000. Though the turbo decoders for cdma2000 and W-CDMA share the constituent encoder structure, they have signicant differences in the interleaver implementation, because their interleaver frame size, individual operations, and other parameters used in generating interleaved addresses are quite different. Compared with W-CDMA, the cdma2000 interleaver is much simpler and suitable for hardware implementation. On the other hand, the W-CDMA interleaver is more complicated and not straightforward to implement in hardware. The detailed descriptions of both interleavers are given in Section VI. Due to those differences, only a fully programmable processor is appropriate for a unied interleaver used for multiple standards. IV. TURBO DECODER ARCHITECTURE As shown in Fig. 3, the proposed turbo decoder employs hybrid implementation of hardware and software. By combining a hardware SISO decoder and a programmable processor, the performance of hardware and the exibility of software can be achieved together. It has the simplest time-multiplex architecture [8] that contains only one SISO, one interleaver, and one extrinsic LLR ( ) memory, which are repeatedly used for both the rst and the second SISO decoding processes in one iteration. The data are accessed in a sequential order for the rst SISO decoding of an iteration and in an interleaved order for the second decoding. Since the write address is the same as the

804

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007

Fig. 3. Block diagram of the proposed turbo decoder.

read address: 1) the content in memory is always sequential; 2) the read address can be buffered in order to use it later as the write address; and 3) the decoder contains no deinterleaver but only one interleaver. As shown by the dotted lines in Fig. 3, the SIMD processor calculates the interleaved addresses of various standards, and the address queue whose length is equal to the SISO latency saves the read addresses in order to use them again as the write addresses. During the rst SISO decoding process of an iteration, which does not need an interleaver, the processor controls the hardware blocks, interfaces with external host, and processes the trellis termination and a stopping criterion. The processor stops the decoding iteration to reduce power consumption based on the cyclic redundancy check (CRC) or a simple stopping criterion presented in [10], which indicates whether or not the further iterations may improve BER. A. SIMD Processor To keep pace with the hardware SISO decoder, parallel processing is indispensable in the design of a programmable interleaver. A simple SIMD architecture depicted in Fig. 4 is chosen, as it is suitable for the simple and repetitive interleaved address generation and has simpler control and lower power consumption than VLIW or superscalar architectures. The number of processing elements (PEs) of the SIMD processor is set to ve because the number of rows in the W-CDMA interleaver, which is the unit of interleaved address generation, is 5, 10, or 20. That of cdma2000 varies between 18 and 24 if we discard the rows that always produce invalid addresses. The processor employs Harvard architecture, i.e., its instruction memory and data memory are separated, and has a separate I/O port. Thus the processor can fetch an instruction, load data, and send an interleaved address through the I/O port at the same time. The bit widths of instructions and data are all 16, and the processor has four pipeline stages: IF (instruction fetch), ID (instruction decode), EX (execution), and MEM (memory operation). The write back is carried out at the end of the MEM stage. The branch delay is one cycle and the processor always executes the instruction in a branch delay slot.

The rst PE, PE0, plays the role of controlling the other PEs and functions as a complete scalar processor. The scalar instructions running only on PE0 include control instructions such as branch and call and multicycle operations such as multiply, divide, and remainder. Basic arithmetic/logic instructions and a few customized instructions for interleaving, which are described in detail in Section V, can be performed by all of the PEs. All of the common register les of the ve PEs form a ve-element vector register le of 16 entries, as shown in Fig. 4. PE0 has an additional 16-entry scalar register les to store nonvector data or special control data. The ve PEs share a single data memory port and a single I/O port in order to save memory access power consumption and support simple I/O interface. For this reason, a SIMD instruction is not executed in all PEs at the same time, but executed in one PE after another so that a data memory port and an I/O port can be shared in a time-multiplexed fashion. An example of this situation is shown in Fig. 5. The rectangles lled with light gray represent the data memory instructions and the dark gray the I/O instructions. The gure shows that those instructions share a port without any conict. B. SISO Decoder As shown in Fig. 6, the architecture of the proposed SISO decoder is similar to the memory architecture presented in [11], . Input data are read in a memory where the window size and used three times for calculating forward metric s, backward metric s, and extrinsic LLR s. The SISO decoder contains one additional memory for temporarily storing the computed s and an additional section of add-compare-select-add (ACSA) units to calculate dummy backward metrics for the sliding window algorithm [11]. To support multiple standards, the SISO decoder employs congurable ACSA units shown in Fig. 7. The ACSA can support the rate of 1/2 to 1/5 turbo codes with arbitrary transfer functions by conguring the input multiplexers in Fig. 7. For calculation, inputs are added in a carry branch metric save adder (CSA) to reduce timing delay of multiple additions. Generally, the Log-MAP algorithm outperforms the MaxLog-MAP algorithm [8]. However, Worm et al. demonstrated in [12] that the Max-Log-MAP is more tolerant to the channel estimation error than the Log-MAP algorithm. The ACSA unit is therefore designed to select one of the two algorithms with the last multiplexer in Fig. 7. If we can obtain reliable channel estimation such as received signal strength indication (RSSI), we can choose the Log-MAP algorithm for better performance. On the other hand, if nothing is known about the channel, we should use the Max-Log-MAP to avoid error caused by the misestimation of channel. The bit width of data and internal operations are determined based on the approach presented in [13]. Let us use the nota, where and tion of the xed-point representation as represent the total bit width and the bit width of the fractional part, respectively. Based on the result of a performance simulaat the cost of tion, we selected (5, 2) for inputs and (7, 2) for slight performance degradation instead of (6, 3) and (8, 3), respectively, which produce almost ideal results. By choosing the smaller widths, the power and area of the buffer memory were

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS

805

Fig. 4. Architecture of the SIMD processor.

Fig. 5. Execution timing of the processing elements.

Fig. 7. ASCA unit used to calculate a forward metric A (s).

V. TURBO INTERLEAVER IMPLEMENTATION Here, we describe common features of 3G standard turbo interleavers and present the proposed interleaved address generation algorithm and the specialized instructions to implement these multiple turbo interleavers with the proposed SIMD processor. A. Prunable Interleavers saved by 15.6%, and, in addition, the size and speed of the LUT for Log-MAP was improved dramatically. Given the bit width of inputs, we can induce the required bit width of internal calculations, which is ten in this case [13]. Standard turbo codes commonly employ block interleavers. Although the complicated operations and parameters used in interleaving are quite different, W-CDMA and cdma2000 share the general concept of prunable block interleavers presented in

Fig. 6. Overall architecture of the SISO decoder.

806

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007

Fig. 8. Prunable interleaver with = 18. (a) Incoming data indexes. (b) Intrarow permutation. (c) Interrow permutation.

[14]. They can implement interleavers of arbitrary size by rst building the mother interleaver of a predened size and then pruning unnecessary indexes. Data are written in a two-dimensional (2-D) mother interleaver matrix row by row and permuted within each row. After the rows are also permuted one another, the interleaved data are read out column by column. is An example of a simple prunable interleaver with shown in Fig. 8, where the data indexes are written in a matrix form. The intrarow permutation rule applied to Fig. 8(b) is (9) is the permuted index of the th row and th column, , and increment . Fig. 8(c) shows the interrow permutation result, which will be read out column by column as since the elements exa sequence of ceeding the range of interest, 18 and 19, are pruned. If the interleaver size changes by more than the granularity of the mother interleavers, the entire interleaver structure should be reconstructed. All of the 3G mobile communication standards support variable bit rates that may change the interleaver size on a 10-ms or 20-ms frame basis, and W-CDMA even supports multiple separately coded transport channels. This means the interleaver structure may change at every 10 ms. Therefore, an efcient method to change the interleaver structure in a short period of time is essential to the turbo decoding of 3G communications. B. Preprocessing and On-the-Fly Generation Generating the whole interleaved address pattern at once is time-consuming and requires a large memory to store the pattern. We propose a solution to this problem by dividing the interleaver construction into two parts: preprocessing for interleaving and incremental on-the-y address generation. When the bit rate changes, only the preprocessing is performed to prepare a relatively small number of seed parameters and variables selected to make the on-the-y generation as simple as possible. Whenever the interleaved address sequence is required, the processor generates it column by column using the parameters and variables. This dividing technique reduces the timing overhead when the interleaver structure changes. It also does not require a large memory because it does not save the whole interleaved address pattern, but the minimal seed data to calculate it. This scheme exploits the properties of prunable interleavers. Let us explain our approach with the example of Fig. 8. In order to remove the computationally expensive multiplication and modulo operations that take a lot of clock cycles from (9), where base

Fig. 9. Proposed implementation applied to Fig. 8. (a) Preprocessing for interleaving. (b) Incremental on-the-y address generation.

we use an incremental vector w of and a cumulative vector of as follows:

instead of q (10)

Then, (9) can be rewritten as (11) and can be obtained recursively as

(12) where and and . since . Then, (12) is equivalent to if otherwise (13)

where the multiplication and modulo operations are replaced by cheaper operations, the multiplication by an addition, and the modulo by a subtract if greater or equal (SUBGE) instruction described in Section V-C. As shown in Fig. 9(a), b, w, and for the rst column of the block interleaver are calculated and stored in vector registers of the SIMD processor in the preprocessing for interleaving. The gure shows that they are stored in the order of interrow perin advance so as to simplify the mutation such as on-the-y generation. In the column-by-column on-the-y address generation shown in Fig. 9(b), the SIMD processor upaccording to (13) and calculates the addresses based dates on (11). Calculated addresses are sent to the address queue, if they are smaller than . C. SIMD Instructions for Turbo Interleavers To speed up the interleaved address generation, we introduced three SIMD processor instructions: STOLT (store to output port if less than), SUBGE (subtract if greater or equal), and LOOP. Each of them substitutes a sequence of three ordinary instructions but takes only one clock cycle to execute.

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS

807

The STOLT instruction is to send an address output to the queue only if the generated address is in the range of the interleaver size, which is the same as the pruning mentioned earlier. It is equivalent to a sequence of three instructions: CMP (compare), BLT (branch if less than), and STO (store to output port). Another conditional instruction SUBGE, which is equivalent to the sequence of CMP, BGE (branch if greater or equal), and SUB (subtract), is quite useful for the block interleavers that commonly use modulo operations. It substitutes a modulo or if the condition is remainder operation satised, which corresponds to (13). The LOOP instruction is adopted from DSP processors to reduce the loop overhead. This instruction conforms to the sequence of CMP, BNE (branch if not equal), and SUB instructions, which at once decrements the loop count and branches. Using the special instructions, we can reduce the lengths of the on-the-y generation program loop of W-CDMA and cdma2000 to six and ve instructions, respectively. This indicates that the ve PEs can provide one address per cycle for cdma2000 turbo decoding. In addition to the three special instructions, there are a few instructions necessary to implement interleavers. The MUL (multiply) and REM (remainder) operations, which take several cycles to complete, are required in the preprocessing of the W-CDMA interleaver. The MASK instruction that disables one or more of PEs is convenient when the column length of the block interleaver is not a multiple of ve. VI. APPLICATION TO STANDARDS We have applied the technique described in Section V to W-CDMA [2] and cdma2000 [3] turbo interleavers. We applied the proposed turbo decoder and turbo interleaver implementation to a few other standard turbo codes. Multistandard turbo decoding can be realized by loading several interleaver programs and control programs on the memory and switch them whenever needed. A. W-CDMA The W-CDMA turbo interleaver is quite similar to the example of Fig. 8 but much more complex. A prime number and its associated primitive root are involved in the whole interleaver generation process [2]. First, the number of rows of the block interleaver matrix , the prime number , and the number of columns are determined. is determined as 5, 10, or 20, , then as the minimum prime number such that and as or , depending on the range of the given interleaver frame size . The interrow permutation rule is also deand the table in [2]. The intrarow permutation termined by rule of the most case is (14) where (15) and the prime integer g.c.d is a least prime integer such that , and for each

Fig. 10. Pseudocode of the W-CDMA on-the-y address generation when R = 5.

. For a more detailed description of the W-CDMA interleaver, readers are referred to [2]. We can divide the complicated process into a preprocessing part and an on-the-y generation such that no timing overhead exists in changing the frame rate. The computationally expensive modulo operations required in the on-the y part are substituted by the SUBGE instructions. , and are found following In the preprocessing part, the method explained above. Then, the permutation function is obtained using the recursive (15) and saved in the data memory, which is the most time-consuming part of the preprocessing. Finally, for the on-the-y generation, we save the seed variable vectors of length : address base , cumulative vari. able , and increment value , where We save instead of in order to replace the modulo operations with SUBGEs in the on-the-y generation part. Since the W-CDMA allows to be 5, 10, or 20 and the SIMD processor has ve PEs, each vector is stored in four separate vector regis. ters when The pseudocode of the on-the-y address generation for is shown in Fig. 10. Each line W-CDMA when corresponds to an instruction of the processor code. Line 2 corresponds to SUBGE, line 5 corresponds to STOLT, and line 6 corresponds to LOOP. The SUBGE safely substitutes because the condition is satised ( and before they or 20, lines 15 are repeated twice or are added). If four times to produce an entire column of the interleaver matrix with ve PEs. The actual order of processor instructions are changed from the pseudocode to avoid data dependencies that stall the processor pipeline. B. cdma2000 The cdma2000 turbo interleaver has a dimension of , where is an integer between four and ten. The relation between , two neighboring entries in a row is , and is a row-specic where is the column index, value found in an LUT given in [3]. The rows are then shufed according to a bit-reversal rule. This procedure is equivalent to the bit-level processing illustrated in Fig. 11 [3]. As one can expect from the gure, the turbo interleaver of the cdma2000 standard is easy to implement in hardware. The preprocessing of the cdma2000 is quite simple. Like the W-CDMA turbo interleaver, we save the address base of

808

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007

TABLE II SUMMARY OF THE CHIP IMPLEMENTATION

Fig. 11. Turbo interleaver generation procedure of cdma2000.

Fig. 12. Pseudocode of the cdma2000 on-the-y address generation.

for , address offset , and increment obtained by the LUT into the vector registers of the SIMD processor. If and are all discarded and their spaces are overwritten with , and , respectively. Because the address generated in the th row is always larger than or equal to , the row will always produce an invalid address if we do not discard them. After discarding those rows, the number of remaining rows becomes 1824 out of 32 for any of 12 predened cdma2000 interleavers. The pseudocode of the on-the-y generation part is shown in Fig. 12 and is almost identical to that of the W-CDMA. C. Other Standards The Forward Link Only (FLO) mobile multicasting standard adopted the turbo code of cdma2000 with a xed frame size of [15]. The FLO turbo decoder can be implemented in the same way as that of cdma2000. The turbo interleaver used in the Consultative Committee for Space Data Systems (CCSDS) standard [16] also belongs to the category of prunable interleavers. The proposed SIMD processor can implement the interleaver, and only four instructions are enough to form the on-the-y generation loop. However, the constraint length of CCSDS turbo code is ve, which requires a SISO decoder that is twice as large as the other standards. The turbo codes adopted by Digital Video Broadcasting standard of return channel via satellite (DVB-RCS) [17] and IEEE Standard 802.16 of broadband wireless access (BWA) systems [18] are double binary circular recursive systematic convolutional (CRSC) codes. The double binary encoders split input data into two sequences and introduce nonuniformity by simply shufing them each other instead of applying the complex nonuniform interleaving to a single data sequence.

We can implement a turbo decoder for DVB-RCS and IEEE Standard 802.16 using the proposed architecture shown in Fig. 3. The control of the data ow from the memory to the SISO decoder will become a little more complex. However, the software interleavers can be very easily implemented by utilizing SUBGE instruction and four-way parallel processing of the proposed SIMD processor. All of the 3G standards adopted convolutional codes as the channel coding as well as turbo codes. Since both the turbo decoders and the Viterbi decoders are based on the similar trellis decoding, most part of the decoders can be shared and reused [19]. The proposed turbo decoder in Fig. 3 can also work as a Viterbi decoder by rearranging data ow by repeatedly using only one group of ACSA units in Fig. 6 and by utilizing the SIMD processor for traceback. VII. EXPERIMENTAL RESULTS We implemented an entire turbo decoder that supports both W-CDMA and cdma2000 1x RTT [3] turbo codes in a 0.25- m CMOS technology [20]. The summary of the chip characteristics and the layout are given in Table II and Fig. 13, respectively. The maximum operating frequency of 135 MHz at 2.5 V and energy/bit/iteration of 6.89 nJ was obtained by measuring 40 working samples on a test machine. The estimated maximum data rate of 5.48 Mb/s indicates that the decoder can easily cover 3G standards whose maximum rate is 2 Mb/s. Table III compares the complexity of the proposed turbo decoder and that of [19], where a turbo decoder is implemented for the W-CDMA standard. As you can see in Table III, our implementation is smaller than that in [19]. Although the work in [19] contains more circuits such as the rst interleaver of the W-CDMA receiver and the bit precision of the computation is larger than ours, we can safely claim that the proposed multistandard turbo decoder based on a programmable processor is comparable in size to average hardware turbo decoder implementations. We measured the BER performance of the proposed turbo decoder by a simulation program in C language, assuming an additive white Gaussian noise (AWGN) channel. We used the W-CDMA turbo code, the Log-MAP decoding algorithm, and the CRC stopping criterion. The performances of the proposed decoder and an ideal decoder are shown in Fig. 14. The ideal turbo decoder is implemented with double-precision oating-

SHIN AND PARK: SIMD PROCESSOR-BASED TURBO DECODER SUPPORTING MULTIPLE THIRD-GENERATION WIRELESS STANDARDS

809

Fig. 14. BER performance comparison with the ideal turbo decoder.

TABLE IV CYCLE COUNTS FOR THE MOST CRITICAL INTERLEAVER SIZES OF W-CDMA

Fig. 13. Micrograph of the chip.

TABLE III AREA COMPARISON OF TURBO DECODER IMPLEMENTATIONS

rows of Table IV actually have a negligible effect on the system speed. VIII. CONCLUSION point arithmetic, and the original Log-MAP algorithm that does not exploit the sliding window technique and the stopping criterion. We obtained the BER curves on the right when the interleaver frame size is 1024 and the maximum number of iterations is six, and the others when the frame size is 5114 and the maximum number of iterations is ten. Fig. 14 shows that the performance degradation of our implementation is within 0.05 dB compared with the ideal case, which is mainly due to the nite-precision effect such as saturation and quantization error. As shown in Table IV, the performance of the proposed interleaving algorithm is analyzed in terms of the cycle counts for four critical interleaver sizes of the W-CDMA turbo decoding. For brevity, the simpler and faster cdma2000 interleaver case is omitted. The last column of Table IV shows that the on-the-y address generation is almost as fast as one address per clock cycle for large interleaver sizes, which imply high bit rates. In addition, the preprocessing time is shorter than the SISO decoding time, which completely hides the preprocessing delay as the preprocessing completes during the rst SISO decoding. The second SISO decoding with the interleaved addresses can start as soon as the rst SISO decoding nishes, since the SIMD processor is ready for the on-the-y generation of the new interleaver structure. Since a small interleaver size implies a low bit rate in 3G systems, the relatively larger overheads for small-sized interleavers shown in the rst and second We have presented a turbo decoder designed for multiple 3G wireless communication standards. It contains a congurable hardware SISO decoder and a 16-b SIMD processor with ve PEs and specialized instructions to perform incremental block interleaving. The performance and power efciency of the hardware and the exibility of the software are thus achieved together. In addition, a fast incremental software implementation of turbo interleavers has been proposed, which is suitable for use with real-time structure change such as VBR in 3G wireless communications. To hide the timing overhead of interleaver changing, the interleaver generation is split into two parts, preprocessing and incremental on-the-y generation. The proposed decoder implemented in a 0.25- m CMOS technology shows 5.48 Mb/s performance, which is sufcient for 3G communication standards. It can decode both of W-CDMA and cdma2000 bit streams by changing the software running on the SIMD processor. The size of the multistandard turbo decoder is comparable to average W-CDMA standard hardware turbo decoders. The proposed software implementation of turbo interleaver running on the SIMD processor is sufciently fast that the timing overhead of interleaver changing is hidden in most cases, and the rate of the interleaved address generation for the 3G standards is almost one interleaved address per cycle.

810

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 7, JULY 2007

REFERENCES [1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near shannon limit error-correcting coding and decoding: Turbo-codes, in Proc. Int. Conf. Commun., 1993, pp. 10641070. [2] 3rd Generation Partnership Project; Technical Specication Group Radio Access Network; Multiplexing and Channel Coding (FDD), 3GPP TS 25.212 v4.6.0, Sep. 2002. [3] Physical Layer Standard for cdma2000 Spread Spectrum Systems, 3GPP2 C.S0002-B,ver. 1.00, Apr. 2002. [4] J. P. Woodard, Implementation of high rate turbo decoders for third generation mobile communication, Proc. IEE Colloq. Turbo Codes Digit. Broadcasting, pp. 12/112/6, 1999. [5] U. Walther and G. P. Fettweis, DSP implementation issues for UMTSchannel coding, in Proc. ICASSP, 2000, pp. 32193222. [6] M. Bekooij, J. Dielissen, F. Harmsze, S. Sawitzki, J. Huisken, A. van der Werf, and J. van Meerbergen, Power-efcient application-specic VLIW processor for turbo decoding, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2001, pp. 180181. [7] L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimizing symbol error rate, IEEE Trans. Inf. Theory, vol. IT-20, no. 2, pp. 284287, Mar. 1974. [8] P. Robertson, E. Villebrun, and P. Hoeher, A comparison of optimal and sub-optimal MAP decoding algorithms operating in the log domain, in Proc. Int. Conf. Commun., Seattle, WA, 1995, pp. 10091013. [9] J. Vogt, K. Koora, A. Finger, and G. Fettweis, Comparison of different turbo decoder realization for IMT-2000, in Proc. IEEE GLOBECOM, 1999, pp. 27042708. [10] R. Y. Shao, S. Lin, and M. P. C. Fossorier, Two simple stopping criteria for turbo decoding, IEEE Trans. Commun., vol. 47, no. 8, pp. 11171120, Aug. 1999. [11] G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, VLSI architectures for turbo codes, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 3, pp. 369379, Sep. 1999. [12] A. Worm, Turbo-decoding without snr estimation, IEEE Commun. Lett., vol. 4, no. 6, pp. 193195, Jun. 2000. [13] G. Montorsi and S. Benedetto, Design of xed-point iterative decoders for concatenated codes with interleavers, IEEE J. Sel. Areas Commun., vol. 19, no. 5, pp. 871882, May 2001. [14] M. Eroz and A. R. Hammons Jr., On the design of prunable interleavers for turbo codes, in Proc. Veh. Technol. Conf., Houston, TX, 1999, pp. 16691673. [15] Forward Link Only Air Interface Specication for Terrestrial Mobile Multimedia Multicast, TIA-1099, rev. 6, Aug. 2006. [16] Recommendation for Space Data System Standards: Telemetry Channel Coding, CCSDS 101.0-B-6, Oct. 2002, Blue Book.

[17] Digital Video Broadcasting (DVB): Interaction Channel for Satellite Distribution Systems, EN 301 790, v1.4.1, Sep. 2005. [18] IEEE Standard for Local and Metropolitan Area Networks, Part 16: Air Interface for Fixed Broadband Wireless Access Systems, IEEE Standard 802.16-2004, Oct. 2004. [19] M. Bickerstaff, D. Garrett, T. Prokop, C. Thomas, B. Widdup, G. Zhou, C. Nicol, and R.-H. Yan, A unied turbo/viterbi channel decoder for 3GPP mobile wireless in 0.18 m CMOS, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2002, pp. 124125. [20] M.-C. Shin and I.-C. Park, A programmable turbo decoder for multiple 3G wireless standards, in IEEE Int. Solid State Circuits Conf. Dig. Tech. Papers, 2003, pp. 154155. Myoung-Cheol Shin (S97M07) received the B.S., M.S., and Ph.D. degrees in electrical engineering and computer science from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1996, 1998, and 2004, respectively. In 2003, he joined the System LSI Division, Samsung Electronics Company, Ltd., Giheung, Korea, as a Senior Engineer working on VLSI architectures and implementations of 3G wireless communication receivers and mobile digital broadcasting receivers. He holds one patent in this area.

In-Cheol Park (S88M92SM02) received the B.S. degree in electronic engineering from Seoul National University, Seoul, Korea, in 1986, and the M.S. and Ph.D. degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, in 1988 and 1992, respectively. Since June 1996, he has been an Assistant Professor and is now a Professor with the School of Electrical Engineering and Computer Science, KAIST. Prior to joining KAIST, he was with the IBM T. J. Watson Research Center, Yorktown, NY, from May 1995 to May 1996, where he performed research on high-speed circuit design. His current research interest includes computer-aided design algorithms for high-level synthesis and VLSI architectures for general-purpose microprocessors. Prof. Park was the recipient of the Best Paper Award at ICCD in 1999 and the Best Design Award at ASP-DAC in 1997.

You might also like