You are on page 1of 5

512

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 51, NO. 3, MARCH 2004

Eliminating the Fanout Bottleneck in Parallel Long BCH Encoders


Keshab K. Parhi, Fellow, IEEE
AbstractLong BCH codes can achieve about 0.6-dB additional coding gain over ReedSolomon codes with similar code rate in long-haul optical communication systems. BCH encoders are conventionally implemented by a linear feedback shift register architecture. Encoders of long BCH codes may suffer from the effect of large fanout, which may reduce the achievable clock speed. The data rate requirement of optical applications require parallel implementations of the BCH encoders. In this paper, a novel scheme based on look-ahead computation and retiming is proposed to eliminate the effect of large fanout in parallel long BCH encoders. For a (2047, 1926) code, compared to the original parallel BCH encoder architecture, the modified architecture can achieve a speedup of 132%. Index TermsBCH, cyclic redundancy checking, encoder, fanout, linear feedback shift register (LFSR), look ahead, parallel processing, retiming, unfolding.

I. INTRODUCTION

CH codes are among the most widely used error-correcting codes. Compared to ReedSolomon codes, BCH codes can achieve around additional 0.6 dB coding gain over the additive white Guassian noise (AWGN) channel with similar rate and codeword length. High-rate ReedSolomon codes of length longer or equal to 255 have wide applications, such as in long-haul optical communication systems used in International Telecommunication Union-Telecommunication Standardization Sector (ITU-T) G.975, magnetic recording systems and digital communications. Hence, long BCH codes are of great interest. The encoders of BCH codes are conventionally implemented by a linear feedback shift register (LFSR) architecture. While such an architecture is simple and can run at very high frequency, it suffers from the serial-in and serial-out limitation. In optical communication systems, where throughput of over 1 Gbps is usually desired, the clock frequency of such LFSR-based encoders cannot keep up with the data transmission rate, and thus parallel processing must be employed. Meanwhile, long BCH encoders face another problem. Due to the large number of nonzero coefficients in the generator polynomial, the effect of large fanout on the gate delay is no longer negligible. The delay of a single XOR gate grows linearly with the fanout. When the BCH code has a generator polynomial with a large number of nonzero coefficients, there will be
Manuscript received July 18, 2003; revised November 11, 2003. This paper was recommended by Associate Editor Y. Wang. The author is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: parhi@ece.umn.edu). Digital Object Identifier 10.1109/TCSI.2004.823655

some XOR gates with a large fanout in the LFSR architecture, which can slow down the encoder significantly. For example, the delay of one XOR gate with 512 gates as load is 96 times the delay of one XOR gate with only one gate as load based on HSPICE simulation using a 0.35- m library. Various parallel LFSR architectures have been developed in previous works [1][3]. However, none of these have addressed the effect of large fanout in the case of long BCH codes. In this paper, a novel scheme based on look-ahead computation is proposed to eliminate the effect of large fanout in the parallel BCH encoders. Unfolding technique [4] is applied to achieve parallel processing. Without loss of generality, only binary BCH codes are considered. Parts of this paper are based on [5]. The structure of this paper is as follows. Section II contains a brief description of the LFSR-based BCH encoder architecture. In Section III, a novel parallel encoder architecture, which eliminates the effect of large fanout, is explained in detail. The performance analysis of an example is described in Section IV. Section V provides conclusions. II. BCH ENCODER ARCHITECTURE binary BCH code encodes a -bit message into An an -bit code word. A -bit message can be considered as the coefficients of a degree polynomial , where . Meanwhile, the can corresponding -bit code word polybe considered as the coefficients of a degree , where nomial . The encoding of BCH codes can be simply expressed by

where the degree

polynomial

is the generator polynomial of the BCH code. Usually, . However, systematic encoding is generally desired, since message bits are just part of the code word. The systematic encoding can be implemented by (1) where denotes the remainder polynomial of dividing by . The architecture of a systematic BCH encoder is shown in Fig. 1. During the first clock cycles, the two switches are connected to the a port, and the -bit message is input to the LFSR serially with most significant bit (MSB) first.

1057-7122/04$20.00 2004 IEEE

PARHI: ELIMINATING THE FANOUT BOTTLENECK IN PARALLEL LONG BCH ENCODERS

513

Fig. 1. Serial BCH encoder architecture.

Fig. 2. Retimed LFSR. Fig. 3. (a) LFSR example. (b) 3-unfolded version of LFSR in (a).

Meanwhile, the message bits are also sent to the output to form the systematic part of the code word. After clock cycles, the regswitches are moved to the b port. At this point, the . The isters contain the coefficients of remainder bits are then shifted out of the registers to the code word output bit by bit to form the remaining systematic code word bits. For binary BCH, the multipliers in Fig. 1 can be replaced by connection or no connection when is 1 or 0, respectively. The critical path of this architecture consists of two XOR gates, and the output of the right-most XOR gate is input to all the other XOR gates. In the case of long BCH codes, this architecture may suffer from the long delay of the right-most XOR gate caused by the large fanout. Although the serial architecture of BCH encoder is quite straight forward, in the case when it cannot run as fast as the application requirements, parallel architectures must be employed. Fanout bottleneck will also exist in parallel architectures. III. PARALLEL BCH ENCODER WITH ELIMINATED FANOUT BOTTLENECK In the serial BCH encoder in Fig. 1, the effect of large fanout can always be eliminated by retiming [7]. To make notations simple, we refer to the input to the right-most XOR gate, which is the delayed output of the second XOR gate from the right as the horizontal input (H input). In Fig. 1, there is at least one register at the H input of the right-most XOR gate. Meanwhile, registers can be added to the message input. Therefore, as shown in Fig. 2, retiming can always be performed along the dotted cutset by removing one register from each input to the right-most XOR gate and adding one to the output. For the purpose of clarity, switches are removed from the LFSR in Fig. 2 and the other figures in the remainder of the paper. However, if unfolding is applied directly to Fig. 1, retiming cannot be applied in an obvious way to eliminate the large fanout. The original architecture can be expressed by a data flow graph (DFG) as nodes connected by path with delays. Each XOR

gate in the LFSR is a node in the corresponding DFG. In the -unfolded architecture, there are copies of each node with the same function as in the original architecture [4, Ch. 5]. However, the total number of delay elements does not change. Assuming there is a path from node to node in the original architecture with delay elements, in the -unfolded architecwith ture, node is connected to node delay elements, where , ( , ) are copies of nodes and , respectively. Therefore, if the unfolding factor is paths with one delay element greater than , there will be paths without any delay element in the unfolded and architecture. For example, Fig. 3(a) shows an LFSR with gen. In this example, there erator polynomial are two registers in the path connecting the output of the left XOR gate and the input of the right XOR gate. In the 3-unfolded paths architecture illustrated in Fig. 3(b), there are from the output of the copies of the left XOR gate to the input of the copies of the right XOR gate with one delay, and another one path without any delay. The unfolded LFSR in Fig. 3(b) cannot be retimed to eliminate the fanout problem for each copy of the right XOR gate. If the generator polynomial can be expressed as (2) are positive integers with is the total number of nonzero terms of ; there are consecutive registers at the H input of the right-most XOR gate in Fig. 1. If a -unfolded BCH encoder needs to be satisfied to ensure that there is desired, is at least one register at the H input of each of the copies of the right-most XOR gate, so that retiming can be applied to move one register to the output. Meanwhile, registers need to be added to the message input to enable retiming. , the generator polynomial needs In the case of to be modified to enable retiming of the right-most XOR gate in and where ,

514

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 51, NO. 3, MARCH 2004

Fig. 4.

Block diagram of modified BCH encoding.

the -unfolded architecture. Assuming the original BCH with degree , the code uses generator polynomial message input multiplied by can be written as (3) where and represent the quotient and remainder polynomials of dividing by , respectively. Multito both sides of (3), we get plying (4) Let be expressed as (5) are positive integers with is the total number of nonzero terms of . Since , it can be derived that , where denotes . Hence, can be conthe degree of polynomial sidered as the remainder of dividing by . , the quotient is exactly the same Dividing this remainder by by . Thus, if we can as the remainder of dividing such that the coefficients in satisfy , find the LFSR to implement dividing will have at least consecutive registers at the H input of the right-most XOR gate. Accordingly, in the -unfolded LFSR, the H input of each copy of the right-most XOR gates will contain at least one register, so that retiming can be applied to eliminate the effect of large fanout. , as well as the modified generator polynomial Such can be found by Algorithm A written in pseudocode. In Algoin , and has rithm A, is set to the desired consecutive zero coefficients after the highest power term at the end of the algorithm. It may be noted that the process is similar to clustered look-ahead computation of finding [6]. Algorithm A: and ; set ; set loop: while ; ; where ,
Fig. 5. Step 1 of the modified BCH encoding.

Algorithm A is applied to transform the encoder computation of a BCH (255, 233) code in the example below. Example I: Given an BCH(255, 233) code using generator polynomial , we want to find such that in . , In this example, should be set to 8, and at the beginning of Algorithm A. The intermediate values after each iteration in Algorithm A are given below. After iteration I: . . ; continue.

; ; After iteration II: .

. ; ; After iteration III: . . ; stop. ; continue.

; Final step:

. According to (4), the modified method of finding in the BCH encoding can be implemented by the steps illustrated in Fig. 4. Each step is explained , and derived in Example I in the reusing the mainder of this section. The first step in Fig. 4 is to multiply the . This can be implemented message input polynomial by by adding delayed message inputs according to the coefficient . For example, using the derived in Example I, this of step can be implemented by the diagram in Fig. 5. The four taps correspond to 1, , and , respectively, as shown in Fig. 5. After is computed, it is fed into the second block by using similar LFSR to compute architectures as that in Fig. 1. However, since , the product of and should be added to the output

; ; ; highest power in final step: ; ; ;

PARHI: ELIMINATING THE FANOUT BOTTLENECK IN PARALLEL LONG BCH ENCODERS

515

Fig. 6. Step 2 of the modified BCH encoding.

Fig. 7.

Step 3 of the modified BCH encoding.

of the th register from left, instead of being added to the output of the right-most register. The addition of can break the consecutive registers at the H input of the right-most XOR gate in the LFSR. The implementation of the second step using the BCH code in Example 1 is illustrated in Fig. 6. As could be observed in Fig. 6, there are consecutive registers at the H input of the right-most XOR gate . However, after adding to the according to output of the 32nd register, only six consecutive registers are left. Therefore, at most 6-unfolding can be applied to Fig. 6 without suffering from large fanout problem. At the end of . algorithm A, consecutive registers left after adding Hence, only , where at the end of Algorithm A. Therefore, is usually set to larger than the desired unfolding to ensure at the end of Algorithm A. factor Alternatively, at the expense of a slight increase in the critical path and latency, the delays at the input of the last XOR gate can be retimed and moved to the output of this XOR gate. For example, the five delays at the H input of the last XOR gate in Fig. 6 can be retimed and moved to the output of this XOR input. gate. This requires first adding five delays to The penalty is the increase in the critical path to two XOR gates in the serial encoder. needs to be diIn the third step, to get the final result. Similar architectures as that vided by in Fig. 1 can also be used, except that the input data is added to the input of the left-most register, since the input polynomial does not need to be multiplied by any power of . For example, dethe third step of the modified BCH encoding using the rived in Example I is illustrated in Fig. 7. Unfolding the modified BCH encoder in Fig. 4 by factor , message bits a parallel architecture capable of processing at a time is derived. In the -unfolded block of computing , feedback loop does not exist. Thus, it can be pipelined to achieve desired clock frequency. In the second block, since the LFSR of the modified generator polynomial has at least registers at the H input of the right-most XOR gate, retiming can be applied to the -unfolded architecture to eliminate the effect of large fanout after adding registers to . the output of

Although the fanout problem does not exist in the third block in Fig. 4, it can exist in the unfolded architecture. Since the enables to have consecupolynomial tive zero coefficients after the highest power term, the differis equal to ence of the highest two powers of . After -unfolding is applied, there are some copies of the XOR gates, where is right-most XOR gates connected to . In the worst case, is at the number of nonzero terms in most , and is set to as small as possible, which is a little bit larger than the unfolding factor . Usually, the dein long sired unfolding factor is far less than the length of BCH codes. Hence, the delay caused by the fanout of dividing is far less than that of dividing in the original BCH encoders. For example, in the 8-unfolded architecture of the BCH can meet the requirements. code in Example I, setting There are at most three XOR gates connected to the copies of the right-most XOR gate in the 8-unfolded LFSR of dividing and the delay caused by that is 1.17 ns according to HSPICE simulation using a 0.35- m library. However, the right-most XOR gate in the unfolded LFSR to implement dividing by in the original encoder may have 20 XOR gates connected to it. The delay caused by the fanout of 20 XOR gates is 2.87 ns, which . As the BCH is 2.45 times of that in the LFSR to divide codes get longer, we expect the difference to be larger. Meanwhile, -unfolding makes the iteration bound times of that in the original architecture. The longer delay caused by more cascaded XOR gates in the unfolded architecture can be reduced by using tree structures to the cascaded XORs. Using tree structures, the delay caused by cascaded XOR gates is much less than that caused by large fanout in long BCH codes. IV. PERFORMANCE ANALYSIS OF AN EXAMPLE The BCH(2047, 1926) code can achieve better performance than a ReedSolomon (255, 239) code with similar length and higher rate over AWGN channel. This BCH code can be with degree 121 and formed by a generator polynomial , which has 17 nonzero 61 nonzero terms. Multiplying terms with degree 32, with can be derived. Using the modified algorithm, a 32-unfolded binary BCH (2047, 1926) code can be implemented using 3200 XOR gates. The iteration bound of the LFSR to is , where compute is the delay of an XOR gate. The iteration bound of in the third step is . the LFSR for dividing by Hence, the critical path of the 32-unfolded BCH encoder is . Using tree structure, the critical path can be reduced to four cascaded XOR gates. Using HSPICE with 0.35- m library, the critical path is estimated to be 3.94 ns after taking the fanout effect of 15

516

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 51, NO. 3, MARCH 2004

TABLE I PERFORMANCE COMPARISON

REFERENCES
[1] T.-B. Pei and C. Zukowski, High-speed parallel CRC circuits in VLSI, IEEE Trans. Commun., vol. 40, pp. 653657, Apr. 1992. [2] J. H. Derby, High-speed CRC computation using state-space transformation, in Proc. Global Telecommunications Conf. 2001, GLOBECOM 01, vol. 1, pp. 166170. [3] R. J. Glaise, A two-step computation of cyclic redundancy code CRC-32 for ATM networks, IBM J. Res. Dev., vol. 41, pp. 705709, 1997. [4] K. K. Parhi, VLSI Digital Signal Processing Systems-Design and Implementation. New York: Wiley, 1999. [5] , System and method for generating cyclic codes for error control in digital communications, U.S. Patent Application 678 910, 2002. [6] K. K. Parhi and D. G. Messerschmitt, Pipeline interleaving and parallelism in recursive digital filters. I. Pipelining using scattered look-ahead and decomposition, IEEE Trans. Acoustics, Speech, Signal Processing, , vol. 37, pp. 10991117, July 1989. [7] C. Leiserson, F. Rose, and J. Sax, Optimizing synchronous circuitry by retiming, in Proc. 3rd Caltech Conf. VLSI, 1983, pp. 87116.

gates into consideration. The total clock cycles needed to . The encode one message block is by can be pipelined, part of multiplying message and the latency is not counted in the total number of clock cycle to encode one block of message. In the original encoder . Hence, the iteration architecture, the iteration bound is , bound in the 32-unfolded original architecture is 32 while the output of some copies of the right-most XOR gate is connected to 60 gates. The estimated critical path of 32-unfolded original architecture is 9.72 ns from HSPICE simulation using 0.35- m technology. Table I shows comparison results of the modified BCH parallel encoder and the original parallel encoder. Compared with the original encoder, the modified encoder can achieve a speed up of 132%.
XOR

V. CONCLUSION A novel parallel implementation of long BCH encoders has been proposed in this paper. The proposed encoder does not suffer from the effect of large fanout. This scheme can also be used in cyclic redundancy checking to reduce fanout bottleneck. After the fanout effect has been taken care of, further speedup can be achieved by reducing critical path. The critical path usu. Future work will be ally lies in the part of dividing by directed toward reducing the critical path of this part or make this part pipelinable after further algorithmic modifications. ACKNOWLEDGMENT The author is grateful to X. Zhang for her help in the preparation of this paper. This work was carried out while the author was at Broadcom Corporation, Irvine, CA, while on leave from the University of Minnesota.

Keshab K. Parhi (S85M88SM91F96) received the B.Tech., M.S.E.E., and Ph.D. degrees from the Indian Institute of Technology, Kharagpur, India, the University of Pennsylvania, Philadelphia, and the University of California at Berkeley, in 1982, 1984, and 1988, respectively. Since 1988, he has been with the University of Minnesota, Minneapolis, where he is currently a Distinguished McKnight University Professor in the Department of Electrical and Computer Engineering. His research addresses VLSI architecture design and implementation of physical layer aspects of broadband communications systems. He is currently working on error-control coders and cryptography architectures, high-speed transceivers, ultra wideband systems, and quantum error-control coders and quantum cryptography. He has published over 350 papers, has authored the text book VLSI Digital Signal Processing Systems (New York: Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Systems (New York: Marcel Dekker, 1999). Dr. Parhi is the recipient of numerous awards including the 2003 IEEE Kiyo Tomiyasu Technical Field Award, the 2001 IEEE W.R.G. Baker prize paper award, and a Golden Jubilee award from the IEEE Circuits and Systems Society in 1999. He has served on Editorial Boards of IEEE TRANSACTIONS ON VLSI SYSTEMS, IEEE TRANSACTIONS ON SIGNAL PROCESSING, IEEE SIGNAL PROCESSING LETTERS, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, and IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII, currently serves on Editorial Boards of the IEEE Signal Processing Magazine and Journal of VLSI Signal Processing Systems, and is the current Editor-in-Chief of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, for 2004-2005. He served as Technical Program Cochair of the 1995 IEEE VLSI Signal Processing Workshop and the 1996 ASAP Conference, and as the General Chair of the 2002 IEEE Workshop on Signal Processing Systems. He was a Distinguished Lecturer for the IEEE Circuits and Systems Society from 1997 to 1999.

You might also like