2009D020047150 S1Ver2

Doctoral Thesis
4

Design and Implementation of
High-Performance Radix-4 Turbo Decoder
for Multiple 4G Standards
( Kim, Ji-Hoon)

School of Electrical Engineering & Computer Science
Division of Electrical Engineering
KAIST
2009


Advisor: Professor Park, In-Cheol
by
Kim, Ji-Hoon
School of Electrical Engineering & Computer Science
Division of Electrical Engineering
KAIST
A thesis submitted to the faculty of the KAIST in partial
fulfillment of the requirements for the degree of Doctor of
Philosophy in Engineering in the School of Electrical
Engineering and Computer Science, Division of Electrical
Engineering
Daejeon, Korea
2009. 4. 29.
Approved by
Professor In-Cheol Park
Professor Park, In-Cheol
Advisor
2009 4 29aa
()
()
()
()
()
d
ddfasdf
DEE
20047150
. Kim, Ji-Hoon. Design and Implementation of HighPerformance Radix-4 Turbo Decoder for Multiple 4G Standards. 4

. School of Electrical Engineering & Computer Science, Division of
Electrical Engineering. 2009. 85 p. Advisor Prof. Park, In-Cheol. Text in
English
Abstract
Recently, turbo codes have been adopted for high-speed data transmission of the 4G
communications systems such as Mobile WiMAX (IEEE 802.16e standard) and 3GPPLTE in the form of the double-binary and the single-binary, respectively. Especially,
double-binary convolutional turbo code (CTC) shows superior advantages over the
classical single-binary turbo codes. However, compared with the classical single-binary
turbo code, nonbinary turbo code is much more complex in hardware implementation and
its decoding requires more memory especially for storing the extrinsic information to be
exchanged between the two soft-input soft-output (SISO) decoders. Additionally, due to its
iterative decoding behavior, implementing a high-performance turbo decoder for nextgeneration mobile communication systems becomes challenging. Also, as the need to
support multiple standards in a single mobile handheld device increases, the efficient
implementation of the advanced channel decoders, which is the most area-consuming and
computationally intensive block in baseband modem, becomes more important.
In order to deal with these issues in resource limited handheld systems, this dissertation
presents several solutions from every aspect algorithm, architecture, and implementation.
As an algorithmic solution, two techniques are proposed, which are especially suitable for
nonbinary / high-radix single-binary turbo decoding. The first one, an energy-efficient
SISO decoder based on border metric encoding, eliminates the complex dummy
calculation at the cost of a small-sized memory that holds encoded border metrics. Due to
the infrequent accesses to the border memory and its small size, the energy consumed for
SISO decoding is reduced hugely. As the second one, to reduce the memory size required
for double-binary turbo decoding, a new method to convert the symbolic extrinsic
information to the bit-level information and vice versa is presented. By exchanging the bitlevel extrinsic information, the number of extrinsic information values to be exchanged in
double-binary turbo decoding is reduced to the same amount as single-binary turbo
decoding. Since the size of the extrinsic information memory is significant, the proposed
method is effective in reducing the total memory size needed in double-binary turbo
decoder.
Based on the proposed algorithmic solutions, to verify the proposed methods, two chips
have been implemented. The first implemented chip contains a double-binary turbo
decoder for the mobile WiMAX standard with the dedicated hardware interleaver and
fabricated using a 0.13m CMOS process. The proposed decoder is based on the timemultiplexing architecture consisting of a single optimized SISO decoder, a low-complexity
hardware interleaver, and it can provide up to 50Mb/s at the frequency of 200MHz with
simple early stopping criterion exploiting the bit-level extrinsic information. The second
chip presents the unified radix-4 turbo decoder architecture which can support both Mobile
WiMAX and 3GPP-LTE. To exhibit a decoding rate of more than 100Mb/s, the proposed
chip consists of eight retimed radix-4 SISO decoders and a dual-mode parallel hardware
interleaver to support both standards. The second chip can show more than 400Mb/s at the
frequency of 250MHz with simple early stopping criterion. The proposed chip can achieve
an energy efficiency of 0.34nJ/bit/iteration while achieving more than 100Mb/s with fixed
eight iterations when the supply voltage is scaled since the peak operating frequency is
relatively high due to the retiming technique.
Contents
CHAPTER 1
INTRODUCTION ......................................................... 8
1.1 Motivation................................................................................................................ 8
1.2 Previous Works ........................................................................................................ 9
1.3 Contributions .......................................................................................................... 11
CHAPTER 2
BACKGROUNDS .........................................................14
2.1 Digital Communication System ..............................................................................14

2.2 Introduction to Turbo Codes ..................................................................................16
2.2.1 Turbo Code Encoder Structure ...........................................................................17
2.2.2 Turbo Decoding .................................................................................................19
2.2.3 Decoding Algorithm for Turbo Codes .................................................................19
2.3 Turbo code in Mobile WiMAX ...............................................................................27

2.3.1 Encoding ...........................................................................................................28
2.3.2 Decoding ...........................................................................................................29
2.4 Turbo code in 3GPP-LTE .......................................................................................32
2.4.1 Encoding ...........................................................................................................32
2.4.2 Decoding ...........................................................................................................34
CHAPTER 3
BORDER METRIC ENCODING ................................35
3.1 Radix-4 SISO Decoding ..........................................................................................35

3.1.1 Sliding Window for nonbinary SISO Decoding...................................................36
3.2 Proposed Border Metric Encoding .........................................................................38
3.3 Experimental Results..............................................................................................42
CHAPTER 4 BIT-LEVEL EXTRINSIC INFORMATION

EXCHANGE...........................................................................................44
4.1 Extrinsic Information in Double-Binary Turbo Codes ..........................................44
4.1.1 Symbol-level Extrinsic Information in Double-Binary Turbo Codes ....................44
4.1.2 Memory Requirement in Double-Binary Turbo Decoder .....................................45
4.2 Proposed Bit-Level Extrinsic Information Exchange ............................................47
4.2.1 Bit-level Extrinsic Information for Double-Binary Turbo Codes .........................47
4.2.2 Symbol-to-Bit Conversion of Extrinsic Information ............................................50
4.2.3 Bit-to-Symbol Conversion of Extrinsic Information ............................................51
4.3 Experimental Results..............................................................................................53

4.3.1 Hardware Implementation ..................................................................................55
CHAPTER 5 A 50MBPS DOUBLE-BINARY CIRCULAR TURBO

DECODER FOR MOBILE WIMAX ....................................................58
5.1 Proposed Chip Architecture ...................................................................................58
5.1.1 Low-Complexity SISO Decoder Design .............................................................59
5.1.2 Bit-level Extrinsic Information Exchange ...........................................................60
5.1.3 Dedicated Hardware Interleaver .........................................................................60
5.1.4 Dedicated Double-Flow Hardware Interleaver ....................................................63
5.1.5 Early Stopping Criterion ....................................................................................64
5.2 Implementation Results ..........................................................................................65
CHAPTER 6 A UNIFIED PARALLEL RADIX-4 TURBO

DECODER FOR MOBILE WIMAX AND 3GPP-LTE ........................69
6.1 Proposed Chip Architecture ...................................................................................69
6.1.1 Parallel Turbo Decoding.....................................................................................70
6.1.2 Unified Radix-4 SISO Decoder with Retiming ...................................................71
6.1.3 Memory-Sharing with Bit-level Extrinsic Information ........................................74
6.1.4 Dual-Mode Hardware Interleaver .......................................................................75
6.2 Implementation Results ..........................................................................................76
CHAPTER 7
CONCLUSIONS ...........................................................80
REFERENCE .........................................................................................82
List of Figures
Figure 1.1: The Need for Supporting Multiple Standards ............................................ 9
Figure 1.2: Research Overview ....................................................................................12
Figure 1.3: Proposed Solutions for Nonbinary CTC Decoder Implementation ..........13
Figure 2.1: Model of a digital communication system .................................................15
Figure 2.2: Turbo code encoder structure . ..................................................................17
Figure 2.3: A turbo decoder structure ..........................................................................20
Figure 2.4: Double-binary CRSC constituent encoder used by WiMAX ....................29
Figure 2.5: A decoder for the WiMAX turbo code .......................................................30
Figure 2.6: A Turbo Encoder for 3GPP-LTE ...............................................................33
Figure 2.7: A Trellis Diagram of a 3GPP-LTE Turbo Encoder....................................33
Figure 3.1: Trellis Diagrams .........................................................................................36
Figure 3.2: Sliding window diagrams...........................................................................37
Figure 3.3: 3-bit border metric encoding function .......................................................39
Figure 3.4: BER performance comparison with 8 iterations for 4800-bit frame. .......40
Figure 3.5: BER performance of 1920-bit frame according to the number of iterations
......................................................................................................................................41
Figure 4.1: Memory Requirements in Double-Binary Turbo Decoder ........................47
Figure 4.2: Block diagram of the proposed bit-level extrinsic information exchange .48
Figure 4.3: Proposed Bit-Level Extrinsic Information Exchange ...............................54
Figure 4.4: Comparison of BER performance of 8 iterations for 1920-bit frame. ......54
Figure 4.5: Block diagram of the proposed double-binary turbo decoder ..................56
Figure 4.6: Block diagram and complexity of the proposed bit-to-symbol converter .56
Figure 5.1: Branch metric memory width comparison ................................................60
Figure 5.2: Block diagram of the proposed two converters .........................................61
Figure 5.3: Interleaving procedure for the WiMAX ....................................................61
Figure 5.4: Interleaver structure based on the incremental calculation ......................62
Figure 5.5: Need of LIFO for Interleaved Address ......................................................63
Figure 5.6: Double-flow hardware interleaver based on incremental calculation.......64
Figure 5.7: Double-flow hardware interleaver based on incremental calculation.......65
Figure 5.8: Block diagram of the proposed double-binary turbo decoder ..................65
Figure 5.9: Average number of iterations for the proposed turbo decoder .................66
Figure 5.10: Comparison of BER performance for 1920-bit frame.............................66
Figure 5.11: Die photo of the proposed double-binary turbo decoder chip .................67
Figure 6.1: Overall Unified Turbo Decoder Architecture with Time-Multiplexing.....70
Figure 6.2: The Proposed Chip Architecture with Eight SISO Decoders ....................71
Figure 6.3: Add-Compare-Select (ACS) block with Retiming .....................................72
Figure 6.4: Sliding Window with Register Retiming ...................................................73
Figure 6.5: Input Frame Memory Configurations .......................................................74
Figure 6.6: Dual-Mode Dedicated Hardware Interleaver............................................76
Figure 6.7: FER Performance and Average Iteration Number with Early Termination
in an AWGN Channel ...................................................................................................78
Figure 6.8: Memory Size Reduction in the Proposed Architecture .............................78
Figure 6.9: Micrograph of the Chip .............................................................................79
List of Tables
Table 1.1 Differences between the 3GPP-LTE and Mobile WiMAX Turbo Codes......10
Table 3.1 Simulation environment ...............................................................................39
Table 3.2 Encoded values for border metrics ...............................................................40
Table 3.3 Single-port SRAM size required for a SISO decoder ...................................43
Table 3.4 Energy consumptions of SISO decoders.......................................................43
Table 4.1 Simulation environment ...............................................................................46
Table 4.2 Memory Configuration for one SISO Decoder ............................................46
Table 4.3 Memory Configuration for the Extrinsic Information .................................46
Table 4.4 Single-port SRAM Size required for the Turbo Decoder .............................55
Table 5.1 CTC Interleaver Parameters for WiMAX....................................................62
Table 5.2 Single-port SRAM Size Required for the Turbo Decoder ............................67
Table 6.1 Comparison of Decoder Implementation .....................................................77
List of Abbreviations
4G: 4th Generation
RSC: Recursive Systematic Convolutional
CTC: Convolutional Turbo Code
SISO: Soft-Input Soft-Output
LLR: Log Likelihood Ratio
ML: Maximum Likelihood
SOVA: Soft-Output Viterbi Algorithm
MAP: Maximum a posteriori
APP: a posteriori Probability
ECC: Error Correction Coding
FEC: Forward Error Correction
BPSK: Binary Phase Shift Keying
OFDMA: Orthogonal Frequency Division Multiple Access
NLOS: Non-Link-of-Sight
AWGN: Additive White Gaussian Noise
ARP: Almost Regular Permutation
QPP: Quadratic Polynomial Permutation
Chapter 1
Introduction
The turbo code introduced in 1993 is one of the most powerful forward error correction
channel codes, and provides near optimal bit-error rates (BERs), that is, within 0.5 dB of
Shannons limit at BER of 10-5 [1]. Having this remarkable performance, the turbo codes
have been accepted in many standardized mobile radio systems.
Recent advance in convolutional turbo code (CTC) attracts much interest in its
applications. Conventional CTC suffers from high error floor due to its relative small
minimum Hamming distance and suffers from performance degradation due to puncturing.
Nonbinary CTC has recently emerged and it seems to solve many flaws of classical singlebinary CTC [2]. In addition, the concept of tail-biting convolutional code has been applied
to CTC. The tail-biting code called circular code improves the spectral efficiency of CTC
since it solves the problem of tail bits used to terminate the state of the encoder.
Recently, turbo codes have been adopted for high-speed data transmission of the 4G
mobile communication systems such as Mobile WiMAX (IEEE 802.16e standard) and
3GPP-LTE in the form of the double-binary and the single-binary, respectively.
1.1 Motivation
There has been little research dedicated to the hardware implementation of the doublebinary turbo decoder although the previous works on the classical single-binary turbo
codes can be applied to the nonbinary turbo codes [4]-[11]. Compared with the classical
single-binary turbo code, nonbinary turbo code is much more complex in hardware
implementation and its decoding requires more memory especially for storing the extrinsic
information to be exchanged between the two soft-input soft-output (SISO) decoders.
Figure 1.1: The Need for Supporting Multiple Standards

In addition, as the need to support multiple standards in a single handheld device
increases as shown in Figure 1.1, the efficient implementation of the advanced channel
decoders, which is the most area-consuming and computationally intensive block in
baseband modem, becomes more important. Accordingly, the unified decoder architecture
which can support multiple standards becomes necessary since the separate
implementations for different standards require much hardware resources leading to huge
silicon area occupation. Since the turbo codes adopted in 3GPP-LTE and Mobile WiMAX
are different from each other as denoted in Table 1.1, the efficient implementation of the
unified turbo decoder to support both 3GPP-LTE and Mobile WiMAX is important for
future mobile hand-held devices. Also, due to its iterative decoding behavior and long
critical path, implementing a high-performance turbo decoder for next-generation mobile
communication systems becomes challenging.
1.2 Previous Works

There have been studies on double-binary turbo decoding to lower the hardware
Table 1.1 Differences between the 3GPP-LTE and Mobile WiMAX Turbo Codes
Standards
RSC
code
Type
Constraint
Length
Trellis
Termination
Type
Interleaver
Frame size
(N)
3GPP-LTE
Single-Binary
Mobile WiMAX
Double-Binary
Appending the bits that

make both encoder states
all zero and sending the
resulting codes
QPP Interleaver
40 8 f , 0 f 59
512 16 f , 0 f 32
N
1024 32 f , 0 f 32
2048 64 f , 0 f 64
Tail-Biting
(Circular Coding)
ARP Interleaver
24, 36, 48, 72, 96, 108,
120, 144, 180, 192, 216,
240, 480, 960, 1440,
1920, 2400 (pairs)
complexity [12]-[14]. For a double-binary SISO decoding algorithm, based on the

maximum a posteriori (MAP) algorithm [1], the constant log-MAP algorithm has been
reported for double-binary turbo decoding [12]. By allowing the constant correction term
in log-MAP algorithm for double-binary SISO decoding, a performance improvement was
observed.
Due to the tail-biting property, the initial values of the forward metric and backward
metric are not explicitly specified. In [13], the simple method to determine the initial state
in circular turbo decoding is presented. It has been reported that using the information of
the previous iteration shows better performance and lower computational complexity than
the pre-computing method [15].
To reduce the huge extrinsic information memory size, two techniques have been
introduced in [14]. The first one, bitwise approximation for extrinsic information, can
reduce three extrinsic information into two extrinsic information in double-binary turbo
decoding by modifying the SISO decoding structure. However, it leads to a severe
performance degradation of BER performance, about larger than 0.5dB. Also, it is well
known that non-uniform quantization can be applied to reduce the extrinsic information
memory size since the extrinsic information does not need to be exact in decoding [4][5].
By exploiting this property, the second technique uses block-scaling method where a
common shift index is used for three extrinsic information values. This method can reduce
10
the extrinsic information memory size hugely with negligible performance degradation
although the number of extrinsic information values is still three.
In addition, there have been several turbo decoder implementations for single-binary
turbo codes [9][20]. To support multiple 3G standards, such as CDMA2000 and W-CDMA,
the programmable single-instruction multiple-data (SIMD) processor has been proposed
for interleaving in order to provide interleaved data at the speed of the hardware SISO [20].
Compared to the ROM-based interleaver which needs a large ROM to store all of the
possible interleaved patterns, the proposed approach can achieve the small area, high
performance, and low power consumption of hardware, as well as the flexibility and
programmability of software needed to support multiple standards.
Also, to support higher user data rates, up to 24Mb/s, a radix-4 log-MAP turbo decoder
for 3GPP-HSDPA has been introduced in [9]. The log-MAP SISO decoder processes two
received symbols per clock cycle using a windowed radix-4 architecture doubling the
throughput for a given clock rate over a similar radix-2 architecture.
1.3 Contributions
The major contribution of this paper is to present the algorithmic modifications for
low-complexity
hardware
implementation,
architectural
solutions
and
several
optimizations for high-performance turbo decoding with the capability of supporting two
4G communication standards as illustrated in Figure 1.2 and Figure 1.3. In other words, the
contribution can be categorized as follows.
The first one is the energy-efficient SISO decoding structure for nonbinary turbo
decoders. With border metric encoding scheme, the complex dummy calculation in
nonbinary turbo decoding can be avoided at the cost of a small-sized memory that holds
encoded border metrics. Due to the infrequent accesses to the border memory and its small
size, the energy consumed for SISO decoding is reduced hugely.
The second one is to present the bit-level extrinsic information exchange. To reduce the
memory size required for double-binary turbo decoding, a new method to convert the
symbolic extrinsic information to the bit-level information and vice versa is presented. By
exchanging the bit-level extrinsic information rather than the symbol-level extrinsic
11
Algorithm
Nonbinary Max-log-MAP
Border Metric Encoding
Bit-level Extrinsic Info.
ARP/QPP Interleaving
Architecture
Implementation
Time-Multiplexing
Parallel Turbo Decoding
Unified SISO Decoding
Memory Sharing
2 Chips in 130nm CMOS

Interconnect Issue
Speed / Area Tradeoff
Figure 1.2: Research Overview

information, the number of extrinsic information values to be exchanged in double-binary
turbo decoding is reduced to the same amount as single-binary turbo decoding. Compared
to bitwise approximation for extrinsic information in [14], the proposed method does not
require any modifications to the conventional double-binary SISO decoder structure. The
proposed method deals with the symbol-to-bit conversion and bit-to-symbol conversion of
the extrinsic information for the double-binary turbo code. Since the size of the extrinsic
information memory is significant, the proposed method is effective in reducing the total
memory size needed in double-binary turbo decoder with negligible performance
degradation.
The third one is to present the whole decoder architecture of the double-binary circular
turbo decoder for Mobile WiMAX. To lower the overall hardware complexity, in addition
to the above methods, the dedicated hardware interleaver is designed. By generating the
interleaved addresses on-the-fly, the proposed turbo decoder can achieve small area and
low power consumption since there is no need to include a large-sized interleaver memory.
Also, for the critical path delay reduction, a retimed architecture for double-binary SISO
12

Branch Metric Optimization
Hardware Interleaver
Bit-level Extrinsic Info.

Memory Sharing
Parallel Turbo Decoding

Register Retiming
loss ~ 0.15 dB
No Error Floor
for Radix-4 Processing w/o memory

WiMAX / 3GPP-LTE
Figure 1.3: Proposed Solutions for Nonbinary CTC Decoder Implementation

decoding is presented. In addition, to avoid unnecessary iterations at good channel
environment, a simple early stopping criterion for double-binary turbo decoder is presented.
The proposed stopping criterion uses the sign values of incoming bit-level extrinsic
information and the hard-decision values.
Finally, to support multiple 4G mobile communication systems such as Mobile
WiMAX and 3GPP-LTE which require high-speed data transmission, the unified parallel
radix-4 turbo decoder architecture is proposed.
13
Chapter 2
Backgrounds
The efficient design of a communication system that enables reliable high-speed
service is challenging. Efficient design refers to the efficient use of primary
communication resources, namely, power and bandwidth. The reliability of such systems is
usually measured by the required signal-to-noise ratio (SNR) to achieve a specific error
rate. Also, a bandwidth efficient communication system with perfect reliability, or as
reliable as possible, using as low as SNR as possible is desired.
Error correction coding (ECC) is a technique that improves the reliability of
communication over a noisy channel. The use of the appropriate ECC allows a
communication system to operate at very low error rates, using low to moderate SNR
values, enabling reliable high-speed multimedia services over a noisy channel.
2.1 Digital Communication System

The information source generates a message containing information that is to be
transmitted to the receiver. In a digital communication system, shown in Figure 2.1, the
outputs of the information source are converted into a sequence of bits. This sequence of
bits might contain too much redundancy. Ideally, the source encoder removes redundancy
and represents the source output sequence with as few bits as possible. Note that the
redundancy in the source is different from the redundancy inserted intentionally by the
error correcting code.
The encrypter encodes the data for security purposes. Encryption is the most effective
way to achieve data security. The tree components, information source, source encoder and
encrypter can be seen as a single component called the source. The binary sequence is the
14
Figure 2.1: Model of a digital communication system

output of the source. The number of bits the source generates per second is the data rate
and is in units of bits per second (bps or bits/s).
The primary goal of the channel encoder is to increase the reliability of transmission
within the constraints of signal power, system bandwidth and computational complexity.
This can be achieved by introducing structured redundancy into transmitted signals.
Channel coding is used in digital communication systems to correct transmission errors
caused by noise, fading and interference. The channel encoder assigns to each message a
longer message called a codeword. This usually results in either a lower data transmission
rate or increased channel bandwidth relative to an un-coded system. To make the
communication system less vulnerable to channel impairments, the channel encoder
generates codewords that are as different as possible from one another.
Since the transmission medium is a waveform medium, the sequence of bits generated
by the channel encoder cannot be transmitted directly through this medium. The main
goals of modulation are not only to match the signal to the transmission medium, enable
simultaneous transmission of a number of signals over the same physical medium and
increase the data rate, but also to achieve this by the efficient use of the two primary
resources of a communication system, namely, transmitted power and channel bandwidth.
A communication channel refers to the combination of physical medium (copper wires,
radio medium or optical fiber) and electronic or optical devices (equalizers, amplifiers) that
are part of the path followed by a signal as shown in Figure 2.1. Channel noise, fading and
interference corrupt the transmitted signal and cause errors in the received signal. This
thesis proposal considers only AWGN type channels, which ultimately limit system
performance. Note that many interference sources and background noise can be modeled
15
as AWGN due to the central limit theorem.

At the receiving end of the communication system, the demodulator processes the
channel-corrupted transmitted waveform and makes a hard or soft decision on each symbol.
If the demodulator makes a hard decision, its output is a binary sequence and the
subsequent channel decoding process is called hard-decision decoding. A hard decision in
the demodulator results in some irreversible information loss. If the demodulator passes the
soft output of the matched filter to the decoder, the subsequent channel decoding process is
called soft-decision decoding.
The channel decoder works separately from the modulator/demodulator and has the
goal of estimating the output of the source encoder based on the encoder structure and a
decoding algorithm. In general, with soft-decision decoding, approximately 2 dB and 6 dB
of coding gain with respect to hard-decision decoding can be obtained in AWGN channels
and fading AWGN channels, respectively.
If encryption is used, the decrypter converts encrypted data back into its original form.
The source decoder transforms the sequence at its input based on the source encoding rule
into a sequence of data, which will be used by the information sink to construct an estimate
of the message. These three components, decrypter, source decoder and information sink
can be represented as a single component called the sink, as far as the rest of the
communication system is concerned. The binary sequence is the input to the sink.
2.2 Introduction to Turbo Codes

It is well known from information theory that a random code of sufficient length is
capable of approaching the Shannon limit, provided one uses maximum likelihood (ML)
decoding. Unfortunately, the complexity of ML decoding increases with the size of
codeword up to the point where decoding becomes impractical. Thus, a practical decoding
of long codes requires that the code possess some structure. Coding theorists have been
trying to develop codes that combine two seemingly conflicting principles: (a)
randomness, to achieve high coding gain and so approach the Shannon limit, and (b)
structure to make decoding practical. In 1993, Berrou et al. introduced a new coding
scheme that combines these two seemingly conflicting principles in an elegant way. They
16
(a)
(b)
Figure 2.2: Turbo code encoder structure. (a) General structure of turbo codes. (b)
Typical structure of turbo codes.
introduced randomness through an interleaver and structure by employing parallel
concatenated convolutional codes. These codes are called turbo codes and offer an
excellent tradeoff between complexity and error correcting capability. Concatenated codes
are very powerful error correcting codes that are capable of closely approaching the
Shannon limit by using iterative decoding [1].
2.2.1 Turbo Code Encoder Structure

A turbo code encoder consists of three building blocks: constituent encoders,
interleavers and a puncturing unit. The constituent encoders are used in parallel and each
interleaver scrambles the information symbols before feeding them into the corresponding
constituent encoder. The puncturing unit is used to achieve higher code rates. In general,
turbo codes can have more than two parallel constituent convolutional encoders, where
each encoder is fed with a scrambled version of the information symbol u. Figure 2.2(a)
shows the general architecture of turbo codes, where the outputs u, Pi (i = 1, , F) are
known as the systematic part and the parity part, respectively. In practice, most
applications use only two constituent encoders where only the input to the second encoder
is scrambled as shown in Figure 2.2(b).
17
2.2.1.1 The Constituent Encoders

Turbo codes use recursive systematic convolutional (RSC) encoders. The use of
recursive or feed-back encoders prevents the encoders from being driven back to all-zero
state by zero symbols. Since u is permuted before entering ENC2, it is likely that one of
the RSC code outputs will have high weight. This discussion does not mean that turbo
codes exhibit very high minimum distances. In fact, achieving high minimum distances
requires the use of a well designed interleaver of sufficient length. Finding such an
interleaver is not trivial. The systematic part helps the iterative decoding to provide better
convergence. Note that the systematic part prevents the turbo codes from being
catastrophic if no data puncturing is involved. If the systematic part is punctured, two
different input sequences can produce the same codeword making the codes catastrophic.
Since repetition codes are not good codes, the systematic part from only one of the
constituent encoders is transmitted.
2.2.1.2 Interleaving
Interleaving refers to the process of permuting symbols in the information sequence
before it is fed to the second constituent encoder. The primary function of the interleaver is
the creation of a code with good distance properties. Note that interleaving alone cannot
achieve good distance properties unless it is used together with recursive constituent
encoders. De-interleaving acts on the interleaved information sequence and restores the
sequence to its original order.
Achieving good distance properties is a common criterion for interleaver design. This
fits very well with the concept of maximum likelihood (ML) decoding. Unfortunately,
turbo decoding is not guaranteed to perform a ML decoding, because of the independence
assumption made on the sequence to be decoded and the probabilistic information (known
as extrinsic information) passed between constituent decoders. This suggests an additional
design criterion based on the correlation between the extrinsic information.
18
2.2.1.3 Puncturing
Puncturing refers to the process of removing certain bits from the codeword. The
purpose of puncturing is to increase the overall code rate. It is common to puncture only
the parity symbols of the first and second encoders.
2.2.2 Turbo Decoding

The iterative turbo decoding consists of two component decoders serially concatenated
via an interleaver, identical to the one in the encoder, as shown in Figure 2.3.
The first SISO decoder takes as input the received information sequence ykp1 and the
received parity sequence generated by the first encoder ykp1 . The decoder then produces
extrinsic information denoted as L1e , which is interleaved and used to produce an
improved estimate of the a priori probabilities of the information sequence for the second
decoder.
The other two inputs to the second SISO decoder are the interleaved received
information sequence y ks and the received parity sequence produced by the second
encoder ykp 2 . The second SISO decoder also produces extrinsic information L2e which is
used to improve the estimate of the a priori probabilities for the information sequence at
the input of the first SISO decoder. The decoder performance can be improved by this
iterative operation relative to a single operation serial concatenated decoder. The feedback
loop is a distinguishing feature of this decoder and the name turbo code is given with
reference to the principle of the turbo engine. After a certain number of iterations the soft
outputs of both SISO decoders stop to produce further performance improvements. Then
the last stage of decoding makes a hard decision after de-interleaving the log likelihood
ratio (LLR), denoted as Lr .
2.2.3 Decoding Algorithm for Turbo Codes

Turbo codes require SISO decoders to generate extrinsic information and LLR. Either
maximum a posteriori (MAP) algorithm [1] or soft output Viterbi algorithm
19
Deinterleaver
L 2e
L1e
S
yK
yP1
K
SISO
Interleaver
SISO
Decoder 1
Deinterleaver
Decoder 2
~S
yK
Lr2
Interleaver
yP2
K
Output
Figure 2.3: A turbo decoder structure

(SOVA) can be used for the component decoders. MAP based Turbo decoders generally
have much better performance than SOVA-based Turbo decoders. In this work, we focus
on MAP algorithm.
2.2.3.1 MAP Algorithm

Let u = (u1, u2, , uN) be a set of binary variables representing information bits, where
N denotes the frame size. In the systematic encoders, one of the outputs xs = (x1s, x2s,, xNs)
is identical to the information sequence u. The other is the parity information sequence
output xp = (x1p, x2p,, xNp). The noisy versions of outputs are ys = (y1s, y2s, , yNs) and yp =
(y1p, y2p, , yNp). Let R1N = (R1, R2, , Rk, , RN) denote the received sequence, where Rk
= (yks, ykp).
We assume that binary phase shift keying (BPSK) modulation is used to map each binary
symbol into a signal from the { +1, -1} modulation signal set. In the MAP decoder, the
decoder decides whether uk = +1 or uk = -1 depending on the following log-likelihood ratio
(LLR).
P (uk 1 R1 )
N
LR (uk ) log
P (uk 1 R1 )
N
(2.1)
In the final operation, the decoder makes a hard decision by comparing LR(uk) to a
threshold equal to zero, as shown in the expression (2.2).
20
if L R (u k ) 0
1
uk
0
(2.2)
otherwise
We can compute the APPs in (2.1) as

P (uk 1| R1N ) P ( Sk 1 s ', Sk s | R1N )
U
P ( Sk 1 s ', Sk s , R1N )
R1N
(2.3)
where Sk is encoder state at time k, U+ is the set of pairs (s, s) for the state transitions (Sk-1
= s ) (Sk = s) which correspond to the event uk = +1, and U- is similarly defined.
Also
P (uk 0 | R1N ) P ( Sk 1 s ', Sk s | R1N )
U
P ( Sk 1 s ', Sk s , R1N )
R1N
(2.4)
The log-likelihood ratio LLR is then
LR
P (S
(u ) log
P (S
U
k 1
s ', Sk s , R1N )
k 1
s ', Sk s , R1N )
(2.5)
By several applications of Bayes rule, we have

P ( s ', s, R1N ) P ( s ', s, R1k-1 , R k , R kN1 )
P (R kN1 | s ', s, R1k-1 , R k ) P ( s ', s, R1k-1 , R k )
P (R kN1 | s ', s, R1k-1 , R k ) P ( s, R k | s ', R1k-1 ) P ( s ', R1k-1 )
P (R
N
k 1
(2.6)
| s ) P ( s, R k | s ') P ( s ', R )
k-1
1
k ( s ) k ( s ', s ) k 1 ( s ')
The log-likelihood ratio LLR can be written as
LR (uk ) ln
k 1
( s ') k ( s ', s) k ( s)
k 1
( s ') k ( s ', s) k ( s)
(2.7)
where k 1 ( s ') is the forward metric, k ( s ) is the backward metric and k ( s ', s) is the
branch metric. They are defined as
k ( s ) P ( S k s, R1k )
21
(2.8)
k ( s ', s) P ( Sk s, R k | Sk 1 s ')
(2.9)
k ( s) P (R kN1 | Sk s)
(2.10)
We can obtain k ( s ) defined in (2.8) as
k ( s ) P ( s, R1k )
P ( s ', s, R1k )
s'
P ( s, R k | s ', R1k-1 ) P ( s ', R1k-1 )
(2.11)
s'
P ( s, R k | s ') P ( s ', R1k-1 )

s'
k ( s ', s ) k 1 ( s ')
s'
We can obtain k ( s ) defined in (2.10) as
k 1 ( s ') P (R kN | s ')
P (R
N
k
, s | s ')
P (R kN | s ', s, R k ) P ( s, R k | s ')
(2.12)
P (R kN | s) P ( s, R k | s ')
s
k ( s ) k ( s ', s )
s
The recursion for the k ( s) is initialized according to

1
0
0 (s)
s0
s0
(2.13)
which makes the reasonable assumption that the component encoder is initialized to the
zero state. The recursion for the k ( s) is initialized according to
1
0
N ( s)
s0
s0
(2.14)
which assumes that termination bits have been appended at the end of the data word so
that the component encoder is again in state zero at time N.
All that remains at this point is the computation of k ( s ', s) P ( s, R k | s ') . Observe that
k ( s ', s) may be written as
22
P ( s ', s) P ( s ', s, R k )
P ( s ')
P ( s ', s)
k ( s ', s)
P ( s | s ') P (R k | s ', s)
(2.15)
P (uk ) P (R k | uk )
where the event uk corresponds to the event s s. Note P(s|s) = P(s s ) = 0 if s is

not a valid state from state s. Hence, k ( s, s ') 0 if s s is not valid and, otherwise,
k ( s ', s)
P(uk )
2
exp[
yk xk
2 2
(2.16)
where it is assumed that codes are transmitted on an AWGN channel and 2 is noise
variance.
2.2.3.2 Max-log-MAP Algorithm

In order to avoid the complexity of multiplication and division operation in (2.7), (2.11),
and (2.12), the computations are converted into logarithmic domain. The metrics in the
new domain are defined as follows:
k ( s) ln( k ( s))
(2.17)
k (s) ln(k (s))
(2.18)
k ( s) ln( k ( s))
(2.19)
The expression (2.17) is rewritten as
k ( s) ln( k ( s))
ln( k 1 ( s ') k ( s ', s))
(2.20)
s'
ln( exp( k 1 ( s ') k ( s ', s )))

s'
These log-domain forward metrics are initialized as

0
0 ( s)
The expression (2.18) is rewritten as
23
s0
s0
(2.21)
k 1 ( s ') ln( k 1 ( s '))

ln( exp(k (s ) k (s ', s )))
(2.22)
with initial conditions

0
-
s0
s0
N ( s)
(2.23)
under the assumption that the encoder has been terminated.

As before, the L R (u k ) is computed as
LR (uk
) ln
k 1
( s ') k ( s ', s ) k ( s )
k 1
( s ') k ( s ', s ) k ( s )
ln[ exp( k 1 ( s ') k ( s ', s ) k ( s ))]
(2.24)
-ln[ exp( k 1 ( s ') k ( s ', s ) k ( s ))]

U
These expressions can be simplified by using the expression.

max( x, y ) ln(e x e y )
(2.25)
Given the max function, we may now rewrite (2.20), (2.22), and (2.24) as
k ( s) max[ k 1 ( s ') k ( s ', s)]
(2.26)
k 1 (s ') max[k (s) k (s ', s)]
(2.27)
s'
LR (uk ) max[
k 1 ( s ') k ( s ', s) k ( s)]
- max[
k 1 ( s ') k (s ', s ) k (s)]
(2.28)
As shown in the above operations, the multiplications in the MAP are replaced by
additions in the Max-log-MAP, which results in the low complexity of Max-log-MAP. The
calculation of k (s) will be given in Section 2.2.3.3.
2.2.3.3 Calculation of Branch Metrics and Extrinsic Information

The extrinsic information takes the role of a priori information in the iterative decoding
algorithm.
24
Le (uk ) ln(
P (uk 1)
)
P (uk 1)
(2.29)
The a priori term P(uk ) shows up in (2.16) in an expression for k ( s, s ') . In the logdomain, (2.16) becomes
k ( s ', s ) ln P(uk ) ln( 2 )
yk xk
(2.30)
2 2
Now observe that we may write from (2.29)

exp[ Le (uk ) / 2]
P (uk ) (
) exp[uk Le (uk ) / 2]
1 exp[ Le (uk )]
(2.31)
A k exp[uk Le (uk ) / 2]
where the first equality follows since it equals

P_ /P
(
) P /P_ P when u k 1 and
1 P_ /P
(2.32)
P_ /P
(
) P_ /P P_ when u k 1,
1 P_ /P
where we have defined
P P(uk 1) and
P P(uk 1) for
convenience.
Substitution of (2.31) into (2.30) yields
k ( s ', s) ln( Ak / 2 )
uk Le (uk ) yk xk
2
2 2
(2.33)
where we will see that the first term may be ignored.

Thus, the extrinsic information received from a companion decoder is included in the
computation through the branch metric k ( s, s ') . The rest of algorithm proceeds as before
using equations (2.26), (2.37) and (2.28).
Using the fact that
yk xk
( yks xks )2 ( ykp xkp )2

( yks )2 2 xks yks ( xks ) 2 ( ykp ) 2 2 xkp ykp ( xkp ) 2
and that
(2.34)
only the terms dependent on U or U , 2 xks yks and 2 xkp ykp , survive after
the subtraction (2.26), (2.27) and (2.28), (2.33) is rewritten as follows.
25
k ( s ' s) uk Le (uk )
Given LC
xks yks
xkp ykp
(2.35)
, we have
k ( s ' s) Le (uk )
LC s s LC p p
xk yk
xk yk
2
2
(2.36)
Upon substitution of (2.36) into (2.28), we have

LC s s LC p p
xk yk
xk yk k ( s) ]
2
2
L
L
max[
k ( s ') Le (uk ) C xks yks C xkp ykp k ( s) ]
U
2
2
LR (uk ) max[
k ( s ') Le (uk )
Now note that Le (uk )

(2.37) and Le (uk )
(2.37)
LC s s
L
xk yk Le (uk ) C yks under the first max() operation in
2
2
LC s s
L
xk yk C yks under the second max() operation. Using the
2
2
definition for max(), it is easy to see that these terms may be isolated out so that
LR (uk ) LC yks Le (uk ) max[
k ( s ')
LC p p
xk yk k ( s) ]
2
(2.38)
L
max[
k ( s ') C xkp ykp k ( s) ]
U
2
The interpretation of this new expression for LR (uk ) is that the first term is likelihood
information received directly from the channel, the second is extrinsic likelihood
information received from a companion decoder, and the third term ( max
) is
max
extrinsic likelihood information to be passed to a companion decoder. Note that this third
term is likelihood information gleaned from received parity not available to the companion
decoder. Using notation Le ,OUT (uk ) for extrinsic information to be passed and Le , IN (uk )
for extrinsic information received, we have
LR (uk ) Le, IN (uk ) LC yks Le,OUT (uk )
(2.39)
Extrinsic information which will be passed to the companion decoder is calculated as

follows.
26
Le,OUT (uk ) LR (uk ) LC yks Le, IN (uk )
(2.40)
2.3 Turbo code in Mobile WiMAX

Mobile WiMAX is a rapidly growing broadband wireless access technology based on
IEEE 802.16 standard [3]. It utilizes Orthogonal Frequency Division Multiple Access
(OFDMA) as the radio access method for improved multipath performance in non-line-ofsight (NLOS) environment and promise to deliver high data rates over large areas to a
large number of uses in the near future. This exciting addition to current broadband options
such as DSL, cable, and Wi-Fi promises to rapidly provide broadband access to locations
in the worlds rural and developing areas where broadband is currently unavailable, as well
as competing for urban market share.
Recently, to improve system gain and non-line-of-sight (NLOS) coverage, doublebinary tail-biting convolutional turbo code (CTC) has been adopted in IEEE 802.16
standard (WiMAX) with its superior advantages over the classical single-binary turbo code.
Double-binary turbo codes double the decoding rates in a hardware implementation,
because they allow memory access of two bits at each time instant. The reason for
such decoding rates is the fact that the extrinsic information, which must be passed
to the next decoder after interleaving or de-interleaving, represents two bits at each
time instant. Doubling the decoding rates leads to a reduction in the latency of the
decoder by one half.
Double-binary turbo codes reduce the sensitivity to puncturing. This can be
explained as follows. Since the rate 1/2 double-binary recursive systematic
convolutional (RSC) encoder produces two parity streams, most of the code rates
can be obtained by simply ignoring one of these parity streams and puncturing the
other (if necessary). Ignoring one of the two parity streams results in a new RSC
encoder with a single parity stream. This single parity stream is less punctured
compared to similar single-binary convolutional RCS encoders, which results in
27
less sensitivity to puncturing.

Double-binary turbo codes reduce the correlation effects between component
decoders, which leads to improved convergence [3].
For practical purpose, it is important to reduce the computational complexity of turbo

decoding. An approach for reducing the computational complexity of maximum a
posteriori (MAP) decoding [1] has been introduced in the previous section for singlebinary turbo codes, where there are only two branches entering and leaving each state.
As opposed to single-binary codes where only two branches enter and leave each state,
in double-binary turbo codes there are four branches entering and leaving each state.
2.3.1 Encoding
The CSRC constituent encoder used by WiMAX is shown in Figure 2.4. The encoder is
fed blocks of k message bits which are grouped into N = k/2 couples. In Figure 2.4, A
represents the first bit of the couple, and B represents the second bit. The two parity bits
are denoted W and Y. For ease of exposition, subscripts are left off the figure, but below a
single subscript is used to denote the time index k {0, , N-1} and an optional second
is used on the parity bits W and Y to indicate which of the two constituent encoders
produced them.
Let the vectors Sk = [Sk,1 Sk,2 Sk,3]T, Sk,m {0,1} denote the state of the encoder at time
k. Note that although the input s and outputs of the encoder are defined over GF(4), only
binary values are stored within the shift register and thus the encoder has just eight states.
The encoder state at time k is related to the state at time k
Sk+1 =GSk +Xk
(2.41)
where
Ak Bk
Xk Bk
Bk
1 0 1
G 1 0 0
0 1 0
(2.42)
Because of the tailbiting nature of the code, the block must be encoded twice by each
28
Figure 2.4: Double-binary CRSC constituent encoder used by WiMAX

constituent encoder. During the first pass at encoding, the encoder is initialized to the allzeros state, S0 = [0 0 0]T. After the block is encoded, the final state of the encoder SN is
used to derive the circulation state
Sc =(I+G N )-1S N
(2.43)
where the above operations are over GF(2). In practice, the circulation state Sc can be
found from SN by using a lookup table [3]. Once the circulation state is found, the data is
encoded again. This time, the encoder is set to start in state Sc and will be guaranteed to
also end in state Sc.
The first encoder operates on the data in its natural order, yielding parity couples {Wk,1,
Yk,1}. The second encoder operates on the data after it has been interleaved.
2.3.2 Decoding
Decoding is complicated by the fact that the constituent codes are double-binary and
circular. As with conventional turbo codes, decoding involves the iterative exchange of
extrinsic information between the two component decoders. While decoding can be
performed in the probability domain, the log-domain is preferred since the low complexity
Max-log-MAP algorithm can then be applied. Unlike the decoder for a single-binary turbo
code, which can represent each binary symbol as a single log-likelihood ratio, the decoder
for a double-binary code requires three log-likelihood ratios. For example, the likelihood
ratios for message couple (Ak, Bk) can be represented in the form
29
Figure 2.5: A decoder for the WiMAX turbo code
a ,b ( Ak , Bk ) log
P( Ak a, Bk b)
P( Ak 0, Bk 0)
(2.44)
where (a, b) can be (0, 1), (1, 0), or (1, 1).

An iterative decoder that can be used to decode the WiMAX turbo code is shown in
Figure 2.5. The goal of each of the two constituent decoders is to update the set of loglikelihood ratios associated with each message couple. In the figure and in the following
discussion,
(i )
a ,b
( Ak , Bk ) denotes the set of LLRs corresponding to the message couple at
the input of the decoder and
Each decoder is provided with
(o)
a ,b
( Ak , Bk ) is the set of LLRs at the output of the decoder.
(i )
a ,b
( Ak , Bk ) along with the received values of the parity
bits generated by the corresponding encoder (in LLR form). Using these inputs and
knowledge of the code constraints, it is able to produce the updated LLRs
(o)
a ,b
( Ak , Bk ) at
its output.
As with single-binary turbo codes, extrinsic information is passed to the other
constituent decoder instead of the raw LLRs. This prevents the positive feedback of
previously resolved information. Extrinsic information is found by simply subtracting the
appropriate input LLR from each output LLR, as indicated in Figure 2.5.
The extrinsic information that is passed between the two decoders must be interleaved
or de-interleaved so that it is in the proper sequence at the input of the other decoder.
30
2.3.2.1 Max-log-MAP Algorithm for Decoding

The extension of Max-log-MAP algorithms to the double-binary case is fairly
straightforward. In the double-binary turbo codes, the three log-likelihood ratio outputs of
the k-th symbol are expressed as follows.
(kz ) max k ( sk ) k 1 ( sk sk 1 ) k 1 ( sk 1 )
( sk sk 1 , z )
k ( sk ) k 1 ( sk sk 1 ) k 1 ( sk 1 )
( sk sk 1 ,00)
max
(2.45)
where z belongs to {01,10,11} , sk is the state of an encoder at time k, and , and
are the forward, backward, and branch metrics, respectively. The metrics are calculated
as expressed in equations (2.46), (2.47) and (2.48), where A is the set of states at time k-1
connected to state sk, and B is the set of states at time k+1 connected to state sk.
k ( sk ) max k 1 ( sk 1 ) k ( sk 1 sk )
(2.46)
k ( sk ) max k 1 ( sk 1 ) k 1 ( sk sk 1 )
(2.47)
sk 1 A
sk 1 B
k ( sk sk 1 ) ln P(y k | x k ) P(uk z )
Lc s1 s1
)
( xk yk xks2 yks2 xkp1 ykp1 xkp2 ykp2 ) L(ez, IN
2
(2.48)
where z belongs to {00,01,10,11} , uk is the input symbol consisting of two bits, P(uk)
is a priori probability of uk, and xk and yk are transmitted and received codewords
associated with uk, respectively. The superscripts p and s denote the parity bits and
)
systematic bits, respectively. In (4), L(ez, IN
is the extrinsic information received from the
other SISO decoder and the code is assumed to be transmitted through an AWGN channel
with a noise variance 2 . Since the Max-log-MAP decoding algorithm is independent of
the signal-to-noise ratio (SNR), Lc 2 2 is usually set to a constant value, although it
can be obtained from channel estimation [8].
After the turbo decoder has completed a fixed number of iterations or met some other
convergence criteria, a final decision on the bits must be made. This is accomplished by
computing the LLR of each bit in the couple (Ak, Bk) according to
31
11
( Ak ) max 10
k , k
01
max 00
k , k
11
( Bk ) max 01
k , k
(2.49)
10
max 00
k , k
where 00
k 0 . The hard bit decisions can be found by comparing each of these likelihood
ratios to a threshold.
2.4 Turbo code in 3GPP-LTE

The newly evolved standard, 3GPP long term evolution (3GPP-LTE), which is the
successor to GSM/UMTS mobile standard, is considered to be major step towards 4 th
generation (4G) mobile broadband systems. The channel coding in LTE involves Turbo
Code with an internal interleaver based on the quadratic permutation polynomial (QPP).
2.4.1 Encoding
Figure 2.6 shows the structure of a 3GPP-LTE turbo encoder. The transfer function of
each component encoder is given as the following equation.
g ( D)
G ( D) 1, 1
g 0 ( D)
(2.50)
where g0 ( D) 1 D2 D3 and g1 ( D) 1 D D3 .
The trellis diagram of a 3GPP-LTE turbo encoder is shown in Figure 2.7. Trellis
diagram is a state diagram which explicitly shows all possible state transitions of the
component encoder at each discrete time instants. The component encoder has 8-state.
Since turbo codes are recursive, it is not possible to terminate the trellis by transmitting
zero tail bits. Trellis termination means driving the encoder to the all-zero state. This is
required at the end of each block to make sure that the initial state for the next block is the
all-zero state. The tail bits depend on the state of the component encoder after N
information bits. A simple solution to this problem is shown in Figure 2.6. A switch in each
parallel component encoder is in position A for the first N clock cycles and in position
32
Xs
1st component encoder
+
Uk
Input
XP1
+
D
Output
2st component encoder
Interleaver
+
Uk
XP2
+
D
Figure 2.6: A Turbo Encoder for 3GPP-LTE
0
6
0
6
0
6
0
6
0
6
0
4
0
4
0
4
0
4
0
4
0
2
0
2
0
2
0
2
0
2
0
2
N-1
N+1
6
0
4
0
4
0
2
0
N+2 N+3
Figure 2.7: A Trellis Diagram of a 3GPP-LTE Turbo Encoder

B for 3 additional cycles. This will drive the encoder to the all-zero state. Trellis
termination is based on setting the input to the first shift register to zero. This will flush the
register with zeros after 3 shifts. The transmitted bits for trellis termination shall then be:
33
xNs 1 , xNp11 , xNs 2 , xNp12 , xNs 3 , xNp13 , xNs 1 , xNp 21 , xNs 2 , xNp 2 2 , xNs 3 , xNp 23
where N is the number of bits.
2.4.2 Decoding
Based on the MAP algorithm, how to decode the single-binary turbo codes is well
described in Section 2.2.3. In this Section, radix-4 single-binary turbo decoding based on
Max-log-MAP is presented.
2.4.2.1 Radix-4 Single-Binary Max-log-MAP Algorithm for Decoding

By merging two trellis sections, the SISO decoder can process two bits per each cycle
[18][19]. Accordingly, the forward, backward, and branch metrics denoted by , , and ,
respectively, can be defined as follows.
k ( sk ) max k 2 ( sk 2 ) k ( sk 2 sk )
(2.51)
k ( sk ) max k 2 ( sk 2 ) k ( sk sk 2 )
(2.52)
sk 2 , sk 1
sk 1 , sk 2
k ( sk sk 2 ) ln P(y k | x k ) P(vk 1)
+ ln P (y k 1 | x k 1 ) P (vk 1 1)
x y x y Le , IN (vk ) x
s
k
s
k
p
k
p
k
s
s
k 1 k 1
(2.53)
p
p
k 1 k 1
Le, IN (vk 1 )
where sk is the state of an encoder at time k and vk is the input bit. Also, P(vk) is a priori
probability of vk,
xk and yk are transmitted and received codewords associated with vk,
respectively. The superscripts p and s denote the parity bits and systematic bits,
respectively. In (2.53), Le, IN (vk ) is the extrinsic information received from the other SISO
decoder. As indicated in (2.46)-(2.48) and (2.51)-(2.53), the radix-4 single-binary SISO
decoding is almost the same with the double-binary SISO decoding which enables the
efficient unified SISO decoder implementation for both decodings.
34
Chapter 3
This chapter presents an energy-efficient soft-input soft-output (SISO) decoder based
on border metric encoding, which is especially suitable for nonbinary circular / high-radix
single-binary turbo codes. In the proposed method, the size of the branch memory is
reduced to half and the dummy calculation is removed at the cost of a small-sized memory
that holds encoded border metrics. Due to the infrequent accesses to the border memory
and its small size, the energy consumed for SISO decoding is reduced by 25.3%.
3.1 Radix-4 SISO Decoding

As expressed in the previous chapter, the metric calculation complexity of the
nonbinary / high-radix single-binary turbo codes is higher than that of the single-binary
turbo codes. For the double-binary / radix-4 single-binary turbo codes, the number of
branches connected to each trellis state is increased from two to four as shown in Figure
3.1. Since a max operation with four operands can be implemented by using three max
operations with two operands as shown in (3.1), the hardware complexity is almost three
times higher than that of the classical single-binary turbo codes if the four-operand max
operation is computed in a cycle.
max a, b, c, d max max a, b , max c, d
(3.1)
It is possible to compute the four-operand max operation serially using a two-operand max
operator, but this structure requires more than one cycles and additional buffers to hold the
intermediate values. Moreover, the serial max computation results in severe throughput
degradation, as the forward and backward metrics are recursively defined using the
35
xksxk+1s /xkpxk+1p
sk
00/00 10/11 01/01 11/10
10/10 00/00 11/11 01/01
11/01 01/10 10/00 00/11
01/11 11/00 00/10 10/01
01/10 11/01 00/11 10/00
11/00 01/11 10/01 00/10
10/11 00/00 11/10 01/01
00/01 10/10 01/00 11/11
(a)
xks1xks2 /xkp1xkp2
sk
sk+1
10/11 01/11 00/00 11/00
00/10 11/10 10/01 01/01
10/01 01/01 00/10 11/10
01/00 10/00 11/11 00/11
11/11 00/11 01/00 10/00
01/10 10/10 11/01 00/01
11/01 00/01 01/10 10/10
(b)
Figure 3.1: Trellis Diagrams for (a) Radix-4 Single-Binary Turbo Code in 3GPPLTE and (b) Double-Binary Turbo Code in Mobile WiMAX
previously calculated metrics. Compared to the single-binary SISO decoders [6][7], the
wordlength of internal metrics should be increased in hardware implementation, as the
number of terms to be added in the branch metric calculation is increased from three to
five as expressed in (2.48).
3.1.1 Sliding Window for nonbinary SISO Decoding

The sliding window technique is effective in reducing the memory size required to
store metric values. A large frame is split into a number of small windows and the MAP
decoding is applied to each window independently [16]. Figure 3.2(a) shows the
36
Trellis Time (Blocks)
Forward Metric
Calculation
Dummy
Backward Metric
Calculation
N = 7L
6L
5L
Branch Metric
Calculation
4L
3L
2L
Backward Metric Calculation

&
LLR Calculation
Processing Time
2T
3T
4T
5T
6T
7T
8T
9T 10T
Dummy Calc.
Active
Duty Ratio 1
(a)
N =7L
Load
Address
Border Metric Store
6L
Border Metric Load
4L
Store
Address
Branch Metric Calculation

Forward Metric Calculation
n-2
2L
n-1
3L
n-1
5L
n+1
n
Backward Metric Calculation

LLR Calculation
Processing Time
2T
3T
4T
5T
6T
7T
Load Active
8T
9T 10T
Duty Ratio 2/WS
Store Active
(b)
Figure 3.2: Sliding window diagrams (a) with dummy calculation and (b) with
border memory
conventional sliding window diagram where forward metrics are calculated prior to
backward metrics [6]. In the sliding window technique, however, the initial values at the
border of each window are also required. To obtain the reliable initial values of each
window, the dummy calculation is performed for the backward metrics as shown in Figure
3.2(a). If the window size is sufficiently long, the initial values obtained by the dummy
37
calculation do not degrade performance.

Another way to obtain reliable border metric values is to use those values of the
previous iteration, which has been adopted for the classical radix-2 single-binary turbo
codes [17]. In this dissertation, this approach is employed with modifying it for the doublebinary / radix-4 single-binary turbo codes. For each window, the last backward metric is
stored in a memory called the border memory. The stored border metrics are loaded in the
next iteration to regard them as the initial backward metric values at the borders as
illustrated in Figure 3.2(b). Since there is no stored value in the first iteration, all states at
the borders are assumed to be equiprobable in the first iteration. Compared to the
conventional method based on the dummy calculation, this approach results in slight
performance degradation for the earlier iterations, but the performance degradation
disappears after a few iterations. By using the metrics stored in the previous iteration, we
can completely avoid the dummy backward metric calculation. Additionally, the size of the
branch metric memory is reduced to half since the number of processes in which the
branch metrics are participated is changed from four to two.
3.2 Proposed Border Metric Encoding

As described in the previous section, an additional memory is needed to hold the border
metric values of the previous iteration. Although the sliding window with the border
memory can eliminate the need of the dummy calculation, the border memory size is
considerable. To achieve more area-efficient and energy-efficient turbo decoding, the
border memory should be minimized. If the maximum frame size is Nmax, the number of
states in trellis is K, and state metric values are represented in P bits, then the border
memory size (BM) is defined as follows.
N
BM max 1 K P
WS
(3.2)
where WS is the window size. Since Nmax and K are fixed for a standard, the border
memory size depends only on the window size and the wordlength of state metrics. To
reduce the border memory size, we can either increase the window size or decrease the
38
Table 3.1 Simulation environment

Max-log-MAP
32
Received input : (4, 2)
Branch Metric : (10, 2)
State Metrics : (10, 2)
Extrinsic Information : (8, 2)
LLR value : (11, 2)
SISO Algorithm
Window Size
Quantization
Encoded
Value
64
32
-64
16
-32 -16
16 32
-16
64
Original
Value
-32
-64
Figure 3.3: 3-bit border metric encoding function
wordlength of state metrics. Increasing the window size, however, increases the sizes of
the memories storing the forward and branch metrics, and the window size is usually set to
32 for 8-state trellis. Therefore, we should decrease P to reduce the overall border memory
size. Otherwise the sliding window associated with the border memory may not be suitable
for the hardware implementation because a large border memory is indispensable for the
3GPP-LTE whose Nmax is 6144 (3072 in case of radix-4 processing) and for the WiMAX
whose Nmax is 2400.
The reduction of the border memory can be realized by allowing a few values to represent
the border metrics. Though the reliability of the border metric is slightly decreased due to
the loss of accuracy, this can be totally recovered after a few trellis stages. A simple
encoding with low hardware complexity is to floor the original metric value to
39
Table 3.2 Encoded values for border metrics

Encoding Scheme
Encoded values for border metric
4-bit encoding
256, 128, 64, 32, 16, 8, 4, 0
3-bit encoding
64, 32, 16, 0
Figure 3.4: BER performance comparison with 8 iterations for 4800-bit frame.
the closest power-of-two number. The experimental environment for the WiMAX is
indicated in Table 3.1, where (q, f) denotes a quantization scheme that uses q bits in total
and f bits to represent the fractional part. The final quantization schemes shown in Table
3.1 are determined by performing several simulations and referring to [6] and [8]. The
encoding function for the proposed 3-bit encoding is depicted in Figure 3.3. The encoding
function for the 4-bit encoding can be similarly defined. Possible values at the border are
listed in Table 3. 2. As the range of the original border metrics is [-512, +511] which can be
represented with 10 bits, the proposed border metric encoding can be obtained by limiting
the value into [-256, +256] for the 4-bit encoding and [-64, +64] for the 3-bit encoding and
by allowing only power-of-two values. In Figure 3.4, the BER performance of the
proposed encoding is compared with those of various methods. The schemes in which the
border metric is initialized with the value of the previous iteration degrade the performance
40
1st Iteration
2nd Iteration
4th Iteration
6th Iteration
8th Iteration
Figure 3.5: BER performance of 1920-bit frame according to the number of

iterations
.
by about 0.02 dB in the water fall region. If the SNR is higher than 1 dB, however, the
proposed 4-bit encoding shows about 0.1 dB better performance than the classical method
which uses the dummy calculation. It is well-known that we can obtain better performance
for the Max-log-MAP algorithm by scaling the extrinsic information [10]. Since the floor
function reduces the border metrics, its effect is similar to the extrinsic information scaling.
When the SNR is high, the performance of the 3-bit encoding is degraded by about 0.1 dB
compared to the other schemes as the values are restricted to a relatively small region. The
size of the border memory can be reduced significantly by using the proposed encoding. As
the iteration proceeds, the BER degradation resulting from the proposed encoding scheme
becomes negligible as shown in Figure 3.5. It has been reported that we can achieve higher
bandwidth efficiency for triple-binary turbo codes [18] and obtain better performance if the
number of states in the trellis increases [19]. Since these two factors increase the
complexity of the dummy backward metric calculation, the sliding window associated with
the proposed border metric encoding can be more effective for the future turbo codes.
41
3.3 Experimental Results

With the quantization indicated in Table 3.1, a Max-log-MAP decoder based on the
proposed border metric encoding was described in Verilog-HDL and synthesized with a
0.18 m 4-Metal CMOS standard-cell library and compiled SRAM memories. Design
Compiler and Power Compiler of Synopsys were used for the synthesis and power
estimation, respectively. Switching activities resulting from gate-level simulation were
annotated for gate-level power estimation. The window size is set to 32 and the 4-bit
border metric encoding is employed. In the hardware implementation, the forward metrics
and backward metrics are normalized by subtracting the value of state 0, k ( s0 ) and
k (s0 ) , from other metrics at the same trellis stage in order to avoid overflow in state
metrics, which also eliminates the need to store the metric value of state 0. Since the SISO
decoder takes two systematic bits and two parity bits as inputs, the number of possible
branch metrics is 16 while the number of possible branch metrics is 4 in the classical
single-binary turbo codes. Among the 16 possible branch metrics, only 8 branch metrics
are distinguishable and sufficient to derive the others. Although the number of branch
metrics to be stored is reduced by half, the branch memory size is still considerable if the
conventional sliding window with the dummy calculation is adopted as indicated in Table
3.3. Even in the case that the sliding window is associated with the border memory, the
total memory size is increased because of the border memory requirement. By applying the
proposed border metric encoding method, the total memory size needed in the SISO
decoder is reduced by 20.7% as summarized in Table 3.3.
In Table 3.4, the energy consumption of the proposed SISO decoder is compared with
that of the conventional decoder, which is measured for 1.2 dB SNR and 8 iterations at the
operating frequency of 200MHz. Due to the increased computational complexity of the
double-binary turbo codes, the energy consumption of the SISO logic is also increased
compared to the classical single-binary turbo codes [6][7]. As shown in Table V, the energy
consumption of the SISO logic is reduced by eliminating the dummy calculation. Also, as
shown in Figure 3.2(b), the energy consumption of the border memory is very low because
the memory is small and infrequently accessed. While processing a window, we need to
access the border memory only two times one for load and the other for store. For the
42
Table 3.3 Single-port SRAM size required for a SISO decoder

With
Dummy
Calculation
2 banks,
32*(10*7)
bits/bank
4480 bits
4 banks,
32*(10*8)
bits/bank
10240 bits
With
Border Memory
(No Encoding)
2 banks,
32*(10*7)
bits/bank
4480 bits
2 banks,
32*(10*8)
bits/bank
5120 bits
With
Border Memory
(4-bit Encoding)
2 banks,
32*(10*7)
bits/bank
4480 bits
2 banks,
32*(10*8)
bits/bank
5120 bits
Border
Memory
N.A.
[(2400/32)-1]
*(10*7) bits
= 5180 bits
[(2400/32)-1]
*(4*7) bits
= 2072 bits
Total
14720 bits (100%)
14780 bits
(100.4 %)
11672 bits
(79.3 %)
Forward
Memory
Branch
Memory
Table 3.4 Energy consumptions of SISO decoders

With
Dummy
Calculation
With
Border Memory
(4-bit Encoding)
SISO Logic
2347.9 pJ/bit/iter
1876.9 pJ/bit/iter
Branch Memory
1288.4 pJ/bit/iter
649.9 pJ/bit/iter
Forward Memory
559.1 pJ/bit/iter
559.1 pJ/bit/iter
Border Memory
N.A.
49.4 pJ/bit/iter
Total
4195.4 pJ/bit/iter
(100%)
3135.3 pJ/bit/iter
(74.7 %)
case of dummy calculation, however, the dummy calculation logic should operate almost
all the time as indicated in Figure 3.2(a). Therefore, the proposed SISO decoder can reduce
the energy consumption by 25.3% compared to the conventional SISO decoder based on
the dummy calculation and the table-based interleaver.
43
Chapter 4
Bit-Level Extrinsic Information Exchange
The nonbinary turbo code has many advantages over the single-binary turbo code, but
its decoding requires more memory especially for storing the extrinsic information to be
exchanged between the two soft-input soft-output (SISO) decoders. To reduce the memory
size required for double-binary turbo decoding, this paper presents a new method to
convert the symbolic extrinsic information to the bit-level information and vice versa. By
exchanging the bit-level extrinsic information, the number of extrinsic information values
to be exchanged in double-binary turbo decoding is reduced to the same amount as singlebinary turbo decoding. Since the size of the extrinsic information memory is significant,
the proposed method is effective in reducing the total memory size needed in doublebinary turbo decoders. A double-binary turbo decoder is designed for the WiMAX standard
to verify the proposed method, which reduces the total memory size by 28.4%.
4.1 Extrinsic Information in Double-Binary Turbo Codes

A typical turbo decoder consists of two SISO decoders serially concatenated via an
interleaver. Focusing on non-binary turbo decoding, we describe in this Section the
conventional symbol-level extrinsic information and the implementation issues.
4.1.1 Symbol-level Extrinsic Information in Double-Binary Turbo Codes

In an m-ary turbo code where a symbol is represented in m bits, the number of possible
symbol-level extrinsic information values is 2m-1 [18]. Since the value of m is two for the
double-binary turbo code, three symbol-level extrinsic information values are defined as
44
follows [13][14].
Lez ln
p(uk z )
p(uk z ) p(uk 00) exp[ Lez ]
p(uk 00)
where z belongs to {01,10,11} , uk is the input symbol consisting of two bits and
(4.1)
p()
means the probability. The extrinsic information is exchanged iteratively between the two
SISO decoders during the whole decoding process. As indicated in (4.1), the extrinsic
information in the double-binary turbo code is defined as the ratio of two input symbols
each of which consists of two bits. In non-binary turbo decoding, more extrinsic
information values are to be exchanged compared to the classical single-binary turbo
decoding that stores only one extrinsic information value. To store the increased number of
extrinsic information values, therefore, a large memory is needed in implementing a
nonbinary turbo decoder.
4.1.2 Memory Requirement in Double-Binary Turbo Decoder

A typical turbo decoder is based on the time-multiplex architecture that contains only
one SISO decoder, one interleaver, and one extrinsic memory. The first and second SISO
decoding processes of an iteration are time-multiplexed in the architecture [20]. To achieve
a high throughput, several SISO decoders can be employed to decode the turbo code in
parallel [18]. For the SISO decoder, the sliding window technique, where a large frame is
split into a number of small windows and the MAP decoding is applied to each window
independently, is widely used to reduce the memory needed to store metric values [16]. To
avoid the complex dummy metric calculation required in the sliding window technique, we
can adopt the border memory in nonbinary turbo decoding as described in the previous
chapter. However, for the extrinsic information memory cannot be reduced even if the
sliding window technique is employed, and the size is determined by the largest frame size
that is 2400 pairs in the WiMAX [3].
The experimental environment for the WiMAX is indicated in Table 4.1, where (q, f)
denotes a quantization scheme that uses q bits in total and f bits to represent the fractional
part. Taking into account the quantization scheme indicated in Table 4.1, the memory size
required for a double-binary SISO decoder is summarized in Table 4.2, and that for storing
45
Table 4.1 Simulation environment

SISO Algorithm
Window Size
Quantization
Max-log-MAP
32
Received input : (4, 2)
Branch Metric : (10, 2)
State Metrics : (10, 2)
Extrinsic Information : (8, 2)
LLR value : (11, 2)
Table 4.2 Memory Configuration for one SISO Decoder

2 banks,
32*(10*7) bits/bank
4480 bits
2 banks,
32*(10*8) bits/bank
5120 bits
Forward Metric
Memory
Branch Metric
Memory
[(2400/32)-1]*(10*7) bits
Border Metric
Memory
5180 bits
Table 4.3 Memory Configuration for the Extrinsic Information
Extrinsic
Info.
Memory
L01
e
L10
e
L11
e
2 banks,
8*2400bits/bank
2 banks,
8*2400bits/bank
2 banks,
8*2400bits/bank
38400 bits
38400 bits
38400 bits
Total
115200 bits
extrinsic information values in Table 4.3. It is crucial to reduce the extrinsic information
memory even if several SISO decoders are adopted for parallel decoding, as the extrinsic
information memory is much bigger than the memory required in SISO decoding as
indicated in Table 4.2 and 4.3 and Figure 4.1.
It has been reported that higher-order non-binary turbo codes are appropriate to achieve
higher data rate and bandwidth efficiency [13][19]. In the higher-order non-binary turbo
46
Figure 4.1: Memory Requirements in Double-Binary Turbo Decoder

decoders such as triple-binary (m=3) and quaternary (m=4) turbo decoders [18][21], the
extrinsic information memory becomes much bigger, since the number of extrinsic
information values to be stored is proportional to 2m-1.
4.2 Proposed Bit-Level Extrinsic Information Exchange

In this Section, we propose a new bit-level extrinsic information exchange method that
can reduce the number of extrinsic values to be exchanged between two SISO decoders by
deriving bit-level extrinsic information values from symbol-level extrinsic information
values and vice versa as shown in Figure 4.2. Specifically, we present two conversions
symbol-to-bit conversion and bit-to-symbol conversion of the extrinsic information. With
the simple conversions, the number of values to be exchanged can be reduced from the
number of possible symbols, 2m-1, to the number of bits in a symbol, m, without inducing
any modifications to the conventional symbol-based double-binary SISO decoders.
Regardless of the bit-width of the extrinsic information, therefore, the proposed method is
effective in reducing the size of the extrinsic information memory.
4.2.1 Bit-level Extrinsic Information for Double-Binary Turbo Codes

Like symbol-level extrinsic information in (4.1), two bit-level extrinsic information
values are defined as follows.
47
Double-Binary
SISO Decoder #1
Le01
Reduced
Memory Size
Le10
Shared with
Hard Decision Unit
Le11
Symbol-to-Bit
Converter (SBC)
LbeA
LbeB
(Bit-Level)
With Negligible
Hardware Overhead
Extrinsic
Memory
LbeA
LbeB
Bit-to-Symbol
Converter (BSC)
In an interleaved order
Le01
Le10
Le11
Double-Binary
SISO Decoder #2
Figure 4.2: Block diagram of the proposed bit-level extrinsic information exchange
Lbe ln
A
LBbe ln
p ( A 1)
p ( A 0)
p ( B 1)
p ( B 0)
p ( A 1) p ( A 0) exp[ Lbe ]
A
(4.2)
p ( B 1) p ( B 0) exp[ LBbe ]
where the input symbol uk consists of a pair of two bits, A and B, i.e., uk = AB. The bit-level
probabilities in (4.2) can be derived from the symbol-level probabilities in (4.1) as
described below.
p( A 0) p(uk 00) p(uk 01)
(4.3a)
p( A 1) p(uk 10) p(uk 11)
(4.3b)
p( B 0) p(uk 00) p(uk 10)
(4.3c)
p( B 1) p(uk 01) p(uk 11)
(4.3d)
Considering the basic property of the probabilities,
p( B 0) p( B 1) p( B 0) 1 exp[ L ] 1
A
p( A 0) p( A 1) p( A 0) 1 exp[ Lbe
] 1
B
be
Similarly, for the symbol-level probabilities,
48
(4.4a)
(4.4b)
p (uk 00) p (uk 01) p (uk 10) p (uk 11)
10
11
p (uk 00) 1 exp[ L01
e ] exp[ Le ] exp[ Le ] 1
(4.5)
From the above properties, p(uk 00) can be expressed as follows.

p (uk 00) =
01
1 exp[ Le ]
1 exp[ L ]
A
be
(4.6)
1 exp[ L ] 1 exp[ L ]
10
e
B
be
Additionally, the following equation can be derived from (4.3).

p (uk 11) p (uk 00) p ( B 1) p ( A 0)
(4.7)
p( A 1) p ( B 0)
Using (4.2), (4.4), and (4.7), we can derive

exp[ Le ]
11
A
exp[ Lbe
]
1
1
A
B
p (uk 00) 1 exp[ Lbe ] 1 exp[ Lbe ]
exp[ LBbe ]
1.
B
A
(4.8)
By applying (4.6) to (4.8),
] 1 exp[ L ] exp[ L
exp[ Le ] exp[ Lbe ] 1 exp[ Le ] exp[ Le ]

11
exp[ LBbe
01
10
10
e
01
e ].
(4.9)
Since p(A=0) p(B=1) p(A=1) p(B=0) = p(uk=01) p(uk=10), we can obtain the
following relations from (4.6).
exp[ Le ]
01
exp[ Le ]
10
A
exp[ LBbe ] exp[ Lbe
]
A
1 exp[ Lbe
]
A
exp[ Lbe
] exp[ LBbe ]
1 exp[ LBbe ]
1 exp[ LBbe ]
A
1 exp[ Lbe
]
A
1 exp[ Lbe
]
1 exp[ LBbe ]
exp[ Le ]
(4.10a)
exp[ Le ]
(4.10b)
10
01
By considering both (4.9) and (4.10), we can obtain the relations among the symbol-level
extrinsic information values with the bit-level extrinsic information values.
49
11
1 exp[ Lbe ]
exp[ L
A
be
Lbe ] 1
B
1 exp[ Lbe ]
exp[ Lbe ] 1 exp[ Lbe ]

B
exp[ Lbe ] 1 exp[ Lbe ]

A
exp[ Le ]
1 exp[ Lbe ]
B
exp[ L
exp[ Le ]
10
(4.11)
B
Lbe ] 1
01
exp[ Le ]
B
exp[ Lbe ]
A
be
Based on the relations discussed above, we can derive the symbol-to-bit conversion and
bit-to-symbol conversion of the extrinsic information.
4.2.2 Symbol-to-Bit Conversion of Extrinsic Information

From (4.1)-(4.3), we can derive two equations which relate the bit-level probabilities to
the extrinsic information values as shown below.
p( A 1) p (uk 10) p (uk 11)
A
p( A 0) exp[ Lbe
]
[ p (uk 00) p (uk
(4.12a)
A
01)] exp[ Lbe
]
p( B 1) p(uk 01) p (uk 11)

p( B 0) exp[ LBbe ]
(4.12b)
B
[ p(uk 00) p (uk 10)] exp[ Lbe
]
As a consequence, two bit-level extrinsic information values can be obtained from (4.12)
as follows.
p (uk 10) p (uk 11)
A
Lbe
ln
p (uk 00) p (uk 01)

11
exp[ L10
10 11
01
e ] exp[ Le ]
ln
max Le , Le max 0, Le
01
exp[
L
]
e
LBbe
(4.13)
p (uk 01) p (uk 11)

ln
p (uk 00) p (uk 10)

11
exp[ L01
01 11
10
e ] exp[ Le ]
ln
max Le , Le max 0, Le
10
exp[
L
]
e
As expressed in (4.13), two bit-level extrinsic information values can be obtained from
three symbol-level extrinsic information values. Therefore, the number of values to be
stored in the memory can be reduced from three to two.
50
4.2.3 Bit-to-Symbol Conversion of Extrinsic Information

To keep the compatibility with the conventional symbol-based SISO decoder, we
should retrieve the symbol-level extrinsic information values from the bit-level extrinsic
information values as illustrated in Figure 4.2. Using the relations derived above, the
proposed bit-to-symbol conversion of the extrinsic information can be classified into four
cases by taking into account the sign values of the two bit-level extrinsic information
values.
4.2.3.1 Case-I : LbeA 0 && LbeB 0

From (4.8)-(4.11), we can relate Le11 with the two bit-level extrinsic information values
as follows.
exp[ L11
e ]
A
exp[ Lbe
]
1
1
1
A
B
A
exp[ Lbe
]
1
1
1
A
B
p( A 0) p( B 0) 1 exp[ Lbe ] 1 exp[ Lbe ]
A
B
1 exp[ Lbe
] 1 exp[ Lbe
]
(4.14)
A
exp[ Lbe
]
1
1
1 exp[ LA ] 1 exp[ LB ]
be
be
A
exp[ Lbe
] exp[ LBbe ]
where p(uk = 00) is approximated to p( A 0) p( B 0) . From (4.14), Le11 can be

determined as follows.
A
B
A
B
L11
e ln exp[ Lbe ] exp[ Lbe ] max( Lbe , Lbe )
(4.15)
Additionally, we can use (4.11) to approximate Le10 as follows.

A
A
B
(1 exp[ Lbe
]) exp[ L11
e ] exp[ Lbe ] (exp[ Lbe ] 1)
ln
L10
e
A
B
exp[ LA LB ] 1
exp[ Lbe Lbe ] 1
be
be
ln exp[ L
ln exp[ LBbe ] exp[ L11

e ] 1
11
e
B
B
A
B
B
Lbe
] L11
e Lbe max( Lbe , Lbe ) Lbe
11
A
A
B
A
Similarly, L01
e Le Lbe max( Lbe , Lbe ) Lbe .
51
(4.16)
4.2.3.2 Case-II : LbeA 0 && LbeB < 0

Using the relations between Le01 and Le10 in (4.10a)-(4.10b), we can obtain approximate
Le01 and Le10.
A
A
exp[ Lbe
] exp[ LBbe ] 1 exp[ Lbe

]
L10
ln
exp[ L01
e
e ]
B
B
1 exp[ Lbe ]
1 exp[ Lbe ]
A
B
exp[ Lbe ] exp[ Lbe ]
ln
( p(uk 01) p(uk 00))
1 exp[ LBbe ]
(4.17)
A
B
( Lbe
0 & & Lbe
0)
A
Lbe
A
B
exp[ LBbe ] exp[ Lbe
] 1 exp[ Lbe
]
L01
ln
exp[ L10
]
e
e
A
A
1 exp[ Lbe ]
1 exp[ Lbe ]
A
A
10
ln 1 exp[ Lbe
L10
e ] Lbe Le
(4.18)
A
( from (17), L10
e Lbe )
According to (4.9), Le11 can be expressed as follows.
B
10
01
L11
e ln exp[ Lbe ] (1 exp[ Le ]) exp[ Le ]
(4.19)
A
B
10
A
LBbe L10
e Lbe Lbe ( from (17), Le Lbe )
4.2.3.3 Case-III : LbeA < 0 && LbeB 0

In this case, symbol-level extrinsic information values can be retrieved similarly as
discussed in Case-II. Therefore,
B
L01
e Lbe ,
L10
e 0,
A
B
L11
e Lbe Lbe
(4.20)
4.2.3.4 Case-IV : LbeA < 0 && LbeB < 0

Since the two bit-level extrinsic information values are less than zero in this case, we
can roughly approximate p(uk = 00) as follows.
p (uk 00) p ( A 0) p ( B 0)
1
A
B
1 exp[ Lbe
] exp[ Lbe
]
According to (4.8) and (4.21),
52
A
be
B
0 & & Lbe
0
(4.21)

A
B
L11
e ln 1 exp[ Lbe ] exp[ Lbe ]
ln

1
B
be ]

L ]
1
1 exp[
exp[ L ] 1 exp[ L
A
B
exp[ Lbe
Lbe
]
A
B
exp[ Lbe ] exp[ Lbe
]
A
B
( Lbe
0 & & Lbe
0)
A
B
A
B
Lbe
Lbe
max Lbe
, Lbe
A
be
A
be
(4.22)
We can approximate (4.11) by considering LbeA < 0 and LbeB< 0.

A
B
exp[ L11
e ] exp[ Lbe ] (1 exp[ Lbe ])
A
B
(exp[ Lbe
Lbe
] 1) exp[ L10
e ]
A
exp[ LBbe ] (1 exp[ Lbe
])
A
B
(exp[ Lbe
Lbe
] 1) exp[ L01
e ]
(4.23)
A
B
01
exp[ Lbe
] exp[ L10
e ] exp[ Lbe ] exp[ Le ]
( p (uk 11) p (uk 00))
B
A
10
Therefore, L01
e Lbe and Le Lbe .
As described above, the symbol-level extrinsic information values can be retrieved

from the bit-level extrinsic information values by using simple operations such as addition
and maximum as illustrated in Figure 4.3. Similarly, the symbol-to-bit conversion can be
derived for higher-order non-binary turbo codes by expressing the bit-level probabilities
with the symbol-level probabilities. After the symbol-level extrinsic information is
expressed with the bit-level extrinsic information, the bit-to-symbol conversion can be
obtained by applying appropriate approximations according to the sign values of the bitlevel extrinsic information.
4.3 Experimental Results

The proposed conversion method is applied to decode the turbo code specified in the
WiMAX standard. With the experimental environment denoted in Table 4.1, Figure 4.4
shows the BER performance of a turbo decoder employing the proposed conversion
method, and compares with that of the conventional turbo decoder [22] using the symbol-
53
LbeA = ln
LbeB
(Bit-Level)
p(u = 01) + p(u = 11)

= ln
p(u = 00) + p(u = 10)
LbeA
Extrinsic
B
Memory Lbe
Le01
Le10
Le11
DoubleBinary
SISO
Decoder #2
Case-I
(+/+)
Case-II
(+/)
Case-III
(/+)
Case-IV
(/)
Le01
MAX LbeA
LbeA + LbeB
LbeA + LbeB
MIN
Le10
MAX LbeB
LbeA
LbeA
Le11
MAX
LbeB
LbeB
p(u = 10) + p(u = 11)

p(u = 00) + p(u = 01)
max(Le10, Le11) max(0, Le01)

LbeB
LbeA
Bit-to-Symbol
Converter
(BSC)
Symbol-to-Bit
Converter
(SBC)
Le01
DoubleLe10
Binary
SISO
L 11
Decoder #1 e
max(Le01, Le11) max(0, Le10)
* MAX
= max(LbeA, LbeB)
* MIN
= min(LbeA, LbeB)
Figure 4.3: Proposed Bit-Level Extrinsic Information Exchange
Figure 4.4: Comparison of BER performance of 8 iterations for 1920-bit frame.

level extrinsic information directly which is described in the previous chapter. The BER
performances are measured for a coding rate of 1/2. Though some approximations are
applied in deriving the proposed conversions, they lead to only a slight degradation of the
54
Table 4.4 Single-port SRAM Size required for the Turbo Decoder
Forward Metric
Memory
Branch Metric
Memory
Border Metric
Memory
Extrinsic Info.
Memory
SISO x 1
Total
SISO x 5
Conventional [22]
Proposed
4480 bits/SISO
4480 bits/SISO
5120 bits/SISO
5120 bits/SISO
2*5180 bits
= 10360 bits
2*5180 bits
= 10360 bits
115200 bits
76800 bits
135160 bits
(100%)
173560 bits
(100%)
96760 bits
(71.6%)
135160 bits
(77.9 %)
signal-to-noise ratio (SNR), about less than 0.1dB, since the extrinsic information does not
need to be exact in decoding [4][5]. Consequently, we can reduce the number of extrinsic
information values to be exchanged without inducing a considerable loss of error
correcting capability.
4.3.1 Hardware Implementation

With the quantization scheme indicated in Table 4.1, a turbo decoder associated with
the proposed conversion was designed in Verilog-HDL and synthesized with a 0.18 m 4Metal CMOS standard-cell library and compiled SRAM memories. The turbo decoder is
based on the time-multiplex architecture as shown in Figure 4.5. The memory size required
for the turbo decoder is summarized in Table 4.4. Since a separate border memory is
needed for each SISO decoding, two border memories are integrated in the decoder
implementation. By adopting the proposed bit-level extrinsic information exchange
method, the memory size required for the extrinsic information is reduced to two-third of
the conventional method as denoted in Table 4.4. When several SISO decoders are adopted
to achieve a higher throughput, the size of the state metric memory should be increased,
but the proposed conversion is still effective in reducing the total memory size, as the
extrinsic information memory is much larger than the state metric memory as indicated in
Table 4.4. Note that a parallel turbo decoder equipping with five SISO decoders is also
55
SISO Decoder
Branch
Memory
Input
Forward
Memory
Le01
LbeA
A
Bit-Level Lbe
B
Extrinsic Lbe
Memory
LbeB
B
S
C
Le10
Metric
Calc.
Unit
Le11
Border
Memory
Le01
Le10
Data
Le
11
S
B
C
LbeA
LbeB
Read Address
Address
Interleaver
Address
Queue
Write Address
(delayed by the SISO latency)
Figure 4.5: Block diagram of the proposed double-binary turbo decoder
LbeA LbeB
Bit-to-Symbol Converter (BSC)
< Gate Counts Comparison >

Gate Counts
ADD
MAX
ADD
Le11
ADD
Le10
SISO Decoder
with interleaver
47501 gates
(100%)
Bit-to-Symbol
Converter
430 gates
(0.9%)
ADD
Le01
Figure 4.6: Block diagram and complexity of the proposed bit-to-symbol converter
considered in Table 4.4. Additional techniques such as the non-uniform quantization of

extrinsic information [4][5] can be applied to reduce the extrinsic memory further, but the
extrinsic memory is still much bigger than other metric memories. The proposed symbolto-bit conversion can be performed in the hard-decision unit that makes hard-decided
values at the end of the decoding process [13], since the relations expressed in (4.13) are
the same as those needed to calculate the log-likelihood ratio values. Therefore, the only
hardware overhead caused by the proposed conversion method is the bit-to-symbol
converter (BSC) illustrated in Figure 4.6. As denoted in Figure 4.6, the complexity of the
56
proposed BSC is negligible compared to that of the total SISO decoder including the
dedicated hardware interleaver and the hard-decision unit.
Since the amount of values per symbol required for m-ary extrinsic information is 2m-1,
the memory saving resulting from the proposed bit-level extrinsic information exchange
increases as m increases, although no results on performance loss have been reported yet
for m larger than 2.
57
Chapter 5
A 50Mbps Double-Binary Circular Turbo
Decoder for Mobile WiMAX
This chapter presents a double-binary turbo decoder developed for the WiMAX
standard. To reduce the large extrinsic memory needed in double-binary turbo decoding,
the proposed decoder exchanges the bit-level extrinsic information values rather than the
traditional symbol-level extrinsic information values by including two simple converters.
In addition, an optimized SISO decoder structure and a low-complexity hardware
interleaver are presented to achieve an area-efficient decoder implementation by generating
interleaved addresses for two data flows simultaneously. To verify the proposed
architecture, a double-binary turbo decoder is designed for WiMAX using a 0.13m
CMOS process. The decoder occupies an area of 2.24mm2 and provides up to 50Mbps
throughput at 200MHz with employing only a single SISO decoder.
5.1 Proposed Chip Architecture

For an efficient turbo decoder implementation, an optimized SISO decoder is necessary
especially when several SISO decoders are adopted to achieve high throughput. In addition,
by reducing the number of extrinsic information values, overall turbo decoder complexity
can be lowered. Focusing on the time-multiplex turbo decoder, we propose a new doublebinary turbo decoder architecture that includes a double-flow hardware interleaver.
58
5.1.1 Low-Complexity SISO Decoder Design

As the complexity of the metric calculation increases, the sliding window with border
memory is efficient for non-binary turbo decoding, because it can eliminate the need of the
complex dummy backward calculation. In the proposed architecture, we adopt the 4-bit
border metric encoding presented in [5] to reduce the border memory size. By applying the
border metric encoding, since the border memory size is constant regardless of the number
of SISO decoders in a turbo decoder, we can remove all complex dummy metric
calculation units at the expense of a small border memory and slight performance
degradation.
In addition, to lower the complexity of the SISO decoder, we exploit the branch metric
recovery scheme. Branch metrics, , are defined in double-binary turbo code as follows.
)
k ( xks yks xks yks xkp ykp xkp ykp ) L(ez, IN
1
where z belongs to
(5.1)
{01,10,11} , x and y are transmitted and received codewords,

k
k
respectively, and we assume the binary phase shift keying modulation. The superscripts p
)
and s denote the parity bits and systematic bits, respectively. In (3), L(ez, IN
is the extrinsic
information received from the other SISO decoder. Since there are 16 unique branch
metrics in the double-binary turbo code, the branch metric memory size becomes
significant. Therefore, we propose a new branch metric recovery scheme which does not
store whole branch metric values, but stores only essential values required to recover the
branch metrics. From (3), we can obtain following relation;
k ( yks yk s ) xk pyk p xk ypk p L((ez, IN) Li z( ) )

1
(5.2)
where L(i z ) is the intrinsic information defined as follows.

s2
L01
i 2 y ,
L10i 2 y s1 ,
L11i 2 y s1 2 y s2
(5.3)
Therefore, by storing only essential sub-metrics as shown in Figure 5.1, we can

significantly reduce the memory size required to store branch metrics at the expense of the
simple calculation expressed in (5.2). We can reduce the branch memory size further by
keeping the bit-level extrinsic information and the symbol-level extrinsic information
values are recovered by the bit-level extrinsic information values later as indicated in
59
16 x 10 bits
STEP 1:
160bits
(100%)
16 unique branch metrics
8 x 7 bits
3 x 8 bits
8 partial branch metrics

(xks1yks1+xks2yks2+xkp1ykp1+xkp2ykp2)
STEP 2 [22]:
2 x 4 bits
3 x 9 bits
Li(z) +
STEP 3:
Le(z)
p1
p2
3 Extrinsic Info.
(Le01, Le10, Le11)
80bits
(50%)
5 bits
ys1+ys2
40bits
(25%)
Memory Size = Width * Depth
2 x 8 bits
STEP 4:
LeA
LeB
2 x 4 bits
ys1
ys2
2 x 4 bits
yp1
yp2
Depth : 32 (WINDOW SIZE)

32bits
(20%)
Figure 5.1: Branch metric memory width comparison

Figure 5.1 as a step 4.
5.1.2 Bit-level Extrinsic Information Exchange

As discussed in Chapter 4, there are three symbol-level extrinsic information values to
be exchanged between two SISO decoders. To reduce the size, the extrinsic information
can be stored in bit-level. Two converters, symbol-to-bit converter and bit-to-symbol
converter, can be implemented with simple operations such as addition and maximum as
shown in Figure 5.2. Regardless of the bitwidth of the extrinsic information, which can be
reduced by non-uniform quantization [4][5], the proposed architecture can reduce the
extrinsic information memory by decreasing the number of extrinsic information values to
be exchanged.
5.1.3 Dedicated Hardware Interleaver

In turbo codes, the interleaver is involved in both encoding and decoding. The most
straightforward way to implement the address interleaving is to store interleaved addresses
in a memory [14]. In case of the WiMAX, the memory size required to store all the
interleaving patterns is about 90K bits since every frame size is associated with a different
interleaving pattern. This large-sized memory leads to significant area occupation and
power consumption. For 3G wireless systems, a dedicated hardware interleaver that
60
Le11 Le10
LbeA LbeB
Le01
ADD
MAX
MAX
MAX
MAX
MAX
+
ADD
ADD
ADD
ADD
ADD
Le11
LbeB
LbeA
(a)
Le10
Le01
(b)
Figure 5.2: Block diagram of the proposed (a) symbol-to-bit converter and (b) bit-tosymbol converter.
Step 1 : Switch alternate couples

for i = 0 ... N-1
If (i mod 2 == 1) let (Ai, Bi) = (Bi, Ai) (i.e., switch the couple)
Step 2 : Inter-symbol permutation
for i = 0 ... N-1
switch i mod 4 :
case 0 : P(i) = (P0i + 1)mod N
case 1 : P(i) = (P0i + 1 + N/2 + P1)mod N
case 2 : P(i) = (P0i + 1 + P2)mod N
case 3 : P(i) = (P0i + 1 + N/2 + P3)mod N
Figure 5.3: Interleaving procedure for the WiMAX

generates interleaved addresses on-the-fly has been proposed to achieve small area [20].
Such a dedicated interleaver is also effective in reducing power consumption as there is no
need to include a large-sized memory. Due to the property of the interleaver adopted in the
WiMAX, the dedicated hardware interleaver can be implemented efficiently. Figure 5.3
describes how to calculate the interleaved addresses on-the-fly for the WiMAX, where P0,
P1, P2 and P3 are determined according to the frame length, N, as denoted in Table 5.1 [3].
The first step can be simply accomplished by switching the values according to the least
significant bit (LSB) of the address. Figure 5.4 illustrates the hardware structure for the
61
Table 5.1 CTC Interleaver Parameters for WiMAX

N (pair)
24
36
48
72
96
108
120
144
180
192
216
240
480
960
1440
1920
2400
P0
5
11
13
11
7
11
13
17
11
11
13
13
53
43
43
31
53
P1
0
18
24
6
48
54
60
74
90
96
108
120
62
64
720
8
66
P2
0
0
0
0
24
56
0
72
0
48
0
60
12
300
360
24
24
P3
0
18
24
6
72
2
60
2
90
144
108
180
2
824
540
16
2
Init0 Init1 Init2 Init3

00
01
10
11
F/F
Addr[1:0]
MSB
(sign)
ADD
0
+
ADD
P(i) = (P0i + Init)mod N

= [(P0i)mod N + (Init)mod N]mod N
ADD
+
1
N
MSB
(sign)
permutated
address
ADD
P0
Figure 5.4: Interleaver structure based on the incremental calculation

second step, that is, inter-symbol permutation. Since the input address increases
62
Figure 5.5: Need of LIFO for Interleaved Address

sequentially, accumulating P0 and adding it to an initial value selected by the two LSBs
can generate the permutated address. To replace the complicated modulo operation,
subtractions are performed in Figure 5.4 when the intermediate values are not less than N.
The initial values are pre-calculated and maintained in a small table.
5.1.4 Dedicated Double-Flow Hardware Interleaver

As shown in Figure 5.5, the turbo decoder requires Last-In First-Out (LIFO) memory
which holds the interleaved addresses. To remove the address LIFO, based on the proposed
hardware interleaver presented, the double-flow interleaver which can generate the write
addresses as well as read addresses for the extrinsic information memory is developed as
shown in Fig.ure 5.6. With two initial value sets, two addresses can be generated on-the-fly
from one shared accumulator. Since write addresses can be generated on-the-fly, the
address LIFO required to hold interleaved addresses in the time-multiplex architecture [20]
can be removed. The initial values can be managed by employing a small look-up table
and simple update logic.
63
P(W-1-i) = (P0(W-1-i) + Init)mod N

= [-(P0i)mod N + (P0(W-1))mod N + (Init)mod N ]mod N
P(i) = (P0i + Init)mod N

= [(P0i)mod N + (Init)mod N]mod N
INIT_DEC0 INIT_DEC1 INIT_DEC2INIT_DEC3
INIT_INC0 INIT_INC1 INIT_INC2 INIT_INC3
00
00
01
10
11
01
10
11
F/F
Addr_DEC[1:0]
+ADD
WRITE
ADDRESS
N
0
READ
ADDRESS
0
MSB
(sign)
PAddr_DEC
Addr_INC[1:0]
ADD
+
ADD
ADD
+
ADD
1
(0)
MSB
(sign)
MSB
(sign)
ADD
PAddr_INC
P0
Figure 5.6: Double-flow hardware interleaver based on incremental calculation.
5.1.5 Early Stopping Criterion

Since turbo decoding proceeds in an iterative fashion, once iterations fail to improve
the accuracy of decoding, the iterative process should be terminated by a stopping criterion
in order to reduce decoding delay, or more importantly, to reduce the power consumption
of the power-constrained systems such as mobile hand-held devices. In the proposed
double-binary turbo decoder, the simple stopping criterion, which utilizes the bit-level
extrinsic information, is devised as follows.
5.1.5.1 Proposed Stopping Criterion for Double-Binary Turbo Decoder

When the double-binary SISO decoder generates the LLR values, the proposed
hardware compares the four sign values which are defined as follows.
A
SeA sign Lbe
, IN ,
SeB sign LBbe , IN
SllrA sign LLR A ,
SllrB sign LLR B
(5.4)
At the end of SISO decoding, the proposed double-binary turbo decoder stops if the
following condition is satisfied for all pairs in a frame.
A
e
SllrA & & SeB SllrB
(5.5)
The proposed stopping criterion can be implemented with low hardware complexity as
64
SeA
SllrA
RST
D
STOP
EN
SeB
SllrB
CLK
Figure 5.7: Double-flow hardware interleaver based on incremental calculation.

SISO Decoder
including BSC
Branch
Memory
Input
Forward
Memory
Lbe
LbeB
Lbe
Bit-Level
Extrinsic
Memory
Metric
Calc.
Unit
LbeB
Read
Address
Data
Address
Border
Memory
Le01
Le10
S
B
C
LbeA
LbeB
Le11
Double-Flow
Interleaver
Write
Address
BSC : Bit-to-Symbol Converter

SBC : Symbol-to-Bit Converter
Figure 5.8: Block diagram of the proposed double-binary turbo decoder

shown in Figure 5.7. If the above condition is satisfied for all pairs in a frame, then STOP
indicates 1. However, if there is a pair which cannot satisfy the above condition, then the
value of STOP stays 0 during the SISO decoding. The flip-flop used in this scheme is reset
when the new SISO decoding starts.
5.2 Implementation Results

To verify the proposed architecture, we implemented a double-binary turbo decoder for
WiMAX. The proposed time-multiplex turbo decoder consists of a single SISO decoder,
two converters, the double-flow dedicated interleaver, and the memory which holds the bitlevel extrinsic information as shown in Figure 5.8. In an iteration, the data are accessed in
a sequential order for the first SISO decoding and in an interleaved order for the second
65
8 maximum
iterations
24.26Mbps
50.20Mbps
32.35Mbps
Figure 5.9: Average number of iterations for the proposed turbo decoder
Figure 5.10: Comparison of BER performance for 1920-bit frame.

decoding. To avoid unnecessary iterations at a high signal-to-noise ratio (SNR), a simple
early stopping criterion is employed in the implementation, that is, the sign values of
incoming bit-level extrinsic information are compared with hard-decision bits.
The effect of the stopping criterion is shown in Figure 5.9. The BER performance of
the proposed turbo decoder is shown in Figure 5.10, where we compare with that of a
conventional turbo decoder that is based on the symbol-level extrinsic information and the
traditional sliding window technique [22]. Though some approximations are applied in
deriving the proposed conversion, they lead to only a slight degradation of SNR, about
66
Table 5.2 Single-port SRAM Size Required for the Turbo Decoder
Forward Metric
Memory
Branch Metric
Memory
Border Metric
Memory
Extrinsic Info.
Memory
SISO x 1
Total
SISO x 4
Conventional [22]
Proposed
4480 bits/SISO
4480 bits/SISO
5120 bits/SISO
2048 bits/SISO
2*2072bits
= 4144 bits
2*2072bits
= 4144 bits
115200 bits
76800 bits
128944 bits
(100%)
157744 bits
(100%)
87472 bits
(67.8%)
107056 bits
(67.9 %)
1.4mm
Test ROM
1.6mm
Core
Border Buffer
One SISO Decoder
Dedicated Interleaver
MEMORY
MEMORY
MEMORY
Technology
130nm 1-poly 6 Metal
Size
1.4mm x 1.6mm
Gate Count
(NAND2 Equiv.)
64.2K Gates
Operating
Frequency
200MHz
Figure 5.11: Die photo of the proposed double-binary turbo decoder chip
0.1dB, since the extrinsic information does not need to be exact in decoding [4][5]. The
memory size required for the proposed turbo decoder is summarized in Table 5.2. By
adopting the proposed bit-level extrinsic information exchange, the memory size required
for the extrinsic information is reduced to two-third of the conventional method as denoted
in Table IV. When several SISO decoders are adopted to achieve a higher throughput [23],
the size of the state metric memory should be increased, but the proposed conversion is
still effective in reducing the total memory size, as the extrinsic information memory is
much larger than the state metric memory as indicated in Table IV.
Figure 5.11 summarizes implementation results. The proposed turbo decoder is
67
implemented with 0.13m 1-poly 6-metal standard CMOS process. The decoder occupies
2.24mm2 and takes 4,948 cycles for each iteration to process a 2400-pair (4800 bit) frame.
As a result, the proposed decoder provides up to 50Mbps at the frequency of 200MHz.
68
Chapter 6
A Unified Parallel Radix-4 Turbo Decoder
for Mobile WiMAX and 3GPP-LTE
This chapter presents a unified parallel radix-4 turbo decoder architecture developed
for supporting both the Mobile WiMAX and the 3GPP-LTE standards. To exhibit a
decoding rate of more than 100Mb/s with lower power consumption, the proposed decoder
mainly consists of eight retimed radix-4 SISO decoders and a dual-mode parallel hardware
interleaver to support both the almost regular permutation (ARP) interleaver and the
quadratic polynomial permutation (QPP) interleaver defined in two standards. A prototype
chip supporting both Mobile WiMAX and 3GPP-LTE standards is fabricated using a
0.13m CMOS process. The decoder core occupies 10.7mm2 and can exhibit a decoding
rate of more than 100Mb/s with eight iterations while achieving an energy efficiency of
0.31nJ/bit/iter for both standards.
6.1 Proposed Chip Architecture

Figure 6.1 shows the architecture of the proposed unified turbo decoder where one
SISO decoder is responsible for the two decoding steps of one iteration in a time-multiplex
manner. In addition, the decoder contains two low-complexity converters need for bit-level
extrinsic information exchange which can reduce the number of extrinsic information
values from three to two in the case of double-binary turbo decoding. Extrinsic information
values are exchanged between two SISO decodings by store and load the values in an
extrinsic information memory. The memory is accessed in a sequential order for the first
decoding, and in an interleaved order for the second decoding. The SISO decoder
69
Not Activated @ 3GPP-LTE
Radix-4 Retimed SISO Decoder
SBC : Symbol-to-Bit Converter

BSC : Bit-to-Symbol Converter
yA, yB(Systematic)
yY, yW(Parity)
SBC
Frame
MEM
Branch
Buffer
dA
LbeB
Reduced
Memory Size
LbeA
(Bit-Level)
BSC
Border
Buffer
Decision
MEM
LbeA
Metric
Calc.
Unit
Forward
Buffer
LbeB
Extrinsic
MEM
dB
Sign(LbeA)
Sign(LbeB)
dA
Early Termination Unit
dB
Stop decoding if
(dA == Sign(LbeA) && (dB == Sign(LbeB))
for all pairs in a frame
STOP
Figure 6.1: Overall Unified Turbo Decoder Architecture with Time-Multiplexing

calculates the extrinsic information values iteratively, and then stops decoding when the
stopping criterion for radix-4 processing introduced in the Chapter 5 is satisfied or the
iteration number reaches the pre-determined limit. At the last SISO decoding, harddecision values are calculated and then stored in a decision memory. This Section explains
the proposed chip architecture in detail which enables the hardware resources to be shared
for both standards, Mobile WiMAX and 3GPP-LTE.
6.1.1 Parallel Turbo Decoding

In the proposed chip, to support high-speed data transmission of the 4G mobile
communication systems such as Mobile WiMAX and 3GPP-LTE, eight SISO decoders are
adopted and part of or all the SISO decoders operate in parallel. The overall architecture of
the proposed parallel turbo decoder chip is depicted in Figure 6.2. The number of SISO
decoders to be involved in the decoding, M, is scalable according to the frame size, N.
Each SISO decoder decodes a sub-frame of L = N/M. If M is less than eight, not all the
70
Frame
MEM #0
Decision
MEM #0
Frame
MEM #1
Extrinsic
MEM #0
Decision
MEM #1
Extrinsic
MEM #1
Frame
MEM #7
Decision
MEM #7
Extrinsic
MEM #7
READ / WRITE
Exchange Network
Radix-4
Radix-4
Radix-4
SISO #0
SISO #1
SISO #7
Figure 6.2: The Proposed Chip Architecture with Eight SISO Decoders
SISO decoders are activated. For the deactivated SISO decoders, their clocks are gated to
reduce the power consumed in such decoders. Each SISO decoder reads/writes the
appropriate values from/to the specific memory determined by the address provided by the
interleaver, and can access the exchange networks independently without collision due to
the collision-free property of the ARP/QPP interleavers defined in Mobile WiMAX/3GPPLTE standards [31].
6.1.2 Unified Radix-4 SISO Decoder with Retiming

As denoted in Chapter 2, for both modes, metrics required in SISO decoder can be
calculated with the almost same hardware because of the similar mathematical expressions.
For the branch metrics, the branch metric recovery scheme proposed in the first chip
implementation can be applied to (6) leading to branch memory reduction. Also, for the
calculation of the forward and backward metrics, a max operation with four operands is
required as shown in Figure 6.3(a).
However, long computation time because of the cascaded 2-input ACS as shown in
Figure 6.3 becomes the critical path of the decoder and limit the operating frequency. To
reduce the critical path delay, retiming which is a transformation technique used to change
71
-(0)/(0)
: Normalization
: Flip Flop
1
CMP
CMP
CMP
CMP
2
3
2
CMP
CMP
Td = 2TCMP + 2TADD
Td = 2TCMP + TADD
Not Stored
in Forward Buffer
CMP
1
CMP
1
2
CMP
CMP
2
3
CMP
2
3
CMP
3
4
Td = 2TCMP
(CMP & ADD are overlapped)
Td = 2TCMP + TADD
4
Metric to be stored in
Forward Buffer
Figure 6.3: Add-Compare-Select (ACS) block with Retiming
the locations of delay elements in a circuit without affecting the input/output

characteristics of the circuit [24]. After applying retiming and migrating the common
operator, the critical path delay can be reduced significantly at the cost of the increased
hardware resources as indicated in Figure 6.3. However, in this case, the metrics to be kept
in the registers during the recursion are different from original forward/backward metrics
72

, Calc. & Store
2W
, Calc. & Store

, Calc. & Load
W
, Calc. & Load
Border Metric Load

Border Metric Store
2T
LLR & Le Calc.
Processing Time
Figure 6.4: Sliding Window with Register Retiming
as shown in Figure 6.3. The new metrics to be kept during the recursion in
forward/backward directions can be expressed as follows.
( z ) sk k ( sk ) ( z ) ( sk sk 1 )
(6.1)
( z ) sk ( z ) ( sk sk 1 ) k 1 ( sk 1 )
(6.2)
k 1
k 1
k 1
where z belongs to {00,01,10,11} . Therefore, the number of metrics to be kept during

the recursion increases from 8 to 32. To avoid the increase of the forward metric buffer,
which holds the metric values to be used for LLR calculation, the recovery circuit to
retrieve the original forward/backward metrics are added as shown in Figure 6.3. Therefore,
the forward metric buffer holds the original metric values rather than the new metrics
defined in (6.1)-(6.2), leading to avoiding the buffer size increase.
Figure 6.4 shows the sliding window of the proposed SISO decoder. For efficient LLR
calculation, during the second half of the window, the original metrics to be loaded from
the internal buffer and the new metrics defined in (6.1)-(6.2) are considered together as
follows.
(kz ) max k( z ) sk k 1 sk 1 max k(00) sk k 1 ( sk 1 )
( sk sk 1 , z )
( sk sk 1 ,00)
73
(6.3)
Frame MEM
Frame MEM
yW1
yY1
yp2
For even
time index
(k = 0, 2, )
yS1
yW2
yp1
yp2
For odd
time index
yY2
yS2
Parity
Input
Systematic
Input
ys
yp1
ys
Parity
Input
Systematic
Input
(k = 1, 3, )
(a)
(b)
Figure 6.5: Input Frame Memory Configurations for (a) Double-Binary Turbo
Decoding Mode and (b) Radix-4 Single-Binary Turbo Decoding Mode
(kz ) max k sk k(z1) sk max k sk k(00)
sk
1
( sk sk 1 , z )
( sk sk 1 ,00)
(6.4)
where z belongs to {01,10,11} .

Accordingly, the turbo decoder can operate at the higher operating frequency, which
leads to higher peak throughput due to the reduced critical path delay. Also, for the given
throughput specification, the supply voltage can be lowered leading lower operating
frequency of the decoder and reduced power consumption.
6.1.3 Memory-Sharing with Bit-level Extrinsic Information

In turbo decoder implementation, input frame memory and extrinsic information
memory require significant area occupation as described in Chapter 4. However, extrinsic
information memory for the double-binary turbo decoder can be reduced by applying the
bit-level extrinsic information exchange performed in symbol-to-bit / bit-to-symbol
converter shown in Figure 6.1. In this case, two extrinsic information values are exchanged
between two SISO decodings even if three extrinsic information values exist for doublebinary turbo codes. Therefore, as shown in Figure 6.1, extrinsic information memory can
74
be shared by radix-4 single-binary turbo decoding which also requires two extrinsic
information values to be exchanged.
As shown in Figure 6.5(a), six memory banks are required to store input frames in
double-binary turbo decoding mode (two systematic inputs and four parity inputs). For the
input memory sharing, parity property of the QPP interleaver in 3GPP-LTE, which means
that even (odd) positions in the input are mapped to even (odd) positions in the output, is
exploited. Even in the second SISO decoding where the values are read in an interleaved
order, memory configuration shown in Figure 6.5(b) can avoid two values required for
each cycle to reside in the same memory due to the parity property. Therefore, the
proposed memory partitioning according to the parity of the time index is useful in radix-4
processing.
6.1.4 Dual-Mode Hardware Interleaver

Two standards, Mobile WiMAX and 3GPP-LTE, employ the different interleaving
procedures the ARP interleaver and the QPP interleaver, respectively [3][27]. To support
both standards with a single hardware, the dual-mode hardware interleaver rather than
using separate RAM-based interleavers is designed to avoid huge area overhead [20].
For a frame size N, the ARP interleaver of size N is defined as follows.
(i) P0 i O d (i) mod N
(6.5)
where 0 i N-1 is the sequential index of the symbol positions after interleaving, (i) is
the symbol index before interleaving corresponding to position i, P0 and O are constant
values defined in standard, and d(i) is also constant values determined by two LSBs of i.
Efficient implementation of the ARP interleaver based on incremental calculation is
proposed in the first chip implementation where P0 is accumulated and added to an initial
value selected by two LSBs as shown in Figure 6.6.
Similarly, the QPP interleaver defined as follows can be rearranged to share the
hardware resources with ARP interleaver.
(i ) f1 i f 2 i 2 mod N
where f1 and f2 are the coefficients defined in the standard [27].
By rearranging (6.6) in the recursive form, we can obtain the following relation.
75
(6.6)
mode == 0 for Mobile WiMAX

mode == 1 for 3GPP-LTE
Init0 Init1 Init2 Init3

00
10
01
11
Addr[1:0]
F/F
mode
ADD
N
0
+
ADD
0
F/F
MSB
(sign)
ADD
0
+
ADD
ADD
+
1
0
N
MSB
(sign)
MSB
(sign)
ADD
+
1
N
MSB
(sign)
ADD
F/F
ADD
F/F
mode
8f2
8f2
P0
(2m)/(i)
(2m+1)
Not Activated
@ Mobile WiMAX
Figure 6.6: Dual-Mode Dedicated Hardware Interleaver
(i 1) f1 (i 1) f 2 (i 1)2 mod N
( (i) (i)) mod N
(6.7)
where (i) f1 2 f 2 i f 2 mod N . Since QPP interleaver should generate two

interleaved addresses (2m) and (2m+1) where 0 m N/2-1, per each cycle to support
radix-4 single-binary turbo decoding, dual-mode interleaver requires two interleaving
block where one of which is disabled for Mobile WiMAX mode as illustrated in Figure 6.6.
Accordingly, with the proposed dual-mode hardware interleaver, all the interleaving
patterns for all the frame sizes can be generated on-the-fly with small look-up tables which
keep the constant values while supporting both ARP interleaving and radix-4 QPP
interleaving.
6.2 Implementation Results

To verify the proposed architecture, a unified radix-4 turbo decoder is implemented for
Mobile WiMAX and 3GPP-LTE with eight SISO decoders as shown in Figure 6.2. To
avoid unnecessary iterations at a high signal-to-noise ratio (SNR), as indicated in Figure
76
Table 6.1 Comparison of Decoder Implementation

Publication
Standard
Compliant
Decoding
Algorithm
0.13
[30]
[9]
[20]
WiMAX
HSDPA
HSDPA
UMTS/
CDMA2000
Double-Binary
Max-log-MAP
Radix-2
Single-Binary
Max-log-MAP
Radix-4
Single-Binary
Log-MAP
Radix-2
Single-Binary
Log-MAP
0.13
0.13
0.18
0.25
1.2 (0.9 )
1.2
1.2
1.8
2.5
Gate Count
800K
(including 300K
for Buffers)
64.2K
44.1K
410K
34.4K
10.7
2.2
1.2
14.5 (7.32)
8.9 (2.22)
24.3
14.0
18.0 (24.92)
4.1 (7.92)
0.63
0.7
10.0 (2.72)
6.9 (0.92)
Energy
Efficiency
[nJ/bit/iteration]
1st Chip
Implementation
CMOS
[m]
Supply Voltage
[V]
# of SISO
Decoders
Core Area
[mm2]
Max.
Throughput
[Mb/s]
with 8 iterations
2nd Chip
Implementation
Mobile WiMAX/
3GPP-LTE
Double-Binary/
Radix-4
Single-Binary
Max-log-MAP
187.5/186.0
@ 1.2V, 250MHz
1051/1041
@ 0.9V, 140MHz
0.61
@ 1.2V, 250MHz
0.341
@ 0.9V, 140MHz
Optimistic estimation based on t pd ~ vdd / (vdd-vth)2, Energy ~ vdd2

Optimistic technology scaling to 0.13m assumed [30] : A ~ 1/s2, tpd ~ 1/s, Pdyn ~ 1/s3
6.1, the stopping criterion for the double-binary / radix-4 single-binary turbo decoding
proposed in the first chip implementation is adopted. Compared to the HDA criterion
[9][20], it does not require additional memory to store the decision bits and has no error
floor at high SNR. Figure 6.7 shows the frame-error rate (FER) performance obtained by
applying the proposed early termination and the effect of the stopping criterion. Table 6.1
compares the characteristics of the developed turbo decoder with previous turbo decoders.
A prototype chip containing eight SISO decoders, hardware interleavers, and on-chip dualport SRAMs is fabricated in a 0.13m CMOS process with 8 metal layers. The decoder
core occupies 10.7mm2 and operates at a maximum frequency of 250MHz due to the
actively applying the retiming technique to the complex ACS operation as shown in Figure
6.3. As denoted in Table 6.1 and Table 6.2, the proposed decoder achieves an energy
77
Vdd = 1.2V
187.5Mb/s @ WiMAX
186.0Mb/s @ LTE
Average Iteration Number
Frame Error Rate (FER)
105Mb/s @ WiMAX
104Mb/s @ LTE
Vdd = 0.9V
WiMAX : 4800bits
LTE : 6144bits
WiMAX : 480bits
LTE : 480bits
Vdd = 1.2V
408Mb/s @ WiMAX
478Mb/s @ LTE
228Mb/s @ WiMAX
267.5Mb/s @ LTE
Vdd = 0.9V
Figure 6.7: FER Performance and Average Iteration Number with Early Termination
in an AWGN Channel
Figure 6.8: Memory Size Reduction in the Proposed Architecture

efficiency of 0.34nJ/bit/iteration while achieving more than 100Mb/s with fixed eight
iterations when supply voltage is scaled since the peak operating frequency is relatively
high due to the retiming. Figure 6.8 illustrates the memory size reduction in the proposed
78
3.10mm
8 Radix-4 SISO Decoders

/ Dual-Mode Interleaver
MEM
#0
MEM
#1
MEM
#2
MEM
#3
MEM
#4
MEM
#5
MEM
#6
MEM
#7
Exchange Network
3.45mm
Figure 6.9: Micrograph of the Chip

architecture. TYPE A refers to the conventional double-binary turbo decoder based on
symbol-level extrinsic information and dummy calculation. TYPE B refers to the doublebinary turbo decoder based on symbol-level extrinsic information and border metric
encoding. The border memory of which the size is independent of the number of SISO
decoders can reduce the branch memory size for all the SISO decoders adopted in the
parallel turbo decoder. Proposed also considers the branch memory optimization
introduced in the first chip implementation. A die-photo is shown in Figure 6.9 where
parallelism is clearly indicated. The complexity of the exchange network which delivers
the values between memories and SISO decoders is significantly lowered due to the
reduced number of extrinsic information with bit-level extrinsic information technique. To
avoid the routing congestion, exchange network is routed over the memory macros.
79
Chapter 7
Conclusions
For efficient nonbinary/high-radix single-binary turbo decoding, two techniques are
proposed. The first one, an energy-efficient SISO decoder based on border metric encoding,
eliminates the complex dummy calculation at the cost of a small-sized memory that holds
encoded border metrics. Due to the infrequent accesses to the border memory and its small
size, the energy consumed for SISO decoding is reduced hugely.
Also, to reduce the memory size required for double-binary turbo decoding, a new
method to convert the symbolic extrinsic information to the bit-level information and vice
versa is presented. By exchanging the bit-level extrinsic information, the number of
extrinsic information values to be exchanged in double-binary turbo decoding is reduced to
the same amount as single-binary turbo decoding. Since the size of the extrinsic
information memory is significant, the proposed method is effective in reducing the total
memory size needed in double-binary turbo decoder.
Based on the proposed algorithmic solutions, to verify the proposed methods, two chips
have been implemented. The first implemented chip contains a double-binary turbo
decoder for the mobile WiMAX standard with the dedicated hardware interleaver and
fabricated using a 0.13m CMOS process. The proposed decoder is based on the timemultiplexing architecture consisting of a single optimized SISO decoder, a low-complexity
hardware interleaver, and it can provide up to 50Mb/s at the frequency of 200MHz with
simple early stopping criterion exploiting the bit-level extrinsic information. The second
chip presents the unified radix-4 turbo decoder architecture which can support both Mobile
WiMAX and 3GPP-LTE. To exhibit a decoding rate of more than 100Mb/s, the proposed
chip consists of eight retimed radix-4 SISO decoders and a dual-mode parallel hardware
interleaver to support both standards. The second chip can show more than 400Mb/s at the
80
frequency of 250MHz with simple early stopping criterion. The proposed chip can achieve
an energy efficiency of 0.34nJ/bit/iteration while achieving more than 100Mb/s with fixed
eight iterations when the supply voltage is scaled since the peak operating frequency is
relatively high due to the retiming technique.
81
Reference
[1] C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error correcting
coding and decoding: Turbo codes, in Proc. Int. Conf. Commun., May 1993, pp. 1064
1070.
[2] C. Douillard and C. Berrou, Turbo Codes With Rate-m/(m+1) Constituent
Convolutional Codes, IEEE Trans. Commun., vol. 53, no. 10, pp. 16301638, Oct. 2005.
[3] Part 16: Air Interface for Fixed and Mobile Broadband Wireless Access Systems
Amendment for Physical and Medium Access Control Layers for Combined Fixed and
Mobile Operation in Licensed Bands, IEEE Std 802.16e/D5-2004, Nov. 2004.
[4] D. Garrett, B. Xu, and C. Nicol, Energy efficient turbo decoding for 3G mobile, in
Proc. ISLPED01, Aug. 2001, pp. 328333.
[5] J. Vogt, J. Ertel, and A. Finger, Reducing bit width of extrinsic memory in turbo
decoder realizations, Electron. Lett., vol. 36, no. 20, pp. 17141716, Sep. 2000.
[6] D. S. Lee and I. C. Park, Low-power log-MAP decoding based on reduced metric
memory access, IEEE Trans. Circuits Systm. I, Reg. Papers, vol. 53, no. 6, pp. 12441253,
Jun. 2006.
[7] H. M. Choi and J. H. Kim, and I. C. Park, Low-power hybrid turbo decoding based on
reverse calculation, in Proc. IEEE Int. Symp. Circuits Syst., 2006, pp. 20532056.
82
[8] A. Worm, P. Hoeher, and N. When, Turbo decoding without SNR estimation, IEEE
Commun. Lett., vol. 4, pp. 193195, June 2000.
[9] M. A. Bickerstaff, L. M. Davis, C. Thomas, D. Garett, and C. Nicol, A 24Mb/s radix-4
logMAP Turbo decoder for 3GPP-HSDPA mobile wireless, in Proc. IEEE Int. Solid-State
Circuits Conf. (ISSCC), Feb. 2003, pp. 150151.
[10] J. Vogt and A. Finger, Improving the Max-log-MAP turbo decoder, Electron. Lett.,
vol. 36, no. 23, pp. 19371939, Jun. 2000.
[11] S. J. Lee, R. Shanbhag, and A. C. Singer, Area-Efficient High-Throughput MAP
Decoder Architecture, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 8,
pp. 921933, Aug. 2005.
[12] S. Papaharalabos, P. Sweeny, and B. G. Evans, Constant log-MAP decoding
algorithm for duo-binary turbo codes, Electron. Lett., vol. 42, no. 12, pp. 709710, Jun.
2006.
[13] C. Zhan, T. Arslan, A. T. Erdogan, and S. MacDougall, An efficient decoder scheme
for double binary circular turbo codes, in Proc. IEEE ICASSP06, May 2006, pp. IV-229
IV-232.
[14] S. M. Park, J. Kwak, and K. Lee, Extrinsic Information Memory Reduced
Architecture for Non-Binary Turbo Decoder Implementation, in Proc. IEEE Vehicular
Technology Conference, May 2008, pp. 539543.
[15] J. B. Anderson and S. M. Hladik, Tailbiting MAP decoders, IEEE J. Sel., Areas
Commun., vol. 16, no. 2, pp. 297302, Feb. 1998.
83
[16] A. J. Viterbi, An intuitive justification and a simplified implementation of the MAP

decoder for convolutional codes, IEEE J. Sel., Areas Commun., vol. 16, no. 2, pp. 260
264, Feb. 1998.
[17] A. Abbasfar and K. Yao, An efficient and practical architecture for high speed turbo
decoders, in Proc. IEEE Vehicular Technol. Conf., Oct. 2003, pp. 337341.
[18] Y. Gao and M. R. Soleymani, Triple-binary circular recursive systematic
convolutional turbo codes, in Proc. 5th Int. Symp. Wireless Personal Multimedia Commun.,
2002, vol. 3, pp. 951955.
[19] C. Berrou, M. Jezequel, C. Douillard, and S. Kerouedan, The advantages of
nonbinary turbo codes, in Proc. IEEE Inf. Theory Workshop, Sep. 2001, pp. 6163.
[20] M. C. Shin and I. C. Park, SIMD Processor-Based Turbo Decoder Supporting
Multiple Third-Generation Wireless Standards, IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 15, no. 7, pp. 801810. July 207.
[21] C. Berrou and M. Jezequel, Non-binary convolutional codes for turbo coding,
Electron. Lett., vol. 35, no. 1, pp. 3940, Jan. 1999.
[22] J. H. Kim and I. C. Park, Double-Binary Circular Turbo Decoding Based on Border
Metric Encoding, IEEE Trans. Circuits Syst. II. Express Briefs, vol. 55, no. 1, pp, 7983,
Jan. 2008.
[23] C. C. Wong et al., A 0.22nJ/b/iter 0.13m Turbo Decoder Chip Using Inter-Block
Permutation Interleaver, in Proc. IEEE Custom Integrated Circuits Conf., Sep. 2007, pp.
273276.
[24] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation.
New York: Wiley, 1999.
84
[25] C. Berrou et al., Designing good permutations for turbo codes: towards a single
model, in Proc. Int. Conf. Commun., May 2004, pp. 341345.
[26] B. Bougard et al., A scalable 8.7nJ/bit 75.6 Mb/s parallel concatenated convolutional
(turbo-) codec, in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 2003,
pp. 152153.
[27] 3GPP, Multiplexing and channel coding, 3GPP TS 36.212, V8.2.0, Mar. 2008.
[28] J. H. Kim and I. C. Park, Bit-Level Extrinsic Information Exchange Method for
Double-Binary Turbo Codes, IEEE Trans. Circuits Syst. II, Express Briefs, vol. 56, no. 1,
pp. 8185, Jan. 2009.
[29] J. H. Kim and I. C. Park, A 50Mbps Double-Binary Turbo Decoder for WiMAX
Based on Bit-level Extrinsic Information Exchange, IEEE Asian Solid-State Circuit
Conference (A-SSCC), 2008, pp. 305308.
[30] C. Benkeser, A. Burg, T. Cupaiuolo, and Q. Huang, A 58mW 1.2mm2 HSDPA Turbo
Decoder ASIC in 0.13m CMOS, in ISSCC Dig. Tech. Papers, 2008, pp. 264265.
[31] J. Kwak and K. Lee, Design of dividable interleaver for parallel decoding in turbo
codes, in IET Electron. Lett., vol. 38, no. 22, pp. 13621364, Oct. 2002.
85

Mobile WiMAX 3GPP-LTE
4
. , Nonbinary

Mobile WiMAX 3GPP-LTE

.
double-binary Single-binary
Radix-4

.
, Border Metric Encoding
SISO(soft-input soft-output)
, Bit-level Extrinsic Information Exchange doublebinary 2 SISO
.
incremental calculation
1 SISO Mobile WiMAX
0.13m .
double-binary
single-binary 2bit
radix-4 processing ,
Mobile WiMAX 3GPP-LTE dual-mode

.
0.13m , Mobile WiMAX
3GPP-LTE 8 SISO
, Stopping Criterion
. , register retiming
SISO voltage scaling
2
.

,
.

2009D020047150 S1Ver2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2009D020047150 S1Ver2

Uploaded by

Copyright:

Available Formats

Doctoral Thesis

Design and Implementation of

Design and Implementation of

2.1 Digital Communication System ..............................................................................14

2.3 Turbo code in Mobile WiMAX ...............................................................................27

BORDER METRIC ENCODING ................................35

3.1 Radix-4 SISO Decoding ..........................................................................................35

CHAPTER 4 BIT-LEVEL EXTRINSIC INFORMATION

4.3 Experimental Results..............................................................................................53

CHAPTER 5 A 50MBPS DOUBLE-BINARY CIRCULAR TURBO

CHAPTER 6 A UNIFIED PARALLEL RADIX-4 TURBO

Figure 1.1: The Need for Supporting Multiple Standards

1.2 Previous Works

Appending the bits that

complexity [12]-[14]. For a double-binary SISO decoding algorithm, based on the

2 Chips in 130nm CMOS

Figure 1.2: Research Overview

Border Metric Encoding

Bit-level Extrinsic Info.

Parallel Turbo Decoding

for Radix-4 Processing w/o memory

Figure 1.3: Proposed Solutions for Nonbinary CTC Decoder Implementation

2.1 Digital Communication System

Figure 2.1: Model of a digital communication system

as AWGN due to the central limit theorem.

2.2 Introduction to Turbo Codes

2.2.1 Turbo Code Encoder Structure

2.2.1.1 The Constituent Encoders

2.2.2 Turbo Decoding

2.2.3 Decoding Algorithm for Turbo Codes

Figure 2.3: A turbo decoder structure

2.2.3.1 MAP Algorithm

We can compute the APPs in (2.1) as

The log-likelihood ratio LLR is then

By several applications of Bayes rule, we have

The log-likelihood ratio LLR can be written as

We can obtain k ( s ) defined in (2.8) as

P ( s, R k | s ', R1k-1 ) P ( s ', R1k-1 )

P ( s, R k | s ') P ( s ', R1k-1 )

We can obtain k ( s ) defined in (2.10) as

The recursion for the k ( s) is initialized according to

k ( s ', s) may be written as

where the event uk corresponds to the event s s. Note P(s|s) = P(s s ) = 0 if s is

2.2.3.2 Max-log-MAP Algorithm

k (s) ln(k (s))

The expression (2.17) is rewritten as

ln( k 1 ( s ') k ( s ', s))

ln( exp( k 1 ( s ') k ( s ', s )))

These log-domain forward metrics are initialized as

k 1 ( s ') ln( k 1 ( s '))

with initial conditions

under the assumption that the encoder has been terminated.

ln[ exp( k 1 ( s ') k ( s ', s ) k ( s ))]

-ln[ exp( k 1 ( s ') k ( s ', s ) k ( s ))]

These expressions can be simplified by using the expression.

k ( s) max[ k 1 ( s ') k ( s ', s)]

k 1 (s ') max[k (s) k (s ', s)]

2.2.3.3 Calculation of Branch Metrics and Extrinsic Information

k ( s ', s ) ln P(uk ) ln( 2 )

Now observe that we may write from (2.29)

where the first equality follows since it equals

where we have defined

Substitution of (2.31) into (2.30) yields