Springer - Turbo-Like Codes

00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) i of xviii August 1, 2007 13:41
Turbo-like Codes
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) iii of xviii August 1, 2007 13:41
Aliazam Abbasfar
Turbo-like Codes
Design for High Speed Decoding
1 3
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) iv of xviii August 1, 2007 13:41
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 9781402063903
ISBN-13 9781402063909
Published by Springer,
P.O. Box 17, 3300 AA Dordrecht, The Netherlands.
www.springeronline.com
Printed on acid-free paper
All Rights Reserved
c 2007
No part of this work may be reproduced, stored in a retrieval system, or transmitted
in any form or by any means, electronic, mechanical, photocopying, microlming, recording
or otherwise, without written permission from the Publisher, with the exception
of any material supplied specically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work.
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) v of xviii August 1, 2007 13:41
Dedicated to my wife
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) vii of xviii August 1, 2007 13:41
Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Turbo Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Turbo Codes and Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 RepeatAccumulate Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Product Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Iterative Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Probability Propagation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Message-passing Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Graphs with Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Codes on Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.1 Parity-check Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.2 Convolutional Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6.3 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 High-speed Turbo Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 BCJR Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Turbo Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Pipelined Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Parallel Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
vii
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) viii of xviii August 1, 2007 13:41
viii Contents
3.6 Speed Gain and Efciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.1 Denitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 Interleaver Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.7.1 Low Latency Interleaver Structure . . . . . . . . . . . . . . . . . . . . . . . . 31
3.7.2 Interleaver Design Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.8 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Very Simple Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Bounds on the ML Decoding Performance of Block Codes . . . . 40
4.1.2 Density Evolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 RA Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3 RA Codes with Puncturing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3.2 Performance of Punctured RA Codes with ML Decoding. . . . . . 53
4.3.3 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 ARA Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.2 Performance of ARA Codes with ML Decoding . . . . . . . . . . . . . 58
4.4.3 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Other Precoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.1 Accumulator wih Puncturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5 High Speed Turbo-like Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2 Parallel ARA Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Speed Gain and Efciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Interleaver Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Projected Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.5.1 Parallel Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.2 Other Known Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.3 Parallel LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5.4 More AccumulateRepeatAccumulate Codes . . . . . . . . . . . . . . 74
5.6 General Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) ix of xviii August 1, 2007 13:41
List of Figures
1 The block diagram of a PCCC encoder 6
2 The block diagram of a SCCC encoder 6
3 An example of a HCCC encoder 6
4 RepeatAccumulator code block diagram 6
5 Block diagram of a product code 6
6 The iterative turbo decoding block diagram 8
7 Examples of Tanner graphs: (a) tree (b) with cycles 8
8 Probabilistic graphs 9
9 Variable x and its connections in the graph 10
10 One constraint node and its connections in graph 11
11 A tree graph 12
12 Tanner graph for Hamming code, H(7,4) 14
13 Tanner graph for regular LDPC (3,5) 14
14 Convolutional code Tanner graph 15
15 The Tanner graph of Convolutional codes with state variables 15
16 A trellis section 16
17 An example of the graph of a PCCC 16
18 The messages in a convolutional code 20
19 Block diagram of the SISO 20
20 Timing diagram of the traditional SISO 21
21 Message passing between the constituent codes of turbo codes 21
22 The iterative decoding structure 22
23 Pipelined turbo decoder 23
24 Parallel turbo decoder structure 24
25 Timing diagram of the parallel SISOs 24
26 Timing diagram of the parallel SISOs in vector notation 25
ix
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) x of xviii August 1, 2007 13:41
x List of Figures
27 Partitioned graph of a simple PCCC 25
28 Parallel turbo decoder with shared processors for two constituent codes 26
29 Performances of parallel decoder 28
30 Efciency and speed gain 28
31 Efciency vs. signal to noise ratio 29
32 (a) Bit sequence in matrix form (b) after row interleaver
(c) A conict-free interleaver (d) Bit sequence in sequential order
(e) The conict-free interleaved sequence 30
33 Data and extrinsic sequences in two consecutive iterations for turbo
decoder with reverse interleaver 31
34 Sequences in two consecutive iterations for parallel turbo decoder
with reverse interleaver 32
35 Scheduling diagram of the parallel decoder 32
36 The owchart of the algorithm 34
37 Performance comparison for B = 1,024 36
38 Performance comparison for B = 4,096 37
39 (a) alpha recursion (b) beta recursion (c) Extrinsic computation 37
40 Probability density function of messages in different iterations 42
41 Constituent code model for density evolution 42
42 Constituent code model for density evolution 43
43 SNR improvement in iterative decoding 44
44 RepeatAccumulator code block diagram 44
45 Density evolution for RA codes (q = 3) 46
46 Accumulator with puncturing and its equivalent for p = 3 47
47 Block diagram of accumulator with puncturing 47
48 Block diagram of check_4 code and its equivalents 51
49 Normalized distance spectrum of RA codes with puncturing 54
50 Density evolution for RA codes with puncturing (q = 4, p = 2) 56
51 The block diagram of the precoder 57
52 ARA(3,3) BER performance bound 58
53 ARA(4,4) BER performance bound 59
54 Normalized distance spectrum of ARA codes with puncturing 60
55 Density evolution for ARA codes with puncturing (q = 4, p = 2) 61
56 Performance of ARA codes using iterative decoding 62
57 The block diagram of the new precoder 63
58 Tanner graph for new ARA code 63
59 Performance of the new ARA code 64
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) xi of xviii August 1, 2007 13:41
List of Figures xi
60 The partitioned graph of ARA code 68
61 Parallel turbo decoder structure 68
62 Projected graph 69
63 Projected graph with conict-free interleaver 70
64 A PCCC projected graph with conict-free interleaver 71
65 (a) PCCC with 3 component codes (b) SCCC (c) RA(3) (d) IRA(2,3) 72
66 A parallel LDPC projected graph 73
67 Simple graphical representation of a LDPC projected graph 74
68 ARA code without puncturing 75
69 (a) Rate 1/3 ARA code (b) rate 1/2 ARA code 75
70 (a) Rate 1/2 ARA code (b) New rate 1/3 ARA code (c) New rate 1/4
ARA code 76
71 Improved rate 1/2 ARA codes 77
72 Irregular rate 1/2 ARA codes 77
73 Irregular ARA code family for rate >1/2 78
74 Parallel decoder hardware architecture 79
75 Window processor hardware architecture 79
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) xiii of xviii August 1, 2007 13:41
List of Tables
I Probability Denitions 9
II State Constraint 16
III The Decoder Parameters 26
IV Characteristic Factors for the Parallel Decoder @SNR = 0.7 dB
(BER = 10E 8) 29
V An Example of the Interleaver 33
VI Cut-off Thresholds for RA Codes with Puncturing 54
VII Cut-off Threshold for Rate 1/2 ARA Codes 61
VIII Cutoff Threshold for ARA Codes with Rate <1/2 76
IX Cutoff Threshold for Improved ARA Codes with Rate <1/2 77
X Cutoff Threshold for ARA Codes with Rate >1/2 78
xiii
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) xv of xviii August 1, 2007 13:41
Acknowledgments
First and foremost, I would like to express my deepest gratitude to my wife for her
patience and sacrices throughout this research. She has been a constant source
of assistance, support, and encouragement. My heartfelt thanks go to my parents
for their generous love, encouragement, and prayers. Their sacrices have been
my inspiration throughout my career and I am deeply indebted to them in all my
successes and accomplishments. Words cannot express the deep feeling of gratitude
I have for my family.
There are so many people that I would like to thank for making my experience
at UCLA truly one of a kind. I would like to thank my advisor Professor Kung Yao
for all the help, support, and opportunities he has provided me over these years. His
valuable advice, guidance, and unconditional support helped me overcome all the
challenges of the doctoral process. I also want to thank Dr. Flavio Lorenzelli for his
support and fruitful discussions throughout my research.
My special thanks go to Dr. Dariush Divsalar who has been the motivation force
to go into the specic eld of channel coding. I have gained most of my knowledge
in the eld from discussions with him. He has had a profound effect on my Ph.D.
both as a mentor and a colleague.
Finally, I would also like to thank Professor Parviz Jabehdar-Maralani of the
University of Tehran for his continued support and encouragement.
xv
00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) xvii of xviii August 1, 2007 13:41
Abstract
The advent of turbo codes has sparked tremendous research activities around the
theoretical and practical aspects of turbo codes and turbo-like codes. The crucial
novelty in these codes is the iterative decoding.
In this work, rst a novel high-speed turbo decoder is presented that exploits
parallelization. Parallelism is achieved very efciently by exploiting the message-
passing algorithm. It has been shown that very large speed gains can be achieved
by this scheme while the efciency is maintained reasonably high. Memory access,
which poses a practical problem for the proposed parallel turbo decoder, is solved by
introducing the conict-free interleaver. The latency is further improved by designing
a special kind of conict-free interleaver. Furthermore, an algorithm to design such
an interleaver is presented. Simulation results show that the performance of turbo
code is not sacriced by using the interleaver with the proposed structure.
Although turbo code has near Shannon-capacity performance and the proposed
architecture for parallel turbo decoder provides a very efcient and highly regular
hardware, the circuit is still very complex and demanding for very high-speed
decoding. Therefore, it becomes necessary to nd turbo-like codes that not only
achieve excellent error correction capability, but are also very simple. As a result,
a class of new codes for different rates and block sizes, called AccumulateRepeat
Accumulate (ARA) codes, was invented during this search. The performance of ARA
codes is analyzed; and it has been shown that some ARA codes perform very close
to random codes, which achieve the Shannon limit.
The architecture for high-speed ARA decoder is presented and practical issues
discussed. This leads us to a general class of turbo-like codes with parallelism
capability, i.e. codes with projected graphs. It is shown that parallel turbo decoder,
discussed earlier, is in the same class. Projected graph provides a powerful and yet
simple method for designing parallelizable turbo-like codes.
xvii
01-Abbasfar-c01 SDO003-abbasfar (Typeset by spi publisher services, Delhi) 1 of 4 May 14, 2007 20:35
Chapter 1
Introduction
Efcient and reliable data communication over noisy channels has been pursued
more and more for many decades. Applications such as wire-line modems, wireless
and satellite communications, Internet data transfer, digital radio broadcasting, and
data storage devices are only a few examples that accelerated the development of
data communication systems.
The issue of efciency and reliability in communication systems was fundamen-
tally addressed by Shannon [30] in 1948. Shannon introduced the capacity (C) for a
noisy channel, which determines the maximum data rate (R) that can be transferred
over the channel reliably (i.e. without any error). In other words, there exists a coding
scheme of rate (R < C) with arbitrarily small error probability. The proof of this
was done in a nonconstructive way, which means that it does not give any method
for construction of capacity-approaching codes.
The pursuit of capacity-approaching codes took almost 50 years. The intro-
duction of turbo codes by Berrou, Glavieux, and Thitimajshima [9] was a major
breakthrough in the world of practical capacity-approaching codes causing a rev-
olution in the eld of error-correcting codes. As a result, several other classes of
capacity-approaching codes were rediscovered and invented including Low-Density
Parity-Check (LDPC) codes [14], RepeatAccumulate (RA) codes [12], and product
codes.
The trend in data communications is towards high data rate applications which
require high-speed decoding. Although very excellent codes have been proposed and
their efcient decoding algorithms are known, design of codes that are suitable for
high-speed applications is still a challenging task. This includes design of high-speed
decoding architectures, as well as low complexity codes, which are naturally more
suitable for parallelism.
1.1 Outline
The advent of turbo codes has sparked tremendous research activities around the
theoretical and practical aspects of turbo codes and turbo-like codes. This study
introduces different types of turbo codes and turbo-like codes. These codes include
the Parallel Concatenated Convolutional Code (PCCC), originally introduced by
Berrou [9], Serial Concatenated Convolutional Codes (SCCC) later introduced in
[7], RA codes [12], and product codes.
Aliazam Abbasfar, Turbo-Like Codes, 13. 1
c Springer 2007
2 1 Introduction
The common property among turbo-like code is that they consist of very simple
constituent codes that are connected to each other with random or pseudorandom
interleavers. The crucial novelty in these codes is the iterative decoding. This means
that the constituent codes are decoded separately, which is efcient and practically
feasible since they are very simple codes. Then, they pass new information to each
other in a course of a few iterations.
It has been shown that iterative decoding is a generalization of the well-known
probability or belief propagation algorithm. The belief propagation algorithm that
has been essential for development of new ideas throughout this work is described
in the context of coding. The basic theorems for this algorithm are explained and
proven in the following paragraphs. Thisis then followed by a description of the
computational algorithm. The probability propagation algorithm is proven in con-
junction with a tree-structured graph graphs without any cycle. In fact, the graphical
representation of any problem solved by this algorithm is the centerpiece of the
algorithm. The generalization of the algorithm for graphs with cycles is presented
later on.
Representation of codes on graph is the next step towards characterization of the
iterative decoding as an example of the probability propagation algorithm. The graph
representations are presented for a few codes that are commonly used in turbo-like
codes.
In Chapter 3, rst the traditional turbo decoder is introduced. Second, a novel
high-seed turbo decoder is presented that exploits parallelization. Parallelism is
achieved very efciently by exploiting the message-passing algorithm. Finally, two
characterization factors for the proposed parallel turbo decoder are investigated:
speed gain and efciency. It has been shown by simulations that very large speed
gains can be achieved by this scheme while the efciency is maintained reasonably
high.
Memory access poses a practical problem for the proposed parallel turbo decoder.
This problem is solved in the next section. Conict-free interleaver is introduced to
address the memory access problem. The latency is further improved by designing
a special kind of conict-free interleaver. Lastly, an algorithm to design such an
interleaver is presented. Simulation results show that the performance of turbo code
is not sacriced by using this interleaver.
Hardware complexity, which is of major concern, is investigated and the overall
architecture of the hardware for high-speed turbo decoder is presented.
Although turbo code has near Shannon-capacity performance and the proposed
architecture for parallel turbo decoder gives a very efcient and highly regular
hardware, the circuit is still very complex and demanding for very high-speed
decoding. Therefore, Chapter 4 examines simple turbo-like codes that can achieve
excellent error correction capability. In order to investigate the performance of the
turbo-like codes some useful analysis tools are used. Some maximum likelihood
(ML) performance bounds are briey explained which evaluate the performance of
codes using ML decoding. The density evolution (DE) method, which analyzes the
behavior of turbo-like codes using iterative decoding, is also described.
1.1 Outline 3
The RAcode provided valuable insight in the search for low-complexity turbo-like
codes. A class of new codes for different rates and block sizes, called ARA code, was
invented during this search. The performance of ARA codes is analyzed and some
are shown to perform very close to random codes, which achieve the Shannon limit.
The simplicity of ARA codes allows us to build a high-speed ARA decoder with
very low complexity hardware.
Chapter 5 rst presents the architecture for high-speed ARA decoder and then
discusses practical issues. This leads us to a general structure for turbo-like codes
with parallelization capability. The concept of projected graph is presented. In fact,
codes with projected graph comprise a class of turbo-like code. The parallel turbo
decoder discussed earlier is in the same class. Projected graph provides a powerful
and yet simple method for designing parallelizable turbo-like codes, which is used
in future research.
Finally, future research directions are discussed in light of this studys ndings.
02-Abbasfar-c02 SDO003-abbasfar (Typeset by spi publisher services, Delhi) 5 of 18 August 1, 2007 13:41
Chapter 2
Turbo Concept
2.1 Turbo Codes and Turbo-like Codes
2.1.1 Turbo Codes
Turbo code introduced by Berrou [9] is a PCCC, which consists of two or more
convolutional codes encoding different permuted versions of a block of information.
A block diagram of a PCCC turbo encoder is shown in Figure 1.
Each convolutional code is called a constituent code. Constituent codes may be
same or different, systematic or nonsystematic. However, they should be of recursive
type in order to have good performance. In most cases the C
0
is a systematic code,
which means that the input sequence is transmitted along with the coded sequences;
the overall code is systematic too. If the code rate of the constituent codes is r
0
,
r
1
, . . . , r
n
, then the overall rate is r
0
||r
1
|| ||r
n
; like parallel resistors.
Later on SCCC were introduced in [7]. A block diagram of a SCCC is drawn in
Figure 2.
The block of information is encoded by the rst constituent code. Then the output
is interleaved and encoded by the second constituent code, and so on and so forth.
The output of the last stage is sent over the communication channel. If the code rate
of the constituent codes are r
1
, r
2
, . . . , r
n
, the overall rate is r
1
r
2
r
n
. To
obtain a systematic code all constituent codes should be systematic.
We can combine the PCCC and SCCC codes to come up with various Hybrid
Concatenated Convolutional Codes (HCCCs). Such a HCCC is shown in Figure 3.
We can extend the above-described turbo codes to obtain more generalized codes.
In the above codes all the constituent codes are convolutional. If we remove this
limitation and let them be arbitrary block codes, then we will have a broader class
of codes called turbo-like codes. Some examples of turbo-like codes are given in the
sequel.
2.1.2 RepeatAccumulate Codes
Perhaps the simplest type of turbo-like codes is RA codes; which makes it very
attractive for analysis. The general block diagram of this code is given in Figure 4.
It is a serial concatenated code of two constituent codes: repetition code and
c Springer 2007
6 2 Turbo Concept
C
0
I
1
C
1
C
n
I
n
u
Fig. 1 The block diagram of a PCCC encoder
C
0
u
C
1
I
1
I
n
C
n
Fig. 2 The block diagram of a SCCC encoder
C
0
I
3
C
3
u
C
1
I
1
C
2
I
2
Fig. 3 An example of a HCCC
encoder
rep(q)
u
ACC I
N qN qN qN
Fig. 4 RepeatAccumulator code
block diagram
row-wise
block codes
u
row-column
interleaver
coulmn-wise
block codes
c
Fig. 5 Block diagram of a product code
2.2 Iterative Decoding 7
accumulate code. An information block of length N is repeated q times and
interleaved to make a block of size qN, and then followed by an accumulator.
Using rate formula for serial concatenated codes, the code rate for RA codes
would be 1/q.
2.1.3 Product Codes
The serial concatenation of two block codes with a rowcolumn interleaver results
in a product code. In fact, each block code consists of several identical smaller block
codes that construct rows/columns of the code word. A block diagram of this code is
shown in Figure 5.
2.2 Iterative Decoding
The crucial novelty in turbo codes is the introduction of iterative decoding. The other
key component is random or pseudorandom interleaver, which is discussed later.
Having exploited these concepts, turbo codes achieve excellent performance with a
moderate complexity.
The iterative decoding algorithm is based on maximum a posteriori (MAP) esti-
mation of the input sequence. However, since it is difcult to nd the MAP solution
by considering all the observations at the same time, the MAP decoding is performed
on the observations of each constituent code separately. This is explained for a
PCCC with two constituent codes. Since two codes have been produced from one
input sequence, the a posteriori probability (APP) of data bits coming from the rst
constituent decoder can be used by the second decoder and vice versa. Therefore
the decoding process is carried out iteratively. At the beginning we do not have
any information about input sequence. The MAP decoding of the rst constituent
code is performed without any prior knowledge of the input sequence. This process
generates APP of the input sequence bits that can be exploited in the second
decoder. The information passed to the other constituent decoder is called extrinsic
information.
BCJR algorithm [5] is an efcient algorithm that recursively computes the APPs
for a convolutional code. In [6] a general unit, called SISO, is introduced that
generates the APPs in the most general case. Chapter 2 explains the BCJR algorithm
based on decoding on graphs with tree structure.
Since the second constituent code is using the permuted version of the input
sequence, therefore, extrinsic information should also be permuted before being used
by the second decoder. Likewise, the extrinsic information of the second decoder is
to be permuted in reverse order for the next iteration of the rst decoder. The iterative
decoding block diagram is shown in Figure 6.
8 2 Turbo Concept
I SISO
y
1
I
-1
SISO
y
2

Fig. 6 The iterative turbo decoding block diagram
2.3 Probability Propagation Algorithms
It has been shown that the iterative decoding is the generalization of the well-known
probability or belief propagation algorithm. This algorithm has been developed in
the articial intelligence and expert system literature, most notably by Pearl [25] and
Lauritzen and Spiegelhalter [19]. Connection between the Pearls belief propagation
algorithm with coding was rst discovered by MacKay and Neal [22, 23], who
showed the Gallager algorithm [14] for decoding LDPC codes is essentially an
instance of belief propagation algorithm. McEliece et al. [24] independently showed
that turbo decoding is also an instance of belief propagation algorithm. We describe
this algorithm in a way that is more suitable for coding applications.
If we have some variables that are related to each other, there is a bipartite graph
representation showing their dependence, which is called the Tanner graph [31]. The
nodes are divided between variable nodes (circles) and constraint nodes (squares).
Two examples of such graphs are shown in Figure 7.
If we have some noisy observation of the variables, then it becomes a probabilistic
graph. In some communication systems some of the variables are not sent through
the channel, hence, there is no observation available for them in the receiver side.
Therefore, we should distinguish between observed and unobserved variables. Two
examples of probabilistic graphs are shown in Figure 8.
We denote all variables with vector x and their observations with vector y. Some
useful denitions are listed in Table I.
x
2
x
1
x
3
x
6
x
4
x
5
x
8
x
9
x
7
x
0
1 0 6
5 11 2
10 12
3 4
8
7 9
x
2
x
1
x
4
x
3
x
6
x
7
x
5
x
0
1 0 6
5 11 2
10 12
3 4
8
7 9
(a) (b)
Fig. 7 Examples of Tanner graphs: (a) tree, (b) with cycles
2.3 Probability Propagation Algorithms 9
(a) Tree (b) loopy
x
2
x
1
x
3
x
6
x
4
x
5
x
8
x
9
x
7
x
0
1 0
6
5
11 2
10 12
3 4
8
7 9
x
2
x
1
x
4
x
3
x
6
x
7
x
5
x
0
1 0 6
5 11 2
10 12
3 4
8
7 9
x
Constraint
node
Unobserved
Variable node
x
Observed
Variable node
Fig. 8 Probabilistic graphs
The probability propagation algorithm is proven for graphs with tree structure.
Given a graphical code model, the probability propagation algorithm is used to
compute the APP of variables very efciently. However, for graphs with cycles
there is no known efcient algorithm for computing the exact APPs, which makes
it practically infeasible. The probability propagation algorithm can be extended to
graphs with cycles by proceeding with the algorithm as if there is no cycle in the
graph. Although there is no solid proof of convergence in this case, it usually gives
very excellent results in practice. The turbo codes performance is a testimony for
effectiveness of the algorithm. The following theorems are very helpful in order to
explain the algorithm.
Theorem 1: In a probabilistic tree, the likelihood of each variable can be decom-
posed into the local likelihood and some independent terms, which are called mes-
sages or extrinsics in the case of turbo decoding.
We prove this theorem using the graph shown in Figure 9.
In Figure 9 we have lumped all the variables connecting to an edge into one
vector and denoted them and their observations as x
1
, x
2
, x
3
, y
1
, y
2
, y
3
; i.e. x =
{x, x
1
, x
2
, x
3
}, y = {y, y
1
, y
2
, y
3
}. The likelihood of variable x is determined by
marginalization of x. We have
P(y | x) =
x
1
,x
21
,x
3
P(x, y) =
x
1
,x
21
,x
3
P(x
1
, y
1
, x
2
, y
2
, x
3
, y
3
| x) P(y | x) (1)
Table I Probability Denitions
Denition Probability
Likelihood of x P(y | x)
Likelihood of x P(y | x)
Local likelihood of x P(y | x)
Local reliability of x P( x, y) = P(y | x)P(x)
Reliability of x P( x, y) = P( y | x)P(x)
A Posteriori Probability (APP) of x P( x | y) = P( x, y)/ P(y)
10 2 Turbo Concept
x
x
2
y
2
x
3
y
3
1 2
3
x
1
y
1
Fig. 9 Variable x and its connections in the
graph
It can be simplied as following
P(y | x) =
x
1
P(x
1
, y
1
| x)
x
2
P(x
2
, y
2
| x)
x
3
P(x
3
, y
3
| x)
P(y | x) (2)
P(y | x) = P(y
1
| x)P(y
2
| x)P(y
3
| x)P(y | x) (3)
P(y | x) = IM
1
(x) IM
2
(x) IM
3
(x)P(x, y) = IM
1
(x) IM
2
(x) IM
3
(x)P(y | x)
In the above formula we have three incoming messages (IM) each one coming
from one part of the graph, i.e. a subgraph. Since the variable is connected to each
subgraph by one edge, therefore, there is a correspondence between IM and the edges
connected to a variable node. In other words, the IM are communicated along the
edges of the graph towards the variable nodes.
As the mathematical expression shows, the incoming message is the likeli-
hood of the variable x given the only observations that are connected to x
via each edge. Since the graph is a tree, for every variable node decomposi-
tion of the graph into some disjoint subgraphs is always possible. That proves
theorem 1.
Furthermore, we dene the outgoing messages (OM) as follows:
OM
1
(x) = P(y
2
, y
3
, y | x) = IM
2
(x) IM
3
(x)P(y | x) (4)
OM
2
(x) = P(y
1
, y
3
, y | x) = IM
1
(x) IM
3
(x)P(y | x) (5)
OM
3
(x) = P(y
1
, y
2
, y | x) = IM
1
(x) IM
2
(x)P(y | x) (6)
Therefore, we have
P(y | x) = IM
1
(x) OM
1
(x) = IM
2
(x) OM
2
(x) = IM
3
(x) OM
3
(x) (7)
These messages are communicated along the edges emanated from variable x
towards constraints nodes. Because every edge of the graph connects one variable
node to a constraint node, there are two messages related to each edge: one incoming
and the other outgoing. Therefore the message indices can be regarded as edge
indices. We will use these messages in theorem 2.
2.3 Probability Propagation Algorithms 11
x
2
y
2
x
3
y
3
1 2
3
x
1
y
1
x
1
x
2
x
3
x
Fig. 10 One constraint node and its
connections in graph
It should be noted that messages are a function of the variable. If the variable
is a quantized variable, the messages should be computed for all the levels of the
variable. For example for binary variables the message includes the likelihood of
being 0 and 1. However, in practice the likelihood ratio is used to simplify the
operations and to save the memory for storage.
Theorem 2: For each constraint node the IM on one edge can be computed from the
OMs on the other edges connected to the constraint node.
We prove this theorem using the graph shown in Figure 10.
In Figure 10 we have lumped all the variables connecting to an edge into
one vector and denoted them and their observations as x
1
, x
2
, x
3
, y
1
, y
2
, y
3
.
Each edge is directly connected to one variable, which is donated by x
1
x
1
,
x
2
x
2
, and x
3
x
3
. Therefore, we have x
1
= {x
1
, x
1
}, x
2
= {x
2
, x
2
}, and x
3
=
{x
3
, x
3
}. The likelihood of variable x is determined by the marginalization of x.
We have
IM(x) = P(y
1
, y
2
, y
3
| x) =
x
1
,x
2
,x
3
P(x
1
, y
1
, x
2
, y
2
, x
3
, y
3
| x)
=
x
1
,x
2
,x
3
P(x
1
, y
1
, x
2
, y
2
, x
3
, y
3
| x
1
, x
2
, x
3
)P(x
1
, x
2
, x
3
| x)
x
1
,x
2
,x
3
1
,x
2
,x
P(x
1
, y
1
, x
2
, y
2
, x
3
, y
3
| x
1
, x
2
, x
3
)P(x
1
, x
2
, x
3
| x)
12 2 Turbo Concept
=
x
1
,x
2
,x
3
1
P(x
1
, y
1
| x
1
)
2
P(x
2
, y
2
| x
2
)
1
P(x
3
, y
3
| x
3
)
P(x
1
, x
2
, x
3
| x)
x
1
,x
2
,x
3
[P(y
1
| x
1
)P(y
2
| x
2
)P(y
3
| x
3
)P(x
1
, x
2
, x
3
| x)]
=
x
1
,x
2
,x
3
[OM
1
(x
1
) OM
2
(x
2
) OM
3
(x
3
)P(x
1
, x
2
, x
3
| x)] (8)
As we see the IM is obtained by marginalizing the appropriate weighted product
of OMs. The weight function explicitly shows the effect of the constraint.
2.4 Message-passing Algorithm
Theorems 1 and 2 provide the basic operations needed for computing the reliability
of all the variables in the graph in an efcient way. The algorithm is called message
passing, which is essentially the marginalization algorithm. There are many variants
of the algorithm that are different only in the scheduling of the computations. Two
important versions of the algorithm are described in the sequel.
Efcient schedule: In this schedule the algorithm starts from the OMs for leaf
vertices in the graph, which are simply the local likelihood of the variables. Messages
propagate from the leaves toward the inside of the graph and then back towards
the leaves; this time they are IMs. The messages are computed only when all the
required messages are ready. The order of message computation for the graph shown
in Figure 11 is:
OM
0
, OM
1
, OM
2
, OM
3
, OM
4
, OM
5
, OM
6
IM
7
, IM
8
, IM
9
OM
10
, OM
11
, OM
12
x
2
x
1
x
3
x
6
x
4
x
5
x
8
x
9
x
7
x
0
1 0 6
5 11 2
10 12
3 4
8
7
9
Fig. 11 A tree graph
2.5 Graphs with Cycles 13
IM
0
, IM
10
, IM
11
, IM
12
OM
7
, OM
8
, OM
9
IM
1
, IM
2
, IM
3
, IM
4
, IM
5
, IM
6
Flooding schedule: In this schedule all messages are initialized by the local
likelihood of the variables. In each step all the messages are computed regardless
of the location of the edge in the graph and the status of other messages. Although
this schedule is not efcient, this is the fastest algorithm that can be conceived.
After several steps all the messages converge to the correct value. The maximum
number of steps is the depth of the graph. However, in big graphs the number of steps
needed for a given accuracy of the messages is much less that the depth of the graph.
This scheduling is shown in the following. The bold messages are those that are
nal.
OM
0
, OM
1
, OM
2
, OM
3
, OM
4
, OM
5
, OM
6
, OM
7
, OM
8
, OM
9
, OM
10
, OM
11
, OM
12
IM
0
, IM
1
, IM
2
, IM
3
, IM
4
, IM
5
, IM
6
, IM
7
, IM
8
, IM
9
, IM
10
, IM
11
, IM
12
OM
0
, OM
1
, OM
2
, OM
3
, OM
4
, OM
5
, OM
6
, OM
7
, OM
8
, OM
9
, OM
10
, OM
11
, OM
12
IM
0
, IM
1
, IM
2
, IM
3
, IM
4
, IM
5
, IM
6
, IM
7
, IM
8
, IM
9
, IM
10
, IM
11
, IM
12
OM
0
, OM
1
, OM
2
, OM
3
, OM
4
, OM
5
, OM
6
, OM
7
, OM
8
, OM
9
, OM
10
, OM
11
, OM
12
IM
0
, IM
1
, IM
2
, IM
3
, IM
4
, IM
5
, IM
6
, IM
7
, IM
8
, IM
9
, IM
10
, IM
11
, IM
12
2.5 Graphs with Cycles
Although the message-passing algorithm is proved for cycle-free graphs, it has been
used for graph with cycles as well. The success of turbo codes shows the importance
of the extension of this algorithm for graph with cycles. All good turbo-like codes
have graph with cycles. In fact, they are very rich in cycles that make the codes more
powerful. However, it is well known that short cycles should be avoided in order to
have good performance.
In essence, the message-passing algorithmdoes not change for this class of graphs.
It simply ignores the presence of loops. The messages propagate through the graph
based on the same rules; i.e. theorems 1 and 2. Because of the loops the effect of one
observation can be fed back to itself after some steps; i.e. the messages passed along
the loop come back to their origin. This creates some correlation between messages,
which make theorems 1 and 2 no longer valid. To alleviate this effect, short cycles
are removed from the graph. In this case the effect of one variable on return message
is well attenuated.
Different scheduling can be used in the message-passing algorithm. Efcient
scheduling is not applicable here because it is based on a graph having a tree struc-
ture. However, there is some scheduling that is more efcient than others. Flooding
scheduling can be used for very fast and low latency applications. We usually stop
the message passing when the messages are well propagated.
14 2 Turbo Concept
3 4
1
2
5
6
7
Fig. 12 Tanner graph for Hamming code,
H(7,4)
2.6 Codes on Graph
The natural setting for codes decoded by iterative algorithms is their graph repre-
sentation. In this section the graph representation of some commonly used codes are
illustrated and explained.
2.6.1 Parity-check Codes
Parity-check codes are binary linear block codes. Parity-check matrix determines the
constraints between binary variables. As an example, the parity-check matrix of for
Hamming code H(7,4) is given as follows:
H =
1 0 1 1 1 0 0
1 1 1 0 0 1 0
1 1 0 1 0 0 1
(9)
The Tanner graph for Hamming code, H(7,4), is shown in Figure 12.
LDPC codes are referred to those codes that have very sparse parity-check matrix.
Regular LDPC, rst introduced by Gallager [14], has a matrix with xed number of
nonzero elements in each row and column. The Tanner graph for a regular LDPC
code with variable and check nodes with degree 3 and 5 is shown in Figure 13.
1
Interleaver
2 3 4 N
. . .
. . .
Fig. 13 Tanner graph for regular LDPC (3,5)
2.6 Codes on Graph 15
0 1 2 3 4 5 6 7 c
0 1 2 3 4 5 6 7 u
Fig. 14 Convolutional code Tanner graph
Luby et al. [20, 21] found out that using irregular LDPC codes and optimizing the
degree distribution of variable and check nodes give very superior codes. Building
on the analytical techniques developed by Luby et al., Richardson et al. [27, 28]
designed long irregular LDPC codes that practically achieve the Shannon limit.
2.6.2 Convolutional Codes
In convolutional codes, the information bits goes through a lter. The code bits
are derived from a few previous information and code bits. The dependency is
dened by the generator polynomials. For example, for a recursive convolutional
code with feedback polynomial Den(D) = 1 + D + D
2
and forward polynomial of
Num(D) = 1 + D
2
, the following constraint hold for all n:
c[n] + c[n 1] + c[n 2] + u[n] + u[n 2] = 0
u[1] = u[2] = 0 (10)
c[1] = c[2] = 0
Where x[n] is the nth information bit, y[n] is the nth output bit and the plus sign is
a modulo 2 addition operator. The above constraints are depicted in Figure 14.
However, this graph representation has many short cycles (length 4), which is not
desirable for message-passing algorithm. With the introduction of state variables we
can eliminate all the loops. State variables are ctitious variables that are not part of
the code word; i.e. they are not sent over the channel. The size of state variables is
(K 1) bits where K is the constraint length of the convolutional code; i.e. it takes
on values between 0 and 2
K1
1. The graph of the convolutional code for 8-bit
information block size with the state variables is drawn in Figure 15.
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7
x
8
x x x x x
x x
0 1 2 3 4 5 6 7
c
u
7
8
x
codeword bits
State
variable
State
constraint
Fig. 15 The Tanner graph of Convolutional codes with state variables
16 2 Turbo Concept
Table II State Constraint
S[n] u[n] c[n] S[n + 1]
0 0 0 0
0 1 1 2
1 0 1 0
1 1 0 2
2 0 0 1
2 1 1 3
3 0 1 1
3 1 0 3
0
3
2
1
0
3
2
1
0/00
1/11
0/01
1/10
1/10
1/11
0/00
0/01
Fig. 16 Atrellis section
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7
C1
x
8
x x x x x x x
0 1 2 3 4 5 6 7
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7
C2
x 8 x x x x x x x
c1
c2
u
Fig. 17 An example of the graph of a PCCC
2.6 Codes on Graph 17
The state constraint shows the relationship between code word bits and the state
variables. For convolutional code in the previous example, the state constraint only
allows the cases that are listed in Table II.
The state constraint can be also described by trellis section of the convolu-
tional codes, which is shown in Figure 16, whose information on each edge
is u/uc.
2.6.3 Turbo Codes
The graph of turbo codes is derived from connecting the graphs of its own constituent
convolutional codes. The graphical representation of a rate 1/3 PCCCwith block size
of 8 is shown in Figure 17.
The graph of other codes can be drawn the same way. It should be noted that
despite the fact that the graph of a constituent convolutional code is loopless, the
overall graph has many loops.
03-Abbasfar-c03 SDO003-abbasfar (Typeset by spi publisher services, Delhi) 19 of 38 July 17, 2007 16:37
Chapter 3
High-speed Turbo Decoders
3.1 Introduction
The graphical representation of a code gives the natural setting for iterative decoding,
which is basically the message-passing algorithm. In order to grasp the high-speed
turbo decoding architecture, it is essential that the traditional approach for turbo
decoding is appreciated rst.
In traditional turbo decoding each constituent code is processed one at a time. The
algorithm for efciently computing the APPs of bits is called the BCJR algorithm
after their rst inventors [5]. It presents an efcient way to compute the APP of bits
for any convolutional code. In [4] it is shown that BCJR algorithm is indeed the
message-passing algorithm. Here we briey describe the structure of this algorithm
and relate it to the message-passing algorithm.
3.2 BCJR Algorithm
The three main steps of this algorithm are as follows:
Forward Recursion (FR): In this step we compute the likelihood of all the states
in the trellis given the past observations. Starting from a known state, we will go
forward along the trellis and compute the likelihood of all the states in one trellis
section from the likelihood of the states in the previous trellis section. This recursive
scheme is continued until the likelihood of all the states, which are called alpha
variables, are computed in the forward direction.
The forward recursion is exactly equivalent of computing the forward messages as
it is shown in Figure 18. Starting from rst state, which is a leaf vertex, forward mes-
sages are computed recursively from observations and previous forward messages.
Since the state variables are not observed from the channel, the IM and OM for each
state variable is the same. Hence, only one of them is drawn in Figure 18.
It is very instructive to note that the forward messages are the likelihood of states
given the past observations. In the sequel we will use alpha variables and forward
messages interchangeably.
Backward Recursion (BR): This step is quite similar to the forward recursion.
Starting from a known state at the end of the block, we compute the likelihood of
c Springer 2007
20 3 High-speed Turbo Decoders
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7
x
8
x x x x x x x
0 1 2 3 4 5 6 7
c
u
Forward messages
Backward messages
extrinsics
Fig. 18 The messages in a convolutional code
previous states in one trellis section. Therefore we compute the likelihood of all the
states in the trellis given the future observations, which are called beta variables. This
iterative processing is continued until the beginning of the trellis.
The backward recursion is exactly equivalent of computing the backward mes-
sages as shown in Figure 18. Starting from the last state, which is a leaf vertex, back-
ward messages are computed recursively from observations and previous backward
message. It is very instructive to note that the backward messages are the likelihood
of states given the future observations. We will use beta variables and backward
messages interchangeably.
Output Computation (OC): Once the forward and backward likelihood of
the states are computed, the extrinsic information can be computed from them.
The extrinsic information can be viewed as the likelihood of each bit given all the
observations.
Output computation is equivalent of computing the IM for the code word bits.
These messages match the denition of extrinsic information. As we observe in
Figure 18, the likelihood of both input and output bits for a convolutional code is
computed. However, depending on the connection between the constituent codes in
a turbo code, some of the extrinsics might not be used.
Bennedetto et al. [6] introduced a general unit, called SISO, which generates the
APPs in the most general case. It should be noted that SISO is a block, which
implements the BCJR algorithm. The inputs to the SISO block are the observations
(r
1
or r
2
), initial values for alpha and beta variables, (
0
and
N
) and the extrinsics
coming from other SISOs. The outputs are the alpha and beta variables at the end
of forward and backward recursions, which are not used any more, and the new
extrinsics that will pass to the other SISO. The block diagram of a SISO for a
convolutional code of length N is sketched in Figure 19.
In the traditional realization of the SISO, the timing scheduling for the three men-
tioned steps is as follows. The backward recursion is done completely for the entire
SISO
r1
a
0
b
0
a
N
b
N
x
y
Fig. 19 Block diagram of the SISO
3.3 Turbo Decoding 21
Backward: y
N1
y
N2
y
1
y
0
b
N
b
N1
b
2
b
1
b
0
Forward: y
0
y
1
y
N2
y
N1
a
0
a
1
a
N2
a
N1
a
N
Output: x
0
x
1
x
N2
x
N1
Fig. 20 Timing diagram of the traditional SISO
block and all beta variables are stored in a memory. Then, the forward recursion starts
from the rst trellis section and computes the alpha variables one by one. Since at
this time both alpha and beta variables are available for the rst trellis section, the
extrinsic for the rst bit is computed at this time. Therefore the extrinsic computation
is done along with the forward recursion. The sequence of variables in time is shown
in Figure 20. Alpha and beta variables are denoted by a and b, and incoming and
outgoing extrinsics are denoted by y and x. We could exchange the order in which
forward and backward recursion is done. However, this scheduling outputs the results
in the reverse order.
3.3 Turbo Decoding
Once two or more convolutional codes are connected together with interleavers,
turbo codes are produced. The decoding is performed by passing the messages
between the constituent codes. Each constituent code receives the incoming extrin-
sics and computes new extrinsics by using the BCJR algorithm. The new extrinsics
are used by other constituent codes. Figure 21 sketches the message passing between
the constituent codes for a simple PCCC and SCCC.
The decoding starts from one constituent code and proceeds to other constituent
codes. This process is considered as one iteration. It takes several iterations to obtain
SISO
SISO
I
u c1
c2
u
SISO I
c1
c2
SISO
Fig. 21 Message passing between the constituent codes of turbo codes
SISO
r1
r2
Interleaver
SISO
a
0
b
0
a
0
b
0
a
N
a
N
b
N
b
N
x y
x
I
y
I
Fig. 22 The iterative decoding structure
very good performance for bit decisions. Turbo codes are designed such that the
probability density of the extrinsics is shifted toward higher values in each iteration.
This phenomenon is known as DE, which is also used to compute the capacity
of codes using the iterative decoding. The decisions are made based on messages
on information bits. Hence, the decisions are getting better and better with more
iterations.
Since the second constituent code uses the permuted version of the bit sequence,
the extrinsic information should also be permuted before being used by the second
SISO. Likewise, the extrinsic information of the second SISO is to be permuted in
reverse order for the next iteration of the rst SISO. The structure of the iterative
decoder is shown in Figure 22.
SISO block processes the observation serially and outputs the extrinsics serially.
Hence, in the sequel the traditional iterative turbo decoding is referred as serial
decoding. It should be noted that only one SISO is working at a time.
Without any loss of generality, only the simple PCCC, which is the most popular
turbo code, is investigated from now on. The methods described here can be applied
to all other turbo codes with simple modications.
3.4 Pipelined Turbo Decoder
One possible way to speed up the decoding is to perform the iterations in a pipelined
way. In this method there is one exclusive SISO for each constituent code at certain
iteration. The constituent codes are decoded and the results passed to next stage for
further iterations. The block diagram of such a decoder is drawn in Figure 23.
All the SISOs are running at the same time, but working on different received
blocks of coded data. In this method the decoding rate has been increased by 2I
times, where I is the number of iterations. Although we have achieved some speed
gain, the latency remains the same. To get the bit decisions we have to wait until all
the stages are nished for a block of data.
3.5 Parallel Turbo Decoder 23
SISO
0
x
Int
x
I
SISO
0
y
Int
1
y
I
...
SISO
I
x
Int
x
I
SISO
I'
y
Int
-1
y
I
r1(t-2I+2) r2(t-2I+1) r1(t) r2(t-I)
y
I
Fig. 23 Pipelined turbo decoder
The increase in the decoding rate comes at the expense of more hardware. As we
see in Figure 23, the hardware needed for the decoding is also increased 2I times.
This hardware consists of memory to store the observations and extrinsics and the
logic to compute the SISO computations.
This method has another disadvantage. The number of iterations is xed and
cannot be changed, which is essential for power reduction. Therefore this method
is not interesting for high-speed turbo decoding.
3.5 Parallel Turbo Decoder
In this section, we present a novel method for iterative decoding of turbo codes
that can be used for very high-speed decoders. This method was rst introduced
by Abbasfar and Yao [3]. Although this method is applicable for every turbo code,
we will explain it in the case of a block PCCC code. The algorithm is as follows.
First of all, the received data for each constituent codes are divided into sev-
eral contiguous nonoverlapping sub-blocks, called windows. Then, each window
is decoded separately in parallel using the BCJR algorithm. In other words, each
window processor is a vector decoder. However, the initial values for alpha and beta
variables come from previous iteration of adjacent windows. Since all the windows
are being processed at the same time, in the next iteration the initial values for all
of them are ready to load. Moreover, there is no extra processing needed for the
initialization of state probabilities at each iteration. The size of windows is a very
important parameter that will be discussed later. The structure of the decoder is
shown in Figure 24.
The timing diagram of the messages for one constituent code is shown in
Figure 25.
The above timing diagram can be simplied by using vector notation, which
is shown in Figure 26. The variables that computed at the same time are simply
replaced with a vector. Each vector has M elements, which belong to different
window processors (SISOs). For example, we have a
0
= [a
0
a
N
a
2N
a
MNN
]
T
and b
0
= [b
0
b
N
b
2N
b
MNN
]
T
.
This notation is the generalization of the serial decoder. It will also help to
appreciate the new interleaver structure for the parallel decoder discussed later.
The proposed structure stems from the message-passing algorithm itself. We
have only partitioned the graph into some subgraphs and used parallel scheduling
Fig. 24 Parallel turbo decoder structure
for different partitions. Partitioning helps us to parallelize the decoding of
one constituent code. The graph of a PCCC and its partitions is shown in
Figure 27.
There are two types of messages that are communicated between sub-blocks. First,
the messages associated with the information bits, i.e. the extrinsic information,
which is communicated between two constituent codes in the traditional approach.
Second, the messages that are related to the states in window boundaries; we call
them state messages. In fact we have introduced new messages that are passed
between sub-blocks at each iteration. These messages are the same as alpha and
SISO
1
:
Backward: y
N1
y
N2
. . . y
1
y
0
b
N
b
N1
. . . b
2
b
1
b
0
Forward: y
0
y
1
. . . y
N2
y
N1
a
0
a
1
. . . a
N2
a
N1
a
N
Output: x
0
x
1
. . . x
N2
x
N1
SISO
2
:
Backward: y
2N1
y
2N2
. . . y
N+1
y
N
B
2N
b
2N1
. . . b
N+2
b
N+1
b
N
Forward: y
N
y
N+1
. . . y
2N2
y
2N1
a
N
a
N+1
. . . a
2N2
a
2N1
a
2N
Output: x
N
x
N+1
. . . x
2N2
x
2N1
.
.
.
SISO
M
:
Backward: y
MN1
y
MN2
. . . y
(M1)N+1
y
(M 1)N
b
MN
b
MN1
. . . b
(M1)N+2
b
(M1)N+1
b
(M1)N
Forward: y
(M1)N
y
(M1)N+1
. . . y
MN2
y
MN1
a
(M1)N
a
(M1)N+1
. . . a
MN2
a
MN1
a
N
Output: x
(M1)N
x
(M1)N+1
. . . x
MN2
x
MN1
Fig. 25 Timing diagram of the parallel SISOs
3.5 Parallel Turbo Decoder 25
Backword: y
N1
y
N2
y
1
y
0
b
N
b
N1
b
2
b
1
b
0
Forword: y
0
y
1
y
N2
y
N1
a
0
a
1
a
N2
a
N1
a
N
Output: x
0
x
1
x
N2
x
N1
Fig. 26 Timing diagram of the parallel SISOs in vector notation
beta variables that are computed in forward and backward recursion of the BCJR
algorithm. In the rst iteration there is no prior knowledge available about the state
probabilities. Therefore the messages are set to equal probability for all the states. In
each iteration, these messages are updated and passed across the border of adjacent
partitions.
The optimum way to process a window is the serial processing using forward
and backward recursions; i.e. BCJR algorithm. Therefore each window processor
is a SISO. The processing of the windows in two constituent codes can be run in
parallel. However, when we discuss the interleaver for the parallel decoder, we will
nd out that this is not necessary. In parallel turbo code, when the constituent codes
are the same we can share the SISO blocks for all constituent codes. Therefore the
architecture of the decoder of the choice only needs half of the processors as it is
shown in Figure 28.
Table III shows the parameters of a parallel decoder. For window size at two
extremes, the approach is reduced to known methods. If window size is B, and the
number of windows is 1, it turns out to the traditional approach. If the window size
is 1, the architecture reduces to a fully parallel decoder, which was proposed by
Frey et al. [13]. It should be noted that the memory requirement for all cases is the
same.
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7 c1
x 8 x x x x x x x
0 1 2 3 4 5 6 7
2 3 4 5 6 1 7 0
0 1 2 3 4 5 6 7
c2
x 8 x x x x x x x
Fig. 27 Partitioned graph of a simple PCCC
W
1
a
0
b
0
x
2
W
2
a
N
b
N
...
W
M
a
2N
b
2N
a
MN
b
MN
Interleaver/Deinterleaver
y
2
x
1
y
1
x
M
y
M
a
MN-N
b
MN-N
Fig. 28 Parallel turbo decoder with shared processors for two constituent codes
Processing time is the time needed to decode one block. Since all windows are
processed at the same time, each SISO is done after T
w
. We assume that all message-
passing computation associated with one state constraint node is done in one clock
cycle (T
clk
). We have I iterations and each iteration has two constituent codes, so it
takes 2I T
W
to complete the decoding. It is worth mentioning that the processing
time determines the latency as well. Therefore any speed gain is equivalent to lower
latency.
Processing load is the amount of computations that we need. The processing load
for each SISO is proportional to the number of the state constraints. Hence, it is kB,
where k is a constant factor which depends on the complexity of the state constraints.
It should be noted that processing load in serial and parallel SISO are the same.
Therefore the total processing load is 2I kB.
3.6 Speed Gain and Efciency
3.6.1 Denitions
Two characteristic factors should be studied as performance gures. One is the
speed gain and the other is the efciency. In ideal parallelization the efciency
is always 1. It means that there is no extra processing load needed for parallel
processing.
Table III The Decoder Parameters
Parameter Denition
N Window size
M Number of windows (SISOs)
B = M N Block size
I Number of iterations
T
W
= 2N T
clk
Window Processing Time
T = 2I T
W
Processing Time (Latency)
P = k 2I B Processing Load
3.6 Speed Gain and Efciency 27
They are dened as follows:
Speed gain = T
0
/T
Efciency = P
0
/P
Where T
0
and P
0
are the processing time and processing load for the serial
approach, i.e. W = B case. The factors can be further simplied to:
Speed gain = M I
0
/I
Efciency = I
0
/I
This is a very interesting result. The speed gain and the efciency are proportional
to the ratio between number of iterations needed for serial case and parallel case.
If the number of iterations required for the parallel case is the same as the serial
case, we enjoy a speed gain of M without degrading the efciency, which is ideal
parallelization. Therefore we should look at the number of iterations required for a
certain performance to further quantify the characteristic factors. In next section we
will investigate these factors with some simulations.
3.6.2 Simulation Results
For simulations, a PCCC with block size of 4,800 is chosen. The rst constituent
code is a rate 1/2 of systematic code and the second code is a rate one nonsystematic
recursive code. The feed forward and feedback polynomials are the same for both
codes and are 1 + D + D
3
and 1 + D
2
+ D
3
, respectively. Thus coding rate is 1/3.
The simulated channel is an additive white Gaussian noise (AWGN) channel. The bit
error rate (BER) performance of the proposed high-speed decoder has been simulated
for window sizes of N = 256, 128, 64, 48, 32, 16, 8, 4, 2, and 1.
The rst observation was that this structure does not sacrice performance for
speed. We can always increase the maximum number of iterations to get simi-
lar performance as of the serial decoder. The maximum number of iterations for
each case is chosen such that the BER performance of the decoder equals that
of the serial decoder after 10 iterations (I
0
= 10). Figure 29 shows the BER
performance of the decoder with different window sizes. The curves are almost
indistinguishable.
However, in practice, the iterations are stopped based on a criterion that shows
the decoded data is reliable or correct. We have simulated such a stopping criterion
in order to obtain the average number of iterations needed. The stopping rule that
we use is the equality between the results of two consecutive iterations. The average
number of iterations is used for the efciency computation. The average number of
iterations for low signal to noise ratio is the maximum number of iterations for each
window size.
Efciency and speed gain of the parallel decoder with different window sizes is
shown in Figure 30. It clearly shows that we have to pay some penalty in order
Fig. 29 Performances of parallel decoder
Fig. 30 Efciency and speed gain
3.6 Speed Gain and Efciency 29
Table IV Characteristic Factors for the Parallel Decoder @SNR = 0.7 dB (BER = 10E 8)
Window size Max # of iterations Avg. # of iterations Speed gain Efciency (%)
64 12 5.0 63 84
32 14 5.8 109 72
16 18 7.4 170 57
8 25 10.4 242 40
4 42 16.3 310 26
2 65 28.3 356 15
1 120 52.0 386 8
to achieve the speed gain. Also we observe that the efciency of parallel decoder
decreases gracefully for window sizes greater than 32. The efciency is degraded
dramatically for very small windows, which prohibits us to get speed gain as well.
However, the speed gain is a decreasing function.
As a summary, in Table IV, the maximumnumber of iterations, the average number
of iterations, and the characteristic factors are tabulated for different window sizes at
E
b
/N
0
= 0.7 (BER = 1e 8).
Efciency curves with respect to SNR are illustrated in Figure 31. The interesting
observation in the efciency curves is the atness of the curves. In other words, the
Fig. 31 Efciency vs. signal to noise ratio
efciency of the parallel decoder is almost constant in all SNR. This observation
translates to almost constant speed gain over the whole SNR range.
3.7 Interleaver Design
In traditional decoder, the extrinsics are usually stored in a memory whose locations
is accessed one at a time. The order in which the extrinsic memory is accessed is
different for two constituent codes because of the interleavers. It is usually accessed
in sequential order for the rst constituent code and in interleaved order for other
constituent codes. In practice, memory addressing is used to access the memory in
desired order.
Although the message-passing algorithm allows us to parallelize the decoding
process, accessing so many extrinsics at the same time poses a practical problem.
Since M SISOs are running at the same time, M extrinsics are being used simulta-
neously. Having a memory with M read and write ports is practically not feasible,
especially for large M. The solution is to have M submemories each one of which
is accessed by only one SISO. This should be true not only for extrinsics with
sequential orders but also for interleaved orders. Therefore, it imposes a constraint
on the interleavers. Such an interleaver is called conict free. If the bit sequence is
put in a matrix with rows corresponding to bits processed with one SISO, the general
structure for conict-free interleaver is derived by random row permutation followed
by column permutations. An example for M = 5 and N = 4 is shown in Figure 32.
The submemories are distinguished with different colors. Each column of the matrix
is processed at the same time. As we see at each time all the submemories are used
and not one is accessed twice. This is true for both sequential and interleaved order.
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
16 17 18 19
0
7
8
13
3
4
10
14
2
5
11
12
0
6
9
15
18 17 19 16
1
1
7
8
18
3
4
10
14
2
11
12
19
16
13 5 15
9
6
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
8 3 11 16 1 14 19 0 18 4 12 9 7 2 6 13 10 17 5 15
(a) (b) (c)
(d)
(e)
17
Fig. 32 (a) Bit sequence in matrix form; (b) after row interleaver; (c) a conict-free interleaver;
(d) bit sequence in sequential order; (e) the conict-free interleaved sequence
3.7 Interleaver Design 31
With conict-free interleavers the extrinsic memory access problem is solved. In
the next section we present an improved interleaver that almost halves the latency.
Although this structure is applicable for every turbo code, we will explain it in the
case of a block PCCC code.
3.7.1 Low Latency Interleaver Structure
To explain the interleaver structure we start with the reverse interleaver in a serial
decoder. When the reverse interleaver is used, it is observed that the next iteration
can start processing as soon as the rst extrinsic is ready and every new computed
extrinsic is used right away. This property is true only for the reverse interleaver.
The reason for this property is that the sequence of extrinsics computed in the
current iterations matches the sequence needed in the backward recursion in the
next iteration. Figure 33 shows the observation and extrinsic sequences used in two
consecutive iterations. The indices used for alpha and beta variables denote the trellis
stage. For extrinsics they denote the bit number in the block. The alpha and beta
variables in two iterations are totally independent, although for simplicity the same
notation is used for them.
As it is observed from the sequences in Figure 33 the output sequence of the
second iteration is compatible with the input sequence of the rst interleaver,
which means that we can repeat this pipelining process for succeeding iterations as
well.
This phenomenon results in two very important advantages. First, the latency of
the decoder decreases by almost a factor of two, which translates to speed gain as
well. We have,
Processing time (Latency) = (2I +1) T
w
/2 = (I +1/2) T
w
C1 :
y
N1
y
N2
y
1
y
0
b
N
b
N1
b
2
b
1
b
0
y
0
y
1
y
N2
y
N1
a
0
a
1
a
N2
a
N1
a
N
x
0
x
1
x
N2
x
N1
extrinsic outputs
C2 :
x
0
x
1
x
N2
x
N1
extrinsic inputs (interleaved)
b
N
b
N1
b
2
b
1
b
0
x
N1
x
N2
x
1
x
0
a
0
a
1
a
N2
a
N1
a
N
y
N1
y
N2
y
1
y
0
Fig. 33 Data and extrinsic sequences in two consecutive iterations for turbo decoder with reverse
interleaver
y
N1
y
N2
. . . y
1
y
0
b
N
b
N1
. . . b
2
b
1
b
0
y
0
y
1
. . . y
N2
y
N1
a
0
a
1
. . . a
N2
a
N1
a
N
x
0
x
1
. . . x
N2
x
N1
extrinsic outputs
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
0

1
. . .
N2

N1
interleavers
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
.
.
.
.
.
.
.
?
x
0
x
1
. . . x
N2
x
N1
extrinsic inputs
b
N
b
N1
. . . b
2
b
1
b
0
x
N1
x
N2
. . . x
1
x
0
a
0
a
1
. . . a
N2
a
N1
a
N
y
N1
y
N2
. . . y
1
y
0
Fig. 34 Sequences in two consecutive iterations for parallel turbo decoder with reverse interleaver
This advantage comes at the expense of running both the forward and backward
recursion circuitry all the time. There is no extra hardware needed, though.
Second, the extrinsic information are not required to be stored in a memory
and retrieved later on. This memory reduction is due to in place processing of the
extrinsics.
Despite these advantages, this interleaver is never used for turbo codes. The reason
is the poor BER performance of it. In the sequel we use the reverse interleaver in
the context of the parallel turbo decoder. By this we get around this problem by
incorporating one more permutation while we still exploit the advantages.
For the parallel decoder; the timing diagram for two consecutive iterations with
the proposed interleaver is shown in Figure 34.
The idea is to use the same vector with permuted elements. So x
0
in the next
iteration is permuted,
0
(x
0
), and y
0
will be replaced by
1
0
(x
0
). If we did not
have the
0
,
1
, . . . ,
N1
interleavers, there would be nothing more than M parallel
decoders each one working on a separate block. However, the presence of the
interleavers creates a randomness that will improve the performance of the code
while the architecture of the code is almost intact, i.e. the advantages of the reverse
interleaver are still in place.
The permutations are done differently for each vector. Therefore the interleaver
block is time variant, but memoryless. Because the number of parallel blocks is
usually small, the interleaver implementation is feasible. Moreover, it does not have
FR 1
BR 1
FR 1'
OC 1
Int
FR 2
BR 1'
OC 1'
Int
-1
FR I'
BR I
OC I
BR I'
OC I'
Int Int
-1
...
...
...
...
Fig. 35 Scheduling diagram of the parallel decoder
the memory access problem due to its memoryless structure, which is the main
problem in the parallelization of turbo decoder. The scheduling diagram of the
parallel decoder is shown in Figure 35.
3.7.2 Interleaver Design Algorithm
The structure of the interleaver can be best understood by organizing the bits in
a matrix with each row having the bits processed in a SISO. Each column of this
matrix is a vector that is computed at a time. We have a two-step interleaver: a
reverse row interleaver and a random column interleaver. If the matrix elements
are denoted by P
i, j
, where i = 0, 1, . . . , M1 and j = 0, 1, . . . , N1, then the
equivalent interleaver sequence is {Q
n
} = {N*P
i, j
+ N j }.
As an example, a turbo code with block length of 20 is decomposed into M = 5
sub-blocks of N =4 bits. Therefore, there are 5 SISOs working in parallel; each one
works on a block of 4 bits. Table V shows an example of the interleaver in matrix
format. The equivalent interleaver is {11, 2, 5, 4, 15, 18, 9, 12, 19, 14, 1, 16, 3, 6, 13,
8, 7, 10, 17, 0}.
In this section we will explain howto design such an interleaver. The algorithmhas
two main steps: constructing a random interleaver and updating based on a certain
constraint.
Using the matrix format for the interleaver, we initialize the interleaver design
by taking a random interleaver for each column. In the next step we will update the
interleaver by applying a certain constraint. To update the interleaver we use column-
wise bit swapping, which ensures that the structure of the interleaver is preserved.
Since the constraints for interleaved designs are usually applicable for one-
dimensional interleaver, to update the interleaver it is best to compute the equivalent
interleaver, which can be done on the y. The simplest constraint that we can use is
the spread of the interleaver [7]. In other words, we design an S-random interleaver
with the proposed structure. An ordinary S-random can be viewed as a special case
for this structure, i.e. when we have only one column. Therefore, this algorithm not
only presents an algorithmwith the proposed structure, but also gives a fast algorithm
for designing S-random interleavers. The owchart of this algorithm is shown in
Figure 36.
Starting from the rst row, the rst SISO bits, and the constraint for each bit is
checked given the previously designed bits. If the constraint is met, we go to the next
Table V An Example of the Conict-Free Interleaver
Bit index in SISO 0 1 2 3
Interleaved bit index 3 2 1 0
SISO
0
gets from SISO 2 0 1 1
SISO
1
SISO
2
SISO
3
SISO
4
Start
Construct a matrix
with N random
interleaver
columns; Pi,j
i = 0, j = 0
n = N x i + j ;
Qn = N*Pi,j + Nj
j++;
if( j =N) { j =0, i++}
k = N*m + j ;
Qk = N*Pm,j + N-j
m = i;
Qk
satisfys the
constraint
?
Pi,j <-> Pm,j
m > M?
m = m+1;
i > M ?
Done
k = N*m + j;
Qk = N*Pm,j + Nj
m = i1;
Qn & Qk
satisfy the
constraint
?
m < 0 ?
m = m 1;
Yes
No
Yes
No
Yes
No
No
Yes
Yes
No
Fig. 36 The owchart of the algorithm
3.8 Hardware Complexity 35
bit. Otherwise we check the remaining elements in the column in place of this bit. If
one of them satises the constraint, we exchange the indices in the column and go
to the next bit. If not, we try exchanging this bit with the previously designed bits
in the column. In this situation, when we exchange two bits the constraint for both
bits should be satised. If none of the previously designed bits can be exchanged
with this bit, then the algorithm fails. There are two options available in case the
algorithm fails: one is to make the constraint milder and the other one is to redo
everything with a new random matrix.
We have observed that this algorithm is very fast for S-random interleaver design
and it does not fail when the spread is less than sqrt(B/2). The maximum spread
that we can achieve for the structured interleaver is slightly smaller than that of the
ordinary one. Therefore, one should expect some degradation in performance.
A much more complicated constraint can be used in the algorithm in order to
improve the code. These constraints usually depend strictly on the code.
3.7.3 Simulation Results
For simulations, two PCCCs with block sizes of 1,024 and 4,096 are chosen. The
rst constituent code is a rate 1/2 systematic code and the second code is a rate
one nonsystematic recursive code. The feed forward and feedback polynomials are
the same for both codes and are 1 + D + D
3
and 1 + D
2
+ D
3
, respectively. Thus
coding rate is 1/3. The simulated channel is an AWGN channel.
Two interleavers with block length of 1,024 (M = 32, N = 32) and 4,096 (M =
128, N = 32) have been designed with the proposed algorithm. The BER perfor-
mance of the decoders has been simulated and compared with that of the serial
decoder with S-random interleaver. The maximum number of iterations for each case
is 10.
The performance comparison for the 1,024 case is illustrated in Figure 37. The
proposed two-dimensional S-random interleaver is called S
2
-random. As we see in
the gure, the performances are almost the same. The S-random interleaver has a
slightly better error oor.
The performances for the 4,096 case is shown in Figure 38. The difference
between the error oors is more noticeable. However, the codes have equal threshold
in both cases. The error oor can be reduced with a more powerful constraint in the
interleaver design algorithm.
3.8 Hardware Complexity
After the speed gain and efciency were discussed, we are going to investigate
hardware complexity for parallel turbo decoder. The turbo decoder hardware consists
of two major parts: logic and memory.
EB/No
Fig. 37 Performance comparison for B = 1,024
The memory requirement for the parallel decoder consists of the following:
Observations: These variables are the observation values received from channel.
They usually are the logarithm of likelihood ratios (LLR) of the code word bits. The
message computations are easier using these variables. The size of the memory is
the number of bits in a code word. For a rate 1/3 turbo code with block size of B,
the memory size is 3B. Usually the width of this memory is 45 bits. This is a read-
only memory during the decoding (all iterations). This memory is the same for serial
decoder.
Extrinsics: The extrinsics are stored in a memory to be used later by other
constituent codes. The size of the memory depends on the number of connections
between constituent codes. For a simple PCCC, there are only B extrinsics needed.
This memory is a read/write with a width of 69 bits. For conict-free interleavers,
this memory is divided into M sub-blocks each one accessed independently. The
total memory, however, is the same as serial decoder.
Beta variables: These variables should be kept in a memory in backward recursion
before it is used to compute the extrinsics. Each beta variable is actually an array. The
size of the array is 2
(K1)
1, where K is the constraint length of the convolutional
code; i.e. the number of possible values for a state variable. For an 8-state convolu-
tional code the size of memory for beta variables is 8B. Each element in the array
has usually 812 bits. This memory is a big portion of memory requirements for the
3.8 Hardware Complexity 37
EB/No
Fig. 38 Performance comparison for B = 4,096
parallel decoder. In serial decoder there is a sliding window approach that reduces
this memory at the expense of some more processing.
The logic that is needed for processing is mainly for the message computation.
There are three operations needed in message computations that have been shown in
Figure 39.
n n
+1
n
x
n
n n
+1
n
x
n
n n
+1
n
x
n
(a) (b) (c)
Fig. 39 (a) alpha recursion; (b) beta recursion; (c) extrinsic computation
All of the above operations are done in one clock cycle. In the parallel decoder we
have M SISOblocks. Therefore, compared to serial decoder the above computational
logic is increased by M times.
Contrary to pipelined turbo decoder, the complexity is not increased proportional
to the speed gain.
3.9 Conclusion
We have proposed an efcient architecture for parallel implementation of turbo
decoders. The advantage of this architecture is that the increase in the processing load
due to parallelization is minimal. Simulation results demonstrate that this structure
not only achieves some orders of magnitude in speed gain, but also maintains the
efciency in processing. Also we have shown that the efciency and the speed gain
of this architecture are almost independent of the SNR.
We also have proposed a novel interleaver structure for parallel turbo decoder.
The advantages of this architecture are low latency, high speed, and the feasibility
of the implementation. Simulation results show that we can achieve very good BER
performance by this architecture as well. We also presented a fast algorithm to design
such an interleaver, which can be used for designing S-random and other interleavers
by just changing the constraint.
The regularity of the recently proposed architecture for parallel turbo decoder and
the advantages of the proposed interleaver make this the architecture of choice for
VLSI implementation of high-speed turbo decoders.
Chapter 4
Very Simple Turbo-like Codes
4.1 Introduction
In searching for simple turbo-like codes RA codes are very inspiring. They are
perhaps the simplest turbo-like codes. Surprisingly, they achieve good performance
too. Simplicity of these codes lends itself to a more comprehensive analysis of their
performance. Divsalar et al. have shown the performance of these codes with ML
decoding and proven that they can achieve near Shannon limit performance [12].
Moreover, they have proved that it achieves the Shannon limit when the rate goes to
zero.
However, RA codes cannot compete with turbo codes or well-designed LDPCs
as far as performance is concerned. To improve the performance of RA codes Jin
proposed Irregular RepeatAccumulate (IRA) codes [16,17]. He also presented a
method for designing very good IRA codes for binary erasure and additive white
Gaussian channels. He showed that they outperform turbo codes for codes with very
large block sizes. However, IRA codes lose both the regularity and simplicity at the
expense of performance.
In this chapter we show that with some simple modications RA codes can
be transformed into very powerful codes while maintaining the simplicity. The
modications include simple puncturing and precoding. First RA codes with reg-
ular puncturing are analyzed using iterative decoding, as well as ML decoding.
ML decoding performance is shown by a tight bound using the weight distrib-
ution of RA codes with puncturing. In fact, with increasing both the repetition
and puncturing the code rate remains the same whereas the performance gets
better.
Then we present ARA codes. These codes not only are very simple, but also
achieve excellent performance. The performance of these codes with ML decoding
is illustrated and compared to random codes by very tight bounds. It is shown that
there are some simple codes that perform extremely close to Shannon limit with
ML decoding. The performance of ARA codes using iterative decoding is also
investigated and compared to ML decoding later on.
RA and ARA codes can be classied as LDPC codes. Despite the fact that LDPC
codes in general may have a very computationally involved encoder, RA and ARA
codes have a simple encoder structure. ARA codes, especially, allows us to generate
a wide range of LDPC codes with various degree distributions including variable
c Springer 2007
40 4 Very Simple Turbo-like Codes
nodes with degree one (RA and IRA code structures do not allow degree one variable
nodes). They are able to generate LDPC codes with various code rates and data frame
sizes, and with a performance close to the Shannon capacity limit. The proposed
coding structure also allows constructing very high-speed iterative decoders using
belief propagation (message passing) algorithm.
First we describe briey some of the tools that we use for analyzing the perfor-
mance of a turbo-like code.
4.1.1 Bounds on the ML Decoding Performance of Block Codes
Since there is no practical ML decoding algorithm available for block codes with
large block size, we use the performance bounds to obtain some insight on codes
behavior. Using the classic union bound, the frame (word) error rate (FER) and BER
for a (N, K) linear block code decoded by an ML criterion over an AWGN channel
is upper-bounded by
FER
d=d min
A
d
Q
_
_
_
2d r
E
b
N
0
_
_
(11)
BER
d=d min
K
w=1
w
K
A
w,d
Q
_
_
_
2d r
E
b
N
0
_
_
(12)
Where r denotes the code rate, E
b
/N
o
is the signal to noise ratio, d is the
Hamming distance of code words, d
mi n
is the minimum distance between code
words, w
d
is the average input error weight, A
d
is the cardinality of code words
with distance d, A
w,d
is the cardinality of code words with input and output weight
of w and d, K is the block length, N is the code word length, and Q denotes the
complementary error function dened as
Q(x) =
1
_
x
e
u
2
/2
du (13)
However, this bound is not very tight in low signal to noise ratios. There are tighter
bounds like Viterbi-Viterbi [33], Poltyrev [26], and Divsalar [10] bounds. Divsalar
bound is very attractive since it provides tight bounds with closed form expressions
for bit-error and word-error probabilities. Here we describe this bound by dening
some new variables:
= d/N
a() = ln(A
d
)/N (14)
c = r
E
b
N
0
4.1 Introduction 41
Where , a(), and c are normalized weight, weight distribution, and SNR respec-
tively. Then we dene the following:
c
0
() = (1 e
2a()
)
1
2
f (c, ) =
_
c
c
0
()
+ 2c + c
2
c 1
(15)
Then we dene the exponent:
E(c, d) =
_
_
1
2
ln[1 2c
0
() f (c, )] +
cf (c, )
1 + f (c, )
, c
0
() < c <
e
2a()
1
2(1 )
a() + c, otherwise
(16)
Based on Divsalar bound the upper bound on the FER is given by:
P
e

d=d min
min
_
e
NE(c,d)
, e
na()
Q
_
2cd
__
(17)
The BER is upper bounded by the same formula, but the denition of r() should
be changed to the following:
a() = ln
_
_
w
w
K
A
w,d
_
_
/N (18)
It also obtains a closed form expression for the minimum SNR threshold that serve
as a tight upper bound on maximum-likelihood capacity of nonrandom codes, which
is the following:
_
E
b
N
0
_
min
=
1
r
max
01r
[c
0
()] (19)
The above threshold holds when N goes to innity. The only information required
for the error probability bounds and the threshold is the weight distribution of the
code. Fortunately, the codes that are investigated are simple enough to derive their
weight distribution.
4.1.2 Density Evolution Method
The analysis of iterative decoders for turbo-like codes with short blocks is an open
problem. However, there is an asymptotic solution for this; i.e. when N goes to
innity. This analysis is based on the DE method proposed by Richardson and
Urbanke [27]. In this method the probability density function of the messages passed
between the constituent codes is tracked as this density evolves from iteration to
iteration. They used this method to compute the asymptotic threshold of LDPC codes
over a binary input AWGN channel. Divsalar et al. [11] generalized this method for
turbo and turbo-like codes.
Fig. 40 Probability density function of messages in different iterations
To explain the DE phenomena, we assume that all the code word bits are zero; i.e.
value 1 sent in the transmitter. We sketch the probability density function of the OMs
for one constituent codes in different iterations. Example of those curves is shown in
Figure 40.
As we see the density functions evolves towards higher means. We can approx-
imate the density functions by Gaussian approximation, as Wiberg did in his dis-
sertation [34]. The bit decisions are made based on the messages, so the bit-error
probability depends on the SNR of messages. Therefore what is really important
is that the SNR of the density functions should increase in order to obtain better
performance as the iterations go on.
The messages are passed between the constituent codes. Hence, the constituent
codes get the evolved message as input and generate new messages at the output.
So what we need to know is the SNR transfer function for each constituent codes.
Using the transfer functions we can track the behavior of the SNR of the messages
G
Eb/ No
SNR
in
SNR
out
Fig. 41 Constituent code model for density evolution
4.1 Introduction 43
G
1
SNR
in
SNR
out
G
2
Eb/ No
Fig. 42 Constituent code model for density evolution
as they are passing between the constituent codes. Therefore, we sketch a general
model for each constituent code as in Figure 41. In this model we have one more
parameter that is the operating E
b
/N
0
. This is the E
b
/N
0
of the observations the
leaf node messages that are fed to this constituent code and acts like a bias for the
constituent code.
The SNR transfer function is denoted by G that indicates the following relation-
ship:
SNR
out
= G(SNR
in
) (20)
It should be noted that the transfer function is implicitly dependent on the operat-
ing E
b
/N
0
. The transfer function is usually derived by Monte Carlo simulation using
the Gaussian approximation or the real density function.
For turbo-like codes with two constituent codes the overall block diagram of the
iterative decoding is shown in Figure 42.
We have:
SNR
out
= G
1
(SNR
in
) (21)
SNR
in
= G
2
(SNR
out
) (22)
Suppose we start with the rst constituent code. There is no prior messages at this
time; i.e. SNR
in
= 0. The observations help to generate new messages with some
SNR
out
= G
1
(SNR
in
) > 0. This messages are passed to the second constituent code
and output messages have the SNR = G
2
(G
1
(SNR
in
)), which is the SNR
in
of the
rst constituent code for the next iteration. In order to obtain a better SNR at each
iteration we should have the following:
G
2
(G
1
(SNR
in
)) > SNR
in
; for any SNR
in
(23)
Since G
2
is strictly ascending function, it is reversible and we can write an equivalent
relation:
G
1
(SNR
in
) > G
1
2
(SNR
in
); for any SNR
in
(24)
If the above constraint holds, then the iteratively decoding will result in correct
information bits (SNR goes to innity). The minimum operating E
b
/N
0
that this
constraint holds is denoted as the capacity of the code with iteratively decoding.
We usually draw G
1
(SNR) and G
1
2
(SNR) in one gure. These curves give us
some sense about the convergence speed as well. An example of such curves and the
Fig. 43 SNR improvement in iterative decoding
way SNR improves in iterations is shown in Figure 43. As we see the speed of SNR
improvement depends on the slopes of G
1
and G
2
.
We use the transfer function curves to analyze the performance of turbo-like codes
with iterative decoding.
4.2 RA Codes
RA codes are the simplest codes among turbo-like codes, which make them very
attractive for analysis. The general block diagram of this code is drawn in Figure 44.
An information block of length N is repeated q times and interleaved to make a
block of size qN, and then followed by an accumulator.
rep(q)
u
ACC I
N qN qN qN
Fig. 44 RepeatAccumulator code block diagram
4.2 RA Codes 45
The accumulator can be viewed as a truncated rate one recursive convolutional
code with transfer function of 1/(1 + D), but sometimes it is better to think of it as a
block code whose input block [x
1
, x
2
, . . . , x
n
] and output block [y
1
, y
2
, . . . , y
n
] are
related by the following:
y
1
= x
1
y
2
= x
1
+ x
2
y
3
= x
1
+ x
2
+ x
3
. . . .
y
n
= x
1
+ x
2
+ x
3
+ + x
n
(25)
4.2.1 ML Analysis
For ML analysis we need the weight distribution of the code. We use the concept
of uniform interleaver [8] to compute the overall input-output weight enumerator
(IOWE). Therefore, we need to compute the IOWE of both repetition code and the
accumulator. For repetition code it is simply the following:
A
rep(q)
w,d
=
_
_
_
N
w
_
; d = qw
0; otherwise
(26)
It can be expressed as
A
rep(q)
w,d
=
_
N
w
_
(d qw) (27)
where () is the Kronecker delta function.
The IOWE of the accumulator is:
A
acc
w,d
=
_
N d
w/2
__
d 1
w/2 1
_
(28)
Where x and x denote the largest integer smaller than x and smallest integer
larger than x, respectively.
Having the IOWE of the repeat and accumulate codes, we can compute the IOWE
of the RA code using the uniform interleaver.
A
RA(q)
w,d
=
qN
h =0
A
rep(q)
w,h
A
acc
h,d
_
qN
h
_ =
_
N
w
_
_
qN
qw
_
_
qN d
qw/2
__
d 1
qw/2 1
_
(29)
4.2.2 DE Analysis
The RA codes consist of two component codes: repeat and accumulate codes. Hence,
the messages are exchanged between these two codes. We use Gaussian approxima-
tion to obtain the SNR transfer functions of the constituent codes, which are shown
in Figure 45.
In accumulator code we have nonzero SNR even with zero messages in the input
coming from repetition code. This is because the observations help to generate
some nonzero messages. Therefore, the accumulator is able to jumpstart the iterative
decoding. On the other hand, the repetition code has a straight line transfer function
Fig. 45 Density evolution for RA codes (q = 3)
4.3 RA Codes with Puncturing 47
Fig. 46 Accumulator with punctur-
ing and its equivalent for p = 3
with a slope of 2; the reverse function is shown in Figure 45. SNR
out
is zero when
SNR
in
is zero. This is justied since there is no channel observation available for this
code. The curves are almost touching; hence, the threshold of the RA code (rate 1/3)
is almost 0.5 dB.
4.3 RA Codes with Puncturing
4.3.1 ML Analysis
To compute the IOWE of the RA codes with puncturing we use the equivalent
encoder depicted in Figure 46 instead of the accumulator with puncturing.
As we see, the equivalent graph is a concatenated code of a regular check code
and an accumulator, which is shown in Figure 47.
Since the check code is regular and memoryless, the presence of any interleaver
between two codes does not change the IOWE of the overall code. In order to
compute the IOWE for this code we insert a uniform interleaver between two codes.
The next step is to compute the IOWE of the check code. The IOWE can
be expressed in a simple closed-form formula if we use the two-dimensional Z-
transform. The inverse Z-transform results in A
c
w,d
. We start with N = 1, i.e. we
have only one parity check. We have
A
c
(W, D) = E
p
(W) + O
p
(W)D (30)
Check(p) Acc
pN N N
Check(p) p
pN N N
Acc
N
Fig. 47 Block diagram of accumulator with puncturing
where
E
p
(W) = Even[(1 + W)
p
] (31)
and
O
p
(W) = Odd[(1 + W)
p
] (32)
Since there are N independent check nodes in the code, the IOWE can be written
in Z-transform as:
A
c
(W, D) = (E
p
(W) + O
p
(W)D)
N
=
N
d=0
_
N
d
_
E
p
(W)
Nd
O
p
(W)
d
D
d
(33)
The IOWE is obtained by taking the inverse Z-transform. The closed-form expres-
sion for A
w,d
for arbitrary p is very complicated. Instead we derive the IOWE for
p = 2, 3, and 4, which are practically more useful.
4.3.1.1 Case p = 2
Using the general formula in Z-transform we have:
A
c(2)
(W, D) = (1 + W
2
+ 2WD)
N
(34)
It can be expanded as following:
N
d=0
_
N
d
_
(1 + W
2
)
Nd
(2W)
d
D
d
=
N
d=0
_
N
d
_
_
_
Nd
j =0
_
N d
j
_
W
2 j
(2W)
d
_
_
D
d
(35)
Therefore the IOWE can be expressed as
A
c(2)
w,d
=
_
_
_
N
d
__
N d
j
_
2
d
; w = d + 2 j for j = 0, . . . ,N d
0; otherwise
(36)
It can be expressed concisely as
A
c(2)
w,d
=
_
N
d
_
Nd
j =0
_
N d
j
_
2
d
(w d 2 j ) (37)
where (x) is the Kronecker delta function.
Example: N = 3
A =
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 6 0 0
3 0 12 0
0 12 0 8
3 0 12 0
0 6 0 0
1 0 0 0
_
_
=
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 2 0 0
3 0 4 0
0 4 0 8
3 0 4 0
0 2 0 0
1 0 0 0
_
_
_
_
_
_
1 0 0 0
0 3 0 0
0 0 3 0
0 0 0 1
_
_
(38)
The second matrix is canceled out when we concatenate this code to other codes
with a uniform interleaver.
4.3.1.2 Case p = 3
Starting from general formula in Z-transform we have:
A
c(3)
(W, D) = (1 + 3W
2
+ (3W + W
3
)D)
N
(39)
It can be expanded as following:
A
c(3)
(W, D) =
N
d=0
_
N
d
_
(1 + 3W
2
)
Nd
(3W + W
3
)
d
D
d
(40)
A
c(3)
(W, D) =
N
d=0
_
N
d
_
__
Nd
i =0
_
N d
i
_
3
i
W
2i
_
_
d
i =0
_
d
i
_
3
i
(W)
2(di )
_
W
d
_
D
d
(41)
It can be written as:
A
c(3)
(W, D) =
N
d=0
_
N
d
_
__
Nd
ii=0
_
N d
ii
_
3
ii
W
2ii
_
_
d
i =0
_
d
i
_
3
(di )
(W)
2i
_
W
d
_
D
d
(42)
Then we have
A
c(3)
(W, D) =
N
d=0
_
N
d
_
_
d
i =0
Nd
ii=0
_
d
i
__
N d
ii
_
3
ii+(di )
W
d+2i +2ii
_
D
d
(43)
If we let j = i + ii, we have
A
c(3)
(W, D) =
N
d=0
_
N
d
_
_
N
j =0
_
min( j,d)
i =max(0, j N+d)
_
d
i
_
_
N d
j i
_
3
d+j 2i
W
d+2 j
__
D
d
(44)
Therefore, it is easy to show that the IOWE becomes:
A
c(3)
w,d
=
_
_
_
N
d
_
min( j,d)
i =max(0, j N+d)
__
d
i
__
N d
j i
_
3
d+j 2i
_
;
w = d + 2 j for j = 0, . . . ,N
0; otherwise
(45)
It can be written as
A
c(3)
w,d
=
_
N
d
_
N
j =0
_
_
_
min( j,d)
i =max(0, j N+d)
__
d
i
__
N d
j i
_
3
d+j 2i
_
(w d 2 j )
_
_
_
(46)
where () is the Kronecker delta function.
Meanwhile we have the following property:
A
c(3)
w,d
= A
c(3)
3Nw,Nd
(47)
This property can be proven very easily by taking the complements of three input
bits to a check. The output of the check is also inverted. If the number of nonzero
input and output bits are w and d, respectively, the number of nonzero bits in
a complemented version is 3N w and N d. This proves the property. This
property helps to save some computations.
Example: N = 3
A =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 9 0 0
9 0 27 0
0 57 0 27
27 0 99 0
0 99 0 27
27 0 57 0
0 27 0 9
0 0 9 0
0 0 0 1
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 3 0 0
9 0 9 0
0 19 0 27
27 0 33 0
0 33 0 27
27 0 19 0
0 9 0 9
0 0 3 0
0 0 0 1
_
_
_
_
_
_
1 0 0 0
0 3 0 0
0 0 3 0
0 0 0 1
_
_
(48)
Check(4)
4N N
Check(2)
4N N 2N
Check(2)
p
N 2N
Check(2)
2N 4N
Check(2)
Fig. 48 Block diagram of check_4 code and its equivalents
4.3.1.3 Case p = 4
The code for this case can be viewed as a concatenated code as shown in Figure 48.
Because the check code is regular and memoryless, we can put any interleaver
between the codes without changing the IOWE of the overall code.
By using a uniform interleaver and the results found for case p = 2 the IOWE can
be written as:
A
c(4)
w,d
=
2N
h=0
A
c(2)
w,h
A
c(2)
h,d
_
2N
h
_ (49)
Using the result for case p = 2, we obtain
A
c(4)
w,d
=
_
N
d
_
Nd
j =0
_
_
_
2Nd2 j
i =0
__
N d
j
__
2N d 2 j
i
_
2
2d+2 j
_
(w d 2i 2 j )
_
_
_
(50)
Example: N = 3
A
c(4)
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 12 0 0
18 0 48 0
0 156 0 64
111 0 384 0
0 600 0 192
252 0 672 0
0 600 0 192
111 0 384 0
0 156 0 64
18 0 48 0
0 12 0 0
1 0 0 0
_
_
=
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
1 0 0 0
0 4 0 0
18 0 16 0
0 52 0 64
111 0 128 0
0 200 0 192
252 0 224 0
0 200 0 192
111 0 128 0
0 52 0 64
18 0 16 0
0 4 0 0
1 0 0 0
_
_
_
_
_
_
1 0 0 0
0 3 0 0
0 0 3 0
0 0 0 1
_
_
(51)
This method can be applied for any p that can be decomposed into two smaller
numbers.
Having computed the IOWE of the check code, we can use the uniform interleaver
formula to come up with the IOWE of the accumulator with puncturing. We have:
A
acc( p)
w,d
=
N
h=0
A
c( p)
w,h
A
acc
h,d
_
N
h
_ (52)
The simplied expressions for cases p = 2, 3, and 4 are as follows:
A
acc(2)
w,d
=
N
h=0
Nh
j =0
_
N h
j
__
N d
h/2
__
d 1
h/2 1
_
2
h
(w h 2 j ) (53)
A
acc(3)
w,d
=
N
h=0
N
j =0
_
_
_
min( j,h)
i =max(0, j N+h)
_
h
i
__
N h
j i
__
N d
h/2
__
d 1
h/2 1
_
3
h+j 2i
(w h 2 j )
_
_
_
(54)
A
acc(4)
w,d
=
N
h=0
Nh
j =0
_
_
_
2Nh2 j
i =0
_
N h
j
__
2N h 2 j
i
__
N d
h/2
__
d 1
h/2 1
_
2
2h+2 j
(w h 2i 2 j )
_
_
_
(55)
It should be noted that despite the fact that we use a uniform interleaver to obtain the
IOWE, we come up with the exact IOWE for accumulator with puncturing.
The next step is to nd the IOWE of the RAcode with puncturing, which is derived
in case of a uniform interleaver after repetition.
A
rep(q)acc( p)
w,d
=
qN
h=0
A
rep(q)
w,h
A
acc(p)
h,d
_
qN
h
_ (56)
Therefore, the closed form expressions for IOWE of RA( p = 2, q = 2), RA( p =
3, q = 3), and RA( p = 4, q = 4) will be the following:
A
rep(2)acc(2)
w,d
=
_
N
w
_
_
2N
2w
_
N
h=0
Nh
j =0
_
N h
j
__
N d
h/2
__
d 1
h/2 1
_
2
h
(2w h 2 j ) (57)
A
rep(3)acc(3)
w,d
=
_
N
w
_
_
3N
3w
_
N
h=0
N
j =0
min( j,h)
i =max(0, j N+h)
_
h
i
__
N h
j i
_
_
N d
h/2
__
d 1
h/2 1
_
3
h+j 2i
(3w h 2 j ) (58)
A
rep(4)acc(4)
w,d
=
_
N
w
_
_
4N
4w
_
N
h=0
Nh
j =0
2Nh2 j
i =0
_
N h
j
__
2N h 2 j
i
_
_
N d
h/2
__
d 1
h/2 1
_
2
2h+2 j
(4wh2i 2 j ) (59)
The above expressions are IOWE of the nonsystematic RA codes with regular
puncturing. However, in most cases we need to have systematic codes. It is very easy
to compute the IOWE of a systematic code based on its nonsystematic code. The
following formula shows the conversion.
A
sysrep(q)acc( p)
w,d
= A
rep(q)acc( p)
w,dw
(60)
4.3.2 Performance of Punctured RA Codes with ML Decoding
RA codes are usually nonsystematic codes, i.e. the information block is not sent
along with the output of the accumulator. However, the RA codes with puncturing
should be systematic in order to be decodable by iterative decoding. This constraint
is because the messages passed towards information variables are always zero; hence
not improving.
The normalized distance spectrum of some rate 1/2 codes for a block size of 4,000
are illustrated in Figure 49. These codes are RA code (q = 2), systematic RA code
with puncturing (q = 3, p = 3), (q = 4, p = 4), and random code. The distance
spectrum of a (n, k) random code is
A
random
w,d
=
_
n
w
__
k
d
_
2
k
(61)
A
random
d
=
_
k
d
_
2
nk
(62)
To compute the cutoff thresholds of these codes using Divsalar bound, the nor-
malized distance spectrum is computed when N goes to innity, i.e. the asymptotic
expression of r(). For random codes and the asymptotic expression of r() for
Fig. 49 Normalized distance spectrum of RA codes with puncturing
random codes with code rate R
c
is
r() = H() + (R
c
1) ln 2 (63)
where H() is the binary entropy function. The asymptotic expression of r() for RA
code with repetition q can be obtained as:
r() = max
0<e<1/q
_
1 q
q
H(qe) + (1 )H
_
qe
2(1 )
_
+ H
_
qe
2
_
_
(64)
Now we obtain the asymptotic expression of r() for systematic punctured RA
(q = 3, p = 3). After summing (4.60) over w, we let = d/2N for 0 < < 1, =
h/2N for 0 < < 1/2,
1
= i /2N for max(0,
2
+ 1/2) <
1
< min(
2
, h), and
Table VI Cut-off Thresholds for RA Codes with Puncturing
Cutoff RA_punc. RA_punc. Random Shannon
threshold RA (q =2) (q = 3, p = 3) (q = 4, p = 4) code limit
Rate
1
/
2
3.38 dB 1.49 dB 0.87 dB 0.308 dB 0.184 dB
2
= j /2N for 0 <
2
> 1/2. Also (2
2
+ )/3 < min(0.5, ).
r() = max
,
1
.
2
_
_
_
H
_
4
2
+ 2
3
_
+ H
_
_
+ (1/2 ) H
_
2

1
1/2
_
+(
2
+ 2
1
) ln(3) +
_
1/2 +
2
2
+
3
_
H
_
/2
1/2 +
2
2
+
3
_
+
_

2
2
+
3
_
H
_
/2

2
2
+
3
_
_
_
_
(65)
To derive the asymptotic expression of r() for RA(q = 4, p = 4), we let =
d/2N for 0 < < 1, = h/2N for 0 < < 1/2,
1
= i /2N for 0 <
1
< 1 2
2
,
and
2
= j /2N for 0 <
2
< 1 . Also (2
1
+ 2
2
+ )/4 < min(0.5, ).
r() = max
,
1
.
2
_
_
_
3
2
H
_
2
1
+ 2
2
+
2
_
+ (1 2
2
)H
_

1
1 2
2
_
+(2
2
+ 2) ln(2)+(1/2 ) H
_

2
1/2
_
+
_
1/2 +
2
1
+ 2
2
+
4
_
H
_
/2
1/2 +
2
1
+2
2
+
4
_
+
_
2
1
+2
2
+
4
_
H
_
/2
2
1
+2
2
+
4
_
_
_
_
(66)
The thresholds for different codes have been computed using a brute force search
and are compared in Table VI.
In fact, by increasing both repetition and puncturing the rate remains the same,
whereas the performance gets better. It is very intriguing to know how much the
codes improve as repetition increases and whether they can achieve Shannon limit
with innite repetition and puncturing.
However, this is not very interesting from the practical point of view. The com-
plexity of code goes up with large number of repetitions and it is not desirable
for high-speed decoders. In next section we improve the code with another simple
modication.
4.3.3 DE Analysis
Here the two component codes are repeat code and accumulate code with punc-
turing. Using Gaussian approximation to obtain the SNR transfer functions of the
constituent codes, we get the transfer functions that are shown, in Figure 50.
Fig. 50 Density evolution for RA codes with puncturing (q = 4, p = 2)
Puncturing the accumulators makes the SNR transfer function to start from (0, 0)
point. Therefore the information bits is sent; systematic code; to enable the iterative
decoding to start. Compared to RA code (q = 3), which has the same rate, an
improvement of 0.1dB in threshold is obtained. Since we have more repetition, the
improvement comes at the expense of more complexity. However, the threshold is
still far from the good turbo-like codes.
4.4 ARA Codes
In this section we investigate the effect of an accumulator precoder on the perfor-
mance of RAcodes. It is shown that we can obtain some codes that achieve extremely
near Shannon capacity performance with ML decoding. Unfortunately, the iterative
decoding algorithm cannot achieve the same performance. However, there are still
some good codes that have performance comparable to turbo codes. The main
advantages of these codes are the simplicity and the capability of parallelization.
Therefore, practical high-speed decoders can be conceived for these codes.
The accumulator precoder is a rate one encoder used to improve the performance
of the RA codes. However, only a portion of the information block goes to the
accumulator. This is mainly because of message-passing algorithm. In other words,
4.4 ARA Codes 57
ACC
N-M
M
N-M
Fig. 51 The block diagram of the precoder
M bits are passed through without any change and the rest (N M bits) goes through
an accumulator. M is considered a parameter in code design. The effect of this
parameter is studied in ML and iterative decoding. Then, it is optimized for achieving
the best performance. The block diagram of the precoder is shown in Figure 51.
4.4.1 ML Analysis
In order to nd the performance of the code we need to compute the IOWE of the
precoder. It is easily computed using the IOWE of the accumulator code as follows:
A
pre
w,d
=
M
m=0
_
M
m
_
A
acc
wm,dm
(67)
Therefore the IOWE of the overall code can be written as:
A
prerep(q)acc( p)
w,d
=
N
h=0
A
pre
w,h
A
rep(q)acc( p)
h,d
_
N
h
_ (68)
For systematic ARA code ( p = 3, q = 3), we have
A
prerep(3)acc(3)
w,d
=
M
m=0
N
k=0
_
M
m
_
_
3N
3k
_
N
h=0
N
j =0
min( j,h)
i =max(0, j N+h)
_
h
i
__
N h
j i
_
_
N d
h/2
__
d 1
h/2 1
__
N M k + m
(w m)/2
_
_
k m 1
(w m)/2 1
_
3
h+j 2i
(3k h 2 j ) (69)
A
prerep(4)acc(4)
w,d
=
M
m=0
N
k=0
_
M
m
_
_
4N
4k
_
N
h=0
Nh
j =0
2Nh2 j
i =0
_
N h
j
__
2N h 2 j
i
_
_
N d
h/2
__
d 1
h/2 1
__
N M k + m
(w m)/2
_
_
k m 1
(w m)/2 1
_
2
2h+2 j
(4k h 2i 2 j ) (70)
Fig. 52 ARA(3,3) BER performance bound
The above expressions are IOWE of the nonsystematic ARA codes with regular
puncturing. The IOWE of their systematic codes are derived by the following con-
version.
A
sysARA(q, p)
w,d
= A
ARA(q, p)
w,dw
(71)
4.4.2 Performance of ARA Codes with ML Decoding
Divsalar BER performance bound of the ARA(3,3) and ARA(4,4) for different
Ms are shown in Figures 52 and 53. It is observed that the more number of bits
accumulates in the precoder, the lower the code threshold becomes. However, the
improvement stops at a certain point, which is M = 1/5 N for ARA(3,3) and
M = 2/5 N for ARA(4,4). It is obvious that when M = N the codes turn into
RA with puncturing. It is very heartening that the performance of the ARA(4,4)
approaches very closely to that of randomcodes for the same block size in low E
b
/N
0
region.
It is very instructive to observe the distance spectrum of these codes (For optimum
M). As we see in Figure 54, the only difference between the distance spectrum of
these codes and a random code is in the low-distance region, which causes the error
oor.
4.4 ARA Codes 59
Fig. 53 ARA(4,4) BER performance bound
Now we obtain the asymptotic expression of r() for systematic punctured
ARA(q = 3, p = 3). After summing (4.71) over w, we let = M/2N for 0 <
< 1/2,
1
= m/2N for 0 <
1
< ,
2
= (w m)/2N for 0 <
2
< 1/2, = d/2N
for 0 < < 1, = h/2N for 0 < < 1/2,
1
= i /2N for max(0,
2
+ 1/2) <
1
< min(
2
, h), and
2
= j /2N for 0 <
2
< 1/2. Also (2
2
+ )/3 < min(0.5, ).
r() = max
1
,
2
,,
1
.
2
_
_
_
H
_
3
2
H
_
4
2
+ 2
3
_
+ (
2
+ 2
1
) ln(3)
+ H
_
_
+ (1/2 ) H
_
2

1
1/2
_
+ (
1

2
)H
_
/2

1

2
_
+(1/2 +
1
+
2
)H
_
/2
1/2 +
1
+
2
_
+
_
2
2
+
3

1
_
H
_

2
/2
2
2
+
3

1
_
+
_
1/2
2
2
+
3
+
1
_
H
_

2
/2
1/2
2
2
+
3
+
1
_
_
_
_
(72)
Fig. 54 Normalized distance spectrum of ARA codes with puncturing
To derive the asymptotic expression of r() for ARA(q =4, p =4), we let = M/2N
for 0 < <1/2,
1
= m/2N for 0 <
1
< ,
2
=(w m)/2N for 0 <
2
<1/2 ,
= d/2N for 0 < < 1, = h/2N for 0 < < 1/2,
1
= i /2N for 0 <
1
< 1
2
2
, and
2
= j /2N for 0 <
2
<1 . Also (2
1
+ 2
2
+ )/4 < min(0.5, ).
r() = max
1
,
2
,,
1
.
2
_
_
_
H
_
_
2H
_
2
1
+ 2
2
+
2
_
+ (1 2
2
)
H
_

1
1 2
2
_
+ (1/2 +
1
+
2
)H
_
/2
1/2 +
1
+
2
_
+(
1

2
)H
_
/2

1

2
_
+
_
1/2 +
2
1
+ 2
2
+
4
+
1
_
H
_

2
/2
1/2 +
2
1
+2
2
+
4
+
1
_
+
_
2
1
+ 2
2
+
4

1
_
H
_

2
/2
2
1
+2
2
+
4

1
_
+ (2
2
+ 2) ln(2)
_
_
_
(73)
4.4 ARA Codes 61
Table VII Cut-off Threshold for Rate 1/2 ARA Codes
Cutoff ARA_punc. ARA_punc. Random Shannon
threshold (q = 3, p = 3) (q = 4, p = 4) code limit
Rate
1
/
2
0.509 dB 0.310 dB 0.308 dB 0.184 dB
Table VII tabulates the Divsalar cutoff threshold for the same codes in Figure
54. As we expected, based on the BER performance bound, the cutoff threshold of
ARA(4,4) is extremely close to the cutoff threshold of random code.
4.4.3 DE Analysis
Unfortunately the iterative decoding cannot decode, as well as ML decoding. More-
over, the difference between the performances of two codes with iterative decoding
cannot be predicted based on their ML decoding performance.
The effect of the precoder in iterative decoding is very clear in Figure 55. The
accumulator and the repetition are regarded as one constituent code. The SNR
Fig. 55 Density evolution for ARA codes with puncturing (q = 4, p = 2)
Fig. 56 Performance of ARA codes using iterative decoding
transfer functions of this code and the simple repetition code are shown for com-
parison. The noticeable difference is a shift in the curve. This improves the threshold
almost by 0.5dB.
We have used the DE method to optimize the ARA codes for iterative decod-
ing. ARA(4,4) and ARA(3,3) achieve the best performance with M = 0.7 N
and M = 0.5 N, respectively. The performances of these codes are illustrated in
Figure 56.
4.5 Other Precoders
Although the ML decoding performance of ARA codes with simple accumulate
precoder is very close to that of random codes, there is no known practical method
to actually perform this decoding. Therefore, we are looking for codes that have
good performance with iterative decoding. We have observed that with different
precoders very good codes can be obtained. In this section we introduce some of
these precoders.
4.5 Other Precoders 63
ACC
N
M
N
P
1
N-M
P
0
Fig. 57 The block diagram of the new precoder
4.5.1 Accumulator with Puncturing
A simple modication is to use puncturing at the output of the accumulator in the
precoder. A general block diagram of this precoder is drawn in Figure 57. P
0
and P
1
denote puncturing with two different patterns. Two puncturing blocks usually have
complement patterns.
Two points are worth noting here. First, in this precoder all the information bits go
to the accumulator contrary to the accumulator precoder. Second, the precoder rate
is one.
Using the above precoder, we have designed a new ARA code, which have
better performance. The optimum value for M is N/2. Therefore one out of two
is punctured both from the information bits and from accumulated sequence with
complement patterns. The RA code is RA(q = 3, p = 3). The Tanner graph for this
code is shown in Figure 58.
This code has 0.1dB better threshold and better error oor as shown in Figure 59.
0 1
Interleaver
2 3 4 5 6 7
0 1 2 3 4 5 6 7
Fig. 58 Tanner graph for new ARA code
Fig. 59 Performance of the new ARA code
4.6 Hardware Complexity
Since the building blocks of the decoder is repetition and check nodes, like LDPC
codes, and the message passing are very simple for these blocks, the hardware
complexity is very low as far as the logic is concerned.
The memory requirement depends on the number of edges in the graph.
ARA codes have quite small number of edges, which results in low memory
requirement.
4.7 Conclusion
This study proposes a novel coding structure which is not only very simple, but also
achieves the performance comparable or better than the best practical turbo codes
and LDPC codes. The ML analysis showed that in some cases they are extremely
close to random codes, which achieve the Shannon limit.
The proposed coding scheme generates a family of LDPC codes for various code
rates and data frame sizes, and with a performance close to the Shannon capacity
4.7 Conclusion 65
limit. Unlike general LDPC codes, they have very simple encoding too. The main
innovation is in the inclusion of a very simple precoder, which is constructed by
parallel punctured accumulators. Such precoder improves the performance.
The regularity and simplicity of the proposed coding structure also allows
the construction of very high-speed iterative decoders using message-passing
algorithm.
Chapter 5
High Speed Turbo-like Decoders
5.1 Introduction
This chapter presents the architecture for high-speed decoding of ARA codes.
Whereby, message-passing algorithm enables us to achieve parallelism. Simula-
tions have shown that efciency is not compromised in order to obtain speed
gains.
Like parallel turbo decoder memory access poses a practical problem. We
extend the concept of conict-free interleaver to address this problem. This leads
to the introduction of a new class of turbo-like codes that can be decoded
very fast. It is shown that the proposed high-speed turbo and ARA decoder are
among the codes in this class. The general architecture for decoding this class is
presented.
5.2 Parallel ARA Decoder
To build high-speed decoders for ARA codes we go along the similar way used
for high-speed turbo decoders. The basic idea is to partition the graph into sev-
eral subgraphs and let them work in parallel. For hardware regularity it is desir-
able that subgraphs are identical or have minimum variety. As an example, the
ARA code shown in Figure 58 is considered. The partitioned graph is drawn in
Figure 60.
Each subgraph is decoded using the messages passing algorithm. Since subgraphs
have tree structure, the efcient scheduling provides the fastest decoding method.
Usually the decoding for each subgraph is done serially, which lowers the com-
plexity. The hardware entity that performs the decoding for one subgraph is called
subgraph processor or window processor each subgraph corresponds to a window
of the code word.
There are three types of messages that are communicated within/between sub-
blocks: Internal messages are those that correspond to edges within one subgraph.
Border messages are those that are related to edges connecting two adjacent sub-
graphs. External messages are passed between subgraphs through the interleaver. In
other words, they correspond to edges with global span. External messages are called
extrinsics after their counterparts in turbo codes.
c Springer 2007
68 5 High Speed Turbo-like Decoders
0 1
Interleaver
2 3 4 5 6 7
0 1 2 3 4 5 6 7
Fig. 60 The partitioned graph of ARA code
We need memory for storing all messages. The memory for internal messages is
local to the subgraph processors, whereas for external messages it is a global memory
that all subgraphs should have access to it. Border messages are usually stored in
registers that are part of subgraph processors. They are exchanged with neighboring
subgraphs at the end of each iteration.
Therefore the architecture for the decoder is like in Figure 61, in which a and b are
border messages and x and y are extrinsics. Internal messages are not shown. This
architecture is very similar to parallel turbo decoder. The only difference is that there
are two different window processors here, which are denoted as W and W.
Fig. 61 Parallel turbo decoder
structure
5.3 Speed Gain and Efciency
Unlike turbo decoders the parallel processing does not cost much more processing.
This is because the LDPC codes are inherently parallel. What we are doing here is
to make the parallelization practically feasible.
5.4 Interleaver Design
Although the message-passing algorithm allows us to parallelize the decoding
process, accessing so many extrinsics at the same time poses a practical problem.
Since M window processors are running at the same time, M extrinsics are being
used simultaneously. The extrinsic memory is organized in M banks of mem-
ory in order to facilitate the simultaneous access; i.e. M locations are accessed
simultaneously.
As we discussed in the case of parallel turbo decoder, the interleaver should be
such that the window processors get the extrinsics from different banks of memory
in interleaved order as well. This forces us to have conict-free interleaver presented
for parallel turbo decoder. However, in this section we look at this problem from a
graphical point of view. The parallel decoder comprises M identical processors that
are running in parallel. We put the partitions in parallel planes. Then we look at the
projected graph. The projected graph for ARA code shown in Figure 60, is shown in
Figure 62.
The projected graph can be viewed as the vectorized version of the actual
graph. In other words, there is a message vector associated with every edge in
the projected graph. The structure of memories for messages is such that only
one message vector is accessible at a time. The interleaver should preserve the
Fig. 62 Projected graph
Fig. 63 Projected graph with conict-free interleaver
message vectors in its entirety, but the permutation is allowed within a vector. The
permutation within a vector is the permutation among the window processors or
different planes in the overall graph. This permutation does not change the projected
graph.
Therefore the interleaver consists of several independent permutations within the
message vectors. The way vectors are connected between two constituent codes is
another exibility in the interleaver design. An example of the projected graph with
conict-free interleaver is shown in Figure 63. The dashed edges indicate that the
permutation is allowed within a vector.
The above connections not only guarantee the conict-free structure, but also
ensure that messages are propagated throughout the graph. Therefore projected
graph provides a very useful approach for designing turbo-like codes for high speed
decoding.
5.5 Projected Graph
In this section, design methodology for turbo-like codes based on the projected graph
is presented. There are two ways that could be pursued in order to design codes based
on projected graphs.
The rst approach is the one that was used so far to parallelize the decoder.
This is based on partitioning an existing code graph into some subgraphs. This
method works on any regular or semiregular graph. The projected graph includes one
5.5 Projected Graph 71
Fig. 64 A PCCC projected graph with
conict-free interleaver
partition of each component code, which is called component graph. The component
graphs are connected with conict-free interleavers. This method was used for ARA
code whose projected graph is shown in Figure 63.
It is shown that the parallel turbo decoder is a member of this class. Later on LDPC
codes based on projected graph are also introduced.
5.5.1 Parallel Turbo Decoder
The parallel turbo decoder proposed in Chapter 3 is one example of graphs
based on projected graph. The projected graph of such a code is illustrated in
Figure 64.
It is very instructive to note that the reverse interleaver, used for decreasing
the latency, is clearly shown connecting the edges of two component graphs in
reverse order. From the projected graph argument, we can see that the interleaver
structure is just several independent permutations. The number of permutations
needed is the window size; here it is 4.
5.5.2 Other Known Turbo-like Codes
The class of codes based on the projected graph covers a wide range of turbo-like
codes. This section introduces some known turbo and turbo-like codes with projected
graphs. Figure 65 illustrates the projected graphs for a parallel turbo code with three
constituent codes: a serial turbo code, RA code, and IRA code.
Fig. 65 (a) PCCC with three
component codes: (b) SCCC, (c)
RA(3), (d) IRA(2,3)
The second approach is to design the code by designing its projected graph. In
this method we should design some component graphs and the connections between
them in order to have good performance. In other words, the partitions are designed
rst and then put together to create constituent codes.
This approach is very appealing because the resulting code is denitely paral-
lelizable and the performance of the code can be analyzed very efciently by its
component graphs. The rst example of this approach is parallel LDPC codes, which
is explained in the following section.
5.5.3 Parallel LDPC Codes
This section explains how to design LDPC codes with parallel decoding capabilities.
This class of LDPC codes was independently discovered by Richardson et al. [29],
which is called vector LDPC codes. Thorpe [32] introduced LDPC codes based on
protograph that is basically the same concept. In this section we present this class as
codes with projected graph.
There are two component graphs in these codes: one contains only single parity-
check codes (variable degree) and the other has only repetition codes (variable
degree). One example of such a code is shown in Figure 66.
There are some noticeable facts about this projected graph. All variable nodes
are in one component graph, which means that all the observations are stored and
processed in one kind of window processor. Variable and check node can have
different degrees. Therefore this structure is capable of implementing regular and
irregular LDPC codes. The degree distribution of variable and check nodes is known
from the projected graph. There are no local messages and no border messages
passed between adjacent subgraphs. In other words, we only have external edges.
Therefore, the projected graph can be represented graphically as a simple Tanner
graph. The graphical representation for above example is shown in Figure 67. The
number of interleaver needed for this code is equal to the number of edges of the
projected graph.
The only disadvantage of this method for code design is that it does not provide
an efcient encoder. Sometimes simple encoding is not possible for codes designed
this way.
Fig. 66 A parallel LDPC projected graph
0 1 3 2
Fig. 67 Simple graphical representation of a
LDPC projected graph
5.5.4 More AccumulateRepeatAccumulate Codes
In this section some improved ARA codes are introduced that were designed based
on projected graphs. These are ARA codes with different codes rates. The thresholds
of these codes are also compared with channel capacity threshold.
5.5.4.1 Code Rates <1/2
The rst idea that comes in mind to lower the rate is to send the punctured bits in the
accumulator. We start with ARA code without puncturing to obtain the lowest code
rate. The projected graph for this code is shown in Figure 68. The code rate is 1/4 in
this case.
By puncturing more bits in the accumulator, ARA codes with higher rates are
obtained. The projected graphs for some of the codes derived this way are drawn
in Figure 69. It should be noted that the code with rates larger than 1/2 cannot be
constructed by puncturing more bits, because the iterative decoding fails to start if so
many bits are punctured.
The thresholds for these codes are listed in Table VIII.
Although the construction of above codes is very simple, they are not competitive
to very good codes as far as the performance is concerned. With some minor changes
in the projected graph for rate 1/2 code, better codes with lower code rates are
obtained. The projected graphs for these codes are shown in Figure 70.
The rate 1/3 code is derived by just encoding zero bits instead of one information
bit; i.e. the variable node 1 in the upper row. The rate 1/4 is constructed by sending
Fig. 68 ARA code without puncturing
Fig. 69 (a) Rate 1/3 ARA code; (b) rate 1/2 ARA code
Table VIII Cutoff Threshold for ARA Codes with Rate <1/2
Rate 1/2 1/3 1/4
Threshold 0.51 dB 0.09 dB 0.02 dB
Shannon limit 0.184 dB 0.5 dB 0.7 dB
the intermediate variable that is punctured in rate 1/2 and 1/3 codes. The thresholds
of these codes are listed in Table IX and compared to Shannon limit.
5.5.4.2 Code Rates = 1/2
The ML analysis of ARA codes showed that the performance of ARA codes
improves as we accumulate more bits in the precoder. This was very inspiring when
we searched for better ARA codes.
Figure 71 illustrates two codes that precode 2/3 and 3/4 of the information bits.
The thresholds for these codes are 0.4 dB and 0.33 dB, respectively, which are 0.1
dB and 0.17 dB better than the best rate 1/2 code we have designed.
So far the repetitions used in the codes are the same for all the variables. Using
variable repetitions gives more improvement. An example of ARA code with irregu-
lar repetition is shown in Figure 72. The threshold for this code is 0.367 dB.
It is very important to understand that the structure of the projected graph is of
major importance in achieving good performance. The projected graphs presented
here are the results of an extensive trial and error search.
Fig. 70 (a) Rate 1/2 ARA code; (b) new rate
1/3 ARA code; (c) new rate 1/4 ARA code
Table IX Cutoff Threshold for Improved ARA Codes with Rate <1/2
Rate 1/2 1/3 1/4
Threshold 0.51 dB 0.05 dB 0.15 dB
Shannon limit 0.184 dB 0.5 dB 0.7 dB
Fig. 71 Improved rate 1/2 ARA codes
Fig. 72 Irregular rate 1/2 ARA codes
Table X Cutoff Threshold for ARA Codes with Rate >1/2
Rate 4/7 5/8 2/3 7/10 8/11 3/4 10/13
Threshold (dB) 0.700 1.006 1.272 1.506 1.710 1.894 2.057
Shannon limit 0.530 0.815 1.059 1.272 1.459 1.626 1.777
Difference 0.170 0.191 0.213 0.234 0.251 0.268 0.280
5.5.4.3 Code Rates >1/2
In this section we present a family of ARA codes derived from rate 1/2 irregular
ARA code in Figure 72. The projected graph of these codes is shown in Figure 73
and the performance of this family is listed for different rates in Table X. It also
shows how close that is to Shannon threshold.
5.6 General Hardware Architecture
In this section we present general hardware architecture for implementing parallel
turbo-like decoders. Without any loss of generality we focus on turbo-like codes with
two constituent codes. This can be easily extended to codes with several constituent
codes by grouping them into two combined constituent codes. The general hardware
architecture is shown in Figure 74. EXT
n
denotes the external memory for nth
window processor.
Since the processors are identical and are running in parallel, the scheduling is the
same for all of them. Therefore, there is only one scheduling controller needed for
each constituent codes. The scheduling controller determines which message vector
is accessed and what permutation is used. The permutor is a memoryless block that
permutes the message vector on the y. Since the message vectors are permuted
Fig. 73 Irregular ARA code family for rate >1/2
5.6 General Hardware Architecture 79
Permutor
Window
processor
W
M-1
EXT
M-1
Scheduling
Controller
C1
Address
Generator
Scheduling
Controller
C2
Permutation
Select
Window
processor
W
1
EXT
1
Window
processor
W
0
EXT
0
Window
processor
W
0
Window
processor
W
1
Window
processor
W
M-1
. . .
. . .
. . .
Fig. 74 Parallel decoder hardware architecture
differently, the permutor should be programmable. If M, the number of window
processors, is large, the permutor can be the bottleneck of the hardware design.
The architecture of one window processor is depicted in Figure 75.
AMand BMdenote the registers which contains border messages. The observation
memory is loaded at the beginning of the decoding and remains intact until end. This
memory is not necessary for all window processors.
B
M
Observation Mem
Message
Passing
Core
A
M
Internal Messages Mem
Fig. 75 Window processor hardware architecture
5.7 Conclusion
In this chapter, architecture for high-speed decoding of ARA codes was presented.
Two major issues in high-speed decoding were addressed: parallel processing and
memory access problem. This led to introduction of a new class of turbo-like codes
that can be decoded very fast, which are the codes with projected graph. This
classication provides an alternative method to design turbo-like codes for high-
speed decoding. It was shown that the proposed high-speed turbo and ARA decoder
are among the codes in this class. The general architecture for decoding this class of
codes was also presented.
The generalized coding structure that was developed during this research is a
powerful approach toward designing turbo-like codes that are suitable for high-
speed decoding. However, there are some areas that are not covered yet or can be
a complement to this research, which are described as follows.
First, in designing ARAcodes the focus was on improving the threshold. However,
another important aspect of the performance is usually ignored that is the error
oor. Two important factors affect the error oor of a code: code structure and
interleaver design. Code structure is selected to obtain a certain threshold; therefore
the interleaver design is used to improve the error oor. ARA codes with pseudo-
random interleavers usually have high error oors. We have been able to improve
the error oor by orders of magnitude by manual changes in the interleavers. It is
very important to nd a systematic way to design or modify interleavers to have low
error oor. Design of algorithmic interleavers is a more challenging topic, which is
of more practical interest.
Second, the search for good codes based on their projected graph is very reward-
ing. Since the structure of such a code guarantees the high-speed decoding capability,
the only concern is the performance of the code. On the other hand, the projected
graph is desired to be very simple, which makes the search easier. One simple way
of approaching this problem is to start with known projected graphs and make some
changes. Analysis of the resulting code determines whether the change is good or not.
We have pursued this approach and some preliminary results show its effectiveness.
06-Abbasfar-Reference SDO003-abbasfar (Typeset by spi publisher services, Delhi) 81 of 82 May 14, 2007 20:35
References
1. A. Abbasfar and K. Yao, An efcient and practical architecture for high speed turbo
decoders, Vol. 1, Proceedings of VTC, October 2003, pp. 337341.
2. A. Abbasfar and K. Yao, Interleaver design for high speed turbo decoders, Vol. 5205,
Proceedings of SPIE, August 2003, pp. 282290.
3. A. Abbasfar and K. Yao, An efcient architecture for high-speed turbo decoders, Pro-
ceedings of ICASSP 2003, April 2003, pp. IV-521IV-524.
4. S. Aji and R.J. McEliece, The generalized distributed law, IEEE Trans. Inform. Theory,
March 2000, 32(1), 325343.
5. L.R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for
minimum symbol error rate, IEEE Trans. Inform. Theory, March 1974, 284287.
6. S. Bennedetto, D. Divsalar, G. Montorsi, and F. Pollara, Soft-input soft-output APP
module for iterative decoding of concatenated codes, IEEE Commun. Lett., January 1997,
2224.
7. S. Bennedetto, D. Divsalar, G. Montorsi, and F. Pollara, Serial concatenation of inter-
leaved codes: performance analysis, design, and iterative decoding, IEEE Trans. Inform.
Theory, May 1998, 44(3), 909926.
8. S. Bennedetto and G. Montorsi, Unveiling turbo codes: some results on parallel concate-
nated codes, IEEE Trans. Inform. Theory, March 1996, 42(2), 409428.
9. C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error correcting coding
and decoding: turbo codes, Proceedings of the 1993 IEEE International Conference on
Communications, Geneva, Switzerland, May 1993, pp. 10641070.
10. D. Divsalar, A simple tight bound on error probability of block codes with application to
turbo codes, JPL TMO Progress Report 42139, November 1999, pp. 135.
11. D. Divsalar, S. Dolinar, and F. Pollara, Iterative turbo decoder analysis based on
Gaussian density evolution, IEEE J. Select. Areas Commun., May 2001, 19(5), pp.
891907.
12. D. Divsalar, H. Jin, and R.J. McEliece, Coding theorems for turbo-like codes, Proceedings
of the 36th Allerton Conference on Communication, Control and Computing, September
1998, Allerton House, Monticello, IL, pp. 201210.
13. B.J. Frey, F.R. Kschischang, and P.G. Gulak, Concurrent turbo-decoding, Proceedings of
the IEEE International Symposium on Information Theory, July 1997, Ulm, Germany,
p. 431.
14. R. Gallager, Low-Density Parity-Check Codes, MIT Press, Cambridge, MA, 1963.
15. J. Hsu and C.H. Wang, A parallel decoding scheme for turbo codes, Vol. 4, IEEE
Symposium on Circuits and Systems, Monterey, June 1998, pp. 445448.
81
06-Abbasfar-Reference SDO003-abbasfar (Typeset by spi publisher services, Delhi) 82 of 82 May 14, 2007 20:35
82
16. H. Jin, Analysis and Design of Turbo-like Codes. Ph.D. thesis, California Institute of
Technology, Pasadena, 2001.
17. H. Jin, A. Khandekar, and R. McEliece, Irregular repeat-accumulate codes, in: Proceed-
ings of the 2nd International Symposium on Turbo Codes, Brest, France, 2000, pp. 18.
18. F.R. Kschischang and B.J. Frey, Iterative decoding of compound codes by probability
propagation in graphical models, IEEE J. Select. Areas Commun., February 1998, 16(2),
219230.
19. S.L. Lauritzen and D.J. Spiegelhalter, Local computations with probabilities on graphical
structures and their applications in expert systems, J. R. Stat. Soc. B., 1988, 50, 157224.
20. M. Luby, M. Mitzenmacher, M.A. Shokrollahi, D.A. Spielman, and V. Stemann, Practical
low-resilient codes, Proceedings of 29th Symposium on Theory of Computing, 1997, pp.
150157.
21. M. Luby, M. Mitzenmacher, M.A. Shokrollahi, and D.A. Spielman, Improved low-density
parity-check codes using irregular graphs, IEEE Trans. Inform. Theory, 2001, 47, 585
598.
22. D.J.C. MacKay and R.M. Neal, Good codes based on very sparse matrices, in: C. Boyd
(ed.). Cryptography and Coding, 5th IMA Conference, No. 1025 in Lecture Notes in
Computer Science, Springer, Berlin, 1995, pp. 100111.
23. D.J.C. MacKay, Good error correcting codes based on very sparse matrices, IEEE Trans.
Inform. Theory, 1999, 45(2), 399431.
24. R.J. McEliece, D.J.C. MacKay, and J.F. Cheng, Turbo decoding as an instance of Pearls
belief propagation algorithm, IEEE J. Select. Areas Commun., February 1998, 16(2),
140152.
25. J. Pearl, Fusion, propagation, and structuring in belief networks, Artif. Intell., 1986, 29,
242288.
26. G. Poltyrev, Bounds on the decoding error probability of binary linear codes via their
spectra, IEEE Trans. Inform. Theory, 40(10), 12611271.
27. T. Richardson and R. Urbanke, The capacity of low density parity check codes under
message passing decoding, IEEE Trans. Inform. Theory, February 2001, 47(2), 599618.
28. T. Richardson, M.A. Shokrollahi, and R. Urbanke, Design of capacity-approaching irreg-
ular low-density parity-check codes, IEEE Trans. Inform. Theory, February 2001, 47(2),
619637.
29. Richardson et al., Methods and apparatus for decoding LDPC codes, United States Patent
6,633,856, October 14, 2003.
30. C.E. Shannon, A mathematical theory of communications, Bell Syst. Tech. J., 1948, 27,
379423.
31. R.M. Tanner, A recursive approach to low complexity codes, IEEE Trans. Inform. Theory,
1981, IT-27, 533547.
32. Jeremy Thorpe, Low Density Parity Check (LDPC) Codes Constructed from Protograhs,
JPL INP Progress Report 42-154, August 15, 2003.
33. A.J. Viterbi and A.M. Viterbi, An improved union bound for binary input linear codes on
AWGN channel, with applications to turbo decoding, Proceedings of IEEE Information
Theory Workshop, February 1998.
34. N. Wiberg, Codes and Decoding on General Graphs. Linkping Studies in Science and
Technology. Dissertation no. 440. Linkping University, Linkping, Sweden, 1996.
07-Abbasfar-Index SDO003-abbasfar (Typeset by spi publisher services, Delhi) 83 of 84 May 14, 2007 20:35
Index
A
AccumulateRepeatAccumulate, v, xiv, 4,
56, 102
ARA code, 4
ARA codes, iv, v, viii, ix, xiv, 4, 56, 80, 82,
84, 87, 88, 90, 92, 102, 105, 106, 109, 110
ARA decoder, v, xiv, 4, 92, 109
B
Backward Recursion, 25
BCJR algorithm, iii, 8, 24, 26, 27, 31, 33,
34
belief propagation algorithm, 2, 9, 112
bipartite graph, 9
block codes, iv, 6, 7, 18, 56
C
Codes on graph, iii, 18
Conict-free interleaver, 3
conict-free interleavers, 42, 52,
97
Constituent code, viii, 61
constituent codes, vi, vii, 2, 5, 6, 7, 8, 26, 27,
28, 29, 31, 33, 34, 35, 40, 52, 59, 60, 61,
65, 79, 96, 98, 100, 107, 108
convolutional code, vi, 5, 8, 20, 21, 23, 24,
26, 52, 64
convolutional codes, 5, 6, 20, 21, 22, 27
Convolutional codes, iii, vi, 20, 21
D
density evolution, viii, 4, 28, 59, 61,
111
Density evolution, iv, 59
E
efciency, iv, v, xiii, 1, 3, 35, 36, 37, 38, 40,
53, 92, 94
Efcient schedule, 16
extrinsic information, 8, 25, 26, 28, 33,
44
extrinsics, 11, 26, 27, 28, 29, 30, 40, 41, 42,
44, 52, 93, 94, 95
F
Flooding schedule, 16
Forward recursion, 24
G
graph representation, 18, 20, 22, 24
graphs with cycles, 3, 11
Graphs with cycles, iii, 17
H
hardware architecture, v, ix, 107, 108, 109
Hardware complexity, iv, v, 3, 51, 90
high-speed decoding, xiii, 2, 4, 92, 109,
110
I
interleaver, iv, vii, ix, x, xiii, 3, 7, 32, 34, 41,
42, 43, 44, 45, 46, 47, 49, 50, 53, 54, 64,
65, 67, 69, 72, 73, 74, 92, 93, 95, 96, 97,
98, 101, 110
Interleaver, iv, v, 40, 45, 95, 111
IOWE, 64, 65, 67, 68, 69, 70, 72, 73, 74, 75,
81, 82
IRA codes, 55
Irregular RepeatAccumulate, 55
83
07-Abbasfar-Index SDO003-abbasfar (Typeset by spi publisher services, Delhi) 84 of 84 May 14, 2007 20:35
84 Index
iterative decoding, vii, viii, xiii, 2, 3, 4, 7, 8,
9, 24, 28, 29, 30, 55, 56, 61, 63, 66, 75,
80, 86, 87, 88, 102, 111
L
latency, iv, xiii, 3, 18, 30, 35, 42, 43, 53, 98
LDPC codes, v, 1, 9, 19, 55, 56, 59, 90, 91,
94, 97, 100, 101, 113
Low-density parity-check, 1, 19, 112
M
MAP decoding, 8
Memory access, xiii, 3
message passing, 15, 18, 20, 24, 27, 56, 80,
90, 91, 92
message-passing algorithm, xiii, 3, 17, 24,
32, 41, 95
ML decoding, iv, v, 4, 55, 56, 75, 80, 82,
86, 88
P
Parallel concatenated convolutional code, 2,
5
parallelization, xiii, 3, 4, 35, 36, 45, 53, 80,
94
Parity-check codes, iii, 18
pipelining, 43
precoder, viii, 80, 81, 82, 86, 88, 89,
105
precoding, 55
probabilistic graph, 10
probability propagation algorithm, 3, 11
processing load, 35, 36, 53
projected graph, ix, 4, 95, 96, 97, 98, 100,
101, 102, 104, 106, 109, 110
protograph, 100
puncturing, iv, v, viii, ix, 55, 67, 73, 74, 75,
76, 78, 79, 82, 84, 87, 88, 102
R
RA codes, iii, iv, viii, x, 1, 2, 6, 55, 63, 65,
66, 67, 75, 76, 78, 79, 80
RepeatAccumulate codes, 63
repetition codes, 100
S
scheduling, 15, 16, 18, 27, 32, 45, 93, 108
Serial concatenated convolutional codes, 2, 5
serial decoder, 32, 37, 42, 49, 52, 53
Shannon limit, xiv, 4, 19, 55, 56, 78, 91, 105,
111
SISO, vi, 8, 26, 27, 28, 29, 30, 34, 35, 41,
45, 46, 47, 53
sparse parity-check matrix, 19
speed gain, vii, 3, 30, 35, 36, 38, 39, 40, 43,
51, 53
Speed gain, iv, v, 35, 36, 94
speed gain and efciency, 3, 51
S-random interleavers, 47
state constraint, 21, 35
state variables, vi, 20, 21, 25
systematic code, 5, 6, 36, 49, 75, 79
T
Tanner graph, vi, viii, 9, 18, 19, 20, 21, 89
turbo codes, vi, xiii, 1, 2, 6, 7, 11, 17, 22, 27,
28, 29, 30, 44, 55, 80, 93, 111, 112
Turbo codes, iii, 5, 22, 28
turbo decoding, vi, 9, 11, 24, 29, 30
turbo encoder, 5
turbo-like code, 2, 4, 56
turbo-like codes, iii, iv, v, xiii, xiv, 2, 3, 4, 5,
6, 17, 55, 59, 61, 63, 80, 92, 97, 98, 107,
109
W
window processor, 31, 34, 93, 101, 107, 108

Springer - Turbo-Like Codes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Springer - Turbo-Like Codes

Uploaded by

Copyright:

Available Formats

00-Abbasfar-Prelims SDO003-abbasfar (Typeset by spi publisher services, Delhi) i of xviii August 1, 2007 13:41

You might also like