Professional Documents
Culture Documents
Philip P. Dang†
STMicroelectronics Inc., San Diego, California, USA
Trac D. Tran
ECE Department, John Hopkins University, Baltimore, Maryland, USA
In this article, we present the BinDCT algorithm, a fast approximation of the Discrete Cosine Transform, and its efficient VLSI
architectures for hardware implementations. The design objective is to meet the real-time constrain in embedded systems. Two
VLSI architectures are proposed. The first architecture is targeted for low complexity applications such as videophones, digital
cameras, and digital camcorders. The second architecture is designed for high perform applications, which include high definition
TV and digital cinema. In order to meet the real-time constrain for these applications, we decompose the structure of the BinDCT
algorithm into simple matrices and map them into multi-stage pipeline architectures. For low complexity implementation, the
proposed 2-D BinDCT architecture can be realized with the cost of 10 integer adders, 80 registers and 384 bytes of embedded
memory. The high performance architecture can be implemented with an extra of 30 adders. These designs can calculate real-
time DCT/IDCT for video applications of CIF format at 5 MHz clock rate with 1.55 volt power supply. With its high performance
and low power consumption features, BinDCT coprocessor is an excellent candidate for real-time DCT-based image and video
processing applications.
124
DCT:
N −1
s(m) ⎛ (2n + 1)mπ ⎞
Y (m) =
2
∑ x(n) cos⎜
⎝ 2N
⎟
⎠ (1)
n=0
IDCT:
N −1
s(m) (2n + 1)mπ
x(n) = ∑ 2
Y (m) cos
2N
(2)
m=0
where
x(n) are input signals in the spatial domain
Y(m) are DCT coefficients in frequency domain. Figure 1. Plane rotation structure.
⎧ 1
⎪ m=0
⎪
s(m) = ⎨ N and Hatori, 7 Hou, 8 Malvar, 9 and Loeffler et al.10 The
⎪ 2 (3) common objective of these algorithms is to reduce the
⎪⎩ N
m≠0 number of multiplications in DCT algorithm. The basic
difference among them is the way to factor the DCT
and m = 0, 1, 2, …., N − 1. DCT can be represented in matrix into simple matrices. The number of additions
term of matrix multiplication. The 8-point DCT derived in these algorithms is 29. The number of multiplications
from Eq. (1) for N = 8 can be written in the matrix form is between 11 and 16.
as follow: For lossy coding applications, Arai, Agui, and
Kakajima 11 showed that eight multiplications can be
absorbed by quantization operations, and the com-
⎡ Y [0 ] ⎤ ⎡ c4 c4 c4 c4 c4 c4 c4 c4 ⎤ ⎡ x[0]⎤ putation for the scaled 8 × 8 DCT can be reduced to five
⎢ Y [1] ⎥ ⎢c c3 c5 c7 − c7 − c5 − c3 − c1 ⎥⎥ ⎢⎢ x[1] ⎥⎥
multiplications and 29 additions. The computational
⎢ ⎥ ⎢ 1 complexity of fast 8-point 1-D DCT algorithms is
⎢Y [2]⎥ ⎢ c2 c6 − c6 − c2 − c2 − c6 c6 c2 ⎥ ⎢ x[2]⎥ summarized in the Table I.
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎢Y [3]⎥ = ⎢ c3 − c7 − c1 − c5 c5 c1 c7 − c3 ⎥ ⎢ x[3]⎥
⎢c BinDCT Algorithm
⎢Y [4]⎥ − c4 − c4 c4 c4 − c4 − c4 c4 ⎥ ⎢ x[4]⎥
⎢ ⎥ ⎢ 4 ⎥⎢ ⎥ Motivation
⎢Y [5]⎥ ⎢ c5 − c1 c7 c3 − c3 − c7 c1 − c5 ⎥ ⎢ x[5]⎥ One of the disadvantages of the aforementioned fast
⎢ Y [6 ] ⎥ ⎢c − c2 c2 − c6 c6 c2 − c2 − c6 ⎥ ⎢ x[6]⎥ DCT algorithms is that they still need floating point
⎢ ⎥ ⎢ 6 ⎥⎢ ⎥
⎢⎣Y [7]⎥⎦ ⎢⎣ c7 − c5 c3 − c1 − c1 c3 c5 c7 ⎥⎦ ⎢⎣ x[7]⎥⎦ multiplications. These operations are slow in software
and cost more silicon for implementation in hardware.
(4) In addition, long execution time of floating point
operations also leads to more power consumption. Most
where of mobile multimedia devices, on the other hand, have
kπ limited resources of power, CPU and memory. There is
ck = cos (5)
16 a need to investigate and develop efficient, low
complexity orthogonal transformation for these
From the matrix representation, one can realize that applications.
the direct implementation of the DCT will cost 64
multiplications and 56 additions. In their landmark Approach
study,1 Ahmed, Natarajan, and Rao presented an N-point Using the lifting-based approach, Tran12 showed that
DCT that can be computed with a 2N-point FFT and fast and accurate approximations of DCT can be
some additional post-processing. Using even/odd obtained without using any multiplication. The proposed
decomposition, Ahmed, Natarajan, and Rao reduced the algorithm is called BinDCT. The basic idea of the
complexity of the 8 × 8 DCT matrix to 22 multiplications BinDCT algorithm is that each plane rotation (Fig. 1)
and 28 additions.2 in the fast DCT structure, such as in Refs. [3] and [10],
can be approximated by at most three dyadic lifting
Fast DCT Algorithms steps. In their article, Liang and Tran13 showed that both
During the last two decades, many fast DCT/IDCT forward and inverse DCT can be implemented using only
algorithms have been proposed since DCT and IDCT shift and addition operations.
become an essential component in numerous The decomposition of a plane rotation into three lifting
international standards. These works included Chen et steps is illustrated in Fig. 2(a). This operation can be
al, 3 Wang,4 Lee, 5 Vetterli and Nussbaunmer,6 Suehiro represented in the matrix form
Multiplication 16 13 12 12 12 12 11 5 0
Addition 26 29 29 29 29 29 29 29 30
Shift 0 0 0 0 0 0 0 0 13
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 125
⎡cos α − sin α ⎤ ⎡ 1 p⎤ ⎡ 1 0⎤ ⎡ 1 p⎤
⎢sin α =
cos α ⎥⎦ ⎢0 1 ⎥ ⎢u 1⎥ ⎢0 1 ⎥ (6)
⎣ ⎣ ⎦⎣ ⎦⎣ ⎦
where
cos α − 1
p= , and u = sinα.
sin α
Note that each lifting step is a biorthogonal transform,
and its inverse also has a simple lifting structure, i.e., (a)
−1
⎡ 1 x⎤ ⎡ 1 − x⎤
⎢0 1⎥ =⎢ ⎥ (7)
⎣ ⎦ ⎣0 1 ⎦
−1
⎡ 1 0⎤ ⎡ 1 0⎤
⎢ x 1⎥ =⎢ ⎥ (8)
⎣ ⎦ ⎣− x 1⎦
−1
⎡cos α − sin α ⎤ ⎡ 1 − p⎤ ⎡ 1 0⎤ ⎡ 1 − p⎤
⎢sin α =⎢
cos α ⎥⎦ ⎥⎢ ⎥⎢ ⎥ (9)
⎣ ⎣0 1 ⎦ ⎣−u 1⎦ ⎣0 1 ⎦
⎡ 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 1/ 2 ⎤
⎢ 1/ 2 1/ 2 3 / 16 0 0 −3 / 16 −1 / 2 −1 / 2 ⎥⎥
⎢
⎢55 / 128 3 / 16 −3 / 16 −55 / 128 −55 / 128 −3 / 16 3 / 16 55 / 128⎥
⎢ ⎥
9 / 32 −1 / 8 −19 / 64 −1 / 4 1/ 4 19 / 64 1/ 8 −9 / 32 ⎥
BinDCT = ⎢
⎢ 1/ 4 −1 / 4 −1 / 4 1/ 4 1/ 4 −1 / 4 −1 / 4 1/ 4 ⎥
⎢ ⎥ (10)
⎢ 7 / 16 −3 / 4 7 / 32 1/ 2 −1 / 2 −7 / 32 3 / 4 −7 / 16 ⎥
⎢ −3 / 16 1/ 2 −1 / 2 3 / 16 3 / 16 −1 / 2 1/ 2 −3 / 16 ⎥
⎢ ⎥
⎣⎢ −1 / 16 1 / 4 −13 / 32 1/ 2 −1 / 2 13 / 32 −1 / 4 1 / 16 ⎥⎦
The plane rotation structure can also be implemented Chen’s factorization can be realized into five stages.
by the scaled lifting structure (Fig. 2(c)) for further Using the lifting steps and scaled lifting structures,
reduce complexity. In this approach, rotation angles can we derive a general lifting structure of the BinDCT
be approximated with only two dyadic lifting steps, family based on the Chen’s factorization (Fig. 3(b)). In
which can be implemented using shift-and-add binary this design, the scaled lifting structure is utilized to
operations. reduce the computational complexity since most of
scaled lifted factors can be incorporated into the
BinDCT Algorithm quantization stage.
Figure 3(a) depicts the Chen’s factorization of an 8-point By varying the approximation accuracy, different
forward DCT. In this picture, versions of the BinDCT can be obtained.12,13 Equation
(10) is the version C of the forward BinDCT algorithm.
Its lifting parameters and scaled factors are shown in
⎛ nπ ⎞ ⎛ nπ ⎞
sn = sin⎜ ⎟ and cn = cos⎜ ⎟. Fig. 3(c). This particular implementation requires 30
⎝ 16 ⎠ ⎝ 16 ⎠ additions and 13 shift operations. Reader can verify that
(c)
Figure 3. (a) Flow chart of Chen’s facgtorization of 1-D 8-point forward DCT; (b) flow chart of the general lifting structure for 1-D
8-point forward DCT using Chen’s factorization; (c) flow chart of 1-D 8-point forward BinDCT.
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 127
TABLE II. PSNR (dB) of the Reconstructed Image with Different Inverse DCT Algorithms
Quality Factor IJG Integer DCT IJG Fast Integer DCT BinDCT C4 BinDCT L3
100 58.85 45.02 44.38 50.12
90 40.79 40.53 40.52 40.66
80 38.51 38.39 38.39 38.44
60 36.43 36.36 36.38 36.38
40 35.11 35.06 35.06 35.06
20 32.95 32.92 32.92 33.31
10 30.40 30.39 30.39 30.37
5 27.33 27.32 27.32 27.30
when the quality factor is 100, the BinDCT result is 10.3 Design Approach
dB better than that of the IJG’s fastest integer DCT. In For efficient VLSI implementation of the BinDCT
terms of the compression ratio, the compressed file size algorithm, we first decompose the basic BinDCT
with BinDCT-C7 is about 1–3% smaller than that with structure in Eq. (10) into five simple matrices. Each
the floating DCT, whereas the compressed size with matrix is designed as a single stage in the proposed
BinDCT-C4 is slightly larger than the latter, but the pipeline architecture. The matrix factorization form of
difference is less than 0.5% in most cases. the BinDCT algorithm is shown in Eq. (11).
⎡1 0 0 0 0 0 0 0⎤
⎢0 0 0 0 0 0 0 1⎥⎥
⎢
⎢0 0 0 1 0 0 0 0⎥
⎢ ⎥
0 0 0 0 0 0 1 0⎥
E=⎢
⎢0 1 0 0 0 0 0 0⎥
⎢ ⎥
⎢0 0 0 0 0 1 0 0⎥ . (16)
⎢0 0 1 0 0 0 0 0⎥
⎢ ⎥
⎣⎢0 0 0 0 1 0 0 0⎥⎦
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 129
TABLE III. The Truth Table of Stage 1 TABLE IV. The Truth Table of Stage 2
S2 S1 S0 Function S2 S1 S0 Function
0 0 0 Y0 = X0 + X7 0 0 0 T0 = (1/4) X5 + (1/64) X 5 = (17/64) X5
0 0 1 Y1 = X1 + X6 0 0 1 T1 = (17/64) X5 + (32/64) X5 = (49/64) X5
0 1 0 Y2 = X2 + X5 0 1 0 T2 = (1/4) X5 + (1/8) X5 = (3/8) X5
0 1 1 Y3 = X3 + X4 0 1 1 T3 = (1/2) X6 + (1/8) X6 = (5/8) X6
1 0 0 Y4 = X0 – X7 1 0 0 Y 5 = (5/8) X6 − (49/64) X5
1 0 1 Y5 = X1 – X6 1 0 1 Y 6 = X6 – (3/8) X5
1 1 0 Y6 = X2 – X5 1 1 0 −
1 1 1 Y7 = X3 – X4 1 1 1 −
Architecture of Stage 2
Stage 2 performs the computations for the matrix B. The
architecture of Stage 2 is similar to the architecture of
Stage 1. Stage 2 includes two 8-to-1 multiplexers, one 9-
bit integer adder, one 1-to-8 demultiplexer, and an
internal register bank. Inputs X0−X4 and X7 are connected
directly to output registers. Inputs X5 and X6 are shifted
with predefined values and are stored in the internal
register bank. The multiplexers select values from the
register bank and feed them into the inputs of the adder.
The demultiplexer distributes output from the adder to
the output registers Y5 and Y6 or send them back to the
register file. The truth table of Stage 2 is summarized in
Table IV. Note that, Stage 2 takes only 6 cycles to calculate
Y 5 and Y 6. The last two cycles are don’t care. The
architecture of Stage 2 is illustrated in Fig. 6. The
synthesize circuit of Stage 2 is shown in Fig. 11.
Figure 10. The synthesize circuit of Figure 11. The synthesize circuit of Figure 12. The synthesize circuit of
Stage 1. Stage 2. Stage 3.
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 131
Figure 13. The synthesize circuit of Stage 4. Figure 14. The synthesize circuit of Stage 5.
The basic function of Stage 5 is to re-organized output of inputs. It computes the first 1-D DCT, transposes data
data before they are stored into buffer or memory. This in the 8 × 8 block buffer, and then calculates the second
matrix can be easily realized in hardware by ap- 1-D DCT. The hardware cost of the low power
propriately hardwiring input registers to output architecture for 8-point 2-D BinDCT is 10 adders, 80
registers. The architecture of Stage 5 is illustrated in registers, and 384 bytes of embedded memory.
Fig. 9. The synthesize circuit of Stage 5 is illustrated
in Fig. 14. High Performance VLSI Architecture
The architecture for high performance applications is
Transpose Buffer similar to the low power VLSI architecture. The basic
To compute the separable 2-D BinDCT, the outputs from difference is that more arithmetic units are used in each
the row decomposition are stored in a buffer. This buffer stage. For instance, four adders are used in Stages 1
contains a 8 × 8 block of 64 12-bit coefficients. The through 3 and eight adders are implemented in Stage 4.
transpose operations are managed by a control unit, The latency for each stage is reduced to only 2 clock cycles.
which allows inputs to be written in row-wide and In addition, Stage 5 is eliminated. The outputs of Stage
outputs to be read in column-wise. 4 are hardwired such that 8 outputs from Stage 4 can be
To improve the throughput of the 2-D BinDCT stored in memory column-wise and they can be read in
calculation, two 8 × 8 buffers are used. When the second row-wise. The hardware cost of the high performance
1-D BinDCT circuitry reads data from the first buffer, VLSI architecture for 8-point 2-D BinDCT is 40 adders,
the outputs from the first 1-D BinDCT of the next 8 × 8 80 registers, and 384 bytes embedded memory.
block are written into the second buffer. After 64 clock
cycles, the address of output port of the first dimension Performance Analysis
and the address of the input port of the second 1-D Computational Complexity
BinDCT are switched. These ping-pong operations The complexity distribution in the proposed five-stage
repeat every 64 clock cycles until the transformation pipeline BinDCT architecture is summarized in Table
for the whole image is completed. VIII.
From the algorithm presented above, Stage 1 and
2-D BinDCT Architecture Stage 3 each require 8 additions. At first glance, Stage
The system architecture of the 2-D BinDCT coprocessor 2 seems to require 6 additions and 7 shift operations.
is depicted in Fig. 15. The 2-D BinDCT coprocessor However,
includes two 1-D BinDCT modules and two transpose
buffers. The BinDCT coprocessor receives a 8 × 8 block b6 = a6 + [ a5 + (a5 << 1)] >> 3
TABLE VIII. Complexity Distribution in the Proposed TABLE IX. The Data Bus Required for 1-D Forward
Five-Stage Pipeline BinDCT Architecture BinDCT
Stage Additions Shift Stage Input Vector Bits Output Vector Bits
Stage 1 8 0 1 f 8 A*f 9
Stage 2 4 4 2 A*f 9 B*A*f 10
Stage 3 8 0 3 B*A*f 10 C*B*A*f 11
Stage 4 10 9 4 C*B*A*f 11 D*C*B*A*f 12
Stage 5 0 0 5 D*C*B*A*f 12 E*D*C*B*A*f 12
Total: 30 13
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 133
TABLE X. The Data Bus Required for 2-D Forward TABLE XI. Profile of Low Power and High Performance 2-D
BinDCT BinDCT Coprocessors
Stage Input Vector Bits Output Vector Bits Architecture High PerformanceBinDCT Low Power BinDCT
different architecture. As a result, it is difficult to make that the ROM size grows exponentially with the input
a fair comparison between two architectures. The IDCT precision. The common approach in many designs uses
processor in Ref. [21], for instance, used digit-serial multiply-accumulate cell (MAC) for DCT/IDCT
construction to reduce the overall chip size. This computation.23,25 The MAC implementation, however,
approach, however, increases the latency. The ROM- has to deal with several design issues including floating
based Distributed Arithmetic (DA) approach is used in point operations and finite register length effect.
Refs. [15], [24] and [26] to eliminate the need of Moreover, the implementation of multipliers requires
multiplier. The disadvantage of this implementation is larger chip area and high power consumption.
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 135
TABLE XIII. DCT/IDCT Processor Comparison
Authors Technology (µm) Area (mm 2) Transistors Voltage (V) Clock rate (MHz) Power (mW) Through-put (Mpel/s)
Defilippis et al.
15 2.00 22.50
Jutand et al.16 0.80 41.40
Chau et al. 17 0.80 21.60 2200
D. Slawecki 18 2.00 72.68 67,929 5.00 50 1000
T. Kuroda19 0.30 4.00 120,000 0.90 150 10
Y.P. Lee20 0.60 50.60 152,017 2.00 100 138
C.-Y. Hung et al.21 0.35 24,000 100 26.0
Katayama et al.22 0.35 70,700 54 27.0
Rambaldi et al. 23 0.50 200,000 27 27.0
Okamoto et al. 24 0.50 4.50 40 14.3
Xanthopoulos et al. 25 0.70 20.70 160,000 14 14.0
T.-S. Chang et al.26 0.60 24,000 55 23.6
SGS Thompson STV320027 311.52 5.00 15 500
SGS Thompson STV320828 196.00 5.00 27 750
SGS Thompson IMSA12129 5.00 20 1500
Gong el al. 31 0.25 1.50 125 125
Low Power BinDCT 0.18 0.07 1.55 5 12 5
High Performance BinDCT 0.18 0.12 1.55 5 48.6 20
Compared with the reference designs in the Table XIII, scalability of the algorithm gives us the flexibility to
the proposed architecture is very competitive in term of meet the real-time constrain at different bit rates.
performance and quality. Our designs, however, can For the high performance architecture, we compare
obtain high throughput at a very low clock frequency. our design with the latest DCT architecture reported.30
In fact, at the 5 MHz frequency, the proposed low power The proposed architecture is 12 times smaller and 25
VLSI architecture can obtain 5 Mbytes/s throughput. times slower clock rate. These factors are translated into
This performance is efficient for videoconference approximately 300 times lesser power consumption. If
application using CIF format. For high video bit-rate, both processor run at the same clock rate 125 MHz, the
we can scale the system clock of BinDCT coprocessor to throughput of proposed system will be 4 times faster
meet the real-time requirement. In addition, the area than the reference design in Ref. [30].
and power consumption of our designs are significantly
smaller than those of in the reference designs. The Summary
proposed low power VLSI architecture takes only 0.07 This article presents efficient VLSI architectures and
mm 2 chip area and 12 mW power consumption. These the implementation result of the BinDCT algorithm. The
requirements are good candidates for mobile multimedia proposed architectures are organized as multi-stage
applications. The high performance VLSI architecture, pipeline. As a result, they can achieve high throughput,
on the other hand, can obtain 20 Mbytes/s throughput which allow them to meet the real-time of video
at 5 MHz, or 120 Mbytes/s throughput at 30 MHz. In processing. These designs, on the other hand, require
addition, the modularity of the proposed architecture very low power consumption. The hardware imple-
simplifies the design verification effort. Moreover, the mentation of the low complexity BinDCT architecture
BinDCT and Its Efficient VLSI Architectures for Real-Time Embedded Applications Vol. 49, No. 2, March/April 2005 137