You are on page 1of 5

2013 13th International Symposium on Communications and Information Technologies (ISCIT)

Design and Implementation of a Reconfigurable SoC


for High-Definition Video Applications

SUN Shu-Wei, LIU Xiang-Yuan LIU Lei-Bo CAO Peng


College of Computer School of Information Science and National ASIC System Engineering
National University of Defense Technology Center
Technology Tsinghua University Southeast University
Changsha, China Beijing, China Nanjing, China

AbstractThis paper proposes a reconfigurable SoC researched during last decades, and some are used in video
architecture based on a large-scale reconfigurable processing applications, just like XPP [7][8]. However, no real-time HD
elements (PEs) array, a high-performance RISC core and several video decoding results are found now.
embedded peripherals on-chip, which are coupled tightly
through System buses of AMBA2.0. The large-scale PEs array is This paper proposes a reconfigurable SoC architecture for
used to process video signals with different standards under HD video application. The SoC is composed of a high-
appropriate contexts disposed dynamically. The embedded performance RISC core, a large-scale reconfigurable PEs array
peripherals are with responsibility for the input of media stream and several embedded peripherals on-chip, which are coupled
data and output of the decoded multimedia data to display, while tightly through AMBA2.0 system buses. Task pipelining
the RISC core takes charge of the initialization of the peripherals parallelism and large-scale data-level parallelism are employed
and the reconfigurable PEs, the pretreatment of media stream to speed up the video decoding process. We design and
data, the audio decoding, the synchronization between audio and implement an antitype SoC chip based on 65nm CMOS silicon
video data, and some other scheduling functions. The antitype techniques, and the testing results show that when working at
SoC chip is implemented based on 65nm CMOS silicon frequency of 225MHz, the reconfigurable SoC achieves the
techniques, and the testing results show that the reconfigurable performance of real-time decoding of videos with size of
SoC achieves the performance of real-time decoding of videos 1920*1080 @ 30fps (frame-per-second) which follow the
with size of 1920*1080 @ 30fps which follow the H.264, AVS and H.264, AVS and MPEG-2 standards respectively.
MPEG-2 standards respectively.
The rest of the paper is organized as follows: Section 2
Keywordsreconfigurable PE array, SoC, high-definition video provides an overview of the reconfigurable processing
decoding, H.264, AVS, MPEG-2 elements (PE) and the PEs array; section 3 details the
architecture of the SoC; section 4 gives the testing results of the
I. INTRODUCTION antitype SoC chip and section 5 concludes our work.
As the development of embedded micro-processor
technology, people demand consumer- electronics with higher II. THE RECONFIGURABLE PES ARRAY
quality, and the high-definition (HD) video applications are The reconfigurable PEs array consists of 512 coarse-grain
becoming more and more popular. Series of projects such as reconfigurable processing elements, which are organized as
The SMP 86x SoC of SIGMA DESIGN [1], RTDx SoC of eight 8x8 PEs sub-array (PE8x8). PE8x8 is a coarse-grained
REALTEX [2], TMS320 DM814x SoC of TI [3], CE4100 SoC reconfigurable computing array, which is composed of 64 16-
of Intel [4], are proposed for applications with different video bit processing elements and routes consisting of a
standards of MPEG-2/4, H.264, AVS, VC-1 and so on. The reconfiguration network. The PE8x8, context interface and data
microprocessors above provide high-definition video decoding buffer interface (DBI) are composed as the basic function unit,
performance by integrating some hard-wired multimedia RCA8x8, as shown in Fig. 1. Context Interface takes charge of
coprocessors, which only support some special video codec the controlling flow, and the context information stored in it
arithmetic. In order to support more algorithms and video supports kinds of applications, which could be updated
standards, extra coprocessors need to be integrated, which dynamically to support different algorithms. DBI is a flexible
results in great wastes of silicon area and power consumption. data exchange unit which has two asymmetric FIFOs in it. It
Some programming processors, such as Pentium 4 of Intel can prepare data from memories on-chip and off-chip of the
and TMS320 DM64x DSP of TI take advantage of single SoC system.
instruction multiple data (SIMD) instructions and multi-threads
technique to support video-decoding applications. These
processors generally work at a high frequency and are of large
power consumption, while providing poor performance in
video applications [5][6]. Reconfigurable processors have been

978-1-4673-5580-3/13/$31.00 2013 IEEE 434


TABLE I. FUNCTION OF THE PROCESSING ELEMENT

A+B A>>B A>B? (A+B)>>C


A-B A<<B A==B? (A+B)<<C
A&B A+(B>>C) A<B? (A-B)>>C
A|B A-(B>>C) A>=B? (A-B)<<C
A^B A+(B<<C) A<=B? AxB_L
A~^B A-(B<<C) A!=B? AxB_H
~A |A-B| (A>>C)-B Clip(A,-B,B)
A C?A:B (A<<C)-B Clip(A,0,B)

III. SOC ARCHITECTURE


The reconfigurable PEs array acts as the video processing
engine, which could provide the real-time HD video decoding
performance. Besides, a high-performance RISC core is
integrated as the main system-controlling unit, which takes
charge of tasks such as the system booting, initialization of the
peripherals, procedure scheduling and so on. The peripherals
on-chip are adopted to proceed the input of multimedia stream
data and output of the decoded video and audio data. These
components are coupled tightly through the AMBA2.0 system
buses, in which components that demand large data bandwidth
Fig. 1. Architecture of RCA8X8. such as the external memory controller and video output
interface controller adopt the AHB protocol while other
The reconfigurable PE is illustrated as Fig.2, and the components that neednt large data bandwidth such as IIC
function of the PE is shown in table 1. Besides the processing controller and GPIO adopt the APB protocol. The architecture
elements and the data-path, there is a temp register file in of SoC is illustrated as Fig.3.
PE8x8, which could collect the processing results of PEs of
one row, and provide retaining data into the input of PEs of one
column. A convenient extension interface which could
combine PE8x8 into a larger array easily is integrated in
RCA8x8. Horizontal scaling could be realized by sharing 4
columns of temp registers, while vertical expansion realized by
using a group of multiplexers from the bottom route and
choosing the direction of data-transfer of neighboring PEs.
Temp Data Results from
Previous Line

Select
A
Multiplexer A

Fig. 3. Architecture of the Reconfigurable SoC


Multiplexer B
Operand A Select
Operand B A. The Reconfigurable PEs Array Sub-System
B
The reconfigurable PEs array sub-system is composed of a
ALU secondary RISC core, two PEs arrays of RPU0 and RPU1, a
Opecode
Results configuring units array of uPA, a local SSRAM bank on-chip,
Multiplexer C
the external memory interface, the system-bus controller and
Select C interfaces, as illustrated in Fig.4. Each of the PEs array (RPU0
and RPU1) is composed of 256 reconfigurable processing
Temp Reg Results Reg elements, organized as 4 RCA8x8 and acting as the main data
processing engine. The uPA is composed of six configuring
units, which take charges of producing appropriate contexts for
Fig. 2. Architecture of the Processing Element. the reconfigurable elements. The 128KB local SSRAM on-chip
acts as the programming buffers, contexts buffers and temp
data buffers. Besides, the external memory interface could
extend the buffers capacity up to 16MB in SSRAM off-chip.

435
data values when powering off, the asynchronous memories are
adopted to store the initialization program and other software
programs of SoC system, while the synchronous memories,
which could not keep the data values when powering off but
could be accessed at a much higher speed, are used as the main
buffers for the temp data when SoC is in the formal working
state.
AMBA2.0 buses adopt the time-sharing mechanism, and
the bus bandwidth is shared by all the masters of the system.
Each of the masters engrosses the buses in a time piece, and all
the masters make use of the buses in turn based on the
corresponding priorities. Read operations and write operations
are performed through the buses serially under the handshake
mechanism. In AMBA2.0 protocol a second operation could
Fig. 4. Architecture of Reconfigurable PEs Aarray Sub-System not be issued until the first one completes, and the inherent
characteristic leads to that the operations could not be issued in
The AMBA2.0 system interface between the PEs array sub- the pipelining mechanism. In the process of accessing the main
system and other function components of the SoC are memory off-chip, several clock cycles are spent to set up the
composed of AHB master controller interface, AHB slave connection between the external memory interface of SoC and
controller interface and APB slave controller interface. The the memory banks off-chip, and it also needs some additional
PEs array sub-system could access the whole memory space cycles before the data are read back from the memory banks. In
through AHB master controller interface. The AHB slave fact, the buses are in idle state in these cycles, which results in
controller interface is used by SoC main RISC core to transfer a large waste of the buses bandwidth.
the programs and contexts to SRAM on-chip, while the APB In order to improve the usage and practical bandwidth of
slave controller interface is adopted to initialize and startup the the system buses, two 16-entry buffers are adopted, which are
PEs array sub-system. used to store the reading information and writing information
The boot process of the SoC is as follows: firstly, the main respectively. All the writing operations carry out based on the
RISC core of SoC starts up the clock system of PEs array sub- writing buffers and the writing order and data are firstly
system and makes its local memory system and data path exit buffered in the buffers and then issued to the external memory
from reset state; secondly, the main RISC core of SoC transfers interface. Thus the writing buffers conceal the set-up time and
the boot program of PEs array sub-system to its program transferring delays for the system buses, reducing the working
buffer; then, the PEs array sub-system quits from reset state time of system bus for writing operations and improving its
and the secondary RISC core starts performing the usage. For reading operations, splitting mechanism is adopted
initialization program, getting the contexts information and to reduce the working time of system busses. After receiving
multimedia stream data from the system memory of SoC, and the reading order, the external memory controller starts up to
then starting up the PEs array to decode the video bit-steam read memories off-chip at the granularity of bursting
data. The reconstructed video data after decoding are immediately, and at the same time sends the splitting response
transferred to the system memory of SoC through the AHB back to the master which demands the reading operation, which
master interface, where the data wait to be displayed. could release the system buses. After the data are read back
from memories off-chip and buffered in the reading buffer, the
B. External Memory Interface Controller external memory interface controller send splitting
acknowledgement to the system buses controller, which results
The external memory interface includes SD card interface in that the system buses are allocated to the master that
and main memory interface. In which, the SD card interface demands these data for a second time and the data reading
supports SD3.0 standard protocol and SDIO2.0 standard operation could complete then. The reading mechanism above
protocol. The SD card interface comprises an AHB master makes the buses released while reading data from the
controller and an AHB slave controller, which means that the memories off-chip, which improves the usage of system buses
SD card interface could work both in AHB master mode, further. When arriving at the system buses simultaneity, the
transferring data between SD card banks and system memory reading operations have higher priorities to get the usufruct of
automatically, and in AHB slave mode, transferring data under the buses than writing operations.
the instructions of the main RISC core or the DMA controller.
The main memory interface supports accessing both C. LCD Controller
asynchronous memory devices such as ROM and Flash and LCD controller is used to get the decoded video data from
synchronous memory devices such as Pipelined synchronous- the system memory off-chip and send them to the liquid crystal
burst SRAM (SBSRAM) and synchronous DRAM (SDRAM). display (LCD) controlling module in appropriate data mode
The asynchronous space and synchronous space are and protocol. LCD controller is designed as an AHB bus
differentiated based on the memory addresses, and the capacity master, which could read the decoded video data automatically
of the former is up to 256MB (million bytes) while 1GB (giga after configured and initialized by the main RISC core. Video
bytes) to the latter. Because of the characteristic of keeping the data with RGB format could be read back and then sent out to

436
display straightway. Video data with other formats must be A development board is designed and implemented, in
transmitted to RGB format firstly. The capacity of data with which the reconfigurable SoC acts as the main processing
RGB format doubles that of YUV420 format for video with the engine. Through the external memory interface, two pieces of
same sizes, which means that displaying the video data with Flash devices, SST39VF6401, are connected to store the
YUV420 format could reduce the demanding bandwidth by initialization program, and four pieces of SDRAM devices,
50%. For HD videos, the amount of data with RGB format to MT48LC32M16A2, are connected to act as the main working
display in one second is up to 187 million bytes while it is only memories. A piece of high-speed SD card is used to store the
93 million bytes for data with YUV420 format. original multimedia stream files to be decoded, and two pieces
of SSRAM devices, K7N643645M, are connected to act as the
To support the YUV420 video data in the LCD controller, extending decoding memory for PEs array. A Sil9134 device is
two stages of buffers are adopted. The first one stage of buffers used to transmit the signals from LCD interface, DAOI and IIC
are composed of three 1024-entry 8-bit SRAM bank, which are to the HDMA signals, as illustrated in Fig. 6.
used to buffer the luma data and two types of chroma data.
1024-bytes buffers of the first stage could store half row of The decoding programs of H.264, AVS and MPEG-2 are
luma data and a row of chroma data for HD1080P video mapped onto the reconfigurable SoC respectively [11][12], and
pictures. After entering the first stage of buffer, each of the the testing condition is that the main parts of the SoC work at
luma data is read out once while each of the chroma data is the frequency of 225MHz and the external memory interface
read twice circularly to produce one set of RGB video data for controller works at the frequency of 125MHz. The testing
a pels. The transformed video data with RGB format are results show that the reconfigurable SoC achieves the
buffered in the second stage of buffer, a 16-entry 24-bit performance of real-time decoding of 1920*1080 @ 30fps
SSRAM bank, and then are read out by the displaying videos with H.264/AVS/MPEG-2 standards. The video
controller and sent out to the LCD controlling module. screenshot of decoded video is also shown in Fig. 6.

D. Other Peripherals On-Chip


Several other peripherals are integrated in the SoC, such as
the digital audio output interface (DAOI), IIC, GPIO (general
purpose input and output), UART and timers. These
peripherals demands less bandwidth and are connected to the
system through APB buses. The DAOI controller could export
digital audio information with I2S or S/PDIF formats, and thus
could be connected to the digital acoustics device or digital
earphone directly. The IIC controller could work in both master Fig. 6. Development board of SoC and video screenshot
and slave mode. The LCD interface, digital audio output
interface and IIC could be converted to HDMI interface The comparison of the performance and power
through SiI9134 [10] on development board level. consumption of different projects are listed in table 2, which
shows that the reconfigurable SoC proposed this paper wins the
best ratio of performance and power consumption in the
IV. SOC ANTITYPE CHIP AND TESTING RESULTS
projects that support decoding of videos with multiple
An antitype chip of the reconfigurable SoC which is named standards. Besides, because of adopting the reconfigurable
as YHFT-RmSOC, is implemented based on the 65nm CMOS calculating technique, the SoC of this paper is also
silicon techniques, and the count of gates for the whole chip is accomplished in other algorithms and applications.
38 million. The layout, die and package of YHFT-RmSOC is
illustrated in Fig. 5. The size of the die is 9.86X9.86mm2, and
TABLE II. COMPARISON OF PERFORMANCE AND POWER CONSUMPTION
the count of pins of the package is 484.
Working Power Video decoding
Projects
frequency consumption performance
This paper 225MHz 1.3W 30fps@HD(1920*1080)
SMP8646 [1] 800MHz / 30fps@HD(1920*1080)
DM8147 [3] 600MHz 6W 60fps@HD(1920*1080)
CE4100 [4] 1.2GHz 7~9W 60fps@HD(1920*1080)
Pentium 4 [5] 3.7GHz 89W 26.7fps@HD(1920*1080)
DM6446 [6] 600MHz 1.2W 28.1fps@CIF(352*288)
XPP 3 [7] 450MHz 3.4W 24fps@HD(1920*1080)

V. CONCLUSIONS
An architecture of SoC for multimedia applications is
proposed in this paper based on large-scale reconfigurable PEs
array, high-performance RISC core and embedded peripherals,
which are coupled tightly through AMBA2.0 system buses.
The large-scale PE array is used to process video signals with
Fig. 5. Layout, die and package of SoC

437
different standards under appropriate contexts disposed [4] Intel Atom TM Processor CE 4100. www.intel.com/go/consumer
dynamically. The embedded peripherals are with responsibility electronics
for the input of media stream and output of the multimedia data [5] Vanghn I, Jeff M, Bob R. Real-Time H.264/AVC Codec on Intel
Architecture. International Conference on Image Processing, 2004, pp:
to display, while the RISC core takes charge of the 757-760
initialization of the peripherals and the reconfigurable PEs, the [6] Kong X P, Lin H Z, Huang L F, Lin J N. Optimization of x264 Decoder
pretreatment of media stream data, the audio decoding, the Based on DaVinci Technology. International Conference on Biomedical
synchronization of audio data and video data, and some other Engineering and Computer Science, April 2010, pp: 1-4
scheduling functions. An antitype SoC chip is designed and [7] Ganesan M K A, Singh S, May F, Becker J. H.264 Decoder at HD
implemented on 65nm CMOS silicon techniques, and the Resolution on a Coarse Grain Dynamically Reconfigurable Architecture.
testing results show that the reconfigurable SoC achieves the International Conference on Field Programmable Logic and
Applications, August 2007, pp:467-471
performance of real-time decoding of 1920*1080 @ 30fps
(frame-per-second) videos with H.264/AVS/MPEG-2 [8] Davide R Fabio C Simone S Stefano P Roberto G. A
Heterogeneous Digital Signal Processor for Dynamically Reconfigurable
standards at the frequency of 225MHz. Computing. IEEE Journal of Solid-State Circuits, Vol.45, No.8, August
2010: 1615-1626
ACKNOWLEDGMENT [9] Zhu M, Liu L B, Yin S Y, Wang Y S, Wang W J, Wei S J, A
Reconfigurable Multi-Processor SoC for Media Applications.
This work is supported by 863 High Tech Project Design International Symposium on Circuits and Systems, May 2010, pp: 2011-
and Implement of the Reconfigurable SoC for the Embedded 2014.
High-Performance Multimedia Applications of China (No. [10] Silicon Image Sil9134 VastLane TM HDMI Transmitter. www.
2009AA011704). siliconimage.com
[11] Geng T S, Liu L B, Yin S Y, Zhu M, Jia W, Wei S J. Parallel
Implementation of Computing-I ntensive Decoding Algorithms of H.264
REFERENCES on Reconfigurable SoC. International Symposium on Circuits and
Systems, May 2010, pp: 1153-1156
[1] Sigma Designs: SMP86xx Development Kit, www.sigmadesigns.com [12] Zhao J, Zhou L, Yu Q D, Chen J. An Efficient Implementation of
Metion Compensation for AVS HD Application Based on a Coarse-
[2] Realtek RTDxx productswww.realtek.com.tk
Grained Reconfigurable Processor. International Conference on Solid-
[3] TMS320DM814x DaVinci Digital Multi-Media Processor 2011, State and Integrated Circuit Technology, August 2010, pp: 596-598.
www.ti.com.cn

438

You might also like