Low Power h.264 Architectures For Mobiles

LOW-POWER H.
264 ARCHITECTURES FOR MOBILE COMMUNICATION
1. INTRODUCTION
Video compression plays an important role in todays wireless communications. It
allows raw video data to be compressed before it is sent through a wireless channel.
However, video compression is computation-intensive and dissipates a significant amount of
power. This is a major limitation in todays portable devices. Existing multimedia devices can
only play video applications for a few hours before the battery is depleted. [1]
The latest video compression standard MPEG-4 AVC/H.264 gives 50% improvement
in compression efficiency compared to previous standard. However, the coding gain comes at
the expense of increased computational complexity at the encoder. Motion estimation (ME)
has been identified as the main source of power consumption in video encoders. It consumes
5090% of the total power used in video compression. The introduction of variable block size
partitions and multiple reference frames in the standard result in increased of computational
load and memory bandwidth during motion prediction.
Block-based ME have been widely adopted by the industry due to its simplicity and
ease of implementation. Each frame is partitioned into 16 16 pixels, known as macro blocks
(MBs). Full-search ME predicts the current MB by finding the candidate that gives the
minimum sum of absolute difference (SAD) as follows: [1]
M 1 N 1
SAD(i,j) =
C (k , l ) R(i k , j l )
(i)
k 0 l 0
Where (c, l) is the current macro block and R (I + k, j + l) is the candidate MB located
in the search window within the previously encoded frame. From (1), the power consumption
in ME is effected by the number of candidates and the total computation to calculate the
matching costs. Thus the power can be reduced by minimizing these parameters. [1]
Furthermore, to maximize the available battery energy, the computational power
should be adapted to the supply power, picture characteristics, and available bandwidth.
Because these parameters change over time, the ME computation should be adaptable to
different scenarios without degrading the picture quality. Pixel truncation can be used to
LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION
reduce the computational load by allowing us to disable the hardware that processes the
truncated bits. While previous studies focused on fixed-block size ME (16 16 pixels), very
little work has been done to study the effect of pixel truncation for smaller block sizes. The
latest MPEG-4 standard, MPEG-4 AVC/H.264, allows variable block size for motion
estimation (VBSME). Defines 1616, 16 8, 8 16, 8 8, 8 4, 4 8, and 4 4 block
sizes. At smaller block partitions, a better prediction is achieved for objects with complex
motion. [2]
Truncating pixels at a 1616 block size results in acceptable performance as shown in the
literature. However, at smaller block sizes, the number of pixels involved during motion
prediction is reduced. Due to the truncation error, there is a tendency for smaller blocks to
yield matched candidates, which could lead to the wrong motion vector. Thus, truncating
pixels using smaller blocks results in poor prediction.
We have implemented a low-power algorithm and architecture for ME using pixel truncation
for smaller block sizes. The search is performed in two steps: 1) truncation mode and 2)
refinement mode. This method reduces the computational cost and memory access without
significantly degrading the prediction accuracy. In this project, we perform an in-depth
analysis of this technique and extend the technique to a complete H.264 system. [1]
1.1. Introduction about Compression:

Compression is the process of reducing the size of the data sent, thereby, reducing the
bandwidth required for the digital representation of a signal. Many inexpensive video and
audio applications are made possible by the compression of signals. Compression technology
can result in reduced transmission time due to less data being transmitted. It also decreases
the storage requirements because there is less data. However, signal quality, implementation
complexity, and the introduction of communication delay are potential negative factors that
should be considered when choosing compression technology. [2]
Video and audio signals can be compressed because of the spatial, spectral, and
temporal correlation inherent in these signals. Spatial correlation is the correlation between
neighboring samples in an image frame. Temporal refers to correlation between samples in
different frames but in the same pixel position. Spectral correlation is the correlation between
samples of the same source from multiple sensors.
1.2. Practical Importance of Image compression:
Typical values of image or video without any compression would be in the range of
some Giga Bytes (of some 1.30hrs video). To store this much data, it will not be practical and
of course, it is not possible for transmissions also (as it requires high bandwidth rates). So, in
order to store or transmit the data, we need to compress the data, so as to meet the
requirements.[1]
fig:1.1 Block diagram of image compression/decompression system
2.LITERATURE SURVEY
2.1 To Low Power H.264 Architecture:
The latest video compression standard MPEG-4 AVC/H.264 gives 50% improvement
in compression efficiency compared to previous standard. However, the coding gain comes at
the expense of increased computational complexity at the encoder. Motion estimation (ME)
has been
Identified as the main source of power consumption in video encoders. It consumes 5090%
of the total power used in video compression. The introduction of variable block size
partitions and multiple reference frames in the standard result in increased of computational
load and memory bandwidth during motion prediction.[2]
Block-based Motion Estimation (ME) has been widely adopted by the industry due to
its simplicity and ease of implementation. Each frame is partitioned into 16 16 pixels,
known as macro blocks (MBs).
Full-search ME predicts the current MB by finding the candidate that gives the
minimum sum of absolute difference (SAD), as follows:
M 1 N 1
SAD(i,j) =
C (k , l ) R(i k , j l )
k 0 l 0
Where C (k, l) is the current MB, and R (i + k, j + l) is the candidate MB located in

the search window within the previously encoded frame. From above equation, the power
consumption in ME is affected by the number of candidates and the total computation to
calculate the matching cost. Thus, the power can be reduced by minimizing these parameters .
[1]
Pixel truncation (usually, a pixel will be represented with 8-bits, here, instead of using
8-bits for pixel representation, we use less number of pixels) can be used to reduce the
computational load by allowing us to disable the hardware that processes the truncated bits.
While previous studies focused on fixed-block size ME (16 16 pixels), very little work has
been done to study the effect of pixel truncation for smaller block sizes. The latest MPEG-4
standard, MPEG-4 AVC/H.264, allows variable block size for motion estimation (VBSME).
At smaller block partitions, a better prediction is achieved for objects with complex motion.
Truncating pixels at a 1616 block size results in acceptable performance. However,

at smaller block sizes, the number of pixels involved during motion prediction is reduced.
Due to the truncation error, there is a tendency for smaller blocks to yield matched
candidates, which could lead to the wrong motion vector. Thus, truncating pixels using
smaller blocks results in poor prediction. [2]
The search is performed in two steps:
1) Trun cation mode and
2) Refinement mode.
This method reduces the computational cost and memory access without significantly
degrading the prediction accuracy.
3.EFFECT OF PIXEL TRUNCATION FOR VBSME
For video applications, data is highly correlated, and the switching activity is
distributed non-uniformly. Since the LSBs of a data word experience a higher switching
activity, significant power reduction can be achieved by truncating these bits. In general,
about 50% switching activity reduction is obtained if we truncate up to three LSBs. Further
reduction can be achieved if the number of truncated bits (NTBs) is increased. For example,
if the NTB is set to 6, the switching activity could be reduced by 8090%. This makes pixel
truncation attractive to minimize power in ME.[2]
Table I shows the cumulative distribution function (CDF) for SAD that is obtained
during ME using five Foreman sequences. The SAD is grouped into five categories: 0%
represents the percentage for SAD = 0, 5% represents the percentage of SAD < 5% SADmax,
and so on.
TABLE 1
CDF OF CALCULATED SAD DURING MOTION ESTIMATION USING
Foreman SEQUENCE (SEARCH RANGE, P=+8)
Block Size
16 x 16
8x8
4x4
NTB
0
4
0
4
0
4
0%
0
0.2
0
5
0
12
<5%
25
25
35
35
58
58
<10%
60
60
58
58
58
58
<20%
94
94
87
87
81
81
<40%
100
100
98
98
99
99
For 16 16 block size with NTB = 4, the percentage of SAD = 0 is close to the un
truncated bit (NTB = 0). This shows that for 1616 block size, the truncated pixel is more
likely to have the same matched candidate as in the un truncated pixel. However, for 44
block with NTB = 4, the percentage of SAD = 0 is 12% compared to 0% for NTB = 0. This
shows that there are more matched candidates using truncated pixel for 44 block size, which
could lead to incorrect motion vectors.[2]
To illustrate the effect of pixel truncation on VBSME, computed values of the average
PSNR for 50 predicted frames of Foreman sequence (QCIF@30frames/s) is given Table II.
TABLE II
AVERAGE FULL-SEARCH PSNR FOR VARIOUS NTB USING SAD AS
MATCHING CRITERIA (SEARCH RANGE, P=+8)
NTB
0
2
4
6
16 x 16
33.11
33.12
33.03
31.79
Diff.
--0.01
-0.08
-1.33
Block Size
8x8
Diff.
34.89
--34.85
-0.03
34.35
-0.54
30.29
-4.60
4x4
36.82
36.75
34.66
27.46
Diff.
---0.07
-2.16
-9.36
The frames are predicted using full-search algorithm at different block sizes and NTB.
From Table II, for full pixel resolution (NTB = 0), the prediction accuracy improves as the
block size decreases. This is reflected by a higher PSNR for predictions using a 4 4 block
compared to a 16 16 block. For NTB = 4, a small PSNR drop is observed for a block size of
1616 (0.08 dB) compared to untruncated pixels. The PSNR drop for predictions using
smaller block sizes is higher with 0.54 dB and 2.16 dB drops for frames with block sizes 8
8 and 4 4, respectively. [2]
As we increase the NTB = 6, the PSNR drop for the smaller blocks increases rapidly.
The PSNR drop for the 16 16 block size is 1.33 dB. However, for 8 8 and 4 4 block
sizes, the PSNR drop increases to 4.6 and 9.36 dB, respectively. This shows that pixel
truncation is not suitable for smaller block sizes. In the H.264 standard, substantial
improvement in motion prediction is gained by using smaller blocks. Therefore, it is
important to improve the PSNR gain especially for smaller block partitions.
For video coding systems, motion estimation can remove most of temporal
redundancy, so a high compression ratio can be achieved. Among various ME algorithms, a
full-search block matching algorithm (FSBMA) is usually adopted because of its good
quality and regular computation. In FSBMA, the current frame is partitioned into many small
macro blocks (MBs) of size N*N, for each MB in the current frame, one reference block that
is the most similar to current MB is sought in the searching range of size [-P,P].[2]
Although FSBMA provides the best quality among various ME algorithms, it
consumes the largest computation power. In general, the computation complexity of ME
varies from 50% to 90% of a typical video coding system. Hence, a hardware accelerator of
ME is required.
Variable block size motion estimation (VBSME) is a new coding technique and provides
more accurate predictions compared to traditional fixed block size motion estimation.
Fig 3.1: Different architectures used for VBSME [2]
4. MOTION ESTIMATION
4.1 Introduction:
A standard movie, which is also known as motion picture, can be defined as a
sequence of several scenes. A scene is then defined as a sequence of several seconds of
motion recorded without interruption. A scene usually has at least three seconds. A movie in
the cinema is shown as a sequence of still pictures, at a rate of 24 frames per second.
Similarly, a TV broadcast consists of a transmission of 30 frames per second (NTSC, and
some flavors of PAL, such as PAL-M), 25 frames per second (PAL, SECAM) or anything
from 5 to 30 frames per second for typical videos in the Internet. [4]
The name motion picture comes from the fact that a video, once encoded, is nothing
but a sequence of still pictures that are shown at a reasonably high frequency. That gives the
viewer the illusion that it is in fact a continuous animation. Each frame is shown for one
small fraction of a second, more precisely 1/ k seconds, where k is the number of frames per
second. Coming back to the definition of a scene, where the frames are captured without
interruption, one can expect consecutive frames to be quite similar to one another, as very
little time is allowed until the next frame is to be captured. With all this in mind we can
finally conclude that each scene is composed of at least 3 k frames (since a scene is at least
3 seconds long). In the NTSC case, for example, that means that a movie is composed of a
sequence of various segments (scenes) each of which has at least 90 frames similar to one
another.[4]
Before going further with details on motion estimation we need to describe briefly
how a video sequence is organized. As mentioned earlier a video is composed of a number of
pictures. Each picture is composed of a number of pixels or peals (picture elements). A video
frame has its pixels grouped in 88 blocks. The blocks are then grouped in macro blocks
(MB), which are composed of 4 luminance blocks each (plus equivalent chrominance
blocks). Macro blocks are then organized in groups of blocks (GOBs) which are grouped in
pictures (or in layers and then pictures). Pictures are further grouped in scenes, as described
above, and we can consider scenes grouped as movies. Motion estimation is often performed
in the macro block domain. For simplicity sake well refer to the macro blocks as blocks, but
we shall remember that most often the macro block domain is the one in use for motion
estimation.[4]
For motion estimation the idea is that one block b of a current frame C is sought for in
a previous (or future) frame R. If a block of pixels which is similar enough to block b is found
in R, then instead of transmitting the whole block just a motion vector is transmitted.
Ideally, a given macro blocks would be sought for in the whole reference frame;
however, due to the computational complexity of the motion estimation stage the search is
usually limited to a pre-defined region around the macro blocks. Most often such region
includes 15 or 7 pixels to all four directions in a given reference frame. The search region is
often denoted by the interval [-p, p] meaning that it includes p pixels in all directions.[4]
4.2. Video Compression Model:
The video compression model is a two-stage procedure. The first procedure consists
of taking advantage of the temporal redundancy followed by a procedure similar to that used
for lossy. Image compression which aims at exploring the spatial redundancy. In the temporal
redundancy exploitation stage, we have motion estimation of the current frame (C) using the
reference frame (R). The first stage produces both a set of motion vectors (i, j) as well as
difference macro blocks (C-R). The difference macro blocks then go through the second
stage which exploits spatial redundancy. One may notice that the difference frame has usually
very high spatial redundancy due to the fact that it only stores information of difference of
motion estimated macro blocks as well as macro blocks where a good match is not found in
the reference frame(s).[3]
4.3. Matching Criteria:

Let x, y denote the location of the current macro block. The pixels of the current
macro block can then be denoted by C(x+k, y+l) while the pixels in the reference frame can
be denoted as R(x+i, y+j). We now define a cost function based on the Mean Absolute Error
(MAE) or Mean Absolute Difference (MAD) or Sum of Absolute Differences (SAD). The
matching block will be R(x+i,y+j) for which MAE is minimized, henceforth i, j defines the
motion vector.[4]
It is important to notice that the MAE is not the only option for matching criteria, as
one can use Mean Square Error and other expressions as well. MAE is, however, often
selected due to its computational simplicity. It is also worth mentioning that when the
minimum MAE has a value higher than a given threshold the block is said not found. That
means that the motion estimation failed and that block is to be encoded without exploiting
temporal redundancy with regard to the R reference frame. We can observe this phenomena
when there is a scene change or a new object is inserted in the scene between the reference
frame and the current frame, or yet if motion goes beyond the search area [-p, p].[4]
Motion estimation doesnt have to be performed in a single reference frame. In some
cases bi-directional motion estimation is performed. That means that other than looking for a
macro block in a previous reference frame R, the block is also sought for in a reference frame
F in the future. That is very useful for the case where a new object is inserted into the scene,
as itll be found in a future reference frame F even tough not being present in a previous
reference frame R. More generally, multiple frames in the past and future can be used by a
motion estimator, but more often we either use a single frame in the past or one in the past
and one in the future, as described herein.[4]
As we have seen, the temporal prediction technique used in MPEG video is based on
motion estimation. The basic premise of motion estimation is that in most cases, consecutive
video frames will be similar except for changes induced by objects moving within the frames.
In the trivial case of zero motion between frames (and no other differences caused by noise,
etc.), it is easy for the encoder to efficiently predict the current frame as a duplicate of the
prediction frame. When this is done, the only information necessary to transmit to the
decoder becomes the syntactic overhead necessary to reconstruct the picture from the original
reference frame. When there is motion in the images, the situation is not as simple.
4.4. Algorithm for Motion Estimation:

The motion estimation is the most computational intensive procedure for standard
video compression. To seek a match for a macro block in a [ -p, p] search region leads to (2 p
+ 1) 2 search locations, each of which requires M N pixels to be compared, where M, N
give the size of the source macro block. Considering F the number of reference frames being
considered in the matching process, such procedure reaches a total of (2 p + 1) 2 MN F
executions of the
Mean Absolute Error expression: | C(x + k, y + l) R(x + i + k, y + j + l) |,
Which involves 3 operations per pixel. Considering a standard 1616 macro block
with p =15 and F=l in a VGA-sized video stream (640480 pixels) we have 4030 = 1200
macro blocks per frame. That leads to a grand total of 778.41M operations for a single frame.
In a 30-fps video stream that would entail 23.3523G operations per second for motion
estimation alone. If bi-directional motion estimation is in place wed reach 46.7046G
operations per second. [3]
The scheme described herein is refereed to as exhaustive search. If a video needs not
to be processed in real time it may be the best option as the exhaustive search ensures that
well find the best match for every macro block. In some cases, however, we wont be able to
afford an exhaustive search. Some algorithms have been developed aiming at finding a suboptimal match in much less time than the exhaustive search.
The temporal prediction technique used in MPEG video is based on motion
estimation. The basic premise of motion estimation is that in most cases, consecutive video
frames will be similar except for changes induced by objects moving within the frames. In the
trivial case of zero motion between frames (and no other differences caused by noise, etc.), it
is easy for the encoder to efficiently predict the current frame as a duplicate of the prediction
frame. When this is done, the only information necessary to transmit to the decoder becomes
the syntactic overhead necessary to reconstruct the picture from the original reference frame.
When there is motion in the images, the situation is not as simple.[4]
Shows an example of a frame with 2 stick figures and a tree. The second half of this
figure is an example of a possible next frame, where panning has resulted in the tree moving
down and to the right, and the figures have moved farther to the right because of their own
movement outside of the panning. The problem for motion estimation to solve is how to
adequately represent the changes, or differences, between these two video frames.
4.5. Motion Estimation Example:

The way that motion estimation goes about solving this problem is that a
comprehensive 2-dimensional spatial search is performed for each luminance macro block.
Motion estimation is not applied directly to chrominance in MPEG video, as it is assumed
that the color motion can be adequately represented with the same motion information as the
luminance. It should be noted at this point that MPEG does not define how this search should
be performed. This is a detail that the system designer can choose to implement in one of
many possible ways. This is similar to the bit-rate control algorithms discussed previously, in
the respect that complexity vs. quality issues need to be addressed relative to the individual
application. It is well known that a full, exhaustive search over a wide 2-dimensional area
yields the best matching results in most cases, but this performance comes at an extreme
computational cost to the encoder. As motion estimation usually is the most computationally
expensive portion of the video encoder, some lower cost encoders might choose to limit the
pixel search range, or use other techniques such as telescopic searches, usually at some cost
to the video quality. [3]
Shows an example of a particular macro block from Frame 2, relative to various
macro blocks of Frame 1. As can be seen, the top frame has a bad match with the macro
block to be coded. The middle frame has a fair match, as there is some commonality between
the 2 macro blocks. The bottom frame has the best match, with only a slight error between the
2 macro blocks. Because a relatively good match has been found, the encoder assigns motion
vectors to the macro block, which indicate how far horizontally and vertically the macro
block must be moved so that a match is made. As such, each forward and backward predicted
macro block may contain 2 motion vectors, so true bi-directionally predicted macro blocks
will utilize 4 motion vectors.
4.6. Motion Estimation Macro blocks Example:

Shows how a potential predicted Frame 2 can be generated from Frame 1 by using
motion estimation. In this figure, the predicted frame is subtracted from the desired frame,
leaving a (hopefully) less complicated residual error frame that can then be encoded much
more efficiently than before motion estimation. It can be seen that the more accurate the
motion is estimated and matched, the more likely it will be that the residual error will
approach zero, and the coding efficiency will be highest. Further coding efficiency is
accomplished by taking advantage of the fact that motion vectors tend to be highly correlated
between macro blocks. Because of this, the horizontal component is compared to the
previously valid horizontal motion vector and only the difference is coded. This same
difference is calculated for the vertical component before coding. These difference codes are
then described with a variable length code for maximum compression efficiency. [3]
4.7. Final Motion Estimation Prediction:
Of course not every macro block search will result in an acceptable match. If the
encoder decides that no acceptable match exists (again, the "acceptable" criterion is not
MPEG defined, and is up to the system designer) then it has the option of coding that
particular macro block as an intra macro block, even though it may be in a P or B frame. In
this manner, high quality video is maintained at a slight cost to coding efficiency.
5. MOVING PICTURE EXPERTS GROUP

The Moving Picture Experts Group (MPEG) was formed by the ISO to set standards for
audio and video compression and transmission. It was established in 1988 and its first
meeting was in May 1988 in Ottawa, Canada. As of late 2005, MPEG has grown to include
approximately 350 members per meeting from various industries, universities, and research
institutions. MPEG official designation is ISO/IEC JTCI/SC29 WG11-coding of moving
pictures and audio (ISO/IEC Joint Technical Committee 1, subcommittee 29, Working Group
11.
JOINT Video Team (JVT) is joint project between ITU-T SG16/Q.6 and ISO/IEC
JTCI/SC29/WG11 for the development of new video coding recommendation and
international standards.
5.1. Overview:
Compression methodology:
The MPEG compression methodology is considered asymmetric in that the encoder is
more complex than the decoder. The encoder needs to be algorithmic or adaptive where as the
decoder is dumb and carries out fixed actions. This is considered advantageous in
applications such as broadcasting where the number of expensive complex encoders is small
but the number of simple inexpensive decoders is large. This approach of the ISO to
standardization in MPEG is considered novel because it is not the encoder which is
standardized; instead, the way in which a decoder shall interrupt the bit stream is defined. A
decoder which can successfully interpret the bit stream is said to be compliant. The advantage
of the standardizing the decoder is that the over time encoding algorithms can improve yet
compliant decoders will continue to function with them. The MPEG standards give very little
information regarding structure and operation of the encoder and implementers can supply
encoders using proprietary algorithm. This gives scope for competition between different
encoder designs which meant that better designs can evolve and users will have greater
choice because of different levels of cost and complexity can exist in a range of coders yet a
compliant decoder will operate with them all.
MPEG also standardizes the protocol and syntax under which it is possible to combine
or multiplex audio data with video data to produce a digital equivalent of a television
program. Many such programs are multiplexed and MPEG defines the way in which such
multiplexers can be created and transported.
5.2. Sub Groups:

Coding of moving pictures and audio has following Sub Groups:
1.
2.
3.
4.
5.
6.
Requirements
Systems
Video
Audio
3D Graphics compression
Test
6. HARDWARE IMPLEMENTATION
In the hardware implementation here we proposed architectures to implement the twostep algorithm. First, the conventional ME architecture that is used in our analysis is
reviewed. Next we discuss the architectures needed to support the two step method. The area
and power overhead for the computation and memory unit are also investigated. Based on
these analyzes, we propose three low-power ME architectures with different areas and power
efficiencies. in this paper, we implement the ME architecture based on 2-D ME has been
discussed. We choose 2-D ME, because it can cope with the high computational needs of the
real-time requirement of H.264 using a lower clock frequency than 1-D architecture.[5]
6.1. Computation unit:
Shows the functional units in the conventional 2-D ME (me_sad). The ME consists of search
area (SA) memory, a processing array which contains 256 processing elements (PEs), an
adder tree, a comparator, and a decision unit. The search area memory consists of 16 memory
banks where each bank stores 8-bit pixels in a H*W/N total word, where H and W are the
search area windows height and width respectively, and N is the MBs width. During motion
prediction, 16 pixels are read from the 16 memory banks simultaneously. The data in the
memory are stored in a ladder like manner to avoid delay during the scanning. At each initial
search, the current and the first candidate MB are loaded into the processing arrays registers.
Then it calculates the matching costs for one candidate per clock cycle. The 256 absolute
differences from the PEs are summed by adder tree, and output the SAD for 41 block
partitions. The adder tree reuses the SAD for 4*4 blocks to calculate a larger block partition.
In total, the adder tree calculates the 41 partitions per clock cycle.[5]
Throughout the scanning process, the comparator updates the minimum SAD and the
respective candidate location for each 41 block partition. Once the scanning is complete, the
decision unit outputs the best MB partition and its motion vectors. The ME requires 256
clock cycles to scan all candidates.
For me_sad, the input and output for each of the PE are 8-bits wide as shown in fig 1
(a). The input for the adder tree is 8-bits wide, and the SAD output is 12 to 16-bit wide,
depending on the partition size. These data are then input into the comparator, together with
the current search location information.[5]
Using similar architecture as in me_sad, DPC-based ME (me_dpc) requires two bits
for current and reference pixel inputs as shown in fig (b).furthermore, the matching cost is
calculated using boolean logic (XOR and OR) rather than arithmetic operation as in SAD
based PE. These make the overall area for the 256 PEs in me_dpc is much smaller than
me_sad. The reduction in output bitwidht in DPC-based PE also reduces the bitwidth required
for adder tree and comparator unit. The input and output for the adder tree is 1-bit and 5 to 9
bitwidths, respectively. A similar bitwidth is applide to the comparators input.
Table (1) compares the area (mm2), the total equivalent gates (based on 2- input NAND
gate), and power consumption (m W) for me_sad and me_dpc computational units. The
comparisons are based on systhesis results using 0.13M m CMOS UMC technology.[5]
TABLE (1)
ME-SAD AND ME-DPC AREA (mm2), TOTAL EQUIVALENT GATES
(BASED ON 2-INPUT NAND GATE) AND POWER (Mw)
me-sad
me-dpc
Modules
Area
Gates
Power
Area
Gates
256 PE
0.90
173611
28.67
0.32
61728
Adder tree
0.13
25077
5.53
0.04
7716
Comparator Unit
0.11
21219
1.25
0.09
17361
Decision Unit
0.10
19290
0.54
0.07
13503
Total
1.24
239198
36.00
0.52
100309
Power
2.31
0.99
0.86
0.50
4.66
The table shown that me_sads area is dominated by 256 PE (73%). Thus, with the
significantly smaller area for 256 PE, the me_dpc will require less area than the me_sad. The
overall me_dpc requires 42% of the me_sad area.
Based on the above analysis, we propose two types of architectures for the ME
computation unit that can perform low-resolution and full-resolution searches. These are
me_split and me_combine as shown in fig (3(a) and (b)).
Me_split implememts me_sad and me_dpc as two seaparte modules as shown in fig
3(a). during low resolution search, me_sad is switched off, while the me_dpc is used to
perform the search. The second step usesthe me_sad, while the me_dpc is switched off. This
architecture allows only the necessary bit size to be used during different search modes.
While potential power savings is possible, this architecture requires additional area for the
adder tree, comparator and decision unit to support the low-resolution search.
computational unit fig (3) (a) me_split
Computational unit fig(3) (b) me_combined

Due to the functions of the adder tree, the comparator and the decision units are
similar for both me_sad and me_dpc, and me_combined shares these units during low
resolution search and full pixel resolution is shown in fig (3)(b). this architecture results in a
much smaller area compared to me_split. However higher power consumption is expected
during the low-resolution search because the adder tree, comparator and decision unit operate
at higher bit size than needed.[5]
6.1.1.Memory architecture (Search Area Memory Organization)

Conventional ME architecture implements the SA memory using single post static
random access memoru (SRAM) with one pixel (8-bit) per word. To implement the two-step
search, we need to access the first two MSBs for each pixel during the first search and 8-bits
in the second stage. Thus, the pixels need to be stored to allow two reading modes. For this,
three types of memory architectures are proposed. These are (a) 8-bit memory (mem8), (b) 2bit and 8-bit memoty (mem28), and (c) 8-bit memoty with prearranged data and transposed
register (mem8pre) as shown in fig (4).[3]
Fig (6.1).SA memory arrangement (a) mem8 (b) mem28) and (c) mem8pre
Mem8 stores the data in the same way as in the conventional ME. We access 8-bit
data during both low-resolution and refinement stage. However, during the low resolution
search, the lower six bits are not used by the PE. Because the memory is accessed during the
low-resolution and the refinement stage, it results in higher memory bandwidth than the
conventional ME architecture.
To overcome the problem in mem8, mem28 uses two types of memory: 2-bit and 8bit. The 2- bit memory stores the first two MSBs of each datum, and the 8-bit memory stores
the complete full pixel bitwidth. During the low-resolution search the data from the 2-bit
memory are accessed. This allows only the required bits to be accessed without wasting any
power during low-resolution. In the refinement stage, the 8-bit memory is read into the PEs.
Although this architecture potentially reduce the memory bandwidth and power consumption,
it needs an additional area for the 2-bit memory.[3]
In mem8pre, the data is prearranged before storing them in 8-bit memory. Four pixels
are grouped together, and then transposed according to their bit position, as shown in fig (5).
Fig (6.2).storing 8-bit pixel in 8-bit memory: (a) conventional arrangement and (b) mem8pre.
During the low-resolution search, we read only the memory locations that stores the
first two MSBs of the original pixels. Thus the total memory is accessed during the lowresolution is one-fourth of the conventional full pixel access.[3]
In full resolution search, we read four memory locations that contain the first up to
eighth bits in four clock cycles. Delay buffers as shown in fig (4)realigns these words to
match the original 8-bit pixel. By prearranging the pixels this way, we can use the same
memory size as in the conventional full search while retaining the ability to access the first
two MSBs, as well as the full bit resolution. The drawback of this approach is that it needs
additioanlal circuity to transpose and realign the pixels during the motion prediction. The
estimated bandwidth for the above three memory architectures are shown in table (2).
TABLE (2)
MEMORY BANDWIDTH FOR DIFFERENT ARCHITECTURES
Mem8
Low-resolution
NWH X 8-BIT
Mem28
NWH X 2-BIT
Mem8pre
NWH X 2-BIT
High-resolution
WH
x 8-bit
4
WH
N
x 8-bit
4
WH
N
x 8-bit
4
The memories for the search area data consume non-trivial hardware cost and power
dissipation. For instance, the memory modules of the 2-D design occupy almost 50% die
area. In fact, the memory organization is one critical issue for the system design, especially in
H.264 which applies multiple reference frames. In this section, one memory mapping
algorithm is proposed to reduce the memory partition number in our architecture.
Consequently, the hardware cost and the power consumption are both optimized.
fig:6.3 Memory Mapping Algorithm

For the m-PEG configuration, where m{1,2,4,8,16}, we use the following algorithm
to organize search area memories. The search area is (M+15)-pixel wide and (N+15)-pixel
high. In
order to make physical implementation convenient, one pixel is extended in both
vertical and horizontal directions. So the search area memory size is (M+16)(N+16) pixels.
It is divided into (M+16)/m logic partitions and each logic partition is m-pixel wide, as
illustrated in Fig. 5. There exists p = (15 + m) /m physical partitions and each partition is
also m-pixel wide. The lth logical partition is mapped to the (l mod p)th physical partition and
its begin address is [l / p](N +16) . The depth of each physical memory partition is [(M +16)/
(m p)](N +16).[3]
For example, he search width is 48 and the search height is 32. So, the search area is
6347 pixels. One pixel is extended in both vertical and horizontal directions. Thus, the
memory capacity for the search area is 6448 pixels. When 16 PEGs are configured, the
memory is divided into 4 logic partitions, which are labeled as L0, L1, L2 and L3, as
illustrated in Fig. 6(a). Each solid line rectangle represents one 16-pixel wide and 48-pixel
high logic partition. The ME processing includes three stages. In the first stage, the area
covered by slash pattern is searched, so L0 and L1 are active. In the second stage, the sub
search area is moved 16-pixel in horizon. The rectangle with backslash pattern includes these
searching candidates. L1 and L2 are active in this stage. In the last phase, L2 and L3
are used and the rectangle filled with dot represents this sub search area. The intuitive
approach is implementing these logic partitions with four memory modules. Each module is
48W128b. But this method causes low memory IO utilization, which is just 48%. When the
search width is increased, the memory utilization becomes even worse.[3]
(a) Logic Memory Partitions
(b) Physical Memory Partitions.
Fig 6.4 16-PEG Design Layout

Based on the proposed mapping algorithm, just two memory modules are required for
the search area data storage, as shown in Fig. 6(b). Each memory module is 16-pixel wide
and 96-pixel high. L2 and L0 are stacked up and mapped to M0. L3 and L1 are mapped to
M1 in the same way. The output pixels dispatched to PEGs come from the outputs of M0
and M1. In the first search stage, the read pointers of M0 and M1 both begin from row
0. The 16 most significant pixels (MSP) of REF come from M0_O and the rest 15 least
significant pixels (LSP) come from M1_O[127:8]. In the second search stage, the read
pointer of M0 starts from row 48, which is the start point of L2 logic partition. The
positions of M0_O and M1_O in REF are exchanged. Namely, The 16 MSPs of REF
come from M1_O and the rest 15 LSPs come from M0_O[127:8]. In the third stage, both
read pointers are initialed to row 48 and the format of REF is the same as the first stage.[3]
Based on the proposed memory mapping algorithm, the IO utilization can reach
96.9% and just 2 memory modules are required. Compared with the intuitive memory
architecture, 41% hardware cost saving is obtained.
6.1.2. Processing element:

A generic term used to reference a hardware element that executes a stream of
instructions. The context defines what unit of hardware is considered a processing element
(e.g., core, processor, and computer). Consider a cluster of SMP workstations. In some
programming environments, each workstation is viewed as executing a single instruction
stream; in this case, a processing element is a workstation. A different programming
environment running on the same hardware, however, may view each processor or core of the
individual workstations as executing an individual instruction stream; in this case, the
processing element is the processor or core rather than the workstation.[21]
To integrate more general-purpose PEs and to make a practical vision chip which can
be used for real systems, we have developed a new and much simpler architecture called
S3PE (Simple and Smart Sensory Processing Elements) . We introduce details of the
architecture below.
The block diagram of the whole chip is shown in below Fig.. Each PE is directly
connected to a photo-detector, an output circuit, and its four neighboring PEs. The input
Image signals are A/D-converted and transmitted in parallel to all PEs. The instruction codes
from the external pins are transmitted to all the PEs and processed simultaneously (SIMD
type processing). The resulting data are transmitted to the output circuit and the feature
quantities are extracted and transmitted to the external pins.[21]
Structure of the PE
The block diagram of the PE is shown in Fig.2. Each PE consists of an ALU, local
memory and three registers. The ALU takes charge of the calculation and the memory data
recording and I/O. Two registers, called A register and B register, read data from the memory
and the ALU performs an operation on the data. Then the result is once fetched by the Z
register and is written into the memory again. This process is defined as a single cycle and by
performing several cycles you can process various kinds of algorithms.[21]
The block diagram of the ALU is shown in Fig.3. The ALU can process one of 10
logical and 8 arithmetic operations at one time. They are all binary operations and multi-bits
operations are processed by repeating single operations serially.
The local memory has 5-bit address space and consists of a 24-bit RAM, and an 8-bit
memory-mapped I/O connected to a sensor, an output circuit, four-neighboring PEs and
ground. Each bit can be randomly accessed. The address map is shown in Table1.[21]
The function of the ALU and the size of the memory are proved to be enough for most
early visual processing algorithms which are often used in visual applications.
The sum of absolute differences may be used for a variety of purposes, such as object
recognition, the generation of disparity maps for stereo images, and motion estimation for
video compression.
An algorithm used for video compression that adds up the absolute differences
between corresponding elements in the macro blocks of video frames. When coding or
compressing video, the similarities between video frames could be used to the best advantage
to achieve better compression ratios. Using the usual coding techniques on moving objects
within a video scene diminish the compression efficiency because they only consider the
pixels located at the same position in the video frames. Motion estimation and the SAD
algorithm are used to capture such movements more accurately for better compression
efficiency.[23]
6.1.3. Adder tree:

One that adds, especially a computational device that performs arithmetic addition. a
small machine that is used for mathematical calculations.
Circuit Description:
This circuit counts the number of active (1) bits in the input word (also known as the
Hamming weight). Obviously, a single full-adder can be used to calculate the Hamming
weight of a three bit input word. For larger input word width, a tree of adders can be used to
combine the intermediate sums from the previous stages.
Click the input switches or type the '0'..'9' and 'a', 'b' bind keys to play with the circuit.
See the next applet for an animated demonstration of the Hamming weight adder-tree.
To keep the circuit layout readable, the carry-in inputs of the multi-bit adders are tied
to GND (logical 0) in the example. Naturally, these carry-in inputs could be connected to
additional input pins, which would allow for 15-bits of input with the same number of adders.
As 15 is also the highest number that one three-bit adder can generate, a 16-bit bit counter
would obviously require another adder, etc.
6.1.4. comparator unit:

Definitions of comparator:
In electronics, a comparator is a device which compares two voltages or currents and
switches its output to indicate which is larger.
Any device for comparing a physical property of two objects, or an object with a
standard; An electronic device that compares two voltages, currents or streams of
data.
A machine used for looking for parallax motion, proper motion, asteroids, or variable
stars by quickly alternating between viewing two photographic plates from two
different times.
Op-amp voltage comparator:
A simple op-amp comparator

An operational amplifier (op-amp) has a well balanced difference input and a very
high gain. The parallel in the characteristics allows the op-amps to serve as comparators in
some functions.
A standard op-amp operating in open loop configuration (without negative feedback) can be
used as a comparator. When the non-inverting input (V+) is at a higher voltage than the
inverting input (V-), the high gain of the op-amp causes it to output the most positive voltage
it can. When the non-inverting input (V+) drops below the inverting input (V-), the op-amp
outputs the most negative voltage it can. Since the output voltage is limited by the supply
voltage, for an op-amp that uses a balanced, split supply, (powered by V S) this action can be
written:
Vout = Ao(V1 V2)
In practice, using an operational amplifier as a comparator presents several disadvantages as
compared to using a dedicated comparator:
1. Op-amps are designed to operate in the linear mode with negative feedback. Hence,
an op-amp typically has a lengthy recovery time from saturation. Almost all op-amps
have an internal compensation capacitor which imposes slew rate limitations for high
frequency signals. Consequently an op-amp makes a sloppy comparator with
propagation delays that can be as slow as tens of microseconds.
2. Since op-amps do not have any internal hysteresis an external hysteresis network is
always necessary for slow moving input signals.
3. The quiescent current specification of an op-amp is valid only when the feedback is
active. Some op-amps show an increased quiescent current when the inputs are not
equal.
4. A comparator is designed to produce well limited output voltages that easily interface
with digital logic. Compatibility with digital logic must be verified while using an opamp as a comparator.
Dedicated voltage comparator chips:
Several voltage comparator ICs:

A dedicated voltage comparator will generally be faster than a general-purpose
operational amplifier pressed into service as a comparator. A dedicated voltage comparator
may also contain additional features such as an accurate, internal voltage reference, an
adjustable hysteresis and a clock gated input.
A dedicated voltage comparator chip such as LM339 is designed to interface with a
digital logic interface (to a TTL or a CMOS). The output is a binary state often used to
interface real world signals to digital circuitry (see analog to digital converter). If there is a
fixed voltage source from, for example, a DC adjustable device in the signal path, a
comparator is just the equivalent of a cascade of amplifiers. When the voltages are nearly
equal, the output voltage will not fall into one of the logic levels, thus analog signals will
enter the digital domain with unpredictable results. To make this range as small as possible,
the amplifier cascade is high gain. The circuit consists of mainly Bipolar transistors except
perhaps in the beginning stage which will likely be field effect transistors. For very high
frequencies, the input impedance of the stages is low. This reduces the saturation of the slow,
large P-N junction bipolar transistors that would otherwise lead to long recovery times. Fast
small Schottky diodes, like those found in binary logic designs, improve the performance
significantly though the performance still lags that of circuits with amplifiers using analog
signals. Slew rate has no meaning for these devices. For applications in flash ADCs the
distributed signal across 8 ports matches the voltage and current gain after each amplifier, and
resistors then behave as level-shifters.
The LM339 accomplishes this with an open collector output. When the inverting
input is at a higher voltage than the non inverting input, the output of the comparator
connects to the negative power supply. When the non inverting input is higher than the
inverting input, the output is 'floating' (has a very high impedance to ground).
Output type:
fig:6.5 A Low Power CMOS Clocked Comparator

Because comparators have only two output states, their outputs are near zero or near
the supply voltage. Bipolar rail-to-rail comparators have a common-emitter output that
produces a small voltage drop between the output and each rail. That drop is equal to the
collector-to-emitter voltage of a saturated transistor. When output currents are light, output
voltages of CMOS rail-to-rail comparators, which rely on a saturated MOSFET, range closer
to the rails than their bipolar counterparts.
On the basis of outputs, comparators can also be classified as open drain or pushpull.
Comparators with an open-drain output stage use an external pull up resistor to a positive
supply that defines the logic high level. Open drain comparators are more suitable for mixedvoltage system design. Since the output is high impedance for logic level high, open drain
comparators can also be used to connect multiple comparators on to a single bus. Push pull
output does not need a pull up resistor and can also source current unlike an open drain
output.
Input voltage range:
The input voltages must stay the limits specified by the manufacturer. Early integrated
comparators, like the LM111 family, and certain high-speed comparators like the LM119
family, require input voltage ranges substantially lower than the power supply voltages (15
V vs. 36V).[1] Rail-to-rail comparators allow any input voltages within the power supply
range. When powered from a bipolar (dual rail) supply,
VS - V+,V- VS +
or, when powered from a uni-polar TTL/CMOS power supply:
0 V+,V- Vcc
Specific rail-to-rail comparators with p-n-p input transistors, like the LM139 family,
allow input potential to drop 0.3 Volts below the negative supply rail, but do not allow it to
rise above the positive rail.[2] Specific ultra-fast comparators, like the LMH7322, allow input
signal to swing below the negative rail and above the positive rail, although by a narrow
margin of only 0.2V.[3] Differential input voltage (the voltage between two inputs) of a
modern rail-to-rail comparator is usually limited only by the full swing of power supply.
6.1.5. Decision unit:

Due to the functions of the adder tree, the comparator and the decision units are
similar for both me_sad and me_dpc, and me_combined shares these units during low
resolution search and full pixel resolution is shown in fig (3)(b). this architecture results in a
much smaller area compared to me_split. However higher power consumption is expected
during the low-resolution search because the adder tree, comparator and decision unit operate
at higher bit size than needed.
From the above discussion, we propose three different architectures that can preform both
low-resolution and full-resolution searches. By combing different computation and memory
units, we propose the following architectures:
(1) Me_split+mem28 (ms_m8)

(2) Me_combine+mem8 (mc_8)
(3) Me_combine+mem8pre (mc_m8p)
In these architectures, both low resolution and full resolution search can be performed.
With proper configuration, the conventional full search algorithm can be used during normal
conditions to ensure a high quality picture at the output. In condition where the energy
consumption without significantly degrading the output picture quqlity.
7.H.264/AVC
7.1. Overview of H.264:
H.264/AVC is the newest international video coding standard, which is jointly

developed by ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture
Experts Group (MPEG). Compared with previous standards, H.264/AVC can provide much
better peak signal-to-noise ratio (PSNR) and visual quality . This high performance is mainly
due to the many new techniques adopted by H.264/AVC, such as variable block sizes motion
compensation, quarter-sample-accurate motion compensation, multiple reference picture
motion compensation, in-the-loop daglocking filtering and so on.[13]
In H.264/AVC, motion estimation (ME) is conducted on different blocks sizes
including 44, 48, 84, 88, 816, 168 and 1616. During ME, all the block sizes inside
one macro block (MB) are processed and the block mode with the best R-D cost is then
chosen and this process is named VBSME. Because of the intensive computation of ME, the
hardware accelerator is essential for the real-time encoding system. Full search algorithm is
widely used because it has following merits: (1) Its performance is superior to other fast
algorithms and stable in all applications; (2) Its processing time is predictable and fixed; (3)
Its control logic and memory access are simple and regular.[13]
In H.264/AVC reference software, the best matching position for one block is decided
by the sum of absolute differences (SAD) and coding cost of motion vector difference
(MVD). However, the calculation of MVD needs the exact motion vector (MV) of left, top
and top right neighboring blocks. Therefore, the four 88 sub-partitions in one MB have to be
processed in sequence. This inherent data dependency in the integer ME (IME) algorithm
makes the parallel processing of all 41 blocks within one MB infeasible. A modified IME
algorithm is proposed in . In that paper, MVD cost is not taken into account and SAD is the
only criterion in IME processing. Because this algorithm avoids the data dependency caused
by MVD with negligible PSNR loss, it is more suitable for hardware implementation.[13]
Based on above algorithm, one efficient 2-D VBSME full-search architecture is
proposed. The design has 256 PEs and can achieve a high throughput, which makes it
suitable for real-time high resolution video processing. However, this architecture also has
some demerits: (1) the search area data memory is partitioned into 16 modules in the column
direction to realize the memory interleaving scheme. (2) In order to fully utilize the 2-D PE
array, the search area memory should be further divided in the row direction.
7.2. H.264 CODEC:

In common with earlier standards (such as MPEG1, MPEG2 and MPEG4), the H.264
draft standard does not explicitly define a CODEC. Rather, the standard defines the syntax of
an encoded video bit stream together with the method of decoding this bit stream. In practice,
however, a compliant encoder and decoder are likely to include the functional elements
shown in fig 2-1 and 2-2. Whilst the functions shown in these figures are likely to be
necessary for compliance, these are scope for considerable variations in the structure of the
CODEC. The basic functional elements are little different from previous standards; important
changes in H.264 occurs in the details of each functional elements.
The encoder includes two dataflow paths, a forward path (left to right, shown in
blue) and a reconstruction path (right to left, shown in magenta). The data flow path in the
decoder (fig 2-2) shown from right to left to illustrate the similarities between encoder and
decoder.[13]
(Fig 7.1 AVC Encoder)
(Fig 7.2 AVC Decoder)

7.3. Encoder (forward path):
An input frame Fn is presented for encoding. The frame is processed in units of a
macro block (corresponding to 16*16 pixels in the original image). Each macro block is
encoded in intra or inter mode. In either case, a prediction macro block P is formed based a
reconstructed frame. In intra mode, P is formed from samples in the current frame n that have
previously encoded, decoded and reconstructed (u Fn in the figures note that the unfiltered
samples are used to form P). In inter mode, P is formed by the motion-compensated
prediction from one or more reference frame(s). In that figures, the reference frame is shown
as the previous encoded frame Fn-1; however, the prediction for each macro block may be
formed from one or two past or future frames (in time order) that have already been encoded
and reconstructed.[13]
The prediction P is subtracted from the current macro block to produce a residual or
difference macro block Da. This is transformed (using a block transform) and quantized to
give X, set of quantized transform coefficients. These coefficients are re-ordered and entropy
encoded. The entropy encoded coefficients, together with side information required to decode
the macro block form the compressed bit stream. This is passed to a network abstraction layer
(NAL) for transmission or storage.
7.4. Encoder (reconstruction path):

The quantized macro block coefficients X are decoded in order to reconstructed a
frame for encoding of further macro blocks. The coefficients X are re-scaled (Q-1) and
inverse transformed (T-1) to produce a difference macro block Dn. This is not identical to the
original difference macro block Dn; the quantization process introduces losses and so Dn is a
distorted version of Dn.
The prediction macro block P is added to Dn to create a reconstructed macro block
uFn (a distorted version of the original macro block). A filter is applied to reduce the effects
of blocking distortion and reconstructed reference frame is created from a series of macro
blocks Fn.[13]
7.5. Decoder:
The decoder receives a compressed bit stream from the NAL. The data elements are
entropy decoded and reordered to produce a set of quantized coefficients X. these are rescaled
and inverse transformed to give Dn (this identical to the Dn shown in the encoded). Using the
header information decoded from the bit stream, the decoder creates a prediction macro
block P, identical to the original prediction P formed in the encoder, P is added to Dn to
produce uFn which this is filtered to create the decoded macro block Fn.
It should be clear the figures and from the discussion above that the purpose of the
reconstruction path in the encoder is ensure that both encoder and decoder use identical
reference frames to create the prediction P. If this is not the case, then the predictions P in
encoder and decoder will not be identical, leading to an increasing error or drift between
the encoder and decoder.
7.6. Introduction to H.264:

What is H.264?
Well, simply speaking, it's a kind of video format, we all know video formats like MPEG2,DivX and XviD. H.264 turns out after them, it's more advanced codec, because it aims at
achieving same video effect with DivX but in half size of DivX.
Then what is H.264/AVC ?
AVC is abbreviated form of Advanced Video Coding. Actually it's the same thing with
H.264.we can also call it H.264/AVC,H.264/MPEG-4 AVC, or MPEG-4 Part 10.
The latest video compression standard, H.264 (also known as MPEG-4 part 10/AVC
for advanced video coding), is expected to become the video standard of choice in the coming
years.
H.264 is an open, licensed standard that supports the most efficient video compression
techniques available today. Without compromising the image quality, an H.264 encoder can
reduce the size of a digital video file by more than 80% compared with the motion JPEG
format and as much as 50%more than with the MPEG-4 part-2 standard. This means that
much less network bandwidth and storage space are required for a video file. Or seen another
way, much higher video quality can be achieved for a given bit rate.[13]
Jointly defined by standardization organizations in the telecommunications and IT
industries, H.264 is expected to be more widely adapted than previous standards.
H.264 has already introduced in new electronic gadgets such as mobile phones and
digital video players, and has gained fast acceptance by end users. Service providers such as
online video storage and telecommunication companies are also beginning to adopt H264.
In the video surveillance industry, H264 will most likely find the quickest traction in
applications where there are demands for high frame rates and high resolution, such as in the
surveillance of highways, airports and casinos, where the use of 30/25 (NTSC/PAL) frames
per second is the norm. this is where the economies of reduced bandwidth and storage needs
will deliver the biggest savings.[13]
H.264 is also expected to accelerate the adoption of megapixel cameras since the
highly efficient compression technology can reduce the large file sizes and bit rates generated
without compromising the image quality. There are tradeoffs, however, while H.264 provides
savings in network bandwidth and storage costs, it will require higher performance network
cameras and monitoring stations.
7.7. Development of H.264:

H.264 is the result of a joint project between the ITU-Ts Video Coding Experts Group
and the ISO/IEC Moving Picture Experts Group (MPEG). ITU-T is the sector that
coordinates telecommunication standards on behalf of the international telecommunication

union. ISO stands for international organization for standardization and IEC stands for
international electro technical commission, which oversees standards for all electrical,
electronic and related technologies. H.264 is the name used by ITU-T, while ISO/IEC has
named it MPEG-4 part 10/AVC since it presented as a new part in its MPEG-4 suite. The
MPEG-4 suite includes, for example, MPEH-4 part-2, which is a standard that has been used
by IP-based video encoders and network cameras.[13]
Designed to address several weaknesses in previous video compression standards, H.264
delivers on its goals of supporting:
1. Implementation that deliver an average bit rate reduction of 50%, given a fixed video
2.
3.
4.
5.
quality compared with any other video standard

Error robustness so that transmission errors over various networks are tolerated
Low latency capabilities and better quality for higher latency.
Straightforward syntax specification that simplifies implementations
Exact match decoding, which defines exactly how numerical calculations are to be
made by an encoder and a decoder to avoid errors from accumulating.

H.264 also has the flexibility to support a wide variety of applications with very different
bit rate requirements. For example, in entertainment video applications---which include
broadcast, satellite, cable and DVD-H.264 will be able to deliver a performance of
between 1 to 10 M bits/s with high latency, while for telecom services, H.264 can deliver
bitrates of below 1 M bits/s with low latency.
7.8. How video compression works?

Video compression is about reducing and removing redundant video data so that a
digital video file can be effectively sent and stored. The process involves applying an
algorithm to the source video to create a compressed file that is ready for transmission or
storage. To ply the compressed file, an inverse algorithm is applied to produce a video that
shows virtually the same content as the original source video. The time it takes to compress,
send, decompress and display a file is called latency. The more advanced the compression
algorithm. The higher the latency, given the same processing power.[13]
A pair of algorithms that works together is called a video codec (encoder/decoder).

Video codecs that implements different standards are normally not compatible with each
other; that is, video content that is compressed using one standard cannot be decompressed
with a different standard. For instance, an M.PEG-4 part-2 decoder will not work with an
H.264 encoder. This is simply because one algorithm cannot correctly decode the output from
another algorithm but it is possible to implement many different algorithms in the same
software or hardware, which would then enable multiple formats to be compressed.
Different video compression standards utilize different methods of reducing data, and
hence, results differ in bit rate, quality and latency.
Results from encoders that use the same compression standard may also vary because
the designer of an encoder can choose to implement different sets of tools defined by a
standard. As long as the output of an encoder conforms to a standards format and decoder, it
is possible to make different implementations. This is advantageous because different
implementations have different goals and budgets. Professional non-real-time software
encoders for mastering optical media should have the option of being able to deliver better
encoded video than a real-time hardware encoder for video conferencing that is integrated in
a hand held device. A given standard, therefore, cannot guarantee a given bit rate or quality.
Furthermore, the performance of a standard cannot be properly compared with other
standards, or even other implementations of the same standard, without first defining how it
is implemented.[13]
A decoder, unlike an encoder, must implement all the required parts of a standard in
order to decode a complaint bit stream. This is because a standard specifies exactly how a
decompression algorithm should restore every bit of a compressed video.
The graph below provides a bit rate comparison, given the same level of image
quality, among the following video standards: motion JPEG, MPEG-4 Part-2 (no motion
compensation), MPEG-4 Patr-2 (with motion compensation) and H.264 (baseline profile).
Below figure shows an H.264 encoder generated up to 50% fewer bits per second for
a sample video sequence than an MPEG-4 encoder with no motion compensation and at least
six times more than motion JPEG.
7.9. H.264 Profiles and levels:

H.264 has seven profiles, each targeting a specific class of applications. Each profile
defines what feature set the encoder may use and limits the decoder implementation
complexity.
Network cameras and video encoders will most likely use a profile called the baseline
profile, which is intended primarily for applications with limited computing resources. The
base line profile is the most suitable given the available performance in a real-time encoder
that is embedded in a network video product. The profile also enables low latency, which is
an important requirement of surveillance video and also particularly important enabling realtime, pan/tilt/zoom (PTZ) control in PTZ network cameras.
H.264 has eleven levels or degree of capability to limit performance, bandwidth and
memory requirements. Each level defines the bit rate and the encoding rate in macro block
per second for resolutions ranging from QCIF to HDTV and beyond. The higher the
resolution, the higher the level required.
7.10. Understanding of frames:

Depending on the H.264 profile, different types of frames such as I-frames, P-frames
and B-frames, may be used by an encoder.
An I-frame, or intra frame, is a self-contained frame that can be independently
decoded without any reference to other images. The image in a video sequence is always an Iframe. I-frames are needed as starting points for new viewers or resynchronization points if
the transmitted bit stream is damaged. I-frames can be used to implement fast-forward,
rewind and other random access functions. An encoder will automatically insert I-frames at
regular intervals or on demand if new clients are expected to join in viewing a stream. The
drawback of I-frames is that they consume much more bits, but on the other hand, they do not
generate many artifacts.
A P-frame, which stands for predictive inter frame, makes references to parts of
earlier I and/or P frame(s) to code the frame. P-frame usually require fewer bits than I-frames,
but a drawback is that they are very sensitive to transmission errors because of the complex
dependency on earlier P and I reference frames.
A B-frame, or bi-predictive inter frame is a frame that makes references to both an
earlier reference frame and a future frame.
Figure-2. A typical sequence with I, B, and P-frames. A P-frame may only reference
preceding I- or P-frames, while a B-frame may reference both preceding and succeeding I- or
P-frames.
When a video decoder restores a video by decoding the bit stream frame by frame,
decoding must always start with an I-frame. P-frames and B-frames, if used, must be decoded
together with the reference frame(s).
In the H.264 baseline profile, only I-and P-frames are used. This profile is ideal for
network cameras and video encoders since low latency is achieved because B-frames are not
used.
7.11. Basic methods of reducing data:

A variety of methods can be used to reduce video data, both within an image frame
and between a series of frames.
Within an image frame, data can be reduced simply by removing unnecessary
information, which will have an impact on the image resolution.
In a series of frames, video data can be reduced by such methods as difference coding, which
is used by most video compression standards including H.264. in difference coding, a frame
is compared with a reference frame.
Figure-3. With motion JPEG format, the three images in the above sequence are coded and
sent as separate unique images (I-frames) with no dependencies on each other
Figure-4. With difference coding (used in most video compression standards including
H.264), only the first image (I-frame) is coded in its entirely. In the two following images (Pframes), references are made to the first picture for the static elements i.e., the house and the
moving parts, i.e., the running man, is coded using motion vectors, thus reducing the amount
of information that is sent and stored.
The amount of encoding can be further reduced if detection and encoding of
differences is based on blocks of pixels (macro blocks) rather than individual pixels;
therefore, bigger area are compared and only blocks that are significantly different are
decoded. The overhead associated with indicating the location of areas to be changed is also
reduced.
Difference coding, however, would not significantly reduce data if there was a lot of
motion in a video. Here, techniques such as block based motion compensation can be used.
Block based motion compensation takes into account that much of what makes up a new
frame in a video sequence can be found in an earlier frame, but perhaps in a difference
location. This technique divides a frame in to a series of macro blocks. Block by block, a new
frame-for instance, a P-frame can be composed or predicted by looking for a matching block
in a reference frame. If a match is found, the encoder simply codes the position where the
matching block is to be found in the reference frame. Coding the motion vector, as it is
called, takes up fewer bits than if the actual content of a block were to be coded.
Figure-7.3: illustration of block based motion compensation
7.12. Applications of H.264/AVC:

The H.264 video format has a very broad application range that covers all forms of
digital compressed video from low bit-rate Internet streaming applications to HDTV
broadcast and Digital Cinema applications with nearly lossless coding. With the use of H.264,
bit rate savings of 50% or more are reported. For example, H.264 has been reported to give
the same Digital Satellite TV quality as current MPEG-2 implementations with less than half
the bit rate, with current MPEG-2 implementations working at around 3.5 Mbit/s and H.264
at only 1.5 Mbit/s.
To ensure compatibility and problem-free adoption of H.264/AVC, many standards
bodies have amended or added to their video-related standards so that users of these standards
can employ H.264/AVC.
Both the Blue-ray Disc format and the now-discontinued HD DVD format include the
H.264/AVC High Profile as one of 3 mandatory video compression formats. Sony has also
chosen this format for their Memory Stick Video format.
The Digital Video Broadcast project (DVB) approved the use of H.264/AVC for
broadcast television in late 2004.
The Advanced Television Systems Committee (ATSC) standards body in the United
States approved the use of H.264/AVC for broadcast television in July 2008, although the
standard is not yet used for fixed ATSC broadcasts within the United States. It has also been
approved for use with the more recent ATSC-M/H (Mobile/Handheld) standard, using the
AVC and SVC portions of H.264.
AVCHD is a high-definition recording format designed by Sony and Panasonic that uses
H.264 (conforming to H.264 while adding additional application-specific features and
constraints).
AVC-Intra is an intraframe-only compression format, developed by Panasonic.
The CCTV (Close Circuit TV) or Video Surveillance market has included the technology
in many products. Prior to this technology, the compression formats used within the industry's
DVRs Digital Video Recorders were generally low qualities in compression capability. With
the application of the H.264 compression technology to the video surveillance industry, the
quality of the video recordings became substantially improved. Starting in 2008, some in the
surveillance industry promoted the H.264 technology as synonymous with "high quality"
video.
7.13. Features of H.264/AVC:

H.264/AVC/MPEG-4 Part 10 contains a number of new features that allow it to compress
video much more effectively than older standards and to provide more flexibility for
application to a wide variety of network environments. In particular, some such key features
include:
Multi-picture inter-picture prediction including the following features:
Using previously-encoded pictures as references in a much more flexible way than in past
standards, allowing up to 16 reference frames (or 32 reference fields, in the case of interlaced
encoding) to be used in some cases. This is in contrast to prior standards, where the limit was
typically one; or, in the case of conventional "B pictures", two. This particular feature usually
allows modest improvements in bit rate and quality in most scenes. But in certain types of
scenes, such as those with repetitive motion or back-and-forth scene cuts or uncovered
background areas, it allows a significant reduction in bit rate while maintaining clarity.[13]
Variable block-size motion compensation (VBSMC) with block sizes as large as 1616
and as small as 44, enabling precise segmentation of moving regions. The supported luma
prediction block sizes include 1616, 168, 816, 88, 84, 48, and 44, many of which
can be used together in a single macro block. Chroma prediction block sizes are
correspondingly smaller according to the chroma sub sampling in use.
The ability to use multiple motion vectors per macro block (one or two per partition) with
a maximum of 32 in the case of a B macro block constructed of 16 44 partitions. The
motion vectors for each 88 or larger partition region can point to different reference
pictures.
The ability to use any macro blocks type in B-frames, including I-macro blocks, resulting
in much more efficient encoding when using B-frames. [13]
MPEG-4 ASP.
Six-tap filtering for derivation of half-peal lima sample predictions, for sharper sub pixel
motion-compensation. Quarter-pixel motion is derived by linear interpolation of the helpful
values, to save processing power.
Quarter-pixel precision for motion compensation, enabling precise description of the
displacements of moving areas. For chroma the resolution is typically halved both vertically
and horizontally. Therefore the motion compensation of chroma uses one-eighth chroma pixel
grid units.
Weighted prediction, allowing an encoder to specify the use of a scaling and offset when
performing motion compensation, and providing a significant benefit in performance in
special casessuch as fade-to-black, fade-in, and cross-fade transitions. This includes
implicit weighted prediction for B-frames, and explicit weighted prediction for P-frames.
Spatial prediction from the edges of neighboring blocks for "intra" coding, rather than the
"DC"-only prediction found in MPEG-2 Part 2 and the transform coefficient prediction found
in H.263v2 and MPEG-4 Part 2. This includes luma prediction block sizes of 1616, 88,
and 44 (of which only one type can be used within each macro block).[13]
Lossless macro block coding features including:
A lossless "PCM macro block" representation mode in which video data samples are
represented directly,[16] allowing perfect representation of specific regions and allowing a
strict limit to be placed on the quantity of coded data for each macro block.
An enhanced lossless macro block representation mode allowing perfect
representation of specific regions while ordinarily using substantially fewer bits than the
PCM mode.
Flexible interlaced-scan video coding features, including:
Macro block-adaptive frame-field (MBAFF) coding, using a macro block pair
structure for pictures coded as frames, allowing 1616 macro blocks in field mode (compared
with MPEG-2, where field mode processing in a picture that is coded as a frame results in the
processing of 168 half-macro blocks).
Picture-adaptive frame-field coding (PAFF or PicAFF) allowing a freely-selected
mixture of pictures coded either as progressive frames where both fields are combined or as
individual single fields.
New transform design features, including:
An exact-match integer 44 spatial block transform, allowing precise placement of
residual signals with little of the "ringing" often found with prior codec designs. This is
conceptually similar to the well-known DCT design, but simplified and made to provide
exactly-specified decoding.
An exact-match integer 88 spatial block transform, allowing highly correlated
regions to be compressed more efficiently than with the 44 transform. This is conceptually
similar to the well-known DCT design, but simplified and made to provide exactly-specified
decoding.
Adaptive encoder selection between the 44 and 88 transform block sizes for the
integer transform operation.
A secondary Handmaid transform performed on "DC" coefficients of the primary
spatial transform applied to chroma DC coefficients (and also luma in one special case) to
obtain even more compression in smooth regions.
A quantization design including:
Logarithmic step size control for easier bit rate management by encoders and
simplified inverse-quantization scaling
Frequency-customized quantization scaling matrices selected by the encoder for
perceptual-based quantization optimization
An in-loop deblocking filter that helps prevent the blocking artifacts common to other
DCT-based image compression techniques, resulting in better visual appearance and
compression efficiency
An entropy coding design including:
Context-adaptive binary arithmetic coding (CABAC), an algorithm to losslessly
compress syntax elements in the video stream knowing the probabilities of syntax elements in
a given context. CABAC compresses data more efficiently than CAVLC but requires
considerably more processing to decode.
Context-adaptive variable-length coding (CAVLC), which is a lower-complexity
alternative to CABAC for the coding of quantized transform coefficient values. Although
lower complexity than CABAC, CAVLC is more elaborate and more efficient than the
methods typically used to code coefficients in other prior designs.
A common simple and highly structured variable length coding (VLC) technique for
many of the syntax elements not coded by CABAC or CAVLC, referred to as ExponentialGolomb coding (or Exp-Golomb).
Loss resilience features including:
A Network Abstraction Layer (NAL) definition allowing the same video syntax to be
used in many network environments. One very fundamental design concept of H.264 is to
generate self contained packets, to remove the header duplication as in MPEG-4's Header
Extension Code (HEC). This was achieved by decoupling information relevant to more than
one slice from the media stream. The combination of the higher-level parameters is called a
parameter set. The H.264 specification includes two types of parameter sets: Sequence
Parameter Set (SPS) and Picture Parameter Set (PPS). An active sequence parameter set
remains unchanged throughout a coded video sequence, and an active picture parameter set
remains unchanged within a coded picture. The sequence and picture parameter set structures
contain information such as picture size, optional coding modes employed, and macroblock
to slice group map.
Flexible macro block ordering (FMO), also known as slice groups, and arbitrary slice
ordering (ASO), which are techniques for restructuring the ordering of the representation of
the fundamental regions (macro blocks) in pictures. Typically considered an error/loss
robustness feature, FMO and ASO can also be used for other purposes.
Data partitioning (DP), a feature providing the ability to separate more important and
less important syntax elements into different packets of data, enabling the application of
unequal error protection (UEP) and other types of improvement of error/loss robustness.
Redundant slices (RS), an error/loss robustness feature allowing an encoder to send an
extra representation of a picture region (typically at lower fidelity) that can be used if the
primary representation is corrupted or lost.
Frame numbering, a feature that allows the creation of "sub-sequences", enabling
temporal scalability by optional inclusion of extra pictures between other pictures, and the
detection and concealment of losses of entire pictures, which can occur due to network packet
losses or channel errors.
Switching slices, called SP and SI slices, allowing an encoder to direct a decoder to jump
into an ongoing video stream for such purposes as video streaming bit rate switching and
"trick mode" operation. When a decoder jumps into the middle of a video stream using the
SP/SI feature, it can get an exact match to the decoded pictures at that location in the video
stream despite using different pictures, or no pictures at all, as references prior to the switch.
A simple automatic process for preventing the accidental emulation of start codes, which
are special sequences of bits in the coded data that allow random access into the bit stream
and recovery of byte alignment in systems that can lose byte synchronization.
Supplemental enhancement information (SEI) and video usability information (VUI),
which are extra information that can be inserted into the bit stream to enhance the use of the
video for a wide variety of purposes.
Auxiliary pictures, which can be used for such purposes as alpha compositing.
Support of monochrome, 4:2:0, 4:2:2, and 4:4:4 chroma sub sampling (depending on the
selected profile).
Support of sample bit depth precision ranging from 8 to 14 bits per sample (depending on
the selected profile).
The ability to encode individual color planes as distinct pictures with their own slice
structures, macro block modes, motion vectors, etc., allowing encoders to be designed with a
simple parallelization structure (supported only in the three 4:4:4-capable profiles).
Picture order count, a feature that serves to keep the ordering of the pictures and the
values of samples in the decoded pictures isolated from timing information, allowing timing
information to be carried and controlled/changed separately by a system without affecting
decoded picture content.
These techniques, along with several others, help H.264 to perform significantly better
than any prior standard under a wide variety of circumstances in a wide variety of application
environments. H.264 can often perform radically better than MPEG-2 videotypically
obtaining the same quality at half of the bit rate or less, especially on high bit rate and high
resolution situations.
Like other ISO/IEC MPEG video standards, H.264/AVC has a reference software
implementation that can be freely downloaded. Its main purpose is to give examples of
H.264/AVC features, rather than being a useful application per se. Some reference hardware
design work is also under way in the Moving Picture Experts Group. The above mentioned
are complete features of H.264/AVC covering all profiles of H.264. A profile for a codec is a
set of features of that codec identified to meet a certain set of specifications of intended
applications. This means that many of the features listed are not supported in some profiles.
Various profiles of H.264/AVC are discussed in next section.[13]
7.14. Advantages of H.264 over MPEG-4:

As we know, H.264 is better than MPEG-4 due to H.264 can compress the video into
smaller block size and maintain the same video quality, compared with MPEG-4.
1. Performance in low bit rate, especially on network. For example, on network with low
bandwidth of 256Kbps, if you set streaming to have 385Kbps, the display is still smooth.
But, if you set MPEG-4 streaming to have 385Kbps, the display will not be smooth,
compared to H.264 streaming.
2. Storage capacity:
It is most obvious, you can set H.264 with lower bit rate, but check if the check the two
displays of H.264 and MPEG-4 stream have the same quality. Since, lower bit rate for H.264,
the storage capacity for H.264 will be smaller compared to MPEG-4 streaming.
8. FPGA
Definitions of FPGA on the Web:
1. A field-programmable gate array (FPGA) is an integrated circuit designed to be
configured by the customer or designer after manufacturing--hence "fieldprogrammable".
...
2. This device is similar to the gate array, defined above, with the device shipped to the
user with general-purpose metallization pre-fabricated, often with variable length
segments.
3. Field-programmable gate array. A chip composed of an array of configurable logic
cells (also called logic blocks). Each cell can be configured, or programmed, to
perform one of a variety of simple functions, such as computing the logical AND of
two inputs. ...
8.1. What is an FPGA?

Before the advent of programmable logic, custom logic circuits were built at the board
level using standard components, or at the gate level in expensive application-specific
(custom) integrated circuits. The FPGA is an integrated circuit that contains many (64 to
over 10,000) identical logic cells that can be viewed as standard components. Each logic cell
can independently take on any one of a limited set of personalities. The individual cells are
interconnected by a matrix of wires and programmable switches. A user's design is
implemented by specifying the simple logic function for each cell and selectively closing the
switches in the interconnect matrix. The array of logic cells and interconnects form a fabric
of basic building blocks for logic circuits. Complex designs are created by combining these
basic blocks to create the desired circuit.
Fig 8.1: SPARTAN-3E development board
8.2. What does a logic cell do?

The logic cell architecture varies between different device families. Generally
speaking, each logic cell combines a few binary inputs (typically between 3 and 10) to one or
two outputs according to a Boolean logic function specified in the user program . In most
families, the user also has the option of registering the combinatorial output of the cell, so
that clocked logic can be easily implemented.
The cell's combinatorial logic may be
physically implemented as a small look-up table memory (LUT) or as a set of multiplexers

and gates. LUT devices tend to be a bit more flexible and provide more inputs per cell than
multiplexer cells at the expense of propagation delay.
8.3. So what does 'Field Programmable' mean?

Field Programmable means that the FPGA's function is defined by a user's program
rather than by the manufacturer of the device. A typical integrated circuit performs a
particular function defined at the time of manufacture. In contrast, the FPGA's function is
defined by a program written by someone other than the device manufacturer. Depending on
the particular device, the program is either 'burned' in permanently or semi-permanently as
part of a board assembly process, or is loaded from an external memory each time the device
is powered up. This user programmability gives the user access to complex integrated
designs without the high engineering costs associated with application specific integrated
circuits.
8.4. How are FPGA programs created?

Individually defining the many switch connections and cell logic functions would be a
daunting task. Fortunately, this task is handled by special software. The software translates a
user's schematic diagrams or textual hardware description language code then places and
routes the translated design. Most of the software packages have hooks to allow the user to
influence implementation, placement and routing to obtain better performance and utilization
of the device. Libraries of more complex function macros (eg. adders) further simplify the
design process by providing common circuits that are already optimized for speed or area.
8.5. FPGA spartan3E:

Designers of a gate centric solutions face a problem- increasing design functionality
while also minimizing device costs. This has often meant sacrificing either features or cost
effectiveness.
The SPARTAN3E FPGA family offers low cost and plot form features you are
looking for, making it for ideal for gate centric programmable logic designs. Spartan-3E is
the seventh in the groundbreaking low cost Spartan series and the third Xilinx family
manufactured with advanced 90nm process technology. Spartan-3E FPGAs delivers up to 1.6
million gates, up to 376 I/Os, and a versatile plot form FPGA architecture with the lowest
cost per-logic in the industry.
This combination of the state-of- the-art low cost manufacturing and the cost efficient
architecture provides unprecedented price points and value. The features and capabilities of
spatrtan-3E family are, optimized for high volume and low cost applications and the Xilinx
supply chain is ready to fulfill your production requirements.
8.6. Features of FPGA Spartan-3E:

Spartan-3E low cost features: the Spartan-3E reduces system cost by offering lowest
cost per logic of any FPGA family, supporting the lowest cost configuration solutions
including commodity serial and parallel flash memories and efficiently integrating the
function of many chips into a single FPGA.
Advanced low cost features:
1.
2.
3.
4.
5.
Five devices with 100K to 1.6M system gates

From 66 to 376 I/Os with package and density migration
Up to 648Kbits of block RAM and up to 231Kbits of distributed RAM
Up to 36 embedded 18*18 multipliers for high performance DSP applications
Up to eight digital clock managers
Cost saving system interfaces and solutions:

1. Supports for Xilinx plot form flash as well as commodity serial (SPI) and byte wide
flash memory for configuration
2. Easy to implement to interfaces to DDR memory
3. Supports for 18common I/O standards including PCI 33/66, PCI-X, mini-LVDS and
RSDS
Industry-leading design tools and IPs:
1. ISE design tools to shorten the design and verification
2. Hundreds of pre-verified, pre-optimized intellectual property (IP) cores and reference
designs
3. Chip scope pro system debugging environment
Easy to use, low cost FPGA development system:
1. Complete Spartan-3E starter kit available for only $149 USD
2. Includes XC3S500E FPGA, SPI flash, 32 MB DDR memory, and support for USB2.0
8.7. Xilinx Spartan-3E 1200K Gates:
IC Xilinx Spartan-3E FPGA, 1200K gates connectorsUSB2 Port, Hirose FX2, Four
12-pin p-mod connectors, VGA, PS/2, and serial programming Diligent USB2 port providing
board power, programming and data transfers.
Fig 8.2 : Xilinx Spartan 3E FPGA (Nexys-2)
The Nexys-2 is a powerful digital system design platform built around a Xilinx
Spartan 3E FPGA. With 16Mbytes of fast SDRAM and 16Mbytes of Flash ROM, the Nexys2 is ideally suited to embedded processors like Xilinx's 32-bit RISC Micro blaze. The onboard high-speed USB2 port, together with a collection of I/O devices, data ports, and
expansion connectors, allow a wide range of designs to be completed without the need for
any additional components.
Features:
1.
Xilinx Spartan-3E FPGA, 500K or 1200K gate
2.
USB2 port providing board power, device configuration, and high-speed data
transfers
3.
Works with ISE/Web pack and EDK
4.
16MB fast Micron PSDRAM
5.
16MB Intel Strata Flash Flash R
6.
Xilinx Platform Flash ROM
7.
High-efficiency switching power supplies (good for battery-powered applications
8.
50MHz oscillator, plus a socket for a second oscillator
9.
75 FPGA I/Os routed to expansion connectors (one high-speed Hirose FX2 connector
with 43 signals and four 2x6 P-mod connectors)
10.
All I/O signals are ESD and short-circuit protected, ensuring a long operating life in
any environment.
11.
On-board I/O includes eight LEDs, four-digit seven-segment display, four

pushbuttons, eight slide switches
12.
Ships in a DVD case with a high-speed USB2 cable
Specifications:
Supply voltage levels necessary for preserving the RAM contents:

Symbol
VDRINT
VDRAUX
Description
VCCINT Level required to retain RAM data
VCCAUX Level required to retain RAM data
Min
1.0
2.0
Units
V
V
Absolute maximum ratings:

Symbol
VCCINT
VCCAUX
VCCO
VREF
VIN(1,2,3,4)
IIK
VESD
TJ
TSTG
Description
Conditions
Inter supply voltage
Auxiliary Supply Voltage
Output driver supply voltage
Input reference Voltage
Voltage applied to all Driver in Commercial
user I/O pins and Dual
a highIndustrial
purpose pins
impedanc
e state
Voltage applied to all
All temp. ranges
dedicated pins
Input clamp current per -0.5V< VIN <( VCCO + 0.5V)
I/O pin
Electrostatic Discharge Human body model
Charged device model
Voltage
Machine model
Junction temperature
Storage temperature
Min
-0.5
-0.5
-0.5
-0.5
-0.95
-0.85
Max
1.32
3.00
3.75
VCCO +0.5(1)
4.4
4.3
Units
V
V
V
V
V
V
-0.5
VCCAUX+0.5(3)
mA
--
100
---------65
2000
500
200
125
150
V
V
V
0
C
0
C
Max
1.0
Units
V
Power supply specifications:

Supply voltage thresholds for power-on reset:
Symbol
VCCINT
Description
Threshold for the VCCINT supply
Min
0.4
VCCAUXT
VCCO2T
Threshold for the VCCAUXT supply

Threshold for the VCCO Bank 2 supply
0.8
0.4
2.0
1.0
V
V
9. SCREEN SHOTS OF VHDL CODE SIMULATION
Steps to simulate and dump the program into an FPGA Spartan-3E Kit:
[1]
Open Xilinx9.2 software icon on the desk top.
[2]
Click on FILE option on the tool bar and select a new project.
(An empty window will be opened)

[3] Write the program and save it.
[4]
Click on synthesize-XST and check the syntax.
[5] After checking the syntax successfully, write the test bench program.
[6]
For test bench waveform, open the test bench program and check the syntax and
simulate
the behavioral model. (Timing diagram will be appeared).
[7]
Now connect the FPGA Spartan-3E kit to the system.
[8]
Click on user constraints, assign package pins.
[9]
Next, generate post-synthesis simulator and generate programming file.
[10] Click on DIGILENT software icon-- adopt--export.

[11] Here assign the program to the program chain and initialize it.
[12]
Then the program will be dumped Into FPGA Spartan-3E kit, and output will be
displayed as indication of LEDs.
FLOW CHARTS
Flow chart for SAD Flow:
Buffering Current frame
Reconstruct previous frame

Fetch pixels from current
frame and previous frame
Process the pixels

(subtraction)
Absolute the differences
Accumulate the
absolute differences
Compare for minimum SAD
Decide MV_ROW,
MV_COL
Flow chart for DPC Flow:
Buffering Current frame
Reconstruct previous frame
Fetch pixels from current

frame and previous frame
Truncate if necessary
Process the pixels (xor operation)
Accumulate the differences
Compare for minimum DPC
Decide MV_ROW, MV_COL
[1] Click on Xilinx-click on FILE option-new project (write the program)-save it-
Click on synthesize-XST and check the syntax
[2] In behavioral simulation, open test bench program- save it- Xilinx ISE simulator and
check the syntax- simulate behavioral model (then we can get the timing diagram)
[3] After simulating the behavioral model we can get the timing diagram as shown below
[4] Now generate the programming file and check it out.
[5] Inputs and output Pin configuration of FPGA Spartan-3E KIT
[6] After connecting the hardware to the CPU, firstly, we have to initialize the chain.
[7] After initialization, add source file to the chain.
[8] Now run the programming chain, then the program will be dumped into FPGA Spartan3E KIT.
So the output will be displayed as indication of LEDs.
10. CONCLUSION
This paper has represented a method to reduce the computational cost and memory
access for VBSME using pixel truncation. VBSME is a new coding technique and provides
more accurate predictions compared to traditional fixed block size motion estimation. With
FBSME, if an MB consists of two objects with different motion directions, the coding
performance of this MB is worse. On the other hand, for the same condition, the MB can be
divided into smaller blocks in order to fit the different motion directions with VBSME. Hence
the coding performance is improved.
However, for motion prediction using a smaller block sizes, pixel truncation reduces the
motion prediction accuracy. So here we have proposed a two-step search to improve the
frame prediction using pixel truncation. Our method reduces the total computation and
memory accuracy compared to the conventional method without significantly degrading the
picture quality. The results theoretically show that the proposed architectures are able to save
up to 53%energy compared to the full search conventional full-search ME architecture, which
is equivalent to 40% energy saving over the conventional H.264 system. This makes such
architecture attractive for H.264 application in future mobile devices.
11. REFERENCES
[1
Advanced video coding for generic audiovisual services,ITU-T Recommendation H.264
&
ISO/IEC 14496-10 (MPEG-4) AVC, 2005.
[2]
C-Y Chen, S-Y Chien,Y-W. Huang. T-C Chen. Wang, and L-G Chen, Analysis
andarchitecture design of variable block size motion estimation for H.264/AVC, IEEE Trans.
[3]
Z-L.Bahari, T.Arslan, and A.T.Eedogan, low computation and memory access for
variable block size motion estimation using pixel truncation, In proc. IEEE workshop signal
process. Syst. 2007 pp.681-685.
[4] Z-L, He, C-Y.Tsui, K-K. Chen, and M.L. Liou, low power VLSI design for motion
estimation using adaptive pixel truncation IEEE Trans, vol. 10, no. 5, pp. 669-678, aug
2000.
[5] Z-L.Bahari, T.Arslan, and A.T.Eedogan, low power hardware architecture for VBSME
using pixel truncation, In proc. IEEE workshop signal process. Syst. 2007 pp.681-685.
Hyderabad, andhrapradesh.
[6] B.Natrajan, V.Bhaskaran, and K.Constantitudes, Low complexity block based motion
estimation via one-bit transform, IEEE Trans, vol, 7, no.4, pp. 702-706.
Advanced video coding for generic audiovisual services,ITU-T Recommendation H.264 &
ISO/IEC 14496-10 (MPEG-4) AVC, 2005.
[7] A.Erturk and S.Erturk, two-bit transform for binary block motion estimation IEEE
Trans, Circuits Syst. Video Technol., vol. 15, no.7, pp. 938-946, Jul.2005.
[8] S.Lee, J.M. Kim, and S-I Chae, New motion estimation algorithm using adaptively
quantized low bit resolution image and its VLSI architecture for MPEG2 video encoding
IEEE Trans. Cirsuits syst. Viseo technol. Vol.8, no. 6, pp. 734-744. Oct 1998.
[9]
Y.Chan and S.Kung, Multi level pixel differece classification methods, in proc,
IEEElnt, Conf. Image processs, vol.3. Washington D.C., 1995, pp.252-255.
[10]
V.G.Moshnyaga, Msb truncation scheme for low-power video processors, in Proc,
IEEE lnt.Symp.Circuits Syst., vol. 4 Orlando, FL 1999, pp. 291-294.

[11]
M-J. Chen, L-G. Chen, T-D.Chiueh, and Y-P. Lee, A new block matching criterion for
motion estimation and its implementation, IEEE Trans. Circuits Syst. Video Technol., vol. 5,
no.3 pp. 231-236.
[12]
S.Sharma, P.Mishra, S. Sawant, C.P. Mammen, and V.M. Gadre, Pre-decision strategy
for coded/non-coded MBs in MPEG4, in Proc, Lnt. Conf. Signal Process. Comman
2004(SPCOM), Bangalore, India 2004, pp.501-505.
[13]
T.C.Chen, Y-H.Chen, S-F. Tsai, S.Y.Chien, and L.G.Chen, Fast algorithm and
architecture design of low-power integer motion estimation for H.264/AVC, IEEE Trans.
Circuits Syst. Video Tchnol, vol. 17, no. 5, pp. 568-577, may 2007.
[14] M.Miyama, J.Miyakoshi, Y.Kurada, K.Lmamura, H.Hashimoto, and M.Yoshimoto, A
sub-mw MPEG4 motion estimation process core for mobile video applivation, IEEE Trans.
Cicuits Syst. Vol. 39, no. 9, pp. 1562-1570. Sep.2004.
[15]
[16]
Video coding for Low Bit Rate Communication, Feb. 1998.

Information Technology Coding of AudiO-Visul Objects- Part 2; Visual, ISO/IEC
14496- 2, 1999.
[17]
S.Srinivasan, J.Hsu, T.Holcomb, K.Mikerjee, S.L. Regunathan, B.Lin, J.Liang, M-
C.Lee, and J.Ribas-corbera, Windows media video 9: over view and applications, Signal
process: image commun., vol.19, pp.851-875. Sep.2004.
[18] Draft-ITU-T, Recommendations and fian draft international standard of joint video
specification, May 2003.
[19]
K.M.Ynag, M.T.Sun, and L.Wu, A family of VLSI designs for the motion
compensation block matching algorithm, IEEE Trans. Circuits Syst. Vol.36, no.10. pp. 13171325. Oct 1989.
[20]
T.Komarek and P. Pirsch, Array architectures for block matching algorithms, IEEE
Trans, Circuits Syat ., vol. 36, no. 10, pp. 1301-130 Oct 1989.
[21]
Y.Nakabo, M.I.shikava, High speed target tracking using 1ms, visual feedback system,
video proc, int. conf. Robotics and automation.

[22]
I.Ishi, Y.N.akobo, and M.Ishikawa; target tracking algorithm for 1ms visual feedback
system using Massively parallel processing, proc.int.conf. robotics and automation, pp. 23092314.
[23]
M.I.shikawa, A.Morita and N.Takayanagi; high speed vision system using
massively parallel processing proc. Int.conf. on intelligent robotics and systems. Pp. 373-377
(1992).

Low Power h.264 Architectures For Mobiles

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Low Power h.264 Architectures For Mobiles

Uploaded by

Copyright:

Available Formats

LOW-POWER H.

264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

1.1. Introduction about Compression:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

1.2. Practical Importance of Image compression:

fig:1.1 Block diagram of image compression/decompression system

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Where C (k, l) is the current MB, and R (i + k, j + l) is the candidate MB located in

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Truncating pixels at a 1616 block size results in acceptable performance. However,

3.EFFECT OF PIXEL TRUNCATION FOR VBSME

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Fig 3.1: Different architectures used for VBSME [2]

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

4.2. Video Compression Model:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

4.3. Matching Criteria:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

4.4. Algorithm for Motion Estimation:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

4.5. Motion Estimation Example:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

4.6. Motion Estimation Macro blocks Example:

4.7. Final Motion Estimation Prediction:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

5. MOVING PICTURE EXPERTS GROUP

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

5.2. Sub Groups:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

computational unit fig (3) (a) me_split

Computational unit fig(3) (b) me_combined

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

6.1.1.Memory architecture (Search Area Memory Organization)

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

fig:6.3 Memory Mapping Algorithm

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

(a) Logic Memory Partitions

(b) Physical Memory Partitions.

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Fig 6.4 16-PEG Design Layout

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

6.1.2. Processing element:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

6.1.3. Adder tree:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

6.1.4. comparator unit:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Op-amp voltage comparator:

A simple op-amp comparator

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION

Several voltage comparator ICs:

LOW-POWER H.264 ARCHITECTURES FOR MOBILE COMMUNICATION