You are on page 1of 10

2K6EC 705(F): Data Compression

Handout

1 Video Signal Representation


1.1 Continuous-time
There are mainly three standards namely NTSC(American), PAL(European) and SECAM(French). The features of
NTSC are

• There are 525 lines per frame. And there are 30 frames per second. More lines per frame and more frames per
second requires more bandwidth to transmit the video signal information. Considering bandwidth constraint,
this number of 525 is arrived at. However, 525 lines per frame is not sufficient to avoid flickr. So 525 lines are
displayed in two phases called ”interlaced fields”. First 262.5 lines in one field (set of odd numbered lines)
and next 262.5 lines in the other field (set of even numbered lines). The former ends with a half-line, and the
latter starts with a half-line. The mechanism of interfaced fields results in a rate of 60 fields per second.

• Some lines are lost during display because of the delay incurred for the electron gun (or any mechanism that
projects video signal wave to a luminous material) to arrange itself to the right position after a full scan of a
frame. After accounting for the delay, we get an effective frame rate of 486 lines per frame.

• For color signal, we should have three independent components for the video signal i.e., red signal, blue signal
and green signal. However for backward compatibility, the same signal should be displayable in a device that
support only display of black-and-white video signal. For that a format of composite video signal is arrived
at. According to the NTSC standard, the composite video signal has 3 components.

1. A luminance component (can contain high frequencies)

Y = 0.299R + 0.587G + 0.114B (1)

2. Two chrominance components (reasonably low pass signals)

Cb = B − Y (2)
Cr = R − Y (3)

1.2 Discrete-time
There are many standards for discrete-time video signal. One of the first attempt to standardise discrete-time repre-
sentation of video signal(irrespective of NTSC, PAL etc) is by CCIR, and this standard is BT 601-2 and is popularly
known as CCIR 601. The main features of the standard are

• The base sampling frequency is 3.725 MHz (3.725 × 106 samples per second). Integer multiples of base
sampling frequency, with a maximum of 4 times, is used to sample Y , Cb and Cr components. Usually a 4:2:2
sampling format (meaning Y , Cb and Cr are sampled at 3.725 × 4, 3.725 × 2 and 3.725 × 2) is used. Sampling
at 3.725 × 4 results in
3.725 × 4
Samples per line = = 925 (4)
486 × 30
If you account for actual video (Refer text books on TV) only, it reduces to 720 samples per line. That means
360 samples per line for Cb and Cr components.
Figure 1: Two subsequent frames(Fig 18.1)

• Sampled values Ys , Cbs and Crs of Y , Cb and Cr are normalized so that Ys takes a value between 0 and 1
whereas Cbs and Crs is in the range −0.5 to 0.5. Now it is scaled and shifted to get the following discrete-time
signal triple (Y, U, V),

Y = 219Ys + 16 (5)
U = 224Cbs + 128 (6)
V = 224Crs + 128 (7)

Thus, Y ∈ [16, 235]; U, V ∈ [16, 240]

• If you convert NTSC continous signal to CCIR 601 format, it will result in 720 × 480 frame for Y component,
and 360 × 480 frames for U, V components.

Other standards for video signal representation are Common Interchange Format(CIF), Quarter CIF(QCIF). MPEG
algorithms use a subsampled version of CCIR 601 format known as MPEG-SIF format. (Refer Fig. 18.8 of [1])

2 Principle of Video Compression


Why can’t we apply image compression to each frame of the video, and thus achieve video compression? But there
is a catch! Consider an image compression strategy which introduces random change in the average intensity of the
pixels of the image. Now two subsequent frames are subjected to this compression. But this may introduce annoying
effect to the viewer who is observing a moving object contained in both of these frames. This demands a different
strategy of video compression and it is known as Motion Compensation.

2.1 Motion Compensation


Typically there is little change in the content between subsequent frames. So predictive techniques are best suited.
However, normal predictive techniques will fail, due to the very nature of subsequent frames of a video.
Example 1: Refer Fig. 18.1 in [1]. The difference between the two images is that the face is shifted to the right,
whereas the triangle is shifted to the left. If the absolute difference is taken, it will result in an image as given in Fig.
18.2 in [1]. Ironically, this contains more information than bot the original images. So encoding difference blindly,
is not an appropriate compression technique in the case of video.
We can however see an interesting behaviour. To identify that, we first observe that the similarity between
subsequent frames in a video is by virtue of moving objects. So if an object in one frame provides an intensity
level I at location (i0 , j0 ), then in the next subsequent frame, the same intensity level I will be at a different
location (i1 , j1 ). The differential/predictive encoding strategy should be based on this principle. It is called Motion
Compensation.
Figure 2: Absolute difference between the frames(Fig 18.2)

Figure 3: Motion-Compensated Prediction(Fig 18.3)

• In Block based Motion compensation, the frame is split to M ×M blocks. When a block in a frame is encoded,
we search the previous frame for a block of the same size that ”closely matches” the currently encoded block.

• The closeness of match is measured by sum of absolute difference between intensity level of corresponding
pixels.

• If closeness is less than a specified threshold, the block is encoded without the benefit of prediction.

• If closeness is above the threshold, then the block is encoded with a ”motion vector”. If (a, b) is the location
(location of the upper-left corner pixel) of the block being encoded and (c, d) the location of the closest-
matching block of the previous frame, then motion vector is defined as (c − a, d − b)

Example 2: Suppose frame is split into 8 × 8 blocks. Block being encoded is (24, 40) to (31, 47). Now this is most
closely matched to a block of pixels (21, 43) to (28, 50). Then the motion vector is (21 − 24, 43 − 40) = (−3, 3).
Refer Fig. 18.3 of [1].
The number of computations required to do motion compensation is pretty huge.
Example 3: If the block size is 8 × 8, then each comparison requires 64 difference computation. Now suppose,
if a matching block can be found within a width(height) of 20 pixels towards either horizontal or vertical direction,
then how many block-block-comparisons do we have to make? Now the search space is a square of 20 × 20 pixels.
Let us assume that our block is in the center of this square. That is, on each of the four sides, there are 12 pixels in
the search space. So the total number of comparisons will turn out to be 168.
To reduce the number of computation, there are two strategies.
1. Increase the size of the block: This will reduce the number of block-block-comparisons. However, the differ-
ence between two blocks can be high in this case, as there is chance that more number of objects fall into the
same block, and the net degrees of freedom for a change enhances.

2. Decrease the search space: This will also reduce the number of block-block-comparisons. However, the
probability of finding a match between blocks reduces, simply because the search space is small.
So for both of the above strategies, there is a trade-off between the number of computations and the probability of
finding a closely similar block.

3 Video Compression Standards


There are many standards for video compression. We will look into

1. ITU-T H.261

2. MPEG 1

3. MPEG 2

4. MPEG 4

3.1 ITU-T H.261


• This video compression standards mainly finds its application in transmission of video data, like videophone,
videoconferencing etc.

• Assumes CIF or QCIF format

• Frame rate of 30 frames per second

• Bit rat is p × 64 kbits per second where p varies from 1 to 30.

• Each frame is divided into blocks of size 8 × 8 pixels.

• Difference between block of current frame, and closest-matching block of previous frame is transformed,
quantized, and quantizer levels are encoded using variable-length code. These difference samples, and the
motion vector is used to represent the digital video. While encoding, the encoded frames are reconstructed by
a local decoder and stored into a frame store. This is used for motion prediction.

The block diagram of the standard is given below.

Figure 4: H.261 Encoder(Fig 18.10)


3.1.1 Motion Compensation
It uses block-based motion compensation. As pointed out earlier, one of the difficulties in doing motion compensa-
tion is its huge computational complexity. H.261 algorithm balances the trade-off between the number of computa-
tion and the probability of finding closely-matched block using a hybrid strategy. 8×8-pixel blocks of luminance(Y)
and chrominance(U and V) pixesl are organized into macroblocks. A macroblock consists of 4 Y blocks, 1 U and
1 V block, that are spatially corresponding to the luminance block. For each macroblock, we search the previous
frame for closest-matching macroblock. The matching is identified only by searching luminance blocks. The search
area limited to +15 pixels towards horizontal or vertical direction of the macroblock being considered.
H.261 conceives the CIF frame in a hierarchial manner as shown in the figure below. One CIF frame consists of
12 Groups of Blocks (GOB), and each GOB contains rows macroblocks with 11-macroblocks in a row. Each of the
layer will have a header information.

Figure 5: Hierarchial Layer of CIF frame

3.1.2 The transform


Transform of the pixesl is carried out using an 8 × 8 DCT. For a compensated block, the what is transformed is not
the original pixels, but difference between pixel values, and pixel values of the closest-matching block of previous
frame. Encoder is said to be in ”intra” mode if it directly encodes the input image, and in ”inter” mode if it encodes
difference.

3.1.3 Quantization and Coding


One problem is that the encoder might shift between intra and inter mode of operation. This will change the dynamic
range of values to be quantized. H.261 adjusts to this by switching between 32 different quantizers.

• Intra-DC coefficient is quantized using a uniform midrise quantise with stepsize 8

• Inter-DC coefficient and other AC coefficients are quantized using uniform midtread quantisers with variable
step-sizes 2, 4, 6, · · · 62.

• Which set of quantizers are used for each of the coefficients is mentiond in the header of GOB. If it is overrid-
den in a particular macroblock, it will be specified in the macroblock header.
After quanitsation, if a macroblocks contain blocks with no non-zero coefficents, then that block need not be trans-
mitted/stored. That information is kept in the macroblock header using a variable called Coded Block Patter(CBP)
The coding is done similar to JPEG. A variable length coding is used for fequently occuring combinations are
coded with a variable length code. Other combinations are coded using 20-bit codewords.

3.1.4 Loop filter


Loop filter is used to avoid a problem by virtue of motion compensation. When the block of a frame is compared
with those of previous frame, sharp edges of the frame can result in huge prediction error. To minimize the value of
prediction error (thereby to reduce the dynamic range of required quantizer), the prediction block(the block used for
comparison) can be subjected to a smoothing filter before subjcting it to comparison. This filter is called Loop filter.

3.1.5 Rate Control


It is important for transmission of video data to keep the data rate within a limit, and thus avoid buffer overflow. If the
transmission buffer is empty, more cofficients from transform coder can be encoded. Rate Controller will decrease
the step size of quantizer if the bitrate to the buffer is less. If the buffer require lower bitrate, then quantizers with
larger step-size will be chosen.

3.2 MPEG 1
• Important applicaiton is video storage.

• Most imporant feature is Random Access Capability.

• Very similar to H.261, but has modification required to implement random access capability.

• Though highly flexible, MPEG 1 suggests certain constraints on certain parameters. For instance, there are
certain constraints suggested on size of pictures. Horizontal picture size ≤ 768 pixels, and vertical picture size
as 576. If these suggested constraints are called Constrained Parameter Bistream(CPB). For CPB case, MPEG
1 achieves bit rates between 1 and 1.5 Mbps

3.2.1 Random Access Capability of MPEG 1


Viewers need not start viewing the video from the beginning. So if the video frames are predictive coded, then how
can sombody start viewing the video from the middle? This problem is solved in MPEG 1. The capability of the
video encoding strategy to start reproducing video frames from the middle is called Random access capability. How
is it achieved?

• Three types of frames - I frames, P frames and B frames.

– I frames: Coding done without any reference to any other frame. Do not exploit temporal correlation of
the video
– P(Predictive Coded) frames: Encoded with reference to previous frames.
– B(Bidirectionally Predictive Coded) frames: Encoded with reference to past and future frames.

• I frames and P frames are called anchor frames. I frames will result in reduction of compression efficiency.
B frames are encoded with reference to past and future anchor frames. This will achieve better compression,
and thus compensate for the reduction in compression due to I frames. In some cases, compensation based
on future frames yields far better results - for instance, in some videos like advertisement films, there will be
sudden shift of a frame. In that case, if the frame is predicted using future frame rather than past frame, there
will be better compression ratio.

• Since there should be future anchor frame available to encode B frames, the ”disply order” of frames (the order
in which frames are displayed to the user) and the ”bitstream order”(order in which frames are processed for
encoding) can be different.

3.3 MPEG 2
While MPEG-1 was tailored from H.261 keeping video storage application in mind, MPEG 2 was designed as to
provide a generic, application-independent standard. To achieve this, MPEG 2 adopts a ”toolkit” approach. A user
can choose the set of algorithms, and the set of constraints from a basket of algorithms and constraints. These are
termed as ”profiles” and ”levels”. A profile defines the algorithm, and a level defines constraints.

• There are 5 profiles: simple, main, snr-scalable, spatially scalable and high

– Simple: MPEG 1 algorithm avoiding B frames. B frames increases complexity, and hence it is avoided
in simple profile.
– Main: Same as MPEG1
– The remaining 3 profiles use multiple bitstream to encode the video. There is a basic stream with lower
bitrate. This can be decoded to reconstruct back the video. Other bitstreams will enhance quality.

• There are 4 levels: low, main, high 1440, and high

– Low: Frame size of 352 × 240


– Main: Frame size of 720 × 480
– High: Frame size of 1440 × 1182
– High: Frame size of 1920 × 1182

• There are a few combinations of profile and level that donot match.

– Simple profile is allowed for only Main level.


– Main profile is allowed for all levels.
– Snr-scalable is allowed for Main and Low levels.
– Spatially scalable is allowed for only High 1440
– High profile is allowed for all except Low level.

• There is an ordering of profiles. A video compressed using one profile can be decoded by encoders in the
same or higher profiles.

• MPEG 2 also allows different kinds of motion compensated predictions. MPEG 2 allows interlaced video,
and hence prediction can be done based on fields.
3.4 MPEG 4
This standard is a result of a joint collaboration between ITU’s Video Coding Experts Group(VCEG) and ISO’s
Motion Pictures Experts Group(MPEG). So this standard is known in various names - H.264, MPEG 4 Part 10,
MPEG-4 AVC(Advanced Video Coding). The standard is in principle similar to previous schemes. But it is hugely
different in terms of algorithms used.

• Basic block diagram is same as that of H.261. There are inter and intra pictures - inter pictures are obtained
by subtracting motion compensated prediction from original; intra pictures are same as the original. The
difference or original values are transform coded, quantized and encoded using variable length.

• Hierarchial structure of frame (Macroblock structure) same. However the smallest accessible block for previ-
ous schemes is 8 × 8 pixels. In H.264 each block is further divided into 8 × 4, 4 × 8 and 4 × 4 pixel units.
Similarly macroblocks is also divided into 16 × 8, 8 × 16 units. While prediction, comparison can be done it
these finer levels.

• The H.264 standard is substantially more flexible than previous standards, with a much broader range of
applications. In terms of performance, it claims a 50 percentage reduction in bit rate over previous standards
for equivalent perceptual quality.

3.4.1 Motion Compensated Prediction


• As in earlier MPEG standards, there are I frames, P frames and B frames.

• Division into smaller blocks allows tracking of finer movements of objects, and hence allows better prediction
leading to lower bit rates. However this will increase the dynamic range of motion vectors, and consuming bit
resources for representing motion vectors. But the facility to have rectangular regions (eg: 4 × 8 pixel units)
helps to track object activities in better manner.

• Compensation is achieved in ”quarter-pixel” accuracy. The reference picture is expanded by interpolating


neighbouring pixels. Then the 4 edges of the block is smoothened using filters. Then comparison is done.

• The sequence of motion vectors obtained after compensation are not transmitted/stored as such. The motion
vectors are futher differentially encoded. Median values of three neighbouring motion vectors are used to
predict the current motion vector, and the prediction error is encoded.

• B frames are there in MPEG 4 also. So there are two motion vectors associated with B frames - one by vitue of
comparison with past frame, and latter due to future frame. The prediction of each pixel is weighted average
of the prediction due to both motion vectors individually.

• There is a type of macroblock called Pskip macroblocks. Here, motion compensation is done at the macroblock
level, and prediction error is not transmitted. It is useful when there is little change in video.

3.4.2 Transform
Transform used is NOT 8 × 8 DCT, but a 4 × 4 DCT-like transform. The transform matrix is given below.
 
1 1 1 1
 2 1 −1 2 
H=  1 −1 −1 1 
 (8)
1 −2 2 −1
3.4.3 Intra Prediction

Figure 6: Intra Prediction(Fig 18.17)

• One problem of earlier scheme was that I frames are escaped out of any compression. In MPEG 4, I frames
are compressed using spatial compression strategies.

• Strategy is explained in Fig 18.17 of [1]. Each 16 pixels labeled as a, b, · p of the 4 × 4 block is predicted using
neighbouring 13 pixels A, B, · L and Q as shown in the figure. There are 9 modes of prediction numbered as
0, 1, · 8. A mode is identified by its direction as shown in the figure. For example if the mode is 3, then that
means pixel D is used to predict pixels c, f and i. Similarly each of the mode specifies direction of prediction.
No direction is assigned for mode 2. Mode 2 is called DC mode, in which all pixels are predicted as average
value of all boundary pixels.

3.4.4 Quantization
• Prior to quantization, the transforms of the 16 × 16 luminance residuals and the 8 × 8 chrominance residuals of
the macroblock-based intra prediction are processed to further remove redundancy. Recall that macroblock-
based prediction is used in smooth regions of the I picture. Therefore, it is very likely that the DC coefficients
of the 4x4 transforms. are heavily correlated. To remove this redundancy, a discrete Walsh-Hadamard trans-
form is used on the DC coefficients in the macroblock. In the case of the luminance block, this is a 4X4
transform for the sixteen DC coefficients. The smaller chrominance block contains four DC coefficients, so
we use a 2x2 discrete Walsh-Hadamard transform.

• H.264 uses 52 uniform scalar quantizers. Step size doubles for every sixth quantizer. Quantizer is notated by
its step-size Qstep . Since the transform used in MPEG4 involves some approximations, there is a scaling done
along with quantization. If θ(i,j) is the quantized value of (i, j)-th coefficient by a quantizer with step-size
Qstep , the scaling is done as
|θ(i,j) |α(i,j) (Qstep )
l(i,j) = sign(θ(i,j) )b c (9)
Qstep
where, α(i,j) (Qstep ) is the scaling factor.

3.4.5 Coding
• Two options of coding:

– Option 1: Exponential Golomb Codes for coding parameters, and Context-adaptive variable length
codes(CAVLC) to encode quantized values.
– Option 2: Binarizes everything, and then use Context-adaptive binary arithmetic code(CABAC)
• Option 1:

– An exponential Golomb code for a positive number x can be obtained as the unary code for M =
blog2 (x + 1)c concatenated with the M bit natural binary code for x + 1. The unary code for a number
x is given as x zeros followed by a 1. The exponential Golomb code for zero is 1.
– Quantizer labels(quantized value) are scanned in zig-zag manner. Usually the last coefficient values
(high frequency components) will be having a magnitude of 1. This redundancy is exploited in coding.
Let N be the number of non-zero coefficients, and T be the number of trailing 1’s (in magnitude). The
maximum value of T is taken to be 3. Then the (N, T ) pair is encoded first, and then encoding is done
only for the N coefficients. Encoding of T trailing +1’s is done as a 0 representing 1 and 1 representing
−1. Then the non-zero labels are scanned in the reverse order (starting from high frequency components)
and encoded till the last non-zero label. The total number of zeros are also encoded. Then runs of zeros
between two non-zero labels are encoded starting from the last non-zero label.

• Option 2:

– This provides better compression ratio.


– All data are first binarized. For each type of data, different binarization methods are employed. They
include unary codes, truncated unary codes, exponential Golomb codes, and fixed-length codes, plus five
specific binary trees for encoding macroblock and sub-macroblock types.
– After binarization, the binary strings are encoded in one of two ways. Redundant strings are encoded
using a CABAC. Binary strings that are random bypass the arithmetic coder.

References
[1] Khalid Sayood, Introduction to Data Compression, ElseVier, 3rd Edition, 2006.

You might also like