You are on page 1of 110

Design and Implementation of a Mobile Videophone

Ben Appleton
Department of Information Technology and Electrical Engineering,
University of Queensland

Submitted for the degree of
Bachelor of Engineering (Honours)
in the division of Signal and Image Processing

October 2001
ii

iii
2/123 Macquarie St
St Lucia, Q4067

The Dean
School of Electrical Engineering
University of Queensland
St Lucia, Q4072

Dear Professor Simmons,

In accordance with the requirements of the degree of Bachelor of Electrical
Engineering (Honours) in the division of Electrical Engineering, I present the
following thesis entitled Design and Implementation of a Mobile Videophone. This
work was performed under the supervision of Dr Vaughan Clarkson.

I declare that the work submitted in this thesis is my own, except as acknowledged in
the text and footnotes, and has not been previously submitted for a degree at the
University of Queensland or any other institution.

Yours Sincerely,

Ben Appleton

iv
v




To my wife, Jenna
vi
vii
Acknowledgements


This thesis could not have been completed without the support and assistance of a
number of important people. First and foremost my thanks go to Dr Vaughan
Clarkson, my supervisor, for always having an open door. His advice and assistance
over the last year have been invaluable.

My appreciation is also extended to Dr Brian Lovell, for his valuable teaching in the
field of DSP and his boundless enthusiasm for grand projects.

To my wife Jenna, I am grateful for her patience, love and support.

Thanks go to my parents, Charles and Christine, for their love and guidance over the
last 22 years. They have given me a great head start on life.

To Elliot, thanks for helping me tackle a project of such magnitude.

Finally, my thanks go to God, my Creator and Sustainer. Remember your Creator in
the days of your youth Ecclesiastes 12:1

viii
ix
Abstract

Mobile videoconferencing is an exciting field of research with a lot of potential.
Applications include advanced telecommuting, remote job interviews, improved
distance learning, flexible telemedicine, and the obvious advantage to interpersonal
communications. However with the majority of potential videophone manufacturers
using compression technology a decade out of date, current solutions are aimed at
high bandwidth communication channels which are many years from implementation.

This thesis takes a different approach. Recent advances in video compression
research have produced techniques which are strong enough to allow video
conferencing over the existing low bandwidth mobile channels. The aim of this
project is to build a device which plugs into an existing mobile phone, extending its
capabilities to 2-way audio and video communication. By combining Set Partitioning
In Hierarchical Trees with a new motion field estimation technique, a real-time video
codec is designed to operate in real time at data rates as low as 12kbps. A Variable
Bit Rate Linear Predictive Coding speech compression scheme is developed requiring
only 600bps, greatly increasing the bandwidth available to video.

Overall the performance of the mobile videophone is excellent. Given the very low
available bandwidth the quality of the audio and video are good. These compression
techniques show great potential for commercial videophones over existing mobile
channels.

x
xi
Table of Contents
DESIGN AND IMPLEMENTATION OF A MOBILE VIDEOPHONE............... I
ACKNOWLEDGEMENTS .................................................................................... VII
ABSTRACT............................................................................................................... IX
TABLE OF CONTENTS ......................................................................................... XI
LIST OF FIGURES................................................................................................. XV
CHAPTER 1 - INTRODUCTION........................................................................ 1
1.1 AIMS............................................................................................................... 1
1.2 OVERVIEW...................................................................................................... 2
CHAPTER 2 - LITERATURE REVIEW........................................................... 3
2.1 OVERVIEW...................................................................................................... 3
2.2 GENERAL VIDEOCONFERENCING .................................................................... 3
2.3 MOBILE PHONE NETWORKS............................................................................ 4
2.4 COMPRESSION RESEARCH............................................................................... 4
2.5 RELEVANCE .................................................................................................... 5
CHAPTER 3 - EXISTING THEORY.................................................................. 7
3.1 DATA COMPRESSION....................................................................................... 7
3.1.1 Speech .................................................................................................... 7
3.1.2 Audio...................................................................................................... 7
3.1.3 Images.................................................................................................... 7
3.1.4 Video ...................................................................................................... 7
3.2 INFORMATION THEORY................................................................................... 8
3.2.1 Entropy and Data Compaction.............................................................. 8
3.2.2 Distortion and Data Compression....................................................... 10
3.3 IMAGE AND VIDEO COMPRESSION ................................................................ 11
3.3.1 Vector Quantisation vs Transform Coding.......................................... 11
3.3.2 The Karhunen-Loeve Transform.......................................................... 12
3.3.3 Scale Invariance and Consequences.................................................... 13
3.3.4 Colour Spaces ...................................................................................... 14
3.4 WAVELETS.................................................................................................... 15
3.4.1 Introduction to Wavelets...................................................................... 15
3.4.2 Wavelet Transform Mathematics ......................................................... 16
3.4.3 Time-Frequency Tilings....................................................................... 17
3.4.4 Biorthogonal Wavelet Filter Banks ..................................................... 18
3.4.5 Multi-Dimensional Wavelet Transform............................................... 19
3.4.6 Coefficient Tree Representation .......................................................... 20
3.4.7 Signal Estimation and Compression in the Wavelet Domain.............. 22
3.5 EMBEDDED ZEROTREE WAVELETS AND SET PARTITIONING IN HIERARCHICAL
TREES 24
3.5.1 Successive Approximation ................................................................... 25
3.5.2 Partial ordering by magnitude............................................................. 25
3.5.3 Spatial Orientation Trees..................................................................... 26
3.5.4 The EZW and SPIHT algorithms ......................................................... 27
xii
3.5.5 Properties of EZW and SPIHT............................................................. 29
3.6 MOTION-BASED VIDEO CODING..................................................................... 30
3.6.1 Underlying Model ................................................................................ 30
3.6.2 Residual Image..................................................................................... 30
3.7 SPEECH COMPRESSION.................................................................................. 32
3.8 LINEAR PREDICTIVE CODING........................................................................ 32
3.8.1 Speech Production Model .................................................................... 32
3.8.2 Vocal Tract Filter Analysis.................................................................. 33
3.8.3 Voicing and pitch determination.......................................................... 35
3.8.4 Voice Synthesis..................................................................................... 36
3.8.5 LPC-10e Specifics................................................................................ 36
CHAPTER 4 - THEORETICAL CONTRIBUTION....................................... 38
4.1 MOTION ESTIMATION ................................................................................... 38
4.1.1 Model-based Motion Estimation.......................................................... 38
4.1.2 Algorithm 2 - Model-Based Surface Tracking..................................... 40
4.2 VARIABLE BIT RATE LINEAR PREDICTIVE CODING....................................... 41
CHAPTER 5 - DESIGN AND IMPLEMENTATION..................................... 44
5.1 SPECIFICATIONS............................................................................................ 44
5.1.1 Video and Audio Quality...................................................................... 44
5.1.2 Real-time Compression and Decompression....................................... 45
5.1.3 Portability ............................................................................................ 45
5.2 SOFTWARE COMPONENTS ............................................................................. 46
5.2.1 Audio Compression.............................................................................. 46
5.2.2 Video Compression.............................................................................. 46
5.2.3 Integration of Audio and Video Codecs............................................... 47
5.3 IMPLEMENTATION......................................................................................... 49
CHAPTER 6 - RESULTS ................................................................................... 50
6.1 IMAGE COMPRESSION ................................................................................... 50
6.2 AUDIO COMPRESSION ................................................................................... 51
6.3 VIDEO COMPRESSION.................................................................................... 52
6.3.1 Typical Quality..................................................................................... 52
6.3.2 Early Behaviour................................................................................... 53
6.4 COMPUTATIONAL RESULTS........................................................................... 54
6.5 PROTOTYPE SYSTEM..................................................................................... 54
CHAPTER 7 - DISCUSSION............................................................................. 56
7.1 ACHIEVEMENTS ............................................................................................ 56
7.1.1 Extremely Low Bit Rate Speech Codec................................................ 56
7.1.2 A New Very Low Bit Rate Video Codec............................................... 56
7.1.3 A New Fast Motion Estimation Algorithm........................................... 56
7.2 FUTURE WORK ............................................................................................. 57
CHAPTER 8 - CONCLUSION .......................................................................... 58
CHAPTER 9 - REFERENCES........................................................................... 60
CHAPTER 10 - APPENDICES............................................................................ 66
10.1 DWT.C ........................................................................................................... 1
10.2 SPIHT.C ......................................................................................................... 1
xiii
10.3 MEC.C............................................................................................................ 1
10.4 VIDCODEC.C................................................................................................ 1

xiv
xv
List of Figures
Figure 1 Original Signal............................................................................................ 11
Figure 2 Quantised signal ......................................................................................... 11
Figure 3 Super-Symbol (Pixel Pair) Distribution for a Sample Image..................... 12
Figure 4 Example: Daubechies 7-9 Mother Wavelet................................................ 16
Figure 5 - Sampled Signal............................................................................................ 17
Figure 6 - Discrete Fourier Transform......................................................................... 17
Figure 7 - Short-Time Fourier Transform.................................................................... 17
Figure 8 - Discrete Wavelet Transform....................................................................... 17
Figure 9 - Quadrature Mirror Filter and Perfect Reconstruction................................. 19
Figure 10 - Wavelet Coefficient Arrangement ............................................................ 20
Figure 11 Full-Scale Image....................................................................................... 21
Figure 12 1
st
Scale..................................................................................................... 21
Figure 13 2
nd
Scale.................................................................................................... 21
Figure 14 Completed Wavelet Transform................................................................ 21
Figure 15 - A Binary Tree of 1-D Wavelet Coefficients ............................................. 22
Figure 16 LPC Speech Production Model ................................................................ 33
Figure 17 - Previous Frame.......................................................................................... 40
Figure 18 - Current Frame ........................................................................................... 40
Figure 19 Motion Field
y
V
v
....................................................................................... 40
Figure 20 - Bit Allocation Scheme for VBR LPC....................................................... 42
Figure 21 - Project Block Diagram.............................................................................. 45
Figure 22 - Bit Allocation for Compressed Data Packets............................................ 48
Figure 23 High Compression Rate Comparison JPEG vs SPIHT ......................... 50
Figure 24 Low Compression Rate Example ............................................................. 51
Figure 25 Sample Input Sequence ............................................................................ 52
Figure 26 - Sample Output at 25kbps .......................................................................... 52
Figure 27 - Frames 1, 2, 3, 10 of Sample Video.......................................................... 53
Figure 28 - Sample PSNR over time............................................................................ 53
Figure 29 - Prototype Client/Server Interface.............................................................. 54


0
1
Chapter 1 - Introduction

Mobile videoconferencing is a commercial gold mine, widely recognised as the
next step forward in mobile communications. The introduction of commonly
available mobile videoconferencing will bring a host of applications including
advanced telecommuting, remote job interviews, telemarketing, improved distance
learning, telemedicine, and the obvious advantage to interpersonal communications.
Many big players in the communication and computing industry are planning to
launch products in this market over the next 5 years.

There is however one serious obstacle which must first be overcome. This is the
difficulty of sending a high-bandwidth video signal through a low-bandwidth
mobile phone system. It is one of the major problems driving two large fields of
research: high-bandwidth wireless communication systems such as 3G
1
networks,
and new low-bandwidth videoconferencing standards such as H.324/M and T.120.
It is inevitable that within the next decade these efforts will converge to produce
high-quality mobile videoconferencing accessible by the general public.

Clearly an undergraduate thesis cannot produce a commercial mobile phone
network, so any solution to the mobile videoconferencing problem must accept the
current timeline. However the compression technology currently in use lags well
behind what is being published by the research groups. Audio and video
compression techniques are emerging which are sufficiently strong to allow video
conferencing over the existing low bandwidth mobile channels. Therefore it makes
sense to bring together the strongest compression techniques currently available to
make a mobile videophone which operates at the present data rates. This thesis is a
proof of the concept.

1.1 Aims


1
3
rd
Generation GSM, the Global System for Mobile communications
2
The aim of this thesis is to design and implement a mobile videophone operating
over a GSM or GPRS channel. The final product should consist of:
1. A video camera and display
2. A speaker and a microphone
3. A processor

The central requirements for the final product are:
1. Real-time operation
2. Very low bit rate
3. Good quality video
4. Acceptable quality audio
5. Scalability
6. Portability
These are reviewed in greater detail in Chapter 5, Design and Implementation.

1.2 Overview

This section summarises the structure of the thesis. Chapter 2 covers the current
state of videoconferencing, mobile telecommunications, compression research and
their relevance to this thesis. Chapter 3 details the existing theory in the field of
compression, particularly wavelets and linear predictive coding, as well as the
widely accepted motion paradigm for video coding.

Chapter 4 describes the two major new contributions of this thesis: a fast model-
based motion estimation algorithm and a variable bit rate speech codec. Chapter 5
describes the design and implementation of the mobile videoconferencing system,
based upon the previous theoretical sections. Chapter 6 describes the results
obtained for the compression modules, as well as a prototype system designed to
demonstrate its potential, while Chapter 7 discusses the three major achievements
of the thesis and possible further work in the area.

Chapter 8 concludes the thesis with a summary of the body of work.
3
Chapter 2 - Literature Review

This chapter reviews the state of the art in videoconferencing, particularly mobile
videoconferencing. It also assesses the current state of mobile phone networks and
their predicted improvements over the next decade. A brief review of current
compression technology for audio and video follows.

2.1 Overview

Of all of the publications reviewed, Video Coding for Mobile Handheld
Conferencing [Faichney & Gonzalez 1999] is the most generally relevant. This
paper considers the potential implementations of handheld video conferencing
systems. In particular it examines the major competing video compression schemes
currently available, including SPIHT [Said & Pearlman 1996]. It also assesses two
handheld platforms for their implementation, a Newton MessagePad 2100 and a
Hewlett Packard 620LX. The undergraduate thesis on which this paper is based is
quite similar to our own and hence of great interest. Although it does not include
audio and is implemented on a pre-existing platform, it still practically
demonstrates the current state of mobile videoconferencing and what our thesis may
achieve.

2.2 General Videoconferencing

Video is one of the most demanding forms of data communication possible. In its
native form it requires an extremely high bandwidth, around 70 Mbps [Duran &
Sauer 1997], which must be delivered with a delay of less than 200msec and very
little temporal jitter.

Traditional approaches to videoconferencing require one of the following: [VC
2001]
1. A PC with a 56kbps modem.
4
2. Dedicated videoconferencing hardware with an ISDN line
or
3. A PDA and a mobile phone with extra audio-visual hardware (See also
[Faichney & Gonzalez 1999])
These achieve varying degrees of success, but are generally realising some of the
promise of videoconferencing.

At present very few mobile videoconferencing solutions are available. Many
companies are advertising products under development aimed at 3G networks
which will supply in excess of 1Mbps. For example, Toshiba and Winnov are
targeting their videophones at the $1000 mark [MobileInfo 2001], not including the
laptop computer on which they run. These are naturally expected to achieve very
good quality audio and video.

2.3 Mobile Phone Networks

The very low bandwidth of current wireless communication systems is one facet of
the obstacle. Currently, GSM Phase 1 phones can achieve between 9.6kbps
[Buckingham 2001] and 14.4kbps [MDWC 2001]. They have very low reliability,
exhibiting high error rates (BER up to 10
-3
) and occasional channel fading.

General Packet Radio Service (GPRS) is Phase 2 of GSM. Early GPRS terminals
are expected to become publicly available in late 2001 and are predicted to have a
bandwidth of 28.8kbps [Buckingham 2001].

3
rd
Generation GSM known as 3G is expected to emerge over the next decade. The
exact specifications are still under consideration by the ITU, however it is expected
to deliver data rates exceeding 1Mbps [3G 2001].


2.4 Compression Research

5
The paper which first inspired this thesis is a masters thesis [van der Walle 1995]
published in 1995 from the University of Waterloo. This thesis analysed the
mathematics of fractal image compression and wavelet image compression, two
competing views of image compression prevalent at the time, to bridge the gap
between them.

A relatively recent image compression scheme based on wavelets is Set Partitioning
in Hierarchical Trees [Said & Pearlman 1996]. SPIHT is based on the properties of
a wavelet-transformed image, and has demonstrated remarkably good results at low
bit rates.

A couple of video compression standards have recently been produced targeting
mobile videoconferencing. One evolving standard, H.324/M (M for mobile) is an
extension of the existing H.324 standard for videoconferencing over Public
Switched Telephone Networks. It is specifically designed for mobile terminals,
allowing for Bit Error Rates of up to 0.05 and the very low bandwidth normally
available [Ohr 1996]. MPEG
2
, responsible for one of the first major video
compression standards, is working on a new set of standards: MPEG4, MPEG7 and
MPEG21 [MPEG 2001]. These standards are focussed primarily on the intelligent
handling of video by providing context, and although MPEG4 contains the facility
for low bandwidth video it is only a side-note in the standard. The audio standard
also does not support very low bandwidths.

Surprisingly the strongest voice compression algorithm publicly available is over 30
years old. LPC-10e
3
, a voice compression scheme based on linear predictive
coding, produces a 2.4kbps fixed-rate output. It is based on a simple model of the
human voice, producing moderate quality speech [Rabiner & Schafer 1978].

2.5 Relevance


2
Moving Picture Experts Group
3
US Federal Standard 1015
6
Our project brings together much of the above. We are aiming at an order of
magnitude lower bandwidth than modern video conferencing systems, so we have
developed our own video compression technique and our own improved LPC
speech codec. We also aim to support the current standard mobile channel (GSM
Phase 1) while providing a scalable codec to allow for the advances in mobile
services expected over the next decade.

7
Chapter 3 - Existing Theory

3.1 Data Compression

3.1.1 Speech

Speech coding is used for low quality telephony and higher quality
teleconferencing. Telephone speech is sampled at 8kHz with 8-bit precision,
producing a PCM code at 64kb/s. Model based analysis allows intelligible speech
at rates as low as 2.4kb/s.

3.1.2 Audio

Audio signals are a more general class of signals. On a CD, audio is sampled at
44.1kHz with 16-bit precision, producing a PCM code at 706kb/s. Model based
analysis is thwarted by the complexity of general audio, so at present the best audio
codecs are transform coders. CD quality sound can be achieved with 128kb/s while
perceptually lossless sound can be achieved with 64kb/s.

3.1.3 Images

A standard greyscale image has 512 by 512 pixels with 8-bit precision. Similar to
audio, images are currently too complicated for model-based compression so
transform codecs are used. Perceptually lossless compression may be achieved at
8-fold compression, while good quality images can be achieved at 30-fold
compression.

3.1.4 Video

Applications of digital video range from low quality videophones and
videoconferencing to high resolution digital television. Currently the best
compression algorithms remove the temporal redundancy with motion
8
compensation. Each frame is predicted from previous frames by compensating for
the motion encoded by the transmitter. An error image, called a residual, is
compressed with an image compression scheme.

For teleconferencing, Quarter Common Intermediate Format (QCIF) colour video
has only 176 by 144 pixels [Cherriman 1996]. A maximum of 30 frames per
second are transmitted, however 10 or 15 is usual. If the motion in the video is
relatively simple then moderate quality video can be obtained at 128kb/s.

3.2 Information Theory

In 1948 Claude Shannon, a researcher for Bell Labs, published a number of
remarkable theorems [Shannon 1948] which have formed the foundation for the
fields of data compression and error coding. He established that the fundamental
measure of the information contained in any message, including text, audio and
video, is the minimum number of bits required to represent it. This measure of
information he named entropy, explicitly showing a fundamental link between
information theory and physics that is still being explored today.

To aid the following discussion we begin with some definitions:
Data compaction is the invertible process of representing data by a shorter
bit sequence, and is said to be lossless.
Data compression is the process of approximately representing data in such
a way as to minimise an error metric, called the distortion.
Data compression typically obtains a much shorter representation than data
compaction.

3.2.1 Entropy and Data Compaction

Following Shannon, we begin by considering a memoryless source of information
which produces symbols s
k
from a symbol space S. A memoryless source is one in
which each successive symbol is independent of all previous symbols.
9

Consider two variables x and y, with probability distribution functions ) (x f
X
and
) ( y f
Y
. x and y are called independent if and only if ( ) ( ) x f y x f
X X
= | . In other
words, the information obtained from y in no way contributes to the information
known about x; it does not alter the distribution of x.

We define the information obtained from the reception of a symbol s
k
from a
memoryless source as follows:
( ) ( )
k
k
k
p
p
s I
2 2
log
1
log =
|
|
.
|

\
|
=
where ) (
k S k
s f p = is the probability of receiving symbol s
k
.

This definition implies that the less probable a given symbol is, the more
information we gain from its reception. If we take the expected value of the
information contained in a random symbol produced by the source:
( ) ( ) ( ) ( )


= =
k
k k k
p p s I H
2
log
where is the message space,
then H(M) is Shannons entropy for a single symbol output by the source.

The definition of symbol is very flexible. We can define a symbol to be an entire
message or to be the components of a message. For example a symbol could be a
single English letter or a whole word. A larger symbol set provides greater
opportunity for the exploitation of dependencies within the data.

For a message composed of independent symbols we can simply sum the
information from each symbol to obtain the entropy of the message. To show this,
consider two independent random symbols X and Y taken from the same symbol
space:
( )
y x Y X XY
p p y f x f y x f = = ) ( ) ( ,

So the entropy of the combined symbol XY is:
10
( ) ( ) ( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( ) Y I X I E
p p
p p XY I XY H
y x
y x
+ =
=
= =
2 2
2
log log
log

Therefore ( ) ( ) ( ) Y H X H XY H + = for independent X and Y

The difference between a messages length and its entropy is known as its
redundancy. Redundancy is additional information which reinforces existing
knowledge rather than adding new information. Most data is very redundant; see
[Shannon 1948] for an example of the redundancy of English text.

Redundancy allows for a reconstruction of the information from an incomplete
fraction of the message. Artificially introduced redundancy is used extensively in
coding theory. Conversely the removal of redundancy results in a shorter message,
which is the goal of compaction and compression.

The concept of entropy allows us to draw a parallel between data compaction and
the classic game of 20 questions. In the game of 20 questions one person (the
transmitter) chooses an object (the message for transmission), while another (the
receiver) is given 20 questions (bits of information) to determine the object. The
receiver can then differentiate between any of 2
20
different objects. Here both
players must agree upon what constitutes a valid object (the message space) before
they play this is the shared information.

If a message can be isomorphically transformed to a sequence of independant,
symmetric binary decisions then it is guaranteed to achieve Shannons entropy.
By establishing a suitable message space (coding scheme) that is shared by sender
and receiver, each data bit subdivides the space into two equally likely subsets until
only one element is left, the intended message. The answers to these questions
form the compacted data, while the message space is the information shared
between sender and receiver.

3.2.2 Distortion and Data Compression

11
So far we have dealt only with data compaction, which is the lossless compression
of a redundant message. However by discarding insignificant components of the
data much higher compression gains may be made. Data compression is the
technique of approximating the data within an acceptable level of distortion. The
choice of distortion measure depends on both the type of data and the intended
recipient.

For image and video compression a widely accepted distortion measure is the Peak-
Signal to Noise Ratio (PSNR). It is defined as:
( ) dB
MSE
PSNR
|
|
.
|

\
|
=
2
10
255
log 10
where MSE is the Mean Squared Error between the original image and the
reconstructed image in 8-bit format. PSNR for video is measured frame-by-frame.

3.3 Image and Video Compression

3.3.1 Vector Quantisation vs Transform Coding
Quantisation is the act of approximating a data symbol by one from a smaller
symbol set. In doing so we require fewer bits to represent the symbol, at the cost of
some distortion. In Scalar Quantisation (SQ) we quantise each element of a signal
separately.


Figure 1 Original Signal

Figure 2 Quantised signal
12
Table 1 Scalar Quantisation Example

In Vector Quantisation (VQ) we group data symbols into super-symbols before
quantisation. Quantisation takes place in a multi-dimensional symbol space where
the interdependency of neighbouring data elements can be exploited to reduce
distortion.

Figure 3 Super-Symbol (Pixel Pair) Distribution for a Sample Image

Figure 3 above shows the strong correlation between neighbouring pixels in a
typical image. Vector Quantisation takes advantage of dependencies between
neighbouring pixels while Scalar Quantisation does not. If we combine our entire
signal into a single super-symbol we can achieve the strongest possible
compression. However VQ is a great deal slower and more complicated to
implement than SQ, exhibiting exponential growth in computation as the super-
symbol dimension is increased.

Transforms may be used to remove dependencies so that VQ achieves little gain
over SQ. So by combining a suitable transform with SQ we may achieve fast
compression with a comparable bit rate.

3.3.2 The Karhunen-Loeve Transform

The Karhunen-Loeve Transform (KLT) is an optimal technique for decorrelating
the components of a signal. Also known as Principle Component Analysis or
Proper Orthogonal Decomposition, it diagonalises the covariance matrix of a signal
13
to produce a set of coefficients that are often close to independent, concentrating the
information content of a signal into the first few coefficients.

3.3.2.1 Derivation

Let
N
X be a (column) data vector of zero mean, drawn from a known
distribution D. Let be the autocovariance matrix of X:
( )
T
X X =
quantifies the correlations between data elements in X.

As is a symmetric matrix it will have orthogonal eigenvectors. Let T be the
orthogonal matrix of normalised eigenvectors of . We now apply a change of
basis to the data vector X:
TX X = '

Consider the resulting autocovariance matrix ' of ' X :
( ) ( )
( )
=
=
=
=
1
'
T T
T X TX
TX TX
T T
T

where is the diagonal matrix of eigenvalues of , equivalently the variances of
each symbol in the new basis for X.

So the KLT changes the basis of the data to remove all inter-symbol correlations.
This allows SQ to achieve comparable compression to VQ. Unfortunately due to its
dense matrix formulation the KLT is not well suited to fast data compression, and is
unable to utilise the non-linear dependencies seen in image and video compression.
However, it forms an ideal theoretical viewpoint from which to analyse other
transforms in compression.

3.3.3 Scale Invariance and Consequences

14
One defining property of images is their scale invariance. For example, if we take
a picture of a scene from a large distance and then take a second picture from a near
distance the two images will have the same statistical properties.

Scale invariance in its direct form motivates fractal image compression [van der
Walle 1995], while the equivalent statement in the Fourier domain motivates
wavelet image compression. We now give a simple derivation of the statistical
distribution of the Fourier coefficients of a scale invariant signal.

Scale invariance in 1-D may be expressed simply as
) ( ~ ) ( sx i x i
where ) (x i is the intensity as a function of position and s is the scaling factor.
B A ~ means that the random functions A and B are drawn from the same
distribution.

By the Fourier scaling theorem, if we let )} ( { ) ( x i F f I = then we obtain
) (
1
)} ( {
s
f
I
s
sx i F =
Scale invariance can thus be translated to the frequency domain:
) ( ~ ) ( ) ( ~ ) ( f sI
s
f
I x i sx i
f
f I
1
) (
So we observe that the amplitudes of the Fourier coefficients of an image follow a
1/f decay. This motivates a transform based not on a linearly spaced set of
frequencies, as in the Fourier transform, but instead on a logarithmically spaced set
of frequencies.

3.3.4 Colour Spaces

The standard Red-Green-Blue format for colour representation is highly redundant
as the three components are strongly correlated. Many more efficient colour spaces
are available. These are typically obtained by computing a KLT basis from a large
15
image database, or by determining a perceptual colour space based on human
psychovisual experiments. The most commonly used colour space is a Luminance-
Chrominance space known as YUV. Typically converting from RGB to YUV
allows a compression gain of 2 [Bourke 2000].

16
3.4 Wavelets

3.4.1 Introduction to Wavelets

The classic tool of signal analysis has been the Fourier Transform. It represents a
signal as a set of complex sinusoidal components of infinite duration. As a
complex sinusoid is an eigenfunction of a linear system the Fourier Transform has
found applications in systems for signal processing, control and communication,
and indeed for modelling any linear time-invariant system.

The Fourier Transform provides an excellent model for stationary (time-invariant)
signals and systems, and when using it for signal analysis we implicitly assume that
the signal is stationary. However speech, images and video clearly are not
stationary processes. They are spatially and temporally variant - within a single
signal occur both periods of silence and of noise, transients and oscillations.

Researchers have taken many different approaches to developing transforms for
time-varying signals. In 1910 Haar, a functional analyst, developed a multi-scale
transform for piecewise constant functions. In 1946 Gabor, a physicist, developed a
multi-scale transform based on Gaussian-windowed complex sinusoids for
modelling coherent states in quantum mechanics. In recent decades researchers in
signal processing have devised the Short-Time Fourier Tranform and the
Windowed Fourier Transform to attempt to capture some sense of the time-varying
nature of signals. The Short-Time Fourier Transform is a somewhat ad-hoc
approach and consequently suffers from an inherent discontinuity between
processing blocks as well as a fixed scale of operation.

Over the last decade these approaches have converged into the Wavelet Transform.
The Wavelet Transform has brought together researchers from diverse fields such
as theoretical physics, signal processing, functional analysis, geology and
surveying, providing a unifying view of many phenomena. It provides a new view
17
of signal processing in which time and frequency are represented simultaneously
and which naturally represents objects at all scales.

3.4.2 Wavelet Transform Mathematics

The Wavelet Transform represents a signal as composed of a set of dilations and
translations of a single waveform, the wavelet. A wavelet is a single function of
zero mean:


= 0 ). ( dt t
that is dilated with a scale parameter s, and translated by u to produce a wavelet
atom:
) (
1
) (
,
s
u t
s
t
s u

=
The wavelet coefficient of a function f(t) at the scale s and position u is computed
by correlating f(t) with a wavelet atom:
dt
s
u t
s
t f s u t f W


) (
1
) ( ) , )}( ( {
The function ) (t is also known as the mother wavelet.



Figure 4 Example: Daubechies 7-9 Mother Wavelet

18
3.4.3 Time-Frequency Tilings

One way to view the wavelet transform and its relatives is by observing their time-
frequency tilings. The time-frequency tiling is used to describe how a transform is
able to resolve time and frequency components of a signal. Each rectangle
describes the area of influence of a single transform coefficient. In an orthogonal
transform each rectangle must have the same area.


Figure 5 - Sampled Signal

Figure 6 - Discrete Fourier Transform

Figure 7 - Short-Time Fourier Transform

Figure 8 - Discrete Wavelet Transform
Table 2 - Time-Frequency Tilings

Figure 5 shows a sampled signal. In this form the coefficients have good temporal
resolution but no frequency resolution.

Figure 6 shows the Discrete Fourier Transform. Each coefficient represents a
narrow frequency range, spread over the entire time range. Thus the coefficients
have no temporal resolution; the DFT implicitly assumes time-invariance.
19

Figure 7 shows the Short-Time Fourier Transform. In this transform the data has
been broken up into temporal blocks, allowing a coefficient to represent a transient
frequency component. Although this simplified diagram is unable to show it, the
rectangles do not overlap so discontinuities exist between temporal blocks.

Figure 8 shows the Discrete Wavelet Transform. In this transform the high
frequency components influence a short time period, while the low frequency
components influence a correspondingly long time period. The rectangles in this
diagram overlap substantially, such that altering any coefficient will smoothly alter
the surrounding neighbourhood. This prevents a compression scheme based on the
wavelet transform from introducing blocky artefacts. The wavelet transform acts
as a spatially adaptive filter, generating a large number of significant coefficients in
the vicinity of signal discontinuities and relatively low amplitudes in smooth
regions.

3.4.4 Biorthogonal Wavelet Filter Banks

The discrete wavelet transform handles signal boundaries by periodic extension.
However this creates an artificial discontinuity across the boundary, generating high
amplitude coefficients in its vicinity. When compressing the signal these falsely
significant coefficients require additional bits for no additional gain in fidelity.

Similar to the Discrete Cosine Transform a wavelet transform can be constructed
which allows the signal to be symmetrically extended, removing the artificial
discontinuity. In order to allow perfect reconstruction under these conditions the
filter must be symmetric. However for a particular choice of wavelet to be
admissible it must be self-orthogonal when dilated or translated. A symmetric
wavelet cannot be self-orthogonal when dilated though, leading to a contradiction.

The solution is to use a pair of wavelets that are mutually orthogonal but not self-
orthogonal. These biorthogonal filters may then be symmetric, allowing perfect
reconstruction for a symmetrically extended signal.
20

The Biorthogonal Wavelet Transform (BWT) is an orthogonal transform, allowing
for coefficient extraction by correlation. Note that the term BWT is used
specifically for biorthogonal wavelets, while the term DWT is used to denote the
wavelet transform in general. Due to the octave-band split in the frequency domain
the DWT has an efficient implementation as a pyramid of filters. Quadrature-
Mirror Filters developed from sub-band coding for audio compression are used to
ensure perfect reconstruction. Consequently the computation time is near-linear in
the number of coefficients.

The pyramidal algorithm initially splits the signal into a high frequency and low
frequency stream. The procedure is then repeated, recursively splitting the low
frequency component into successively lower octaves.


Figure 9 - Quadrature Mirror Filter and Perfect Reconstruction

Here we see the signal a
0
[n] split into a low frequency component or average a
1
[n]
and a high frequency, or detail component d
1
[n]. The perfect reconstruction by h
~

and g
~
is also shown. In actual implementation symmetric polyphase filters may be
used to quarter the number of multiplies required.

For further algorithmic detail, see the attached code [Appendices, 9.1 DWT.c].
For further theoretical detail, see [Mallat 1998], [Daubechies 1992] or [Cohen
1995].

3.4.5 Multi-Dimensional Wavelet Transform

21
The wavelet transform may be simply extended to any N-dimensional signal. As
the wavelet transform is linear this may be achieved by transforming the signal
separably along each axis.

By convention, we store the coefficients in the following arrangement for a 2-
dimensional image transform:


Figure 10 - Wavelet Coefficient Arrangement
HH is the x-detail, y-detail component,
HL is the x-detail, y-average component, and
LH is the x-average, y-detail component.

The scales are shaded from light for lowest scale to dark for higher scales.

3.4.6 Coefficient Tree Representation

If a discontinuity exists in a signal then any wavelet that lies across it will have a
significant amplitude. A wavelet transform can then be seen as a multi-scale edge
detector, where successively lower frequencies represent smaller versions of the
same edge map. This can be seen in the following figures which show several
stages of a periodic DWT (Table 3). Observe how the periodic extension scheme
creates large coefficients along the edges of the image.

22

Figure 11 Full-Scale Image

Figure 12 1
st
Scale

Figure 13 2
nd
Scale

Figure 14 Completed Wavelet Transform
Table 3 Successive Scales of a 2-D Wavelet Transform

The similarity of the edge map between scales motivates the view of the wavelet
coefficients as a binary tree. To each node we associate two children, the two
coefficients within the same time period in the next higher octave. At the root of
the tree is the signal component at the highest scale (lowest frequency). If a
discontinuity causes a particular coefficient to have a significant amplitude then its
ascendants will be significant also. This property of the wavelet transform may be
used to aid in the efficient localisation of signal discontinuities.

23

Figure 15 - A Binary Tree of 1-D Wavelet Coefficients

The concept of a binary tree of wavelet coefficients extends simply to a 2
N
-ary tree
in N dimensions.

3.4.7 Signal Estimation and Compression in the Wavelet Domain

The wavelet transform is an orthogonal transform, or equivalently a rotation in N-
space for a signal composed of N elements. As such it preserves the L
2
-metric,
which is the square root of the sum of the squared differences between two signals.

PSNR, the measure of distortion introduced in section 3.2.2, is therefore preserved
under the wavelet transform. Therefore when compressing an image we find that
minimising the distortion in the wavelet domain is equivalent to minimising the
distortion in the spatial domain. This allows a very simple treatment of the
distortion in compression schemes.

Incidentally, recent research has suggested that both the human audio and visual
mechanisms physically sense data in the wavelet packet domain, an extension to the
wavelet domain. This would suggest that the wavelet domain is well suited to
perceptual coding [Majani1994].

24
The DWT and its relatives may be viewed as an approximation to the KLT. Taking
into account the statistical properties of images we expect to observe both scale and
translation invariance in the autocovariance matrix . The DWT explicitly assumes
such a structure, thus requiring only a small set of wavelet coefficients as shared
knowledge rather than the entire matrix. This approximation produces good
results for typical images and greatly reduces the computational load of the
transform. See [van der Walle 1995] for a more formal discussion of the
relationship between the DWT and the KLT.

25
3.5 Embedded Zerotree Wavelets and Set Partitioning in
Hierarchical Trees

EZW and SPIHT are two efficient schemes for representing the wavelet coefficients
of an image. They are among the best available image compression schemes
currently under research. Since its invention in the mid-1990s SPIHT has become
a de facto baseline against which all other image compression schemes are
measured. We now discuss the underlying principles and operation of EZW and
SPIHT, before summarising their advantages for real-time video compression.

As we have observed, the wavelet transform is an approximation of the Karhunen-
Loeve transform for images. It removes the correlations between pixels in the
image, producing a set of coefficients heavily biased towards 0. This allows a lossy
coding scheme to discard the majority of the coefficients to obtain an efficient
approximation of an image, sending only the significant coefficients for
reconstruction by the receiver.

However specifying the positions of the significant coefficients occupies a
substantial portion of the bit budget, particularly at low bit rates. Reducing this
overhead requires some additional insight into the wavelet transform of an image.
A Spatial Orientation Tree (SOT) is an efficient representation of the positions of
the significant coefficients.

The EZW and SPIHT schemes are based on the following three concepts:
1. Successive approximation of the significant wavelet coefficients from the
most significant bit down to the least significant.
2. A partial and reversible ordering of the wavelet coefficients by magnitude.
3. Utilisation of the self-similarity of the wavelet coefficients across different
scales in a SOT.

The partial ordering of the wavelet coefficients allows them to be sent in order of
importance. The successive approximation/refinement of the coefficients ensures
26
that at any stage the minimum Sum of Square Error (SSE) approximation of the
coefficients has been transmitted. The SOT allows the efficient representation of
the positions of the significant coefficients.

EZW and SPIHT differ only in their choice of SOT, so we begin by describing the
first two concepts.

3.5.1 Successive Approximation

Successive approximation is the process of using a fixed-point binary representation
of each coefficient, sending the nth bit of each coefficient in the image, before
decrementing n. It is also known as bit-plane coding as it treats the image as a
number of planes (one for each bit number) and compresses the corresponding
binary array. At any stage the coefficients value is specified to within a certain
range, the width of which is halved with each successive approximation. The
receiver tentatively assumes that the coefficient is in the centre of the current range
at every stage in order to minimise SSE. The threshold in this case is T = 2
n
,
corresponding to the nth bit. All coefficients insignificant with respect to this
threshold are hence known to be a 0 in the corresponding bit plane and need not be
transmitted. As the threshold is repeatedly halved the range of each significant
coefficient halves accordingly, quartering their contribution to the SSE. New
coefficients become significant at each stage and their position must be encoded.
Any coefficient that is already significant is simply refined by sending its next
lower bit. For the same number of bits, successive approximation allows a large
number of coefficients to be known approximately instead of a small number of
coefficients to be known exactly.

3.5.2 Partial ordering by magnitude

Partial ordering by magnitude is the process of specifying the set of coefficients that
are significant with respect to the current threshold T. Once a coefficient becomes
significant, it remains significant at each lower threshold such that the SOT does
not need to take it into account. It is a reversible sort of the coefficients by
27
magnitude, such that the largest coefficients are sent first. This ensures that at any
stage the compressed bit stream achieves the smallest SSE possible.

3.5.3 Spatial Orientation Trees

The Spatial Orientation Tree is responsible for obtaining an efficient representation
of the positions of the significant coefficients. Although the wavelet coefficients
are uncorrelated, they are not all independent. The magnitudes of the coefficients at
the lowest scale are somewhat equivalent to an edge map of the image, consisting
of an x-edge subimage, a y-edge subimage, and an xy-edge (corner) subimage.
Each successively higher scale represents edges on a larger scale, with a
correspondingly spatial resolution. As a result the edge map at a given scale is
approximately a downsampled version of the edge map at the next lower scale.
This scale space of edges allows us to state the following heuristic:

If a coefficient is insignificant with respect to a threshold T (|x| < T), then all
descendents are likely to be insignificant also.

This extra redundancy allows us to efficiently determine the position of the
significant wavelet coefficients using a Spatial Orientation Tree (SOT). The choice
of SOT is so fundamental that each compression scheme is named after it.

3.5.3.1 EZWs SOT

The Embedded Zerotree Wavelet coder uses this property in its immediate form.
The significance map of the wavelet coefficients forms a quadtree, with the
property that an insignificant node is very likely to have insignificant descendants.
EZW assigns the following labels to each node at a given threshold:

Table 4 EZW Coefficient Labels
Label Description
P (for Positive) Positive Significant coefficient
N (for Negative) Negative Significant coefficient
28
Z (for Zero) Insignificant, but with a significant descendant
T (for zero-Tree root) Insignificant, with insignificant descendants

If a node is labelled as a zero-tree root then none of its childrens labels need to be
transmitted. Thus the vast majority of the insignificant coefficients are labelled by
a few zero tree roots in the highest levels of the tree, producing an extremely
efficient representation.

3.5.3.2 SPIHTs SOT

EZWs SOT is inefficient, as a coefficient that is already known to be significant
must be relabelled significant at each lower threshold. Set Partitioning in
Hierarchical Trees avoids this by treating the insignificant coefficients as subtrees
represented by their root node. As new coefficients become significant during
encoding, these subtrees are repeatedly split into smaller subtrees until the
significant coefficients have been extracted. The following three lists are
maintained in synchronisation by both the encoder and the decoder:
1. The list of insignificant sets, or subtrees (LIS)
2. The list of significant pixels (LSP)
3. The list of insignificant pixels (LIP)
The list of insignificant pixels arises in the later stages of encoding as some trees
are split down to their component pixels. Note that every element in an
insignificant set is insignificant.

There are two types of entries in the LIS:
1. Type A entries specify that all descendants of the given node are insignificant.
2. Type B entries specify that all 2
nd
descendants (grandchildren) and lower are
insignificant.
The use of two types of sets is found to produce a more efficient representation than
a single type allows.

3.5.4 The EZW and SPIHT algorithms

29
Both algorithms are very similar with the exception of the SOT, so here we describe
only SPIHT. For a more detailed description of the algorithms see the original
papers [Shapiro 1993] and [Said & Pearlman 1996]. Both EZW and SPIHT have
symmetric encoding/decoding algorithms, which means that the encoding algorithm
is the same as the decoding algorithm with each output step replaced by an input
step.

3.5.4.1 Algorithm 1 SPIHT encoding
1. Initialisation:
1.1. Output n, the highest significant bit position.
1.2. Set the LSP as empty
1.3. Add the significant coefficients co-ordinates to the LIP, and those with
descendants to the LIS as type A entries.
2. Sorting pass
2.1. For each entry in the LIP:
2.1.1. Output their significance at the current threshold
2.1.2. If they have become significant, move them to the LSP and output
their sign.
2.2. For each entry in the LIS:
2.2.1. If the entry is of type A, then
Output the significance of the set
If it became significant, split it into its 4 Type B subsets and append
them to the end of the LIS
Add each child of the root to the end of the LSP or LIP as
appropriate, outputting their significance. Output their sign if they
have become significant.
2.2.2. If the entry is of type B, then
Output the significance of the set
If it became significant, split it into its 4 Type A subsets and append
them to the end of the LIS
Remove this entry from the LIS.
3. Refinement pass
30
3.1. For entry in the LSP, except the new entries, output the n-th most
significant bit of |x|.
4. Quantisation step update
4.1. Decrement n (halve the threshold) and go back to Step 2.

For further detail, see the attached code [Appendices, 9.2 SPIHT.c]. This
algorithm has been extended to colour by transformation to YUV space before
coding the three components independently. When interlaced, the three coded
streams compete for bit allocation such that the PSNR is maximised at each point.

3.5.5 Properties of EZW and SPIHT

Both schemes produce an embedded representation of the image, such that the
compressed bit stream may be truncated to any length without recompression and
still achieve the best possible compression for the corresponding number of bits.
This ensures completely controlled Variable Bit Rate coding, allowing an image to
be coded for either a target quality or bit length.

EZW and SPIHT are also substantially faster than their rivals, particularly in the
compression stage where most image coders fall down. The only difference in
speed is a preprocessing stage in the encoder where we calculate the significance of
each tree, removing the need to test each coefficient of a subtree in Step 2 above.
This reduces the encoding algorithm from O(N
4
) to O(N
2
) for an NxN image. A
fast encoding algorithm is very important for a real-time video codec.

An additional advantage becomes evident from an alternate view of both schemes.
Each scheme transmits the values of the significant coefficients and discards the
coefficients of low magnitude. This is equivalent to a near-optimal wavelet image
denoising scheme [Donaho & Johnstone 1994], so compression at high bit-rates
actually cleans up noisy images to improve their quality!

31
3.6 Motion-based video coding

3.6.1 Underlying Model

Video is composed of a sequence of successive images, or frames, of a scene. Each
frame is closely related to its predecessor, with deformations due to objects moving
within the scene and new information in the form of new objects entering the scene.
By suitably modelling these deformations of the scene between frames we may
remove a large amount of redundancy in the video data.

This view of video data motivates the widely used video compression paradigm of
motion estimation and compensation. Motion estimation is the process of
determining the motion between two successive frames of a video sequence, and is
performed in the transmitter. The set of motion vectors assigned to each pixel is
known as a motion field. Motion compensation is the process of applying the
motion field in the receiver to the previous frame to produce an estimate of the
current frame. A residual image, the difference between the motion estimated
frame and the actual frame, is also transmitted to account for new objects entering
the scene.

3.6.2 Residual Image

Some points will not be present in both frames. For instance an object in the
foreground may move to reveal a portion of the background that was previously
hidden. These points clearly cannot be matched from the previous frame, so the
motion field will not successfully predict their values.

In order to simplify the representation of the motion field, we assign such points a
locally average motion vector. This creates a complete motion field by filling in the
gaps in a smooth way such that the motion field may still be easily compressed.

This leaves the residual image to represent the actual content of the image at that
location. The residual image is defined as the difference between the actual frame,
32
and the predicted frame based on the previous frame and the corresponding motion
field.
33
3.7 Speech Compression

There are currently a wide variety of speech compression standards available. At
low bit rates ranging from 2.4 to 4.8kbps are the LPC model based vocoders such as
LPC-10e, CELP, MELP, and the more advanced Mixed-Band Excitation vocoders.
At the higher bit rates ranging upwards from 8kbps are more general audio codecs,
such as G.711, G.722 and G.728 that have been developed for the H.320
videoconferencing standard, and G.729 which has been developed for mobile
telecommunications [G.729 1996]. Given the very low available bandwidth the
model-based vocoders are of particular interest in this thesis.

3.8 Linear Predictive Coding

3.8.1 Speech Production Model

Much of the following theory is based on Rabiner and Schafer [Rabiner & Schafer
1978].

Linear Predictive Coding is based on a simple speech production model consisting
of an excitation source and a time-varying filter.

The fundamental blocks of speech are called phonemes. Phonemes are classified
into vowels (such as a, e, i, o, u), diphthongs (oa, ou, ow), semivowels (w, l, r),
nasals (m, n), unvoiced fricatives (f, th, s), voiced fricatives (v, z), voiced stops (b,
d, g), unvoiced stops (p, t, k) and affricatives (j, h). This division can be simplified
into the two major categories of phonemes: voiced and unvoiced.

Voiced phonemes are produced by a series of periodic pulses from the vocal cords.
The period between pulses determines the voices pitch frequency. As these pulses
travel along the vocal tract the esophageus, tongue and nasal cavity act as an
acoustic filter, shaping the frequency spectrum of the speech waveform. The
34
resulting filter varies slowly relative to the pitch period, which is typically between
1 and 10 milliseconds.

Unvoiced phonemes are generated by forming a constriction at some point in the
vocal tract, usually in the mouth, forcing the air through the constriction at a high
enough velocity to produce turbulence. This turbulent flow produces white noise
which is acoustically filtered by the mouth and nasal cavity.

The vocal tract filter can be modelled as an all-pole filter. This assumption is only
true for non-nasal phonemes, as the vocal tract lacks side-branches to produce
zeroes. For nasal phonemes it is considered a satisfactory approximation as zeros
are less perceptible than poles.

This simple model of speech production therefore represents the speech by its
amplitude, voicing decision, pitch (if voiced), and vocal tract filter. The speech is
broken up into 30 to 50 frames per second for analysis and synthesis. By extracting
the parameters for each frame in the transmitter and reconstructing the speech at the
receiver we can very efficiently represent the speech waveform with quite
acceptable quality.


Figure 16 LPC Speech Production Model

3.8.2 Vocal Tract Filter Analysis

Successful LPC speech coding requires the accurate extraction of the model
parameters: voicing, pitch, amplitude, and the filter. The heart of the vocoder is in
the extraction of the filter coefficients. Given the filter model assumed above, the
35
speech waveform is the result of filtering the source by an all-pole filter. Therefore
we can retrieve the source waveform from a frame of speech by passing it through
the inverse (all-zero) filter.

For the system in Figure 15 above, the speech samples s(n) are related to the
excitation u(n) by:
( ) ( ) ( ) n u k n s a n s
p
k
k
+ =

=1

where p is the order of the filter.

This assumption of a weakly stationary signal is approximately true over short time
periods, typically fewer than 10 pitch periods.

So the all-zero filters output is:
( ) ( ) ( )

=
=
p
k
k
k n s n s n u
1
'
where ( ) n u is an approximation to ( ) n u , the unknown source waveform.

To determine the filters coefficients { }
k
, we calculate the waveforms covariance
function { }
k
over the frame under analysis and solve the corresponding Yule-
Walker equations:
(
(
(
(
(
(

=
(
(
(
(
(
(

(
(
(
(
(
(

M
M
M
K K
M O M
M
K
2
1
2
1
0 2 1
0 1 2
2 1 0 1
1 2 1 0





p p p
p
p


There are a number of fast solutions to the Yule-Walker equations which make use
of the symmetric Toeplitz structure of the covariance matrix. A recursive solution
known as the Durbin-Levinson algorithm may be used if p is unknown, however
this will not be necessary in the following.

36
This procedure extracts the filter coefficients, and with little extra effort we may
then extract the source waveform. The resultant coefficients are subsequently
quantised for inclusion in the encoded frame packet.

3.8.3 Voicing and pitch determination

While the voicing and pitch can be detected directly from the speech waveform a
more robust approach is to observe the source waveform. This technique is
preferred as the filter estimation allows us to generate the source waveform with
little extra computation.

The approach we discuss here measures the pitch frequency of the source
waveform. If the source waveform does not have a well-defined pitch then it is
designated unvoiced. The calculation of the pitch period measures the Average
Magnitude Difference Function over the range of possible pitch periods :
( ) ( ) ( )

=
T
dt t u t u
T
AMDF
0
1

The source waveform is delayed by one pitch period and subtracted from itself,
with the average magnitude of the resulting difference determining the success of
the match. Once the best match has been found the voicing decision is made based
on the AMDFs value, and the voicing and pitch are included in the frame packet.

In practice making a reliable voicing decision is quite difficult so additional cues
are used. Voiced frames tend to have a lot more energy than unvoiced frames so
the speech amplitude gives an indication of the voicing. We may also delay the
speech to allow a soft decision based on the voicing of past and future frames. As
the voicing may change abruptly mid-frame, a separate voicing decision is often
made for both halves of the frame at the cost of one extra bit per frame. In part it is
the difficulty of making accurate voicing decisions that has led to the wide variety
of extensions to LPC.

37
3.8.4 Voice Synthesis

The speech production model of Figure 16 suggests a very simple speech synthesis
scheme. For each frame generate a pulse train or white noise according to the
voicing and pitch, pass it through the all-pole filter whose coefficients were
extracted in the transmitter, and amplify the resulting speech waveform to the
desired volume.

In practice though the realistic regeneration of continuous speech from the LPC
parameters is hindered by the inherent discontinuity between speech frames. A
good synthesiser must produce a smooth transition between successive frames,
taking into account the voicing of each frame and the synchronisation of pitch
pulses. In general it must also ensure that the filter coefficients and pitch period
vary smoothly.

3.8.5 LPC-10e Specifics

The LPC-10e standard accepts 16-bit speech samples at 8kHz. It decomposes the
signal into analysis frames of 180 speech samples, generating 44.4 frames per
second. Each frame is analysed with a 10-pole filter model using the Yule-Walker
equations, followed by pitch determination using the AMDF to detect the signal
periodicity. The voicing decision is made based on the minimum AMDF value and
the presence or absence of voicing in neighbouring frames. The amplitude is
simply taken from the RMS value for the frame. Pre-filtering is applied to attenuate
frequencies over 700Hz using a human auditory model [Rabiner & Schafer 1978].

Each frame is coded separately into a 54-bit data frame, producing a compressed
data stream with 2.4kbps fixed-rate:

Table 5 LPC-10e Bit Allocation
Component Bits Allocated
Frame Synchronisation 1
Pitch and Voicing 7
38
Amplitude 5
Reflection Coefficients 1-4 5 each
Reflection Coefficients 5-8 4 each
Reflection Coefficient 9 3
Reflection Coefficient 10 2
The reflection coefficients are the { }
k
introduced in Section 3.8.2 above.

As the pitch does not exist for an unvoiced frame, the pitch and voicing bits are
combined into a single symbol. For unvoiced frames only the first 4 reflection
coefficients are encoded, as unvoiced frames typically have fewer poles [Rabiner &
Schafer 1978]. The remaining 21 bits are used to Hamming encode the 28 most
significant bits of the frame, which are the frame synchronisation, pitch, voicing,
amplitude and the significant bits of the reflection coefficients.

The synthesis module in the decoder ensures that the pitch pulses are aligned across
frames as well as smoothing the amplitude and pitch frequency between frames. It
also generates smooth transitions between voiced and unvoiced frames. Post-
filtering is applied to the output to remove the pre-filtering in the encoder.
39
Chapter 4 - Theoretical Contribution

4.1 Motion Estimation

Motion estimation typically dominates the computational load of a video
compression algorithm. Some nave implementations break each frame into blocks,
exhaustively searching for the displacement of each block between frames that
yields the best match. For example, Common Intermediate Format (CIF) video has
352x288 pixels at 15 frames per second [Cherriman 1996]. Exhaustive motion
estimation with a maximum displacement of 30 pixels requires approximately
5.7*10
9
pixel comparisons per second, well beyond the capabilities of most
processors. To complicate this further we often require half-pixel resolution in the
motion field, more than quadrupling the processing. In order to reduce this to a
reasonable load we must restrict our search space by modelling the structure of the
motion field.

4.1.1 Model-based Motion Estimation

Typically in videoconferencing a scene is composed of a relatively small number of
objects moving independently. These objects may be faces, people or the
background furniture. Each individual object will tend to translate, rotate and
deform slowly relative to the frame rate. Translations are the simplest form of
motion to analyse and synthesize, producing a constant region in the motion field.
Rotations occur in three dimensions and produce a smoothly varying motion field
across the surface of an object. Deformations, such as a person moving their mouth
as they speak, tend to stretch the object smoothly between frames. Thus it seems
reasonable to assume that regardless of the type of motion, the values of the motion
field at two neighbouring points within an object will be quite close.

In contrast we will observe discontinuities in the motion field between two objects
moving at different velocities. As we assume that such motion is independent there
will be no correlation to exploit between the two neighbouring velocities.
40

In determining the motion field we break the image into 4x4 pixel blocks. Each
block is compared to the previous frame within a dynamic search range using the L
2

or Sum of Squared Error metric. If a match with sufficiently low SSE is found then
the corresponding displacement forms the value of the motion field for that block.

Let
n
I be the nth frame of the video sequence. The motion field V
v
specifies the
offset between each pixel of
n
I and its best match in
1 n
I as follows:
( ) ( ) ( )) , , , ( ,
1
y x V y y x V x I y x I
y x n n
+ +



Within an object in a scene we impose the following smoothness constraint on V
v
:
M V
r

where M is the maximum motion field gradient. This hard constraint reduces the
size of the search space.

A large a choice of M relaxes the restriction on the motion field, allowing an object
to deform or rotate rapidly. This increases the search space though, requiring more
computation and generally producing a noisy motion field estimate. Too small a
choice of M forces the scene to split into many more objects, requiring more
exhaustive searches and again more computation. The optimal value of M is
determined empirically, proportional to the video resolution and inversely
proportional to the frame rate.

41

Figure 17 - Previous Frame

Figure 18 - Current Frame

Figure 19 Motion Field
y
V
v

Table 6 Motion Field Example

The pair of sequential frames shown above (Table 6) demonstrates the key features
of a motion field. The background is stationary and so produces a large flat area in
the motion field estimate. The face is moving downwards and produces the bright
face-shaped region. Careful comparison of the two frames shows that the right
eyebrow and the mouth are moving relative to the face, as can be clearly seen in the
motion field estimate.

This simplified model of the motion in a video sequence suggests the following
algorithm:
4.1.2 Algorithm 2 - Model-Based Surface Tracking
1. Select a few points distributed throughout the image, designated the seeds.
2. Accurately and robustly determine the motion of the seeds through an
exhaustive search. Now each neighbouring point is likely to lie within the
object also, and hence is likely to have a motion vector of similar value.
3. Breadth First Search outwards from the seed, determining the motion of
each point in the object within the maximum perturbation M of its
neighbours values.
4. If a match cannot be found for a block within the matching tolerance then it
is likely to lie outside of the object, so dont accept a match. Continue
matching points until the BFS halts.
5. Choose as a new seed point the point furthest from all currently known
objects. Heuristically this increases the probability that it lies in the centre
of an object where the most reliable matches tend to be made. Determine
42
this point by ultimate erosion of the set of unmatched points. Go back to
step 2, obtaining a motion vector for each point in the new object.
6. Continue finding new objects and determining their motion fields until one
of the following conditions is met:
a. A real-time deadline occurs
b. No remaining points may be matched
7. Fill any gaps in the motion field estimate by local averaging.

For further detail, see the attached code [Appendices, 9.3 MEC.c].

4.1.2.1 High Frame Rate Limit

The values of the motion field are inversely proportional to the frame rate. As is
a linear operator we find that for high frame rates M can be reduced proportionally.
As the search size is proportional to M
2
, we see that this algorithm has
|
|
.
|

\
|
2
1
rate frame
O . As this algorithm is run once per frame, we then see that as the
frame rate increases the absolute time spent on motion estimation decreases
proportionately.

4.2 Variable Bit Rate Linear Predictive Coding

The LPC-10e speech codec assumes a fixed channel bandwidth of 2.4kbps,
introducing additional redundancy into the data stream in frames where the entire
bandwidth is not needed. However in many situations a variable bit rate codec is
more desirable as it is able to achieve a lower average bit rate.

LPC-10e may be improved by introducing silence detection. Silence may be
detected from the amplitude and voicing of the speech frame. The corresponding
data frame need not be sent, with a suitable overhead to ensure that the receiver
maintains frame synchronisation.

43
The compression may be improved further still by removing the Hamming bits
from the unvoiced frames, producing a shorter data frame. As the pitch and voicing
data precedes the unused bits of an unvoiced frame, the receiver will be able to
determine the frame data length without further synchronisation information. See
figure 20 for the structure of a compressed data packet.

To assist in the later integration of audio and video data frames, the audio frame
rate is altered to 40fps. The resulting audio stream averages 600bps during
conversation, ranging from 40bps during silence to 2.1kbps instantaneous peak rate.


Figure 20 - Bit Allocation Scheme for VBR LPC

44
45
Chapter 5 - Design and Implementation

5.1 Specifications

The aim of the project was to build a device known as the Mobile Video Phone or
MVP. It was to plug into a mobile phone, extending its capabilities to 2-way audio
and video communication. Naturally such a device would include:
1. A video camera
2. A video display
3. A microphone
4. A speaker
5. A processor

Key design requirements for the MVP were:
1. Good quality video
2. Acceptable quality audio
3. Real-time compression and decompression
4. Very low bitrate compression
5. Scalability
6. Portability

5.1.1 Video and Audio Quality

Clearly to be accepted by the consumer a mobile videophone must have satisfactory
quality audio and video compression. GSM Phase 1 phones can achieve between
9.6kbps [Buckingham 2001] and 14.4kbps [MDWC 2001], all of which is devoted
to audio data, while the MVP must send both audio and video through the same
channel. Thus it was decided to emphasise the video quality at the expense of the
audio quality, while maintaining an intelligible quality for the audio. With the very
limited bandwidth available the resolution and frame rate of the video was kept to a
moderate 128x128 colour pixels at 10 frames per second, equivalent to 4Mbps prior
to compression. However with the bit allocation scheme developed and the
46
embedded nature of SPIHT the codec is scalable with bandwidth, allowing for
increased video quality with future improvements in the mobile phone networks.

5.1.2 Real-time Compression and Decompression

The low channel rate of 12-25kbps required very strong compression of both audio
and video. Real-time compression and decompression of video requires a very fast
processor and well-designed algorithms. In addition it restricts the delay of the
codec, reducing the extent to which the codec may look ahead to future frames.
Thus schemes such as 3D SPIHT are inapplicable or severely limited in such an
application. Speech is typically 8-16 bits wide with a sampling rate of 8-22kHz and
requires relatively little processing.

5.1.3 Portability

For portability a fast and low power processor is required, ideally one which is
optimised for the manipulation of audio and video data. The natural choice of
processor for such a task is a Digital Signal Processor or DSP. While more recent
processing technologies such as Field Programmable Gate Arrays are capable of
higher speeds, the high implementation complexity of such a solution is a strong
disincentive. For ease of implementation we selected the TMSC6701, the fastest
available floating point DSP at the time.


Figure 21 - Project Block Diagram

47
The project was divided into two areas, hardware and software. The software
includes the audio and video compression, error control and synchronisation, and
drivers for the audio and visual interfaces. The hardware includes the camera and
converter, display, microphone and speaker, and supporting circuitry for the DSP.
This thesis is predominantly concerned with the software; for a detailed
examination of the hardware see Elliot Hills thesis [Hill 2001].

5.2 Software Components

5.2.1 Audio Compression

As described in Section 4.2 an enhanced VBR LPC audio codec was developed.
The compressed bitstream averaged 600bps during conversation, with a peak rate of
2400bps. Initial testing indicated that the quality was quite acceptable for voice
communication with low background noise. As a variable rate codec the output bit
stream is very sensitive to errors and requires a synchronisation overhead.

5.2.2 Video Compression

The video compression was based on motion compensation with a residual image
taking account of new image components. Both the motion field and the residual
image were compressed using the SPIHT image compression scheme. As SPIHT is
lossy both the motion field and the residual must be decompressed in the transmitter
to maintain synchronisation, generating some computational overhead. In exchange
such a scheme greatly compresses the motion component, significantly improving
the video quality.

As SPIHT generates an embedded output stream it can be cut after compression to
the desired length, allowing complete control over the bit rate of each component.
This control is exploited in the design of the compression system to efficiently
utilise all of the available bandwidth.

48
5.2.3 Integration of Audio and Video Codecs

For robust transmission over a wireless channel the audio and video data must be
combined into a single stream. The combination must take into account the
variable output bit rates and levels of control of each of the data components.
Where necessary it must include synchronisation data so that the decoder can parse
the data packet, extracting the relevant components for each step of the
decompression.

The audio codecs output rate is uncontrolled but contains embedded
synchronisation information. As such it may be placed at the beginning of the data
packet without specifying its length. The audio codec is altered from its original
44.44 frames per second to 40 frames per second to simplify the interleaving of
audio and video frames.

The residual image data is dependent on the motion field data, and so is placed
afterwards. Poor quality motion data will produce high residuals and lose
information from the scene, so it must be coded to a fixed quality rather than bit
length. However a maximum bit length restriction is placed on the motion data to
ensure that it does not overflow the data packet length. As the SPIHT scheme does
not embed information about its length, the motion field data must have a header
appended.

The residual is placed last. Taking full advantage of the SPIHT scheme we simply
code the residual to fill the remaining data bits. As the data frames are of known
fixed length, the residual data length is known at decompression time and need not
be included.

49

Figure 22 - Bit Allocation for Compressed Data Packets

Figure 22 above shows the bit allocation scheme used in the MVP. The diagram is
to scale, based on the average length of each data component at 12kbps. Larger
bandwidths simply increase the bits available for residue data.

For further detail, see the attached code [Appendices, 9.4 VIDCODEC.c].

5.2.3.1 Low Motion Limit

An advantage of this bit allocation scheme is found in its great flexibility. In low
motion video the motion field data reduces to as little as 1% of the bandwidth,
leaving the vast majority of the bandwidth for the residual image. In this case the
video compression algorithm collapses to a progressive SPIHT image codec, giving
an extremely high quality image of the scene after a relatively short period of time.

In general any region of smooth motion in a video sequence will become steadily
clearer over time, as the residual image progressively codes the region to produce
an exact copy of the scene. Thus the vast majority of the bandwidth is devoted to
progressively coding objects in the scene. It is only when the motion estimation
loses tracking that some quality is lost unnecessarily.

New objects entering a scene naturally take up more of the bandwidth. As the
residual image is SPIHT coded, and SPIHT is a spatially adaptive codec, new
objects produce large coefficients in the residual image and hence are allocated the
majority of the bit budget. Thus new objects in an otherwise static scene very
rapidly converge to a high quality replica.

50
5.3 Implementation

The target platform consisting of the DSP, audio and video interface hardware and
support circuitry was to be constructed by Elliot Hill [Hill 2001]. However toward
the end of the project it was decided that the target platform was too far behind
schedule to be used. Instead I am developing a demonstration system consisting of
a laptop computer with a GPRS phone, downloading and decompressing a pre-
compressed video stream in real time. This is near completion, due to be finalised
by the 30
th
of October for demonstration at the University of Queenslands
Innovation Expo.

51
Chapter 6 - Results

This chapter outlines the results obtained from the individual system components as
well as from the entire MVP system. It also discusses the performance of these
components and compares this with their expected behaviour.

6.1 Image Compression

The SPIHT-based image coding scheme works remarkably well, achieving very
good quality images at low compression rates and acceptable images at extremely
high compression rates.

JPEG Image (512x512x24bit)
360x Compression
2202 bytes
SPIHT Image (512x512x24bit)
800x Compression
983 bytes

Figure 23 High Compression Rate Comparison JPEG vs SPIHT

52

Original Image (512x512x24bit)
786kB
SPIHT Image (512x512x24bit)
50x Compression
15.7kB

Figure 24 Low Compression Rate Example

The 800 times compressed image is quite blurry but still recognisable. The 50
times compressed image is near perfect, showing only very slight blurring effects.
Observe that the artifacts generated by the wavelet compression scheme are less
noticeable than the blocking artefacts typically seen in a JPEG compressed image.

6.2 Audio Compression

The audio codec, a Variable Bit Rate Linear Predictive Coding scheme, was found
to produce a very low bitrate data stream averaging 600bps. At this rate the fraction
of the mobile phone bandwidth used by audio is relatively trivial so further
improvements in the audio compression rate are unnecessary.

It is difficult to demonstrate the quality of the compressed audio on paper. Due to
the analysis-synthesis compression technique there are no simple measures of
quality parallel to the use of PSNR in image compression. However, in the authors
admittedly subjective analysis, the audio is considered to be of moderate quality.
The speech is easily intelligible and the speaker may be recognised from the output.
Onsets of words are coded well, however some words ending in voiced stops have
53
noticeable coding artifacts afterwards. Due to its origins in LPC-10e this codecs
output is of closely comparable quality.

6.3 Video Compression

In this section we view the results of the video compression. Overall the quality of
the decompressed video is quite good for the low bandwidth available to GPRS
(25kbps) phones, with the quality for GSM (12kbps) phones considered
satisfactory.

6.3.1 Typical Quality


Figure 25 Sample Input Sequence


Figure 26 - Sample Output at 25kbps

In figure 26 above we see the typical output quality after 15 seconds of video has
been received. Some blurring is still evident.

54
6.3.2 Early Behaviour


Figure 27 - Frames 1, 2, 3, 10 of Sample Video

Figure 27 above shows the typical behaviour of the first few frames. In the first
frame there is no previous information about the scene, so the image is of very low
quality. Successive frames show the information flowing into the image in the form
of the wavelet blobs which make up the scene. After 1 second (frame 10) the
video begins to resemble the scene of interest.

Sample 1 - Ben Talking
20
22
24
26
28
30
32
34
36
1
1
5
2
9
4
3
5
7
7
1
8
5
9
9
1
1
3
1
2
7
1
4
1
1
5
5
1
6
9
1
8
3
Frame number
P
S
N
R
GSM
GPRS
Noise

Figure 28 - Sample PSNR over time

The above plot shows how the quality generally increases over time. Decay in the
quality is due to loss of tracking in the scene or to a period of motion. The line
labelled Noise is the inter-frame camera noise; heuristically if the quality goes
above this line then the codec is transmitting the noise in the image and hence
wasting bits.
55

6.4 Computational Results

The decompression and display system runs in real-time on a 750MHz Pentium III
laptop computer with 64MB of memory. The code consists of 6000 lines of C code
with a 1000 line C++ front end and has not been optimised to a commercial
standard. Initially targeted at the C6701 floating point DSP, the majority of the
processing is performed in floating point. It is expected that significant speed gains
may be made by conversion to a fixed-point or integer implementation, particularly
with the use of the MMX SIMD instructions available on a Pentium III or higher
processor.

The encoding of audio and video operates at 3.3 frames per second on the same
computer described above. The lower encoding rate is for to two reasons. Firstly,
the encoder must decode its own output in order to maintain synchronisation with
the receiver. Secondly, motion estimation between frames is a significant
computational burden. Due to this low compression rate the demonstration system
must use precompressed audio and video data.

6.5 Prototype System


Figure 29 - Prototype Client/Server Interface

Figure 29 above shows the user interface for the prototype client/server video-over-
IP system. It allows video and audio transfer over any TCP/IP connection. The
56
GPRS phone has a serial connector and plugs directly into the laptop computer as a
modem. Thus the system is mobile, performing real-time decompression on
precompressed video sequences. Applications of this prototype could include video
on demand, or with a sufficiently fast compression server, TV over a mobile phone
and mobile surveillance. A fast implementation would allow bi-directional mobile
videoconferencing, the ultimate goal of this project.

57
Chapter 7 - Discussion

Overall the performance of the mobile videophone is excellent. Given the very low
bandwidth of the channel the quality of the audio and video are good. This chapter
outlines the key achievements of this system as well as its limitations. Possible
improvements and extensions are discussed for future work in the field.

7.1 Achievements

7.1.1 Extremely Low Bit Rate Speech Codec

This thesis describes the design and implementation of an extremely low bit rate
(600bps) speech codec. It is a Variable Bit Rate Linear Predictive Coding speech
compression scheme based on LPC-10e, with silence detection and voicing
dependent bit rate to reduce the average bit rate. The output quality is low but
acceptable, demonstrating both intelligibility and speaker recognisability.

7.1.2 A New Very Low Bit Rate Video Codec

Taken from a combination of existing theory and new algorithms, the original video
codec presented in this thesis achieves good quality video at the Australian GPRS
bit rate of 25kbps. It demonstrates real-time decoding on a standard laptop
computer, suitable for early mobile videoconferencing applications. The encoding
scheme is near real-time, and it is expected that suitable optimisations such as the
conversion to integer arithmetic would produce a real-time compression scheme.

7.1.3 A New Fast Motion Estimation Algorithm

A new algorithm is presented for fast motion estimation. A simple model of the
motion field in a typical video sequence is developed which assumes a small
number of objects moving independently in a scene. This model inspires an
58
algorithm that acquires the motion vector of an object and then tracks the motion
surface outwards, determining each objects motion field in turn. Although it does
not provide an optimal motion field, such an approach drastically reduces the size
of the search space to allow real-time motion estimation. Its fundamental model
causes it to produce smooth, easily compressible motion field estimates. The
algorithm is shown to produce less computational loading at higher frame rates,
which suggests its application to high frame rate video coding.

7.2 Future Work

Mobile videoconferencing and more general video conferencing are enormous
fields of research. Current research includes the compression of audio and video,
video-specific error correction, packet loss recovery, network latency hiding
schemes and even multimedia-driven network designs. However from this thesis a
number of specific avenues for future work have emerged.

At present the video compression scheme restricts the video frame dimensions to
powers of 2 due to the dyadic nature of the Discrete Wavelet Transform. This has
not presented a problem in this thesis as the video resolution is under our control.
However there are many situations in which a more general video codec would be
desirable. Wavelet packets and wavelet lifting extend the simple DWT to allow
non-dyadic wavelet transforms [Vetterli & Kovacevic 1995] allowing arbitrary
image sizes to be transformed. This would generate a different coefficient tree
structure, requiring the SPIHT scheme to be altered accordingly. The resulting
codec would then handle arbitrary video frame dimensions.

Data errors and data packet losses have not been dealt with in this thesis. The video
codec developed in this thesis displays strong dependence upon past data; if an
error occurs in a frame or if a data packet is lost then all proceeding frames will
exhibit noise. For future work an error correction scheme specific to the channel of
interest should be implemented. See [Vanstone & Oorschot 1989] and [Haykin
1994] for an introduction to error correction in communication systems.

59
Chapter 8 - Conclusion

This thesis described the theory, design and implementation of a mobile
videophone. The system demonstrates good performance at 25kbps (GPRS), with
satisfactory performance at 12kbps (GSM). A variable bit rate speech codec based
on LPC-10e has been developed requiring only 600bps, significantly increasing the
bandwidth available for video. The original video compression developed is
scalable to ensure that the quality improves with future increases in mobile
communications bandwidth. A prototype has been implemented on a laptop
computer connected to a GPRS phone, demonstrating the systems potential.

This thesis covered the following: Chapter 1 introduced the need for and the
applications of videoconferencing. Chapter 2 reviewed the present state of mobile
videoconferencing, including current compression research, mobile
telecommunications and more general videoconferencing. Chapter 3 explores the
wealth of theory in the field of video and speech compression.

Chapter 4 described the two major contributions of new work in this thesis: a fast
motion estimation algorithm using surface tracking and a variable bit rate speech
codec based on LPC-10e. Chapter 5 applied the theory developed in earlier
chapters to the design and implementation of the mobile videoconferencing system.
Chapter 6 described a prototype system designed to demonstrate the potential
applications of this work, and the results for each compression module. Finally,
Chapter 7 discussed the three major achievements of the thesis and suggested
further work in the area.

60
61
Chapter 9 - References

[Bourke 2000]
YCC Colour Space and Image Compression
http://astronomy.swin.edu.au/pbourke/colour/ycc/
April 2000.


[Buckingham 2001]
Simon Buckingham, An Introduction to the General Packet Radio Service,
http://www.gsmworld.com/technology/yes2gprs.html, Mobile Lifestreams
Ltd. (Issued Jan 2001)


[Cherriman 1996]
Dr. Peter Cherriman, H.261 Video Coding
http://www-mobile.ecs.soton.ac.uk/peter/h261/h261.html
(14th September 1996)


[Cohen 1995]
L. Cohen, Time-Frequency Analysis, Prentice-Hall, Upper Saddle River,
N.J., 1995.


[Conklin & Hemami 1997]
G. Conklin, S Hemami, Multi-Resolution Motion Estimation,
Proceedings of ICASSP 97, Munich, Germany, April 1997.


[Daubechies 1992]
I. Daubechies, Ten Lectures on Wavelets, Capital City Press, Montpelier,
Vermont, 1992.
62


[Donaho & Johnstone 1994]
D. Donaho and I. Johnstone, Ideal Spatial Adaption via Wavelet Shrinkage,
Biometrika, 81:425-455, Dec 1994.


[Duran & Sauer 1997]
Duran and Sauer, Mainstream Videoconferencing: A Developers Guide to
Distance Multimedia, Addison-Wesley, 1997.


[Faichney & Gonzalez 1999]
J. Faichney, R. Gonzalez, Video Coding for Mobile Handheld
Videoconferencing, Proc. IASTED International Conference, Internet and
Multimedia Systems and Applications, Nassau, Bahamas, 1999.


[G.729 1996]
Recommendation G.729, Conjugate-Structure Algebraic Codebook Excited
Linear Predictive Speech Coder
International Telecommunications Union, Place des Nations, CH-1211
Geneva 20, Switzerland.
Approved March 1996, in force at 13
th
October 2001.


[Haykin 1994]
S. Haykin, Communication Systems, John Wiley & Sons, Inc, New York,
1994.


[Hickman 1997]
Angela Hickman, Wireless Videoconferencing, PC Magazine, January
1997.
63


[Hill 2001]
E. Hill, Hardware Design of a GPRS Videophone, Undergraduate thesis,
University of Queensland, Brisbane, Australia, Department of Information
Technology and Electrical Engineering, 2001.


[Mallat 1998]
S. Mallat, A Wavelet Tour of Signal Processing, Academic Press Ltd., San
Diego, CA, 1998.


[Majani 1994]
E. Majani, Biorthogonal Wavelets for Image Compression, Proc. SPIE,
VCIP 1994, Vol. 2308, pp. 478-488, Sept. 1994.


[Marshall 2001]
D. Marshall, Optical Flow,
http://www.cs.cf.ac.uk/Dave/Vision_lecture/node45.html (Current 18/4/01)


[MDWC 2001]
Mobile Device Wireless Connectivity,
http://www.microsoft.com/mobile/enterprise/whitepapers.asp, Microsoft
(March 2001)


[MobileInfo 2001]
MobileInfo.com, Mobile Videoconferencing,
http://www.mobileinfo.com/Hot_Topics/videoconfencing.htm, 10
th
August
2001

64

[MPEG 2001]
Moving Pictures Expert Group, MPEG 4 Applications Coding of Moving
Pictures and Audio, March 1999/Seoul.


[Ohr 1996]
Stephan Ohr, ITU effort eyes mobile video phone,
http://www.icsl.ucla.edu/~luttrell/pubs/eetimes.html, October 1996.


[Rabiner & Schafer 1978]
L. Rabiner, R. Schafer, Digital Processing of Speech Signals, Prentice-Hall,
Eaglewood Cliffs, N.J., 1978.


[Said & Pearlman 1996]
A. Said and W. Pearlman, A New Fast and Efficient Image Codec Based
on Set Partitioning in Hierarchical trees, IEEE Transactions on Circuits
and Systems for Video Technology, Vol. 6, June 1996.


[Shannon 1948]
C. E. Shannon, A mathematical theory of communication, Bell Systems
Technical Journal, vol. 27, pp. 379-423 and 623-656, July and October
1948.


[Shapiro 1993]
J. Shapiro, Embedded Image Coding Using Zerotrees of Wavelet
Coefficients, IEEE Transactions on Signal Processing, Vol. 41, No. 12,
December 1993, pp 3445-3462.


65
[Storn 1996]
Rainer Storn, Echo Cancellation Techniques for Multimedia Applications
a Survey, Berkeley, CA, 1996.


[Vanstone & Oorschot 1989]
S. A. Vanstone and P. C. Oorschot, An Introduction to Error Correcting
Codes with Applications, Kluwer Academic Publishers, Norwell,
Massachusetts, 1989.


[van der Walle 1995]
A. van der Walle, Relating Fractal Image Compression to Transform
Methods, Masters thesis, Univ. of Waterloo, Ontario, Canada, Department
of Applied Mathematics, 1995.


[VC 2001]
Videoconferencing,
http://disc-nt.cba.uh.edu/rudy/spring98/day/6/TeamAUSMUS1.htm
(Current 13
th
October 2001)


[Vetterli & Kovacevic 1995]
M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall
PTR, Upper Saddle River, N.J., 1995.


[Vfaxman 2001]
Video Conferencing for the Public,
http://www.vfaxman.com/applications2.htm, Faxman Communications Pty
Ltd. (Current June 2001)


66
[3G 2001]
3G The Future of Communications,
http://www.gsmworld.com/technology/3g_future.html
Mobile Lifestreams Ltd. (Issued April 26, 2001)



67
Chapter 10 - Appendices



10.1 DWT.c

This appendix contains the C code written to perform forward and inverse Biorthogonal Wavelet Transforms on colour images.



/*****************************************************************************
DWT.c
A groups of functions to perform wavelet transforms and inverse transforms
*****************************************************************************/
#include "DWT.h"

/* - toYUV:
Converts an RGB image into a YUV image - a simple matrix operation
*/
int toYUV(float *in, int imageSize) {
int x, y;
for (x = 0; x < imageSize; x++) {
for (y = 0; y < imageSize; y++) {
float R, G, B;
//RGB to YUV
// Y = 0.299 R 0.587 G + 0.114 B
// U =-0.146 R - 0.288 G + 0.434 B
// V = 0.617 R - 0.517 G - 0.100 G

R = *(in + RED + 3*(x + imageSize*y));
G = *(in + GREEN + 3*(x + imageSize*y));
B = *(in + BLUE + 3*(x + imageSize*y));

*(in + YCOMP + 3*(x + imageSize*y)) = 0.299f*R + 0.587f*G + 0.114f*B;
*(in + UCOMP + 3*(x + imageSize*y)) = -0.146f*R - 0.288f*G + 0.434f*B;
*(in + VCOMP + 3*(x + imageSize*y)) = 0.617f*R + -0.517f*G - 0.100f*B;
}
}
return 0;
}

/* - toRGB:
Converts a YUV image into an RGB image - a simple matrix operation
*/
int toRGB(float *in, int imageSize) {
int x, y;
for (x = 0; x < imageSize; x++) {
for (y = 0; y < imageSize; y++) {
float Y, U, V;
//RGB to YUV
// R = 1.0000Y -0.0009U 1.1359V
// G = 1.0000Y -0.3959U -0.5783V
// B = 1.0000Y 2.0411U -0.0016V

Y = *(in + YCOMP + 3*(x + imageSize*y));
U = *(in + UCOMP + 3*(x + imageSize*y));
V = *(in + VCOMP + 3*(x + imageSize*y));

*(in + RED + 3*(x + imageSize*y)) = Y - 0.0009f*U + 1.1359f *V;
*(in + GREEN + 3*(x + imageSize*y)) = Y - 0.3959f*U - 0.5783f *V;
*(in + BLUE + 3*(x + imageSize*y)) = Y + 2.0411f*U - 0.0016f *V;

}
}
return 0;
}

/* - fwt2:
2D forward wavelet transform a colour image. CAUTION - In-place!
YUV BUILT-IN

Note: Not simply a sequential fwt in both directions - we have
to do the separable filtering at the same level before downsampling!

CAREFUL OF ACCIDENTALLY CYCLING SUBBANDS RELATIVE TO EACH OTHER!!!!
*/
int fwt2(float *in, int imageSize, float *motherWavelet, int motherLength, float *fatherWavelet, int
fatherLength,


int numColours) {

int s, row, col, colour, i;
float *tempBuffer;

int log2imageSize = (int)(log(imageSize)/log(2) + 0.5);

tempBuffer = (float *)malloc(imageSize*numColours*sizeof(float));
assert(tempBuffer != NULL);

// Wavelet transform all colours at once (interlaced data)
// For each scale, split the row/column intensity signals into octaves and downsample losslessly
for (s = log2imageSize; s > 0; s--) {
for (row = 0; row < (1<<s); row++) {
// Copy the input data to the temporary buffer and wipe the row.
// I'm using a partial in-place algorithm to minimise the memory-shifting overhead
memcpy(tempBuffer, in + numColours*imageSize*row, (1<<s)*numColours*sizeof(float));
memset(in + numColours*imageSize*row, 0, (1<<s)*numColours*sizeof(float));

// QMF and downsample, storing the wavelet coefficients in place
downHPF(tempBuffer, (1<<s), numColours, numColours, fatherWavelet, fatherLength, in +
numColours*((1<<(s-1)) + imageSize*row));
downLPF(tempBuffer, (1<<s), numColours, numColours, motherWavelet, motherLength, in +
numColours*(imageSize*row));
}
for (col = 0; col < (1<<s); col++) {
// Copy the input data to the temporary buffer and wipe the row
// Column-copy - can't use memcpy
for (i = 0; i < (1<<s); i++) {
for (colour = 0; colour < numColours; colour++) {
*(tempBuffer + colour + numColours*i) = *(in + colour + numColours*(col + imageSize*i));
*(in + colour + numColours*(col + imageSize*i)) = 0;
}
}

// QMF and downsample, storing the wavelet coefficients in place
downHPF(tempBuffer, (1<<s), imageSize*numColours, numColours, fatherWavelet, fatherLength, in +
numColours*(col + (1<<(s-1))*imageSize));
downLPF(tempBuffer, (1<<s), imageSize*numColours, numColours, motherWavelet, motherLength,
in + numColours*col);
}
}

/* Free the memory allocated for temporary storage */
free((void *)tempBuffer);

return 0;
}


/* - iwt2:
2D inverse wavelet transform a given image.
Note: Not simply a sequential fwt in both directions - we have
to do the separable filtering at the same level before downsampling!
*/
// CAREFUL OF ACCIDENTALLY CYCLING SUBBANDS RELATIVE TO EACH OTHER!!!!
int iwt2(float *in, int imageSize, float *motherWavelet, int motherLength, float *fatherWavelet, int
fatherLength, int numColours) {
int s, row, col, colour, i;
float *tempBuffer;

int log2imageSize = (int)(log(imageSize)/log(2) + 0.5);

tempBuffer = (float *)malloc(imageSize*numColours*sizeof(float));
assert(tempBuffer != NULL);

// Inverse wavelet transform all colours at once (interlaced data)
// For each scale, upsample and anti-alias the two octaves and recombine them
for (s = 1; s <= log2imageSize; s++) {
for (col = 0; col < (1<<s); col++) {
// We write to a temporary buffer before copying it over that column in the image
memset(tempBuffer, 0, (1<<s)*numColours*sizeof(float));

// QMF and upsample to obtain the image coefficients
upHPF(in + numColours*(col + imageSize*(1<<(s-1))), (1<<(s-1)), imageSize*numColours,
numColours, motherWavelet, motherLength, tempBuffer);
upLPF(in + numColours*col, (1<<(s-1)), imageSize*numColours, numColours, fatherWavelet,
fatherLength, tempBuffer);

// Store the data back in the image (column-copy)
for (i = 0; i < (1<<s); i++) {
for (colour = 0; colour < numColours; colour++) {
*(in + colour + numColours*(col + imageSize*i)) = *(tempBuffer + colour + numColours*i);
}
}
}
for (row = 0; row < (1<<s); row++) {
// We write to a temporary buffer before copying it over that column in the image


memset(tempBuffer, 0, (1<<s)*numColours*sizeof(float));

// QMF and upsample to obtain image coefficients
upHPF(in + numColours*(imageSize*row + (1<<(s-1))), (1<<(s-1)), numColours, numColours,
motherWavelet, motherLength, tempBuffer);
upLPF(in + numColours*imageSize*row, (1<<(s-1)), numColours, numColours, fatherWavelet,
fatherLength, tempBuffer);

// Store the data back in the image (column-copy)
memcpy(in + numColours*imageSize*row, tempBuffer, (1<<s)*numColours*sizeof(float));
}
}

// Free the temporary buffer space
free((void *)tempBuffer);

return 0;
}


/* - downHPF:
A function to high-pass filter and downsample by two.
Used by the forward wavelet transform function.
A symmetrically-extended convolution, with built-in ODD downsampling.
*/
int downHPF(float *in, int inLength, int dataStep, int numColours, float *filter, int filterLength, float *out)
{
int oneSidedLength = (filterLength - 1)/2;
int inIndex, outIndex, filterIndex, colour;
int modBase = 2*(inLength - 1);
int outLength = inLength/2;
char boundsApply;
float filterCoefficient;
float *inPtr, *outPtr;

for (outIndex = 0; outIndex < outLength; outIndex++) {
outPtr = (out + dataStep*outIndex);

/* Check outside the loop if the bounds are going to apply for this output index */
boundsApply = (outIndex < oneSidedLength/2 || outIndex >= (outLength - 1 - oneSidedLength/2));

for (filterIndex = -oneSidedLength; filterIndex <= oneSidedLength; filterIndex++) {
inIndex = 2*outIndex+1 - filterIndex;

if (boundsApply) {
if (inIndex < 0) inIndex = -inIndex;
inIndex %= modBase; // Mod it to the correct range
if (inIndex >= inLength) inIndex = modBase - inIndex;
}

// Accumulate the output value
filterCoefficient = *(filter + filterIndex + oneSidedLength);
inPtr = (in + numColours*inIndex);
if ((filterIndex&1) == 0) { // Quick way to check if we should negate for mirror-filtering
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) += *(inPtr + colour) * filterCoefficient;
}
} else {
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) -= *(inPtr + colour) * filterCoefficient;
}
}
}
}

return 0;
}


/* - downLPF:
A function to low-pass filter and downsample by two.
Used by the forward wavelet transform function.
A symmetric convolution algorithm, with built-in EVEN downsampling.
*/
int downLPF(float *in, int inLength, int dataStep, int numColours, float *filter, int filterLength, float *out)
{
int oneSidedLength = (filterLength - 1)/2;
int inIndex, outIndex, filterIndex, colour;
int outLength = inLength/2;
int modBase = 2*(inLength - 1);
char boundsApply;
float filterCoefficient;
float *inPtr, *outPtr;

for (outIndex = 0; outIndex < outLength; outIndex++) {
outPtr = (out + dataStep*outIndex);

/* Check outside the loop if the bounds are going to apply for this output index */


boundsApply = (outIndex < oneSidedLength/2 || outIndex >= (outLength - 1 - oneSidedLength/2));

for (filterIndex = -oneSidedLength; filterIndex <= oneSidedLength; filterIndex++) {
// Symmetrically wrap boundaries, don't repeat edges
inIndex = 2*outIndex - filterIndex;

if (boundsApply) {
if (inIndex < 0) inIndex = -inIndex;
inIndex %= modBase; // Mod it to the correct range
if (inIndex >= inLength) inIndex = modBase - inIndex;
}

// Accumulate the output value
filterCoefficient = *(filter + filterIndex + oneSidedLength);
inPtr = (in + numColours*inIndex);
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) += *(inPtr + colour) * filterCoefficient;
}
}
}

return 0;
}


/* - upHPF:
A function to reverse the HP quadrature mirror filtering of the
forward wavelet transform
A variation on symmConv to include ODD upsampling
*/
int upHPF(float *in, int inLength, int dataStep, int numColours, float *filter, int filterLength, float *out) {
// Upsample, convolute
// out = symmConv(upsample(input), filter);
int oneSidedLength = (filterLength - 1)/2;
int inIndex, outIndex, filterIndex, colour, startIndex;
int outLength = 2*inLength;
int modBase = 2*(outLength - 1);
char boundsApply, negFilterCoeff;
float filterCoefficient;
float *inPtr, *outPtr;

for (outIndex = 0; outIndex < outLength; outIndex++) {
outPtr = (out + numColours*outIndex);

/* Check outside the loop if the bounds are going to apply for this output index */
boundsApply = (outIndex <= oneSidedLength*2 || outIndex > (outLength - 1 - oneSidedLength*2));

startIndex = -oneSidedLength;
if (((outIndex + startIndex)&1) == 0) startIndex++;

/* Check outside the loop to see if we should negate the filter coefficients */
negFilterCoeff = ((startIndex&1) != 0);

for (filterIndex = startIndex; filterIndex <= oneSidedLength; filterIndex+=2) {
// Symmetrically wrap boundaries, don't repeat edges
//outIndex = inIndex*2+1 + filterIndex;
inIndex = outIndex - filterIndex; // Then subtract 1 and divide by 2

if (boundsApply) {
if (inIndex < 0) inIndex = -inIndex;
inIndex %= modBase; // Mod it to the correct range
if (inIndex >= outLength) inIndex = modBase - inIndex;
}

inIndex >>= 1; // Implicitly subtract 1, then divide by 2

// Accumulate the output value
filterCoefficient = *(filter + filterIndex + oneSidedLength);
inPtr = (in + dataStep*inIndex);
if (!negFilterCoeff) { // Quick way to check if we should negate
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) += *(inPtr + colour) * filterCoefficient;
}
} else {
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) -= *(inPtr + colour) * filterCoefficient;
}
}
}
}

return 0;
}


/* - upLPF:
A function to reverse the LP quadrature mirror filtering of the
forward wavelet transform


A variation on symmConv to include EVEN upsampling
*/
int upLPF(float *in, int inLength, int dataStep, int numColours, float *filter, int filterLength, float *out) {
int oneSidedLength = (filterLength - 1)/2;
int inIndex, outIndex, filterIndex, colour, startIndex;
int outLength = 2*inLength;
int modBase = 2*(outLength - 1);
char boundsApply;
float filterCoefficient;
float *inPtr, *outPtr;

for (outIndex = 0; outIndex < outLength; outIndex++) {
outPtr = (out + numColours*outIndex);

/* Check outside the loop if the bounds are going to apply for this output index */
boundsApply = (outIndex < oneSidedLength*2 || outIndex > (outLength - 1 - oneSidedLength*2));

startIndex = -oneSidedLength;
if (((outIndex + startIndex)&1) != 0) startIndex++;

for (filterIndex = startIndex; filterIndex <= oneSidedLength; filterIndex+=2) {
// Symmetrically wrap boundaries, don't repeat edges
// outIndex = inIndex*2 + filterIndex;
inIndex = outIndex - filterIndex; // Then divide by 2

if (boundsApply) {
if (inIndex < 0) inIndex = -inIndex;
inIndex %= modBase; // Mod it to the correct range
if (inIndex >= outLength) inIndex = modBase - inIndex;
}

inIndex >>= 1;

// Accumulate the output value
filterCoefficient = *(filter + filterIndex + oneSidedLength);
inPtr = (in + dataStep*inIndex);
for (colour = 0; colour < numColours; colour++) {
*(outPtr + colour) += *(inPtr + colour) * filterCoefficient;
}
}
}

return 0;
}


10.2 SPIHT.c

This appendix contains the C code written to perform SPIHT encode and decode wavelet-transformed images.



/* - SPIHT:
A class of functions to convert between the wavelet coefficients of an image
and the corresponding Set-Partitioning In Hierarchical Trees representation.
*/
#include "SPIHT.h"

/* - initSPIHT:
Initialise an SPIHTSTRUCT object.
*/
int initSPIHT(float *wc, int imageSize, int numColours, char writing, int threshExp,
SPIHTSTRUCT *thisSPIHT) {

int xIndex, yIndex, colour, i;
PIXELSTRUCT tempPixelStruct;
SETSTRUCT tempSetStruct;

// Store the dimensions of the wavelet data
thisSPIHT->numColours = numColours;
thisSPIHT->imageSize = imageSize;

// Default to an arbitrarily small number
thisSPIHT->haltThresh = -BIGNUM;

// Allocate memory for the intermediate variables
thisSPIHT->exp = (int *)calloc(imageSize*imageSize*numColours, sizeof(int));
thisSPIHT->treeExp = (int *)calloc(imageSize*imageSize*numColours, sizeof(int));
thisSPIHT->positive = (char *)calloc(imageSize*imageSize*numColours, sizeof(char));
assert(thisSPIHT->exp != NULL);
assert(thisSPIHT->treeExp != NULL);
assert(thisSPIHT->positive != NULL);

// Allocate memory for the three lists
thisSPIHT->lis = (SETSTRUCT *)malloc(imageSize*imageSize*numColours*sizeof(SETSTRUCT));
thisSPIHT->lsp = (PIXELSTRUCT
*)malloc(imageSize*imageSize*numColours*sizeof(PIXELSTRUCT));
thisSPIHT->lip = (PIXELSTRUCT
*)malloc(imageSize*imageSize*numColours*sizeof(PIXELSTRUCT));
assert(thisSPIHT->lis != NULL);
assert(thisSPIHT->lsp != NULL);
assert(thisSPIHT->lip != NULL);
thisSPIHT->lisLength = 0;
thisSPIHT->lspLength = 0;
thisSPIHT->lipLength = 0;

thisSPIHT->wc = wc;

/* Precalculate the exponents of each coefficient, for a faster encoding algorithm! */
if (writing) {
getExponents(thisSPIHT);
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
getTreeExponents(thisSPIHT, 0, 0, colour);
}
}

/*
First send the highest threshold (not including DC coefficients)
Then send the bits of the three DC coefficients (which are typically 3-4 thresholds above the
remainder of the tree) from the highest possible (fixed knowledge at coder/decoder) bit down to
just above the highest threshold.
Initialise the DC coefficients to significant/insignificant as appropriate.
Start the algorithm at the specified threshold.
*/
if (writing) {
// First determine the highest threshold and send it
thisSPIHT->highestThresh = 0;
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
for (xIndex = 0; xIndex < 2; xIndex++) {
for (yIndex = 0; yIndex < 2; yIndex++) {


int ptrOffset;
if ((xIndex == 0) && (yIndex == 0)) continue;
ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

if (*(thisSPIHT->treeExp + ptrOffset) > thisSPIHT->highestThresh)
thisSPIHT->highestThresh = *(thisSPIHT->treeExp + ptrOffset);
}
}
}

// 'Guaranteed' 0 <= highestThreshold < 16, therefore send only those 4 bits
for (i = 0; i < 4; i++) {
pack((thisSPIHT->highestThresh & (1<<i)) > 0, thisSPIHT->bitBuffer);
}

// Now output the DC bits down to the starting threshold
// Automatically assume that they're significant - simplifies things and is usually true
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
// Send the sign of the coefficient
pack(*(thisSPIHT->wc + colour) > 0, thisSPIHT->bitBuffer);

// Send the bits of the coefficients
for (i = threshExp; i > (int)thisSPIHT->highestThresh; i--) {
int outputBit;
if (i >= 0)
outputBit = ((int)(ABS(*(thisSPIHT->wc + colour))) & (1<<i)) > 0;
else {
outputBit = (int)(ABS(*(thisSPIHT->wc + colour))) / pow(2.0, i - 1);
outputBit &= 1; // Mod 2 to give lsb only
}

pack(outputBit, thisSPIHT->bitBuffer);
}
}
} else {
// Read in the highest (non-DC) threshold
thisSPIHT->highestThresh = 0;
// 'Guaranteed' 0 <= highestThreshold < 16, therefore read only those 4 bits
for (i = 0; i < 4; i++) {
if(unpack(thisSPIHT->bitBuffer)) thisSPIHT->highestThresh += (1<<i);
}

// Now input the DC bits down to the starting threshold
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
// Read the sign of the coefficient
*(thisSPIHT->positive + colour) = unpack(thisSPIHT->bitBuffer);

// Read the bits of the coefficients
for (i = threshExp; i > (int)thisSPIHT->highestThresh; i--) {
if (unpack(thisSPIHT->bitBuffer)) {
if (*(thisSPIHT->positive + colour)) {
if (i >= 0)
*(thisSPIHT->wc + colour) += (1<<i);
else {
*(thisSPIHT->wc + colour) += pow(2.0, i);
}
}
else {
if (i >= 0)
*(thisSPIHT->wc + colour) -= (1<<i);
else {
*(thisSPIHT->wc + colour) -= pow(2.0, i);
}
}
}
}
}

// Place the coefficient in the centre of the uncertainty interval
i = (int)thisSPIHT->highestThresh + 1;
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
if (*(thisSPIHT->positive + colour)) {
if (i >= 1) {
*(thisSPIHT->wc + colour) += (1<<(i-1));
} else {
*(thisSPIHT->wc + colour) += pow(2.0, i - 1);
}
} else {
if (i >= 1) {
*(thisSPIHT->wc + colour) -= (1<<(i-1));
} else {
*(thisSPIHT->wc + colour) -= pow(2.0, i - 1);
}
}
}
}

// Set up the initial lists


for (xIndex = 0; xIndex < 2; xIndex++) {
for (yIndex = 0; yIndex < 2; yIndex++) {
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
if (xIndex == 0 && yIndex == 0) { // Make sure we have no sets containing themselves!
// DC coefficients default to significant, due to my low-overhead initialisation
tempPixelStruct.x = xIndex;
tempPixelStruct.y = yIndex;
tempPixelStruct.col = colour;
addToLSP(&tempPixelStruct, thisSPIHT);
} else {
// Add to the list of insignificant pixels
tempPixelStruct.x = xIndex;
tempPixelStruct.y = yIndex;
tempPixelStruct.col = colour;
addToLIP(&tempPixelStruct, thisSPIHT);

// Add to the list of insignificant sets
tempSetStruct.isdtype = TRUE;
tempSetStruct.x = xIndex;
tempSetStruct.y = yIndex;
tempSetStruct.col = colour;
addToLIS(&tempSetStruct, thisSPIHT);
}
}
}
}

return 0;
}


/* - closeSPIHT:
Destroy the SPIHT object and its variables, including closing the ARISTRUCT and freeing
all allocated memory
*/
int closeSPIHT(SPIHTSTRUCT *thisSPIHT) {
free((void *)thisSPIHT->exp);
free((void *)thisSPIHT->treeExp);
free((void *)thisSPIHT->positive);

free((void *)thisSPIHT->lis);
free((void *)thisSPIHT->lsp);
free((void *)thisSPIHT->lip);
return 0;
}


/* - fromSPIHT:
Convert this SPIHTSTRUCT's bitBuffer into the corresponding wavelet coefficients
*/
int fromSPIHT(SPIHTSTRUCT *thisSPIHT) {
int xIndex, yIndex, colour;
int i, ptrOffset, oldLSPLength;
int dx, dy;
int threshExp = thisSPIHT->highestThresh;

// Fill in the coefficient values from the input file
for (; !thisSPIHT->bitBuffer->endOfBuffer; threshExp--) {
// Store the old LSP length, so that the bit refinement step only treats the new coeff's
oldLSPLength = thisSPIHT->lspLength;

// DEBUGGING
/*
if (threshExp >= 8) {
printf("Threshold exponent: %d\n", threshExp);
for (xIndex = 0; xIndex < 4; xIndex++) {
for (yIndex = 0; yIndex < 4; yIndex++) {
for (colour = 0; colour < thisSPIHT->numColours; colour++) {
printf("%f\t", *(thisSPIHT->wc + colour + thisSPIHT->numColours*(xIndex + thisSPIHT-
>imageSize*yIndex)));
}
printf("\n");
}
}
printf("\n");

printSets(thisSPIHT);
}
*/

// Check the lone pixels first:
i = 0;
while(i < thisSPIHT->lipLength) {
// Is this pixel significant?
if (unpack(thisSPIHT->bitBuffer)) {
// If this pixel from the lip is significant, move it to the lsp
xIndex = (thisSPIHT->lip + i)->x;
yIndex = (thisSPIHT->lip + i)->y;


colour = (thisSPIHT->lip + i)->col;
ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

addToLSP((thisSPIHT->lip + i), thisSPIHT);
removeFromLIP(i, thisSPIHT);

// Read its sign also, updating the wavelet coefficients in the process
// Put coefficient at centre of uncertainty interval!
if (unpack(thisSPIHT->bitBuffer)) { // If positive
*(thisSPIHT->positive + ptrOffset) = TRUE;
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) = (3<<(threshExp-1));
else {
*(thisSPIHT->wc + ptrOffset) = (3*pow(2.0, threshExp - 1));
}
} else { // If negative
*(thisSPIHT->positive + ptrOffset) = FALSE;
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) = -(3<<(threshExp - 1));
else {
*(thisSPIHT->wc + ptrOffset) = -(3*pow(2.0, threshExp - 1));
}
}
} else {
// Only increment the index if we found an insignificant pixel, otherwise we've
// moved the end of the list to our current position (and it must be checked)
i++;
}
//if (thisSPIHT->bitBuffer->endOfBuffer) return 0;
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;

// Check the insignificant sets next:
i = 0;
while (i < thisSPIHT->lisLength) {
// Is this set significant?
if (unpack(thisSPIHT->bitBuffer)) {
// This set from the lis is significant, so break it down
xIndex = (thisSPIHT->lis + i)->x;
yIndex = (thisSPIHT->lis + i)->y;
colour = (thisSPIHT->lis + i)->col;
//ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

// Break the set down
if ((thisSPIHT->lis + i)->isdtype) { // If we've got a D-type set
// Ensure that we don't generate an empty L-type set
if (MAX(xIndex, yIndex) < thisSPIHT->imageSize/4) {
(thisSPIHT->lis + i)->isdtype = FALSE; // Convert the set to L-type
} else {
removeFromLIS(i, thisSPIHT); // Remove this now empty set
}

// Add each of the offspring to the LIP/LSP as needed
for (dx = 0; dx < 2; dx++) {
for (dy = 0; dy < 2; dy++) {
int childPtrOffset;
PIXELSTRUCT tempPixelStruct;
tempPixelStruct.x = 2 * xIndex + dx;
tempPixelStruct.y = 2 * yIndex + dy;
tempPixelStruct.col = colour;

childPtrOffset = colour
+ thisSPIHT->numColours*(2*xIndex + dx + thisSPIHT->imageSize*(2*yIndex + dy));

if(unpack(thisSPIHT->bitBuffer)) { // If significant
addToLSP(&tempPixelStruct, thisSPIHT);

// Include sign information
if (unpack(thisSPIHT->bitBuffer)) { // If positive
*(thisSPIHT->positive + childPtrOffset) = TRUE;
if (threshExp >= 1)
*(thisSPIHT->wc + childPtrOffset) = (3<<(threshExp-1));
else {
*(thisSPIHT->wc + childPtrOffset) = (3*pow(2.0, threshExp - 1));
}
} else { // If negative
*(thisSPIHT->positive + childPtrOffset) = FALSE;
if (threshExp >= 1)
*(thisSPIHT->wc + childPtrOffset) = -(3<<(threshExp-1));
else {
*(thisSPIHT->wc + childPtrOffset) = -(3*pow(2.0, threshExp - 1));
}
}
} else { // If insignificant
addToLIP(&tempPixelStruct, thisSPIHT);
}
}
}


} else { // Else if we've got an L-type set
// Remove the old L-type set
removeFromLIS(i, thisSPIHT);

// Split it into the 4 D-type sets
for (dx = 0; dx < 2; dx++) {
for (dy = 0; dy < 2; dy++) {
// Build the set
SETSTRUCT tempSetStruct;
tempSetStruct.x = 2 * xIndex + dx;
tempSetStruct.y = 2 * yIndex + dy;
tempSetStruct.col = colour;
tempSetStruct.isdtype = TRUE;

// Append the set to the list
addToLIS(&tempSetStruct, thisSPIHT);
}
}
}
} else {
// Only increment the index if we found an insignificant set, otherwise we've
// moved the end of the list to our current position (and it must be checked)
i++;
}
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;

// Keep coefficient at centre of uncertainty interval
for (i = 0; i < oldLSPLength; i++) {
xIndex = (thisSPIHT->lsp + i)->x;
yIndex = (thisSPIHT->lsp + i)->y;
colour = (thisSPIHT->lsp + i)->col;
ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

if (unpack(thisSPIHT->bitBuffer)) {
if (*(thisSPIHT->positive + ptrOffset)) { // If positive
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) += (1<<(threshExp-1));
else {
*(thisSPIHT->wc + ptrOffset) += (pow(2.0, threshExp - 1));
}
} else { // Else if negative
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) -= (1<<(threshExp-1));
else {
*(thisSPIHT->wc + ptrOffset) -= (pow(2.0, threshExp - 1));
}
}
} else {
if (*(thisSPIHT->positive + ptrOffset)) { // If positive
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) -= (1<<(threshExp-1));
else {
*(thisSPIHT->wc + ptrOffset) -= (pow(2.0, threshExp - 1));
}
} else { // Else if negative
if (threshExp >= 1)
*(thisSPIHT->wc + ptrOffset) += (1<<(threshExp-1));
else {
*(thisSPIHT->wc + ptrOffset) += (pow(2.0, threshExp - 1));
}
}
}
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;
}

return 0;
}


/* - toSPIHT:
A function to dump the compressed output to the specified file (.spiht)
*/
int toSPIHT(SPIHTSTRUCT *thisSPIHT) {
int i, oldLSPLength;
int xIndex, yIndex, colour, ptrOffset;
int dx, dy;
int threshExp = thisSPIHT->highestThresh;

// Halt early if we drop below the halting threshold
for (; !(thisSPIHT->bitBuffer->endOfBuffer || (threshExp <= thisSPIHT->haltThresh)); threshExp--) {
char significant;

// Store the old LSP length, so that the bit refinement step only treats the new coeff's
oldLSPLength = thisSPIHT->lspLength;

// Check the lone pixels first:


i = 0;
while(i < thisSPIHT->lipLength) {
// Get the info about this pixel from the lis
xIndex = (thisSPIHT->lip + i)->x;
yIndex = (thisSPIHT->lip + i)->y;
colour = (thisSPIHT->lip + i)->col;
ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

// Establish the significance of this pixel and let the decoder know
significant = (*(thisSPIHT->exp + ptrOffset) == threshExp);

if (significant) pack(TRUE, thisSPIHT->bitBuffer);
else pack(FALSE, thisSPIHT->bitBuffer);

if (significant) { // If significant
// This pixel is significant, so we move it from the lip to the lsp
addToLSP((thisSPIHT->lip + i), thisSPIHT);
removeFromLIP(i, thisSPIHT);

// Output its sign also
pack(*(thisSPIHT->wc + ptrOffset) > 0, thisSPIHT->bitBuffer);
} else {
// Only increment the index if we found an insignificant pixel, otherwise we've
// moved the end of the list to our current position (and it must be checked)
i++;
}
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;

// Check the insignificant sets next:
i = 0;
while (i < thisSPIHT->lisLength) {
// Get the info about this pixel from the lis
char significant;
xIndex = (thisSPIHT->lis + i)->x;
yIndex = (thisSPIHT->lis + i)->y;
colour = (thisSPIHT->lis + i)->col;
//ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

// Establish the significance of this set and let the decoder know
if((thisSPIHT->lis + i)->isdtype) { // If it's a D-type set
// Test to see if any child's treeExp is significant
char significant = FALSE;
for (dx = 0; (dx < 2) && !significant; dx++) {
for (dy = 0; (dy < 2) && !significant; dy++) {
int thisTreeExp = *(thisSPIHT->treeExp + colour
+ thisSPIHT->numColours*(2*xIndex + dx + thisSPIHT->imageSize*(2*yIndex + dy)));
if (thisTreeExp == threshExp) significant = TRUE;
}
}

// Tell the decoder the result of the test
pack(significant, thisSPIHT->bitBuffer);

// If it was significant, break down this D-type set
if (significant) {
// Ensure that we don't generate an empty L-type set
if (MAX(xIndex, yIndex) < thisSPIHT->imageSize/4) {
(thisSPIHT->lis + i)->isdtype = FALSE; // Convert the set to L-type
} else {
removeFromLIS(i, thisSPIHT); // Remove this now empty set
}

// Add each of the offspring to the LIP/LSP as needed
for (dx = 0; dx < 2; dx++) {
for (dy = 0; dy < 2; dy++) {
// Get the info about this pixel from the lis
char significant;
int childPtrOffset;
PIXELSTRUCT tempPixelStruct;
tempPixelStruct.x = 2 * xIndex + dx;
tempPixelStruct.y = 2 * yIndex + dy;
tempPixelStruct.col = colour;
childPtrOffset = colour
+ thisSPIHT->numColours*(2*xIndex + dx + thisSPIHT->imageSize*(2*yIndex + dy));

// Establish the significance of this pixel and let the decoder know
significant = (*(thisSPIHT->exp + childPtrOffset) == threshExp);
if (significant) {
pack(TRUE, thisSPIHT->bitBuffer);
addToLSP(&tempPixelStruct, thisSPIHT);

// Include sign information
pack(*(thisSPIHT->wc + childPtrOffset) > 0, thisSPIHT->bitBuffer);
} else {
pack(FALSE, thisSPIHT->bitBuffer);
addToLIP(&tempPixelStruct, thisSPIHT);
}


}
}
} else { // If this set was insignificant
// Only increment the index if we found an insignificant set, otherwise we've
// moved the end of the list to our current position (and it must be checked)
i++;
}
} else { // Break down this L-type set
// Test to see if any child's treeExp is significant
char significant = FALSE;
for (dx = 0; (dx < 4) & !significant; dx++) {
for (dy = 0; (dy < 4) & !significant; dy++) {
int thisTreeExp = *(thisSPIHT->treeExp + colour
+ thisSPIHT->numColours*(4*xIndex + dx + thisSPIHT->imageSize*(4*yIndex + dy)));
if (thisTreeExp == threshExp) {
significant = TRUE;
}
}
}

// Tell the decoder the result of the test
pack(significant, thisSPIHT->bitBuffer);

// If it was significant, break down this L-type set
if (significant) {
// Remove the old L-type set
removeFromLIS(i, thisSPIHT);

// Split it into the 4 D-type sets
for (dx = 0; dx < 2; dx++) {
for (dy = 0; dy < 2; dy++) {
// Build the set
SETSTRUCT tempSetStruct;
tempSetStruct.x = 2*xIndex + dx;
tempSetStruct.y = 2*yIndex + dy;
tempSetStruct.col = colour;
tempSetStruct.isdtype = TRUE;

// Append the set to the list
addToLIS(&tempSetStruct, thisSPIHT);
}
}
} else { // If this set was insignificant
// Only increment the index if we found an insignificant set, otherwise we've
// moved the end of the list to our current position (and it must be checked)
i++;
}
}
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;

// Finally, refine the coefficients (only the old ones - new ones are implicitly 1)
for (i = 0; i < oldLSPLength; i++) {
// Get the info about this pixel from the lis
char outputBit;
xIndex = (thisSPIHT->lsp + i)->x;
yIndex = (thisSPIHT->lsp + i)->y;
colour = (thisSPIHT->lsp + i)->col;
ptrOffset = colour + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex);

// Output the current bit of this coefficient
if (threshExp >= 0)
outputBit = ((int)(ABS(*(thisSPIHT->wc + ptrOffset))) & (1<<threshExp)) > 0;
else {
outputBit = (int)(ABS(*(thisSPIHT->wc + ptrOffset))) / pow(2.0, threshExp - 1);
outputBit &= 1; // Mod 2 to give lsb only
}

pack(outputBit, thisSPIHT->bitBuffer);
}
if (thisSPIHT->bitBuffer->endOfBuffer) return 0;
}

return 0;
}


/* - getExponents:
A function to precalculate and store the single-coefficient exponents
*/
int getExponents(SPIHTSTRUCT *thisSPIHT) {
int i;
int exponent;
int loopLimit = SQR(thisSPIHT->imageSize) * thisSPIHT->numColours;
// Find the exponents of each of the coefficients
for (i = 0; i < loopLimit; i++) {
frexp(*(thisSPIHT->wc + i), &exponent);
*(thisSPIHT->exp + i) = (char)(exponent - 1);


}
return 0;
}


/* - getTreeExponents:
A function to precalculate and store the whole-tree exponents. Returns the resulting
tree exponent for recursive simplicity.
*/
char getTreeExponents(SPIHTSTRUCT *thisSPIHT, int xIndex, int yIndex, int col) {
int *treeExpPtr =
(thisSPIHT->treeExp + col + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex));
int *expPtr =
(thisSPIHT->exp + col + thisSPIHT->numColours*(xIndex + thisSPIHT->imageSize*yIndex));
char temp;
// The tree exponent is always at least the exponent itself, so initialise:
*treeExpPtr = *expPtr;

// Find the tree exponent, based on the tree exponents of this node's children
if (MAX(xIndex, yIndex) >= (thisSPIHT->imageSize/2)) { // If this node has no children
} else { // Else this node has children
if (!(xIndex == 0 && yIndex == 0)) { // Don't bother with the DC coefficient
temp = getTreeExponents(thisSPIHT, xIndex*2, yIndex*2, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
temp = getTreeExponents(thisSPIHT, xIndex*2, yIndex*2 + 1, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
temp = getTreeExponents(thisSPIHT, xIndex*2 + 1, yIndex*2, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
temp = getTreeExponents(thisSPIHT, xIndex*2 + 1, yIndex*2 + 1, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
} else {
// If we're dealing with the DC coefficient, we must kick start the lower orders
temp = getTreeExponents(thisSPIHT, 0, 1, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
temp = getTreeExponents(thisSPIHT, 1, 0, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
temp = getTreeExponents(thisSPIHT, 1, 1, col);
if (temp > *treeExpPtr) *treeExpPtr = temp;
}
}

return *treeExpPtr;
}


/* - addToLIP:
A function to add a new pixel structure to the list of insignificant pixels
*/
int addToLIP(PIXELSTRUCT *tempPixelStruct, SPIHTSTRUCT *thisSPIHT) {
memcpy(thisSPIHT->lip + thisSPIHT->lipLength, tempPixelStruct, sizeof(PIXELSTRUCT));
thisSPIHT->lipLength++;
return 0;
}


/* - addToLSP:
A function to add a new pixel structure to the list of significant pixels
*/
int addToLSP(PIXELSTRUCT *tempPixelStruct, SPIHTSTRUCT *thisSPIHT) {
memcpy(thisSPIHT->lsp + thisSPIHT->lspLength, tempPixelStruct, sizeof(PIXELSTRUCT));
thisSPIHT->lspLength++;
return 0;
}


/* - addToLIS:
A function to add a new set structure to the list of insignificant sets
*/
int addToLIS(SETSTRUCT *tempSetStruct, SPIHTSTRUCT *thisSPIHT) {
memcpy(thisSPIHT->lis + thisSPIHT->lisLength, tempSetStruct, sizeof(SETSTRUCT));
thisSPIHT->lisLength++;
return 0;
}


/* - removeFromLIP:
A function to remove a pixel structure from the list of insignificant pixels
*/
int removeFromLIP(int index, SPIHTSTRUCT *thisSPIHT) {
// Move the final entry to replace the one being removed
thisSPIHT->lipLength--;
memcpy(thisSPIHT->lip + index, thisSPIHT->lip + thisSPIHT->lipLength, sizeof(PIXELSTRUCT));
return 0;
}


/* - removeFromLIS:
A function to remove a set structure from the list of insignificant sets


*/
int removeFromLIS(int index, SPIHTSTRUCT *thisSPIHT) {
// Move the final entry to replace the one being removed
thisSPIHT->lisLength--;
memcpy(thisSPIHT->lis + index, thisSPIHT->lis + thisSPIHT->lisLength, sizeof(SETSTRUCT));
return 0;
}

/* - printSets:
A function to print out the set structure, for debugging purposes
*/
int printSets(SPIHTSTRUCT *thisSPIHT) {
int i;
printf("LIS:\n");
for (i = 0; i < thisSPIHT->lisLength; i++) {
int xIndex, yIndex, colour;
xIndex = (thisSPIHT->lip + i)->x;
yIndex = (thisSPIHT->lip + i)->y;
colour = (thisSPIHT->lip + i)->col;

printf("X: %d\t Y: %d\t C: %d\t Type: %d\n", xIndex, yIndex, colour, (thisSPIHT->lis + i)->isdtype);
}
printf("LSP:\n");
for (i = 0; i < thisSPIHT->lspLength; i++) {
int xIndex, yIndex, colour;
float wcVal;
xIndex = (thisSPIHT->lsp + i)->x;
yIndex = (thisSPIHT->lsp + i)->y;
colour = (thisSPIHT->lsp + i)->col;
wcVal = *(thisSPIHT->wc + colour + thisSPIHT->numColours*(xIndex + thisSPIHT-
>imageSize*yIndex));

printf("X: %d\t Y: %d\t C: %d\t", xIndex, yIndex, colour);
printf("Value: %f\n", wcVal);
}
printf("LIP:\n");
for (i = 0; i < thisSPIHT->lipLength; i++) {
int xIndex, yIndex, colour;
float wcVal;
xIndex = (thisSPIHT->lip + i)->x;
yIndex = (thisSPIHT->lip + i)->y;
colour = (thisSPIHT->lip + i)->col;
wcVal = *(thisSPIHT->wc + colour + thisSPIHT->numColours*(xIndex + thisSPIHT-
>imageSize*yIndex));

printf("X: %d\t Y: %d\t C: %d\t", xIndex, yIndex, colour);
printf("Value: %f\n", wcVal);
}

return 0;
}


10.3 MEC.c

This appendix contains the C code written to perform Motion Estimation and Compensation of successive video frames in YUV format.



/* - MEC (Motion Estimation/Compensation)
Estimate the motion between two frames, or compensate for the given motion from the
previous frame.
The three main aims of the algorithm:
1 - Accurate - Generates low-energy residual images
2 - Smooth - Generates 1/f-process residual images (good for wavelet coding)
3 - Fast - Real-time hard computational limit
*/

#include "MEC.h"

/* - MECestimate:
A function to estimate the motion between two consecutive frames of a video sequence,
given those two frames.
Places the resulting motion field in the *MV buffer.

My algorithm:
Start by estimating 4 seed blocks. Throw away the poor matches
as likely occlusions, etc.

Grow each region around the seed (BFS), halting at discontinuities.
Discontinuities are found where the motion vectors suddenly
decorrelate spatially!

Erode each undetermined area, yielding a few more seed blocks for
estimation.

Repeat until time has run out or a sufficient number of blocks are matched?
(Halt condition not determined yet)
*/
// MVs are stored block-wise!
// Input frames must be in YUV format!!!
int MECestimate(float *lastFrame, float *thisFrame, int imageSize, int numColours, float *MV,
int blockSize) {

float matchingThreshold = 400 * blockSize;
int MVLength = imageSize/blockSize;
int iteration;

// BlockList is a list of blocks; completed if earlier in list, active if later in list
BLOCKSTRUCT *blockList = (BLOCKSTRUCT
*)malloc(MVLength*MVLength*sizeof(BLOCKSTRUCT));
int blockListLength = 0;
int blockListIndex;
char *estimated = (char *)calloc(MVLength*MVLength, sizeof(char)); //Initially false
float error;

BLOCKSTRUCT tempBlock;

// Start by testing some candidate seeds - a 3x3 sparse array
// Use larger blocks for better matches (hopefully!)
int xb, yb, dx, dy;
for (xb = MVLength/4; xb < MVLength; xb += MVLength/2) {
for (yb = MVLength/4; yb < MVLength; yb += MVLength/2) {
// Obtain a temporary match
error = bestMatch(lastFrame, thisFrame, imageSize, numColours, xb*blockSize, yb*blockSize,
0, 0, SEARCHRANGE, blockSize*2, BIGNUM, (MV + 2*(xb + MVLength*yb))); // Value returned
implicitly

// Keep the match regardless of quality, as it's the best possible for this block
if(error < BIGNUM) {
for (dx = 0; dx <= 1; dx++) {
for (dy = 0; dy <= 1; dy++) {
// Set up a block and append it to the block list
tempBlock.x = xb + dx;
tempBlock.y = yb + dy;
if (tempBlock.x < 0 || tempBlock.x >= MVLength || tempBlock.y < 0 || tempBlock.y >= MVLength)
continue;
addToBlockList(&tempBlock, blockList, &blockListLength);



// Update the block states (stored spatially for fast access)
*(estimated + tempBlock.x + MVLength*tempBlock.y) = TRUE;

// Copy this motion vector to the other blocks in the 2x2 block seed
if (dx == 0 && dy == 0) continue;
memcpy((MV + 2*(xb+dx + MVLength*(yb+dy))), (MV + 2*(xb + MVLength*yb)), 2*sizeof(float));
}
}
}
}
}

// Perform region growth on the seeds until regions cease to grow, then generate new seeds
// and repeat
for (iteration = 0; iteration < 10; iteration++) {
/*
Now go through the block list, BFSing outwards (implicitly due to list) until no more
blocks can be estimated from their neighbours
*/
blockListIndex = 0;
while (blockListIndex < blockListLength) {
// For each block, estimate the motion of its neighbours

xb = (blockList + blockListIndex)->x;
yb = (blockList + blockListIndex)->y;
// For each new 4-connected neighbour:
for (dx = -1; dx <= 1; dx++) {
for (dy = -1; dy <= 1; dy++) {
float startMVx, startMVy, error;
int nxb, nyb;
if (((dx + dy + 2) % 2) == 0) continue; // 4-connected search

// Determine neighbouring block, if legal
nxb = xb + dx; // Neighbour X Block
nyb = yb + dy; // Neighbour Y Block
if (nxb < 0 || nxb >= MVLength) continue;
if (nyb < 0 || nyb >= MVLength) continue;

// Don't re-process a block if it's already done!
if (*(estimated + nxb + MVLength*nyb)) continue;

startMVx = *(MV + XMV + 2*(xb + MVLength*yb));
startMVy = *(MV + YMV + 2*(xb + MVLength*yb));

// Get the best match within the range +-0.5, using estimate.
error = bestMatch(lastFrame, thisFrame, imageSize, numColours, nxb*blockSize,
nyb*blockSize, startMVx, startMVy, 0.5, blockSize,
matchingThreshold, (MV + 2*(nxb + MVLength*nyb)));

// If a good match was found, append it to the end of the block list and set it as estimated
if (error < matchingThreshold) {
// ADD TO LIST
tempBlock.x = nxb;
tempBlock.y = nyb;
addToBlockList(&tempBlock, blockList, &blockListLength);

// Set as estimated
*(estimated + nxb + MVLength*nyb) = TRUE;
}
}
}
blockListIndex++;
}
blockListLength = 0; // Effectively wipe the list of the old blocks

// At the end of each iteration, generate some new seeds:

// Feed the 'estimated' array to a function which returns the centre block of the largest
// undetermined region (by ultimate erosion)
// erode returns false if every block is already estimated

if (!erode(estimated, MVLength, &tempBlock)) break;
xb = tempBlock.x;
yb = tempBlock.y;

// Now estimate the motion at the seed position.
// Allow any match here; we need a starting point, and if we don't accept it we loop forever
bestMatch(lastFrame, thisFrame, imageSize, numColours, xb*blockSize, yb*blockSize, 0, 0,
SEARCHRANGE, blockSize*2, 1000000, (MV + 2*(xb + MVLength*yb)));

for (dx = 0; dx <= 1; dx++) {
for (dy = 0; dy <= 1; dy++) {
// Set up a block and append it to the block list
tempBlock.x = xb + dx;
tempBlock.y = yb + dy;
if (tempBlock.x < 0 || tempBlock.x >= MVLength || tempBlock.y < 0 || tempBlock.y >= MVLength)
continue;


addToBlockList(&tempBlock, blockList, &blockListLength);

// Update the block states (stored spatially for fast access)
*(estimated + tempBlock.x + MVLength*tempBlock.y) = TRUE;

// Copy this motion vector to the other blocks in the 2x2 block seed
if (dx == 0 && dy == 0) continue;
memcpy((MV + 2*(xb+dx + MVLength*(yb+dy))), (MV + 2*(xb + MVLength*yb)), 2*sizeof(float));
}
}

} // End iteration bracket

fillUndetermined(lastFrame, thisFrame, imageSize, numColours, MV, MVLength, estimated);

// Display whether they've been estimated or not
/*
for (x = 0; x < MVLength; x++) {
for (y = 0; y < MVLength; y++) {
for (component = 0; component < 2; component++) {
*(MV + component + 2*(x + MVLength*y)) = *(estimated + x + MVLength*y)*SEARCHRANGE -
SEARCHRANGE/2;
}
}
}
*/

// Free the memory allocated for this algorith
free((void *)blockList);
free((void *)estimated);

return 0;
}


/* - MECcompensate:
A function to compensate for the motion between two consecutive frames of a video sequence,
given the first frame and the motion field.
Places the resulting frame in the *thisFrame buffer.

Assume that motion vectors are always 2-D vectors, in an array of the same size as the image
*/
int MECcompensate(float *lastFrame, float *thisFrame, int imageSize, int numColours, float *MV) {
// The source pixel co-ordinate in lastFrame
int oldX, oldY, col;

// The destination pixel co-ordinate in thisFrame
float newX, newY;

// The fractional component of the (floating-point) destination pixel co-ordinate
float dx, dy;

// The variables used by the bilinear interpolater
int floorX, ceilX, floorY, ceilY;
float alpha, beta, gamma, delta;

for (oldX = 0; oldX < imageSize; oldX++) {
for (oldY = 0; oldY < imageSize; oldY++) {
// Bilinearly interpolate for smooth ME - may not be necessary if speed needed
newX = oldX + *(MV + XMV + 2*(oldX + imageSize*oldY));
newY = oldY + *(MV + YMV + 2*(oldX + imageSize*oldY));

// Saturate, to prevent black spots at edges - effectively continues boundary
if (newX > imageSize - 1) newX = imageSize - 1;
if (newY > imageSize - 1) newY = imageSize - 1;
if (newX < 0) newX = 0;
if (newY < 0) newY = 0;

// Attempting efficient bilinear interpolation
dx = newX - (int)newX;
floorX = (int)newX;
ceilX = floorX + 1;
if (ceilX >= imageSize) ceilX = floorX; // Don't let ceilX go out of bounds either

dy = newY - (int)newY;
floorY = (int)newY;
ceilY = floorY + 1;
if (ceilY >= imageSize) ceilY = floorY;

// Precalculate the bilinear coefficients for greater speed inside the loop
alpha = (1.0-dx)*(1.0-dy);
beta = (1.0-dx)*dy;
gamma = dx*(1.0-dy);
delta = dx*dy;

for (col = 0; col < numColours; col++) {
float pixelVal = alpha * *(lastFrame + col + numColours*(floorX + imageSize*floorY));
pixelVal += beta * *(lastFrame + col + numColours*(floorX + imageSize*ceilY));


pixelVal += gamma * *(lastFrame + col + numColours*(ceilX + imageSize*floorY));
pixelVal += delta * *(lastFrame + col + numColours*(ceilX + imageSize*ceilY));

*(thisFrame + col + numColours*(oldX + imageSize*oldY)) = pixelVal;
}
}
}

return 0;
}

/* - addToBlockList:
A function to add a new block structure to the end of the current list of blocks
*/
int addToBlockList(BLOCKSTRUCT *block, BLOCKSTRUCT *list, int *listLength) {
memcpy((list + *listLength), block, sizeof(BLOCKSTRUCT));
(*listLength)++;
return 0;
}


/* - erode:
Erode a given array down to find the centre of the largest region
We're eroding the FALSE regions - beware of seemingly inverted logic
*/
int erode(char *estimated, int MVLength, BLOCKSTRUCT *centreBlock) {
char *regions, *newRegions;
char halt;

// Check that a block is actually found
centreBlock->x = -1;
centreBlock->y = -1;

// Copy the estimated array into the regions array
regions = (char *)malloc(MVLength*MVLength*sizeof(char));
memcpy(regions, estimated, MVLength*MVLength*sizeof(char));

// Copy the estimated array into the newRegions array
newRegions = (char *)malloc(MVLength*MVLength*sizeof(char));
memcpy(newRegions, estimated, MVLength*MVLength*sizeof(char));

// Erode until nothing left, keeping track of at least one potential seed
halt = FALSE;
while(!halt) {
int xb, yb, dx, dy;

halt = TRUE;

// For each unestimated region
for (xb = 0; xb < MVLength; xb++) {
for (yb = 0; yb < MVLength; yb++) {
if (!*(regions + xb + MVLength*yb)) { // If this region has not been estimated...
// See if any 4-connected neighbour has been estimated - if so, set this block as estimated
char keepBlock = TRUE;
for (dx = -1; (dx <= 1) && keepBlock; dx++) {
for (dy = -1; (dy <= 1) && keepBlock; dy++) {
int tx, ty;

if (((dx + dy + 2) % 2) == 0) continue; // 4-connected search

// Check I'm not out of bounds
tx = xb + dx;
ty = yb + dy;
if (tx < 0 || tx >= MVLength || ty < 0 || ty >= MVLength) {
// If we've found an edge block, automatically erode - they don't make good seeds
keepBlock = FALSE;
continue;
}

// If any neighbouring block has been estimated, don't keep this block
if (*(regions + tx + MVLength*ty)) {
keepBlock = FALSE;
}
}
}

if (!keepBlock) {
// Set this point as TRUE ('estimated') - thus eroding the unestimated regions
*(newRegions + xb + MVLength*yb) = TRUE;

// Store this as the new centre block
// Only if it's within bounds
if (!(xb < 0 || xb >= MVLength-1 || yb < 0 || yb >= MVLength-1)) {
centreBlock->x = xb;
centreBlock->y = yb;
}

// Let the algorithm know not to halt yet


if (halt) halt = FALSE;
}
}
}
}

// Copy the new regions over the old for the next iteration
if (!halt) memcpy(regions, newRegions, MVLength*MVLength*sizeof(char));
}

free((void *)regions);
free((void *)newRegions);

// Let the callee know if we found any block to keep!
return (centreBlock->x != -1);
}

/* - fillUndetermined:
Fill in the undetermined regions of the motion vector array
- Dilate the known regions into the unknown ones
*/
int fillUndetermined(float *lastFrame, float *thisFrame, int imageSize, int numColours,
float *MV, int MVLength, char *estimated) {

int blockSize = imageSize/MVLength;
char *done, *newDone;
char halt;
float *tempMV;
float startMVx, startMVy;
float error, leastError;
done = (char *)malloc(MVLength*MVLength*sizeof(char));
newDone = (char *)malloc(MVLength*MVLength*sizeof(char));
tempMV = (float *)malloc(2*sizeof(float));

// Copy the estimated array across to initialise
memcpy(done, estimated, MVLength*MVLength*sizeof(char));
memcpy(newDone, estimated, MVLength*MVLength*sizeof(char));

halt = FALSE;
while(!halt) {
int xb, yb, dx, dy, tx, ty;

halt = TRUE;

// (xb, yb) is the unestimated block
// (tx, ty) is a neighbouring estimated block
for (xb = 0; xb < MVLength; xb++) {
for (yb = 0; yb < MVLength; yb++) {
if (!*(done + xb + MVLength*yb)) {
// If not done, take best neighbour as match
leastError = BIGNUM;
for (dx = -1; dx <= 1; dx++) {
for (dy = -1; dy <= 1; dy++) {
if (((dx + dy + 2) % 2) == 0) continue; // 4-connected search

tx = xb + dx;
ty = yb + dy;
if (tx < 0 || tx >= MVLength || ty < 0 || ty >= MVLength) continue;

if (!*(done + tx + MVLength*ty)) continue;

// Estimate the motion based on all neighbours, and take the best
startMVx = *(MV + XMV + 2*(tx + MVLength*ty));
startMVy = *(MV + YMV + 2*(tx + MVLength*ty));
error = bestMatch(lastFrame, thisFrame, imageSize, numColours, xb*blockSize, yb*blockSize,
startMVx, startMVy, 1, blockSize, leastError, tempMV);
if (error < leastError) {
leastError = error;
memcpy((MV + 2*(xb + MVLength*yb)), tempMV, 2*sizeof(float));
}
}
}
// Update node status and algorithm halt flag
if (leastError < BIGNUM) {
*(newDone + xb + MVLength*yb) = TRUE;
if (halt) halt = FALSE;
}
}
}
}
// Copy the changes across for the next iteration
memcpy(done, newDone, MVLength*MVLength*sizeof(char));
}

// Free the memory used by this function
free((void *)done);
free((void *)newDone);
free((void *)tempMV);



return 0;
}


/* - bestMatch:
Determine the best match between the given array and the given image, with the specified starting
point.
*/
float bestMatch(float *lastFrame, float *thisFrame, int imageSize, int numColours, int startX,
int startY, float startMVx, float startMVy, float searchRange, int thisBlockSize, float lse,
float *MVPtr) {

float MVx, MVy, sse;

// Search exhaustively around the given starting point for the best match
// START AT CENTRE FOR GREATER SPEED
sse = fastError(lastFrame, thisFrame, imageSize, numColours, startX, startY,
startMVx, startMVy, lse, thisBlockSize);

// Check to see if this vector is better than the previous best...
if (sse < lse) {
lse = sse;
*(MVPtr + XMV) = startMVx;
*(MVPtr + YMV) = startMVy;
}

for (MVx = -searchRange + startMVx; MVx <= searchRange + startMVx; MVx+= 0.5) {
for (MVy = -searchRange + startMVy; MVy <= searchRange + startMVy; MVy+= 0.5) {
if (MVx == startMVx && MVy == startMVy) continue;

// Calculate SSE for the given position
sse = fastError(lastFrame, thisFrame, imageSize, numColours, startX, startY,
MVx, MVy, lse, thisBlockSize);

sse *= 1.5; //Emphasise non-relatively-moving blocks!

// Check to see if this vector is better than the previous best...
if (sse < lse) {
lse = sse;
*(MVPtr + XMV) = MVx;
*(MVPtr + YMV) = MVy;
}
}
}

return lse;
}


/* - fastError:
Return the SS error for the given floating-point MV - must be half-pixel res!
*/
float fastError(float *lastFrame, float *thisFrame, int imageSize, int numColours,
int cornerX, int cornerY, float MVx, float MVy, float lse, int thisBlockSize) {
// currCorner is corner of block in current frame
// MV is motion vector (pointing backwards in time!)
char boundsApply;
char halfX, halfY;
int i, j;
float sse = 0;

float lastX, lastY;
float pixelVal, error;
int floorX, floorY, ceilX, ceilY;
int colour;

// Determine where half-pel accuracy is required
halfX = !(MVx == (int)MVx);
halfY = !(MVy == (int)MVy);

// Determine before the main loop whether bounds apply here
boundsApply = FALSE;
if (cornerX + MVx < 0) boundsApply = TRUE;
else if (cornerX + MVx + thisBlockSize + 1 >= imageSize - 1) boundsApply = TRUE;
else if (cornerY + MVy < 0) boundsApply = TRUE;
else if (cornerY + MVy + thisBlockSize + 1 >= imageSize - 1) boundsApply = TRUE;

// Calculate SSE for the given position
// (i, j) is the current pixel in the current frame, cycling over whole block
for (i = cornerX; i < cornerX + thisBlockSize; i++) {
for (j = cornerY; j < cornerY + thisBlockSize; j++) {

// (lastX, lastY) is the current pixel in the previous frame
lastX = (float)i + MVx;
lastY = (float)j + MVy;
floorX = (int)lastX;
floorY = (int)lastY;



// Saturate, to prevent black spots at edges
if (boundsApply) {
if (floorX < 0) floorX = 0;
else if (floorX >= imageSize - 1) floorX = imageSize - 1;
if (floorY < 0) floorY = 0;
else if (floorY >= imageSize - 1) floorY = imageSize - 1;
}

ceilX = floorX + 1;
ceilY = floorY + 1;

if (boundsApply) {
if (ceilX >= imageSize - 1) ceilX = floorX;
if (ceilY >= imageSize - 1) ceilY = floorY;
}

// Bilinearly interpolate to half-pixel resolution:
for (colour = 0; colour < 3; colour++) {
if (!halfX && !halfY)
pixelVal = *(lastFrame + colour + numColours*(floorX + imageSize*floorY));
else if (!halfX && halfY)
pixelVal = (*(lastFrame + colour + numColours*(floorX + imageSize*floorY))
+ *(lastFrame + colour + numColours*(floorX + imageSize*ceilY)))/2;
else if (halfX && !halfY)
pixelVal = (*(lastFrame + colour + numColours*(floorX + imageSize*floorY))
+ *(lastFrame + colour + numColours*(ceilX + imageSize*floorY)))/2;
else if (halfX && halfY)
pixelVal = (*(lastFrame + colour + numColours*(floorX + imageSize*floorY))
+ *(lastFrame + colour + numColours*(floorX + imageSize*ceilY))
+ *(lastFrame + colour + numColours*(ceilX + imageSize*floorY))
+ *(lastFrame + colour + numColours*(ceilX + imageSize*ceilY)))/4;

error = pixelVal - *(thisFrame + colour + numColours*(i + imageSize*j)); // Try Y-component only?

// Changing the following line alters the error type
sse += SQR(error);

// Early breakout - 2.5 times faster!
if (sse > lse) {
return 1000000;
}
}
}
}

return sse;
}


10.4 VIDCODEC.c

This appendix contains the C code written to encapsulate the low level compression and decompression functions and to perform bit allocation/parsing.



/* - VIDCODEC (Video Compression and Decompression)
Encapsulates the video compression and decompression into an object-like form.
VIDCODECSTRUCT stores the state of the codec
I/O is by simple RGB buffers, with the callee responsible for interfacing to file/screen
This class operates on a frame-by-frame basis; the callee is responsible for keeping
track of frame numbers, etc.
*/
#include "VIDCODEC.h"

/* - initVIDCODEC:
Initialise a VIDCODECSTRUCT object.
*/
int initVIDCODEC(int imageSize, char writing, VIDCODECSTRUCT *thisVIDCODEC) {

// Store the dimensions of the data
thisVIDCODEC->imageSize = imageSize;
thisVIDCODEC->log2imageSize = (int)( log(imageSize)/log(2) + 0.5);

// Store whether we're coding or decoding:
thisVIDCODEC->writing = writing;

// Allocate memory for the receiver's frame (the shared knowledge)
thisVIDCODEC->sharedFrame = (float *)calloc(imageSize*imageSize*NUMCOLOURS, sizeof(float));
assert(thisVIDCODEC->sharedFrame != NULL);

return 0;
}


/* - closeVIDCODEC:
Destroy the VIDCODECSTRUCT object and its variables
*/
int closeVIDCODEC(VIDCODECSTRUCT *thisVIDCODEC) {
free((void *)thisVIDCODEC->sharedFrame);
return 0;
}


/* - codePFrame:
Code a Predicted frame. This consists of:
1 - Estimating the motion between the two perfect input frames
2 - Compressing/decompressing the motion vectors (sending the compressed version)
3 - Compensating the sharedFrame by the (compressed) motion vectors
4 - Calculating the residual by subtraction
5 - Compressing/decompressing the residual (sending the compressed version)
6 - Adding the residual back on -> storing it as the sharedFrame
*/
int codePFrame(float *lastFrame, float *thisFrame, BITBUFFERSTRUCT *MVResBitBuffer,
VIDCODECSTRUCT *thisVIDCODEC, int numVideoDataBits) {

int x, y, component, i;
float *blockMV, *MV, *tempImage, *residue;
int mvBits;

float *motherWavelet, *fatherWavelet;
int motherLength = 9, fatherLength = 7;

SPIHTSTRUCT spiht;
spiht.bitBuffer = MVResBitBuffer;

// Estimate the motion between two frames
blockMV = (float *)malloc(SQR(thisVIDCODEC->imageSize/BLOCKSIZE)*2*sizeof(float));

MECestimate(lastFrame, thisFrame, thisVIDCODEC->imageSize, NUMCOLOURS, blockMV,
BLOCKSIZE);

// Convert the blockwise MV's to pixelwise MVs - simplistic upsampling
MV = (float *)malloc(SQR(thisVIDCODEC->imageSize)*2*sizeof(float));
for (x = 0; x < thisVIDCODEC->imageSize; x++) {


for (y = 0; y < thisVIDCODEC->imageSize; y++) {
for (component = 0; component < 2; component++) {
*(MV + component + 2*(x + thisVIDCODEC->imageSize*y)) =
*(blockMV + component + 2*(x/BLOCKSIZE + (thisVIDCODEC-
>imageSize/BLOCKSIZE)*(y/BLOCKSIZE)));
}
}
}
// I've finished with the block motion vectors now
free((void *)blockMV);

// COMPRESS/DECOMPRESS MVs
motherWavelet = (float *)malloc(motherLength * sizeof(float));
fatherWavelet = (float *)malloc(fatherLength * sizeof(float));

*(motherWavelet + 0) = .037828455506995f;
*(motherWavelet + 1) = -.023849465019380f;
*(motherWavelet + 2) = -.11062440441842f;
*(motherWavelet + 3) = .37740285561265f;
*(motherWavelet + 4) = .85269867900940f; //
*(motherWavelet + 5) = .37740285561265f;
*(motherWavelet + 6) = -.11062440441842f;
*(motherWavelet + 7) = -.023849465019380f;
*(motherWavelet + 8) = .037828455506995f;

*(fatherWavelet + 0) = -.064538882628938f;
*(fatherWavelet + 1) = -.040689417609558f;
*(fatherWavelet + 2) = .41809227322221f;
*(fatherWavelet + 3) = .78848561640566f; //
*(fatherWavelet + 4) = .41809227322221f;
*(fatherWavelet + 5) = -.040689417609558f;
*(fatherWavelet + 6) = -.064538882628938f;

fwt2(MV, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength, 2);

// *****************************************************************************
// Compress the motion vectors
initSPIHT(MV, thisVIDCODEC->imageSize, 2, TRUE, thisVIDCODEC->log2imageSize + 3, &spiht);
spiht.haltThresh = MVHALTTHRESH; // Allow it to exit early
toSPIHT(&spiht);
// Round the coded bits down to the desired size - effectively only send full bytes
mvBits = spiht.bitBuffer->currentBitNum;
closeSPIHT(&spiht);

// Store the number of bytes used to code the motion vectors
for (i = 0; i < 9; i++) {
pack((mvBits & (1<<i)) > 0, thisVIDCODEC->overheadBitBuffer);
}

// Now decompress
memset(MV, 0, SQR(thisVIDCODEC->imageSize)*2*sizeof(float)); // WIPE IT FIRST!

// Restart the MVbitBuffer for re-reading by the decoder
MVResBitBuffer->maxBitLength = MVResBitBuffer->currentBitNum;
MVResBitBuffer->currentBitNum = 0;
MVResBitBuffer->endOfBuffer = FALSE;
initSPIHT(MV, thisVIDCODEC->imageSize, 2, FALSE, thisVIDCODEC->log2imageSize + 3, &spiht);
fromSPIHT(&spiht);
closeSPIHT(&spiht);

// Inverse wavelet transform the MVs
iwt2(MV, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength, 2);


// Motion compensate
// Compensate for the motion in the image and test the accuracy of this technique
tempImage = (float *)malloc(SQR(thisVIDCODEC->imageSize)*NUMCOLOURS*sizeof(float));

// MV is the decompressed motion vectors
MECcompensate(thisVIDCODEC->sharedFrame, tempImage, thisVIDCODEC->imageSize,
NUMCOLOURS, MV);

// The motion vectors are now finished with
free((void *)MV);

memcpy(thisVIDCODEC->sharedFrame, tempImage, SQR(thisVIDCODEC-
>imageSize)*NUMCOLOURS*sizeof(float));

// sharedFrame now contains the motion compensated new frame
free((void *)tempImage);

// RESIDUE
residue = (float *)malloc(SQR(thisVIDCODEC->imageSize)*NUMCOLOURS*sizeof(float));

// residue = actual frame - estimate
for (i = 0; i < SQR(thisVIDCODEC->imageSize)*NUMCOLOURS; i++) {
*(residue + i) = *(thisFrame + i) - *(thisVIDCODEC->sharedFrame + i);
}



fwt2(residue, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength,
NUMCOLOURS);

// Compress the residues
MVResBitBuffer->maxBitLength = numVideoDataBits;
MVResBitBuffer->endOfBuffer = FALSE;
// Continuing from the point at which the MVs ended
initSPIHT(residue, thisVIDCODEC->imageSize, NUMCOLOURS, TRUE, thisVIDCODEC-
>log2imageSize + 2,
&spiht);
toSPIHT(&spiht);
closeSPIHT(&spiht);

// Now convert it back again
setmem(residue, SQR(thisVIDCODEC->imageSize)*NUMCOLOURS*sizeof(float), 0); // WIPE IT
FIRST!
// Restart the resBitBuffer for decoding again
MVResBitBuffer->currentBitNum = mvBits; // Go back to where the MVs ended
MVResBitBuffer->endOfBuffer = FALSE;
initSPIHT(residue, thisVIDCODEC->imageSize, NUMCOLOURS, FALSE, thisVIDCODEC-
>log2imageSize + 2,
&spiht);
fromSPIHT(&spiht);
closeSPIHT(&spiht);

iwt2(residue, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength,
NUMCOLOURS);

// I'm finished with the wavelets now
free((void *)motherWavelet);
free((void *)fatherWavelet);

// Result seen by decoder = estimate + residue
for (i = 0; i < SQR(thisVIDCODEC->imageSize)*NUMCOLOURS; i++) {
// result = motion compensated frame + decoded residue
*(thisVIDCODEC->sharedFrame + i) += *(residue + i);
}
free((void *)residue);

return 0;
}


/* - decodePFrame:
Decode a Predicted frame. This consists of:
1 - Decompressing the motion vectors
2 - Compensating the sharedFrame by the motion vectors
3 - Decompressing the residual
4 - Adding the residual on -> storing it as the sharedFrame
*/
int decodePFrame(BITBUFFERSTRUCT *MVResBitBuffer, VIDCODECSTRUCT *thisVIDCODEC,
int numVideoDataBits) {

int i;
float *tempImage, *MV, *residue;
float *motherWavelet, *fatherWavelet;
int motherLength = 9, fatherLength = 7;

int mvBits; // MV is coded adaptively for quality, with residue taking up remaining space

SPIHTSTRUCT spiht;

motherWavelet = (float *)malloc(motherLength * sizeof(float));
fatherWavelet = (float *)malloc(fatherLength * sizeof(float));

*(motherWavelet + 0) = .037828455506995f;
*(motherWavelet + 1) = -.023849465019380f;
*(motherWavelet + 2) = -.11062440441842f;
*(motherWavelet + 3) = .37740285561265f;
*(motherWavelet + 4) = .85269867900940f; //
*(motherWavelet + 5) = .37740285561265f;
*(motherWavelet + 6) = -.11062440441842f;
*(motherWavelet + 7) = -.023849465019380f;
*(motherWavelet + 8) = .037828455506995f;

*(fatherWavelet + 0) = -.064538882628938f;
*(fatherWavelet + 1) = -.040689417609558f;
*(fatherWavelet + 2) = .41809227322221f;
*(fatherWavelet + 3) = .78848561640566f; //
*(fatherWavelet + 4) = .41809227322221f;
*(fatherWavelet + 5) = -.040689417609558f;
*(fatherWavelet + 6) = -.064538882628938f;

// Decode the motion vectors
spiht.bitBuffer = MVResBitBuffer;

MV = (float *)calloc(SQR(thisVIDCODEC->imageSize)*2, sizeof(float));



// Obtain the number of MV bytes from the overhead file
mvBits = 0;
for (i = 0; i < 9; i++) {
if(unpack(thisVIDCODEC->overheadBitBuffer)) mvBits += (1<<i);
}

spiht.bitBuffer->maxBitLength = mvBits;
initSPIHT(MV, thisVIDCODEC->imageSize, 2, FALSE, thisVIDCODEC->log2imageSize + 3, &spiht);
fromSPIHT(&spiht);
closeSPIHT(&spiht);

iwt2(MV, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength, 2);

// Motion compensate
// Compensate for the motion in the image and test the accuracy of this technique
tempImage = (float *)malloc(SQR(thisVIDCODEC->imageSize)*NUMCOLOURS*sizeof(float));

// MV is the decompressed motion vectors
MECcompensate(thisVIDCODEC->sharedFrame, tempImage, thisVIDCODEC->imageSize,
NUMCOLOURS, MV);

// The motion vectors are now finished with
free((void *)MV);

memcpy(thisVIDCODEC->sharedFrame, tempImage, SQR(thisVIDCODEC-
>imageSize)*NUMCOLOURS*sizeof(float));

// sharedFrame now contains the motion compensated new frame
free((void *)tempImage);

// Decode the residue
residue = (float *)calloc(SQR(thisVIDCODEC->imageSize)*NUMCOLOURS, sizeof(float));

// Decompressing the residue
spiht.bitBuffer->maxBitLength = numVideoDataBits; // Extending the buffer's readable portion
spiht.bitBuffer->endOfBuffer = FALSE;
initSPIHT(residue, thisVIDCODEC->imageSize, NUMCOLOURS, FALSE, thisVIDCODEC-
>log2imageSize + 2,
&spiht);
fromSPIHT(&spiht);
closeSPIHT(&spiht);

iwt2(residue, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength,
NUMCOLOURS);

// Result seen by decoder = estimate + residue
for (i = 0; i < SQR(thisVIDCODEC->imageSize)*NUMCOLOURS; i++) {
*(thisVIDCODEC->sharedFrame + i) += *(residue + i);
}

// We've now finished with the residue buffer
free((void *)residue);

// We've finished with the biorthogonal wavelets
free((void *)motherWavelet);
free((void *)fatherWavelet);

return 0;
}

/* - codeIFrame:
Code an Independant frame. This consists of:
1 - Coding the frame (using SPIHT)
2 - Decoding the frame -> storing it as the sharedFrame
*/
int codeIFrame(float *Iframe, BITBUFFERSTRUCT *bitBuffer, VIDCODECSTRUCT *thisVIDCODEC)
{
float *motherWavelet, *fatherWavelet;
int motherLength = 9, fatherLength = 7;
SPIHTSTRUCT spiht;
motherWavelet = (float *)malloc(motherLength * sizeof(float));
fatherWavelet = (float *)malloc(fatherLength * sizeof(float));

*(motherWavelet + 0) = .037828455506995f;
*(motherWavelet + 1) = -.023849465019380f;
*(motherWavelet + 2) = -.11062440441842f;
*(motherWavelet + 3) = .37740285561265f;
*(motherWavelet + 4) = .85269867900940f; //
*(motherWavelet + 5) = .37740285561265f;
*(motherWavelet + 6) = -.11062440441842f;
*(motherWavelet + 7) = -.023849465019380f;
*(motherWavelet + 8) = .037828455506995f;

*(fatherWavelet + 0) = -.064538882628938f;
*(fatherWavelet + 1) = -.040689417609558f;
*(fatherWavelet + 2) = .41809227322221f;


*(fatherWavelet + 3) = .78848561640566f; //
*(fatherWavelet + 4) = .41809227322221f;
*(fatherWavelet + 5) = -.040689417609558f;
*(fatherWavelet + 6) = -.064538882628938f;

// Begin image compression/decompression cycle
fwt2(Iframe, thisVIDCODEC->imageSize, motherWavelet, motherLength, fatherWavelet, fatherLength,
NUMCOLOURS);

// Compress and decompress the image
spiht.bitBuffer = bitBuffer;
initSPIHT(Iframe, thisVIDCODEC->imageSize, NUMCOLOURS, TRUE, thisVIDCODEC-
>log2imageSize + 7,
&spiht);
toSPIHT(&spiht);
closeSPIHT(&spiht);

free((void *)motherWavelet);
free((void *)fatherWavelet);

// Restart the bit buffer, so that the decoder can read the compressed data
bitBuffer->currentBitNum = 0;
bitBuffer->endOfBuffer = FALSE;
// Convert it back again, so that sharedFrame is kept the same for tx and rx
decodeIFrame(bitBuffer, thisVIDCODEC);

return 0;
}

/* - decodeIFrame:
Decode an Independant frame. This consists of:
1 - Decoding the frame (using SPIHT) -> storing it as the sharedFrame
*/
int decodeIFrame(BITBUFFERSTRUCT *bitBuffer, VIDCODECSTRUCT *thisVIDCODEC) {
float *motherWavelet, *fatherWavelet;
int motherLength = 9, fatherLength = 7;
SPIHTSTRUCT spiht;

motherWavelet = (float *)malloc(motherLength * sizeof(float));
fatherWavelet = (float *)malloc(fatherLength * sizeof(float));

*(motherWavelet + 0) = .037828455506995f;
*(motherWavelet + 1) = -.023849465019380f;
*(motherWavelet + 2) = -.11062440441842f;
*(motherWavelet + 3) = .37740285561265f;
*(motherWavelet + 4) = .85269867900940f; //
*(motherWavelet + 5) = .37740285561265f;
*(motherWavelet + 6) = -.11062440441842f;
*(motherWavelet + 7) = -.023849465019380f;
*(motherWavelet + 8) = .037828455506995f;

*(fatherWavelet + 0) = -.064538882628938f;
*(fatherWavelet + 1) = -.040689417609558f;
*(fatherWavelet + 2) = .41809227322221f;
*(fatherWavelet + 3) = .78848561640566f; //
*(fatherWavelet + 4) = .41809227322221f;
*(fatherWavelet + 5) = -.040689417609558f;
*(fatherWavelet + 6) = -.064538882628938f;

// Read the image into the shared frame buffer
setmem(thisVIDCODEC->sharedFrame, SQR(thisVIDCODEC-
>imageSize)*NUMCOLOURS*sizeof(float),
0); // WIPE IT FIRST!

// Link the 2 structures used by the image decoding algorithm
spiht.bitBuffer = bitBuffer;

// Initialise the decoder with the required state information
initSPIHT(thisVIDCODEC->sharedFrame, thisVIDCODEC->imageSize, NUMCOLOURS,
FALSE, thisVIDCODEC->log2imageSize + 7, &spiht);

// Decode the data
fromSPIHT(&spiht);

// Feed the structure to my garbage collector
closeSPIHT(&spiht);

// Inverse wavelet transform and invert the colour transform too
iwt2(thisVIDCODEC->sharedFrame, thisVIDCODEC->imageSize, motherWavelet, motherLength,
fatherWavelet, fatherLength, NUMCOLOURS);

free((void *)motherWavelet);
free((void *)fatherWavelet);

return 0;

You might also like