You are on page 1of 6

SIFT-BASED IMAGE COMPRESSION

Huanjing Yue
1*
, Xiaoyan Sun
2
,

Feng Wu
2
, Jingyu Yang
1

1
Tianjin University, Tianjin, China
2
Microsoft Research Asia, Beijing, China

1
{dayueer, yjy}@tju.edu.cn,
2
{xysun, fengwu}@microsoft.com

ABSTRACT

This paper proposes a novel image compression scheme
based on the local feature descriptor - Scale Invariant
Feature Transform (SIFT). The SIFT descriptor
characterizes an image region invariantly to scale and
rotation. It is used widely in image retrieval. By using SIFT
descriptors, our compression scheme is able to make use of
external image contents to reduce visual redundancy among
images. The proposed encoder compresses an input image
by SIFT descriptors rather than pixel values. It separates
the SIFT descriptors of the image into two groups, a visual
description which is a significantly subsampled image with
key SIFT descriptors embedded and a set of differential
SIFT descriptors, to reduce the coding bits. The
corresponding decoder generates the SIFT descriptors from
the visual description and the differential set. The SIFT
descriptors are used in our SIFT-based matching to retrieve
the candidate predictive patches from a large image dataset.
These candidate patches are then integrated into the visual
description, presenting the final reconstructed images. Our
preliminary but promising results demonstrate the
effectiveness of our proposed image coding scheme towards
perceptual quality. Our proposed image compression
scheme provides a feasible approach to make use of the
visual correlation among images.

Index Terms Image compression, image coding,
SIFT, perceptual quality, local feature descriptor, feature
extraction
1. INTRODUCTION
Image compression, a crucial component of digital image
applications, has been improved greatly over the past two
decades. A number of advanced image compression
schemes have been introduced since the well-known JPEG
standard, driven by the explosion of digital images over the
Internet. State-of-the-art mainstream image coding schemes
such as JPEG 2000, MPEG-4 AVC/H.264 intra coding, and
the emerging HEVC intra coding (HEVC in short) [1] are
examples that greatly outperform previous generations in
terms of coding efficiency. Digging into the detailed
techniques, one can easily observe that these mainstream
coding schemes make use of the statistical correlation
among pixels within an image for compression. From JPEG
to the latest HEVC, an increasing number of prediction
directions, partitions and transforms have been introduced in
the pursuit of high coding efficiency; small improvements
are accomplished with the high cost of increased complexity
in both encoder and decoder. One cannot help worrying
about the future of mainstream compression schemes if they
rely only on the techniques based on classical information
theory to explore the inside statistical redundancy for high
pixel-level fidelity.
Based on human visual system (HVS) [2], visual
redundancy has been considered in addition to statistical
redundancy in the development of compression techniques
toward perceptual quality. These schemes compress an
image by identifying and utilizing features within the image
to achieve high coding efficiency [3]. For example, edge-
based and segmentation-based coding schemes [4][5] are
presented based on the human eyes ability to identify edges
and group similar regions. Nevertheless, the development of
these schemes, though promising, is influenced greatly by
the availability and effectiveness of HVS as well as certain
algorithms, such as edge detection and segmentation tools.
Moreover, they still only consider the redundancy, either
statistical or visual, within one image.
Image compression using external image datasets has
also been investigated. Generally speaking, vector
quantization (VQ) [6] can be regarded as an example in
which the codebooks are trained from an external image
dataset. Later, visual patterns learned at patch level are
introduced in vector quantization for the compression of
edge regions [7][8]. However, these VQ-based schemes
measure the correlations by pixel values in the training and
matching process, which are incapable of utilizing the visual
correlation effectively.
With the exploration of digital images over the Internet,
there is a trend to investigate new technologies to make use
of the massive number of images. Recent efforts have shed
a light that local feature descriptors not only help enhance
image search but also provide hints to visually interpret
image content. Local feature descriptors describe image
regions by gradient histograms around the selected key-
points, which are invariant to image scales and rotations.
Based on local feature descriptors, such as SIFT [9], the
*
This work is done during Huanjing Yues internship in Microsoft
Research Asia.

2012 IEEE International Conference on Multimedia and Expo
978-0-7695-4711-4/12 $26.00 2012 IEEE
DOI 10.1109/ICME.2012.52
473
2012 IEEE International Conference on Multimedia and Expo
978-0-7695-4711-4/12 $26.00 2012 IEEE
DOI 10.1109/ICME.2012.52
473

visual correlations among images are explored in order to
help solve the problems of image inpainting and completion
[10][11]. Later on, Weinzaepfel et al. demonstrate that
meaningful reconstruction is achievable merely based on
SIFT descriptors [12]. The follow-up work presented in the
project report [13] tries to reconstruct an image using SIFT
and Speed Up Robust Features (SURF) descriptors [14],
with or without scale information. All these works
encourage us to investigate high performance compression
solutions given a large number of images and the
corresponding local feature descriptors.
In this paper, we propose a pilot study on image
compression using local feature descriptors to explore the
visual redundancy among images. We adopt the SIFT
descriptor which is widely used in image search and object
recognition. It is the SIFT descriptors instead of the true
pixel values that are encoded in our compression scheme.
However, the SIFT descriptors consume a lot of computing
resources. For efficient representation, we propose not
compressing the SIFT descriptors directly but manipulating
them into two groups and encoding them separately. One
group named visual description is a heavily simplified
version of the original image in which the key SIFT
descriptors are embedded and coded as a common image;
the other called differential SIFT set consists of the rest of
important SIFT descriptors and coded as assisting
information. At the decoder side, the SIFT descriptors
extracted from the two groups are used to retrieve the
candidate patches from a large image dataset based on the
similarity between SIFT descriptors. The final reconstructed
image is produced by integrating the candidate patches into
the visual description. Experimental results demonstrate that
it is promising in significantly reducing the visual
redundancy of images on the basis of current transform-
based coding schemes. Visually comparable reconstructions
are achievable with our SIFT-based image compression in
comparison to mainstream image coding schemes.
The rest of this paper is organized as follows: Section 2
introduces the basic idea of SIFT descriptor; Section 3
presents our SIFT-based compression scheme in detail;
Experimental results are reported in Section 4; Section 5
concludes the paper and discusses future work.
2. SIFT DESCRIPTOR
The SIFT descriptor presented in [9] provides a solution that
characterizes an image region invariantly to image scale and
rotation so that it can be used to perform robust matching
between different views of an object or scene.
Given an input image I, the set of its SIFT descriptors is
denoted as Z
I
.
Z
I
= (I), (1)
where () represents the SIFT generation process.
Each SIFT descriptor z

e Z
I
is defined as
z
I
= {v
I
, x
I
, s
I
, o
I
], (2)
where v
I
is an l-dimension appearance descriptor v
I
e R
I
,
x
I
= (x
I
, y
I
) , s
I
and o
I
represent the spatial coordinates,
scale and dominant gradient orientation of the i
th
keypoint,
respectively.
The SIFT key-points are determined by finding the local
maximums and minimums of the difference of Gaussians
that occur at multiple scales. Each key-point is assigned one
or more dominant orientations based on the directions of
local image gradient which are calculated using the
neighboring pixels around the key-point in the Gaussian-
blurred image. Note that low contrast candidates and edge
response points along an edge are discarded from the key-
point set. For detailed information about SIFT generation,
please refer to [9].
We observe that it is unrealistic to directly code all the
SIFT descriptors of one image in our compression scheme.
In a SIFT descriptor, vector v
I
is defined as a histogram of
the gradient orientations around key-point x
I
. The histogram
is a three dimensional lattice with 4 bins for each spatial
direction and 8 bins for the orientation. There are a total of
448=128 bins for the histogram. Along with the
information on location, scale, and orientation, it is very bit
consuming to directly code a SIFT descriptor.
Moreover, the SIFT descriptor contains less information
compared with actual pixels in an image. It lacks true grey
and color information. The matching between SIFT
descriptors, though efficient for search and recognition, is
not accurate enough for image compression. It is necessary
to investigate SIFT-based representations for image
compression.
3. SIFT-BASED IMAGE COMPRESSION
Our SIFT-based image compression scheme is introduced in
this section. First, the framework of our coding scheme is
presented. Then, the modules employed in the encoder and
decoder will be discussed in detail.
3.1. Overview
The basic idea of our SIFT-based image coding scheme is
illustrated in Fig. 1.
On the encoder side, the visual description containing
key SIFT descriptors is first extracted and encoded. The set
of differential SIFT descriptors are then generated
containing the rest of the important SIFT descriptors and
coded as augment information. These two together form the
final coded bit stream.
The corresponding decoder decodes the bit stream and
gets the two groups of SIFT descriptors. The SIFT
descriptors extracted from the two groups are then used in
the SIFT-based matching to achieve candidate patches
retrieved from a large image database. At last, the candidate
patches are integrated into the visual description to generate
the final reconstructed image.
3.2. Encoder
474 474

Our encoder compresses an image based on the SIFT
descriptors. Instead of coding the SIFT descriptors directly,
we propose reforming the SIFT descriptors into two
representative groups and coding them accordingly. In this
subsection, the encoding modules shown in Fig. 1 are
presented in detail.
3.2.1. Visual description extraction
The SIFT descriptors, as we know, characterize the features
around key-points extracted from a series of smoothed and
resampled images. Compliant with the human visual system,
basic salient features become dominant and the fine features
become weak with the increase of the scale level. The
salient features explicit in the smoothed and resampled
images are valuable in our image representation, which
outline the semantic meaning of the visual content.
We are also aware of the fact that SIFT descriptor lacks
the grey values and color information of the featured region
so that it is difficult to recover the grey and color
information from only SIFT descriptors; a SIFT descriptor
contains a 128-dimensional vector which is very expensive
in terms of bit consumption.
Therefore, we propose separating the SIFT descriptors
into two groups for efficient coding. The visual description
is one of the groups that contains not only the SIFT
descriptors of the salient features but also basic grey and
color information of the original image.
A visual description of image I(x, y) is defined as
I
v
(x, y, o) =l
k
(u(x, y, o) - I(x, y)), (S)
where is the convolution operation in x and y, and l
k
is
the sub-sampling by a factor of k, u(x, y, o) is defined as
u(x, y, o) =
1
2o
2
e
-(x
2
+y
2
)2o
2
. (4)
In other words, a visual description I
v
is a sub-sampled
version of the original image.
3.2.2. Visual SIFT encoding
Since the visual description is defined as a sub-sampled
version of the original image, state-of-the-art image
compression methods, such as JPEG2000 and HEVC, can
be used in the encoding process. After compression, a lossy
version of the visual description, I

v
, is produced.
I

v
= B(E(I
v
)), (S)
where E() and B() present the encoding and decoding
processes of visual description, respectively.
In our current solution, we adopt the intra coding process
of HEVC for the compression of the visual description.
Please refer to [1] for detailed information on HEVC intra
coding.
3.2.3. Differential SIFT extraction
According to (1), the sets of SIFT descriptors extracted from
image I and I

v
are defined as
Z
I
= (I), (6)
Z
I

v = (I

v
). (7)
Note that the SIFT descriptors in Z
I

v are achievable on the


decoder side as they can be extracted from I

v
. Then the
differential SIFT set should be
Z
I
'
= Z
I
\(Z
I
r Z
I

v). (8)
However, we notice that not all the SIFT descriptors in Z
I
'

contribute to the final reconstruction. It is not necessary and
feasible to deliver all the SIFT descriptors to the decoder
side.
Therefore, we propose involving only the important
SIFT descriptors in the differential SIFT set Z
I
d . The
importance of a SIFT descriptor is determined by the scale
parameter in our current solution. Consequently, we sort the
SIFT descriptors z
I
, z
I
e Z
I
'
, according to the scale
parameter o
I
. The larger the o
I
, the higher the priority. The
first N descriptors will then be selected, forming differential
SIFT set Z
I
d, where |Z
I
d| = N.
3.2.4. Differential SIFT encoding
For each descriptor z
I
, z
I
= {v
I
, x
I
, s
I
, o
I
], in the differential
SIFT set Z
I
d, we propose encoding only the key point x
I
,
scale s
I
and rotation o
I
. The l-dimensional vector v
I
is
excluded from the encoding process to further decrease the
data size of the differential SIFT set. Thus, the coded
differential descriptors are Z

I
d.
Z

I
d = {z
I
], wheie z
I
= z
I
\v
I
= {x
I
, s
I
, o
I
] anu z
I
e Z
I
d. (9)
The discarded vector v
I
will be approximated by extracting
it from the visual description I

v
on the decoder side.

Fig. 1. Framework of our SIFT-based image compression scheme. The left and right portions show the encoding and decoding
processes, respectively.
475 475

We use a simple fixed-length lossless compression
method to code the differential descriptors. We assign 47
bits to encoding each differential SIFT descriptor z
I
.
3.3. Decoder
The decoding process of our SIFT-based image compression
is illustrated in the right portion of Fig. 1.
3.3.1. Decoding of visual description and differential SIFT
Corresponding to the encoding process, we employ the intra
decoding scheme in HEVC for the decoding of the visual
description and use a simple fixed-length decoding scheme
to interpret the differential descriptors. Then we get two sets
of information reconstructed visual description I

v
and the
differential descriptors Z

I
d.
3.3.2. SIFT-based matching
The SIFT descriptors used in the matching module come
from both I

v
and Z

I
d .We first achieve the set of SIFT
descriptors Z
I

v from image I

v
by (7). Then the image I

v
is
up-sampled to the same resolution as the original image
using a convolution with Lanczos kernel L.
I

u
(x, y, k) = L(x, y, k) - I

v
(x, y), (1u)
where k is the up-sample ratio. The resulting image I

u
is
used to generate the vectors {v
I
] based on the decoded Z

I
d,
producing the set of approximated differential SIFT
descriptors Z

I
d.
Z

I
d = (I

u
|Z

I
d) . (11)
In other words, the SIFT descriptors are extracted with
regards to the scale and orientation information at the key
point positions denoted by Z

I
d. The two sets, Z

I
d and Z
I

v ,
together form our set of SIFT descriptors Z

I
for matching.
Z

I
= Z

I
d U Z
I

v. (12)
Let u denote the external image database and Z
u
present
the set of SIFT descriptors extracted from u,
Z
u
= Z
I
]
I
]eu
, wheie Z
I
]
= (I
j
). (1S)
For each SIFT descriptor z
I
= {v
I
, x
I
, s
I
, o
I
] in Z

I
, its
matching SIFT descriptor z
t
= {v
t
, x
t
, s
t
, o
t
] in Z
u
is
retrieved if the L
2
distance satisfies
[v
I
- v
t
-[
2
> o [v
I
- v
t
[
2
, (14)
where
t = aigmin
v
t
|eZ
I
]
[v
I
- v
t
|[
2
, (1S)
t
-
= aigmin
v
t
|e(Z
I
]
\v
t
)
[v
I
- v
t
|[
2
. (16)
That is, a matched descriptor z
t
to z
I
is the one whose
distance to z
I
is not only the shortest but also u times shorter
than those of the rest of the descriptors.
If z
I
finds its matching descriptor z
t
, the corresponding
patch f
t
denoted by z
t
is extracted from the corresponding
image I
j
in u. The fragment f
t
in our scheme is a rectangle
centered at (x
t
, y
t
) in I
j
. It is further scaled and rotated to
generate the candidate fragment f
I
of z
I
.
f
I
= u
o
u
s
f
t
, (17)
where u
o
and u
s
are orientation and scaling matrices with
regards to the orientation and scaling parameters of z
t
and z
I
,
respectively. Otherwise f
I
is null if no matching descriptor is
available.
Repeating the matching process until all the descriptors
in Z

I
are checked, we get a set of candidate fragments.
3.3.3. Integration
Before integration, We first sort the retrieved fragments
from largest to smallest according to their size and get the
ordered set F, F = {f
j
]. Then the fragments are merged into
the image I

u
sequentially based on the key-point positions
denoted by the corresponding descriptors.
For each fragment f
j
, the integration consists of four
steps: normalization, position, alignment, and merging.
First, the luminance magnitudes of f
j
are normalized by
f
`
j
(x, y) = f
j
(x, y) - f

j
, (18)
where f
j

is the average of f
j
.
Second, to reduce the aliasing effect caused by the up-
sampling, we shift the matching fragment f
`
j
in the
corresponding regions in I

u
. The cover position i
j
-
of
fragment f
`
j
in I

u
is determined by
i
j
-
= aig min
r
]
eR
]
N( f
`
j
, i
j
), (19)
where R
j
is the search range center at the key-point position
x
I
denoted by f
`
j
, and M(a,b) measures the similarity of a and
b.
Third, we align the fragment f
`
j
to image I

u
at region i
j
-

by multiplying a homography matrix H.
f

j
= B f
`
j
. (2u)
The homography matrix is calculated using RANSAC [17]
by matching feature points between fragment f

j
and region
i
j
-
.
At this point, the transformed fragment f

j
is merged into
image I

u
. The boundary effect is reduced using the iterative
Gauss-Seidel SOR algorithm [16], which is widely used in
Poisson image editing.
This process continues until all the fragments in F are
handled. Notice that a fragment f
`
j
will not be merged if its
corresponding region i
j
-
is fully covered by the previous
integrated fragments. Also, we perform the fragment
integration only in the luminance component. The
chrominance information of image I

u
is directly used to
produce the final reconstructed image I

.
476 476

4. EXPERIMENTAL RESULTS
We evaluate the performance of our SIFT-based image
compression scheme in comparison with those of JPEG and
the HEVC intra coding [19]. The parameters used in our
scheme are fixed. Factor k in (3) and (10) is set to 16, that is,
the sub-sampling ratio in our scheme is 256:1. u in (14) is
set to 1.5. N equals 100 in the following test so that the first
100 differential SIFT descriptors are encoded to the decoder
side. Similarity measure M(a,b) in (19) is determined by
MSE. The results of HEVC are generated using the default
parameters.
The comparison results of JPEG, HEVC and our scheme
are shown in Fig. 2 to Fig. 4. Our compression results are
achieved by using the INRIA Holidays database [18] that
consists of 1491 images. In the experiments, the test images
shown in this paper are excluded from the searched database.
Additionally, the waterfall image shown in Fig. 4 tests the
performance when database images are obtained at the same
GPS location.
It can be observed that compared with JPEG, our scheme
constantly achieves better visual quality. Compared with
HEVC, our results are more vivid and smooth whereas
HEVC has block artifacts in all the reconstructed images.
We also agree that there are some mismatched fragments
attached to the reconstructed images; for the uncovered
regions, the details are smoothed, which could be refined if
we employ editing tools such as synthesis or inpainting
schemes. Notice that our scheme achieves these results at
constantly higher compression ratios.
5. CONCLUSIONS
This paper proposes a novel image compression scheme
based on local feature descriptors. Our compression scheme
interprets images by SIFT descriptors instead of pixel values.
The SIFT descriptors are manipulated into two groups in our
encoder for efficient compression. A significantly down-
sampled image is first generated as the visual description of
the input image that contains the key SIFT descriptors. Then
the rest of the important SIFT descriptors are simplified,
forming the differential SIFT descriptor set. These two
groups are then compressed separately. The corresponding
decoder makes use of the visual correlation among images
through SIFT-based matching for image reconstruction. Our
preliminary results demonstrate the feasibility of
compression using local feature descriptors. The perceptual
quality of our reconstructed images is comparable with that
of HEVC and JPEG.
Our cloud-based image coding scheme is not a
replacement of the conventional image coding. Our solution
aims at new scenarios, e.g. compression and reconstruction
for internet or cloud applications where a large-scale image
set is always available. Although the results are preliminary,
we believe it is a promising step towards much larger,
"Internet-based" image compression.
On the other hand, there are several limitations in the
current scheme towards which we should direct future
research. For example, enhanced local feature descriptors as
well as visual descriptions can be developed with regards to
matching the compression. The bits for coding the
differential SIFT descriptors can be reduced with a
dedicated compression solution. We also notice that the
quality of the reconstructions can be further improved by
introducing advanced techniques in matching, alignment
and merging. We hope to further improve the reconstruction
by introducing more reliable features and constraints in the
future.
6. REFERENCES
[1] T. Wiegand, W.-J. Han, B. Bross, J.-R. Ohm, and G.J.
Sullivan, WD1: Working Draft 1 of High-Efficiency Video
Coding, JCTVC-C403, Guangzhou, China, Oct. 2010.
[2] J. A. Saghri, P. S. Cheatham, and A. Habibi, Image quality
measure based on a human visual system model, Opt. Eng.,
vol. 28, no. 7, pp. 813818, 1989.
[3] M. M. Reid, R. J. Millar, and N. D. Black, Second-
generation image coding: an overview, ACM Computing
Surveys, Vol. 29, No. 1, March 1997.
[4] D. Liu, X. Sun, F. Wu, and Y.-Q. Zhang, "Edge-oriented
Uniform Intra Prediction," IEEE Trans. Image Processing, vol.
17, no. 10, pp. 1827-1836, Oct. 2008
[5] Wei Siong Lee, Zonoobi, D. Kassim, A.A. "Hierarchical
Segmentation-Based Image Coding Using Hybrid Quad-
Binary Trees", IEEE Trans. on Image Processing, Vol.18,
Iss.6, pp.1284, 2009
[6] N. M. Nasrabadi and R. A. King, "Image coding using vector
quantization: A review", IEEE Trans. Commun., vol. 36,
pp.957 - 971 , 1988
[7] G. Qiu, M. R. Varley, T. J. Terrell, Image coding based on
visual vector quantization, IEE International Conference on
Image Processing and its applications, pp. 301-305, July 1995.
[8] F. Wu, X. Sun, "Image compression by visual pattern vector
quantization (VPVQ)", Data Compression Conference 2008,
pp. 282-291, 2008
[9] D. Lowe, Distinctive Image Features from Scale-Invariant
Keypoints, International Journal of Computer Vision, vol. 60,
no. 2, pp. 91-110, 2004
[10] J. Hays and A. A. Efros. Scene completion using millions of
photographs. ACM Trans. Graphics (Proc. SIGGRAPH 2007),
26(3), 2007.
[11] O. Whyte, J. Sivic, and A. Zisserman. Get out of my picture!
internet-based inpainting. In BMVC, 2009.
[12] P. Weinzaepfel, H. Jgou, and P. Prez, Reconstructing an
image from its local descriptors, in Computer Vision and
Pattern Recognition, Jun. 2011.
[13] M. Daneshi, and J. Guo, http://www.stanford.edu/class
/ee368/Project_11/Reports/Daneshi_Guo_Image_Reconstructi
on_from_Descriptors.pdf
[14] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, Speeded-up
robust features (surf), Comput. Vis. Image Underst., vol. 110,
pp. 346359, June 2008.
[15] V. Kwatra, A. Schdl, I. Essa, G. Turk and A. Bobick,
Graphcut textures: image and video synthesis using graph
cuts, SIGGRAPH 2003, July 2003.
[16] P. Prez, M. Gangnet, and A. Blake. Poisson image editing.
ACM. Trans. Graph. (SIGGRAPH), 22(3):313318, 2003.
477 477

[17] Fischler M A, Bolles R C. Random sample consensus: A
paradigm for model fitting with applications to image analysis
and automated cartography. Communication of ACM, 1981,
24(6): 381-395.
[18] H. Jgou, M. Douze, and C. Schmid. Hamming embedding
and weak geometric consistency for large scale image search.
In ECCV, October 2008.
[19] HM3.0,Available:http://hevc.kw.bbc.co.uk/trac/ .

Fig. 2. Visual evaluation of the reconstruction images. From left to right: the original, our scheme, HEVC, and JPEG. The coded bit
numbers are denoted at the bottom of the images.

Fig. 3. Comparison results containing image portions denoted by the red rectangles in Fig.2. From left to right: reconstructed images of our
scheme, HEVC, and JPEG.
Fig. 4. Comparison results. For each pair, left side is the original image and the right one is our reconstructed image. File sizes are denoted
at the bottom of the images.
478 478

You might also like