Professional Documents
Culture Documents
While there has been a lot of recent work on object recognition and image understanding, the
focus has been on carefully establishing mathematical models for images, scenes, and
objects. In this paper, we propose a novel, nonparametric approach for object recognition and
scene parsing using a new technology we name label transfer. For an input image, our system
first retrieves its nearest neighbors from a large database containing fully annotated images.
Then, the system establishes dense correspondences between the input image and each of the
nearest neighbors using the dense SIFT flow algorithm which aligns two images based on
local image structures. Finally, based on the dense scene correspondences obtained from
SIFT flow, our system warps the existing annotations and integrates multiple cues in a
Markov random field framework to segment and recognize the query image. Promising
experimental results have been achieved by our nonparametric scene parsing system on
challenging databases. Compared to existing object recognition approaches that require
training classifiers or appearance models for each object category, our system is easy to
implement, has few parameters, and embeds contextual information naturally in the
retrieval/alignment procedure.
Chapter 1
Introduction
CHAPTER 1: INTRODUCTION
Scene parsing, or recognizing and segmenting objects in an image, is one of the core
problems of computer vision.
However, these learning-based methods do not, in general, scale well with the number of
object categories. For example, to include more object categories in an existing system, we
need to train new models for the new categories and, typically, adjust system parameters.
Training can be a tedious job if we want to include thousands of object categories in a scene
parsing system. In addition, the complexity of contextual relationships among objects also
increases rapidly as the quantity of object categories expands.
Recently, the emergence of large databases of images has opened the door to a new family
of methods in computer vision. Large database-driven approaches have shown the potential
for nonparametric methods in several applications. Instead of training sophisticated
parametric models, these methods try to reduce the inference problem for an unknown image
to that of matching to an existing set of annotated
images. In the authors estimate the pose of a human, relying on 0.5 million training examples.
In the proposed algorithm can fill holes on an input image by introducing
elements that are likely to be semantically correct through searching a large image database.
In a system is designed to infer the possible object categories that may appear in an image
by retrieving similar images in a large database. Moreover, the authors in showed that with
a database of 80 million images, even simple SSD match can give semantically meaningful
parsing for 32 -32 images.
In this paper, we propose a novel, nonparametric scene parsing system to transfer the labels
from existing samples in a large database to annotate an image, as illustrated in Fig. 1. For a
query image our system first retrieves the top matches in a large, annotated image database
using a combination of GIST matching and SIFT flow. Since these top matches are labeled,
we transfer the annotation of the top matches to the query image and obtain the scene parsing
result in . For comparison, the ground truth user annotation of the query is displayed in. Our
system is able to generate promising scene parsing results if images from the same scene type
as the query are retrieved in the annotated database.
However, it is nontrivial to build an efficient and reliable scene parsing system using dense
scene alignment. To account for the multiple annotation suggestions from the top matches,
a Markov random field model is used to merge multiple cues (e.g., likelihood, prior, and
spatial smoothness) into a robust annotation. Promising experimental results are achieved
on images from the LabelTransfer Database.Our goal is to explore the performance of
scene parsing through the transfer of labels from existing annotated images, rather than
building a comprehensive object recognition system. We show, however, that the
performance of our system outperforms existing approaches on our databases.
Fig. 1. For a query image (a), our system finds the top matches (b) (three are shown here) using scene retrieval and a SIFT flow
matching algorithm. The annotations of the top matches (c) are transferred and integrated to parse the input image, as shown in
(d). For comparison, the ground-truth user annotation of (a) is shown in (e).
Chapter 2
MATLAB
Chapter 2: MATLAB
2.1: Introduction
MATLAB (matrix laboratory) is a fourth-generation high-level programming language and
interactive environment for numerical computation, visualization and programming.
It has numerous built-in commands and math functions that help you in mathematical
calculations, generating plots, and performing numerical methods.
It provides vast library of mathematical functions for linear algebra, statistics, Fourier
analysis, filtering, optimization, numerical integration and solving ordinary
differential equations.
It provides built-in graphics for visualizing data and tools for creating custom plots.
1.4: Uses
MATLAB is widely used as a computational tool in science and engineering encompassing
the fields of physics, chemistry, math and all engineering streams. It is used in a range of
applications including −
Related Work
CHAPTER 3: RELATED WORK
Object recognition is an area of research that has greatly evolved over the last decade. Many
works focusing on single-class modeling, such as faces, digits, characters, and pedestrians
have been proven successful and, in some cases, the problems have been mostly deemed as
solved. Recent efforts have turned to mainly focusing in the area of multiclass object
recognition. In creating an object detection system, there are many basic building blocks to
take into account; feature description and extraction is the first stepping stone. Examples of
descriptors include gradient-based features such as SIFT and HOG , shape context and patch
statistics. Consequently, selected feature descriptors can be further applied to images in either
a sparse manner by selecting the top key points containing the highest response from the
feature descriptor, or densely by observing feature statistics across the image.
Sparse key point representations are often matched among pairs of images. Since the generic
problem of matching two sets of key points is NP-hard, approximation algorithms have been
developed to efficiently compute key point matches minimizing error rates (e.g., the pyramid
match kernel and vocabulary trees). On the other hand, dense representations have been
handled by modeling distributions of the visual features over neighborhoods in the image or
in the image as a whole. We chose the dense representation in the paper due to recent
advances in dense image matching.
At a higher level, we can also distinguish two types of object recognition approaches:
parametric approaches that consist of learning generative/discriminative models, and
nonparametric approaches that rely on image retrieval and matching. In the parametric family
we can find numerous template-matching methods, where classifiers are trained to
discriminate between an image window containing an object or a background. However,
these methods assume that objects are mostly rigid and are susceptible to little or no
deformation. To account for articulated objects, constellation models have been designed to
model objects as ensembles of
parts considering spatial information depth ordering information and multi-resolution modes.
Recently, a new idea of integrating humans in the loop via crowd sourcing for visual
recognition of specialized classes such as plants and animal species has emerged this method
integrates the description of an object in less than 20 discriminative questions that humans
can answer after visually inspecting the image.
In the realm of nonparametric methods we find systems such as Video Google a system that
allows users to specify a visual query of an object in a video and subsequently retrieve
instances of the same object across the movie. Another nonparametric system is the one in
where a previously unknown query image is matched against a densely labeled image
database; the nearest neighbors are used to build a label probability map for the query, which
is further used to prune out object detectors of classes that are unlikely to take place in the
image.Nonparametric methods have also been widely used in web data to retrieve similar
images. For example, a customized distance function is used at a retrieval stage to compute
the distance between a query image and images in the training set, which subsequently cast
votes to infer the object class of the query. In the same spirit, our nonparametric label transfer
system avoids modeling object appearances explicitly as our system parses a query image
using the annotation of similar images in a
training database and dense image correspondences.
Recently, several works have also considered contextual information in object detections to
clean and reinforce individual results. Among contextual cues that have been used are object-
level co-occurrences, spatial relationships and 3D scene layout. For a more detailed and
comprehensive study and benchmark of contextual works. Instead of explicitly
modeling context, our model incorporates context implicitly as object co-occurrences and
spatial relationships are retained in label transfer. An earlier version of our work appeared at.
In this paper, we will explore the label-transfer framework in-depth with more thorough
experiments and insights. Other recent papers have also introduced similar ideas. For
instance, over-segmentation is performed to the query image and segment-based classifiers
trained on the nearest neighbors are applied to recognize each segment. Scene boundaries are
discovered by the common edges shared by nearest neighbors.
Fig. 2. System pipeline. There are three key algorithmic components (rectangles) in our system: scene
retrieval, dense scene alignment, and label transfer. The ovals denote data representations.
Chapter 4
System Overview
CHAPTER 4: SYSTEM OVERVIEW
Fig. 2 shows the pipeline of our system, which consists of the following three algorithmic
modules:
. Scene retrieval: Given a query image, use scene retrieval techniques to find a set of
nearest neighbors that share similar scene configuration (including objects and their
relationships) with the query.
. Dense scene alignment: Establish dense scene corre-spondence between the query
image and each of the retrieved nearest neighbors. Choose the nearest neighbors with
the top matching scores as voting candidates.
. Label transfer: Warp the annotations from the voting candidates to the query image
according to estimated dense correspondence. Reconcile multiple labeling and impose
spatial smoothness under a Markov random field (MRF) model.
Although we are going to choose concrete algorithms for each module in this paper, any
algorithm that fits to the module can be plugged into our nonparametric scene parsing system.
For example, we use SIFT flow for dense scene alignment, but it would also suffice to use
sparse feature matching and then propagate sparse correspondences to produce dense
counterparts.
A key component of our system is a large, dense, and annotated image database. In this
paper, we use two sets of databases, both annotated using the LabelMe online annotation tool
to build and evaluate our system. The first is the LabelMe Outdoor (LMO) database
containing 2,688 fully annotated images, most of which are outdoor scenes including street,
beach, mountains, fields, and buildings. The second is the SUN database containing 9,566
fully annotated images, covering both indoor and outdoor scenes; in fact, LMO is a subset
of SUN.
1. Other scene parsing and image understanding systems also require such a database.
We do not require more than others.
We use the LMO database to explore our system in-depth, and also report the results on the
SUN database.
Before jumping into the details of our system, it is helpful to look at the statistics of the LMO
database. The 2,688 images in LMO are randomly split into 2,488 for training and 200 for
testing. We chose the top 33 object categories with the most labeled pixels. The pixels that
are not labeled, or labeled as other object categories, are treated as the 34th category:
“unlabeled.” The per pixel frequency count of these object categories in the training set is
shown at the top of Fig. 3. The color of each bar is the average RGB value of the
corresponding object category from the training data with saturation and brightness boosted
for visualization purposes. The top 10 object categories are sky, building, mountain, tree,
unlabeled, road, sea, field, grass, and river. The spatial priors of these object categories are
displayed at the bottom of Fig. 3, where white denotes zero probability and the saturation of
color is directly proportional to its probability. Note that, consistent with common
knowledge, sky occupies the upper part of the image grid and field occupies the lower part.
Furthermore, there are only limited samples for the sun, cow, bird, and moon classes.
Fig. 3. Top: The per-pixel frequency counts of the object categories in our data set (sorted in
descending order). The color of each bar is the average RGB value of each object category
from the training data with saturation and brightness boosted for visualization. Bottom: The
spatial priors of the object categories in the database. White means zero and the saturated
color means high probability.
Chapter 5
Image Processing
BMP
HDF
JPEG
PCX
TIFF
XWB
Most images you find on the Internet are JPEG-images which is the name for one of the most
widely used compression standards for images. If you have stored an image you can usually
see from the suffix what format it is stored in..
If an image is stored as a JPEG-image on your disc we first read it into Matlab. However, in
order to start working with an image, for example perform a wavelet transform on the image,
we must convert it into a different format. This section explains four common formats.
Segmentation
CHAPTER 6: SEGMENTATION
6.1: Introduction
Natural images consist of an overwhelming number of visual patterns generated by very
diverse stochastic processes in nature. The objective of image understanding is to parse an
input image into its constituent patterns. Depending on the type of patterns that a task is
interested in, the parsing problem is called respectively 1) Image segmentation --- for
homogeneous grey/color/texture region processes. 2) Perceptual grouping --- for point,
curve, and general graph processes 3) Object recognition --- for text and objects.
6.2: Segmentation
In computer vision, segmentation refers to the process of partitioning a digital image into
multiple regions (sets of pixels). The goal of segmentation is to simplify and/or change the
representation of an image into something that is more meaningful and easier to analyze.
Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in
images.
The result of image segmentation is a set of regions that collectively cover the entire image,
or a set of contours extracted from the image. Each of the pixels in a region is similar with
respect to some characteristic or computed property, such as color, intensity, or texture.
Adjacent regions are significantly different with respect to the same characteristic(s).
Segmentation algorithms generally are based on one of 2 basis properties of intensity values
Discontinuity : to partition an image based on abrupt changes in intensity (such as edges)
Similarity : to partition an image into regions that is similar according to a set of predefined
criteria.
For intensity images (i.e. those represented by point-wise intensity levels) four popular
approaches are: threshold techniques, edge-based methods, region-based techniques, and
connectivity-preserving relaxation methods.
Threshold techniques, which make decisions based on local pixel information, are effective
when the intensity levels of the objects fall squarely outside the range of levels in the
background. Because spatial information is ignored, however, blurred region boundaries can
create havoc. Edge-based methods center around contour detection: their weakness in
connecting together broken contour lines make them, too, prone to failure in the presence of
blurring.
A region-based method usually proceeds as follows: the image is partitioned into connected
regions by grouping neighboring pixels of similar intensity levels. Adjacent regions are then
merged under some criterion involving perhaps homogeneity or sharpness of region
boundaries. Over stringent criteria create fragmentation; lenient ones overlook blurred
boundaries and over merge. Hybrid techniques using a mix of the methods above are also
popular.
A connectivity-preserving relaxation-based segmentation method, usually referred to as the
active contour model, was proposed recently. The main idea is to start with some initial
boundary shape represented in the form of spline curves, and iteratively modifies it by
applying various shrink/expansion operations according to some energy function. Although
the energy minimizing model is not new, coupling it with the maintenance of an ``elastic''
contour model gives it an interesting new twist. As usual with such methods, getting trapped
into a local minimum is a risk against which one must guard; this is no easy task.
Chapter 7
Delta E or Delta E*
Delta E is the standard calculation metric which correlates the human visual judgment of
differences between two perceived colors. This standard quantifies this difference and is used
to calculate the deviation from the benchmark standards which allows a tolerance level to be
set (based on L*a*b coordinates). Generally speaking, the lower the Delta E number, the
closer the display matches the input color to the display’s reproduced color. The Commission
Internationale de l’Eclairge (International Commission on Illumination) or CIE, has
established Delta E as the standard color distance metric, revising past definitions to
incorporate the human eye being more sensitive to certain colors. To address these issues, in
addition to finding accepted tolerance levels, CIELAB can be used to scale, with the
underlying theory being no two colors can be both red and green nor blue and yellow at the
same time. With a lightness scale vertically in the center, colors can then be expressed with
single values.
The term Delta E or ∆E is used to describe color differences in the CIELAB color space. The
term stems from the greek letter delta which is used in science to denote difference. The E
stands Empfindung, a german wording meaning feeling. Put them together and you get
different feeling.
What led to the creation of these two color spaces and associated color difference
formulas? Its a long interesting history that began with Isaac Newton hypothesizing the
origin of white light.
It is known today that both Helmholtz and Hering were correct in their theories of human
color vision. The human vision system is complex and starts with receptors that are sensitive
to the long (L), medium (M), and short (S) wavelengths of the electromagnetic region that
make up visible light. The vision system then transforms these visual stimuli into opponent
color signals and sends them to the brain for processing.
7.4: Working
CIE L*, a*, b* color values provide a complete numerical descriptor of the color, in a
rectangular coordinate system.
Delta L*, delta a* and delta b* values provide a complete numerical descriptor of the
color differences between a Sample or lot and a Standard or target color.
In the case of dL*, da*, db*, the higher the value, the greater the difference in that dimension.
Delta E* (Total Color Difference) is calculated based on delta L*, a*, b* color differences
and represents the distance of a line between the sample and standard.
In addition to quantifying the overall color difference between a sample and standard color,
delta E* was intended to be a single number metric for Pass/Fail tolerance decisions.
Effectively a delta E* tolerance value defines an acceptance sphere around the standard or
target color.
The lower the delta E* value is, the closer the sample is to the standard. A delta E* value of
0.00 means the color of the sample is identical to the color of the standard.
Please note that while CIE L*, a*, b* are being used in this example, delta E (no star)
represents the overall sample difference in Hunter L, a, b coordinates.
The "E" in delta E or delta E* is derived from "Empfindung", the German word for
sensation. Delta E means a "difference in sensation" for any delta E-type metric, CIE or
Hunter.
The Color Measurement Committee of the Society of Dyers and Colorists published a new
equation for determining color differences in 1986. This equation became known as the
∆ECMC color difference formula. The CMC's work was to derive a formula that better
handled small color differences found in the colorant industries.[4] This was accomplished
by adding weighting factors to the equations that made correlate better with what the eye
senses. The equation used ellipsoids to create weighting factors for the lightness and
chroma factors.[6]
It is know that changes in lightness are hard to perceive than changes in chroma. The
∆ECMC takes this into account by induction of a lightness to chroma factor. This factor is
traditionally set to 2:1. Hue is a constant defined as 1. [6] The CMC color difference
equation has been adopted by the colorant industry and by graphic arts as a more accurate
color difference equation. The CMC equation became an ISO standard for the textile
industry in 1995.
Chapter 8
System Design
The objective of scene retrieval is to retrieve a set of nearest neighbors in the database for a
given query image. There exist several ways for defining a nearest neighbor set. The most
common definition consists of taking the K closest points to the query (K-NN). Another
model, -NN, widely used in texture synthesis, considers all of the neighbors within ð1 þ Þ
times the minimum distance from the query. We generalize these two types to hK; i-NN, and
define it as
As ! 1, hK; 1i-NN is reduced to K-NN. As K ! 1, h1; i-NN is reduced to -NN. However, hK;
i-NN representation gives us the flexibility to deal with the density variation of the graph, as
shown in Fig. 5. We will show how K affects the performance in the experimental section.
In practice, we found that ¼ 5 is a good parameter and we will use it through our experiments.
Nevertheless, dramatic improvement of hK; i-NN over K-NN is not expected as sparse
samples are few in our databases.
We have not yet defined the distance function distð ; Þ between two images. Measuring
image similarities/dis-tances is still an active research area; a systematic study of image
features for scene recognition can be found in. In this paper, three distances are used:
euclidean distance of GIST, spatial pyramid histogram intersection of HOG visual words,
and spatial pyramid histogram intersection of the ground-truth annotation. For the HOG
distance, we use the standard pipeline of computing HOG features on a dense grid and
quantizing features to visual words over a set of images using k-means clustering. The
ground truth-based distance metric is used to estimate an upper bound of our system for
evaluation purposes. Both the HOG and the ground truth distances are computed in the same
manner. The ground truth distance is computed by building histograms of pixel-wise labels.
To include spatial information, the histograms are computed by dividing an image into 2 2
windows and concatenating the four histograms into a single vector. Histogram intersection
is used to compute the ground truth distance. We obtain the HOG distance by replacing pixel-
wise labels with HOG visual words.
In Fig. 4, we show the importance of the distance metric as it defines the neighborhood
structure of the large image database. We randomly selected 200 images from the LMO.
Fig. 4. The structure of a database depends on image distance metric. Top: The hK; i-NN graph of the
LabelMe Outdoor database visualized by scaled MDS using GIST feature as distance. Bottom: The hK; i-
NN graph of the same database visualized using the pyramid histogram intersection of ground-truth
annotation as distance. Left: RGB images; right: annotation images. Notice how the ground-truth
annotation emphasizes the underlying structure of the database. In (c) and (d), we see that the image
content changes from urban, streets (right), to highways (middle), and to nature scenes (left) as we pan
from right to left. Eight hundred images are randomly selected from LMO for this visualization.
database and computed pair-wise image distances using GIST (top) and the ground-truth
annotation (bottom). Then, we use multidimensional scaling (MDS) to map these images to
points on a 2D grid for visualization. Although the GIST descriptor is able to form a
reasonably meaningful image space where semantically similar images are clustered, the
image space defined by the ground-truth annotation truly reveals the underlying structures
of the image database. This will be further examined in the experimental section.
As our goal is to transfer the labels of existing samples to parse an input image, it is essential
to find the dense correspondence for images across scenes. In our previous work, we have
demonstrated that SIFT flow is capable of establishing semantically meaningful
correspondences among two images by matching local SIFT descriptors. We further
extended SIFT flow into a hierarchical computa-tional framework to improve the
performance . In this section, we will provide a brief explanation of the algorithm; for a
detailed description.
Similarly to optical flow, the task of SIFT flow is to find dense correspondence between
two images. Let p ¼ ðx; yÞ contain the spatial coordinate of a pixel, and wðpÞ ¼ ðuðpÞ;
vðpÞÞ be the flow vector at p. Denote s1 and s2 as the per-pixel SIFT descriptor [30] for two
images,2 and " contains all the spatial neighborhood (a four-neighbor system is used). The
energy function for SIFT flow is defined as:
X
EðwÞ minðks1ðpÞ s2ðp þ
¼ wðpÞÞk1; tÞ þ ð2Þ
p
X
ðjuðpÞj þ jvðpÞjÞ þ ð3Þ
P
minð juðpÞ uðqÞj; dÞþ
ðX ðÞ
p;qÞ2" 4
which contains a data term, small displacement term, and smoothness term (a.k.a. spatial
regularization). The data term in (2) constrains the SIFT descriptors to be matched along with
the flow vector wðpÞ. The small displacement term in (3) constrains the flow vectors to be
as small as possible when no other information is available. The smoothness term in (4)
constrains the flow vectors of adjacent pixels to be similar. In this objective function,
truncated L1 norms are used in both the data term and the smoothness term to account for
matching outliers and flow discontinuities, with t and d as the threshold, respectively.
While SIFT flow has demonstrated the potential for aligning images across scenes [29],
the original implementation scales poorly with respect to the image size. In SIFT flow, a
pixel in one image can literally match to any other pixel in another image. Suppose the image
has h2 pixels, then the time and space complexity of the belief propagation algorithm to
estimate the SIFT flow is Oðh4Þ. As reported
SIFT descriptors are computed at each pixel using a 16 16 window. The window is divided
into 4 4 cells, and image gradients within each cell are quantized into a 8-bin histogram.
Therefore, the pixel-wise SIFT feature is a 128D vector
in [29], the computation time for 145 105 images with an 80 80 searching neighborhood is
50 seconds. The original implementation of SIFT flow would require more than 2 hours to
process a pair of 256 256 images in our database with a memory usage of 16 GB to store the
data term. To address the performance drawback, a coarse-to-fine SIFT flow matching
scheme was designed to significantly
Fig. 5. An image database can be non uniform as illustrated by some random 2D points. The green node (A)
is surrounded densely by neighbors, whereas the red node (B) resides in a sparse area. If we use
improve the performance. As illustrated in Fig. 6, the basic idea consists of estimating the
flow at a coarse level of image grid, and then gradually propagating and refining the flow
from coarse to fine; please refer to [28] for details. As a result, the complexity of this coarse-
to-fine algorithm is Oðh2 log h Þ, a significant speed up compared to Oðh4Þ. The matching
between two 256 256 images take 31 seconds on a workstation with two quad-core 2.67 GHz
Intel Xeon CPUs and 32 GB memory, in a C++ implementation. We also discovered that the
coarse-to-fine scheme not only runs
Fig. 6. An illustration of our coarse-to-fine pyramid SIFT flow matching. The green square denotes the
searching window for pk at each pyramid level k. For simplicity, only one image is shown here, where pk is
on image s1 and ck and wðpkÞ are on image s2. The details of the algorithm
can be found in [28].
K -NN (K ¼ 5), then some samples (orange nodes) far away from the query (B) can be
chosen as neighbors. If, instead, we use -NN and choose the radius as shown in the picture,
then there can be too many
neighbors for a sample such as (A). The combination, hK; i-NN, shown as gray-edges,
provides a good balance for these two criteria.
Fig. 7. For a query image, we first find a hK; i-nearest neighbor set in the database using GIST matching
[34]. The nearest neighbors are reranked using SIFT flow matching scores, and form a top M-voting
candidate set. The annotations are transferred from the voting candidates to parse the query image.
significantly faster, but also achieves lower energies most of the time compared to the
ordinary matching algorithm.
Some SIFT flow examples are shown in Fig. 8, where dense SIFT flow fields (Fig. 8f) are
obtained between the query images (Fig. 8a) and the nearest neighbors (Fig. 8c). It is trial to
verify that the warped SIFT images (Fig. 8h) based on the SIFT flows (Fig. 8f) look very
similar to the SIFT images (Fig. 8b) of the inputs (Fig. 8a), and that the SIFT flow fields
(Fig. 8f) are piecewise smooth. The essence of SIFT flow is manifested in Fig. 8g, where the
same flow field is applied to warp the RGB image of the nearest neighbor to the query. SIFT
flow is trying to hallucinate the structure of the query image by smoothly shuffling the pixels
of the nearest neighbors. Because of the intrinsic similarities within each object categories,
it is not surprising that, through aligning image structures, objects of the same categories are
often matched. In addition, it is worth noting that one object in the nearest neighbor can
correspond to multiple objects in the query since the flow is asymmetric. This allows reuse
of labels to parse multiple object instances.
Now that we have a large database of annotated images and a technique for establishing
dense correspondences across scenes, we can transfer the existing annotations to parse a
query image through dense scene alignment. For a given query image, we retrieve a set of
hK; i-nearest neighbors in our database using GIST matching [34]. We then compute the
SIFT flow from the query to each nearest neighbor, and use the achieved minimum energy
(defined in (4)) to rerank the hK; i-nearest neighbors. We further select the top M-ranked
retrievals (M K) to create our voting candidate set. This voting set will be used to transfer its
contained annotations into the query image. This procedure is illustrated in Fig. 7.
Under this setup, scene parsing can be formulated as the following label transfer problem:
For a query image I with its corresponding SIFT image s, we have a set of voting candidates
fsi; ci; wigi¼1:M , where si, ci, and wi are the SIFT image, annotation, and SIFT flow field (from
s to si) of the ith voting candidate, respectively. ci is an integer image where ciðpÞ 2 f1; . . . ;
Lg is the index of object category for pixel p. We want to obtain the annotation c for the
query image by transferring ci to the query image according to the dense correspondence wi.
We build a probabilistic Markov random field model to integrate multiple labels, prior information
of object category, and spatial smoothness of the annotation to parse
image I. Similarly to that of [43], the posterior probability is defined as: where Z is the normalization
constant of the probability. This posterior contains three components, i.e., likelihood, prior, and
spatial smoothness. The likelihood term is defined as
where _p;l ¼ fi; ciðp þ wðpÞÞ ¼ lg, l ¼ 1; . . . ; L, is the index set of the voting candidates whose
label is l after being warped to pixel p. _ is set to be the value of the maximum
difference of SIFT feature: _ ¼ maxs1;s2;pks1ðpÞ _ s2ðpÞk. The prior term _ðcðpÞ ¼ lÞ indicates
the prior probability that object category l appears at pixel p. This is obtained
from counting the occurrence of each object category at each location in the training set:
where histlðpÞ is the spatial histogram of object category l. The smoothness term is defined to bias
the neighboring pixels into having the same label in the event that no other
information is available, and the probability depends on the edge of the image: The stronger
luminance edge, the more likely it is that the neighboring pixels may have different
labels: Notice that the energy function is controlled by four parameters, K and M that decide the
mode of the model and _ and _ that control the influence of spatial prior and
smoothness. Once the parameters are fixed, we again use the BP-S algorithm to minimize the energy.
The algorithm converges in two seconds on a workstation with two quadcore
2.67 GHz Intel Xeon CPUs. A significant difference between our model and that in
[43] is that we have fewer parameters because of the nonparametric nature of our approach, whereas
classifiers were trained in [43]. In addition, color information is not included in our model at the
present as the color distribution for each object category is diverse in our databases.
Fig. 8. System overview. For a query image, our system uses scene retrieval techniques such as [34] to find
hK; i-nearest neighbors in our database. We apply coarse-to-fine SIFT flow to align the query image to the
nearest neighbors, and obtain top M as voting candidates (M ¼ 3 here). (c), (d), (e): The RGB image, SIFT
image and user annotation of the voting candidates. (f): The inferred SIFT flow field, visualized using the
color scheme shown on the left (hue: orientation; saturation: magnitude). (g), (h), and (i) are the warped
version of (c), (d), (e) with respect to the SIFT flow in (f). Notice the similarity between (a) and (g), (b) and
(h). Our system combines the voting from multiple candidates and generates scene parsing in (j) by
optimizing the posterior. (k): The ground-truth annotation of (a).
Chapter 9
Results
CHAPTER 9: RESULTS
EXPECTED OUTPUT :
OUTPUT 1: A Blue color portion of the beach
OUTPUT 2: AN ONION
OUTPUT 3
OUTPUT 4:
OUTPUT 5:
OUTPUT 6:
CONCLUSION
The future of image processing and scene parsing will involve scanning the
heavens for other intelligent life out in space. Also new intelligent, digital species
created entirely by research scientists in various nations of the world will include
advances in image processing applications. Due to advances in image processing
and related technologies there will be millions and millions of robots in the world
in a few decades time, transforming the way the world is managed. Advances in
image processing and artificial intelligence6 will involve spoken commands,
anticipating the information requirements of governments, translating languages,
recognizing and tracking people and things, diagnosing medical conditions,
performing surgery, reprogramming defects in human DNA, and automatic driving
all forms of transport. With increasing power and sophistication of modern
computing, the concept of computation can go beyond the present limits and in
future, image processing technology will advance and the visual system of man can
be replicated. The future trend in remote sensing will be towards improved sensors
that record the same scene in many spectral channels. Graphics data is becoming
increasingly important in image processing app1ications. The future image
processing applications of satellite based imaging ranges from planetary
exploration to surveillance applications.
REFERENCES
1. E.H. Adelson, “On Seeing Stuff: The Perception of Materials by Humans and
Machines,” Proc. SPIE, vol. 4299, pp. 1-12, 2001.
2. S. Belongie, J. Malik, and J. Puzicha, “Shape Context: A New Descriptor for
Shape Matching and Object Recognition,” Proc. Advances in Neural
Information Processing Systems, 2000.
3. A. Berg, T. Berg, and J. Malik, “Shape Matching and Object Recognition
Using Low Distortion Correspondence,” Proc. IEEE Conf. Computer Vision
and Pattern Recognition, 2005
4. I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory and
Applications, second ed. Springer-Verlag, 2005.
5. M.J. Choi, J.J. Lim, A. Torralba, and A. Willsky, “Exploiting Hierarchical
Context on a Large Database of Object Cate-gories,” Proc. IEEE Conf.
Computer Vision and Pattern Recogni-tion, 2010.
6. D. Crandall, P. Felzenszwalb, and D. Huttenlocher, “Spatial Priors for Part-
Based Recognition Using Statistical Models,” Proc. IEEE Conf. Computer
Vision and Pattern Recognition, 2005.
7. G. Edwards, T. Cootes, and C. Taylor, “Face Recognition Using Active
Appearance Models,” Proc. European Conf. Computer Vision, 1998.
8. P. Felzenszwalb and D. Huttenlocher, “Pictorial Structures for Object
Recognition,” Int’l J. Computer Vision, vol. 61, no. 1, pp. 55-79, 2005.
9. G. Edwards, T. Cootes, and C. Taylor, “Face Recognition Using Active
Appearance Models,” Proc. European Conf. Computer Vision, 1998.
12. A. Frome, Y. Singer, and J. Malik, “Image Retrieval and Classification Using
Local Distance Functions,” Proc. Advances in Neural Information Processing
Systems, 2006.
13. A. Gupta and L.S. Davis, “Beyond Nouns: Exploiting Prepositions and
Comparative Adjectives for Learning Visual Classifiers,” Proc. European
Conf. Computer Vision, 2008.
.
15. B.C. Russell, A. Torralba, K.P. Murphy, and W.T. Freeman, “LabelMe: A
Database and Web-Based Tool for Image Annota-tion,” Int’l J. Computer
Vision, vol. 77, nos. 1-3, pp. 157-173, 2008.
APPENDICES
PROGRAM:
% The RGB image is converted to LAB color space and then the user draws
% some freehand-drawn irregularly shaped region to identify a color.
% The Delta E (the color difference in LAB color space) is then calculated
% for every pixel in the image between that pixel's color and the average
% LAB color of the drawn region. The user can then specify a number that
% says how close to that color would they like to be. The software will
% then find all pixels within that specified Delta E of the color of the drawn
region.
function DeltaE()
clc; % Clear command window.
clear; % Delete all variables.
close all; % Close all figure windows except those created by imtool.
% imtool close all; % Close all figure windows created by imtool.
workspace; % Make sure the workspace panel is showing.
try
% Check that user has the Image Processing Toolbox installed.
hasIPT = license('test', 'image_toolbox');
if ~hasIPT
% User does not have the toolbox installed.
message = sprintf('Sorry, but you do not seem to have the Image Processing
Toolbox.\nDo you want to try to continue anyway?');
reply = questdlg(message, 'Toolbox missing', 'Yes', 'No', 'Yes');
if strcmpi(reply, 'No')
% User said No, so exit.
return;
end
end
% Ask user if they want to use a demo image or their own image.
message = sprintf('Do you want use a standard demo image,\nOr pick one of
your own?');
reply2 = questdlg(message, 'Which Image?', 'Demo','My Own', 'Demo');
% Open an image.
if strcmpi(reply2, 'Demo')
% Read standard MATLAB demo image.
message = sprintf('Which demo image do you want to use?');
selectedImage = questdlg(message, 'Which Demo Image?', 'Onions', 'Peppers',
'Stained Fabric', 'Onions');
if strcmp(selectedImage, 'Onions')
fullImageFileName = 'onion.png';
elseif strcmp(selectedImage, 'Peppers')
fullImageFileName = 'peppers.png';
else
fullImageFileName = 'fabric.png';
end
else
% They want to pick their own.
% Change default directory to the one containing the standard demo images
for the MATLAB Image Processing Toolbox.
originalFolder = pwd;
folder = fullfile(matlabroot, '\toolbox\images\imdemos');
if ~exist(folder, 'dir')
folder = pwd;
end
cd(folder);
% Browse for the image file.
[baseFileName, folder] = uigetfile('*.*', 'Specify an image file');
fullImageFileName = fullfile(folder, baseFileName);
% Set current folder back to the original one.
cd(originalFolder);
selectedImage = 'My own image'; % Need for the if threshold selection
statement later.
end
% Check to see that the image exists. (Mainly to check on the demo images.)
if ~exist(fullImageFileName, 'file')
message = sprintf('This file does not exist:\n%s', fullImageFileName);
WarnUser(message);
return;
end
% Display the masked Delta E image - the delta E within the masked region
only.
subplot(3, 4, 6);
imshow(maskedDeltaE, []);
caption = sprintf('Delta E between image within masked region\nand mean color
within masked region.\n(With amplified intensity)');
title(caption, 'FontSize', fontSize);
% Display the Delta E image - the delta E over the entire image.
subplot(3, 4, 7);
imshow(deltaE, []);
caption = sprintf('Delta E Image\n(Darker = Better Match)');
title(caption, 'FontSize', fontSize);
% Find out how close the user wants to match the colors.
prompt = {sprintf('First, examine the histogram.\nThen find pixels within this
Delta E (from the average color in the region you drew):')};
dialogTitle = 'Enter Delta E Tolerance';
numberOfLines = 1;
% Set the default tolerance to be the mean delta E in the masked region plus two
standard deviations.
strTolerance = sprintf('%.1f', meanMaskedDeltaE + 3 * stDevMaskedDeltaE);
defaultAnswer = {strTolerance}; % Suggest this number to the user.
response = inputdlg(prompt, dialogTitle, numberOfLines, defaultAnswer);
% Update tolerance with user's response.
tolerance = str2double(cell2mat(response));
% Let them interactively select the threshold with the threshold() m-file.
% (Note: This is a separate function in a separate file in my File Exchange.)
% threshold(deltaE);
% Mask the image with the matching colors and extract those pixels.
matchingColors = bsxfun(@times, rgbImage, cast(binaryImage,
class(rgbImage)));
subplot(3, 4, 10);
imshow(matchingColors);
caption = sprintf('Matching Colors (Delta E <= %.1f)', tolerance);
title(caption, 'FontSize', fontSize);
% Mask the image with the NON-matching colors and extract those pixels.
nonMatchingColors = bsxfun(@times, rgbImage, cast(~binaryImage,
class(rgbImage)));
subplot(3, 4, 11);
imshow(nonMatchingColors);
caption = sprintf('Non-Matching Colors (Delta E > %.1f)', tolerance);
title(caption, 'FontSize', fontSize);
% Display credits: the MATLAB logo and my name.
ShowCredits(); % Display logo in plot position #12.
catch ME
errorMessage = sprintf('Error running this m-file:\n%s\n\nThe error message
is:\n%s', ...
mfilename('fullpath'), ME.message);
errordlg(errorMessage);
end
return; % from SimpleColorDetection()
% ---------- End of main function ---------------------------------
%----------------------------------------------------------------------------
% Display the MATLAB logo.
%-----------------------------------------------------------------------------
function [xCoords, yCoords, roiPosition] = DrawBoxRegion(handleToImage)
try
% Open a temporary full-screen figure if requested.
enlargeForDrawing = true;
axes(handleToImage);
if enlargeForDrawing
hImage = findobj(gca,'Type','image');
numberOfImagesInside = length(hImage);
if numberOfImagesInside > 1
imageInside = get(hImage(1), 'CData');
else
imageInside = get(hImage, 'CData');
end
hTemp = figure;
hImage2 = imshow(imageInside, []);
[rows columns NumberOfColorBands] = size(imageInside);
set(gcf, 'Position', get(0,'Screensize')); % Maximize figure.
end
txtInfo = sprintf('Draw a box over the unstained fabric by clicking and dragging
over the image.\nDouble click inside the box to finish drawing.');
text(10, 40, txtInfo, 'color', 'r', 'FontSize', 24);
hBox = imrect;
roiPosition = wait(hBox);
roiPosition
% Erase all previous lines.
if ~enlargeForDrawing
axes(handleToImage);
% ClearLinesFromAxes(handles);
end
%-----------------------------------------------------------------------------
function [mask] = DrawFreehandRegion(handleToImage, rgbImage)
try
fontSize = 14;
% Open a temporary full-screen figure if requested.
enlargeForDrawing = true;
axes(handleToImage);
if enlargeForDrawing
hImage = findobj(gca,'Type','image');
numberOfImagesInside = length(hImage);
if numberOfImagesInside > 1
imageInside = get(hImage(1), 'CData');
else
imageInside = get(hImage, 'CData');
end
hTemp = figure;
hImage2 = imshow(imageInside, []);
[rows columns NumberOfColorBands] = size(imageInside);
set(gcf, 'Position', get(0,'Screensize')); % Maximize figure.
end
message = sprintf('Left click and hold to begin drawing.\nSimply lift the mouse
button to finish');
text(10, 40, message, 'color', 'r', 'FontSize', fontSize);
% Now, finally, have the user freehand draw the mask in the image.
hFH = imfreehand();
% Once we get here, the user has finished drawing the region.
% Create a binary image ("mask") from the ROI object.
mask = hFH.createMask();
%-----------------------------------------------------------------------------
% Get the average lab within the mask region.
function [LMean, aMean, bMean] = GetMeanLABValues(LChannel, aChannel,
bChannel, mask)
try
LVector = LChannel(mask); % 1D vector of only the pixels within the masked
area.
LMean = mean(LVector);
aVector = aChannel(mask); % 1D vector of only the pixels within the masked
area.
aMean = mean(aVector);
bVector = bChannel(mask); % 1D vector of only the pixels within the masked
area.
bMean = mean(bVector);
catch ME
errorMessage = sprintf('Error running GetMeanLABValues:\n\n\nThe error
message is:\n%s', ...
ME.message);
WarnUser(errorMessage);
end
return; % from GetMeanLABValues
%=========================================================
===========================================================
======
function WarnUser(warningMessage)
uiwait(warndlg(warningMessage));
return; % from WarnUser()
%=========================================================
===========================================================
======
function msgboxw(message)
uiwait(msgbox(message));
return; % from msgboxw()
%=========================================================
===========================================================
======
% Plots the histograms of the pixels in both the masked region and the entire
image.
function PlotHistogram(maskedRegion, doubleImage, plotNumber, caption)
try
fontSize = 14;
subplot(plotNumber(1), plotNumber(2), plotNumber(3));
% Find out where the edges of the histogram bins should be.
maxValue1 = max(maskedRegion(:));
maxValue2 = max(doubleImage(:));
maxOverallValue = max([maxValue1 maxValue2]);
edges = linspace(0, maxOverallValue, 100);
catch ME
errorMessage = sprintf('Error running PlotHistogram:\n\n\nThe error message
is:\n%s', ...
ME.message);
WarnUser(errorMessage);
end
return; % from PlotHistogram
%=========================================================
============
% Shows vertical lines going up from the X axis to the curve on the plot.
function lineHandle = PlaceVerticalBarOnPlot(handleToPlot, x, lineColor)
try
% If the plot is visible, plot the line.
if get(handleToPlot, 'visible')
axes(handleToPlot); % makes existing axes handles.axesPlot the current
axes.
% Make sure x location is in the valid range along the horizontal X axis.
XRange = get(handleToPlot, 'XLim');
maxXValue = XRange(2);
if x > maxXValue
x = maxXValue;
end
% Erase the old line.
%hOldBar=findobj('type', 'hggroup');
%delete(hOldBar);
% Draw a vertical line at the X location.
hold on;
yLimits = ylim;
lineHandle = line([x x], [yLimits(1) yLimits(2)], 'Color', lineColor,
'LineWidth', 3);
hold off;
end
catch ME
errorMessage = sprintf('Error running PlaceVerticalBarOnPlot:\n\n\nThe error
message is:\n%s', ...
ME.message);
WarnUser(errorMessage);
end
return; % End of PlaceVerticalBarOnPlot