Final Document Print

3D Reconstruction Based on Image Pyramid and Block Matching
CHAPTER 1
INTRODUCTION TO 3D
1.1 3D RECONSTRUCTION FROM SINGLE 2D IMAGES
A single image of an everyday object, a sculptor can recreate its 3D shape (i.e.,
produce a statue of the object), even if the particular object has never been seen before.
Presumably, it is familiarity with the shapes of similar3D objects (i.e., objects from the
same class) and how they appear in images, which enables the artist to estimate its shape.
This might not be the exact shape of the object; but it is often a good enough estimate for
many purposes Motivated.
In general, the problem of 3D reconstruction from a single2D image is ill posed,
since different shapes may give rise to the same intensity patterns.
To solve this,
additional constraints are required. Here, we constrain the reconstruction process by

assuming that similarly looking objects from the same class (e.g., faces, fish), have
similar shapes. We maintain a set of 3D objects, selected as examples of a specific class.
We use these objects to produce a database of images of the objects in the class (e.g., by
standard rendering techniques), along with their respective depth maps. These provide
examples of feasible mappings from intensities to shapes and are used to estimate the
shapes of objects in query images.
Methods for single image reconstruction commonly use cues such as shading,
silhouette shapes, texture, and vanishing. These methods restrict the allowable
reconstructions by placing constraints on the properties of reconstructed objects (e.g.,
reflectance properties, viewing conditions, and symmetry). A few approaches explicitly
use examples to guide the reconstruction process. One approach
reconstructs outdoor
scenes assuming they can be labelled as ground, sky, and vertical billboards.
The target of the system is the geometric model of the scenes. So here consider
geometric reconstruction and not photometric (or image-based) reconstruction, which
directly generates new views of a scene without (completely) reconstructing the 3D
structure. With the stated purposes stated and application context set the limits as:
Dept. of ECE, MRITS

Static scenes: There is no moving object or the movement of objects is relatively same.
Un-calibrated cameras: The input data is captured by an un calibrated camera, i.e. the
camera's intrinsic parameters such as focal length is unknown.
Varying intrinsic camera parameters: The camera intrinsic parameters (e.g. focal
length) can vary freely. Together with the previous, this assumption adds flexibility to the
system.
1.2 3DRECONSTRUCTION FROM VIDEO SEQUENCES

The application-oriented description of 3D reconstruction from video sequences
(shortly called 3D reconstruction)
1.
The process starts with the data capturing step, in which a person moves around
and captures a static scene using a hand-held camera.

2.
The recorded video sequence is then pre-processed (e.g. selecting frames),
removing (noise, normalizing illumination).

3.
After that, the video sequence is processed to produce a 3D model of the scene.
4.
Finally, the 3D model can be rendered, or exported for editing using 3D modeling
tools.
Fig.1.1: Main tasks of 3D reconstruction
The 3D reconstruction (step 3) can be divided into 4 main tasks, which are as following:
1.
Feature detection and matching: The objective of this step is to find out the
same features in different images and match them.

2.
Structure and motion recovery: This step recovers the structure and motion of
the scene (i.e. 3D coordinates of detected features; position ,orientation and parameters of
the camera at capturing positions).
Dept. of ECE, MRITS

3.
Stereo mapping: This step creates a dense matching map. In conjunction with the
structure recovered in the previous step, this enables to build a dense depth map.
4.
Modeling: This step includes procedures needed to make a real model of the
scene (e.g. building mesh models, mapping textures).
Some define the input as an image sequence but in fig.1.1 defines it as a video
sequence since our practical objective is a system that does reconstruction from video. By
defining it like that, we want to clearly state that the intermediate step to go from video to
image sequences (i.e. frame selection) is a part of there construction process.
1.2.1 Feature Detection and Matching
This process creates relations used by the next step, structure and motion
recovery, by detecting and matching features in different images. Until now, the features
used in structure recovery processes are points and lines. So here features are understood
as points or lines.
Fig.1.2: Pollefeys 3D modelling framework

Detectors: Given an image a feature detector is s a process to detect features from the
image. The most important information a detector gives is the location of features but other
characteristics such as the scale can also be detected. Two characteristics that a good
detector needs are repeatability and reliability. Repeatability means that the same feature can
be detected in different images. Reliability means that the detected point should be
distinctive enough so that the number of its matching candidates is small.
Dept. of ECE, MRITS

Descriptors: Suppose there are two images (from two different views) of a scene and
already have extracted some features of them. To find corresponding pairs of features, need
feature descriptors. A descriptor is a process that takes information of features and image to
produce descriptive information i.e. feature description which are usually presented in form
of features vectors. The descriptions then are used to match a feature to one in another
image.
A descriptor should be invariant to rotation, scaling, and affine transformation so
the same feature on different images will be characterized by almost the same value and
distinct to reduce number of possible matches.
Fig.1.3 : Feature detection and matching process

The roles of detectors and descriptors in the feature detection and matching step
are shown in fig.1.3.
The following describes the interest points and lines using the two concepts
detectors and descriptors.
Dept. of ECE, MRITS

1.2.2 Interest Points
Here, a point feature is called an interest point. A definition of interest points, is
any point in the image for which the signal changes two- dimensionally".
1.2.2.1 Point detectors
Classification: Point detectors are classified into three categories: contour based,
intensity based, and parametric model based ones.
Contour based detectors: These detectors first extract contours from images and then
find points that have special characteristics, e.g. junctions, endings, or curvature maxima.
A multi-scale framework can be utilized to get more robust results.
Intensity based detectors: These detectors find interest points by examining the
intensity change around points. To measure the change, first and second derivatives of
images are used in many different forms combinations.
Parametric model based detectors: Points are found by matching models/templates
(e.g. of L-corners) to an image.
1.2.2.2 Point descriptors

Describes point descriptors and their application to matching for 3D
reconstruction.
Fig 1.4: An interest points detected by SIFT (green marks).
Dept. of ECE, MRITS

Classification: Point descriptors are classified into the following categories:
Distribution based descriptors: Histograms are used to represent the characteristics of
the region. The characteristics could be pixel intensity, distance from the centre point
relative ordering of intensity or gradient.
Spatial-frequency descriptors: These techniques are used in the domain of texture
classification and description. Texture description using Gabor filters is standardized in
MPEG7.
Differential descriptors: The descriptor are used to evaluate detectors reliability is an
example of a differential descriptor, in which a set of local derivatives (local jet) is used
to describe an interest region.
Moments: Moments are used moments to describe a region. The central moment of a
region in combination with the moment's order and degree forms the invariant.
1.3 LINES
Two-view projective reconstruction can only use point correspondences. But in
three or more view structure recovery it is possible to use line correspondences.
1.3.1 Line detection
Line detection usually includes edge detection, followed by line extraction.
1.3.2 Edge detection
The key to solve the problem is the intensity change, which is shown via the
gradient of the image. Edge detectors usually follow the same routine: smoothing,
applying edge enhancement filters, applying a threshold, and edge tracing.
Evaluations of edge detectors are inconsistent and not convergent for reasons such
as unclear objective and varying parameters. A series of evaluation in different tasks in
which the application acts as the black box to test algorithms. One of them is structure
from motion. The evaluation shows that overall, the canny detector is most suitable
because of its Performance.
Fastest s peed and low sensitivity to parameters variation .However the structure
from motion algorithm used there is not a three-view one and uses line segments rather
Dept. of ECE, MRITS

than lines as in also the intermediate processing" (line extraction and corresponding) that
would affect the final result is fixed. Thus the result is not concrete enough.
1.3. 3 Line Extraction
Extracting lines could be done in several ways. The Hough transform is famous in
curve fitting. Despite of having a long history Hough transform and its extensions are still
used widely. A simpler approach connects line segments with a limit of angle changes
and then uses the least median square method to fit the connected paths into lines. As
with edge detection, no complete and concrete evaluation of line extraction is available.
1.3.4 Line matching

Lines can be matched based on attributes such as orientation, length, or extent of
Overlap. Some matching strategies such as nearest line, or additional view verification
can be used to increase the speed and accuracy. Optical flow can be employed in the case
of short baseline Matching groups of lines (graph-matching) is more accurate than
individual matching Beardsley et al use the geometric constraints in both two-view and
three view cases to match lines. The constraints are found by a robust method with
corresponding points.
Lines are generally highly structured features give stronger constraints. Lines are
many and easy to extract in scenes with dominant artificial objects, e.g. urban
architectures. However, the fact that evaluations on line extraction and matching for
structure recovery are not complete and concrete probably is the reason why the theory of
three-view reconstruction with lines are available for along time but methods in structure
recovery usually use point correspondence. One of the few works that uses line
correspondences and trifocal tensors is of Breads but lines are not used directly. Still
point correspondences are used first to recover geometry information.
1.4 STRUCTURE AND MOTION RECOVERY

The second task Structure and motion recover the structure of the scene and the
motion information of the camera. The motion information is the position, orientation,
Dept. of ECE, MRITS

and intrinsic parameters of the camera at the captured views. The structure information is
captured by the 3D coordinates of features.
Given feature correspondences, the geometric constraints among views can be
established. The projection matrices that represent the motion information then can be
recovered. Finally, 3D coordinates of features, i.e. structure information can be computed
via triangulation.
Fig.1.5: Structure and motion recovery process
1.5 ADVANTAGES AND PROBLEMS OF USING VIDEO

SEQUENCES
It is possible to do 3D reconstruction from images. But in practice, it is more
natural to use video sequences since it eases the capturing process and provides more
complete data. But also problems arise. The following describes the advantages and the
problems of using video sequences as input.
1.5.1 Advantages
The most important advantage of using input of video sequences is the higher
quality one can obtain. Both geometric accuracy and visual quality can be improved by
exploiting the redundancy of data. Intuitively, more back-projecting rays of a point's
projections limits the possible 3D coordinates of the point. The best texture found by
selecting the best view or super-resolution can be used to get better visualization quality.
Dept. of ECE, MRITS

Image sequences also enable some techniques to deal with shadow, shading and
highlights.
Other advantages are the automaticity and flexibility. Capturing data by a
handheld camera is more comfortable since a person does not have to worry about
missing information or to consider if the captured information is enough for
reconstruction .And on the processing time, instead of manually selecting some images
from a video, it is better to have a system that can do everything automatically.
1.5.2 Problems
To take advantage of the use of video sequences we have to deal with some
problems, ranging from pre-processing (frame selection, sequence segmentation), during
processing as has been seen in previous sub-sections, to post-processing (bundle
adjustment, structure fusion).
Frame selection: Among a number of frames, selecting good frames will improve the
reconstruction result. Good frames are ones that have proper geometric attributes and
good photometric quality. The problem is related to the estimation of views' position and
orientation and photometric quality evaluation.
Sequence segmentation: Reconstruction algorithms assume that a sequence is
continuously captured. The sequence should be broken into proper scene parts and
reconstruct separately and fuse later.
Structure fusion: Results of processing different video segments (generated either by
different captures or by segmentation) must be fused together to create a final unique
result.
Bundle adjustment: The reconstruction process includes local updates (e.g. feature
matching, structure update) and bias assumptions (e.g. use of first view coordinate
system). Those lead to inconsistency and accumulated errors in the global result. There
should be global optimization step to produce a unique consistent result.
Dept. of ECE, MRITS
1.6 CRITICAL CASES

A critical case happens when it is impossible to make a metric reconstruction
from the input data. It either because of the characteristics of the scene or the capturing
positions.
In practice, metric reconstruction from video sequences captured by a person
using a hand-held camera hardly falls into an absolute critical case. However, nearly
critical cases are common in practice, e.g. a camera moving along a wall or on an elliptic
orbit around the object. That is why studying critical configurations and detecting those
cases is extremely important to create a robust reconstruction method, or select the most
suitable method for the case.
There are two kinds of critical cases: (i) critical surface or critical configuration
and (ii) critical motion sequences (of camera). The first class depends on the observed
points. The later depends only on camera motion, i.e. can happen with any scene. A
brute force" approach to select the best algorithm. This however only helps to reject the
case but not to take the proper method for it. Some important notes about critical cases
are:
Normal cases in some conditions, e.g. calibrated or fixed intrinsic parameters, can
turn into critical ones when conditions change. The more un calibrated the camera, i.e.
less cameras' parameters are known, the more ambiguous the reconstruction will be.
1.7 IMAGE-BASED 3D RECONSTRUCTION

Image-based 3D reconstruction is an active field of research in Photogrammetric
and Computer Vision. The need for detailed 3D models for mapping and navigation,
inspection, cultural heritage conservation or photorealistic image-based rendering for the
entertainment industry lead to the development of several techniques to recover the shape
of objects. To achieve precise and high detailed reconstructions is often employed
providing 2.5D range images and the respective 3D point cloud in a metric scale.
On the other hand, laser-based methods are complex to handle for large scale
outdoor scenes, especially for aerial data acquisition. In contrast to that, passive imageDept. of ECE, MRITS
10

based methods that utilize multiple overlapping views are easily deployable and are low
cost compared to but require some post-processing effort to derive depth information. In
this work we investigate how redundancy and baseline influence the depth accuracy of
multiple view matching methods.
In particular performance synthetic experiments on a typical aerial camera
network that corresponds to a2D flight with 80% forward-overlap and 60% side-lap as
shown in fig.1.6. By covariance analysis of triangulated scene points the theoretical
bound of depth accuracy is determined according to the triangulation angle and the
number of measurements (i.e. the redundancy).
One of the main findings is that true multi-view matching/triangulation
outperforms two-view fused stereo results by at least one order of magnitude in terms of
depth accuracy. Furthermore, present a fast, accurate and robust matching and
reconstruction technique suitable for high resolution images of large scale scenes that is
able to compete through leveraging the redundancy of many views. The solution to multiview reconstructions is based on pair-wise stereo, employing efficient and robust optical
flow that is restricted to the epipolar geometry. Unlike standard aerial
Fig.1.6(a) :The view network, a sparse reconstruction and uncertainties

(magnified by 1000 for better visibility) for selected 3D points on a regularly
sampled grid on the ground plane
Dept. of ECE, MRITS
11
Fig.1.6(b) : Reconstructed dense point cloud from our multi-view method of an

urban scene.
matching approaches that rely on 2.5D data fusion of pair wise stereo depth maps,
the correspondence chaining (i.e. measurement linking) and triangulation approach that
takes full advantage of the achievable baseline (i.e. triangulation angles). In contrast to
voxel-based approaches, Polygonal meshes and local patches, focus on algorithms
representing geometry as a set of depth maps. It eliminates the need for resembling the
geometry in the three-dimensional domain and can be easily parallelized. Evaluate the
approach on the multitier benchmark data set that provides accurate ground truth and on
large scale aerial images.
1.8 UNCERTAINTY OF SCENE POINTS

The depth uncertainty of a rectified stereo pair can be directly determined from
the disparity error
z= -
. d...
(1)
where z is the point depth, f the focal length and b the image baseline. Hence the depth
precision is mainly a function of the ray intersection angle. In contrast, for multi view
image matching and triangulation , the redundancy not only implies more measurements
but additionally constrains the 3D point location through multiple ray intersections. These
entities are not independent but are coupled, since they rely on the network geometric
Dept. of ECE, MRITS
12

configuration that determines image overlap (i.e. redundancy) and baseline,
simultaneously. Given a photogrammetric network of cameras and correspondences with
known error distribution, the precision of triangulated points can be determined from the
3D confidence ellipsoid (i.e. covariance matrix CX) as shown in fig. 1.7. An empirical
estimate of the covariance ellipsoid corresponding to multi view triangulation can be
computed by statistical simulation. For the moment we assume that camera orientations
and 3D structure are fixed and known.
The cameras are distributed along a 2D grid (corresponding to flight paths) in

order to achieve a 80% forward overlap and 60% side-lap as shown fig.1.7 According to
a large format digital aerial camera (e.g. Ultra Cam D from Microsoft) the image
resolution is set to 7500 _ 11500 pixel with a field of view _ = 54_. Furthermore,3D
points are evenly distributed on a 2D plane that corresponds to the bold earth surface
observed from a flying height of 900m. Therefore, an average Ground Sampling Distance
(GSD) of8cm/pixel is achieved.
Given the cameras Pi=1 : N P (i.e. calibration and poses) and 3D points Xi=1:M X,
respective ground truth projections are produced xij = PX. Therefore, for every 3D point a
set of point-tracks (i.e. 2D measurements) is generated m = (< x1; y1>;< x2; y2> : : : ;<xk;
yk>). Next, 2Dprojections are perturbed by zero mean Gaussian isotropic noise ^x = x +
N(0; ),
=(
with standard deviation
)...........................................................................................(2)
x
= 1 pixel (i.e. _ 8cm GSD). Given the set perturbed point
tracks^m = (< ^x1; ^y1>;< ^x2; ^y2> : : : ;< ^xk; ^yk>) and ground truth projection
matrices Pi=1:N, the 3Dposition of the respective point in space is determined. This
process requires the intersection of at least two known rays in space. Hence, we use a
linear triangulation method to determine the 3D position of point tracks. This method
generalizes easily to the intersection of multiple rays providing a least squares solution.
Optionally, a non-linear optimizer based on the Levenberg-Marquardt algorithm issued to
Dept. of ECE, MRITS
13

refine the 3D point by minimizing the projection error. Through Monte Carlo Simulation
on the perturbed measurement vectors ^m we obtain a distribution of 3D points Xi
around a mean position ^X. From the Law of Large Numbers it follows that for a large
number N of simulations,
one can approximate the mean 3D position by,
....(3)
and its respective covariance matrix by,

CX = EN [( Xi EN [Xi]) (Xi EN [Xi] )T ]
... (4)
Using the singular value decomposition the covariance matrix can then be diagonalized,
CX = U(
) VT(5)
where U represents the main diagonals of the covariance ellipsoid
and
are the
respective standard deviations. The decomposition of the covariance matrix in equation

5into its main diagonals directly relates to the uncertainty in x-y and z direction. Under
the assumption of front to parallel image acquisition the largest singular value
corresponds to the uncertainty in depth and
2 and
3to
the uncertainty in x - y direction,
respectively.
1.9 LITERATURE SURVEY

With the advent of the multimedia age and the spread of Internet, video storage on
CD/DVD and video has been gaining a lot of popularity. The ISO Moving Picture
Experts Group (MPEG) video coding standards pertain towards compressed video
storage on physical media like CD/DVD, whereas the International Telecommunications
Union (ITU) addresses real-time point-to-point or multi-point communications over a
network. The former has the advantage of having higher bandwidth for data transmission.
In either standard the basic flow of the entire compression decompression process
is largely the same and is shown in fig. 1.7. The encoding side estimates the motion in the
current frame with respect to a previous frame. A motion compensated image for the
current frame is then created that is built of blocks of image from the previous frame. The
Dept. of ECE, MRITS
14

motion vectors for blocks used for motion estimation are transmitted, as well as the
difference of the compensated image with the current frame with respect to a previous
frame. A motion compensated image for the current frame is then created that is built of
blocks of image from the previous frame.
The motion vectors for blocks used for motion estimation are transmitted, as well
as the difference of compensated image motion estimation are transmitted as well as the
difference of the compensated image with the current frame is also JPEG encoded and
sent. The encoded image that is sent is then decoded at the encoder and used as a
reference frame for the subsequent frames. The decoder reverses the process and creates a
full frame. The whole idea behind motion estimation based video compression ratio 30:1
is to save on bits by sending JPEG encoded difference images which inherently have less
energy and can be highly compressed as compared to sending a full frame that is JPEG
encoded. Motion JPEG where all frames are JPEG encoded, achieves anything between
10:1 to 15:1 compression ratio, whereas MPEG can achieve a compression ratio of 30:1
and is also useful at 100:1 ratio. It should be noted that the first frame is always sent full,
and so are some other frames that might accurate some regular interval (like every 6th
frame). The standards do not specify this and this might change with every video being
sent based on the dynamics of the video.
The most computationally expensive and resource hungry operation in the entire
compression process is motion estimation. Hence, this field has seen the highest activity
and research interest in the past two decades. The algorithms that have been implemented
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),
Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and
Adaptive Rood Pattern Search (ARPS).
Dept. of ECE, MRITS
15
Fig.1.7: MPEG / H.26x video compression process flow

.
The most computationally expensive and resource hungry operation in the entire
compression process is motion estimation. Hence, this field has seen the highest activity
and research interest in the past two decades. The algorithms that have been implemented
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),
Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and
Adaptive Rood Pattern Search (ARPS).
1.10 BLOCK MATCHING ALGORITHMS

The underlying supposition behind motion estimation is that the patterns
corresponding to objects and background in a frame of video sequence move within the
frame to form corresponding objects on the subsequent frame. The idea behind block
matching is to divide the current frame into a matrix of macro blocks that are then
compared with corresponding block and its adjacent neighbours in the previous frame to
create a vector that stipulates the movement of a macro block from one location to
Dept. of ECE, MRITS
16

another in the previous frame. This movement calculated for all the macro blocks
comprising a frame, constitutes the motion estimated in the current frame. The search
area for a good macro block match is constrained up to p pixels on all fours sides of the
corresponding macro block in previous frame. This p is called as the search parameter.
Fig.1.8: Block matching a macro block of side 16 pixels and a search parameter p of
size 7 pixels.
Larger motions require a larger p, and the larger the search parameter the more
computationally expensive the process of motion estimation becomes. Usually the macro
block is taken as a square of side 16 pixels, and the search parameter p is 7 pixels. The
idea is represented in Fig1.8. The matching of one macro block with another is based on
the output of a cost function. The macro block that results in the least cost is the one that
matches the closest to current block. There are various cost functions, of which the most
popular and less computationally expensive is Mean Absolute Difference (MAD) given
by equation 1.Another cost function is Mean Squared Error (MSE) given in equation 2.
MAD= N-1i=0
Dept. of ECE, MRITS
(6)
17
MSE =
N-1i=0
2 (7)
where N is the side of the macro bock, C and Rij are the pixels being compared in current
macro block and reference macro block respectively, Peak-Signal-to-Noise-Ratio (PSNR)
given by equation 3characterizes the motion compensated image that is created by using
motion vectors and macro clocks from the reference frame
PSNR=10log10 [
(a)Exhaustive Search (ES)

This algorithm, also known as Full Search, is the most computationally expensive
block matching algorithm of all. This algorithm calculates the cost function at each
possible location in the search window. As a result of which it finds the best possible
match and gives the highest PSNR amongst any block matching algorithm. Fast block
matching algorithms try to achieve the same PSNR doing as little computation as
possible. The disadvantage to ES is that the larger the search window gets the more
computations it requires.
(b)Three Step Search (TSS)
The general idea is represented in Fig.1.9. It starts with the search location at the
centre and sets the step size S = 4, for a usual search parameter value of 7. It then
searches at eight locations +/- pixels around location (0,0). From these nine locations
searched so far it picks the one giving least cost and makes it the new search origin. It
then sets the new step size S = S/2,and repeats similar search for two more until S = 1. At
that point it finds the location with the least cost function and the macro block at that
location is the best match. The calculated motion vector is then saved for transmission. It
gives a flat reduction in computation by a factor of 9. So that for p = 7, ES will compute
cost for 225 macro blocks where as TSS computes cost for 25 macro blocks. The idea
behind TSS is that the error surface due to motion in every macro block is unimodal. A
unimodal surface is a bowl shaped surface such that the weights generated by the cost
function increase monotonically from the global minimum.
Dept. of ECE, MRITS
18
Fig.1.9: Three Step Search procedure. The motion vector is (5, -3).
(c)New Three Step Search (NTSS)

NTSS improves on TSS results by providing a centre biased searching scheme
and having provisions for half way stop to reduce computational cost. It was one of the
first widely accepted fast algorithms and frequently used for implementing earlier
standards like MPEG 1 and H.261.The TSS uses a uniformly allocated checking pattern
for motion detection and is prone to missing small motions. The NTSS process is
illustrated graphically in fig.1.10. In the first step 16 points are checked in addition to the
search origin for lowest weight using a cost function. Of these additional search
locations, 8 are a distance of S = 4 away (similar to TSS) and the other 8 are at S = 1
away from the search origin. If the lowest cost is at the origin then the search is stopped
right here and the motion vector is set as (0, 0). If the lowest weight is at any one of the 8
locations at S = 1, then we change the origin of the search to that point and check for
weights adjacent to it.
Dept. of ECE, MRITS
19
Fig.1.10: New Three Step Search block matching.

Big circles are checking points in the first step of TSS and the squares are the
extra 8 points added in the first step of NTSS. Triangles and diamonds are second step of
NTSS showing 3 points and 5 points being checked when least weight in first step is at
one of the 8 neighbours of window centre.
Dept. of ECE, MRITS
20
Fig.1.11: Search patterns corresponding to each selected quadrant: (a) Shows all
quadrants (b) quadrant I is selected (c) quadrant II is selected (d) quadrant III is
selected (e) quadrant IV is selected.
Depending on which point it is end up checking 5 points or 3 points (Fig 1.11(b)

& (c)). The location that gives the lowest weight is the closest match and motion vector is
set to that location. On the other hand if the lowest weight after the first step was one of
the 8 locations at S = 4, then follow the normal TSS procedure. Hence although this
process might need a minimum of 17 points to check every macro block, it also has the
worst-case scenario of 33 locations to check.
(d)Simple and Efficient Search (SES)
SES is another extension to TSS and exploits the assumption of unimodal error
surface. The main idea behind the algorithm is that for a unimodal surface there cannot be
two minimums in opposite directions and hence the 8 point fixed pattern search of TSS
can be changed to incorporate this and save on computations. The algorithm still has
three steps like TSS, but the innovation is that each step like TSS, but the innovation is
that each step has further two phases. The search area is divided into four quadrants and
the algorithm checks three locations A, B and C as shown in fig... A is at the origin and B
and C are S = 4 locations away from A in orthogonal directions. Depending on certain
weight distribution amongst the three the second phase selects few additional points as
shown in fig. 2.3. The rules for determining a search quadrant for seconds phase are as
follows:
Dept. of ECE, MRITS
21

If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (b);
If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (c)
If MAD(A)_ MAD(B) and MAD(A) < MAD(C), select (d);
If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (e)
If MAD(A) < MAD(B) and MAD(A) _ MAD(C), select (e)
Once selected the points to check for in second phase, we find the location with
the lowest weight and set it as the origin. We then change the step size similar to TSS and
repeat the above SES procedure again until we reach S = 1.The location with the lowest
weight is then noted down in terms of motion vectors and transmitted. An example
process is illustrated in fig.1.12
Fig.1.12: The SES procedure. The motion vector is (3, 7) in this example.
Although this algorithm saves a lot on computation as compared to TSS, it was

not widely accepted for two reasons. Firstly, in reality the error surfaces are not strictly
unimodal and hence the PSNR achieved is poor compared to TSS. Secondly, there was
Dept. of ECE, MRITS
22

another algorithm, Four Step Search, that had been published a year before that presented
low computational cost compared to TSS and gave significantly better PSNR.
(e)Four Step Search (4SS)
Similar to NTSS, 4SS also employs center biased searching and has a halfway
stop provision. 4SS sets a fixed pattern size of S = 2 for the first step, no matter what the
search parameter p value is. Thus it looks at 9 locations in a5x5 window. If the least
weight is found at the centre of search jumps to fourth step. If the least weight is at one of
the eight locations except the centre, then we make it the search origin and move to the
second step. The search window is still maintained as 5x5 pixels wide. Where the least
weight location was, might end up checking weights at 3 locations or 5 locations. The
patterns are shown in Fig 1.13
(c)
(d)
Fig.1.13: Search patterns of the FSS. (a) First step (b) Second/Third
step(c)Second/Third Step (d) Fourth Step.
Once again if the least weight location is at the center of the 5x5 search window
we jump to fourth step or else we move on to third step. The third is exactly the same as
the second step. IN the fourth step the window size is dropped to 3x3, i.e.S = 1. The
Dept. of ECE, MRITS
23

location with the least weight is the best matching macro block and the motion vector is
set to point o that location. A sample procedure is shown in Fig 1.14. This search
algorithm has the best case of algorithm has the best case of 17 checking points and worst
case of 27 checking points.
(f)Diamond Search (DS)

DS algorithm is exactly the same as 4SS, but the search point pattern is changed
from a square to a diamond, and there is no limit on the number of steps that the
algorithm can take.DS uses two different types of fixed patterns, one is Large Diamond
Search Pattern (LDSP) and the other is Smal lDiamond Search Pattern (SDSP). These
two patterns and the DS procedure are illustrated in Fig.1.13. Just like in FSS, the first
step uses LDSP and if the least weight is at the center location we jump to fourth step.
The consequent steps, except the last step, are also similar and use LDSP, but the number
of points where cost function is checked are either 3 or 5 and are illustrated in second and
third steps of procedure shown in Fig.1.14
Dept. of ECE, MRITS
Fig. 1.14: Diamond Search procedure.
24

This figure shows the large diamond search pattern and the small diamond search
pattern. It also shows an example path to motion vector (-4, -2) in five search steps four
times of LDSP and one time of SDSP.
The last step uses SDSP around the new search origin and the location with the
least weight is the best match. As the search pattern is neither too small nor too big and
the fact that there is no limit to the number of steps, this algorithm can find global
minimum very accurately. The end result should see a PSNR close to that of ES while
computational expense should be significantly less.
Adaptive Rood Pattern Search (ARPS)
ARPS algorithm makes use of the fact that the general motion in a frame is
usually coherent, i.e. if the macro blocks around the current macro block moved in a
particular direction hen there is a high probability that the current macro block will also
have a similar motion vector. This algorithm uses the motion vector of the macro block to
its immediate left to predicts own motion vector. An example is shown in fig.1.15.
Fig.1.15: Adaptive Root Pattern: The predicted motion vector is (3,-2), and the step
size S = Max (|3|, |-2|) = 3.
Dept. of ECE, MRITS
25

The predicted motion vector points to (3, -2). In addition to checking the location
pointed by the predicted motion vector, it also checks at a rood pattern distributed points,
as shown in fig.1.15, where they are at a step size of S = Max (|X|, |Y|). X and Y are the
x- coordinate and y-coordinate of the predicted motion vector. This rood pattern search is
always the first step. It directly puts the search in an area where there is a high probability
of finding a good matching block.
The point that has the least weight becomes the origin for subsequent search steps,
and the search pattern is changed to SDSP. The procedure keeps on doing SDSP until
least weighted point is found to be at the center of the SDSP. A further small
improvement in the algorithm can be to check for Zero Motion Prejudgment, using which
the search is stopped half way if the least weighted point is already at the center of the
rood pattern
Fig. 1.16. Search points per macro block while computing the PSNR
Performance of Fast Block Matching Algorithms
The main advantage of this algorithm over DS is if the predicted motion vector is
(0, 0), it does not waste computational time in doing LDSP, it rather directly starts using
Dept. of ECE, MRITS
26

SDSP. Furthermore, if the predicted motion vector is far away from the center, then again
ARPS save on computations by directly jumping to that vicinity and using SDSP,
whereas DS takes its time doing LDSP.
Care has to be taken to not repeat the computations at points that were checked
earlier. Care also needs to be taken when the predicted motion vector turns to match one
of the rood pattern location. So have to avoid double computation at that point. For macro
blocks in the first column of the frame, rood pattern step size is fixed at 2 pixel.
Fig.1.17: PSNR performance of Fast Block Matching Algorithms.
1.11 THESIS OUTLINE

The summary of chapter 1 deals with general concept of 3D reconstruction and
their methods of how the image is reconstructed from 2D to 3D image. Types of methods
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search
(NTSS),Simple and Efficient Search (SES),Four Step Search (4SS),Diamond Search
(DS),Adaptive Rood Pattern Search (ARPS).
Dept. of ECE, MRITS
27

Chapter 2 deals with the stereo vision algorithms. Chapter 3 deals with the
implementation of algorithms. Chapter 4 deals with the simulation results. Chapter 5
describes the conclusion.
Dept. of ECE, MRITS
28
CHAPTER 2
STEREO VISION ALGORITHMS
2.1 INTRODUCTION TO STEREO VISION
Stereo correspondence problem has historically been, and continues to be, one of
the most investigated topics in computer vision, and a larger number of literatures on it
have been published. The correspondence problem in computer vision concerns the
matching of points, or other kinds of primitives, in two or more images such that the
matched elements are the projections of the same physical elements in 3D scene, and the
resulting displacement of a projected point in one image with respect to the other is
termed as disparity. Similarity is the guiding principle for solving the correspondence
problem; however, the stereo correspondence problem is an ill-posed task, in order to
make it tractable, it is usually necessary to exploit some additional information or
constraints.
The most popular constraint is the epipolar constraint, which can reduce the
search to one-dimension rather than two. Other constraints commonly used are the
disparity uniqueness constraint and the continuous constraint.
The origin of the word stereo is the Greek word stereos which means firm or
solid, with stereo vision, the objects are seen solid in three dimensions with range. In
stereo vision, the same seen is captured using two sensors from two different angles. The
captured two images have a lot of similarities and smaller number of differences. In
human sensitivity, the brain combines the captured to images together by matching the
similarities and integrating the differences to get a three dimension model for the seen
objects.
In machine vision, the three dimension model for the captured objects is obtained
finding the similarities between the stereo images and using projective geometry to
process these matches. The difficulties of reconstruction using stereo is finding matching
correspondences between the stereo pair.
Dept. of ECE, MRITS
29

Latest trends in the field mainly pursue real-time execution speeds, as well as
decent accuracy. As indicated by this survey, the algorithms theoretical matching cores
are quite well established leading the researchers towards innovations resulting in more
efficient hardware implementations.
Detecting conjugate pairs in stereo images is a challenging research problem
known as the correspondence problem, i.e., to find for each point in the left image, the
corresponding point in the right one. To deter-mine these two points from a conjugate
pair, it is necessary to measure the similarity of the points. The point to be matched
without any ambiguity should be distinctly different from its surrounding pixels. Several
algorithms have been proposed in order to address this problem. However, every
algorithm makes use of a matching cost function so as to establish correspondence
between two pixels.
The most common ones are absolute intensity differences (AD), the squared
intensity differences (SD) and the normalized cross correlation (NCC) evaluation of
various matching costs can be found. Usually, the matching costs are aggregated over
support regions. Those support regions, often referred to as support or aggregating
windows, could be square or rectangular, fix-sized or adaptive ones. The aggregation of
the aforementioned cost functions, leads to the core of most of the stereo vision methods,
which can be mathematically SAD expressed as follows, for the case of the sum of
absolute differences
(SAD).(x, y, d) = (I l (x, y) - I r (x, y - d))
(1)
For the case of the sum of squared differences (SSD)

SSD (x, y, d) `= (I l (x, y) - I r (x, y - d)) 2..(2)
And for the case of the NCC
NCC (x, y, d) = (I l (x, y) * I r (x, y - d))/sqrt ( (I2l (x, y) I2r (x, y - d))..(3)
Where IL and Ir are the intensity values in left and right image, (x, y) are the pixels
coordinates, d is the disparity value under consideration and W is the aggregated support
Dept. of ECE, MRITS
30

region. The selection of the appropriate disparity value for each pixel is performed
afterwards. The simpler algorithms make use of winner-takes-all (WTA) method of
disparity selection.
D (x, y) = arg min SAD (x, y, d) .. ...................(4)
i.e., for every pixel (x, y) and for constant value of disparity d the minimum cost is
selected. Equation 1.4 refers to the SAD method but any other could be used instead.
However, in many cases disparity selection is an iterative process, since each pixels
disparity is depending on its neighbouring pixels disparity. As a result, more than one
iterations are needed in order to find the best set of disparities. This stage differentiates
the local from the global algorithms, which will be analyzed. An additional disparity
refinement step is frequently used
2.2 GOAL OF STEREO VISION

The recovery of the 3D structure of a scene using two or more images of the 3D
scene, each acquired from a different viewpoint in space. The images can be obtained
using multiple cameras or one moving camera. The term binocular vision is used when
two cameras are employed
Fig2.1: General setup of cameras

Dept. of ECE, MRITS
31

2.2.1 Stereo setup and terminology
Fixation point: the point of intersection of the optical axis.
Baseline: the distance between the centres of projection.
Epipolar plane: the plane passing through the centers of projection and the point in the
scene.
Epipolar line: the intersection of the epipolar plane with the image plane.
Conjugate pair: any point in the scene that is visible in both cameras will be projected to
a pair of image points in the two images
Disparity: the distance between corresponding points when the two images are
superimposed.
Disparity map: the disparities of all points from the disparity map (can be displayed as
an image).
Fig2.2: Internal projection of camera in stereo vision
Dept. of ECE, MRITS
32
Figure2.3: Two cameras in arbitrary position and orientation

2.2.2 Triangulation - the principle underlying stereo vision
The 3D location of any visible object point in space is restricted to the straight line
that passes through the center of projection and the projection of the object point.
Binocular stereo vision determines the position of a point in space by nding the
intersection of the two lines passing through center of projection and the projection of
the point in each image
Fig2.4: Positions of binocular

Dept. of ECE, MRITS
33

2.2.3 The problems of stereo
The correspondence problem.
The reconstruction problem.
2.2.4 The correspondence problem
Finding pairs of matched points such that each point in the pair is the projection
of the same 3D point.
Triangulation depends crucially on the solution of the correspondence problem.
Ambiguous correspondence between points in the two images may lead to several
different consistent interpretations of the scene.
Fig2.5: Correspondence problem in stereo vision

2.2.5The reconstruction problem
Given the corresponding points, we can compute the disparity map.
The disparity map can be converted to a 3D map of the scene (i.e., recover the 3D
structure) if the stereo geometry is known
Dept. of ECE, MRITS
34
Fig2.6: Reconstruction problem in stereo vision
2.3 STEREO CORRESPONDENCE

Stereo correspondence problem has historically been, and continues to be, one of
the most investigated topics in computer vision, and a larger number of literatures on it
have been published. The correspondence problem in computer vision concerns the
matching of points, or other kinds of primitives, in two or more images such that the
matched elements are the projections of the same physical elements in 3D scene, and the
resulting displacement of a projected point in one image with respect to the other is
termed as disparity.
Similarity is the guiding principle for solving the correspondence problem;

however, the stereo correspondence problem is an ill-posed task, in order to make it
tractable, it is usually necessary to exploit some additional information or constraints.
The most popular constraint is the epipolar constraint, which can reduce the search to
one-dimension rather than two. Other constraints commonly used are the disparity
uniqueness constraint and the continuous constraint.
The existing techniques for general two-view stereo correspondence roughly fall
into two categories: local method and global method. Local methods use only small areas
neighborhoods surrounding the pixels, while global methods optimize some global
(energy) function. Local methods, such as block matching, gradient-based optimization,
Dept. of ECE, MRITS
35

and feature matching can be very efficient, but they are sensitive to locally ambiguous
regions in images (e.g., occlusion regions or regions with uniform texture).
Global methods, such as dynamic programming, intrinsic curves, graph cuts, and
belief propagation can be less sensitive to these problems since global constraints provide
additional support for regions difficult to match locally. However, these methods are
more expensive in their computational cost.
Stereo correspondence algorithms can be grouped into those producing sparse
output and those giving a dense result. Feature based methods stem from human vision
studies and are based on matching segments or edges between two images, thus resulting
in a sparse output. This disadvantage, dreadful for many purposes, is counterbalanced by
the accuracy and speed obtained. However, contemporary applications demand more and
more dense output.
In order to categorize and evaluate them a context has been proposed. According
to this, dense matching algorithms are classified in local and global ones. Local methods
trade accuracy for speed. They are also referred to as window-based methods because
disparity computation at a given point depends only on intensity values within a finite
support window. Global methods (energy-based) on the other hand are time consuming
but very accurate.
Their goal is to minimize a global cost function, which combines data and
smoothness terms, taking into account the whole image. Of course, there are many other
methods that are not strictly included in either of these two broad classes. The issue of
stereo matching has recruited a variation of computation tools. Advanced computational
intelligence techniques are not uncommon and present interesting and promiscuous
results.
While the aforementioned categorization involves stereo matching algorithms in
general, in practice it is valuable for software implemented algorithms only. Software
implementations make use of general purpose personal computers (PC) and usually result
in considerably long running times. However, this is not an option when the objective is
Dept. of ECE, MRITS
36

the development of autonomous robotic platforms, simultaneous localization and
mapping (SLAM) or virtual reality (VR) systems.
Such tasks require real-time, efficient performance and demand dedicated
hardware and consequently specially developed and optimized algorithms. Only a small
subset of the already proposed algorithms is suitable for hardware implementation.
Hardware implemented algorithms are characterized from their theoretical algorithm as
well as the implementation itself. There are two broad classes of hardware
implementations: the field-programmable gate arrays (FPGA) and the applicationspecific integrated circuits (ASIC) based ones. Figure 2 depicts an ASIC chip (a) and a
FPGA development board (b). Each one can execute stereo vision algorithms without the
necessity of a PC, saving volume, weight and consumed energy. However, the evolution
of FPGA has made them an appealing choice due to the small prototyping times, their
flexibility and their good performance
2.4 STEREO MATCHING ALGORITHMS

The issue of stereo correspondence is of great importance in the field of machine
vision, computer vision, depth measurements and environment reconstruction as well as
in many other aspects of production, security, defense, exploration, and entertainment.
Calculating the distance of various points or any other primitive in a scene relative to the
position of a camera is one of the important tasks of a computer vision system.
The most common method for extracting depth information from intensity images
is by means of a pair of synchronized camera-signals, acquired by a stereo rig. The pointby-point matching between the two images from the stereo setup derives the depth
images, or the so called disparity maps. This matching can be done as a one dimensional
search if accurately rectified stereo pairs in which horizontal scan lines reside on the
same epipolar line are assumed, as shown in Figure 2.7. A point P1 in one image plane
may have arisen from any of points in the line C1P1, and may appear in the alternate
image plane at any point on the so-called epipolar line.
Dept. of ECE, MRITS
37

Thus, the search is theoretically reduced within a scan line, since corresponding
pair points reside on the same epipolar line. The difference on the horizontal coordinates
of these points is the disparity. The disparity map consists of all disparity values of the
image. Having extracted the disparity map, problems such as 3D reconstruction,
positioning, mobile robot navigation, obstacle avoidance, etc., can be dealt with in a more
efficient way.
Fig 2.7: Geometry of epipolar lines, where C1 and C2 are the left and right camera
lens centers, respectively. Point P1 in one image plane may have arisen from any of
points in the line C1P1, and may appear in the alternate image plane at any point on
the epipolar line E2.
As numerous methods have been proposed since then, this section aspires to review the
most recent ones, i.e., Most of the results presented in the rest of this paper are based on
the image sets and test provided there.
The most common image sets are presented in Figure 2.8. Table 2.1 summarizes
their size as well the number of disparity levels. Experimental results based on these
image sets are given, where available. The preferred metric adopted by in this paper, in
order to depict the quality of the resulting disparity maps, is the percentage of pixels
whose absolute disparity error is greater than 1 in the unconcluded areas of the image.
This metric, considered the most representative of the results quality, was used so as to
Dept. of ECE, MRITS
38

make comparison easier. Other metrics, like error rate and root mean square error are also
employed.
Fig.2.8:Left image of the stereo pair (left) and ground truth (right) for the Tsukuba
(a), Sawtooth (b), Map (c), Venus (d), Cones (e) and Teddy (f) stereo pair.
The speed with which the algorithms process input image pairs is expressed in
frames per second (fps). This metric has of course a lot to do with the used computational
platform and the kind of the implementation. Inevitably, speed results are not directly
comparable.
Tsukuba
Size
in 384288
Map
Sawtooth
Venus
Cone
Teddy
284216
434380
434383
450375
450375
30
20
20
60
60
pixels
Disparity 16
levels
Table 2.1: Characteristics of the most common image sets
Dept. of ECE, MRITS
39
2.5 DENSE DISPARITY ALGORITHMS

Methods that produce dense disparity maps gain popularity as the computational
power grows. Moreover, contemporary applications are benefited by, and consequently
demand dense depth information. Therefore, during the latest years efforts towards this
direction are being reported much more frequently than towards the direction of sparse
results.
Dense disparity stereo matching algorithms can be divided in two general classes,
according to the way they assign disparities to pixels. Firstly, there are algorithms that
decide the disparity of each pixel according to the information provided by its local,
neighboring pixels. There are, however, other algorithms which assign disparity values to
each pixel depending on information derived from the whole image. Consequently, the
former ones are called local methods while the latter ones global.
2.5.1. Local methods.
Local methods are usually fast and can at the same time produce descent results.
Several new methods have been presented. In Figure 2.9 Venn diagram presents the main
characteristics of the below presented local methods. Under the term color usage we have
grouped the methods that take advantage of the chromatic information of the image pair.
Any algorithm can process color images but not everyone can use it in a more beneficial
way. Furthermore, in Figure 2.3 NCC stands for the use of normalized cross correlation
and SAD for the use of sum of absolute differences as the matching cost function.
As expected, the use of SAD as matching cost is far more widespread than any
other. A method that uses the sum of absolute differences (SAD) correlation measure for
RGB color images. It achieves high speed and reasonable quality. It makes use of the left
to right consistency and uniqueness constraints and applies a fast median filter to the
results.
It can achieve 20 fps for 160120 pixels image size, making this method suitable
for real time applications. The PC platform is Linux on a dual processor 800MHz
Pentium III system with 512 MB of RAM.
Dept. of ECE, MRITS
40
Fig.2.9: Diagrammatic representation of the local methods categorization.

Another fast area-based stereo matching algorithm, which uses the SAD as error
function, is presented is based on the uniqueness constraint, it rejects previous matches as
soon as better ones are detected. In contrast to bidirectional matching algorithms this one
performs only one matching phase, having though similar results. The results obtained
are tested for reliability and sub-pixel refined. It produces dense disparity maps in realtime using an Intel Pentium III processor running at 800MHz. The algorithm achieves
39.59 fps speed for 320240 pixels and 16 disparity levels and the root mean square error
for the standard Tsukuba pair is 5.77%.
The object is to achieve minimum segmentation. The experimental results
indicate 1.77%, 0.61%, 3.00%, and 7.63% error percentages. The execution speed of the
algorithm varies from 1 to 0.2 fps on a 2.4GHz processor.
Another method that presents almost real-time performance is it makes use of a
refined implementation of the SAD method and a left-right consistency check. The errors
in the problematic regions are reduced using different sized correlation windows. Finally,
a median filter is used in order to interpolate the results. The algorithm is able to process
Dept. of ECE, MRITS
41

7 fps for 320240 pixels images and 32 disparity levels. These results are obtained using
an Intel Pentium 4 at 2.66GHz Processor.
A window-based method for correspondence search is presented in that uses
varying support-weights. The support-weights of the pixels in a given support window
are adjusted based on color similarity and geometric proximity to reduce the image
ambiguity. The difference between pixel colors is measured in the CIE Lab color space
because the distance of two points in this space is analogous to the stimulus perceived by
the human eye. The running time for the image pair with a 3535 pixels support window
is about 0.016 fps on an AMD 2700processor. The error ratio is 1.29%, 0.97%, 0.99%,
and 1.13% and Map image sets, respectively. These figures can be further improved
through a left-right consistency check.
For given input images, specular free two-band images are generated. The
similarity between pixels of these input-image representations can be measured using
various correspondence search methods such as the simple SAD-based method, the
adaptive support-weights method and the dynamic programming (DP) method. This preprocessing step can be performed in real time and compensates satisfactory for specular
reflections.
On the other hand the zero mean normalized cross correlation (ZNCC) as
matching cost. This method integrates a neural network (NN) model, which uses the
least-mean-square delta rule for training. The NN decides on the proper window shape
and size for each support region. The results obtained are satisfactory but the 0.024 fps
running speed reported for the common image sets, on a Windows platform with a
300MHz processor, renders this method as not suitable for real-time applications.
Based on the same matching cost function a more complex area-based method is
proposed in a perceptual organization framework, considering both binocular and
monocular cues is utilized. An initial matching is performed by a combination of
normalized cross correlation techniques. The correct matches are selected for each pixel
using tensor voting. Matches are then grouped into smooth surfaces. Disparities for the
unmatched pixels are assigned so as to ensure smoothness in terms of both surface
Dept. of ECE, MRITS
42

orientation and color. The percentage of un occluded pixels whose absolute disparity
error is greater than 1 is 3.79, 1.23, 9.76, and 4.38 for the image sets. The execution
speed reported is about 0.002 fps for the image pair with 20 disparity levels running on
an Intel Pentium 4 processor at 2.8MHz.
There are, of course, more hardware-oriented proposals as well. Many of them
take advantage of the contemporary powerful graphics machines to achieve enhanced
results in terms of processing time and data volume. A hierarchical disparity estimation
algorithm implemented on programmable 3D graphics processing unit (GPU) is reported
in this method can process either rectified or un calibrated image pairs. Bidirectional
matching is utilized in conjunction with a locally aggregated sum of absolute intensity
differences.
Moreover, the use of Cellular Automata (CA) presents architecture for real-time
extraction of disparity maps. It is capable of processing 1Mpixels image pairs at more
than 40 fps. The core of the algorithm relies on matching pixels of each scan-line using a
one-dimensional window and the SAD matching cost as described in this method
involves a pre-processing mean filtering step and a post-processing CA based filtering
one.
CA is models of physical systems, where space and time are discrete and
interactions are local. They can easily handle complicated boundary and initial
conditions. In CA analysis, physical processes and systems are described by a cell array
and a local rule, which defines the new state of a cell depending on the states of its
neighbors. All cells can work in parallel due to the fact that each cell can independently
update each own state. Therefore the proposed CA algorithm is massively parallel and is
an ideal candidate to be implemented in hardware.
2.5.2 Global methods
Contrary to local methods, global ones produce very accurate results. Their goal is
to find the optimum disparity function d = d (x, y) which minimizes a global cost
function E, which combines data and smoothness terms.
Dept. of ECE, MRITS
43

E (d) = E data (d) +. E smooth (d) (5)
Where E
data
smooth provides
takes into consideration the (x, y) pixels value throughout the image, E
the algorithms smoothening assumptions and k is a weight factor.
The main disadvantage of the global methods is that they are more time
consuming and computational demanding. The source of these characteristics is the
iterative refinement approaches that they employ. They can be roughly divided in those
performing a global energy minimization and those pursuing the minimum for
independent scan lines using DP.
In Figure 2.10 the main characteristics of the below discussed global algorithms
are presented. It is clear that the recently published works utilizes global optimization
preferably rather than DP.
This observation is not a surprising one, taking into
consideration the fact that under the term global optimization there are actually quite a
few different methods. Additionally, DP tends to produce inferior ,thus less impressive,
results. Therefore, applications that dont have running speed constraints, preferably
utilize global optimization methods
2.6 REVIEW OF STEREO VISION ALGORITHMS
Fig.2.10: Diagrammatic representation of the global methods categorization

Dept. of ECE, MRITS
44

2.6.1. Global optimization
The algorithms that perform global optimization take into consideration the whole
image in order to determine the disparity of every single pixel. An increasing portion of
the global optimization methodologies involves segmentation of the input images
according to their colors.
The algorithm presented uses color segmentation. Each segment is described by a
planar model and assigned to a layer using a mean shift based clustering algorithm. A
global cost function is used that takes into account the summed up absolute differences,
the discontinuities between segments and the occlusions. The assignment of segments to
layers is iteratively updated until the cost function improves no more. The experimental
results indicate that the percentage of un-concluded pixels whose absolute disparity error
is greater than 1 is 1.53, 0.16, and 0.22 for the image sets, respectively.
The stereo matching algorithm proposed in makes use of color segmentation in
conjunction with the graph cuts method. The reference image is divided in nonoverlapping segments using the mean shift color segmentation algorithm. Thus, a set of
planes in the disparity space is generated. The goal of minimizing an energy function is
faced in the segment rather than the pixel domain. A disparity plane is fitted to each
segment using the graph cuts method. This algorithm presents good performance in the
texture less and occluded regions as well as at disparity discontinuities. The running
speed reported is 0.33 fps for a 384288 pixel image pair when tested on a 2.4GHz
Pentium 4 PC. The percentage of bad matched pixels and Map image sets is found to be
1.23, 0.30, 0.08, and 1.49, respectively.
The ultimate goal of the work describe is to render dynamic scenes with
interactive viewpoint control produced by a few cameras suitable color segmentationbased algorithm is developed and implemented on a programmable ATI 9800 PRO GPU.
Disparities within segments must vary smoothly, each image is treated equally,
occlusions are modelled explicitly and consistency between disparity maps is enforced
resulting in higher quality depth maps. The results for each pixel are refined in
conjunction with the others.
Dept. of ECE, MRITS
45

Another method that uses the concept of image color segmentation is reported in
an initial disparity map is calculated using an adapting window technique. The segments
are combined in larger layers iteratively. The assignment of segments to layers is
optimized using a global cost function. The quality of the disparity map is measured by
warping the reference image to the second view and comparing it with the real image and
calculating the color dissimilarity.
For the 384288 pixel and the 434383 pixel Venus test set, the algorithm
produces results at 0.05 fps rate. For the 450375 pixel Teddy image pair, the running
speed decreased to 0.01 fps due to the increased scene complexity. Running speeds refer
to an Intel Pentium 4 2.0GHz processor. The root mean square error obtained is 0.73 for
the 0.31 for the Venus and 1.07 for the image pair.
Moreover, Sun and his colleagues presented a method which treats the two
images of a stereo pair symmetrically within an energy minimization framework that can
also embody color segmentation as a soft constraint. This method enforces that the
occlusions in the reference image are consistent with the disparities found for the other
image. Belief propagation iteratively refines the results. Moreover, results for the version
of the algorithm that incorporates segmentation are better.
The percentage of pixels with disparity error larger than 1 is 0.97, 0.19, 0.16, and
0.16 for the Map image sets, respectively. The running speed for the aforementioned data
sets is about 0.02 fps tested on a 2.8GHz Pentium 4 processor.
Color segmentation is utilized as well. The matching cost used here is a selfadapting dissimilarity measure that takes into account the sum of absolute intensity
differences as well as a gradient based measure. Disparity planes are extracted using an
insensitive to outliers technique. Disparity plane labelling is performed using belief
propagation. Execution speed varies between 0.07 and 0.04 fps on a 2.21GHz AMD
Athlon 64 processor. The results indicate 1.13, 0.10, 4.22, and 2.48 percent of bad
matched pixels in non-occluded areas for the image sets, respectively.
Dept. of ECE, MRITS
46

Finally, one more algorithm that utilizes energy minimization, color
segmentation, plane fitting and repeated application of hierarchical belief propagation is
presented in this algorithm takes into account a color weighted correlation measure.
Discontinuities and occlusions are properly handled. The percentage of pixels with
disparity error larger than 1 is 0.88, 0.14, 3.55, and 2.90 for the Cones image sets,
respectively.
In two new symmetric cost functions for global stereo methods are proposed. A
symmetric data cost function for the likelihood, as well as a symmetric discontinuity cost
function for the prior in the MRF model for stereo is presented. Both the reference image
and the target image are taken into account to improve performance without modelling
half-occluded pixels explicitly and without using color segmentation. The use of both of
the two proposed symmetric cost functions in conjunction with a belief propagation based
stereo method is evaluated.
Experimental results for standard test bed images show that the performance of
the belief propagation based stereo method is greatly improved by the combined use of
the proposed symmetric cost functions. The percentage of pixels badly matched for the
non-occluded areas was found 1.07, 0.69, 0.64, and 1.06 for the image sets, respectively.
The incorporation of Markov random fields (MRF) as a computational tool is also a
popular approach.
A method based on the Bayesian estimation theory with a prior MRF model for
the assigned disparities is described in the continuity, coherence and occlusion constraints
as well as the adjacency principal are taken into account. The optimal estimator is
computed using a Gauss-Markov random field model for the corresponding posterior
marginal, which results in a diffusion process in the probability space. The results are
accurate but the algorithm is not suitable for real-time applications, since it needs a few
minutes to process a 256255 stereo pair with up to 32 disparity levels, on an Intel
Pentium III running at 450MHz.
On the other hand, treat every pixel of the input images as generated either by a
process, responsible for the pixels visible from the reference camera and which obey to
Dept. of ECE, MRITS
47

the constant brightness assumption, or by an outlier process, responsible for the pixels
that cannot be corresponded. Depth and visibility are jointly modelled as a hidden MRF,
and the spatial correlations of both are explicitly accounted for by defining a suitable
Gibbs prior distribution. An expectation maximization (EM) algorithm keeps track of
which points of the scene are visible in which images, and accounts for visibility
configurations. The percentages of pixels with disparity error larger than 1 are 2.57, 1.72,
6.86 and 4.64 for the image sets, respectively.
Moreover, a stereo method specifically designed for image-based rendering is
described in this algorithm uses over-segmentation of the input images and computes
matching values over entire segments rather than single pixels. Color-based segmentation
preserves object boundaries. The depths of the segments for each image are computed
using loopy belief propagation within a MRF framework. Occlusions are also considered.
The percentage of bad matched pixels in the un concluded regions is 1.69, 0.50, 6.74, and
3.19 for the Cones image sets, respectively. The aforementioned results refer to a 2.8GHz
PC platform.
An algorithm based on a hierarchical calculation of mutual information based
matching cost is proposed. Its goal is to minimize a proper global energy function, not by
iterative refinements but by aggregating matching costs for each pixel from all directions.
The final disparity map is sub- pixel accurate and occlusions are detected. The processing
speed for the image set is 0.77 fps. The error in un concluded regions is found less than
3% for all the standard image sets. Calculations are made on an Intel Xeon processor
running at 2.8GHz.
Mutual information is once again used as cost function. The extensions applied in
it result in intensity consistent disparity selection for un textured areas and discontinuity
preserving interpolation for filling holes in the disparity maps. It treats successfully
complex shapes and uses planar models for un textured areas. Bidirectional consistency
check, sub-pixel estimation as well as invalid-disparities interpolation are performed.
The experimental results indicate that the percentages of bad matching pixels in
un-concluded regions are 2.61, 0.25, 5.14, and2.77 for the Tsukuba, Venus, Teddy and
Dept. of ECE, MRITS
48

Cones image sets, respectively, with 64disparity levels searched each time. However, the
reported running speed on a2.8GHz PC is less than 1 fps.
The dense disparity estimation is accomplished by a region dividing technique
that uses a Canny edge detector and a simple SAD function. The results are refined by
regularizing the vector fields by means of minimizing an energy function. The root mean
square error obtained from this method is 0.9278 and 0.9094 for the image pairs. The
running speed is 0.15 fps and 0.105 fps respectively on a Pentium 4 PC running Windows
XP.
An uncommon measure is used in this work describes an algorithm which is
focused on achieving contrast invariant stereo matching. It relies on multiple spatial
frequency channels for local matching. The measure for this stage is the deviation of
phase difference from zero. The global solution is found by a fast non-iterative left right
diffusion process. Occlusions are found by enforcing the uniqueness constraint. The
algorithm is able to handle significant changes in contrast between the two images and
can handle noise in one of the frequency channels.
Another algorithm that generates high quality results in real time is reported is
based on the minimization of a global energy function comprising of a data and a
smoothness term. The hierarchical belief propagation iteratively optimizes the
smoothness term but it achieves fast convergence by removing redundant computations
involved. In order to accomplish real-time operation authors take advantage of the
parallelism of graphics hardware (GPU).
Experimental results indicate 16 fps processing speed for 320240 pixel selfrecorded images with 16 disparity levels. The percentages of bad matching pixels in unconcluded regions for the image sets are found to be 1.49, 0.77, 8.72, and 4.61. The
computer used is a 3GHz PC and the GPU is an NVIDIA 7900 GTX graphics card with
512M video memory.
`
The work indicates that computational cost of the graph cuts stereo
correspondence technique can be efficiently decreased using the results of a simple local
Dept. of ECE, MRITS
49

stereo algorithm to limit the disparity search range. The idea is to analyze and exploit the
failures of local correspondence algorithms. This method can accelerate the processing by
a factor of 2.8, compared to the sole use of graph cuts, while the resulting energy is worse
only by an average of 1.7%. These results proceed from an analysis done on a large
dataset of 32 stereo pairs using a Pentium4 at 2.6GHz PC.
2.6.2. Dynamic programming
Many researchers develop stereo correspondence algorithms based on DP. This
methodology is a fair trade-off between the complexity of the computations needed and
the quality of the results obtained. In every aspect, DP stands between the local
algorithms and the global optimization ones. However, its computational complexity still
renders it as a less preferable option for hardware implementation.
The work presents a unified framework that allows the fusion of any partial
knowledge about disparities, such as matched features and known surfaces within the
scene. It combines the results from corner, edge and dense stereo matching algorithms to
impose constraints that act as guide points to the standard DP method. The result is a
fully automatic dense stereo system with up to four times faster running speed and greater
accuracy compared to results obtained by the sole use of DP.
One or more disparity candidates for the true disparity of each pixel are assigned
by local matching using oriented spatial filters. Afterwards, a two-pass DP technique that
performs optimization both along and between the scan-lines is performed. The result is
the reduction of false matches as well as of the typical Inter-scan line inconsistency
problem.
The per-pixel matching costs are aggregated in the vertical direction only
resulting in improved inters scan line consistency and sharp object boundaries. This work
exploits the color and distance proximity based weight assignment for the pixels inside a
fixed support window as reported. The real time performance is achieved due to the
parallel use of the CPU and the GPU of a computer. This implementation can process
Dept. of ECE, MRITS
50

320-240 pixel images with 16 disparity levels at 43.5 fps and 640480 pixel images with
16 disparity levels at 9.9 fps.
On the contrary, the algorithm proposed in the DP method not across individual
scan lines but to a tree structure. Thus the minimization procedure accounts for all the
pixels of the image, compensating the known streaking effect without being an iterative
one. Reported running speed is a couple of frames per second for the tested image pairs.
So, real-time implementations are feasible. However, the results obtained are comparable
to those of the time-consuming global methods.
In the pixel-tree approach of the previous work is replaced by a region-tree one.
First of all, the image is color-segmented using the mean-shift algorithm. During the
stereo matching, a corresponding energy function defined on such a region-tree structure
is optimized using the DP technique. Occlusions are handled by compensating for border
occlusions and by applying cross checking. The obtained results indicate that the
percentage of the bad matched pixels in un-concluded regions is 1.39, 0.22, 7.42, and
6.31 for the Cones image sets. The running speed, on a 1.4GHz Intel Pentium M
processor, ranges from 0.1 fps with 16 disparity levels to 0.04 fps for the Cones dataset
with 60 disparity levels.
2.6.3 Other methods
There are of course other methods, producing dense disparity maps, which can be
placed in neither of previous categories. The below discussed methods use either
wavelet-based techniques or combinations of various techniques .Such a method, based
on the continuous wavelet transform (CWT) is found. It makes use of the redundant
information that results from the CWT. Using 1D orthogonal and bi-orthogonal wavelets
as well as 2D orthogonal wavelet the maximum matching rate obtained is 88.22% for the
image pair. Up sampling the pixels in the horizontal direction by a factor of two, through
zero insertion, further decreases the noise and the matching rate is increased to 84.91%.
Another work presents an algorithm based on non-uniform rational B-splines
(NURBS) curves. The curves replace the edges extracted with a wavelet based method.
The NURBS are projective invariant and so they reduce false matches due to distortion
Dept. of ECE, MRITS
51

and image noise. Stereo matching is then obtained by estimating the similarity between
projections of curves of an image and curves of another image. A 96.5% matching rate
for a self-recorded image pair is reported for this method.
Finally, a different way of confronting the stereo matching issue is proposed in
investigate the possibility of fusing the results from spatially differentiated (stereo vision)
scenery images with those from temporally differentiated (structure from motion) ones.
This method takes advantage of both methods merits improving the performance.
2.7 SPARSE DISPARITY ALGORITHMS

Algorithms resulting in sparse, or semi-dense, disparity maps tend to be less
attractive as most of the contemporary applications require dense disparity information.
Though, they are very useful when fast depth estimation is required and at the same time
detail, in the whole picture, is not so important. This type of algorithms tends to focus on
the main features of the images leaving occluded and poorly textured areas unmatched.
Consequently high processing speeds, accurate results but with limited density are
achieved. Very interesting ideas flourish in this direction but since contemporary interest
is directed towards dense disparity maps, only a few indicatory algorithms are discussed
here.
An algorithm that detects and matches dense features between the left and right
images of a stereo pair, producing a semi-dense disparity map. A dense feature is a
connected set of pixels in the left image and a corresponding set of pixels in the right
image such that the intensity edges on the boundary of these sets are stronger than their
matching error. All these are computed during the stereo matching process.
Another method developed is based on the same basic concepts as the former one.
The main difference is that this one uses the graph cuts algorithm for the dense feature
extraction. As a consequence this algorithm produces semi-dense results with significant
accuracy in areas where features are detected. The results are significantly better
considering density and error percentage but require longer running times. The total error
in the non-occluded regions is 0.36% and the running speed is 0.17 fps.
Dept. of ECE, MRITS
52

On the other hand DP algorithm, called reliability-based dynamic programming
(RDP) that uses a different measure to evaluate the reliabilities of matches. According to
this the reliability of a proposed match is the cost difference between the globally best
disparity assignment that includes the match and the one that does not include it.
Main research on stereo vision (more images) is to use imaging technology from
the (multiple) images of objects in the scene to get the distance (depth) information.
Stereo vision's basic method is to use two or more of the same object point of view to
observe the target, get a different perspective as photos, and then calculate the principle
of visual imagery in different images correspond to the relative pixel Location
information, and then infer the spatial location of target objects.
2.8 ALGORITHM IMPLEMENTATION

A complete stereo vision system generally can be divided into six modules: image
acquisition, camera calibration (sometimes can be ignored), feature extraction, stereo
matching, and three-dimensional reconstruction and post-processing. Because auto body
panels emphasis curved surface whole artistic effects, measurement accuracy is not the
first problem considered in the course of the measuring car body, but trying to make the
development time shorter.
Fig.2.11: Existing Algorithm of Stereo Vision
Dept. of ECE, MRITS
53

I. Image Acquisition
There are many ways to collect image for three-dimensional information,
depending on the application of the occasion and purpose. Usually use CCD camera or
CMOS camera device and after pre-treatment device to obtain the intrinsic landscape
images. The basic rule is those two different locations, or a video camera (CCD) through
moving or rotating with a scene for stereo image pairs.
II. Camera Calibration
In stereo vision research, usually under the camera image information obtained in
the calculation of three-dimensional geometry of objects, which reconstruction and object
recognition, and spatial three-dimensional surface geometry of a point in the image and
its corresponding Point is the relationship between the geometric models of camera
imaging decisions.
The geometric model parameters are camera parameters. Inmost conditions, these
parameters can be obtained by experiment and calculation, this process is known as
camera calibration.
Therefore, camera calibration is to determine the camera position, attribute
parameters (internal parameters such as focal length, lens distortion coefficient, the
uncertainty factor and the external parameters such as image rotation matrix and
translation vector) and the establishment of imaging models to determine points
correspondence in the object space coordinates and in the image plane, also between
image points correspondence.
Toobtain three-dimensional images in the camera calibration when applied not
only to meet the requirements. But also to consider differences in viewpoint, illumination
conditions, camera performance and features scenery and other factors, the establishment
of an effective camera model, not only can accurately recover the scene of threedimensional information space, but also help solve the problem of stereo matching.
Commonly camera calibration method is a single-camera calibration. Inside and
outside parameters of two cameras were obtained; then through the same world
Dept. of ECE, MRITS
54

coordinates given a set of tie points to create the position relationship between the two
cameras. The commonly used methods of single camera calibration are:
(a) The traditional photogrammetric calibration
Parameters described by at least 17 to determine three-dimensional structure of
relations between images and object space, it is the huge computation.
(b) The direct linear transformation (DLT)
Involves less parameter easy to calculate. But the DLT algorithm does not take
into account the camera distortion factor, the accuracy also select the relevant parameters
and landmarks.
(c) Perspective transformation matrix
This is a linear model based on camera perspective transformation matrix method,
which is simple and practical. Perspective transformation from the perspective of the
camera imaging to create a model of camera parameters from a perspective
transformation matrix, without the initial value, as long as a set of reference points to
determine the spatial coordinates and the corresponding image coordinates, you can use
the linear method to solve, this method can be calculated in real time.
(d) Two-step method
Firstly, use of perspective transformation matrix method to solve linear system of
the camera parameters. Then the parameters obtained for the initial value. Consider the
distortion factors; obtained using nonlinear optimization method solution, the calibration
accuracy is high.
(e) Double-plane calibration method
The two-camera calibration needs for precision external parameters. As the
structural configuration is difficult to accurately measure out, the distance and angle of
two cameras is limited. Therefore generally requires at least 6 or more (recommended to
take 10 or more) of the known world coordinates of the points in order to obtain more
Dept. of ECE, MRITS
55

satisfactory parameter matrix. The actual measurement process is complex, but the effect
is not necessarily ideal, greatly limiting its scope of application. Also double-lens camera
calibration needs to consider the nonlinear correction, range and precision, less current
outdoor applications.
III .Feature Extraction
From the collection of feature extraction is to extract the three-dimensional image
of the corresponding image features, so it is a key step of the stereo matching. Since there
is currently no universally applicable theory of an image feature extraction can be
applied, resulting in a stereo vision matching characteristics of diversity. Currently, the
main characteristics of some commonly used form of points features, linear features and
regional features. In general, large-scale features rich image information contained in the
image of the number of small, easy-to-be quickly matched.
Image features can be extracted including color, texture, spatial relation between
plane, shape, or other demographic characteristics. Image feature extraction and
expression is content-based image retrieval technology. Broadly speaking, features of the
image include text-based features (such as keywords, comments, etc.) and visual features
(such as color, texture, shape, object surface, etc.). Visual features can be divided into
general visual features and visual features related to corresponding areas.
The former is used to describe features common to all images, and images
unrelated to the specific type or content, including color, texture and shape; the latter is
described based on image content of some prior knowledge (or assumptions) . Closely
related works with specific applications, such as facial features, fingerprints, license plate
number of features or characteristics and so on.
There are many feature extraction algorithms depend on the applications should
choose reliable feature detection and feature localization. Three-dimensional features on
many of the characteristics of the object can be calculated in the two-dimensional images
in. In addition, due to differences in image acquisition process, some features can be
easily calculated, and some are not.
Dept. of ECE, MRITS
56

The commonly used feature extraction method is based on texture, shape and
contour extraction of the edge detection. Implementation of edge detection focused on the
extraction method based on geometry on the edge map, the maximum curvature point on
the edge of the search or extraction of the definition of the value of all interested
operators, search interest in the value of gray-scale image as the maximum point of
interest operator.
Specific edge detection method can be divided into three categories: one is the use
of differential edge operators. More well-known edge detection operators are Robert
operator, Sobel operator and Prewitt operator, Kirsch operator, Laplacian, Canny
operator, 621LOG operator and on the basis of these operators to improve the some
operator. Among them, the second derivative in the form of edge detection operator is the
primary means of edge detection.
The other is based on wavelet and fractal theory of edge detection methods.
Theory of edge detection based on wavelet method can detect the image edge features in
different scales. Fractal-based edge detection method as the selected template cannot be
too small (too small to adequately reflect the fractal characteristics), which detect the
edge of the coarse, need further refinement, positioning accuracy is not high. There is
also a category is based on morphological edge detection method.
These algorithms have advantages and disadvantages, but also refer to the actual
application of the specific situation. Then, before the feature extraction, the need for
image preprocessing obtained. Because the image acquisition process, there is a noise
source, after treatment can significantly improve image quality. The image features more
prominent.
IV. Stereo matching
Stereo vision stereo matching is the most complex and difficult part. Stereo
matching is based on the data obtained after feature extraction, the establishment of
correspondence between features. Points will be in the same space in different images
correspond pixel. Thus is the corresponding disparity image.
Dept. of ECE, MRITS
57

Projected image is two-dimensional scene, the subject of a wealth of information
(such as light background, geometry, environmental characteristics, distortion, etc.) The
final gray value of pixels reflected to only by this to extract a single gray value
information, very difficult. Therefore, unambiguous image matching is very difficult, at
present, matching methods are either impossible to recover all of the parallax image
points. According to the different matching primitives, stereo matching can be divided
into three categories, regional matching, feature matching and phase matching. Regional
matching and feature matching are based on a certain constraint conditions of the match.
The purpose of matching is to obtain the disparity between images. Common
constraints are: the epipolar constraint, consistency constraints, unique constraints and
continuity constraint. Scale-invariant feature transform (or SIFT) is an algorithm in
computer vision to detect and describe local features in images..Applications include
object recognition, robotic mapping and navigation, image stitching, 3D modeling,
gesture recognition, video tracking, and match moving.
V. 3D Reconstruction
When the parallax images obtained by stereo matching, you can get the depth of
match points, then use the matching points obtained by interpolation in depth, further the
depth of other points. Interpolation of discrete data that is to get the feature points does
not match the apparent difference. Through the three-dimensional reconstruction of the
data obtained, so as to achieve the purpose of recovery scenarios 3D information.
VI. Post-processing
Post-processing includes two aspects: error correction and accuracy improved.
Stereo matching is subject to geometric distortion and noise interference between the
image influences. In addition, as cyclical pattern, the presence of smooth areas, as well as
shielding effect, the principle is not strict constraints and other reasons will produce
errors in the disparity map. Error detection and correction of post-processing as an
important content. Usually causes of error need to select the appropriate means and
Dept. of ECE, MRITS
58

methods to. Good precision of the three-dimensional reconstruction model of a space
object from the effect is obvious.
Therefore, every step must be to maintain a very high accuracy. This depends on
the choice of algorithms and calculation accuracy. To improve accuracy, and can be
general in the usual pixel-level stereo vision disparity, the further improvement of
accuracy, in order to achieve sub-pixel level accuracy.
The disadvantage of stereo matching is it is very complex and time consuming
process since it the parameters to be calculated are more .To overcome this disadvantage
the block matching and image pyramiding technique is used in the proposed method.
Dept. of ECE, MRITS
59
CHAPTER 3
IMAGE PYRAMID AND BLOCK MATCHING
3.1 ALGORITHM IMPLEMENTATION
Fig.3.1: Proposed algorithm

This algorithm shows how to compute the depth map between two rectified stereo
images. In this paper we use block matching, which is the standard algorithm for highspeed stereo vision in hardware systems. We first explore basic block matching, and then
apply dynamic programming to improve accuracy, and image pyramiding to improve
speed.
Stereo vision is the process of recovering depth from camera images by
comparing two or more views of the same scene. Simple, binocular stereo uses only two
images, typically taken with parallel cameras that were separated by a horizontal distance
known as the "baseline." The output of the stereo computation is a disparity map (which
Dept. of ECE, MRITS
60

is translatable to arrange image) which tells how far each point in the physical scene from
the camera.
Step1: Read Stereo Image Pair
Here we read in the color stereo image pair and convert the images to gray scale
for the matching process. Using color images may provide some improvement in
accuracy, but it is more efficient to work with only one-channel images. For this we use
the Image Data Type Converter and the Color Space Converter System objects. Below
we show the left camera image and a color composite of both images so that one can
easily see the disparity between them.
Step 2: Basic Block Matching
Block matching is a way of locating matching blocks in a sequence of digital
video frame for the purpose of motion estimation. The basic idea of block matching is to
estimate the disparity point in the left image .we find a reference block surrounding this
point and then find the closest matched block with search in the right image.
The algorithm used in block matching is SAD algorithm; it computes the intensity
differences of each center pixel for each block in the left image with search in the right
image. To obtain the stereo points they compared with different disparities until the best
matching pairs can be determined. The disparity associated with the smallest SAD value
is selected as the best match.
Next we perform basic block matching. For every pixel in the right image, we
extract the 7-by-7-pixel block around it and search along the same row in the left image
for the block that best matches it. Here we search in a range of $\pm 15$ pixels around
the pixel's location in the first image, and we use the sum of absolute differences (SAD)
to compare the image regions. We need only search over columns and not over rows
because the images are rectified. We use the Template Matcher| System object to perform
this block matching between each block and the region of interest.
Dept. of ECE, MRITS
61

Step3: Sub-pixel Estimation
Sub pixel estimation method is for motion DE blurring. Motion blur appear in
almost all practical video system either a moving camera with stationary scene or
stationary camera with moving scene.
The disparity estimates returned by block matching are all integer-valued, so the
above depth map exhibits contouring effects where there are no smooth transitions
between regions of different disparity. This can be ameliorated by incorporating sub-pixel
computation into the matching metric. Previously we only took the location of the
minimum cost as the disparity, but now we take into consideration the minimum cost and
the two neighboring cost values. We fit a parabola to these three values, and analytically
solve for the minimum to get the sub-pixel correction.
Re-running basic block matching, we achieve the result below where the
contouring effects are mostly removed and the disparity estimates are correctly refined.
This is especially evident along the walls.
Step4: Dynamic Programming
Dynamic programming is a general approach to making a sequence of interrelated
decisions in an optimum way
Overview of this method
1)
Define a small part of the whole problem and find an optimum solution to this
small part.
2)
Enlarge this small part slightly and find optimum solution to new problem using
the previously found optimum solution.

3)
Continue with step2 until you have enlarged sufficiently that the current problem
encompasses the original problem

4)
Track back the solution to the whole problem from the optimum solutions to the
small problems solved along the way.
Dept. of ECE, MRITS
62

As mentioned above, basic block matching creates a noisy disparity image. This
can be improved by introducing a smoothness constraint. Basic block matching chooses
the optimal disparity for each pixel based on its own cost function alone. Now we want to
allow a pixel to have a disparity with possibly sub-optimal cost for it locally. This extra
cost must be offset by increasing that pixel's agreement in disparity with its neighbors.
The problem of finding the optimal disparity estimates for a row of pixels now
becomes one of finding the "optimal path" from one side of the image to the other. To
find this optimal path, we use the underlying block matching metric as the cost function
and constrain the disparities to only change by a certain amount between adjacent pixels.
This is a problem that can be solved efficiently using the technique of dynamic
programming.
Step 5: Image Pyramid
Image pyramiding is to develop filter based representations to decompose images
into information at multiple scales, to extract features/structures of interest, and to
attenuate noise.
Motivation
Extract image features such as edges at multiple scales
Redundancy reduction and image modeling for
Efficient coding
Image enhancement/restoration
Image analysis/synthesis
While dynamic programming can improve the accuracy of the stereo image, basic
block matching is still an expensive operation, and dynamic programming only adds to
the burden. One solution is to use image pyramiding and telescopic search to guide the
block matching. The below example performs this telescoping stereo matching using a
four-level image pyramid.
Dept. of ECE, MRITS
63

Step 6: Combined Pyramiding and Dynamic Programming
Finally we merge the above techniques and run dynamic programming along with
image pyramiding, where the dynamic programming is run on the disparity estimates
output by every pyramid level. The results compare well with the highest-quality results
we have obtained so far, and are still achieved at a reduced computational burden versus
basic block matching. It is also possible to use sub-pixel methods with dynamic
programming, and we show the results of all three techniques in the second image. As
before, sub pixel refinement reduces contouring effects and clearly improves accuracy.
3.2 ADVANTAGES
Computational time is more
Improves accuracy
Efficient technique
3.4 APPLICATIONS
Object recognition.
Position determination.
Shape and size detection.
Product processing and assembly.
Obstacle avoidance and navigation.
Dept. of ECE, MRITS
64
CHAPTER 4
EXPERIMENTAL RESULTS AND DISCUSSIONS
Fig 4.1: Right image
Fig 4.2: Left image

Initially we read the two stereo image pair and convert it in to gray scale for the
matching purpose. The above figures show the captured image.
The below figure 4.3 shows the stereo image after converting in to grayscale.
Dept. of ECE, MRITS
65
Fig 4.3: Color composite image

After reading stereo image pair perform the block matching for every pixel in the
image for disparity. The resultant image after block matching is shown in figure 4.4.
The disparity estimates returned by block matching are integer values. Due to this
depth map exhibits contouring effects these effects can be overcome by sub pixel
estimation.
Fig 4.4: Depth map from basic block matching

The figure 4.5 shows the estimation of contour effects by sub pixel estimation.
Dept. of ECE, MRITS
66
Fig 4.5: Basic block matching with sub-pixel accuracy

The above step creates a noisy disparity image this can be improved by
smoothness constraints i.e. Dynamic programming. The figure 4.6 shows enhancement of
disparity of block matching.
Fig 4.6: Block matching with dynamic programming

To overcome the complexity of block matching technique introduce a new
technique i.e. image pyramiding due to this technique the speed of operation is improved.
Dept. of ECE, MRITS
67

Finally combine the pyramiding and dynamic programming where dynamic
programming is apply on the disparity estimated output by every pyramid level. The
figure 4.7 shows the combination of pyramid and block matching.
Fig 4.7: 4-level pyramid with dynamic programming

Finally combine all the techniques and apply pyramid with dynamic
programming, to reduce the contouring effects.
The figure 4.8 shows the final disparity image.
Fig 4.8: Pyramid with dynamic programming and sub-pixel accuracy
Dept. of ECE, MRITS
68

The figure 4.9 shows the 3D reconstructed image.
Fig 4.9: 3D reconstructed image
Dept. of ECE, MRITS
69
CHAPTER 5
CONCLUSION AND FUTURE SCOPE
The process of stereo matching involves in block matching technique is able to
establish a correspondence by matching image pixel intensities. The output of is a
disparity mapping which stores the depth or distance of each pixel in an image. Each
pixel in the map corresponds to the depth at that point rather than the gray shaded or
color. In future we can improve the peak-signal to noise ratio by improved stereo vision
techniques.
Dept. of ECE, MRITS
70
REFERENCES
1. Hartley, R. and Zisserman, A., 2004. Multiple View Geometry in computer Visionsecond edition. Cambridge Un.Press.
2. Blostein, S. and Huang, T., 1987. Quantization error in stereo triangulation. In: IEEE
Int. Conf. on Computer Vision.
3. Hartley, R., Gupta, R. and Chang, T., 1992. Stereo from uncalibrated cameras. In:
IEEE Conf. on Computer Visionand Pattern Recognition.
4. Koch, R., Pollefeys, M. and Gool, L., 2000. Realistic surface reconstruction of 3d
scenes from uncalibrated image. sequences. In: Visualization and Computer Animation.
5. Hernandez, C., Schmitt, F. and Cipol, R., 2007. Silhouette coherence for camera
calibration under circular motion.In: IEEE Trans. On Pattern Analysis and Machine
Intelligence.
6. S. Lazebnik, Y. Furukawa, J. Ponce. Projective Visual Hulls. IJCV 2006.
7. R. Szeliski. Rapid Octree Construction from Image Sequences. Computer Vision,
Graphics and Image Processing, 58(1):23.32, 1993.
8. W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMillan. Image Based Visual
Hulls. In ACM Siggraph,2000.
9. Khan, S., Yan, P. and Shah, M., 2007. A homographic framework for the fusion of
multi-view silhouettes. In: IEEE Int. Conf. on Computer Vision.
10.
Lowe, David G. (1999). Object recognition from local scale-invariant features.
Proceedings of the International Conference on Computer Vision. 2. pp. 11501157. 623

11. S.T. Barnard and M.A. Fischler, Computational stereo, ACM Computing Surveys,
vol. 14, pp. 553572, 1982.
12. Umesh R. Dhond and J. K. Aggarwal, Structure from stereo-A review,IEEE Trans.
Systems, Man, and Cybernetics, vol. 19, no. 6, pp. 14891510,198
Dept. of ECE, MRITS
71
APPENDIX
SOFTWARE IMPLEMENTATION
Software Requirement
The MATLAB
MATLAB is a high performance language for technical computing. It integrates
computation visualization and programming in an easy to use environment. MATLAB
stands for matrix laboratory. It was written originally to provide easy access to matrix
software developed by LINPACK (linear system package) and EISPACK (Eigen system
package) projects. MATLAB is therefore built on a foundation of sophisticated matrix
software in which the basic element is matrix that does not require pre dimensioning.
Typical uses of MATLAB
The typical usage areas of MATLAB are
Math and computation
Algorithm development
Data acquisition
Data analysis, exploration and visualization
Scientific and engineering graphics

MATLAB is an interactive system whose basic data element is an array that
does not require dimensioning. This allows you to solve many technical computing
problems, especially those with matrix and vector formulations, in a fraction of the time it
would take to write a program in a scalar non-interactive language such as C or
FORTRAN.
MATLAB features a family of add-on application-specific solutions called
toolboxes. Very important to most users of MATLAB, toolboxes allow you to learn and
apply specialized technology. Toolboxes are comprehensive collections of MATLAB
functions (M-files) that extend the MATLAB environment to solve particular classes of
Dept. of ECE, MRITS
72

problems. Areas in which toolboxes are available include signal processing, image
processing, control systems, neural networks, fuzzy logic, wavelets, simulation, and
many others.
Features of MATLAB
Advance algorithm for high performance numerical computation, especially in the
field matrix algebra.
A large collection of predefined mathematical functions and the ability to define
ones own functions.
Two-and three dimensional graphics for plotting and displaying data
MATLAB
MATLAB
Programming language
User written / built in functions
Graphics
Computation
External interface
2-D graphics
Linear algebra
Interface with C and
3-D graphics
Signal processing
FORTRAN
Color and lighting

Animation
Quadrature
Programs
ToolEtc
boxes
1. Signal
processi
ng
2. Image
processi
Fig: Features ng
and capabilities of MATLAB
3. Control
systems
Dept. of ECE, MRITS
4. Neural
Network
s
5. Commu
73

MATLAB Development Environment
The MATLAB system consists of five main parts
Development Environment
The MATLAB Mathematical Function
The MATLAB Language
The Graphical User Interface(GUI) construction
The MATLAB Application Program Interface (API)
Development Environment
This is the set of tools and facilities that help you use MATLAB functions and
files. Many of these tools are graphical user interfaces. It includes the MATLAB desktop
and Command Window, a command history, an editor and debugger, and browsers for
viewing help, the workspace, files, and the search path.
The MATLAB Mathematical Function
This is a vast collection of computational algorithms ranging from elementary
functions like sum, sine, cosine, and complex arithmetic, to more sophisticated functions
like matrix inverse, matrix Eigen values, Bessel functions, and fast Fourier transforms.
The MATLAB Language
This is a high-level matrix/array language with control flow statements, functions,
data structures, input/output, and object-oriented programming features. It allows both
"programming in the small" to rapidly create quick and dirty throw-away programs, and
"programming in the large" to create complete large and complex application programs.
The GUI construction
MATLAB has extensive facilities for displaying vectors and matrices as graphs,
as well as annotating and printing these graphs. It includes high-level functions for twoDept. of ECE, MRITS
74

dimensional and three dimensional data visualization, image processing, animation, and
presentation graphics. It also includes low-level functions that allow you to fully
customize the appearance of graphics as well as to build complete graphical user
interfaces on your MATLAB applications.
The Graphical User Interface (GUI) is an interactive system that helps to establish
good communication between the processor and organizer. The functional operation of
the GUI is compatible with the Applets in JAVA. The MATLAB Toolbox provides more
functions to create GUI main frames. The GUIs can be created by the GUIDE (Graphical
User Interface Development Environment) which is a package in MATLAB Toolbox.
The GUI makes the process so easy to operate and reduces the risk. GUIDE, the
MATLAB Graphical User Interface development environment, provides a set of tools for
creating graphical user interfaces (GUIs). These tools greatly simplify the process of
designing and building GUIs. We can use the GUIDE tools to develop. Lay out the GUI:
Layout Editor, we can lay out a GUI easily by clicking and dragging GUI components -such as panels, buttons, text fields, sliders, menus, and so on -- into the layout area.
Program the GUI: GUIDE automatically generates an M-file that controls how the
GUI operates. The M-file initializes the GUI and contains a framework for all the GUI
call-backs -- the commands that are executed when a user clicks a GUI component. Using
the M-file editor, we can add code to the call-backs to perform the functions.
GUIDE stores a GUI in two files, which are generated the first time when we save
or run the GUI:
A FIG-file, with extension .fig, which contains a complete description of the GUI
layout and the components of the GUI: push buttons, menus, axes, and so on.
An M-file, with extension .m, which contains the code that controls the GUI,
including the call-backs for its components.
These two files correspond to the tasks of lying out and programming the GUI. When
we lay out of the GUI in the Layout Editor, our work is stored in the FIG-file. When we
program the GUI, our work is stored in the M-file.
Dept. of ECE, MRITS
75

The MATLAB Application Program Interface (API)
This is a library that allows you to write C and FORTRAN programs that interact
with MATLAB. It includes facilities for calling routines from MATLAB (dynamic
linking), calling MATLAB as a computational engine, and for reading and writing MATfiles.
MATLAB Desktop
Mat lab Desktop is the main Mat lab application window. The desktop contains
five sub windows, the command window, the workspace browser, the current directory
window, the command history window, and one or more figure windows, which are
shown only when the user displays a graphic.
Fig MATLAB desktop

The command window is where the user types MATLAB commands and
expressions at the prompt (>>) and where the output of those commands is displayed.
MATLAB defines the workspace as the set of variables that the user creates in a work
Dept. of ECE, MRITS
76

session. The workspace browser shows these variables and some information about them.
Double clicking on a variable in the workspace browser launches the Array Editor, which
can be used to obtain information and income instances edit certain properties of the
variable.
The current Directory tab above the workspace tab shows the contents of the
current directory whose path is shown in the current directory window. For example, in
the windows operating system the path might be as follows: C:\MATLAB\Work,
indicating that directory work is a subdirectory of the main directory MATLAB;
which is installed in drive C. clicking on the arrow in the current directory window shows
a list of recently used paths. Clicking on the button to the right of the window allows the
user to change the current directory.
MATLAB uses a search path to find M-files and other MATLAB related files,
which are organize in directories in the computer file system. Any file run in MATLAB
must reside in the current directory or in a directory that is on search path. By default, the
files supplied with MATLAB and math works toolboxes are included in the search path.
The easiest way to see which directories are on the search path. The easiest way to see
which directories are soon the search paths, or to add or modify a search path, is to select
set path from the File menu the desktop, and then use the set path dialog box. It is good
practice to add any commonly used directories to the search path to avoid repeatedly
having the change the current directory.
The Command History Window contains a record of the commands a user has
entered in the command window, including both current and previous MATLAB
sessions. Previously entered MATLAB commands can be selected and re-executed from
the command history window by right clicking on a command or sequence of commands.
This action launches a menu from which to select various options in addition to executing
the commands. This is useful to select various options in addition to executing the
commands. This is a useful feature when experimenting with various commands in a
work session.
Dept. of ECE, MRITS
77

Using the MATLAB Editor to create M-Files
The MATLAB editor is both a text editor specialized for creating M-files and a
graphical MATLAB debugger. The editor can appear in a window by itself, or it canbe a
sub window in the desktop. M-files are denoted by the extension .m, as in pixelup.m.
The MATLAB editor window has numerous pull-down menus for tasks such as
saving, viewing, and debugging files. Because it performs some simple checks and also
uses color to differentiate between various elements of code, this text editor is
recommended as the tool of choice for writing and editing M-functions. To open the
editor, type edit at the prompt opens the M-file filename. m in an editor window, ready
for editing. As noted earlier, the file must be in the current directory, or in a directory in
the search path.
Getting Help
The principal way to get help online is to use the MATLAB help browser, opened
as a separate window either by clicking on the question mark symbol (?) on the desktop
toolbar, or by typing help browser at the prompt in the command window. The help
Browser is a web browser integrated into the MATLAB desktop that displays a Hypertext
Mark-up Language (HTML) documents. The Help Browser consists of two panes, the
help navigator pane, used to find information, and the display pane, used to view the
information. Self-explanatory tabs other than navigator pane are used to perform a search.
Dept. of ECE, MRITS
78

Final Document Print

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Document Print

Uploaded by

Copyright:

Available Formats

3D Reconstruction Based on Image Pyramid and Block Matching

additional constraints are required. Here, we constrain the reconstruction process by

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

1.2 3DRECONSTRUCTION FROM VIDEO SEQUENCES

and captures a static scene using a hand-held camera.

The recorded video sequence is then pre-processed (e.g. selecting frames),

removing (noise, normalizing illumination).

Fig.1.1: Main tasks of 3D reconstruction

same features in different images and match them.

3D Reconstruction Based on Image Pyramid and Block Matching

scene (e.g. building mesh models, mapping textures).

Fig.1.2: Pollefeys 3D modelling framework

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.3 : Feature detection and matching process

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

1.2.2.2 Point descriptors

Fig 1.4: An interest points detected by SIFT (green marks).

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

3D Reconstruction Based on Image Pyramid and Block Matching

1.3.4 Line matching

1.4 STRUCTURE AND MOTION RECOVERY

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.5: Structure and motion recovery process

1.5 ADVANTAGES AND PROBLEMS OF USING VIDEO

3D Reconstruction Based on Image Pyramid and Block Matching

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

1.6 CRITICAL CASES

1.7 IMAGE-BASED 3D RECONSTRUCTION

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.6(a) :The view network, a sparse reconstruction and uncertainties

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.6(b) : Reconstructed dense point cloud from our multi-view method of an

1.8 UNCERTAINTY OF SCENE POINTS

3D Reconstruction Based on Image Pyramid and Block Matching

The cameras are distributed along a 2D grid (corresponding to flight paths) in

= 1 pixel (i.e. _ 8cm GSD). Given the set perturbed point

3D Reconstruction Based on Image Pyramid and Block Matching

and its respective covariance matrix by,

where U represents the main diagonals of the covariance ellipsoid

respective standard deviations. The decomposition of the covariance matrix in equation

the uncertainty in x - y direction,

1.9 LITERATURE SURVEY

3D Reconstruction Based on Image Pyramid and Block Matching

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.7: MPEG / H.26x video compression process flow

1.10 BLOCK MATCHING ALGORITHMS

3D Reconstruction Based on Image Pyramid and Block Matching

3D Reconstruction Based on Image Pyramid and Block Matching

(a)Exhaustive Search (ES)

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

(c)New Three Step Search (NTSS)

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching

Fig.1.10: New Three Step Search block matching.

Dept. of ECE, MRITS

3D Reconstruction Based on Image Pyramid and Block Matching