You are on page 1of 5

Automated 3-D Modeling from a single Uncalibrated

Image
K.V.Shailesh, 200401120
Prof. Suman Mitra
Evaluation Committee no: 2
Abstract - This project aims at developing a method to
automatically generate a 3-D model from a single
uncalibrated image. We do not aim to identify various
objects and shapes in an image but, coarsely segregate the
image into 3 main categories, namely: sky, vertical and
horizontal. Next step is to make the computer learn the
perspective in the image and generate a 3D interpretation
of the single input image. Our major contributions are:
segmenting image into the 3 categories mentioned above
and developing algorithms for machine learning. This
methodology would ultimately aid in the development of
virtual walkthroughs and 3D models without the need of
going into their technical details.
Index Terms Image processing, image segmentation,
perspective viewing, projective geometry, texturing.
I. INTRODUCTION
Increasing interest in the entertainment and communication
design industry has led to significant developments in the field
of 3D graphics. As a result of which we are able to enjoy
virtual reality gaming with stunning graphics, high definition
animation films and interactive virtual tours. A subset of this
boom is 3D modeling based on image-based rendering, which
involves the use of photographs for generating 3D models
which are further used in developing gaming and virtual tour
environments. However, the development of these
environments has strictly remained a domain of the
professionals and often requires great amount of technical
expertise, multiple photographs shot using calibrated cameras
and special equipment.
This project allows us to bypass all the technical
complexity mentioned above by embedding the same into an
automated framework. The application of this project can vary
from the development of personalized virtual tours and
walkthroughs to FPS (First-person shooter) game
development. The application takes as input any image with
considerable depth of field and automatically generates a 3D
VRML model as the output, allowing us to walk through it or
further integrate it with any other application.
II. PROBLEM STATEMENT
To develop a methodology for automatically generating 3D
models in VRML from single images. For achieving this, the
main tasks would be to develop image segmentation

algorithms based on characteristics of perspective images and


geometric cues extracted from the images. To derive a formula
for generating the front view of various structures (in the input
image) using fundamentals of projective geometry. Finally,
automating the whole process and implementing it using
MATLAB R2006a.
III. RELATED WORK
The image based rendering approaches which have been
developed so far include the Quicktime VR [2] which requires
a large number of photographs taken using special equipment.
PhotoBuilder [3] is similar to this project but still requires
more than 2 images and a considerable effort on the users part
in labeling the corresponding parts in the images so that the
translation and rotation matrix can be obtained. Other projects
which involve matching 3D objects and 2D images involve
high levels of accuracy and aid mainly in robot navigation like
[4]. Tour into the picture [5] generates an output similar to this
project. It treats the scene as an axis-aligned box typically
comprising of a floor, ceiling, 2 side walls and a back drop.
Using a spidery mesh interface, the user specifies the
coordinates of the box and the vanishing point. The software
creates good results but works only for scenes which have one
point perspective. Our project is mainly inspired by Automatic
Photo Pop-Up [1] which also creates a VRML model as the
output using a single image. The idea of partitioning an image
into 3 categories namely sky, vertical and horizontal has been
borrowed from this project.
IV. PERCEIVING 3D
Humans can easily grasp a scene and even imagine reasonably
well what it would look like from a different view point.
Having no prior knowledge of optics or projective geometry,
this is possible in human mind because of a statistical
knowledge built up through observation. Surprisingly this
statistical knowledge is governed by certain rules which have
been used so far in Perspective Viewing (Graphic arts) and
Projective Geometry for creating 3D interpretations on a 2D
plane. We exploit the same basic set of rules to make our
computers perceive these images like we do. They are:
Foreshortening: All perspective images assume a viewer
a certain distance away from the drawing. Objects are
scaled relative to that viewer. Additionally, an object is
often not scaled evenly: a circle often appears as an

ellipse and a square can appear as a trapezoid. This


distortion is referred to as foreshortening.
Horizontal Line: Perspective drawings typically have an
often implied horizon line directly opposite to the viewer's
eye. It represents objects infinitely far away which have
shrunk into the distance, to the infinitesimal thickness of a
line. It is analogous to (and named after) the Earth's
horizon.
Vanishing Points: Any perspective representation of a
scene that includes parallel lines has one or more
vanishing points. A one-point perspective drawing means
that the drawing has a single vanishing point, usually
opposite the viewer's eye and usually on the horizon line.
All lines parallel with the viewer's line of sight recede to
the horizon towards this vanishing point. This is the
standard "receding railroad tracks" phenomenon. A twopoint drawing would have lines parallel to two different
angles. Any number of vanishing points is possible in an
image, one for each set of parallel lines that are at an
angle relative to the plane of the drawing.

gradient and Laplacian. The gradient method detects the edges

V. APPROACH
Before we begin we form certain basic assumptions for our set
of input images and derive some conclusions based on
observation and performing some basic image processing
operations:
I. Assumptions
1. Every image has a sky in the upper part.
2. Every image has at least one vanishing point, inside or
outside the image boundaries.
3. Every image has sufficient depth of view.
4. Every image in the input data set has straight lines.
II. Conclusions
1. Sky has almost uniform gradient (Observed by edge
detection).
2. Vertical structures have vertical lines (slope=90o).
3. Vertical lines (slope=90o) only diminish with distance but
inclination remains unaltered (Based on statistical
analysis using Linear Hough Transform).
To address the above problem we have divided the whole
process into 7 steps as shown in the Figure 1.
Step1. Detecting an Edge
Edges characterize boundaries and are therefore a problem of
fundamental importance in image processing. Edges in images
are areas with strong intensity contrasts a jump in intensity
from one pixel to the next. Edge detecting an image
significantly reduces the amount of data and filters out useless
information, while preserving the important structural
properties in an image. According to [6] different edge
detection methods may be grouped into two categories,

FIGURE 1
THE STEPS INVOLVED IN THE DESIGN OF THE ALGORITHM

by looking for the maximum and minimum in the first


derivative of the image. This method of locating an edge is
characteristic of the gradient filter family of edge detection
filters and includes the Sobel, Roberts Cross, Prewitt and
Canny methods. Furthermore, when the first derivative is at a
maximum, the second derivative is zero. As a result, another
alternative to finding the location of an edge is to locate the
zeros in the second derivative. This method is known as the
Laplacian. Filtering is mandatory in both the cases in order to
remove the noise and get a smoother function and more
precise detection of edges.
Running edge detection eliminates major portion of the
unnecessary information in the image giving us the edges of
almost every distinguishable object in the picture. Now, to be
able to understand the geometry of the images we need further
classification of the edges. Lines embedded in an image help
in giving a more precise approximation of the perspective of
the image than any curve. One more reason behind
considering lines is that they are easier to detect and
manipulate, which is required in later parts. So, the next step
would be to extract only the lines from the edges detected so
far.
Step2. Detecting a Line
The Hough transform addresses this problem by making it
possible to perform groupings of edge points into object
candidates by performing an explicit voting procedure over a
set of parameterized image objects.
Hough Transform is capable of extracting any
geometrical shape like circle, ellipse etc. the complexity of
which increases with the number of parameters The simplest

case of Hough transform is the linear transform for detecting


straight lines. In the image space, the straight line can be
described as y = mx + b and can be graphically plotted for
each pair of image points (x,y). In the Hough transform, the
main idea is to consider the characteristics of the straight line
not as image points x or y, but in terms of its parameters, here
the slope parameter m and the intercept parameter b. Based on
that fact, the straight line y = mx + b can be represented as a
point (b, m) in the parameter space.
Reference [7] shows if one uses the slope and intercept
parameters m and b, however, one faces the problem that
vertical lines give rise to unbounded values of the parameters
m and b. For computational reasons, it is therefore better to
parameterize the lines in the Hough transform with two other
parameters, commonly referred to as r and (theta). The
parameter r represents the distance between the line and the
origin, while is the angle of the vector from the origin to this
closest point. Using this parameterization, the equation of the
line can be written as:
cos r
(1)
y=
+

sin sin
which can be rearranged to
(2)
r = x cos + y sin
It is therefore possible to associate to each line of the
image, a couple (r,) which is unique if [0, ] and

applying a threshold which in simple terms determines the


number of lines detected. Next, these extracted lines are stored
in 2 arrays in (r,) and (x1,x2,y1,y2) forms.
Step3. Identifying the Vanishing Points
Using the arrays containing the detected lines in Step2,
and the definition of vanishing points we compute all the
possible nC2 intersection points of the extracted lines. An
arithmetic mean of these intersection points gives the
approximate position of the vanishing points. Further,
according to Conclusion.3 vertical lines (85o<m<95o) do not
contribute to the computation of vanishing points. Therefore,
excluding these lines gives us (n-nm)C2 intersection points and
a much better approximation of vanishing point as:

i =l ( dp )
dp( xi, yi) l (dp)

i =1

(Xf, Yf) =

(3)

Where, dp() is the array containing all the intersection


points computed by ignoring the vertical lines.

r R or if [0,2 ] and r 0 . The (r,) plane is


sometimes referred to as Hough space for the set of straight
lines in two dimensions.
For implementation, the Hough transform algorithm [8]
uses an array, called accumulator, to detect the existence of a
line (2). The dimension of the accumulator is equal to the
number of unknown parameters of the Hough transform
problem. In this case we have two unknown parameters: r and
.

FIGURE 2
OUTPUT OF LINE DETECTION USING HOUGH TRANSFORM [8]

For each pixel and its neighborhood, the Hough transform


algorithm determines if there is enough evidence of an edge at
that pixel. If so, it will calculate the parameters of that line,
and then look for the accumulator's bin that the parameters fall
into, and increase the value of that bin. By finding the bins
with the highest values, typically by looking for local maxima
in the accumulator space, the most likely lines can be
extracted. The simplest way of finding these peaks is by

FIGURE 3
VANISHING POINT APPROXIMATION. LEFT-CONSIDERING VERTICAL LINES.
RIGHT-IGNORING VERTICAL LINES. RED CROSSES REPRESENT INTERSECTION
POINTS AND GREEN DOT REPRESENTS VANISHING POINT

Step4. Image Segmentation


We broadly classify the input image into 3 categories namely:
sky, vertical and horizontal[1]. This segmentation is based on
the characteristic properties of the respective regions.
4.1. Sky Compared to other parts of the image sky has
relatively uniform gradient. Using Assumption.1 and
Conclusion.1 we have analysed 3 algorithms all of which
generate a binary mask equal to the size of the image, which
when applied to the input image removes the sky:
a) Vertical Traversal (Gradient Based) The image is
converted to grayscale so that the 3 RGB planes are
merged into one and gradient be measured irrespective of
colour. Every column starting from the first till the last
(width of the image) is traversed vertically checking every
adjacent pixel till a significant change in the gradient is
encountered. Thus, every column has a string of zeros in
its upper part and ones below the critical pixel. This gives
us the required binary mask.
b) Region Growing (Gradient Based) Instead of
comparing the pixels in one direction as in (a) this
algorithm takes as input a seed pixel (Assumption.1) and
starting from it, compares it with its edge pixels which are
[-1 0;1 0;0 -1;0 1] given the seed is at [0 0]. Now, all the
edge pixels lying within a given threshold are considered
as a part of the region and the one with the least

c)

difference in intensity is assigned as the new seed. This


process goes on till all the edge pixels lie outside the
gradient threshold. This technique is specifically used in
medical sciences for marking tumors in an X-ray or an
MRI image [9].
K-Means clustering using L*a*b* colour space
(Colour Based) This method uses colour of the sky as
the filtering parameter. The input image is converted from
RGB to L*a*b* colour space. It consists of Luminosity
layer L*, chromaticity layer a* for red-green and
chromaticity layer b* for blue-yellow. This conversion
is done because we are only concerned with the color of
the sky and not the intensity (which is present separately
in the luminosity layer). Now, the K-Means clustering
function operates on the a*b* layers and returns an
index for every pixel in the image. The index is a natural
number (1 to k) representing each of the k clusters. Since
we do not know the index of the sky pixels we use
Assumption.1 to find the index of a sky pixel and convert
all the pixels with that index as zero and the rest as one
thus generating the required binary mask.

TABLE I
PROS AND CONS OF THE 3 ALGORITHMS TESTED
1.Vertical Traversal
+ Faster - 1.135675 secs for a
478x434 px image.

- Obstacles like birds, wires etc. when


traversing from top to bottom are
considered as part of the vertical
structures only. Thus the part of the sky
below them is left undetected.
- Unable to remove clouds.

2. Region Growing
+As all the edge pixels for every - Slower - 14.071094secs for a 478x434
pixel are considered, it is more
px image.
versatile compared to 1.
- Obstacles dividing the sky into 2 disjoint
regions cannot be overcome. Eg. An
overhead transmission line.
3. K-Means clustering using L*a*b* colour space
+ As it is colour based
segmentation, it is more
efficient in removing sky (blue)
and clouds (white/grey).

- Parts of the image other than the sky but


having the same colour set as sky are also
removed.
-Time -18.220051secs for a 478x434 px
image.
-Gives erroneous results if it is a
sunset\sunrise scene, where the hue is no
where close to blue or white.

directional Gaussian filter which smoothens any obstacle such


as a wire or even clouds thus improving the efficiency to a
great extent. The output can be seen in Figure 4.
TABLE II
TEST RESULTS FROM 3 ALGORITHMS FOR REMOVING SKY
Vertical Traversal
Region Growing
K-Means Clustering

4.2. Vertical The vertical is sandwiched with sky on the


upper side and horizontal ground on the lower side. After the
removal of sky we need to find the boundary between the
vertical and the horizontal. We make use of Conclusion.2 to
achieve this task. By joining the lower extremities of all the
vertical line segments we get the required boundary, as
wherever the vertical structures end the horizontal plane
begins. For implementation we run line detection [8] again
with a much higher resolution in order to detect as many lines
as possible. Then, we separate the vertical and non-vertical as
done for vanishing point approximation. The next step is to
divide the whole image into n columns. The lower most
extremity of among the vertical line segments in each of these
n columns are marked and the roipoly() function of MATALB
gives us a polygonal mask to separate the horizontal.

FIGURE 5
RED REGION REPRESENTS THE MASK TO SELECT THE GROUND. DETECTED
VERTICAL LINE SEGMENTS ARE SHOWN IN BLUE

FIGURE 4
IMAGE ON THE LEFT IS THE ORIGINAL AND RIGHT SIDE SHOWS THE OUTPUT OF
THE MODIFIED IMPLEMENTATION OF SKY SEGMENTATION. NOTE THE
OBSTACLES IN THE SKY HAVE BEEN ENCOUNTERED EFFECTIVELY.

Region Growing is concluded to be the solution to this


problem as its only drawback can be encountered using a bi-

4.3. Horizontal The only region left after segmenting sky


and vertical is horizontal. This sequence of segmentation is
important because the horizontal in general has no defining
characteristic. Gradient based region growing cannot be used
as the variations in gradients can be very erratic from case to
case. Same holds for texture based segmentation.
Step5. Image Cropping
After segmenting the image into 3 regions the horizontal and
vertical are cropped and separated. The vertical region is

further sliced about the vanishing point computed in Step3.


This gives us trapeziums representing planes in the vertical
and horizontal regions.
Step6. Spatial Transform
The various trapeziums so obtained are the projections of the
respective real world planes onto the 2D window which forms
our image. So, effectively these trapeziums can be mapped to
rectangles which would represent the front view of those
structures.

algorithms for the 3 regions have been developed. The project


has certain limitations like limited viewing due to single
image, blurring of certain parts of the image due to lack of
clarity at depth and inability to operate on images having no
straight lines. But, the first 2 limitations can be eliminated by
using multiple images and image registration technique.
The code is developed using MATLAB R2006a with
Image Processing Toolbox 6.1. The 3D models have been
created using V-Realm Builder 2.0 which comes along with
the Virtual Reality Toolbox 4.7.

FIGURE 6
PROJECTION OF THE BLUE RECTANGLE ON THE PLANE Z=1

Figure 6 shows the rectangle (blue) projects as trapezium


on the plane z=1. Height of the rectangle is 2h and the width
2b, which we need to calculate. The height of the shorter of
the parallel edges of the trapezium is 2h. The equations of the
lines L1 and L2 are:
x y z
(4)
l1, l 2
=
=
b w h' 1
Now, putting the constrain that distance between P1 and
P2 is 2h we get:
1
h(b w)
(5)
P1
, h' ,
h'
h'
Once P1 is calculated we know that the distance between
P1 and (b, h, 1) is 2b (width of rectangle) which gives:

w(1 h' ) + w2 h'2 h 2 h'2 +2hh'+2h' h 2 + 2h'3 4hh'2 (6)


(1 2h' )
For the implementation we use the projective() function
of MATLAB to map the trapezium to rectangle thus
generating the front view.
b=

Step.7 3DModeling and Texturing


3D models are generated in VRML and basically comprise of
planes perpendicular to one another. The size, position and the
orientation of the rectangular planes are derived from the
respective front views generated in Step.6. Further, the planes
are textured using the same images.
VI. CONCLUSION
Automation of the whole process has been achieved
successfully. Algorithm for machine learning of perspective
images has been developed. Effective image segmentation

FIGURE 7
SNAPSHOTS OF THE OUTPUTS GENERATED FOR 2 IMAGES

VI. ACKNOWLEDGEMENT
I convey my sincere thanks to Prof. Suman Mitra for
providing his valuable and timely guidance.
REFERENCES
[1]

Hoiem, D., Efros, A., A., and Hebert, M., Automatic Photo Pop-up,
ACM Siggraph, 2005.

[2]

Chen, E. QuickTime VR - an image-based approach to virtual


environment navigation. In ACM Siggraph, 2938, 1995.

[3]

Cipolla, R., Robertson, D., and Boyer, E. Photobuilder 3d models of


architectural scenes from uncalibrated images. In IEEE Int. Conf. on
Multimedia Computing and Systems, vol. I, 2531, 1999.

[4]

Beveridge, J., Ross, and Riseman, E., M., Hybrid Weak-Perspective


and Full-Perspective Matching. IEEE, 1992.

[5]

Horry, Y., Anjyo, K.-I., and Arai, K. Tour into the picture: using a
spidery mesh interface to make animation from a single image. In ACM
Siggraph, 225232, 1997.

[6]

www.pages.drexel.edu/~weg22/edge.html

[7]

Duda, O., Richard and Hart, E., Peter, Use of the Hough
Transformation to Detect Lines and Curves in Pictures. In Comm.
ACM, vol. 15, no. 1, 11-15, April 1971.

[8]

Peng, T., Detecting Lines in grayscale image using Hough


Transform,http://www.mathworks.com/matlabcentral/fileexchange/load
Category.do, Dec 2005.

[9]

Kroon, D., Segmentation by growing a region from seed point using


intensity mean measure,
http://www.mathworks.com/matlabcentral/fileexchange/loadCategory.do
, March 2008.

You might also like