You are on page 1of 24

Computer Vision Notes 

Bilingual for convenience and don’t forget to bring calculator 

Confirmed Midterm Exam Guide (Kisi-kisi UTS) 


● Point-based processing: Image transformation, Histogram equalization 
● Area-based processing: Filtering → Convolution and Correlation 
● Canny Edge Detector (Explain how the edge detector works, step-by-step) 
● Harris Corner Detector (Most probably explaining how it works step-by-step again) 
● Case: SIFT and SURF explanation (according to Pak Diaz), paper-related in t​ he following link​. 

Point-based Processing 

Image Transformation (Transformasi Citra) 


For an in-depth explanation, you may open ​the following link​. 
Untuk penjelasan lebih lanjut, Anda dapat membuka t​ autan berikut ini​. 
 
Image transformation can be achieved through matrix multiplication. 
Transformasi citra dapat dilakukan melalui perkalian matriks. 

Rotation (Rotasi) 
The following formula is used to rotate an image where 𝛩 (Theta) is the angle of rotation. 
Rumus berikut digunakan untuk merotasi sebuah citra di mana ​𝛩​ (Theta) adalah sudut rotasi. 

 
 
Easy way to remember rotation 

 
 
Let say you want to rotate this vector 90 degree 2 times counterclockwise: 
 
In the first rotation, the coordinate of this vector becomes: 
 
 
In the second rotation, the coordinate of this vector becomes: 

 
 
Now mathematically , we can do this 90 degree rotation by multiplying some unknown 2x2 matrix with the 
vector 2 times: 
 
First multiplication : 

 
 
The result would be  

 
 
Second multiplication : 

 
 
The result would be  

 
 
The full result matrix a, b , c ,d is  

 
 
Since cos 90 = 0, sin 90 = 1, -(sin 90) = -1, we can guess and transform full result matrix into : 

 
 
Hurray ​\(^ω^\)  

Shearing (Shear) 
Shearing (a.k.a. skewing) is an operation that displaces a line vertically or horizontally depending on the 
shear matrix used. 
Shear (dikenal juga dengan skew) merupakan transformasi yang menggeser garis secara vertikal atau 
horizontal bergantung pada shear matrix yang digunakan. 
 
There are two types of shearing: 
Terdapat dua jenis shear: 
 
● Vertical 
This type of shearing displaces lines vertically, depending on the value of 𝛼 and 𝑥. 
Jenis shear ini menggeser garis secara vertikal, tergantung pada besar nilai 𝛼 dan 𝑥. 
 

 
● Horizontal 
This type of shearing displaces lines horizontally, depending on the value of 𝛼 and 𝑦. 
Jenis shear ini menggeser garis secara horizontal, tergantung pada besar nilai 𝛼 dan 𝑦. 

Scaling 
A transformation that enlarges or shrinks the image by a certain scale (constant). 
Sebuah transformasi yang memperbesar atau memperkecil citra dengan suatu skala (konstanta) tertentu. 
 
There are two kinds of scaling transformations: 
Terdapat dua jenis transformasi scaling: 
 
● Uniform/Isotropic scaling (Scaling dengan konstanta yang sama) 
This type of scaling uses the same scale factor for the 𝑥 and 𝑦 components of the vector. 
Jenis scaling ini menggunakan faktor skala yang sama untuk komponen 𝑥 dan 𝑦 dari vektor. 

 
● Non-uniform/Anisotropic Scaling (Scaling dengan konstanta yang berbeda) 
This type of scaling uses different scale factors for the 𝑥 and 𝑦 components of the vector. 
Jenis scaling ini menggunakan faktor skala yang berbeda untuk komponen 𝑥 dan 𝑦 dari citra. 

 
Translation (Translasi) 
A transformation that moves every component of the image by a given distance and cannot be written as 
the multiplication of a 2x1 matrix with a 2x2 matrix. 
Sebuah transformasi yang memindahkan setiap komponen dari citra dengan jarak tertentu dan tidak dapat 
dituliskan sebagai perkalian matriks 2x1 dengan matriks 2x2. 
 
● Homogeneous Coordinates (Koordinat Homogen) 

To allow translation, the image must use homogeneous coordinates where the 2D vector is 

represented as a 3D vector , with 𝑧 acting as a scale for the 𝑥 and 𝑦 components. 

Agar dapat melakukan translasi, citra harus menggunakan koordinat homogen di mana vektor 2D  

direpresentasikan dalam bentuk vektor 3D , di mana 𝑧 berfungsi sebagai skala untuk komponen 𝑥 dan 
𝑦. 
 
● Translation with Homogeneous Coordinates (Translasi dalam Koordinat Homogen) 
Translation can be written as the product of a homogenous vector with a 3x3 matrix (with 𝑧 = 1) 
Translasi dapat dituliskan dalam persamaan perkalian vektor homogen dengan matriks 3x3 (dengan 
komponen 𝑧 = 1) 

 
Where 𝑥 is moved by 𝛼 units and 𝑦 by 𝛽. 

Converting a 2x2 matrix to 3x3 for homogeneous coordinates (Konversi matriks 2x2 
menjadi 3x3 untuk koordinat homogen) 

The transformation matrix can be converted to a 3x3 matrix for transformation with homogeneous 
coordinates, that is: 

Matriks transformasi dapat diubah menjadi matriks 3x3 untuk transformasi dalam koordinat 
homogen menjadi: 

Histogram Equalization 
To calculate the equalized histogram, use CDF (Cumulative Distribution Function) 

 
 
By calculating the CDF, we can obtain the normalized frequency of every intensity by rounding down every 
f​N​ result  

 
L = intensity count 
f​k​ = cumulative frequency 
 
Example: 

 
 
Intensity   f​k  CDF  f​N  Intensity  New f​k 
0  2  2/25  0.56  1  2 
1  4  6/25  1.68  2  4 
2  5  11/25  3.08  3  5 
3  2  13/25  3.64  4  ↓ 
4  3  16/25  4.48  4  2 + 3 = 5 
5  3  19/25  5.32  5  3 
6  3  22/25  6.16  6  3 
7  3  25/25  7  7  3 

  
 

Intensity Transformation (point operators) 


Image Negative 
 
 
 
Where s is the output intensity value for input intensity r. 
 
Using the equation above we reverse the intensity levels of an image to produce equivalent of image 
negatives. 
 
This type of processing is suited for enhancing white or gray detail covered mostly of dark background in 
an image. 

 
Log Transformation 
 

 
 
Where c is a constant (usually 1) and   
 
This type of transformation is suited for expanding the dark values in an image while compressing the high 
intensity values. 
 
 
We can see from Figure 3.3 that : 
● Log function maps low input intensity value to wide range of output intensity level and map high input 
intensity value to narrow range of output intensity level. 
● Inverse log function is the opposite of log function (low intensity -> narrow output, high intensity -> wide 
output). 
 
Power Law (Gamma) Transformation 
 
 
 

Where s is the output intensity value, c and are positive constant. 


 
● Gamma Transformation is more versatile than log transformation for compressing intensity values. 
● A variety of devices used for image capture, printing and display respond according to power law. The 
process used to correct these power-law response phenomena is called ​gamma correction. 
 
 
 
We can see from Figure 3.6 that: 
● Fractional values ( ) maps narrow range of low intensity input value to wider range of 
output intensity value while the opposite is true for the high intensity input value. (lowering the 
fractional gamma values might reduce the contrast of an image and might make image look 
“washed-out”). 
● that is bigger than 1 maps wide range of intensity input value to narrow range of output intensity 
value while the opposite is true for the high intensity input value.  
● Gamma Transformation become identity transformation when is 1 
 
Example of gamma correction: 
 
A CRT device have an intensity-to-voltage response that is a power function, with exponents 2.5. By looking 
at the Figure 3.6 ( = 2.5) the response produced by the CRT device tends to produce image that is darker. 
We see in the Figure 3.7 (b) that indeed the image viewed in the CRT monitor is darker than the original 
image in Figure 3.7(a). 
 
Thus we need to do gamma correction by applying Power Law Transformation to the original image by 

with c = 1 before we display it in the CRT monitor. 


 
Gamma correction is useful for: 
● Displaying an image accurately on a computer screen. 
● Reproduce color of an image correctly (gamma values not only change the intensity value but also the 
ratio of Red, Green, Blue in a color image). 
 
Histogram Equalization 
 
Probability of occurrence of input intensity level in a digital image is approximated by : 
 

 
 
 
Where  
● means input pixel r with intensity level j (0-255 or 0-[L-1] where L is the color bit depth or the number 
of bins). 
● is the number of pixel that have intensity level j. 
● M is the number of pixel row and N is the number of pixel column (for example the image resolution is 
640 x 480 then MN = 307200). 
The discrete form of Transformation function CDF equation is: 
 
Where (located in output image) is mapping from each corresponding pixel (located in input image). 
 
Example : 
Let say that there is an 3-bit image represented in 5x5 matrix : 
 
5  6  3  1  5 

1  2  5  3  3 

6  4  1  7  7 

3  4  0  6  2 

2  7  5  0  5 
 
We can calculate the frequency of each intensity value: 
 
         

  2       

  3       

  3       

  4       

  2       

  5       

  3       

  3       
 
Since it’s contains 8 different intensity value then (L-1) = (8-1) = 7. 
MN = 5x5 = 25 
 
The equation becomes: 

 
 
Calculate each s from 0 to 7: 
= 7/25 * 2 = 0.56 
= 7/25 * (2 + 3) = 1.4 
= 7/25 * (2 + 3 + 3) = 2.24 
= 7/25 * (2 + 3 + 3 + 4) = 3.36 
= 7/25 * (2 + 3 + 3 + 4 + 2) = 3.92 
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5) = 5.32 
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5 + 3) = 6.16 
= 7/25 * (2 + 3 + 3 + 4 + 2 + 5 + 3 + 3) = 7 
 
Round all the fraction result since there is no way that pixel values is a fraction (IIRC PaoPao said by 
flooring): 
 
= 0 (no changes) 
= 1 (no changes) 
= 2 (no changes) 
= 3 (no changes) 
= 3 (changed) 
= 5 (no changes) 
= 6 (no changes) 
= 7 (no changes) 
 
Since only the pixel of intensity 4 mapped into 3 in the output image we replace intensity 4 by 3 and the 
output image matrix become (changes in red): 
 
5  6  3  1  5 

1  2  5  3  3 

6  3  1  7  7 

3  3  0  6  2 

2  7  5  0  5 
 
 
(this is a really shitty example since the distribution of histogram is pretty balanced in the first place). 
 
Spatial Transformation (Neighbours operation) 
 
Definition of filters 
 
There are two kinds of filter: 
● Low Pass filter​ is a filter that passes low frequencies, the effect produced by this filter is 
blurring/smoothing ​an image (also called a ​ veraging filters)​.  
● High Pass filter ​is a filter that passes high frequencies, the effect produced by this filter is s
​ harpening 
(if the result of the filter is added to the original image). 
 
We can achieve these effects by using ​spatial filters (also called spatial mask), ​spatial filters consist of : 
1. A ​neighbourhood ​typically a small rectangle. 
2. A ​predefined operation​ that is performed on the images pixels encompassed by the neighbourhood. 
 
Spatial Filtering c​ reates a new pixel (in output image) with coordinates equal to the coordinate of the 
center of the neighbourhood. If the operation performed on the image is linear then the filter is called ​linear 
spatial filter​ otherwise the filter is ​nonlinear. 
Spatial Correlation and Convolution 
 
There are two methods of spatial filtering: 
1. Correlation ​which is a process of moving filter mask over the image and computing the sum of 
products at each location. 
2. Convolution i​ s the same as correlation but the filter mask is rotated 180 degree before convolving. 
 
Note that if the filter mask is s​ ymmetric​ then correlation and convolution will lead to the same result. 
(leftmost column is symmetric) 
 
 

 
Here the step by step video on how to convolve a mask with an image: 
https://youtu.be/XuD4C8vJzEQ?t=185 

 
 
 
Smoothing Spatial Filters (averaging) 
 
● Smoothing is analogous to i​ ntegration 
● Smoothing filters are used for b
​ lurring​(removal of small details in image) and ​noise reduction​. 
 
Blurring​ because replacing the pixel in the image by the average intensity level in the neighbourhood lead to 
reduced s ​ harp transition ​in intensities between adjacent pixels and this also lead to N
​ oise reduction. 
However edges (which is also characterized as sharp transition in intensities) is also blurred. 

 
● The mask in figure 3.32(a) is called ​box filter​ because all the coefficient in the matrix is the same. 
● The mask in figure 3.32(b) is called ​weighted average filter, t​ his terminology is used to indicate that 
pixels are multiplied by different coefficients and thus giving more importance/weight to some pixels 
(in this case the closer the pixel to the center, the bigger the coefficient is). 

Sharpening Spatial Filters 


● Sharpening is analogous to ​differentiation. 
● Sharpening filter are based on f​ irst order derivative ​and ​second order derivative. 
● All the coefficient in the mask must sum to 0 (image with constant intensity must have zero derivative). 
 
Unsharp Masking and High Boost Filtering 
 
Unsharp Masking is ​sharpening an image by subtracting an unsharp (smoothed) version of the original 
image from the original image. 
High Boost filtering​ is multiplying the mask created from unsharp masking by a constant K > 1. 
 
The process of unsharp masking and high boost filtering are: 
1. Blur the original image. 
2. Subtract the blurred image from the original. 
3. Multiply the mask by some constant k > 1. 
4. Add the multiplied mask to the original image. 

 
Edge Detection 
How does a computer define an edge? It's the sudden change in colour or intensity of colour. 
Mathematical definition: Edge is the zero crossing point of the second derivative as illustrated below. 

 
All you have to understand is that the derivative is the gradient/kemiringan at any point in the graph.  
First derivative graph is graphed by calculating gradient at every point in the colour intensity graph.  
Second derivative graph is graphed by made by calculating gradient at all point in the first derivative graph. 
First and second order derivative 
 
First order derivative in digital image is defined as partial derivative with respect to both x and y : 

 
And for second order derivative: 

 
 
 
Laplacian Edge Detection 
 
The second order derivative in image processing are implemented using laplacian operator. 
 
Derivative laplacian operator can be defined as the sum of second order derivative with respect to x and y : 

 
 
The equation above can be implemented into a filter mask by 3x3: 
 

 
● Left mask doesn’t take into account diagonal pixel when computing the derivative and ​invariant to 90 
degree rotation. 
● Right mask is the extension of the original equation and also take into account the diagonal pixel when 
computing the derivative and i​ nvariant to 90 and 45 degree rotation. 
● Rotation invariant​ ​mask is called ​isotropic filter. 
● We can sharpen an image by adding the result of filtered image by laplacian mask to the original image. 
 
 
The first order derivative in image processing is implemented by using ​sobel mask​ and others​.  
 
Sobel Edge Detection 
 

 
 
 
● The mask in the first row left column compute the derivative in h
​ orizontal direction. 
● The mask in the first row right column compute the derivative in v​ ertical direction. 
● The mask in the second row compute the derivative in d​ iagonal direction. 
● Sobel ​also smooth the image when differentiating. 

Canny Edge Detector 


The Canny edge detector is an ​edge detection​ operator that uses a multi-stage ​algorithm​ to detect a
wide range of edges in images.

Canny edge detection algorithm consist of the following steps: 


 
1. Noise reduction with Gaussian Filter 
Convolving the image with the Gaussian filter to remove noise on the image since edge detection is 
highly sensitive to noise 
2. Gradient magnitude 
First, we detect edge intensity and direction by calculating the gradient of the image using the Sobel 
filter in x and y axis. Then, we calculate the magnitude by finding the hypotenuse of derivatives of the x 
and y axis. Finally, we calculate the degree/magnitude of the gradient by calculating the tangent of 
derivatives of the x and y axis. 
3. Non-maxima suppression 
Non-maxima suppression is a method to eliminate spurious (read:false) edges and corners in Canny 
and Harris. By using this, you’ll get only the true edges and corners. Basically, NMS eliminates spurious 
edges or corners by checking the direction of the edge and corner then checking the surrounding pixels 
to eliminate pixels with low intensity and keep the high intensity pixels.  
4. Double thresholding 
Set high threshold to identify strong pixel (higher intensity than high threshold), low threshold to identify 
non-relevant pixel (lower intensity than low threshold). The pixel which has intensity between high and 
low threshold will be flagged.  
5. Hysteresis thresholding 
Hysteresis help us to identify the flagged pixels in double threshold considered as strong pixels or 
non-relevant pixels. It will transform the weak pixel into strong pixel if there is at least one strong pixel 
around that weak pixel. 
 
More details about Canny in h ​ ere! 
Harris Corner Detector 
 
Harris corner detector algorithm consist of the following steps: 
1. Convert the image to grayscale and compute the image derivative (optionally smooth it first). 
2. Find the second moment matrix / structure tensor matrix M by approximating the response difference 
by using first order taylor expansion. 
3. Plug the eigenvalues of matrix M to the corner response function to get response value R. 
4. Perform non-maxima suppression to the list of candidate corners (R) and find the correct corner. 
(gonna explain all of this later, maybe) 
 
Step 1: 
See APPENDIX A for gradient and Smoothing Spatial filters for smoothing. 
 
Step 2: 
In this step we're gonna use SSD (Sum of Square Difference) to extract the​ structure​ ​tensor matrix​. The 
purpose of this is to find the biggest response difference when we move the window in any direction 
(finding a candidate corner point in a nutshell) [see the image below.] 
 

 
 
This window operation is mathematically defined as: 

 
● E is the difference between the original and the moved window. 
● u is the window's displacement in the x direction 
● v is the window's displacement in the y direction 
● w(x, y) is the window at position (x, y). This acts like a mask. Ensuring that only the desired window is 
used. 
● I is the intensity of the image at a position (x, y) 
● I(x+u, y+v) is the intensity of the moved window 
● I(x, y) is the intensity of the original 

Lets ignore w(x,y) for now and focus on the square difference 
 

We can approximate the equation I(x+u,y+v) above by using f​ irst order multivariate taylor expansion.​ ​And 
the equation becomes. 

In the above equation, I(x,y) cancel out so lets expand and the equation becomes: 

This can be turned into matrix vector multiplication from (since the summation symbol only depends on the 
x,y we can leave out and its transpose outside the summation) : 

Now we can extract the equation in the parenthesis which is called the ​structure tensor matrix/second 
moment matrix i​ nto M (also add back w(x,y) since it also depends on the summation).: 

Now the window operation simplified into : 

Step 3: 

Compute the eigenvalues from every M matrix (with varying x,y coordinate). 

Forgot how to calculate eigen value? Here’s a refresher: 

 
Then plug it to the response function  

Response function is defined as : 

Step 4: 

Perform NMS to the list of R corner coordinates to find the best corner and eliminate unnecessary corners 
candidate that do not lie on the ‘true edges’ (see APPENDIX A).3 

Blob Detection 
In computer vision, ​blob detection​ methods are aimed at detecting regions in a digital image that differ in 
properties, such as brightness or color, compared to surrounding regions. 
 
Methods:  
● Laplacian of Gaussian (LoG) 
● Difference of Gaussians (DoG) 
● Determinant of Hessian (DoH) 
 
Laplacian of Gaussian 
 
Given input image I, create gaussian blurred version of it G. Convolving I and G using Laplacian operator 
(taking second derivative, that is the very definition of edge if you remember) gives you LoG. (source: 
http://fourier.eng.hmc.edu/e161/lectures/gradient/node8.html​) 
 
Difference of Gaussians 
 
Given input image I, create multiple gaussian blurred version of it with different k-sizes and take the 
differences between them using laplacian operator. (SIFT uses this) 
 
 
 
 
Determinant of Hessian 
 
Hessian operator is simply put a better version of Laplacian operator. (SURF uses this)  
(source: ​https://en.wikipedia.org/wiki/Blob_detection#The_determinant_of_the_Hessian​) 
 
This is simply because Hessian operators contain more information since it contains all the possible 
second-order partial derivatives where Laplacian operators only store information about the sum of 
second-order partial derivatives. Hessian matrix looks like below matrix: 

Hough Transform 
The ​Hough transform​ is a ​feature extraction​ technique used in ​image analysis​, ​computer vision​, and ​digital 
image processing​.[1]​
​ The purpose of the technique is to find imperfect instances of objects within a certain 
class of shapes by a voting procedure. This voting procedure is carried out in a p ​ arameter space​, from 
which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly 
constructed by the algorithm for computing the Hough transform. 
The classical Hough transform was concerned with the identification of l​ ines​ in the image, but later the 
Hough transform has been extended to identifying positions of arbitrary shapes, most commonly ​circles​ or 
ellipses​. 

Image Descriptors 
● Most features can be thought of as templates, histograms (counts), or combinations 
● The ideal descriptor should be 
○ Robust and Distinctive 
○ Compact and Efficient 
● Most available descriptors focus on edge/gradient information 
○ Capture texture information 
○ Color rarely used 
 
Main Components 
1. Detection: Identify the interest points 
2. Description: Extract vector feature descriptor surrounding each interest point. 
3. Matching: Determine correspondence between descriptors in two views 
Scale Invariant Feature Transform (SIFT) 
https://en.wikipedia.org/wiki/Scale-invariant_feature_transform 
 
 
 
The ​scale-invariant feature transform​ (​SIFT​) is a f​ eature detection​ algorithm in 
computer vision​ to detect and describe local features in images published by David 
Lowe in 1999. SIFT can robustly identify objects even among clutter and under partial 
occlusion, because the SIFT feature descriptor is invariant to u ​ niform scaling​, 
orientation​, illumination changes, and ​partially invariant​ to a ​ ffine distortion​.  
 
Affine distortion example: 
An image of a f​ ern​-like f​ ractal​ that exhibits affine s​ elf-similarity​.  
 
SIFT keypoints of objects are first extracted from a set of reference images​ ​and stored 
in a database. An object is recognized in a new image by individually comparing each feature from the new 
image to this database and finding candidate matching features based on ​Euclidean distance​ of their 
feature vectors.  
Key locations are defined as maxima and minima of the result of d ​ ifference of Gaussians​ (DoG) function 
applied in s
​ cale space​ to a series of smoothed and resampled images. 

SIFT uses a modified version of kd-tree (binary search tree that stores k-dimension koordinates) called the 
best-bin-first​ search​ method​[14]​ that can identify the n
​ earest neighbors​ with high probability using only a 
limited amount of computation. The BBF algorithm uses a modified search ordering for the ​k-d tree 
algorithm so that bins in feature space are searched in the order of their closest distance from the query 
location. This search order requires the use of a ​heap​-based ​priority queue​ for efficient determination of the 
search order. The best candidate match for each keypoint is found by identifying its nearest neighbor in the 
database of keypoints from training images. The nearest neighbors are defined as the key points with 
minimum ​Euclidean distance​ from the given descriptor vector. The probability that a match is correct can 
be determined by taking the ratio of distance from the closest neighbor to the distance of the second 
closest. 

Lowe​[3]​ rejected all matches in which the distance ratio is greater than 0.8, which eliminates 90% of the 
false matches while discarding less than 5% of the correct matches. To further improve the efficiency of 
the best-bin-first algorithm search was cut off after checking the first 200 nearest neighbor candidates. For 
a database of 100,000 keypoints, this provides a speedup over exact nearest neighbor search by about 2 
orders of magnitude, yet results in less than a 5% loss in the number of correct matches. 
SIFT uses Hough Transform to identify clusters of features with a consistent interpretation by using each 
feature to vote for all object poses that are consistent with the feature. When clusters of features are found 
to vote for the same pose of an object, the probability of the interpretation being correct is much higher 
than for any single feature. 

Finally, ​Outliers​ can now be removed by checking for agreement between each image feature and the 
model, given the parameter solution. Given the l​ inear least squares​ solution (linear regression), each match 
is required to agree within half the error range that was used for the parameters in the H ​ ough transform 
bins. As outliers are discarded, the linear least squares solution is resolved with the remaining points, and 
the process iterated. If fewer than 3 points remain after discarding o ​ utliers​, then the match is rejected. In 
addition, a top-down matching phase is used to add any further matches that agree with the projected 
model position, which may have been missed from the ​Hough transform​ bin due to the similarity transform 
approximation or other errors. 

The final decision to accept or reject a model hypothesis is based on a detailed probabilistic model.​[15]​ This 
method first computes the expected number of false matches to the model pose, given the projected size 
of the model, the number of features within the region, and the accuracy of the fit. A ​Bayesian probability 
analysis then gives the probability that the object is present based on the actual number of matching 
features found. A model is accepted if the final probability for a correct interpretation is greater than 0.98. 
Lowe's SIFT based object recognition gives excellent results except under wide illumination variations and 
under non-rigid transformations. 
 
SIFT consist of the following this steps: 
1. Scale-space extrema detection 
Use Difference of Gaussian (DoG) to identify potential interest points, which were invariant to scale 
and orientation 
2. Keypoint localization 
Reject low contrast points and eliminate edge responses 
3. Orientation assignment 
Each keypoint is assigned one or more orientations based on local image gradient direction to 
achieve invariance to rotation 
4. Keypoint descriptor  
Compute a descriptor vector for each keypoints for the descriptor that is highly distinctive 
andpartially invariant to remaining variations 
 

Speeded up robust features (SURF) 


It is partly inspired by the ​scale-invariant feature transform​ (SIFT) descriptor. The standard version of SURF 
is several times faster than SIFT and claimed by its authors to be more robust against different image 
transformations than SIFT, but SURF is sometimes more inaccurate when faced with rotations. It’s also 
partially invariant​ to a
​ ffine distortion​ like SIFT, meaning sometimes it can be inaccurate. 
To detect interest points, SURF first uses Gaussian Blur uses an integer approximation of the ​determinant 
of Hessian​ ​blob detector​, which can be computed with 3 integer operations using a precomputed ​integral 
image​. Its feature descriptor is based on the sum of the H
​ aar wavelet​ response around the point of interest. 
These can also be computed with the aid of the integral image. 

APPENDIX A 
Image gradient 
 
Lets define the derivative with respect to x as and with respect to y as : 

 
(see F
​ irst and second order derivative ​section for explanation on the derivation). 
 
To find edge strength and direction at location (x,y) in image f we need to compute the gradient : 

 
The magnitude (length) of vector denoted as M(x,y) which is ​euclidean distance​ : 

 
The direction of the gradient vector is given by angle at point (x,y) with respect to x axis: 

 
 
We can use gradient operators to compute edge direction and strength (illustrated below): 
 

 
 
Non-Maxima suppression 
 
Example :  
 
Let , , and denote four basic edge direction for 3x3 region which are  
Horizontal (0 and +180) , -45 degree, vertical (+90 and -90) and +45 degree respectively. We can formulate 
the following non maxima suppression scheme for 3x3 region centered at point (x,y) in ): 
1. Compute gradient magnitude M(x,y) and angle . 
2. Find the direction that is closest to  
a. For example : if then the closest direction from (edge normal) is 
horizontal since (20 - 0) = 20 , (45-20) = 25. 
b. Since edge direction is perpendicular to edge normal, the edge direction is 0 + 90 = +90 degree and 
0 - 90 = -90 degree (vertical direction). 
3. If the value of M(x,y) is less than at least on of its two neighbors along , let f(x,y) = 0 (suppression) 
otherwise, let f(x,y) = M(x,y). 
a. Continuing the example in 2, the two neighbors along the vertical direction is (x,y+1) and (x,y-1). 
 
 
 
Characteristics of Good Features detector (maybe come out in theory) 
● Repeatability​ 
○ The same feature can be found in several images despite geometric and photometric 
transformations  ​
● Saliency​ 
○ Each feature is distinctive​ 
● Compactness and efficiency​ 
○ Many fewer features than image pixels​ 
● Locality​ 
○ A feature occupies a relatively small area of the image; robust to clutter and occlusion 
Criteria for Optimal Edge Detection (this too) 
● Good detection​ 
○ The optimal detector must minimize the probability of false positives (detecting spurious edges 
caused by noise), as well as that of false negatives (missing real edges)​ 
● Good localization​ 
○ The edges detected must be as close as possible to the true edges​ 
● Single response constraint​ 
○ The detector must return one point only for each true edge point, that is, minimize the number of 
local maxima around the true edge (created by noise)​ 

You might also like