You are on page 1of 5

Data Structures Assignment 5

(Hashing)
Clarifications:
- The requirement of using a 2-4 tree for overflow chaining in the first hash table is
relaxed for this assignment. You may use any auxilliary data structure for overflow
chaining for the first hash table.
- The image file extension is .jpg instead of .jpeg (changes have been made below).
- Assume that the image files given as input for checking if there is a duplicate, are in
your working directory. The images for which you create the data structure, are in the
subdirectory Pictures.
- Please see moodle for other clarifications.

Too many pictures too little space...

Cameras being available on mobile devices, there is a huge amount of digital photographs
being generated by people. Furthermore, there are tools for easy sharing of pictures.
Given these two new developments, it is increasingly becoming difficult to keep track of
one’s pictures. The goal of this assignment is to develop software to manage the large
number of digital pictures that one possesses. In this assignment, we will target to resolve
just one of the issues in managing photographs: the problem of “duplicates”.

Duplicates:

Many times you receive a picture from a friend that is already stored on your computer
but you being unaware of this fact end up storing a duplicate copy. This wastes a lot of
space on your hard-disk if you store a large number of duplicates. What you want from
your software is to quickly determine if a given digital picture is already there in your
storage. Note that checking if two picture files are the same is a costly operation (since
you have to do a pixel-by-pixel match). Furthermore, comparing the given picture to a
picture that is stored in the hard-disk involves bringing the picture file from the hard-disk
to the memory which we know is a very costly operation. So, ideally you would like to
minimize the actual number of picture comparisons you need to do.

Hashing:

Hashing provides a nice solution for resolving some of the above-mentioned issues. We
will use two hash tables Hash1 and Hash2 of size m1 and m2 respectively. We come up
with simple hash functions h1 and h2 that map a given digital image to locations in the
respective hash tables. In the first hash function, we resolve collisions by maintaining a 2-
4 tree for each bucket in the hash table. In the second hash table, we do open addressing
with linear probing. The details of our scheme follow. First, we try to understand what
constitutes a digital image.
Digital Image:

A colour digital image can be defined by a 2-D array of “pixels”. Each pixel has 3
different attributes which together define the colour of the pixel. The 3 components are:

1. Red value: The intensity of the red colour. This in an integer in the range 0-255.

2. Green value: The intensity of the green colour. This in an integer in the range 0-
255.

3. Blue value: The intensity of the blue colour. This in an integer in the range 0-255.

So, a pixel is defined by an “RGB” triple (r, g, b) and the image is defined by a 2-D
array I[h, w] of these pixels. Here h is the height of the image and w is the width of the
image.

The hash functions:

We will be maintaining two hash tables and we will need two hash functions. The first
hash function maps a digital image to a bucket number (i.e., {0,1,...,m1-1}) of the first
hash table. We define this hash function in the following manner.

• • Let the image be denoted by the array I[h, w].


• • P1 = I[0.25*h, 0.25*w] and G1 = 0.30*P1.R + 0.59*P1.G +
0.11*P1.B
• • P2 = I[0.25*h, 0.75*w] and G2 = 0.30*P2.R + 0.59*P2.G +
0.11*P2.B
• • P3 = I[0.75*h, 0.25*w] and G3 = 0.30*P3.R + 0.59*P3.G +
0.11*P3.B
• • P4 = I[0.75*h, 0.75*w] and G4 = 0.30*P4.R + 0.59*P4.G +
0.11*P4.B
• • P5 = I[0.50*h, 0.50*w] and G5 = 0.30*P5.R + 0.59*P5.G +
0.11*P5.B
• • HC1(I) = (83*G1 + 137*G2 + 257*G3 + 577*G4 + 769*G5)
• • H1(I) = HC1(I) (mod m1)

(All the multiplications above are integer multiplications)

The second hash function maps an image to a bucket number (i.e., {0,1,...,m2-1}) in the
second hash table. The second hash function is defined in a similar manner as the first
hash function. Let

• • Image be defined by the array I[h, w].


• • P1 = I[0.50*h, 0.25*w] and G1 = 0.30*P1.R + 0.59*P1.G +
0.11*P1.B
• • P2 = I[0.50*h, 0.75*w] and G2 = 0.30*P2.R + 0.59*P2.G +
0.11*P2.B
• • P3 = I[0.25*h, 0.50*w] and G3 = 0.30*P3.R + 0.59*P3.G +
0.11*P3.B
• • P4 = I[0.75*h, 0.50*w] and G4 = 0.30*P4.R + 0.59*P4.G +
0.11*P4.B
• • P5 = I[0.50*h, 0.50*w] and G5 = 0.30*P5.R + 0.59*P5.G +
0.11*P5.B
• • HC2(I) = (193*G1 + 317*G2 + 571*G3 + 647*G4 + 857*G5)
• • H2(I) = HC2(I) (mod m2)

(All the multiplications above are integer multiplications)

Implementation details:

You are given 9144 distinct digital photographs to begin with. You can download these
images from here. When you unzip the file in your working directory, you will find all
the digital images in a directory named Pictures. The pictures are named x.jpg where x is
an integer between 0 and 9143. Given n as an input, your program should first iteratively
read all the image files named 0.jpg through (n-1).jpg in the Pictures directory and
initialize the Data Structure that you will be using to detect duplicates. Here is how you
initialise the data structure. Let us call this operation “indexing”.

For a given image I that has name x.jpg, do the following:

1. Insert HC1(I) into the 2-4 tree corresponding to the location H1(I) of the first
hash table. Note that Hash1[H1(I)] stores a reference to the root of the 2-4 tree.

2. Insert the key-value pair (HC2(I), (HC1(I), x)) (here the tuple (HC1(I), x) is the
value) into the second hash table. Use linear probing with probe sequence H2(I),
H2(I)+1,... etc.

After you have initialized your data structure in the above manner, your program should
now take input from the console asking for an image filename for which it should check
the presence of duplicates. To check whether a duplicate of a given image I exists, you do
the following:

1. Search the 2-4 tree corresponding to the location H1(I), for the key HC1(I).

2. If the search fails, then output “DUPLICATE NOT PRESENT”.


3. In case search succeeds do the following:

a. Search the second hash table for all occurrences of the key HC2(I).

b. If HC2(I) is not present, then output “DUPLICATE NOT PRESENT”

c. In case search succeeds, let the values corresponding to the matched keys
are (P1,Q1), (P1, Q2),..., (Pk, Qk). Do the following:

i. For every 1<=i<=k, if Pi = HC1(I), then read the image


file Qi.jpg and check if the image is the same as I. If so, output
“DUPLICATE PRESENT: Qi”.

4. Output “DUPLICATE NOT PRESENT”.

After checking for duplicate, your program should ask user for another file name for
which duplicate should be checked.

You have to use the 2-4 tree that you have implemented in your previous assignment.

Your main class CheckDuplicates in file CheckDuplicates.java, will read input


from the console and write output to console. You run your program using the
command:

java CheckDuplicates <n>, <m1>, <m2>

here n is the number of images for which indexing is supposed to be done. Suppose n =
10 then you only index images 0.jpg through 9.jpg.
m1 and m2 (m2 > n) are the size of the hash tables that are used in your implementation.

Here is some code that you may use to get started. This includes image reading.

EXAMPLE:

java CheckDuplicates 1000, 10, 2000

WITH INPUT:

check1.jpg
check2.jpg

THE OUTPUT IS:

DUPLICATE NOT PRESENT


DUPLICATE PRESENT: 50
The above is in the case that check1.jpg is not one of the images 0.jpg through 999.jpg
and check2.jpg is the same as 50.jpg.

Some ideas behind the assignment

Why are you doing whatever you are doing above?

The primary goal is to give you a chance to implement these data structures. However,
there is some logic that goes behind the above construction. Note that one of our main
goals was to minimize the
number of disk I/O operations that one has to do. In order to achieve this, what we did
was to consider very short “signatures” of images present in our database (HC1 and HC2
gave these signatures).
Now, we store these short signatures of images in hash tables. If the signatures of the
images are evenly distributed, then each location of our hash table will be less loaded and
all operations will be quick.
Instead of a linked-list that is typically used in chaining, we use a 2-4 tree in order to get
some savings in time. Remember, search operation on a linked list runs in linear time but
runs in logarithmic amount
of time on a 2-4 tree. The intuition for using two hash tables instead of one is that it is
very unlikely that an image that is not present in the storage will succeed in matching
both HC1(I) and HC2(I) value
with that of an image in the database.

Why use the 2-4 tree implementation of the previous assignment?

Again, the main reason is that you can use your 2-4 tree implementation of the previous
assignment. Another reason for using this implementation is that you can easily
store a copy of your 2-4 tree in the hard-disk when you want to terminate your program
(for instance, shut down your computer). You just write the entire array into a file and
then when you
re-start your program, you can read the file and initialize your array to get the 2-4 tree
back. This is a very useful feature in our current problem. You do not want to index your
entire collection of
pictures each time you want to check for duplicates. Ideally, you would want to index the
database once and then use the data structure that you have created to check for
duplicates in the
future. This data structure should be such that it can be easily stored and recovered from
the secondary memory.

You might also like