You are on page 1of 13

Modification of System cache for effective

implementation of Bloom Filters


Gurudatha.M
M050224CS
Guided by: Dr. Priya Chandran
April 26, 2006

Department Of Computer Engineering


National Institute Of Technology, Calicut
Kerala, 673601
2005

1
ACKNOWLEDGEMENT

I have been fortunate to have Dr. Priya Chandran, Department of Com-


puter Engineering, as my guide whose timely guidance, advice and inspiration
helped me in the preparation of this mini–project. I express my sincere grat-
itude to her for having guided me through this work.

I am thankful to Dr. M.P.Sebastian, Head of Computer Engineering De-


partment for his encouragement and for giving me this opportunity. I also
thank all my friends who helped me throughout this work.

Gurudatha.M
(M050224CS)

2
CERTIFICATE

This is to certify that the seminar entitled ”Modification of System cache


for effective implementation of Bloom Filters” is a bonafide record of
the mini–project presented by Mr M.gurudatha (M050224CS) under our su-
pervision and guidance. The seminar report has been submitted to Depart-
ment of Computer Engineering of National Institute of Technology,
Calicut in partial fulfilment of the degree of Master Of Technology in
Computer Science And Engineering.

Dr. M. P. Sebastian Dr. Priya Chandran


Head Of Department Assistant Professor
Dept. of Computer Engg. Dept. of Computer Engg.
NIT Calicut. NIT Calicut.

3
Contents
1 Abstract 5

2 Introduction 6

3 Spectral Bloom Filters 7


3.1 Minimal Increase . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Recurring Minimum . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 DineroIV 10

5 Modifications 11

6 Conclusion 12

7 References 13

4
1 Abstract
Spectral Bloom filter is a space efficient data structure that are used to
efficiently find the existence of an object in any given set. The optimizations
for this randomized structure are Recurring minimum and minimal Increase
and then comparison of these two data structures is done. Then DineroIV the
cache simulator tool and its uses are discussed. For the effective utilization
of the Bloom filters it is more efficient that data is checked in small amount
of time even if the algorithm is fast. For this Bloom filters and all other
search techniques need to be fast the best way is to keep as much as possible.
Then finally modifications that are to be done to the DineroIV code so that
we can implement this Spectral Bloom filters for the better implementation
of filters.

5
2 Introduction
Data mining is the process of selecting the required data from the huge
amount of data. Nowadays in most of the applications it has become increas-
ingly important to search the data with in the less amount of time. This led
to the idea of hash functions with an average search time of each element is
constant or O(1). The need arose even beyond hash functions.

Spectral Bloom filters are the kind of data structures with the ability to
compactly represent a set, and filter out effectively any element that does
not belong to the set, with small error probability. With this data structure
it is easy to search a data with in a set with small false positive error. In
this Spectral bloom filters. In minimal selection method the element with
the minimum minimum value will be returned. in the case of Recurring
Minimum the elements with the single minimum are kept in an secondary
SBF and then looked for recurring minimum and then returned.

DineroIV is a cache simulator tool which simulates the cache and given
the input trace it records all the memory requests and counts the number of
misses, hits and memory references of the trace and also outputs the results.

Now in this paper we formally explain about the modifications to be done


to the dinero tool so the implementation of the Spectral Bloom filter can be
done efficiently. By efficiently modifying the tool such as the no of cache
levels, no of types of memory in each cache we can develop the use specific
system.

6
3 Spectral Bloom Filters
The Spectral Bloom Filter (SBF) is a data structure that stores the set S
of elements in a bit vector V with a vector of m counters, C. The counters
in C roughly represent multiplicities of items in S, all the counters in C are
initially set to 0. In the basic implementation, when inserting an item s,
increase the counters Ch1 (s) , Ch2 (s) , ...Chk (s) ; by 1. The SBF stores the fre-
quency of each item, and it also allows for deletions, by decreasing the same
counters. Consequently, updates are also allowed ( by performing a delete
and then an insert ).

Let S be a multi-set of keys taken from a universe U. For x ∈ U let fx be


the frequency of x in S. Let vx = {Ch1 (s) , Ch2 (s) , ...Chk (s) } be the sequence of
values stored in the k counters representing x’s value, and v̂x = {v̂x1 , v̂x2 ,...v̂xk }
be a sequence consisting of the same items of vx , sorted in non-decreasing
order; i.e. mx = v̂x1 is the minimal value observed in those k counters.

To add a new item x ∈ U to the SBF, the counters {Ch1 (s) , Ch2 (s) , ...Chk (s) }
are increased by 1. The Spectral Bloom Filter for a multi-set S can be
computed by repeatedly inserting all the items from S. The same logic is
applied when dealing with streaming data. While the data flows, it is hashed
into the SBF by a series of insertions.

Querying the SBF : A basic query for the SBF on an item x ∈ U returns an
estimate on f x. The SBF error, denoted ESBF , to be the probability that
for an arbitrary element z (not necessarily a member of S), f̂z 6= fz. The basic
estimator, denoted as the Minimum Selection (MS) estimator is f̂x = mx .

Deletions and sliding window maintenance Deleting an item x ∈ U from the


SBF is achieved simply by reversing the actions taken for inserting x, namely
decreasing by 1 the counters {Ch1 (x) , Ch2 (x) , ...Chk (x) } In sliding windows
scenarios, in cases data within the current window is available (as is the
case in data warehouse applications), the sliding window can be maintained
simply by preforming deletions of the out-of-date data.
The optimizations that are used in the implementation for improving the
query performance, when the threshold T is greater than one. the threshold
is the minimum fx for which fx > T

7
3.1 Minimal Increase
The minimal counter holds better information of a hashed item than the
other counters as the minimal counter holds less data items hashed to it so
the minimal counter has the better info. So the Minimal Increase method
takes this result into account for effective finding of an element in the set.
In Minimal Increase method insert an element into the set and then increase
only the minimal counters of that element and so the other counters will not
be increased until the other minimal counters catches them up. so this can
be done in this way :

Minimal Increase : When performing an insert of an item x, increase


only the counters that equal mx (its minimal counter). When performing
lookup query, return mx . For insertion of r occurrences of x this method can
be executed iteratively, or instead increase the smallest counter(s) by r, and
set every other counter to the maximum of their old value and mx + r.

Minimal Increase and deletions.. This method has a big drawback


as it does not support deletions as the minimal counter is only incremented
during . In fact, when allowing deletions the insertions. Minimal Increase
algorithm introduces a new kind of errors - false-negative errors. This result
is salient in the experiments dealing with deletions and sliding-window ap-
proaches, where the Minimal Increase method becomes unattractive because
of its poor performance, mostly because of false negative errors.

3.2 Recurring Minimum


In this method the assumption is that most of the elements in the set S
have many minimums and also the number of elements with single minimum
are less. The elements with singular minimums have high probability of being
an error and taken as separate cases. To reduce further complexity consider
two caches they are primary cache consisting elements with the recurring
minimum and then it introduces the Secondary SBF consisting of elements
with singular minimum.

To search for a particular element in the set first calculate hash functions
and sort them in the serial order and then it is found if there are recurring
minimum if there are then it is success and report other wise if there is
singular minimum then search for the element in secondary SBF.

8
The algorithm: When adding an item x, increase the counters of x in
the primary SBF. Then check if x has a recurring minimum. If so, continue
normally. Otherwise (if x has a single minimum), look for x in the secondary
SBF. If found, increase its counters, otherwise add x to the secondary SBF,
with an initial value that equals its minimal value from the primary SBF.
When performing lookup for x, check if x has a recurring minimum in the
primary SBF. If so return the minimum. Otherwise, perform lookup for x
in secondary SBF. If returned value is greater than 0, return it. Otherwise,
return minimum from primary SBF.

Deletions and sliding window maintenance Deleting x when using


Recurring Minimum is essentially reversing the increase operation: First de-
crease its counters in the primary SBF, then if it has a single minimum (or
if it exists in Bf ) decrease its counters in the secondary SBF, unless at least
one of them is 0. Since insertions are performed both to the primary and
secondary SBF, there can be no false negative situations when deleting items.
Sliding window is easily implemented as a series of deletions, assuming that
the out-of-scope data is available.

3.3 Comparison
Error rates The MS algorithm provides the same error rates as the origi-
nal Bloom Filter. The Minimal increase method performs better over the RM
and MS algorithms. The RM algorithm is not as good, but is consistently
better than the MS algorithm.

Memory overhead The RM algorithm performs better over the MS


algorithm. But the requirement of maintaining second SBF make the RM
need heavier memory. The MI is the best among the three as the no of
updations and data kept in memory are very less than the other two.

Complexity The RM algorithm is the most complex method, because


of the hashing into two SBFs. The complexity naturally increases.The MS
method is the simplest.

Deletions. Both the MS and RM methods support these actions. The


MI algorithm does not, and may produce false-negative errors if used.

9
4 DineroIV
Dinero IV is an uniprocessor trace driven cache simulator tool. Trace is a
sequence of memory references made during the execution of the code. Its
main work to simulate the memory requests and cache misses and keeping
all the info regarding the execution of a particular program.

The input to the DineroIV is the sample traces of the execution of a pro-
gram. DineroIv takes the sample code in the .din format. It then simulates
the behavior of the cache. for this the dinero IV maintains the blocks in the
form of stack. The stack contains the memory blocks that are requested by
the sample traces. Each stack item can also be hashed to a no of blocks.

This tool takes the sizes, associativity, replacement policies, and sample
traces as input. It then checks for the validity of the arguments with re-
spect to the cache levels and then it initializes the stack to the tone of the
arguments. After initializing the stack and setting the replacement policies
it keeps on getting the traces and keeps updating the respective details and
then it finally summarizes to give output.

This DineroIV tool devides the cache into a maximum of five level caches
where each cache is divided into 3 parts the instruction, data and unified.
and for each it maintains the no of blocks and subblocks, associativity etc.
Dinero IV keeps track of the size of each access, along with address and
access type.When reading din format, Dinero IV forces all accesses to be
4-byte aligned and assumes a size of 4. the same trace read by Dinero IV
in din format may produce different results than if read directly in a richer
format, e.g., extended din or one of the pixie formats.he richer formats should
have correct size information, while Dinero IV must guess the size in the din
format (always 4 bytes).References can cross block or sub-block boundaries.
A multi-block reference is handled by doing the first block and deferring the
rest, as a separate reference. The number of times this is done is reported in
the d4cache multi block field.

10
5 Modifications
For the effective utilization of the Bloom filters it is more efficient that data
is checked in small amount of time even if the algorithm is fast. For this
Bloom filters and all other search techniques need to be fast the best way is
to keep as much as possible. so the necessary modifications are:

• The stack which the dineroIV is using must be divided into two parts
one for the ordinary cache purposes and another for storing the contents
of the set S which the element needs to be searched.so that most of the
set S remains in the memory rather than bringing every time.

• Some of the arguments need to be given statically rather than dynam-


ically like the no of levels of cache must be set to 2 and then the sizes
must also be fixed and accordingly intialize cache() function must be
modified.

• Allocation policies need to be modified and giving some default options


such as prefetch always for the second cache containing the set S and
modifications are needed in d4rep *,d4prefect * functions

• d4 find(),d4findnth() functions which help in finding the node in stack


needs to be modified so as to allow searching.

• Code of d4ref function must be modified for the way the element can be
searched in the set S whether it is Minimal Increase or the Recurring
Minimum. which allow for both random search and also varying size
such that the hash time is O(1).

11
6 Conclusion
The Spectral Bloom filters can be used effectively than many searching meth-
ods with small error rate and also effective in implementation with the avail-
able requirements.The DineroIV can be effeciently modified to incorporate
the required static or dynamic policies.
The searching of an element from a set or datamining can be done effi-
ciently by combing the two methods namely the bloom filters and the modi-
fying the cache for the implementation of set and ordinary cache to make the
fetching of the elements easy. Hence with the specified modifications there
is better chance of reducing the searching time of an element.

12
7 References
References
[1] Saar Cohen,Yossi Matias, ”Spectral Bloom Filters”,Proceedings of the
2003 ACM SIGMOD international conference on Management of data,
2003,San Diego, California.

[2] B. H. Bloom. Space/time trade-offs in hash coding with allowable errors.


Communications of the ACM, 13(7):422–426, 1970.

[3] Mark D. Hill,http://www.cs.wisc.edu/ markhill/DineroIV/

13

You might also like