You are on page 1of 3

Average number of collisions in a hash function

Sergei Winitzki
2001-10-13 to December 30, 2008
1 Statistics of random hash
function
1.1 Formulation of the problem
A p-bit hash function is a function from N to the
integer range {0, 1, ..., 2
p
1}. Such functions are used
as check sums on data les. A data le is considered
as a stream of bits, that is, a binary representation
of a nonnegative integer number. If the hash func-
tion gives dierent results on two les, the les are
surely dierent. For example, the MD5 sum is a 64-bit
hash function frequently used to verify le integrity. A
good hash function will yield dierent results for even
slightly dierent les; heuristically, a good hash func-
tion yields a random value. However, it is clear that
there will be, by pure chance, some cases where dier-
ent inputs yield the same hash function value. These
are calledhash collisions. The problem is to estimate
the frequency of hash collisions, assuming a perfect
hash, i.e. that the hash values are perfectly random,
uniformly distributed numbers in the hash range.
Therefore, the problem of nding the frequency of
hash collisions is equivalent to the following mathe-
matical problem. Suppose x
1
, ..., x
n
are independent,
uniformly randomly chosen integers, each ranging from
1 to N (in the case of a p-bit hash function, we choose
N = 2
p
). We need to compute the average number of
dierent integers in the set {x
1
, ..., x
n
}. We would like
to compute also the average number of pair collisions,
triple collisions, etc.
1.2 The basic generating function
One drawing of n integers can be described if we spec-
ify how many times each possible integer from the
set {1, ..., N} is selected. Consider the probability
p(n; s
1
, ..., s
N
) that the integer i is selected s
i
times
(i = 1, ..., N). The generating function for this proba-
bility can be dened as
G(n; q
1
, ..., q
N
) =

s
i0
q
s1
1
...q
sN
N
p(n; s
1
, ..., s
N
). (1)
For n = 1 we have
p(1; s
1
, ..., s
N
) =
_
1
N
, if only one of s
i
is 1,
0, otherwise.
So the generating function for n = 1 is simply
G(1; q
1
, ...q
N
) =
1
N
(q
1
+ ... + q
N
) . (2)
For n > 1 the generating function is equal to the prod-
uct of the n (identical) generating functions (2):
G(n; q
1
, ..., q
N
) =
1
N
n
(q
1
+ ... + q
N
)
n
=
1
N
n

si0;
P
i
si=n
N!
s
1
!...s
N
!
q
s1
1
...q
sN
N
.
(3)
This generating function contains, in principle, the
complete information about the probabilities of draw-
ing various sets of integers. Our task now is to use this
generating function for the computations we need to
perform.
1.3 Average number of dierent inte-
gers
Each possible drawing of the n random integers is rep-
resented in the generating function G by a term such
as q
1
q
2
3
q
4
, which signies a drawing of {1, 3, 3, 4}. The
number of dierent integers in this drawing is 3. The
generating function G is the sum of all these terms with
the coecients equal to the probabilities of the draw-
ings. The average number of dierent integers will be
computed if we replace in G(n; q
1
, ..., q
N
) every term
q
s1
1
...q
sN
N
by the number of dierent q
i
s in that term.
The number of dierent q
i
s in the term q
s1
1
...q
sN
N
can
be computed as f(s
1
) +... +f(s
N
), where the function
f(s) is dened as
f(s) =
_
0, s = 0,
1, s 1.
(4)
So we only need to replace q
s1
1
...q
sN
N
by f(s
1
) + ... +
f(s
N
).
An elegant way of doing this is to nd an explicit
formula for a linear map from polynomials in {q
i
} to
integers, so that q
s1
1
...q
sN
N
is mapped to f(s
1
) + ... +
f(s
N
). This map can be found as follows.
First let us try to nd the map for just one variable.
We need a formula for a linear map such that q
s
is
mapped into f(s). In particular, we need f(s) = 1 for
all s 1. In other words, q
2
is equivalent to q after
the map; this suggests that q should be replaced by a
projection matrix. However, once we got the idea of
using a matrix we do not need to limit ourselves to a
particular choice of f(s). Let us keep f(s) general and
substitute instead of q some matrix T such that T
s
is
mapped into f(s). This can be arranged if we choose
1
some vector u V and some covector v

such
that
v

, T
s
u = f(s), (5)
where the operator T acts in the vector space V . This
construction yields a linear map from polynomials in q
into numbers, such that q
s
is mapped into f(s).
Now let us generalize to N variables {q
i
}. We need a
linear map that yields f(s
1
)+...+f(s
N
). This suggests
that we use a direct sum of N copies of the linear space
V and substitute instead of q
i
the operators
T
i
1
V
... T ... 1
V
End(V ... V ) (6)
where the operator T acts on the i-th copy of V and
1
V
is the identity operator in V . We now dene the
vector u and the covector v

,
u u ... u, v

... v

, (7)
and verify that
v

, T
s
i
u = f(s). (8)
When we substitute T
i
instead of q
i
in a polynomial
term q
s1
1
...q
sN
N
, we obtain an operator T
s1
... T
sN
,
which will yield
v

, (T
s1
... T
sN
) u = f(s
1
) + ... + f(s
N
). (9)
Therefore, we constructed a linear map that can be
applied directly to the polynomial G(n; q
1
, ..., q
N
) to
yield the average number of dierent integers if f(s) is
chosen as shown above.
Let us perform this computation using the explicit
form of G(n; q
1
, ..., q
N
). We substitute T
i
instead of q
i
and obtain
G(n; T
1
, ..., T
N
) =
_
T
1
+ ... + T
N
N
_
n
.
The operator T
1
+ ... + T
N
can be simplied to
T
1
+...+T
N
= [(N 1)1
V
+ T]...[(N 1)1
V
+ T] .
Let us denote for brevify
Q =
N 1
N
1
V
+
1
N
T.
Then we can write
G(n; T
1
, ..., T
N
) = Q
n
... Q
n
. (10)
Now we can evaluate the application
v

, G(n; T
1
, ..., T
N
) u = N v

, Q
n
u .
Consider the function f(s) dened by Eq. (4). One can
certainly choose an operator T and vectors u, v

such
that Eq. (5) holds for this f(s). Then we nd
v

, Q
n
u =
1
N
n
n

k=0
_
n
k
_
(N 1)
nk

v

, T
k
u
_
=
1
N
n
n

k=0
_
n
k
_
(N 1)
nk
f(k) (11)
=
1
N
n
n

k=1
_
n
k
_
(N 1)
nk
=
N
n
(N 1)
n
N
n
.
Therefore the average number of distinct integers is
n
d
= N
_
1
_
1 N
1
_
n
_
. (12)
This formula describes the average number of collisions
in a perfect hash function.
As a realistic example, let us assume that we have
computed the 32-bit hash sums of one million dierent
les. How many dierent hash sums do we have on
the average? We substitute N = 2
32
and n = 10
6
into
Eq. (12) and nd
n
d
(2
32
, 10
6
) 10
6
116.4,
which means that about 116 les will have the same
hash sum even though the les are dierent. So we
need to use a larger hash range; with N = 2
64
we nd
n
d
(2
64
, 10
6
) = 10
6
2.7 10
8
. (13)
This indicates a negligible chance of hash collisions.
Therefore, a 64-bit hash sum is sucient for a million
les.
Let us perform an asymptotic estimate of the colli-
sion rate for very large N. We may expand Eq. (12)
as
n
d
N
_
1
_
1
n
N
+
n(n 1)
2N
2
__
= n
n(n 1)
2N
.
Therefore, the collision rate is negligible (n n
d
1)
when N n
2
.
1.4 Average number of pairs, triples,
etc.
If we wanted to nd the average number of pairs, we
could replace the term q
s1
1
...q
sN
N
in the generating func-
tion G(n; q
1
, ..., q
N
) by f
2
(s
1
)+... +f
2
(s
N
) where f
2
(s)
is dened as
f
2
(s) =
s2
=
_
1, s = 2;
0, otherwise.
We can similarly consider the triples or, more generally,
p-tuples of coincident integers, by taking the function
f
p
(s) =
sp
. We can describe all these p-tuples at once
if we consider the generating function of the average
number of p-tuples; this means introducing an addi-
tional formal parameter t and dening
f(s; t) =

p0
f
p
(s)t
p
= t
s
.
Hence, we use the same derivation as in the previous
section up to Eq. (11), but now we substitute the func-
tion f(s) = t
s
instead of the previously used f(s) in
Eq. (11). Then we nd
v

, G(n; T
1
, ..., T
N
) u = N v

, Q
n
u
=
N
N
n
n

k=0
_
n
k
_
(N 1)
nk
t
k
=
(N 1 + t)
n
N
n1
. (14)
2
The average number of pairs is read o from Eq. (14)
as the coecient at t
2
. The average number of p-tuples
is
n
p
=
_
n
p
_
(N 1)
np
N
n1
.
For example, with p = 2 we nd
n
2
=
n(n 1)
2
(N 1)
n2
N
n1
=
n(n 1)
2N
_
1
1
N
_
n2
.
1.5 Remarks
Perhaps the calculation can be performed directly
without the procedure with the substitution of some
complicated operators into the generating function.
Maybe one can directly consider the generating func-
tion of the average number of p-tuples, starting with
Eq. (11).
3

You might also like