You are on page 1of 30

9/17/2015

5.4.1 The birthday paradox

• How many people must there be in a room


before there is a 50% chance that two of them
were born on the same day of the year?

• The answer is surprisingly few


• The paradox is that it is in fact far fewer
– than the number of days in a year, or
– even half the number of days in a year

MAT-72006 AADS, Fall 2015 17-Sep-15 181

An analysis using indicator


random variables

• We use indicator random variables to provide a


simple but approximate analysis of the birthday
paradox
• For each pair of the people in the room,
de ne the indicator random variable , for
, by
and have the same birthday}
and have the same birthday
=
0 otherwise
MAT-72006 AADS, Fall 2015 17-Sep-15 182

1
9/17/2015

• Once birthday for is chosen, the probability


that is chosen to be the same day is ,
where = 365
E = Pr and have the same birthday = 1

• Let be a random variable counting the number


of pairs of individuals having the same birthday

MAT-72006 AADS, Fall 2015 17-Sep-15 183

• Taking expectations of both sides and applying


linearity of expectation, we obtain

E =E = ]

1 1)
= =
2

• When 1) , the expected number of


pairs of people with the same birthday is at least
1

MAT-72006 AADS, Fall 2015 17-Sep-15 184

2
9/17/2015

• Thus, if we have at least + 1 individuals in a


room, we can expect at least two to have the
same birthday
• For = 365, if = 28, the expected number of
pairs with the same birthday is
(28 27) (2 365) 1.0356
• With at least 28 people, we expect to nd at
least one matching pair of birthdays
• Mars has = 687 days, need = 38 aliens
• Analysis using only probabilities gives a different
exact number of people, but same
asymptotically:
MAT-72006 AADS, Fall 2015 17-Sep-15 185

5.4.2 Balls and bins

• Consider tossing identical balls randomly into


bins, numbered 1,2, … ,
• Tosses are independent, and on each toss the
ball is equally likely to end up in any bin
• The probability that a tossed ball lands in any
given bin is
• The ball-tossing process is a sequence of
Bernoulli trials with a probability of success
the ball falls in the given bin
MAT-72006 AADS, Fall 2015 17-Sep-15 186

3
9/17/2015

• How many balls fall in a given bin?


– The number of balls that fall in a given bin
follows the binomial distribution ,1 )
– If we toss balls, the expected number of
balls that fall in the given bin is
• How many balls must we toss, on the
average, until a given bin contains a ball?
– The number of tosses until the given bin
receives a ball follows the geometric
distribution with probability and
– the expected number of tosses until success
is 1 (1 ) =
MAT-72006 AADS, Fall 2015 17-Sep-15 187

• How many balls must we toss until every bin


contains at least one ball?
– Call a toss in which a ball falls into an empty
bin a “hit”
– We want to know the expected number of
tosses required to get hits
– We can partition the tosses into stages
– The th stage consists of the tosses after the
1)st hit until the th hit
– The rst stage consists of the rst toss, since
we are guaranteed to have a hit when all bins
are empty
MAT-72006 AADS, Fall 2015 17-Sep-15 188

4
9/17/2015

– During the th stage, 1 bins contain balls


and + 1 bins are empty
– For each toss in the th stage, the probability
of obtaining a hit is + 1)/
– is the number of tosses in the th stage
– The number of tosses required to get hits is
=
– Each has a geometric distribution with
probability of success + 1)/

E =
+1
MAT-72006 AADS, Fall 2015 17-Sep-15 189

E =E = ]

=
+1
1

(ln (1))

• By harmonic series
MAT-72006 AADS, Fall 2015 17-Sep-15 190

5
9/17/2015

• It therefore takes approximately ln tosses


before we can expect that every bin has a
ball
• This problem is also known as the coupon
collector’s problem, which says that a
person trying to collect each of different
coupons expects to acquire approximately
ln randomly obtained coupons in order to
succeed
MAT-72006 AADS, Fall 2015 17-Sep-15 191

II (Sorting and)
Order Statistics
Heapsort
Quicksort
Sorting in Linear Time
Medians and Order Statistics

6
9/17/2015

9 Medians and Order Statistics

• The th order statistic of a set of elements


is the th smallest element
– E.g., the minimum of a set of elements is
the rst order statistic ( = 1), and the
maximum is the th order statistic ( )
• A median is the “halfway point” of the set
• When is odd, the median is unique,
occurring at = ( + 1)/2

MAT-72006 AADS, Fall 2015 17-Sep-15 193

• When is even, there are two medians,


occurring at /2 and /2 + 1
• Thus, regardless of the parity of , medians
occur at
– = ( + 1)/2 (the lower median) and
– = ( + 1)/2 (the upper median)
• For simplicity, we use “the median” to refer to
the lower median

MAT-72006 AADS, Fall 2015 17-Sep-15 194

7
9/17/2015

• The problem of selecting the th order statistic


from a set of distinct numbers
• We assume that the set contains distinct
numbers
– virtually everything extends to the situation in
which a set contains repeated values
• We formally specify the problem as follows:
– Input: A set of (distinct) numbers and an
integer , with
– Output: The element that is larger than
exactly other elements of

MAT-72006 AADS, Fall 2015 17-Sep-15 195

• We can solve the problem in lg ) time by


heapsort or merge sort and then simply index
the th element in the output array
• There are faster algorithms
• First, we examine the problem of selecting the
minimum and maximum of a set of elements
• Then we analyze a practical randomized
algorithm that achieves an ) expected
running time, assuming distinct elements

MAT-72006 AADS, Fall 2015 17-Sep-15 196

8
9/17/2015

9.1 Minimum and maximum

• How many comparisons are necessary to


determine the minimum of a set of elements?
• We can easily obtain an upper bound of 1
comparisons
– examine each element of the set in turn and
keep track of the smallest element seen so
far.
• In the following procedure, we assume that the
set resides in array , where
MAT-72006 AADS, Fall 2015 17-Sep-15 197

MINIMUM )
1. min [1]
2. for 2 to
3. if min > ]
4. min ]
5. return min

• We can, of course, nd the maximum with


1 comparisons as well

MAT-72006 AADS, Fall 2015 17-Sep-15 198

9
9/17/2015

• This is the best we can do, since we can obtain


a lower bound of comparisons
– Think of any algorithm that determines the minimum
as a tournament among the elements
– Each comparison is a match in the tournament in
which the smaller of the two elements wins
– Observing that every element except the winner must
lose at least one match, we conclude that 1
comparisons are necessary to determine the
minimum
• Hence, the algorithm MINIMUM is optimal w.r.t.
the number of comparisons performed

MAT-72006 AADS, Fall 2015 17-Sep-15 199

Simultaneous minimum and maximum

• Sometimes, we must nd both the minimum


and the maximum of a set of elements
• For example, a graphics program may need
to scale a set of ) data to t onto a
rectangular display screen or other graphical
output device
• To do so, the program must rst determine
the minimum and maximum value of each
coordinate
MAT-72006 AADS, Fall 2015 17-Sep-15 200

10
9/17/2015

• ) comparisons is asymptotically optimal:


– Simply nd the minimum and maximum
independently, using 1 comparisons for
each, for a total of 2 comparisons
• In fact, we can nd both the minimum and the
maximum using at most 3 /2 comparisons by
maintaining both the minimum and maximum
elements seen thus far
• Rather than processing each element of the
input by comparing it against the current
minimum and maximum, we process elements in
pairs
MAT-72006 AADS, Fall 2015 17-Sep-15 201

• Compare pairs of input elements rst with each


other, and then we compare the smaller with the
current min and the larger to the current max, at
a cost of 3 comparisons for every 2 elements
• If is odd, we set both the min and max to the
value of the rst element, and then we process
the rest of the elements in pairs
• If is even, we perform 1 comparison on the
rst 2 elements to determine the initial values of
the min and max, and then process the rest of
the elements in pairs as in the case for odd

MAT-72006 AADS, Fall 2015 17-Sep-15 202

11
9/17/2015

• If is odd, then we perform


3 /2
comparisons
• If is even, we perform 1 initial comparison
followed by
3( 2)/2
comparisons, for a total of 2 2
• Thus, in either case, the total number of
comparisons is at most 3 /2
MAT-72006 AADS, Fall 2015 17-Sep-15 203

9.2 Selection in expected linear time

• The selection problem appears more dif cult than


nding a minimum, but the asymptotic running time
for both is the same:
• A divide-and-conquer algorithm RANDOMIZED-SELECT
is modeled after the quicksort algorithm
• Unlike quicksort, RANDOMIZED-SELECT works on only
one side of the partition
• Whereas quicksort has an expected running time of
lg ), the expected running time of RANDOMIZED-
SELECT is ), assuming that the elements are
distinct
MAT-72006 AADS, Fall 2015 17-Sep-15 204

12
9/17/2015

• RANDOMIZED-SELECT uses the procedure


RANDOMIZED-PARTITION of RANDOMIZED-
QUICKSORT

RANDOMIZED-PARTITION )
1. RANDOM )
2. exchange ] with ]
3. return PARTITION )

MAT-72006 AADS, Fall 2015 17-Sep-15 205

Partitioning the array


PARTITION )
1. ]
2. 1
3. for to 1
4. if
5. +1
6. exchange ] with ]
7. exchange + 1] with ]
8. return + 1
MAT-72006 AADS, Fall 2015 17-Sep-15 206

13
9/17/2015

MAT-72006 AADS, Fall 2015 17-Sep-15 207

• PARTITION always selects an element ] as


a pivot element around which to partition the
subarray .. ]
• As the procedure runs, it partitions the array into
four (possibly empty) regions
• At the start of each iteration of the for loop in
lines 3–6, the regions satisfy properties, shown
above
MAT-72006 AADS, Fall 2015 17-Sep-15 208

14
9/17/2015

• At the beginning of each iteration of the loop of


lines 3–6, for any array index ,
1. If , then
2. If + 1 1, then
3. If , then

• Indices between and are not covered by


any case, and the values in these entries have
no particular relationship to the pivot
• The running time of PARTITION on the subarray
. . ] is , +1
MAT-72006 AADS, Fall 2015 17-Sep-15 209

• Return the th smallest element of [ . . ]

RANDOMIZED-SELECT )
1. if
2. return ]
3. RANDOMIZED-PARTITION )
4. +1
5. if // the pivot value is the answer
6. return ]
7. elseif
8. return RANDOMIZED-SELECT 1, )
9. else return RANDOMIZED-SELECT( , + 1, , )
MAT-72006 AADS, Fall 2015 17-Sep-15 210

15
9/17/2015

• (1) checks for the base case of the recursion


• Otherwise, RANDOMIZED-PARTITION partitions
. . ] into two (possibly empty) subarrays
.. 1] and + 1 . . ] s.t. each element in
the former is < each element of the latter
• (4) computes the # of elements in .. ]
• (5) checks if ] is th smallest element
• Otherwise, determine in which of the two
subarrays the th smallest element lies
• If , then the desired element lies on the low
side of the partition, and (8) recursively selects it
MAT-72006 AADS, Fall 2015 17-Sep-15 211

• If , then the desired element lies on the


high side of the partition
• Since we already know values that are smaller
than the th smallest element of . . ] the
desired element is the )th smallest
element of + 1. . ], which (9) nds
recursively

MAT-72006 AADS, Fall 2015 17-Sep-15 212

16
9/17/2015

• Worst-case running time for RANDOMIZED-SELECT


is ), even to nd the minimum
• The algorithm has a linear expected running
time, though
• Because it is randomized, no particular input
elicits the worst-case behavior
• Let the running time of RANDOMIZED-SELECT on
an input array . . ] of elements be a
random variable )
• We obtain an upper bound on E[ ] as follows

MAT-72006 AADS, Fall 2015 17-Sep-15 213

• The procedure RANDOMIZED-PARTITION is equally


likely to return any element as the pivot
• Therefore, for each such that , the
subarray . . ] has elements (all less than
or equal to the pivot) with probability 1/
• For = 1,2, … , , de ne indicator random
variables where

= I the subarray .. has exactly elements

MAT-72006 AADS, Fall 2015 17-Sep-15 214

17
9/17/2015

• Assuming that the elements are distinct, we


have E =1
• When we choose ] as the pivot element, we
do not know, a priori,
1) if we will terminate immediately with the
correct answer,
2) recurse on the subarray .. 1], or
3) recurse on the subarray + 1 .. ]

• This decision depends on where the th


smallest element falls relative to ]

MAT-72006 AADS, Fall 2015 17-Sep-15 215

• Assuming ) to be monotonically increasing,


we can upper-bound the time needed for a
recursive call by that on largest possible input
• To obtain an upper bound, we assume that the
th element is always on the side of the partition
with the greater number of elements
• For a given call of RANDOMIZED-SELECT, the
indicator random variable has value 1 for
exactly one value of , and it is 0 for all other
• When = 1, the two subarrays on which we
might recurse have sizes and

MAT-72006 AADS, Fall 2015 17-Sep-15 216

18
9/17/2015

• We have the recurrence

(max( 1, )) + ))

= (max( 1, )) + )

• Taking expected values, we have

E E (max( 1, )) + )

= E[ (max( 1, ))] + )

MAT-72006 AADS, Fall 2015 17-Sep-15 217

= E E[ (max( 1, ))] + )

1
= E[ (max( 1, ))] + )

• The first Eq. on this slide follows by


independence of random variables and
max( 1, )
• Consider the expression
1 if > /2
max( 1, )=
if /2
MAT-72006 AADS, Fall 2015 17-Sep-15 218

19
9/17/2015

• If is even, each term from /2 up to


1) appears exactly twice in the
summation, and if is odd, all these terms
appear twice and /2 appears once
• Thus, we have
2
E[ E )

• We can show that E ) by substitution


• In summary, we can nd any order statistic, and
in particular the median, in expected linear time,
assuming that the elements are distinct
MAT-72006 AADS, Fall 2015 17-Sep-15 219

III Data Structures

Elementary Data Structures


Hash Tables
Binary Search Trees
Red-Black Trees
Augmenting Data Structures

20
9/17/2015

Dynamic sets

• Sets are fundamental to computer science


• Algorithms may require several different types of
operations to be performed on sets
• For example, many algorithms need only the
ability to
– insert elements into, delete elements from,
and test membership in a set
• We call a dynamic set that supports these
operations a dictionary
MAT-72006 AADS, Fall 2015 17-Sep-15 221

Operations on dynamic sets

• Operations on a dynamic set can be grouped


into queries and modifying operations
SEARCH ) DELETE )
• Given a set and a key value , • Given a pointer to an element
return a pointer to an element in the set , remove from .
in such that , or NIL if Note that this operation takes a
no such element belongs to pointer to an element , not a
INSERT ) key value
• Augment the set with the MINIMUM )
element pointed to by . We • A query on a totally ordered set
assume that any attributes in that returns a pointer to the
element have already been element of with the smallest
initialized key
MAT-72006 AADS, Fall 2015 17-Sep-15 222

21
9/17/2015

MAXIMUM )
• Return a pointer to the • We usually measure
element of with the largest the time taken to
key
SUCCESSOR )
execute a set
• Given an element whose operation in terms of
key is from a totally ordered the size of the set
set , return a pointer to the
next larger element in , or • E.g., we later describe
NIL if is the max element a data structure that
PREDECESSOR ) can support any of the
• Given an element whose
key is from a totally ordered
operations on a set of
set , return a pointer to the size in time (lg )
next smaller element in , or
NIL if is the min element

MAT-72006 AADS, Fall 2015 17-Sep-15 223

11 Hash Tables

• Often one only needs the dictionary operations


INSERT, SEARCH, and DELETE
• Hash table effectively implements dictionaries
• In the worst case, searching for an element in a
hash table takes ) time
• In practice, hashing performs extremely well
• Under reasonable assumptions, the average
time to search for an element is (1)

MAT-72006 AADS, Fall 2015 17-Sep-15 224

22
9/17/2015

11.2 Hash tables

• A set of keys is stored in a dictionary, it is


usually much smaller than the universe of
all possible keys
• A hash table requires storage (| |) while
search for an element in it only takes (1)
time
• The catch is that this bound is for the
average-case time

MAT-72006 AADS, Fall 2015 17-Sep-15 225

• An element with key is stored in slot ); that


is, we use a hash function to compute the slot
from the key
• 0,1, … , 1 maps the universe of
keys into the slots of hash table [0. . 1]
• The size of the hash table is typically |
• We say that an element with key hashes to
slot or that ) is the hash value of
• If hash to same slot we have a collision
• Effective techniques resolve the con ict

MAT-72006 AADS, Fall 2015 17-Sep-15 226

23
9/17/2015

MAT-72006 AADS, Fall 2015 17-Sep-15 227

• The ideal solution avoids collisions altogether


• We might try to achieve this goal by choosing a
suitable hash function
• One idea is to make appear to be random,
thus minimizing the number of collisions
• Of course, must be deterministic so that a key
always produces the same output )
• Because , there must be at least two
keys that have the same hash value;
– avoiding collisions altogether is therefore
impossible

MAT-72006 AADS, Fall 2015 17-Sep-15 228

24
9/17/2015

Collision resolution by chaining

• In chaining, we place all the elements that


hash to the same slot into the same linked list
• Slot contains a pointer to the head of the list
of all stored elements that hash to
• If no such elements exist, slot contains NIL
• The dictionary operations on a hash table
are easy to implement when collisions are
resolved by chaining

MAT-72006 AADS, Fall 2015 17-Sep-15 229

CHAINED-HASH-INSERT )
1. insert at the head of list ]

CHAINED-HASH-SEARCH )
1. search for element with key in list ]

CHAINED-HASH-DELETE )
1. delete from the list [ . ]
MAT-72006 AADS, Fall 2015 17-Sep-15 230

25
9/17/2015

• Worst-case running time of insertion is (1)


• It is fast in part because it assumes that the
element being inserted is not already present
in the table
• For searching, the worst-case running time is
proportional to the length of the list;
– we analyze this operation more closely soon
• We can delete an element (given a pointer) in
(1) time if the lists are doubly linked
• In singly linked lists, to delete , we would rst
have to nd in the list [ . ]
MAT-72006 AADS, Fall 2015 17-Sep-15 231

Analysis of hashing with chaining

• A hash table of slots stores elements,


de ne the load factor , i.e., the
average number of elements in a chain
• Our analysis will be in terms of , which can be
< 1, = 1, or > 1

• In the worst-case all keys hash to one slot


• The worst-case time for searching is thus )
plus the time to compute the hash function
MAT-72006 AADS, Fall 2015 17-Sep-15 232

26
9/17/2015

• Average-case performance of hashing depends


on how well distributes the set of keys to be
stored among the slots, on the average
• Simple uniform hashing (SUH):
– Assume that an element is equally likely to
hash into any of the slots, independently of
where any other element has hashed to
• For = 0,1, … , 1, let us denote the length of
the list ] by , so
and the expected value of is
E[ ] = = /
MAT-72006 AADS, Fall 2015 17-Sep-15 233

Theorem 11.1 In a hash table which resolves


collisions by chaining, an unsuccessful search
takes average-case time (1 + ), under SUH
assumption.
Proof Under SUH, not already in the table is
equally likely to hash to any of the slots.
The time to search unsuccessfully for is the
expected time to go through list )], which has
expected length E[ )] = .
Thus, the expected number of elements examined
is , and the total time required (incl. computing
( )) is (1 + ).
MAT-72006 AADS, Fall 2015 17-Sep-15 234

27
9/17/2015

Theorem 11.2 In a hash table which resolves


collisions by chaining, a successful search takes
average-case time (1 + ), under SUH.

Proof We assume that the element being


searched for is equally likely to be any of the
elements stored in the table.
The number of elements examined during the
search for is one more than the number of
elements that appear before in ’s list.
Elements before in the list were all inserted after
was inserted.
MAT-72006 AADS, Fall 2015 17-Sep-15 235

• Let us take the average (over the table


elements ) of the expected number of
elements added to ’s list after was added
to the list + 1
• Let , = 1,2, … , , denote the th element
inserted into the table and let
• For keys and , de ne the indicator
random variable = I{ }
• Under SUH, Pr =1 , and by
Lemma 5.1, E[ ]=1
MAT-72006 AADS, Fall 2015 17-Sep-15 236

28
9/17/2015

• Thus, the expected number of elements


examined in successful search is
1
E 1+

1
= 1+ E

1 1
= 1+

MAT-72006 AADS, Fall 2015 17-Sep-15 237

1
=1+ )

1
=1+

1 + 1)
=1+
2
1
=1+ +=1+
2
Thus, the total time required for a successful
search is 2 + 2 + 1+

MAT-72006 AADS, Fall 2015 17-Sep-15 238

29
9/17/2015

• If the number of hash-table slots is at least


proportional to the number of elements in the
table, we have ) and, consequently,
(1)

• Thus, searching takes constant time on average


• Insertion takes (1) worst-case time
• Deletion takes 1 worst-case time when the
lists are doubly linked
• Hence, we can support all dictionary operations
in (1) time on average
MAT-72006 AADS, Fall 2015 17-Sep-15 239

30

You might also like