You are on page 1of 42

Data Structures and Algorithms

(CS210/ESO207/ESO211)
Lecture 36

Sorting
beyond O(n log n) bound
1
Overview of todays lecture


The sorting algorithms you studied till now

Integer sorting

Solving 2 problems from Practice sheet 6 and one problem
from Practice sheet 5.




2
Sorting algorithms studied till now
Algorithms for Sorting n elements
Insertion sort: O(

)
Selection sort: O(

)
Bubble sort: O(

)
Merge sort: O( log )
Quick sort: worst case O(

), average case O( log )


Heap sort: O( log )

Question: What is common among these algorithms ?
Answer: All of them are allowed to use only comparison operation to
perform sorting.

Question: Can we sort in O() time ?

The answer depends upon
the model of computation.
the domain of input.


Theorem (to be proved in CS345): Every comparison based sorting
algorithm must perform at least O( log ) comparisons in the worst case.
word RAM model of computation:
Characteristics
Word is the basic storage unit of RAM. Word is a collection of few bytes.

Each input item (number, name) is stored in binary format.

RAM can be viewed as a huge array of words. Any arbitrary location of
RAM can be accessed in the same time irrespective of the location.

Data as well as Program reside fully in RAM.

Each arithmetic or logical operation (+,-,*,/,or, xor,) involving a constant
number of words takes a constant number of steps by the CPU.
6
Each arithmetic or logical operation (+,-,*,/,or, xor,) involving O( log n) bits
take a constant number of steps by the CPU, where n is the number of bits of
input instance.
Integer sorting
Counting sort: algorithm for sorting integers



Input: An array A storing integers in the range [0 ].
Output: Sorted array A.
Running time: O( +) in word RAM model of computation.
Extra space: O()

Counting sort: algorithm for sorting integers


A
0 1 2 3 4 5 6 7
Count
0 1 2 3 4 5
2 5 3 0 2 3 0 3
2
2 2 4 7 7 8
0 2 3 0 1
Place
0 1 2 3 4 5
B
0 1 2 3 4 5 6 7
3
Counting sort: algorithm for sorting integers


A
0 1 2 3 4 5 6 7
Count
0 1 2 3 4 5
2 5 3 0 2 3 0 3
2
2 2 4 6 7 8
0 2 3 0 1
Place
0 1 2 3 4 5
B
0 1 2 3 4 5 6 7
0 3
Counting sort: algorithm for sorting integers


A
0 1 2 3 4 5 6 7
Count
0 1 2 3 4 5
2 5 3 0 2 3 0 3
2
1 2 4 6 7 8
0 2 3 0 1
Place
0 1 2 3 4 5
B
0 1 2 3 4 5 6 7
3 0 3
Counting sort: algorithm for sorting integers
Algorithm (A[... ], )
For =0 to do Count[] 0;
For =0 to do Count[A[]] Count[A[]] +1;

For =0 to do Place[] Count[];
For =1 to do Place[] Place[ ] + Count[];

For = to do
{ B[ ?? ] A[];
Place[A[]] Place[A[]]-1;
}
return B;
Place[A[]]-1
Counting sort: algorithm for sorting integers

Note: The algorithm performs arithmetic operations involving O(log + log )
bits. In word RAM model, it takes O(1) time for such an operation.

Theorem: An array storing integers in the range [.. ]can be sorted in
O(+) time and using total O(+) space in word RAM model.

For = O(), we get an optimal algorithm for sorting. But what if is large ?

In the next class:
We shall discuss an algorithm for sorting integers in the range [..

] in O()
time and using O() space in word RAM model.


Practice sheet 6
We shall solve exercises 5 and 1 from this sheet
Important note




Though the solution is provided for this problem here, one should NOT feel that such
a problem will be asked in the end sem exam of this course. It was a mistake of the
instructor to put it in the practice sheet.
Problem 5 of practice sheet 6.
Description(in terms of interval):
Given a set A of n intervals, compute smallest set B of intervals so that for every
interval I in A\B, there is some interval in B which overlaps/intersects with I.









A
The set of green intervals is a solution
but not an optimal solution.
Solution of Problem 5 of practice sheet 6.
Description(in terms of interval):
Given a set A of n intervals, compute smallest set B of intervals so that for every
interval I in A\B, there is some interval in B which overlaps/intersects with I.








Let I* be the interval with earliest finish time.
Let I be the interval with maximum finish time overlapping I*.
Lemma1: There is an optimal solution for set A that contains I.
A
I*
I
Solution of Problem 5 of practice sheet 6.
Question: How to obtain smaller instance A using this greedy approach ?
Naive approach (again inspired from the job scheduling problem): remove from A all
intervals which overlap with I. This is A.
This approach does not work! Here is a counterexample.








The problem is that some deleted interval (in this case I) could have been used for
intersecting many intervals if it were not deleted. But deleting it from the instance
disallows it to be selected in the solution.
A
I
I*
I
Overview of the approach


In order to make sure we do not delete intervals (like I in the previous slide)
if they are essential to be selected to cover many other intervals, we make
some observations and introduce a terminology called Uniquely covered
interval. It turns out that we need to keep I in the smaller instance if there is
an interval there which is uniquely covered by I . Otherwise, we may discard
I.
An Observation
We can delete all intervals whose finish time is before finish time of I because any interval
overlapped by such intervals will anyway be overlapped by I. Let us consider intervals
which overlap with I, but have finish time greater than that of I. In the example shown
below, these intervals are those three intervals which cross the red line.







Observation1: Among the intervals crossing the red line, we need to keep only that interval
which has maximum finish time. (I in this picture)
Proof: Notice that each of these intervals are anyway intersected by I. As far as using them
to intersect other intervals in concerned, we may better choose I for this purpose.
So from now onwards, we shall assume that there is exactly one interval I in A which
overlaps I (intersects the red line) and has finish time larger than I.
I
I*
I
A
Uniquely covered interval







I2 is said to be uniquely covered by I1 if
I2 is fully covered by I1
Every interval overlapping I2 is also full covered by I1.
Lemma2 : There is an optimal solution containing I1.
Proof: Surely I2 or some other interval overlapping it must be there in the optimal solution. If
we replace that interval by I1, we still get a solution of the same size and hence an optimal solution.

I2
I1



We are now ready to give description/construction of A from A. There will be
two cases. We shall then prove that |Opt(A)| = |Opt(A)| + 1 for each of
these cases.

Important note:
The reader is advised to full understand Lemma1, Lemma2, Observation1,
and the notion of Uniquely covered interval. Also fully internalize the
notations I*, I, and I. This will help the reader understand the rest of the
solution.
Constructing A from A

Constructing A from A

A
I
I*
I
I
Case1: There is an interval I D uniquely covered by I
A
I
I
D E
We need to take care
of intervals whose
starting point is to
the right of red line
(finish time of I).
We can partition these
intervals into two sets.
D: those which overlap with I.
E: those that start after the
end of I and hence do not
overlap with I.
D E
Now we shall describe the two
cases for construction of A.
Constructing A from A




If there is an interval I D uniquely covered by I, then we define A as
follows. Remove all intervals from A which overlap with I (this was our usual
way of defining A in our wrong solution). Now add I to this set. This set is
the smaller instance A for Case 1.

We shall now define A for Case 2.

Constructing A from A


Case2: There is no interval uniquely covered by I
A
I
I*
I
D E
A
D
E
Constructing A from A




If there is no interval in D uniquely covered by I, then we define A as
follows. Remove all intervals from A which overlap with I (this was our usual
way of defining A in our wrong solution). This set is the smaller instance A
for Case 2.


Theorem1: |Opt(A)| = |Opt(A)| + 1

We shall prove this theorem for case 1 as well as
case 2.
Case1: There is an interval I D uniquely covered by I
|Opt(A)| |Opt(A)| + 1

A
I
I*
I
I
A
I
I
D E
D E
Now Using Lemma2, it follows
that there is an optimal
solution for A containing I.
What to add to this solution
to get a solution for A ?
We need to add just I to get a
solution for A and we are done.
Case1: There is an interval I uniquely covered by I
|Opt(A)| |Opt(A)| - 1

A
I
I*
I
I
A
I
I
D E
D E
Using Lemma1 and Lemma2,
it follows that there is an
optimal solution for A
containing I and I.
We need to just remove I from
this optimal solution for A to get
a solution for A and we are done.



This finishes the proof of Theorem for Case 1.
We shall now analyze Case2 and prove Theorem for this case as well.
Case2: There is no interval uniquely covered by I
|Opt(A)| |Opt(A)| + 1

A
I
I*
I
A
D E
D E
Consider any optimal solution
for A. Note that this optimal
solution takes care of D and E.
So we just need to take care of intervals
from A which intersect the red line.
These are taken care by adding I to this
solution. We are done.
Case2: There is no interval uniquely covered by I
|Opt(A)| |Opt(A)| - 1

A
I
I*
I
A
D E
D E
Using Lemma1, it follows that
there is an optimal solution
for A containing I.
If I is not in this optimal solution,
we can see that removing I from
this optimal solution gives a valid
solution for A.
So let us consider the case when I is
present in the optimal solution of A.
The problem is that I is not present
in A, so we need a substitute of I
from A.
Notice that I can serve the purpose
of overlapping of intervals from D
only. So we should search for
substitute for I from D only.
We replace I by the interval from D which
intersects the violet line and has earliest start
time. See the following slide for its justification.
Let be the interval in D which intersects the violet vertical line (has finish time greater than
that of I) and has earliest start time. It suffices if we can show that every interval of D
overlaps with . We proceed as follows. Consider any interval in D. There are two cases.
Finish time of is less than that of I. In other words, does not intersects the violet
line. In this case, there must be some other interval in D that overlaps and intersects
the violet line (otherwise, would be uniquely covered by I); since start time of is less
than this interval, so is overlapped by as well.
Finish time of is more than I. In other words, does intersect the violet line. Hence
overlaps with as well since the latter also intersects the violet line.
Hence if remove I and I from the given optimal solution of A, and add to it, we get a
solution for A. Since optimal solution for A has to be smaller or equal in size related to this
solution, we get |Opt(A)| |Opt(A)| - 1 for Case 2.

Hence we have proved Theorem1: |Opt(A)| = |Opt(A)| + 1
Now we need to design the algorithm for our problem based on the greedy strategy that
we used for constructing A from A.
Simplification and efficient implementation of
the algorithm

Though the algorithm looks quite complex to implement, but as will soon become clear,
it is quite simple to implement. We first introduce some notations to facilitate a clean
representation of the algorithm.

Notations:
f(I): finish time of interval I;
Maxf(I,A): maximum finish time of an interval from A that overlaps with I. (If no interval
overlaps with I, then Maxf(I,A)=f(I)).
Maxf-Interval(I,A): the interval from A with maximum finish time that overlaps with I. (If no
interval overlaps with I, then Maxf-Interval(I,A)=I).
Cover: set of intervals selected in till now. (At the end of the algo, Notations will be an optimal
solution)
][: Empty interval.
Algorithm

I ][; Cover ; A A;
While A<> do
{ If (I = ][)
{ let I be the interval in A with earliest finish time;
let I maxf-Interval(I);
Cover Cover U {I};
I maxf-Interval(I, A);
remove all intervals from A that are overlapped by I;
}
Else If (there is an interval I A with maxf(I) < f(I))
{ I I;
Cover Cover U {I};
I maxf-Interval(I, A);
remove all intervals from A that are overlapped by I;
}
Else I ][ ;
}
return Cover;
Algorithm
(further refinements of the same algo)



I ][; Cover ; A A;
While A<> do
{ let I be the interval in A with earliest finish time;
If (maxf(I) < f(I)) I I;
Else I maxf-Interval(I , A);
Cover Cover U {I};
I maxf-Interval(I, A);
remove all intervals from A that are overlapped by I;
}
return Cover;

It is easy to observe that each iteration of the while loop can be implemented in O()
time.
Proof of correctness for the algorithm
Though we had derived a proof of correctness while arriving at the algorithm, the same can be
given now as well. This may be helpful if you are not interested in the way we arrived at the
algorithm and are just wish to see the correctness of the algorithm.

Let Overlapped = A\A; In plain words, Overlapped is the set of intervals from A which are
overlapped by some interval from Cover.

In the beginning of an iteration, the following assertions hold:
1. There is an optimal solution for A containing Cover.
2. Every interval from Overlapped is overlapped by an interval from Cover, and I is an interval
with maximum finish time from the set Overlapped.
3. Every interval from A has start time greater than finish time of any interval from Cover .

The above assertion can be proved by induction on the number of iterations. The arguments
needed will be a small collection of arguments used for proving Theorem 1.


Concluding slide for exercise 5



Theorem:
There is an O(

) time algorithm for computing smallest subset of intervals


overlapping a given set of intervals.
Problem 1 of practice sheet 6



Given an array A storing n elements, and a number k, compute k nearest
elements for the median. Time complexity should be O(n).

Hint: Use the following tools.
Divide and conquer strategy like used in problem 2 of the same practice
sheet.
Linear time median finding algorithm.
You need to divide the problem to half the size in each step.
Firstly, we may prune our search domain from to 2 as follows.
Find median, let it be .
Find element with rank

, let it be . Remove all elements smaller than (justify it).


Find element with rank

+, let it be . Remove all elements greater than (justify it).


Time spent till now is O().
The nearest elements of the median are surely among these remaining 2 elements.
Now find element with rank

, let it be . Find element with rank

, let it be . If is
closer to than , then we can conclude the following:
1. all elements greater than and less than must be among the set of nearest
elements from . These / elements are eliminated from input and added to our
solution.
2. None of the elements which are greater than can be among the set of nearest
element from . These / elements are also removed from the input.
In this way, we have found / nearest element from . Moreover, the input has reduced
from to . Keep repeating it. We get nearest element from inO() time.
Finding DFS tree from start and finish time
There was a problem in practice sheet 5 where, given start time and finish
time of DFS traversal for all vertices, the aim is to compute DFN number and
DFS tree.

A few students were facing the problem of determining children of a node in
DFS tree. An easy way to achieve this goal is an indirect way:
In order to compute children of a vertex in DFS tree, it suffices if we can
compute parent of each vertex. We can do the latter task as follows.
Among all vertices neighboring to a vertex u, find all those vertices whose
start time is smaller than that of u. All these vertices are ancestors of u. Who
among them will be parent of u? Surely, the vertex with maximum start time.
So we can compute parent of vertex u in O(deg(u)) time. Time spent over all
vertices will be O(m+n) time. Hence we can compute children of each vertex
in DFS tree and hence the entire DFS tree structure in O(m+n)time.

You might also like