Professional Documents
Culture Documents
ENGINEERING
LAB MANUAL
FOR
Vibha Lahane
Prepared By
List of Practicals
1 Using Divide and Conquer Strategies design a function for Binary Search using C+
+/ Java/ Python/Scala.
Group A
2 Using Divide and Conquer Strategies design a class for Concurrent Quick Sort
using C++.
Prerequisites:
Knowledge of writing programs in C++.
Objectives:
To learn the concept of Divide and Conquer Strategy.
To study the design and implementation of Binary Search algorithm.
Theory:
Divide and Conquer strategy:
A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-
problems of the same (or related) type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution to the original problem.
This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,
quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and
computing the discrete Fourier transform (FFTs).
Searching
Sequential Algorithm
function sequential (T [ 1 .. n ], x)
This algorithm clearly takes a time in (r), where r is the index returned : this is O(n) in the worst case and
(1) in the best case. If we assume that all the elements of T are distinct, that x is indeed somewhere in the
array
CL-I B.E. Computer Engineering
Binary Search
The binary search algorithm begins by comparing the target value to value of the middle element of the
sorted array. If the target value is equal to the middle element's value, the position is returned. If the
target value is smaller, the search continues on the lower half of the array, or if the target value is
larger, the search continues on the upper half of the array. This process continues until the element is
found and its position is returned, or there are no more elements left to search for in the array and a
"not found" indicator is returned.
Binary search can be applied to sorted list only. It searches sorted lists using a divide and conquer
technique. On each iteration the search domain is cut in half, until the result is found. The
computational complexity of a binary search is O(log n).
functionbinrec (T [i .. j ], x)
{ binary search for x in subarray T [i .. j] }
If i = j then return i k
(i+j+1)/2
Binary searching is the algorithm used to look up a word in a dictionary or a name in a telephone
directory. It is probably the simplest application of divide-and-conquer. It can be applied to a sorted list
only.
7
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
8
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
The concept of divide and conquer strategy is studied and binary search algorithm is implemented
using C++.
FAQs:
1) What is Divide and Conquer approach? Also explain its advantages.
3) Explain the need of analysis of algorithm with respect to complexities as well as techniques
used for analysis.
4) Compute time complexity and space complexity of your program. Also give the proper
justification for same.
5) Compare the conventional Binary Search algorithm and the Divide and Conquer Binary Search
algorithm. Also explain the advantages of Divide and Conquer approach in terms of quick sort.
6) Compare between Divide and Conquer, Concurrent programming, Back tracking,brach and
bound approach.
9
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Assignment No: 02
Title: Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using
C++.
Prerequisites:
Knowledge of writing programs in C++.
Objectives:
To learn the concept of Divide and Conquer Strategy.
To study the design and implementation of Quick Sort algorithm.
Theory:
Divide and Conquer strategy:
A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-
problems of the same (or related) type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution to the original problem.
This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,
quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and
computing the discrete Fourier transform (FFTs).
Sorting
Quick Sort
The sorting algorithm invented by Hoare, usually known as "quicksort", is also based on the idea of
divide-and-conquer. As a first step, this algorithm chooses one of the items in the array to be sorted as
the pivot. The array is then partitioned on either side of the pivot, elements are moved in such a way
that those greater than the pivot are placed on its right, whereas all the others are moved to its left. If
now the two sections of the array on either side of the pivot are sorted independently by recursive calls
of the algorithm, the final result is a completely sorted array, no subsequent merge step being necessary.
To balance the sizes of the two sub instances to be sorted, we would like to use the median element as
10
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
the pivot. Finding the median takes more time than it is worth. For this reason we simply use the first
element of the array as the pivot. The quick sort algorithm is given below.
procedure quicksort (T [i .. j ])
{ sorts array T [i .. j ] into increasing
order } if j - i is small then insert (T [i .. j ])
else pivot (T [i .. j ],1)
quicksort(T [i .. I -1])
quicksort (T [1 +1 .. j ])
Let p = T [i ] be the pivot. One good way of pivoting consists of scanning the array T [i .. j ] just once,
but starting at both ends. Pointers k and 1 are initialized to i and j + 1, respectively. Pointer k is then
incremented until T [k] >p, and pointer I is decremented until T [1] <- p. Now T [k] and T [1] are
interchanged. This process continues as long as k < 1. Finally, T [i] and T [1] are interchanged to put the
pivot in its correct position.
procedure pivot (T [i .. j ] ; var 1)
{ permutes the elements in array T [i .. j ] in such a way that, at the
end, i<- l <- j, the elements of T [i .. 1-1] are not greater than p,
T[11 =p, and the elements of T J1+1 .. j ] are greater than
p, where p is the initial value of T [i ] }
p <-T[i]
k<- i; 1<-j+1;
repeat k- k + 1 until T [k] > p or k >-
j repeat I E- 1- 1 until T [1] <- p
while k < I do
interchange T [k] and T [1]
repeat k F- k + 1 until T [k] >
p repeat 1 f- 1- 1 until T [1] p
interchange T [i] and T [1]
Quicksort is a sequential based, sequential sorting algorithm. It is a recursive algorithm that uses the list,
the pivot, and finds its position in the list where the key should be placed. This is the low side of the pivot
and ii) the keys larger than or equal to the pivot are placed to the high side of the pivot. Then the
same program is recursively applied on these two parts.
11
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
2
The average time complexity of Quick sort is O(n log n). The worst-case time complexity is O(n )
12
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Flow Chart for Quick Sort using Divide and Conquer Approach.
13
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
The concept of divide and conquer strategy is studied and Concurrent Quick Sort algorithm is
implemented using C++.
FAQs
1) Explain the need of Divide and Conquer approach for Quick Sort.
2) What is advantage of Divide and Conquer Technique over the recursion?
3) Compare the conventional Quick Sort algorithm with Quick sort using Divide and Conquer .
4) When does the worst case of Quick Sort occur?
5) What are the advantages and disadvantages of quick sort?
6) What is the complexity of quick sort?
Assignment No: 3
Aim:
Assignment to understand the syntax of LEX specifications, built-in functions and variables. (Lexical
analyzer for sample language using LEX)
14
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Objective:
1. To understand how to construct a compiler using LEX and YACC. LEX and YACC are tools used to
generate lexical analyzers and parsers.
What is LEX?
It is a tool for generating Lexical Analyzer. It takes a specification of tokens in the form of a
list of regular expression. From above input LEX generate a lexical analyzer. Its source file is a
specification file consisting of a set of regular expression together with an action.
%{
%}
Definition Section
%%
Rules Section
%%
15
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
User Subroutines
I] Definition Section:
In this section literal block, definitions, internal table declaration, start conditions and
translations are included.
We can use C code also as it is just by writing that code in special brackets as shown in above diagram i.e %{
%} all code in between those brackets is copied as it is in lex.yy.c. we can also declare Regular expression in
this section which we can use in Rule section.
There are some regular expressions used by the LEX with their meaning is listed below:
16
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
^ indicates match any character except the ones within the brackets
\ Escape character
matched by any input stream. This action is a typical C Code Statements stating what action
should be taken by LEX after matching pattern.
17
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
This section is for defining the other subroutines required for the Lexical analyzer like symbol
table management etc.
Hence it is also a typical C Code section. The main() method is defined here which will include yylex()
method. yylex() method is defined in LEX which calls the lex.yy.c.
Block Diagram:
a.out
18
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Input : FirstLexProgram.l
Output : lex.yy.c
This command will convert lex specification given in FirstLexProgram.l into C code. There is fixed
destination or the default file to store this C code and that is lex.yy.c.
Input : lex.yy.c
Output : a.out
This command will check that the lex.yy.c generated by first step is syntactically correct or not
according to C Language Syntax.
-O : It is Redirecting output to some file means store the result of compilation into the file mentioned after it
a.out: File containing the output of compilation. A.out is default. We can change this file. i.e. we can store
result in any file.
Final a.out is nothing but the lexical analyzer. If we provide an input stream to the a.out it will separate
out the different tokens in given input stream
Build in variables
19
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Build in Functions
1. yylex() : - Lexical analyzer produced by LEX is C routing called
yylex().
Build in macros
a. input():- Gets Next Character From Input
b. unput():- Put character back in logical input stream
%%
\n
The following example prepends line numbers to each line in a file. Some implementations of lex
predefine and calculate yylineno. The input file for lex is yyin, and defaults to stdin.
Whitespace must separate the defining term and the associated expression. References to substitutions
in the rules section are surrounded by braces ({letter}) to distinguish them from literals. When we have a
match in the rules section, the associated C code is executed. Here is a scanner that counts the number of
characters, words, and lines in a file (similar to Unix wc).
20
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
21
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
LEX is a tool which accepts regular expressions as an input & generates a C code to recognize that
token. If that token is identified, then the LEX allows us to write user defined routines that are to be executed.
When we give input specification file to LEX, LEX generates lex.yy.c file as an output which contains function
yylex() which is generated by the LEX tool & contains a C code to recognize the token & action to be carried
out if we find the token.
We also wrote a small LEX specification for recognizing the C type comments.
FAQs:
3. What is a parser?
22
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Assignment No. 4
Aim:
Write an ambiguous CFG to implement Parser for sample Language using YACC and Lex.
Provide the details of all conflicting entries in the parser table generated by LEX and YACC and how
they have been resolved.
Objectives:
Theory:
Ambiguous grammars:
C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks
something like this:
exp : exp '-' sub_exp
| sub_exp
;
sub_exp : '(' type_name ')' sub_exp
| '-' sub_exp
| id
| literal
| '(' exp ')'
;
type_name : id
| more_complex_type_descriptions
;
23
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
target : id
| target '.' id
;
An LL(1) parser trying to compile this language would have difficulties distinguishing
between assignments (e.g. a=x;) and procedure calls i.e. functions/methods returning void (e.g. a(x);).
This is because an LL(1) parser has to decide which kind of statement it is looking at after seeing only
1 symbol (i.e. a), and it isn't until we see the = or ( that we can tell what is intended. Suppose we used
a more complex algorithm, such as LL(3) - even this couldn't decide between e.g. a.b=x and a.b(x). In
fact, no matter how far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can always
be confused by a sufficiently complicated target in an assignment or call.
There are two kinds of solutions - the parser can use a variable amount of lookahead, as
JAVACC can be asked to do, so it reads as far as the = or ( before making a decision - or we can
rewrite the grammar, by left-factorising it, so that the two kinds of statement are merged until we can
make the decision:
stat : target assign_or_call ';'
;
assign_or_call : '=' exp
| '(' explist ')'
;
An LR (1) parser has no difficulty dealing with the original grammar, as it will have read to
the end of the statement, and seen the = or (on the way, before it has to decide whether to recognize
an assignment or a call.
It is possible to construct unambiguous grammars that would confuse any LR(n) parser (as
well as any LL(n) parser) e.g. palindromes - strings that are their own mirror images, such as abba or
abacaba:
P:
| 'a' | 'b' | 'c' |...
| 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . .
;
The problem is that, although it is perfectly obvious to us what to do - find the middle, and
work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of
25
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
the string by using their finite lookahead to find the end of the string. This could not work for strings
of length > n for LL(n), or length >2n for LR(n).
Confusing YACC:
Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user
what the problem is, even if it isn't obvious what to do about it. However, what kinds of error
messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or
confusion?
YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce
conflicts (where YACC can't decide whether to perform a shift or reduce - i.e. the grammar rule is
complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform -
i.e. which grammar rule is it?).
An example of a shift/reduce conflict:
The start of a function/method declaration in a C-like language, that accepts headers like void
fred(int a, int b, float x, float z), looks something like this header:
type_name id '(' params ')'
| type_name id '(' ')'
;
params : param
| params ',' param
;
param : type_name id
;
YACC has no problems with this grammar, but what if we modify it? It might be nice to be
able to write the example above simply as void fred(int a, b, float x, z). We could try rewriting the
grammar like this:
param : type_name ids
;
ids : id
| ids ',' id
;
26
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
But now, YACC reports a shift/reduce conflict, and the details from the y.output file are:
13: shift/reduce conflict (shift 15, reduce 5) on ','
state 13
param : type_name ids . (5)
ids : ids . ',' id (7)
That is, when the generated parser sees a , after a list of identifiers in a param, it doesn't know
whether that , (and the id it expects after) is part of the same param (in which case it should shift, to
include them as part of the RHS) or the start of the next param (in which case it should reduce this
RHS and start a new RHS).
This is not ambiguous, just confusing to YACC, as it needs more lookahead to see if the next
few symbols are e.g. , a b (a is a type_name, b is a parameter name of type a) or , a , or , a ) (a is a
parameter name of the current type). The way to make this clear to YACC is to rewrite the grammar
so that it can see more of the input before having to make a decision:
params : type_name id
| params ',' type_name id
| params ',' id
;
An example of a reduce/reduce conflict:
state 8
sub_exp : id . (5)
type_name : id . (8)
That is, when it sees id) it doesn't know whether the id is a variable giving a value or a type
name, so it doesn't know which rule to use to recognize the id.
Assuming we don't already know what the problem is, this hasn't helped much, but we can get
more information by working back through the states in the y.output file to try to find how we get
here. To do so, we need to look for states that include shift 8 or goto 8. In this example, all we find is:
state 4
sub_exp : '(' . type_name ')' sub_exp (3)
sub_exp : '(' . exp ')' (7)
...
id shift 8
27
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
So the input must include (id), which can be recognized either as a type-cast or as an
expression.
This is a big hint about the source of the ambiguity in the grammar, but more by luck than
anything else - YACC remains confused even if we make the grammar unambiguous, by removing the
rule sub_exp : '-' sub_exp. YACC still reports the same reduce/reduce conflict for this modified
grammar, as it is confused by an input as simple as ( a ) - it has to decide whether this is a value in an
expression or a type-cast before it reads past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or ( a ) - 99 (i.e.
the value a - 99).
Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to
distinguish between identifiers that are really type names (or class names) and all other identifiers -
also solves this confusion for YACC.
Epilogue:
Most of the time, an ambiguous grammar results from an error made by the implementers of a
programming language. Sometimes, however, it is the fault of the language designer. Many languages
are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)).
Does this matter? We should not limit language designers to what a particular type of parser generator
can cope with, but on the other hand there is no particular merit in making a language harder to
compile if a small change can simplify the problem.
An example of this is a well-known problem with conditional statements; the dangling else.
Most imperative languages permit conditional statements to take two slightly different forms:
if ( ... ) ...
So the else d in if (a) if (b) c else d could be associated either with if (a) or with if (b).
Most languages attempt to fix this problem by stating that the second interpretation is more
natural, and so is correct, although some languages have different rules. Whatever the language
definition, it is an extra rule that anyone learning the language has to remember.
Similarly, the compiler writer has to deal with this special case: if we use a tool like YACC we
get a shift/reduce error - do we shift the else to get if (b) c else d, or do we reduce the if (b) c as it
28
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
stands, so we get if (a) ... else d To overcome this problem, we can rewrite the grammar to explicitly
say ``you can't have an unmatched then (logically) immediately before an else - the then and the else
must be paired up'':
stat : matched
| unmatched
|...
|...
| exp
| . . .;
29
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
30
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
We have written an ambiguous CFG to recognize an infix expression and implement a parser
that recognizes the infix expression using YACC, And also the details of all conflicting entries in the
parser table generated by LEX and YACC and how they have been resolved.
Questions:
4. What is ambiguity?
Assignment No. 05
31
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Aim:
Theory:
Semantic Actions:
Parsing tools use a generalization of CFG's in which each grammar symbol one or more
values, called attributes, have associated with it. Each production of the grammar may have an
associated "action", which can refer to and compute the values of attributes. So we have:
Terminals & non-terminals . have attributes
Productions . have semantic actions
Example:
E -> E' + E
| E'
E' -> int * E'
| int
For each symbol, let X.val be an integer value associated with X.
For terminal symbols, val is the lexeme provided by the lexical analyzer.
For non-terminals, val should be the integer value of the expression. This attribute is
computed from the attributes of sub-expressions.
Production Action:
E -> E' + E1 E.val = E'.val + E1.val
| E' E.val = E'.val
E' -> int * E1' E'.val = int.val * E1'.val
| int E'.val = int.val
Note: the attribute of some grammar symbols, such as the terminals + and *, is unused.
32
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Example:
5*3+2*4
Parse Tree Equations
E1 E1.val = E3'.val + E2.val
------------- E3'.val = int7.val + E4'.val
E3'+E2 E4'.val = int8.val
------------- E2.val = E5'.val
int7 * E4' E5' E5'.val = int9.val * E6'.val
------------- E6'.val = int0.val
int8 int9*E6' int7.val = 5
------------- int8.val = 3
int0 int9.val = 2
int0.val = 4
Working from the leaves to the root, we can compute each val attribute.
For example, E6'.val = 4 and E5'.val = 8. Finally, E1.val = 23.
Notes:
1. Fresh attributes are associated with every node in the parse tree.
2. The semantic actions specify a system of equations; they don't say in what order the
equations are resolved. The user just gives a specification and the parser takes care of the
implementation.
Warning: You can use side-effects in semantic actions, but in this case you have to understand the
order in which attributes get computed or the results will seem unpredictable.
3. In this example, the val attribute can be evaluated bottom-up: the .val attribute for a node
of the parse depends only on the .val attributes of its children.
4. The parse tree need not actually be built by the parser. In fact, a parser tool would
compile this specification into code that simply traces out the structure of the parse tree
without actually building it.
5. Pattern/action parsing can be though of as a systematic translation of the original text into
a new form specified by the semantic actions. Because the translation is guided by the syntax,
it is called syntax-directed translation. (NB: Book uses SDT in a narrower sense.)
33
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
6. Attributes may also be passed top-down: an attribute of a node may depend on an attribute
of the parent in the parse tree. Such an attributed is called "inherited". We will talk about
inherited attributes eventually, but they will not be used in the course project.
34
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
A topological sort of the graph is any ordering n1,...,nk of the nodes such that edges of
the graph are all from left-to-right in the ordering; i.e., a node appears in the ordering after all of the
nodes it depends on. Any topological sort is a legal evaluation order of the attributes.
Note: for the topological sort to make sense there can be no cycles in the graph.
Input:
Identifiers from the input in a symbol table and other relevant information about the identifiers
Output:
Instructions:
For the For Statement, if, if-else statement as per the syntax of C or Pascal and generate
equivalent three address code for the given input made up of constructs mentioned above using LEX
and YACC. Write a code to store the identifiers from the input in a symbol table and also to record
35
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
other relevant information about the identifiers from the input in a symbol table and also to records
stored in the symbol table.
36
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
Questions:
Assignment No: 6
Theory:
In statistics and machine learning, k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to the cluster
with the nearest mean.
Algorithm:
Regarding computational complexity, the k-means clustering problem is:
NP-hard in general Euclidean space d even for 2 clusters
NP-hard for a general number of clusters k even in the plane
If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n),
38
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
39
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Example:
Problem: Cluster the following eight points (with (x, y) representing locations) into three clusters
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are:
A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2,
y2) is defined as:
(a, b) = |x2 x1| + |y2 y1| .
Use k-means algorithm to find the three cluster centers after the second iteration.
First we list all points in the first column of the table above. The initial cluster centers means, are
(2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point
(2, 10) to each of the three means, by using the distance function:
point mean1
x1, y1 x2, y2
(2, 10) (2, 10)
(a, b) = |x2 x1| + |y2 y1|
(point, mean1) = |x2 x1| + |y2 y1|
40
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
= |2 2| + |10 10|
=0 + 0
=0
point mean2
x1, y1 x2, y2
(2, 10) (5, 8)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |5 2| + |8 10|
=3+2
=5
point mean3
x1, y1 x2, y2
(2, 10) (1, 2)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |1 2| + |2 10|
=1+8
=9
So, we fill in these values in the table:
41
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
So, which cluster should the point (2, 10) be placed in? The one, where the point has the shortest
distance to the mean that is mean 1 (cluster 1), since the distance is 0.
Cluster 1 Cluster 2 Cluster
3 (2, 10)
So, we go to the second point (2, 5) and we will calculate the distance to each of the three means,
by using the distance function:
point mean1
x1, y1 x2, y2
(2, 5) (2, 10)
(a, b) = |x2 x1| + |y2 y1|
(point, mean1) = |x2 x1| + |y2 y1|
= |2 2| + |10 5|
=0+5
=5
point mean2
x1, y1 x2, y2
(2, 5) (5, 8)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |5 2| + |8 5|
=3+3
=6
point mean3
x1, y1 x2, y2
(2, 5) (1, 2)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |1 2| + |2 5|
=1+3
=4
So, we fill in these values in the table:
42
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Iteration
So, which cluster should the point (2, 5) be placed in? The one, where the point has the shortest
distance to the mean that is mean 3 (cluster 3), since the distance is 0.
Cluster 1 Cluster 2 Cluster 3
(2, 10) (2, 5)
Analogically, we fill in the rest of the table, and place each point in one of the
clusters: Iteration 1
43
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
(4, 9)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all
points in each cluster.
For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center
remains the same.
For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) =
(6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
44
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
45
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until
the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time using the new means
we computed.
46
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion:
FAQs :
47
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
GROUP B: ASSIGNMENTS
( any 6 Assignments)
Assignment No: 07
Aim:
48
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Objective:
4. To understand rules for generating the target code by providing three address code as a
input.
Theory:
Code generation is final phase of compiler. Basically code generation is process of creating
low level (assembly language or m/c ) code for three address code (generated by intermediate
code generation phase) or optimized three address code(Optimized by Code Optimizer phase).
Symbol Table
Read the expression in the form of operator ,operand1,operand2 and generate code using
following algorithm .
Gen_Code(operator,operand1,operand2)
{
If(operand1.addressmode=R)
{
If(operator=+)
49
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Generate(ADD operand2,R0);
else if(operator=-)
Generate(SUB operand2,R0);
else if(operator=*)
Generate(MUL operand2,R0);
else if(operator=/)
Generate(DIV operand2,R0);
}
else If(operand2.addressmode=R)
{
If(operator=+)
Generate(ADD operand1,R0);
else if(operator=-)
Generate(SUB operand1,R0);
else if(operator=*)
Generate(MUL operand1,R0);
else if(operator=/)
Generate(DIV operand1,R0);
}
else{
If(operator=+)
Generate(ADD operand2,R0);
else if(operator=-)
Generate(SUB operand2,R0);
else if(operator=*)
Generate(MUL operand2,R0);
else if(operator=/)
Generate(DIV operand2,R0);
}
}
50
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Example:
X:= (a+b)*(c-d)+((e/f)*(a+b))
t1:=a+b
t2:=c-d
t3:=e/f
t4:=t1*t2
t5=t3*t1
t6:=t4+t5
Using simple code generation algorithm the sequence target code can be generated
51
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
52
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Conclusion :
Thus we have studied to generate the target code for the optimized code.
Questions:
1. What is complier?
4. What is Ambiguity?
5. Explain the difference between the target code and intermediate code?
53
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Assignment No: 8
Aim: Write a LEX and YACC program to generate abstract syntax tree.
Objective:
To understand working of Code Generation Phase of Compiler
Theory:
The purpose of this lab is to create and print an abstract syntax tree for a C program. The C program
will use only a small subset of the grammar.
As an example of a syntax tree, consider the statement tri_area = (base *
height)/2; The root node is an assignment operation. Its left subtree is a pointer
to tri area.
Its right subtree represents the expression (base * height)/2. The tree looks like the tree in Figure
54
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
In this display, each node is followed by its left subtree and then its right subtree, indented one
tab stop. Notice that base and height are dereferenced, but tri area isn't. That will be explained
next.
Tree Nodes and the Tree Node Class
A tree node will be implemented by the Tree Node class. If a tree node is an interior node, then it
will contain an operator that acts on the left and right subtrees. The operator will have a mode,
which will be the data type involved in the operation. For example, if the mode of an assignment
operator is INT, then the operator will assign an int to an int. If a tree node is an exterior (leaf)
node, then it will contain an object, which will be an identi_er or a number (and later a string). The
mode of an exterior node will be the kind of object stored in that node. For example, if the object is
an integer variable (l-value), then the mode will be a pointer to an INT.
If the object is an integer constant, then the mode will be INT. Open the _le TreeNode.java.
This _le de_nes the TreeNode class whose objects have the following attributes: the operation
(oper) represented by the node, the mode (mode) of the operation, a reference to the left subtree
55
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
(left), a reference to the right subtree (right), the identi_er (id) represented by the node, the number
(num) represented by the node, and the string (str) represented by the node.
If the node is a binary interior node, then left and right will be non-null, and id, num, and str
will be unde_ned. On the other hand, if the node is an exterior node, then left and right will be null,
while exactly one of id, num, and str will be de_ned, depending on the kind of exterior node. From
time to time, we will have unary interior nodes. They will always use the left subtree rather than the
right subtree.
Note the types of the data members oper, mode, left, right, id, num,
and str. Also, one constructor
publicTreeNode(IdEntryi)
and the toString() function have been de_ned. You will de_ne three additional constructors. First,
de_ne the default constructor:
publicTreeNode()
It should set oper, mode, and num to 0 and left, right, id, and str
to null. Next, de_ne the following constructor.
publicTreeNode(int op, int m, TreeNode l, TreeNode r)
The purpose of this constructor is to join together two existing trees, with root nodes l and r, as the
left and right subtrees of a new tree with this node as its root node.
In the root node, the value of oper should be op and the value of mode
should be m. Finally, define the constructor
publicTreeNode(int n)
It will create a node that represents a number. The member oper should be Ops.NUM,
mode should be Ops.INT, and num should be the value of n. Write these constructors. We will use
these constructors later in this lab.
Yacc is a tool for building syntax analyzers, also known as parser,yacc has been used to
implement hundreds of languages. Its applications range from small desk calculators, to medium-
sized preprocessors for typesetting, to large compiler front ends for complete programming
languages.
A yacc specification is based on a collection of grammar rules that describe the syntax of a
language; yacc turns the specification into a syntax analyzer. A pure syntax analyzer merely checks
whether or not an input string conforms to the syntax of the language.
56
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Algorithm:
Step1: Start
Step2: declare the declarations as a header file {include<ctype.h>}
Step3: token digit
Step4: define the translations rules like line, expr, term, factor
Line: expr \n {print (\n %d \n,$1)}
Expr: expr+ term ($$=$1=$3}
Term: term + factor ($$ =$1*$3}
Factor: (enter) {$$ =$2)
%%
Step5: define the supporting C routines
Step6: Stop
Conclusion:
FAQs
1. What is AST?
2. What is the need of AST?
3. Which phase of compiler generates AST?
4. What are the applications of AST in compiler?
57
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Assignment No: 9
Objective:
To develop a recursive-descent parser for a given grammar.
To generate a syntax tree as an output of the parser
To handle syntax errors.
Theory:
A recursive descent parser is a kind of top-down parser built from a set of mutually-recursive
procedures (or a non-recursive equivalent) where each such procedure usually implements one
of the production rules of the grammar. Thus the structure of the resulting program closely
mirrors that of the grammar it recognizes.
This parser attempts to verify that the syntax of the input stream is correct as it is read from left
to right. A basic operation necessary for this involves reading characters from the input stream
and matching then with terminals from the grammar that describes the syntax of the input. Our
recursive descent parsers will look ahead one character and advance the input stream reading
pointer when proper matches occur. What a recursive descent parser actually does is to perform
a depth-first search of the derivation tree for the string being parsed. This provides the 'descent'
58
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
portion of the name. The 'recursive' portion comes from the parser's form, a collection of
recursive procedures.
As our first example, consider the simple grammar
E
->
x+
T
T
->
(E
)T
->
x
and the derivation tree in figure 2 for the expression x+(x+x)
A recursive descent parser traverses the tree by first calling a procedure to recognize an E. This
procedure reads an 'x' and a '+' and then calls a procedure to recognize a T. This would look like
the following routine.
Procedure E()
Begin
59
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
If
(input_symbol=x)
then next();
If (input_symbol=+) then
Next();
T();
Else
Errorhandler();
END
Note that the 'next' looks ahead and always provides the next character that will be read from
the input stream. This feature is essential if we wish our parsers to be able to predict what is due
to arrive as input. Note that 'errorhandler' is a procedure that notifies the user that a syntax error
has been made and then possibly terminates execution.
In order to recognize a T, the parser must figure out which of the productions to execute. This is
not difficult and is done in the procedure that appears below.
Procedure T()
Begin
Begin
If
(input_symbol=()
then next();
E();
If
(input_symbol=))
then next();
end
else If
(input_symbol=x)
then next();
else
Errorha
ndler();
60
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
END
In the above routine, the parser determines whether T had the form (E) or x. If not then the
error routine was called, otherwise the appropriate terminals and nonterminals were recognized.
Algorithm:
1. Make grammar suitable for parsing i.e. remove left recursion (if required).
2. Write a function for each production with error handler.
3. Given input is said to be valid if input is scanned completely and no error function is called.
Conclusion:
FAQs:
1.What do you mean by Recursive Descent Parsing?
2.What are the applications of Recursive descent parse
3.Advantages of Recursive descent parse
Assignment No: 10
Title: Implement Apriori approach for data mining to organize the data items on a shelf.
Aim: Write a program to implement Apriori algorithm.
Objective:
To find frequent itemsets and association between different itemsets i.e. association
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a
set of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where
X,YC I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent
(left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.
61
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table to the right. An
example rule for the supermarket could be meaning that if milk and bread is bought, customers
also buy butter.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The best-known constraints are 51 minimum thresholds
on support and confidence. The support supp(X) of an itemsetX is defined as the proportion of
transactions in the data set which contain the itemset. In the example database, the itemset
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in
the database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability
of finding the RHS of the rule in transactions under the condition that these transactions also
contain the LHS
62
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
The lift of a rule is defined as or the ratio of the observed confidence to that expected by chance.
The rule has a lift of.
The conviction of a rule is defined as. The rule has a conviction of , and be interpreted as the ratio of
the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an
incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect
predictions. In this example, the conviction value of 1.2 shows that the rule would be incorrect 20%
more often (1.2 times as often) if the association between X and Y was purely random chance.
Association rules are required to satisfy a user-specified minimum support and a user-specified
minimum confidence at the same time. To achieve this, association rule generation is a two-step
process. First, minimum support is applied to find all frequent itemsets in a database. In a second
step, these frequent itemsets and the minimum confidence constraint are used to form rules. While
the second step is straight forward, the first step needs more attention.
Many algorithms for generating association rules were presented over time.
Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since
they are algorithms for mining frequent itemsets. Another step needs to be done after to generate
rules from frequent itemsets found in a database.
Algorithm:
Association rule mining is to find out association rules that satisfy the predefined minimum support
and confidence from a given database. The problem is usually decomposed into two sub problems.
One is to find those item sets whose occurrences exceed a predefined threshold in the database;
those item sets arecalled frequent or large itemsets. The second problem is to generate association
rules from those large itemsets with the constraints of minimal confidence.
Suppose one of the large itemsets is Lk, Lk = {I1, I2, ,Ik}, association rules with this itemsets are
generated in the following way: the first rule is {I1, I2, , Ik-1}and {Ik}, by checking the
confidence this rule can be determined as interesting or not. Then other rule are generated by
deleting the last items in the antecedent and inserting it to the consequent, further the confidences of
the new rules are checked to determine the interestingness of them. Those processes iterated until
the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the
researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In Database
D.
Find frequent set Lk 1.
Join Step.
Ck is generated by joining Lk 1with itself
Prune Step.
Any (k 1) -itemset that is not frequent cannot be a subset of a frequent k -itemset,
hence should be removed.
where
(Ck: Candidate itemset of size k)
(Lk: frequent itemset of size k)
Example:
A large supermarket tracks sales data by SKU (item), and thus is able to know what items are
typically purchased together. Apriori is a moderately efficient way to build a list of frequent
purchased item pairs from this data. Let the database of transactions consist of the sets are
T1:{1,2,3,4},
T2: {2,3,4},
T3: {2,3},
T4:{1,2,4}, T5:
{1,2,3,4}, and
T6: {2,4}.
Each number corresponds to a product such as "butter" or "water". The first step of Apriori to count
up the frequencies, called the supports, of each member item separately:
We can define a minimum support level to qualify as "frequent," which depends on the context. For
this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-
pairs of the frequent items. Had any of the above items not been frequent, they wouldn't have been
included as a possible member of possible 2-item pairs
In this way, Apriori prunes the tree of all possible sets.
This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't
need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no
supersets of them can possibly be frequent. Keep going
This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't
need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no
supersets of them can possibly be frequent. Keep going
Conclusion:
FAQs :
Assignment No:11
Title: Using any similarity based techniques develop an application to classify text data. Perform
tasks as per requirement.
Objectives:
Theory:
There are three fundamental measures for assessing the quality of text retrieval
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can
be defined as
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall
is defined as
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for
precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows
The World Wide Web contains huge amounts of information that provides a rich source for data
mining.
The web is too huge The size of the web is very huge and rapidly increasing. This seems that the web is
too huge for data warehousing and data mining.
Complexity of Web pages The web pages do not have unifying structure. They are very complex as
compared to traditional text document. There are huge amount of documents in digital library of web.
These libraries are not arranged according to any particular sorted order.
Web is dynamic information source The information on the web is rapidly updated. The data such as
news, stock markets, weather, sports, shopping, etc., are regularly updated.
Diversity of user communities The user community on the web is rapidly expanding. These users have
different backgrounds, interests, and usage purposes. There are more than 100 million workstations that
are connected to the Internet and still rapidly increasing.
Relevancy of Information It is considered that a particular person is generally interested in only small
portion of the web, while the rest of the portion of the web contains the information that is not relevant to
the user and may swamp desired results.
The DOM structure was initially introduced for presentation in the browser and not for description
of semantic structure of the web page. The DOM structure cannot correctly identify the semantic
relationship between the different parts of a web page.
Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to
a block.
A value is assigned to each node. This value is called the Degree of Coherence. This value is
assigned to indicate the coherent content in the block based on visual perception.
The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that
it finds the separators between these blocks.
The separators refer to the horizontal or vertical lines in a web page that visually cross with
no blocks.
The semantics of the web page is constructed on the basis of these blocks.
ss
Data mining is widely used in diverse areas. There are a number of commercial data mining system
available today and yet there are many challenges in this field. In this tutorial, we will discuss the
applications and the trend of data mining.
FAQS
1] What are different techniques used for classification of text data.
Assignment No:12
Prerequisites:
Knowledge of K-NN approach.
Objectives:
To learn the concept of K-NN approach with suitable example.
To implement K-NN approach.
Theory:
K-NN approach
In k-NN regression, the output is the property value for the object. This value is the average of the
values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is deferred until classification. The
k-NN algorithm is among the simplest of all machine learning algorithms.
Both for classification and regression, it can be useful to assign weight to the contributions of the
neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where
d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class
(for k-NN classification) or the object property value (for k-NN regression) is known. This can be
thought of as the training set for the algorithm, though no explicit training step is required.
A limitation of the k-NN algorithm is that it is sensitive to the local structure of the data. The
algorithm has nothing to do with and is not to be confused with k-means, another popular machine
learning technique.
72
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label.
The training phase of the algorithm consists only of storing the feature vectors and class labels of
the trainingsamples. In the classification phase, k is a user-defined constant, and an unlabeled
vector (a query or test point) is classified by assigning the label which is most frequent among the
k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance. For discrete
variables, such as for text classification, another metric can be used, such as the overlap metric (or
Hamming distance). In the context of gene expression microarray data, for example, k-NN has also
been employed with correlation coefficients such as Pearson and Spearman. Often, the
classification accuracy of k-NN can be improved significantly if the distance metric is learned with
specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components
analysis.
A drawback of the basic "majority voting" classification occurs when the class distribution is
skewed. That is, examples of a more frequent class tend to dominate the prediction of the new
example, because they tend to be common among the k nearest neighbors due to their large
number. One way to overcome this problem is to weight the classification, taking into account the
distance from the test point to each of its k nearest neighbors. The class (or value, in regression
problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of
the distance from that point to the test point. Another way to overcome skew is by abstraction in
data representation. For example in a self-organizing map (SOM), each node is a representative (a
center) of a cluster of similar points, regardless of their density in the original training data. K-NN
can then be applied to the SOM.
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise
on the classification, but make boundaries between classes less distinct. A good k can be selected
73
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
by various heuristic techniques (see hyperparameter optimization). The special case where the
class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest
neighbor algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their importance. Much research
effort has been put into selecting or scaling features to improve classification. A particularly
popular[citation needed] approach is the use of evolutionary algorithms to optimize feature
scaling.Another popular approach is to scale features by the mutual information of the training data
with the training classes.[citation needed]
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this
avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via
bootstrap method.
Feature Extraction.
When the input data to an algorithm is too large to be processed and it is suspected to be
notoriously redundant (e.g. the same measurement in both feet and meters) then the input data will
be transformed into a reduced representation set of features (also named features vector).
Transforming the input data into the set of features is called feature extraction. If the features
extracted are carefully chosen it is expected that the features set will extract the relevant
information from the input data in order to perform the desired task using this reduced
representation instead of the full size input. Feature extraction is performed on raw data prior to
applying k-NN algorithm on the transformed data in feature space.
Conclusion:
The K-NN approach is studied and implemented.
74
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
GROUP C: ASSIGNMENTS
(Any one)
75
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Assignment No:13
Title: Generate Huffman codes for a gray scale 8 bit image.
Prerequisites:
Knowledge of Huffman codes.
Objectives:
To generate Huffman codes for a gray scale 8 bit image.
Theory:
Huffman coding,
Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT,
and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source
symbol (such as a character in a file). The algorithm derives this table from the estimated probability or
frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy
encoding methods, more common symbols are generally represented using fewer bits than less common
symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of
input weights if these weights are sorted. However, although optimal among methods encoding symbols
separately, Huffman coding is not always optimal among all compression methods.
The beauty of Huffman codes is that variable length codes can achieve a higher data density than fixed
length codes if the characters differ in frequency of occurrence. The length of the encoded character is
inversely proportional to that character's frequency. Huffman wasn't the first to discover this, but his
paper presented the optimal algorithm for assigning these codes. Huffman codes are similar to the
Morse code. Morse code uses few dots and dashes for the most frequently occurring letter. An E is
represented with one dot. A T is represented with one dash. Q, a letter occurring less frequently is
represented with dash-dash-dot-dash.
76
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Huffman codes are created by analyzing the data set and assigning short bit streams to the datum occurring
most frequently. The algorithm attempts to create codes that minimize the average number of bits per
character. Table 9.1 shows an example of the frequency of letters in some text and their corresponding
Huffman code. To keep the table manageable, only letters were used. It is well known that
in English text, the space character is the most frequently occurring character.
As expected, E and T had the highest frequency and the shortest Huffman codes. Encoding with these
codes is simple. Encoding the word toupee would be just a matter of stringing together the appropriate
bit strings, as follows:
T 0 U P E E
One ASCII character requires 8 bits. The original 48 bits of data have been coded with 23 bits
achieving a compression ratio of 2.08.
77
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
78
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
79
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
Modified Huffman coding is used in fax machines to encode black on white images (bitmaps). It is also
an option to compress images in the TIFF file format. It combines the variable length codes of Huffman
coding with the coding of repetitive data in run length encoding. Since facsimile transmissions are
typically black text or writing on white background, only one bit is required to represent each pixel or
sample. These samples are referred to as white bits and black bits. The runs of white bits and black bits
are counted, and the counts are sent as variable length bit streams.
The encoding scheme is fairly simple. Each line is coded as a series of alternating runs of white and
black bits. Runs of 63 or less are coded with a terminating code. Runs of 64 or greater require that a
makeup code prefix the terminating code. The makeup codes are used to describe runs in multiples of
64 from 64 to 2560. This deviates from the normal Huffman scheme which would normally require
encoding all 2560 possibilities. This reduces the size of the Huffman code tree and accounts for the
term modified in the name.
Studies have shown that most facsimiles are 85 percent white, so the Huffman codes have been
optimized for long runs of white and short runs of black. The protocol also assumes that the line begins
with a run of white bits. If it doesn't, a run of white bits of 0 length must begin the encoded line. The
encoding then alternates between black bits and white bits to the end of the line. Each scan line ends
with a special EOL (end of line) character consisting of eleven zeros and a 1 (000000000001). The
EOL character doubles as an error recovery code. Since there is no other combination of codes that has
more than seven zeroes in succession, a decoder seeing eight will recognize the end of line and
continue scanning for a 1. Upon receiving the 1, it will then start a new line. If bits in a scan line get
corrupted, the most that will be lost is the rest of the line. If the EOL code gets corrupted, the most that
will get lost is the next line.
Tables 13.2 and 13.3 show the terminating and makeup codes. Figure 13.1 shows how to encode a
1275 pixel scanline with 53 bits.
Run White bits Black bits Run White bits Black bits
Length Lengt
h
0 00110101 0000110111 32 00011011 000001101010
1 000111 010 33 00010010 000001101011
80
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
64 11011 000000111
128 10010 00011001000
192 010111 000011001001
256 0110111 000001011011
320 00110110 000000110011
384 00110111 000000110100
448 01100100 000000110101
512 01100101 0000001101100
576 01101000 0000001101101
640 01100111 0000001001010
704 011001100 0000001001011
768 011001101 0000001001100
832 011010010 0000001001101
896 101010011 0000001110010
960 011010100 0000001110011
1024 011010101 0000001110100
1088 011010110 0000001110101
1152 011010111 0000001110110
1216 011011000 0000001110111
1280 011011001 0000001010010
1344 011011010 0000001010011
1408 011011011 0000001010100
1472 010011000 0000001010101
1536 010011001 0000001011010
1600 010011010 0000001011011
82
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
words
0 white 00110101
1 block 010
4 white 1011
2 block 11
83
Dr. D. Y. Patil College of Engg.,Ambi
CL-I B.E. Computer Engineering
1 white 0111
1 block 010
1266 white 011011000 + 01010011
EOL 000000000001
Conclusion:
Generation of Huffman codes for a gray scale 8 bit image is studied.
84
Dr. D. Y. Patil College of Engg.,Ambi