Lab Manual CL I Vibhs

DEPARTMENT OF COMPUTER
ENGINEERING
B.E COMP. ENGG. SEM: I
LAB MANUAL
FOR
410446: Computer Laboratory-I
Vibha Lahane
Prepared By
410446 Computer Laboratory-I
Teaching Scheme Examination Scheme Oral Assessment: 50

Practicals: 4 Hrs/Week Practical Assessment : 50
List of Practicals
1 Using Divide and Conquer Strategies design a function for Binary Search using C+
+/ Java/ Python/Scala.
Group A
2 Using Divide and Conquer Strategies design a class for Concurrent Quick Sort
using C++.
3 Assignment to understand the syntax of LEX specifications, built-in functions and

variables. (Lexical analyzer for sample language using LEX).
4 Write an ambiguous CFG to implement Parser for sample Language using

YACC and Lex. Provide the details of all conflicting entries in the parser table
generated by LEX and YACC and how they have been resolved
5 To write an attributed translation grammar to recognize declarations of simple

variables, arithmetic expressions, for if, if-else statement as per syntax of C
to generate three address code for the given input.
6 Write a program to implement k-means clustering using C++.
7 To generate the target code for the optimized code in assignment.

8 Write a LEX and YACC program to generate abstract syntax tree.
9 Write a program to generate Recursive Descent Parser.
10 Write a program to implement Apriori algorithm.
11 Using any similarity based techniques develop an application to classify text

data. Perform tasks as per requirement.
12 Implementation of K-NN approach takes suitable example
Group C 13 Generate Huffman codes for a gray scale 8 bit image.

GROUP A: ASSIGNMENTS
(Mandatory Six Assignments)
Assignment No: 01
Title: Using Divide and Conquer Strategies design a function for Binary Search using C++/
Java/ Python/Scala.
Aim: Implementation of Binary Search algorithm using using C++/ Java/ Python/Scala.
Prerequisites:
Knowledge of writing programs in C++.
Objectives:
To learn the concept of Divide and Conquer Strategy.
To study the design and implementation of Binary Search algorithm.
Theory:
Divide and Conquer strategy:
A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-
problems of the same (or related) type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution to the original problem.
This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,
quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and
computing the discrete Fourier transform (FFTs).
Searching
Sequential Algorithm
function sequential (T [ 1 .. n ], x)
{ sequential search for x in array

T } for i1 to n do
if T[i] x then return

i return n+1
This algorithm clearly takes a time in (r), where r is the index returned : this is O(n) in the worst case and
(1) in the best case. If we assume that all the elements of T are distinct, that x is indeed somewhere in the
array
CL-I B.E. Computer Engineering
Binary Search
The binary search algorithm begins by comparing the target value to value of the middle element of the
sorted array. If the target value is equal to the middle element's value, the position is returned. If the
target value is smaller, the search continues on the lower half of the array, or if the target value is
larger, the search continues on the upper half of the array. This process continues until the element is
found and its position is returned, or there are no more elements left to search for in the array and a
"not found" indicator is returned.
Binary search can be applied to sorted list only. It searches sorted lists using a divide and conquer
technique. On each iteration the search domain is cut in half, until the result is found. The
computational complexity of a binary search is O(log n).
Binary Search Algorithm

function binsearch (T [ 1 .. n ], x)
{ binary search for x in array T [1..n]} if
n = 0 or x > T [n] then return n+1
else return binrec (T[1..n], x)
functionbinrec (T [i .. j ], x)
{ binary search for x in subarray T [i .. j] }
If i = j then return i k
(i+j+1)/2
if x T [k] then return binrec (T [i . k ], x)

else return binrec (T [k+1 .. j ],x)
Binary searching is the algorithm used to look up a word in a dictionary or a name in a telephone
directory. It is probably the simplest application of divide-and-conquer. It can be applied to a sorted list
only.
7
Dr. D. Y. Patil College of Engg.,Ambi
Flowchart for Implementation of Divide and Conquer
8
Conclusion:
The concept of divide and conquer strategy is studied and binary search algorithm is implemented
using C++.
FAQs:
1) What is Divide and Conquer approach? Also explain its advantages.
2) What is Time Complexity of algorithm? Explain the different time complexities.
3) Explain the need of analysis of algorithm with respect to complexities as well as techniques
used for analysis.
4) Compute time complexity and space complexity of your program. Also give the proper
justification for same.
5) Compare the conventional Binary Search algorithm and the Divide and Conquer Binary Search
algorithm. Also explain the advantages of Divide and Conquer approach in terms of quick sort.
6) Compare between Divide and Conquer, Concurrent programming, Back tracking,brach and
bound approach.
9
Assignment No: 02
Title: Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using
C++.
Aim: Implementation of Concurrent Quick Sort algorithm using C++.
Prerequisites:
Knowledge of writing programs in C++.
Objectives:
To learn the concept of Divide and Conquer Strategy.
To study the design and implementation of Quick Sort algorithm.
Theory:
Divide and Conquer strategy:
A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-
problems of the same (or related) type, until these become simple enough to be solved directly. The
solutions to the sub-problems are then combined to give a solution to the original problem.
This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,
quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and
computing the discrete Fourier transform (FFTs).
Sorting
Quick Sort
The sorting algorithm invented by Hoare, usually known as "quicksort", is also based on the idea of
divide-and-conquer. As a first step, this algorithm chooses one of the items in the array to be sorted as
the pivot. The array is then partitioned on either side of the pivot, elements are moved in such a way
that those greater than the pivot are placed on its right, whereas all the others are moved to its left. If
now the two sections of the array on either side of the pivot are sorted independently by recursive calls
of the algorithm, the final result is a completely sorted array, no subsequent merge step being necessary.
To balance the sizes of the two sub instances to be sorted, we would like to use the median element as
10
the pivot. Finding the median takes more time than it is worth. For this reason we simply use the first
element of the array as the pivot. The quick sort algorithm is given below.
procedure quicksort (T [i .. j ])
{ sorts array T [i .. j ] into increasing
order } if j - i is small then insert (T [i .. j ])
else pivot (T [i .. j ],1)
quicksort(T [i .. I -1])
quicksort (T [1 +1 .. j ])
Let p = T [i ] be the pivot. One good way of pivoting consists of scanning the array T [i .. j ] just once,
but starting at both ends. Pointers k and 1 are initialized to i and j + 1, respectively. Pointer k is then
incremented until T [k] >p, and pointer I is decremented until T [1] <- p. Now T [k] and T [1] are
interchanged. This process continues as long as k < 1. Finally, T [i] and T [1] are interchanged to put the
pivot in its correct position.
procedure pivot (T [i .. j ] ; var 1)
{ permutes the elements in array T [i .. j ] in such a way that, at the
end, i<- l <- j, the elements of T [i .. 1-1] are not greater than p,
T[11 =p, and the elements of T J1+1 .. j ] are greater than
p, where p is the initial value of T [i ] }
p <-T[i]
k<- i; 1<-j+1;
repeat k- k + 1 until T [k] > p or k >-
j repeat I E- 1- 1 until T [1] <- p
while k < I do
interchange T [k] and T [1]
repeat k F- k + 1 until T [k] >
p repeat 1 f- 1- 1 until T [1] p
interchange T [i] and T [1]
Quicksort is a sequential based, sequential sorting algorithm. It is a recursive algorithm that uses the list,
the pivot, and finds its position in the list where the key should be placed. This is the low side of the pivot
and ii) the keys larger than or equal to the pivot are placed to the high side of the pivot. Then the
same program is recursively applied on these two parts.
11
2
The average time complexity of Quick sort is O(n log n). The worst-case time complexity is O(n )
12
Flow Chart for Quick Sort using Divide and Conquer Approach.
13
Conclusion:
The concept of divide and conquer strategy is studied and Concurrent Quick Sort algorithm is
implemented using C++.
FAQs
1) Explain the need of Divide and Conquer approach for Quick Sort.
2) What is advantage of Divide and Conquer Technique over the recursion?
3) Compare the conventional Quick Sort algorithm with Quick sort using Divide and Conquer .
4) When does the worst case of Quick Sort occur?
5) What are the advantages and disadvantages of quick sort?
6) What is the complexity of quick sort?
Assignment No: 3
Aim:
Assignment to understand the syntax of LEX specifications, built-in functions and variables. (Lexical
analyzer for sample language using LEX)
14
Objective:
1. To understand how to construct a compiler using LEX and YACC. LEX and YACC are tools used to
generate lexical analyzers and parsers.
2. To understand the application of data structures such as linked-lists and trees.
3. To understand LEX programming.
4. To understand rules ie LEX specification, built-in functions and variables.
What is LEX?
It is a tool for generating Lexical Analyzer. It takes a specification of tokens in the form of a
list of regular expression. From above input LEX generate a lexical analyzer. Its source file is a
specification file consisting of a set of regular expression together with an action.
LEX Specification Format ( . l File Format):-

This file has three sections as given below
%{
<C global variables, prototypes and comments >
%}
Definition Section
%%
Rules Section
%%
15
User Subroutines
Fig. 1.1: LEX Specification Structure
Let Discus each section in brief:
I] Definition Section:
In this section literal block, definitions, internal table declaration, start conditions and
translations are included.
We can use C code also as it is just by writing that code in special brackets as shown in above diagram i.e %{
%} all code in between those brackets is copied as it is in lex.yy.c. we can also declare Regular expression in
this section which we can use in Rule section.
There are some regular expressions used by the LEX with their meaning is listed below:
Regular Expression used by LEX:
Regular expression: A Regular Expression is a pattern description using a meta

language, a language that you use to describe particular patterns of interest.
Regular Expressions used by LEX
Regular Expression Meaning
. Matches Any single character except \n
* Matches zero or more copies of preceding expression
[] Character class that matches any character within the brackets.
- indicates within the range any character
16
eg. [a-z] means any character between a to z.
^ indicates match any character except the ones within the brackets
^ Matches beginning of line as a first character of a regular expression
$ Matches end of line as a Last character of a regular expression
{} Indicates how many time previous pattern is allowed to match.
Eg. A{1,3} matches one to three occurances of letter A
\ Escape character
+ One Or More occurrences
? Zero or One occurrence
| Matches Either preceding or following regular expression
/ Matches preceding regular expression but only if followed by the following

regular expression
() Groups series of regular expression as a new regular expression
II] Rules Section:
Rule section is list of rules as follows
< Pattern > {Action}
<Pattern > {Action}
< Pattern > {Action}
<pattern> :- It is the rule that matched by token. It is nothing but the
regular expression for that particular token.
{ Action} :- It is the action taken by LEX when particular pattern is
matched by any input stream. This action is a typical C Code Statements stating what action
should be taken by LEX after matching pattern.
17
III] User Subroutines:
This section is for defining the other subroutines required for the Lexical analyzer like symbol
table management etc.
Hence it is also a typical C Code section. The main() method is defined here which will include yylex()
method. yylex() method is defined in LEX which calls the lex.yy.c.
Block Diagram:
LEX Specification LEX lex.yy.c
( i.e. .l file) Compiler (c routine for .l file
lex.yy.c CC Output File
a.out
Input Stream a.out List Of Token
Fig. 1.2: Block Diagram
How to create and Execute LEX Program?
As shown in block diagram there are three steps for execution
Let we have written lex specification in file named FirstLexProgram.l
18
Step 1. First compile Lex specification file using LEX Compiler.
Command : $> Lex FirstLexProgram.l
Input : FirstLexProgram.l
Output : lex.yy.c
This command will convert lex specification given in FirstLexProgram.l into C code. There is fixed
destination or the default file to store this C code and that is lex.yy.c.
Step 2. Compile the lex.yy.c using C compiler.
Command : $> cc lex.yy.c o a.out -ll
Input : lex.yy.c
Output : a.out
This command will check that the lex.yy.c generated by first step is syntactically correct or not
according to C Language Syntax.
-O : It is Redirecting output to some file means store the result of compilation into the file mentioned after it
a.out: File containing the output of compilation. A.out is default. We can change this file. i.e. we can store
result in any file.
-ll : This is linking the lex libraries at the time of compilation.
Step 3. Token Generation.
Command : $> a.out
Input : Input Stream
Output : List Of Tokens
Final a.out is nothing but the lexical analyzer. If we provide an input stream to the a.out it will separate
out the different tokens in given input stream
Build in variables
19
1. yytext: - It is the array variable which contains the current text

matched the pattern.
2. yylval:- This variable contains the current matched number

3. yyleng:- It holds the Length of the string recognized by Lexer
4. yyin :- holds name of standard I/O. bydefault i.e. stdin
Build in Functions
1. yylex() : - Lexical analyzer produced by LEX is C routing called
yylex().
2. yywrap():- Returns 0 if no more input to read or 1 if more input to

read.
Build in macros
a. input():- Gets Next Character From Input
b. unput():- Put character back in logical input stream
Disambiguating Rules for LEX

c. LEX patterns only match a given input character or string once.
d. LEX executes the action for the longest possible match for current input.
If two possible rules that match the same length, LEXER use the ear Heres a program that does
nothing at all. All input is matched, but no action is associated with any pattern, so there will be no output.
%%
\n
The following example prepends line numbers to each line in a file. Some implementations of lex
predefine and calculate yylineno. The input file for lex is yyin, and defaults to stdin.
Whitespace must separate the defining term and the associated expression. References to substitutions
in the rules section are surrounded by braces ({letter}) to distinguish them from literals. When we have a
match in the rules section, the associated C code is executed. Here is a scanner that counts the number of
characters, words, and lines in a file (similar to Unix wc).
Flow Chart for Lexical Analysis
20
Flow Chart for Execution Process
21
Conclusion:
LEX is a tool which accepts regular expressions as an input & generates a C code to recognize that
token. If that token is identified, then the LEX allows us to write user defined routines that are to be executed.
When we give input specification file to LEX, LEX generates lex.yy.c file as an output which contains function
yylex() which is generated by the LEX tool & contains a C code to recognize the token & action to be carried
out if we find the token.
We also wrote a small LEX specification for recognizing the C type comments.
FAQs:
1. What are tokens?
2. What is the Lexical Analysis?
3. What is a parser?
4. Explain the working of lexical analyzer LEX?
5. Define the following terms: a) lexemes b) tokens c) pattern
6. What are the parse trees?
7. What are two parts of compilation?
8. What is the role of finite state automata in compiler construction?
9. Explain the term bootstrapping?
10. Enlist the various lexical analysis tools?
22
Assignment No. 4
Aim:
Write an ambiguous CFG to implement Parser for sample Language using YACC and Lex.
Provide the details of all conflicting entries in the parser table generated by LEX and YACC and how
they have been resolved.
Objectives:
1. To understand the Ambiguous and unambiguous grammar.
2. To understand effect of ambiguity on parsing i.e. its consequences.
3. To build a parser table by automating its function through program.
4. To study Top down and Bottom up parser in details.
Theory:
Ambiguous and confusing grammars:
Ambiguous grammars:
C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks
something like this:
exp : exp '-' sub_exp
| sub_exp
;
sub_exp : '(' type_name ')' sub_exp
| '-' sub_exp
| id
| literal
| '(' exp ')'
;
type_name : id
| more_complex_type_descriptions
;
23
This allows expressions like: 1, a, a - 1, ( a ), ( a - 1 ), - a, ( int ) a

but what is meant by: ( b ) - ( c ) ?
The problem is that a single input string corresponds to more than one possible parse tree.
That is, it is a valid part of the language, but we don't know what it means for certain!
This is a genuine problem with Java and with C, that takes extra work by compiler-writers to
solve - every identifier has to be checked (e.g. by LEX) to see if it has already appeared in a class or
typedef declaration, in which case it definitely a type_name, otherwise it is an ordinary id and can't
become a type_name. We would also need to modify the grammar slightly to make this distinction
clear.
Ambiguous grammars are, by definition, going to be difficult to handle no matter what tools
we use. The assumption made with languages designed for computers is that we do our best to make
them unambiguous. Therefore, we would normally expect any tools we use, like YACC, only to have
to handle unambiguous grammars. Given that, can they handle any unambiguous grammar?
Unfortunately, the answer is ``no'' - there are unambiguous grammars that tools like YACC
and JAVACC can't handle. Luckily, for most good tools, you are unlikely to come across such a
grammar, and if you do, you can usually modify the grammar to overcome the problems but still
recognize the same language.
Equally unfortunately, there is no way of deciding whether a grammar is ambiguous or not -
the best that can be done is to try to create a parser, but if the process fails it can't tell us whether this
is because the grammar is really ambiguous or if it is just because the grammar is too confusing for
the kind of parser we are trying to make.
How to confuse parsers:
The decision that a parser repeatedly makes is: given what it has already read of the input, and
the grammar rules it has already recognised, what grammar rule comes next? The more input the
parser can look at before it has to make a decision, the more likely it is to be able to avoid confusion
and get it right.
For example, suppose we look at languages where assignment is a particular kind of
statement, rather than an operation that can be embedded in any expression:
stat : target '=' exp ';'
target '(' explist ')' ';'
;
24
target : id
| target '.' id
;
An LL(1) parser trying to compile this language would have difficulties distinguishing
between assignments (e.g. a=x;) and procedure calls i.e. functions/methods returning void (e.g. a(x);).
This is because an LL(1) parser has to decide which kind of statement it is looking at after seeing only
1 symbol (i.e. a), and it isn't until we see the = or ( that we can tell what is intended. Suppose we used
a more complex algorithm, such as LL(3) - even this couldn't decide between e.g. a.b=x and a.b(x). In
fact, no matter how far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can always
be confused by a sufficiently complicated target in an assignment or call.
There are two kinds of solutions - the parser can use a variable amount of lookahead, as
JAVACC can be asked to do, so it reads as far as the = or ( before making a decision - or we can
rewrite the grammar, by left-factorising it, so that the two kinds of statement are merged until we can
make the decision:
stat : target assign_or_call ';'
;
assign_or_call : '=' exp
| '(' explist ')'
;
An LR (1) parser has no difficulty dealing with the original grammar, as it will have read to
the end of the statement, and seen the = or (on the way, before it has to decide whether to recognize
an assignment or a call.
It is possible to construct unambiguous grammars that would confuse any LR(n) parser (as
well as any LL(n) parser) e.g. palindromes - strings that are their own mirror images, such as abba or
abacaba:
P:
| 'a' | 'b' | 'c' |...
| 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . .
;
The problem is that, although it is perfectly obvious to us what to do - find the middle, and
work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of
25
the string by using their finite lookahead to find the end of the string. This could not work for strings
of length > n for LL(n), or length >2n for LR(n).
Confusing YACC:
Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user
what the problem is, even if it isn't obvious what to do about it. However, what kinds of error
messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or
confusion?
YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce
conflicts (where YACC can't decide whether to perform a shift or reduce - i.e. the grammar rule is
complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform -
i.e. which grammar rule is it?).
An example of a shift/reduce conflict:
The start of a function/method declaration in a C-like language, that accepts headers like void
fred(int a, int b, float x, float z), looks something like this header:
type_name id '(' params ')'
| type_name id '(' ')'
;
params : param
| params ',' param
;
param : type_name id
;
YACC has no problems with this grammar, but what if we modify it? It might be nice to be
able to write the example above simply as void fred(int a, b, float x, z). We could try rewriting the
grammar like this:
param : type_name ids
;
ids : id
| ids ',' id
;
26
But now, YACC reports a shift/reduce conflict, and the details from the y.output file are:
13: shift/reduce conflict (shift 15, reduce 5) on ','
state 13
param : type_name ids . (5)
ids : ids . ',' id (7)
That is, when the generated parser sees a , after a list of identifiers in a param, it doesn't know
whether that , (and the id it expects after) is part of the same param (in which case it should shift, to
include them as part of the RHS) or the start of the next param (in which case it should reduce this
RHS and start a new RHS).
This is not ambiguous, just confusing to YACC, as it needs more lookahead to see if the next
few symbols are e.g. , a b (a is a type_name, b is a parameter name of type a) or , a , or , a ) (a is a
parameter name of the current type). The way to make this clear to YACC is to rewrite the grammar
so that it can see more of the input before having to make a decision:
params : type_name id
| params ',' type_name id
| params ',' id
;
An example of a reduce/reduce conflict:
state 8
sub_exp : id . (5)
type_name : id . (8)
That is, when it sees id) it doesn't know whether the id is a variable giving a value or a type
name, so it doesn't know which rule to use to recognize the id.
Assuming we don't already know what the problem is, this hasn't helped much, but we can get
more information by working back through the states in the y.output file to try to find how we get
here. To do so, we need to look for states that include shift 8 or goto 8. In this example, all we find is:
state 4
sub_exp : '(' . type_name ')' sub_exp (3)
sub_exp : '(' . exp ')' (7)
...
id shift 8
27
So the input must include (id), which can be recognized either as a type-cast or as an
expression.
This is a big hint about the source of the ambiguity in the grammar, but more by luck than
anything else - YACC remains confused even if we make the grammar unambiguous, by removing the
rule sub_exp : '-' sub_exp. YACC still reports the same reduce/reduce conflict for this modified
grammar, as it is confused by an input as simple as ( a ) - it has to decide whether this is a value in an
expression or a type-cast before it reads past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or ( a ) - 99 (i.e.
the value a - 99).
Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to
distinguish between identifiers that are really type names (or class names) and all other identifiers -
also solves this confusion for YACC.
Epilogue:
Most of the time, an ambiguous grammar results from an error made by the implementers of a
programming language. Sometimes, however, it is the fault of the language designer. Many languages
are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)).
Does this matter? We should not limit language designers to what a particular type of parser generator
can cope with, but on the other hand there is no particular merit in making a language harder to
compile if a small change can simplify the problem.
An example of this is a well-known problem with conditional statements; the dangling else.
Most imperative languages permit conditional statements to take two slightly different forms:
if ( ... ) ...
if ( ... ) ... else ...
So the else d in if (a) if (b) c else d could be associated either with if (a) or with if (b).
Most languages attempt to fix this problem by stating that the second interpretation is more
natural, and so is correct, although some languages have different rules. Whatever the language
definition, it is an extra rule that anyone learning the language has to remember.
Similarly, the compiler writer has to deal with this special case: if we use a tool like YACC we
get a shift/reduce error - do we shift the else to get if (b) c else d, or do we reduce the if (b) c as it
28
stands, so we get if (a) ... else d To overcome this problem, we can rewrite the grammar to explicitly
say ``you can't have an unmatched then (logically) immediately before an else - the then and the else
must be paired up'':
stat : matched
| unmatched
unmatched : IF '(' exp ')' stat
| IF '(' exp ')' matched ELSE unmatched
| FOR '(' exp ';' exp ';' exp ')' unmatched
| WHILE '(' exp ')' unmatched
|...
matched : IF '(' exp ')' matched ELSE matched
| FOR '(' exp ';' exp ';' exp ')' matched
| WHILE '(' exp ')' matched
|...
| exp
Alternatively, it is possible to make a simple change to the language which completely

removes this ambiguity - to have a terminating keyword such as end_if or fi:
stat : IF '(' exp ')' stat FI
| IF '(' exp ')' stat ELSE stat FI
| . . .;
Flowchart for YACC File Compilation
29
Flow Chart for Parser Execution Process
30
Conclusion:
We have written an ambiguous CFG to recognize an infix expression and implement a parser
that recognizes the infix expression using YACC, And also the details of all conflicting entries in the
parser table generated by LEX and YACC and how they have been resolved.
Questions:
1. Can YACC handle any unambiguous grammar?

2. Describe the way to avoid confusion of parsers?
3. What is reduce/ reduce conflict?
4. What is ambiguity?
5. What is shift reduce conflict? Does yacc overcome this conflict?
6. What is yywrap() function ?explain its role in parsing?
7. How yacc refers to symbol table?
8. Explain the role of syntax analysis in compiler construction?
9. Enlist the all syntax analysis tools available?
10. Enlist the limitation of yacc?
Assignment No. 05
31
Aim:
To write an attributed translation grammar to recognize declarations of simple variables,

arithmetic expressions, for if, if-else statement as per syntax of C to generate three address code
for the given input.
Theory:
Semantic Actions:
Parsing tools use a generalization of CFG's in which each grammar symbol one or more
values, called attributes, have associated with it. Each production of the grammar may have an
associated "action", which can refer to and compute the values of attributes. So we have:
Terminals & non-terminals . have attributes
Productions . have semantic actions
Example:
E -> E' + E
| E'
E' -> int * E'
| int
For each symbol, let X.val be an integer value associated with X.
For terminal symbols, val is the lexeme provided by the lexical analyzer.
For non-terminals, val should be the integer value of the expression. This attribute is
computed from the attributes of sub-expressions.
Production Action:
E -> E' + E1 E.val = E'.val + E1.val
| E' E.val = E'.val
E' -> int * E1' E'.val = int.val * E1'.val
| int E'.val = int.val
Note: the attribute of some grammar symbols, such as the terminals + and *, is unused.
32
Example:
5*3+2*4
Parse Tree Equations
E1 E1.val = E3'.val + E2.val
------------- E3'.val = int7.val + E4'.val
E3'+E2 E4'.val = int8.val
------------- E2.val = E5'.val
int7 * E4' E5' E5'.val = int9.val * E6'.val
------------- E6'.val = int0.val
int8 int9*E6' int7.val = 5
------------- int8.val = 3
int0 int9.val = 2
int0.val = 4
Working from the leaves to the root, we can compute each val attribute.
For example, E6'.val = 4 and E5'.val = 8. Finally, E1.val = 23.
Notes:
1. Fresh attributes are associated with every node in the parse tree.
2. The semantic actions specify a system of equations; they don't say in what order the
equations are resolved. The user just gives a specification and the parser takes care of the
implementation.
Warning: You can use side-effects in semantic actions, but in this case you have to understand the
order in which attributes get computed or the results will seem unpredictable.
3. In this example, the val attribute can be evaluated bottom-up: the .val attribute for a node
of the parse depends only on the .val attributes of its children.
4. The parse tree need not actually be built by the parser. In fact, a parser tool would
compile this specification into code that simply traces out the structure of the parse tree
without actually building it.
5. Pattern/action parsing can be though of as a systematic translation of the original text into
a new form specified by the semantic actions. Because the translation is guided by the syntax,
it is called syntax-directed translation. (NB: Book uses SDT in a narrower sense.)
33
6. Attributes may also be passed top-down: an attribute of a node may depend on an attribute
of the parent in the parse tree. Such an attributed is called "inherited". We will talk about
inherited attributes eventually, but they will not be used in the course project.
Synthesized and Inherited Attributes:

Synthesized:
Attribute value depends on descendants of the node
Example: the val attribute above
Inherited:
Attribute value depends on parent and siblings of the node
Example: symbol table environment
S-attributed Definitions:
- An attribute grammar is S-attributed if it consists only of inherited attributes
- Can be evaluated bottom-up:
- Keep a stack S parallel to parsing stack
- consider production
A -> XY A.val = X.val + Y.val
- When reducing by A -> XY
- the top of the S stack has X.val and Y.val
- compute A.val
- pop X.val and Y.val from S, push A.val
- symmetric with reduce action on the parse stack
- Tools like Bison/Flex support S-attributed definitions
Evaluating Attributes:
- S attributed definitions are a very special case of attribute grammars
- The most general method is to construct an ordering from the parse tree itself: Define a
graph as follows. For each attribute E.a to be computed add a node in the graph. If E.a
depends on E1.a1,...,En.an then add directed edges from Ei.ai to E.a for
1 <= i <= n.
34
A topological sort of the graph is any ordering n1,...,nk of the nodes such that edges of
the graph are all from left-to-right in the ordering; i.e., a node appears in the ordering after all of the
nodes it depends on. Any topological sort is a legal evaluation order of the attributes.
Note: for the topological sort to make sense there can be no cycles in the graph.
Cyclically defined attributes are not legal.

- can make sense even cyclically defined attributes if they are treated as recursive
definitions
- In practice, computing all of the attribute dependencies from the AST is rarely, if ever,
used. Instead, special cases of syntax-directed definitions are used where the attribute
evaluation order can be determined once and for all from the actions.
- The most important special case is S-attributed grammars: grammars with only
synthesized attributes. Building an AST is an example of an S-attributed grammar (i.e., PA3).
These attributes can be evaluated bottom-up during parsing.
Testing For Circularity:
- If an attribute grammar has a dependence cycle among attributes in some parse tree, then
the attribute grammar is said to be circular.
- Circular attribute grammars are considered meaningless---that is, erroneous.
- It is possible to check whether a given attribute grammar is circular.
Algorithm: Not applicable
Input:
Identifiers from the input in a symbol table and other relevant information about the identifiers
Output:
Equivalent three address code for the given input
Instructions:
For the For Statement, if, if-else statement as per the syntax of C or Pascal and generate
equivalent three address code for the given input made up of constructs mentioned above using LEX
and YACC. Write a code to store the identifiers from the input in a symbol table and also to record
35
other relevant information about the identifiers from the input in a symbol table and also to records
stored in the symbol table.
Flow Chart for Intermediate Code Generation
36
Conclusion:
We have written an attributed translation grammar to recognize declarations of simple

variables
Questions:
1. What are Semantic Actions?
2. What are synthesized and Inherited Attributes?
3. What are S-attributed Definitions?
4. What is circular Grammar?
5. What is semantic analysis?
6. What is syntax directed translation?
7. How s-attributes are evaluated?
8. How L-attributes are evaluated?

37
9. Explain the need of SDT?
10. Enlist the all subroutines/inbuilt functions in YACC?
Assignment No: 6
Title: Implement a simple approach for k-means clustering using C++.
Aim: Write a program to implement k-means clustering using C++
Objective: To learn the function of compiler by:
To understand clustering technique.

To learn and use graphics for clustering in C++.
Theory:
In statistics and machine learning, k-means clustering is a method of cluster analysis which
aims to partition n observations into k clusters in which each observation belongs to the cluster
with the nearest mean.
Algorithm:
Regarding computational complexity, the k-means clustering problem is:
NP-hard in general Euclidean space d even for 2 clusters
NP-hard for a general number of clusters k even in the plane
If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n),
38
Where n is the number of entities to be

clustered Standard algorithm
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is
often called the k-means algorithm.
Given an initial set of k means m1,,mk, which may be specified randomly or by some
heuristic, the algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition
the observations according to the Voronoi diagram generated by the means).
Update step: Calculate the new means to be the centroid of the observations in the cluster.
K-Means Clustering Example

clustering allows for unsupervised learning. That is, the machine / software will learn on its own,
using the data (learning set), and will classify the objects into a particular class for example, if
our class (decision) attribute is tumor Type and its values are: malignant, benign, etc. - these will
be the classes. They will be represented by cluster1, cluster2, etc. However, the class information is
never provided to the algorithm. The class information can be used later on, to evaluate how
accurately the algorithm classified the objects.
39
Example:
Problem: Cluster the following eight points (with (x, y) representing locations) into three clusters
A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are:
A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2,
y2) is defined as:
(a, b) = |x2 x1| + |y2 y1| .
Use k-means algorithm to find the three cluster centers after the second iteration.
First we list all points in the first column of the table above. The initial cluster centers means, are
(2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point
(2, 10) to each of the three means, by using the distance function:
point mean1
x1, y1 x2, y2
(2, 10) (2, 10)
(a, b) = |x2 x1| + |y2 y1|
(point, mean1) = |x2 x1| + |y2 y1|
40
= |2 2| + |10 10|
=0 + 0
=0
point mean2
x1, y1 x2, y2
(2, 10) (5, 8)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |5 2| + |8 10|
=3+2
=5
point mean3
x1, y1 x2, y2
(2, 10) (1, 2)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |1 2| + |2 10|
=1+8
=9
So, we fill in these values in the table:
41
So, which cluster should the point (2, 10) be placed in? The one, where the point has the shortest
distance to the mean that is mean 1 (cluster 1), since the distance is 0.
Cluster 1 Cluster 2 Cluster
3 (2, 10)
So, we go to the second point (2, 5) and we will calculate the distance to each of the three means,
by using the distance function:
point mean1
x1, y1 x2, y2
(2, 5) (2, 10)
(a, b) = |x2 x1| + |y2 y1|
(point, mean1) = |x2 x1| + |y2 y1|
= |2 2| + |10 5|
=0+5
=5
point mean2
x1, y1 x2, y2
(2, 5) (5, 8)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |5 2| + |8 5|
=3+3
=6
point mean3
x1, y1 x2, y2
(2, 5) (1, 2)
(a, b) = |x2 x1| + |y2 y1|
(point, mean2) = |x2 x1| + |y2 y1|
= |1 2| + |2 5|
=1+3
=4
So, we fill in these values in the table:
42
Iteration
So, which cluster should the point (2, 5) be placed in? The one, where the point has the shortest
distance to the mean that is mean 3 (cluster 3), since the distance is 0.
Cluster 1 Cluster 2 Cluster 3
(2, 10) (2, 5)
Analogically, we fill in the rest of the table, and place each point in one of the
clusters: Iteration 1
Cluster 1 Cluster 2 Cluster 3

(2, 10) (8, 4) (2, 5)
(5, 8) (1, 2)
(7, 5)
(6, 4)
43
(4, 9)
Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all
points in each cluster.
For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center
remains the same.
For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) =
(6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
44
45
The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.
That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until
the means do not change anymore.
In Iteration2, we basically repeat the process from Iteration1 this time using the new means
we computed.
46
Conclusion:
Thus we have studied k means clustering technique.
FAQs :
1.Which are the different clustering techniques?
2.Applications of clustering technique.
3.What is kmeans algorithm and how does it work .
4.What are the application of k means clustering?
5.What are the weakness of k-means clustering?
6.What are k-means advantages and disadvantages?
47
GROUP B: ASSIGNMENTS
( any 6 Assignments)
Assignment No: 07
Aim:
To generate the target code for the optimized code in assignment.
48
Objective:
1. To understand how to construct the machine code.
2. To understand the basic instructions in the ASM.
3. To understand LEX and YACC programming.
4. To understand rules for generating the target code by providing three address code as a
input.
Theory:
Code generation is final phase of compiler. Basically code generation is process of creating
low level (assembly language or m/c ) code for three address code (generated by intermediate
code generation phase) or optimized three address code(Optimized by Code Optimizer phase).
Proposed Code Generator:-
Intermediate optimized Intermediate

Source Assembly
Front End Code Optimization Code Generator
Program Code code Code
Symbol Table
Fig. Position of code generator in compiler
Algorithm for code generation:
Read the expression in the form of operator ,operand1,operand2 and generate code using
following algorithm .
Gen_Code(operator,operand1,operand2)
{
If(operand1.addressmode=R)
{
If(operator=+)
49
Generate(ADD operand2,R0);
else if(operator=-)
Generate(SUB operand2,R0);
else if(operator=*)
Generate(MUL operand2,R0);
else if(operator=/)
Generate(DIV operand2,R0);
}
else If(operand2.addressmode=R)
{
If(operator=+)
else if(operator=-)
else if(operator=*)
else if(operator=/)
}
else{
If(operator=+)
else if(operator=-)
else if(operator=*)
else if(operator=/)
}
}
50
Example:
We will generate code for following expression
X:= (a+b)*(c-d)+((e/f)*(a+b))
The corresponding three address code can be given as,
t1:=a+b
t2:=c-d
t3:=e/f
t4:=t1*t2
t5=t3*t1
t6:=t4+t5
Using simple code generation algorithm the sequence target code can be generated
Three Address Code Target Code Register Descriptor Operand Descriptor

Sequence
t1:=a+b MOV a,R0 Empty
ADD b,R0 R0 contains t1
t1 R R0
t2:=c-d MOV c,R1 R1 contains c
SUB d,R1 R1 contains t2
t2 R R1
t3:=e/f MOV e,R2 R2 contains e
DIV f,R2 R2 contains t3
t3 R R2
t4:=t1*t2 MUL R0,R1 R0 contains t1
R1 contains t2
R1 contains t4 t4 R R1
t5=t3*t1 MUL R2,R1 R2 contains t3
R0 contains t1
t6:=t4+t5 ADD R1,R0 R0 contains t4
R0 contains t5
51
Flow Chart for Target Code Generation
52
Conclusion :
Thus we have studied to generate the target code for the optimized code.
Questions:
1. What is complier?
2. What is front and back end of compiler?
3. Write steps for program execution?
4. What is Ambiguity?
5. Explain the difference between the target code and intermediate code?
53
Assignment No: 8
Title: Generating abstract syntax tree using LEX and YACC.
Aim: Write a LEX and YACC program to generate abstract syntax tree.
Objective:
To understand working of Code Generation Phase of Compiler
Theory:
The purpose of this lab is to create and print an abstract syntax tree for a C program. The C program
will use only a small subset of the grammar.
As an example of a syntax tree, consider the statement tri_area = (base *
height)/2; The root node is an assignment operation. Its left subtree is a pointer
to tri area.
Its right subtree represents the expression (base * height)/2. The tree looks like the tree in Figure
54
Fig: Abstract syntax tree

ASSIGN INT
ID PTR|INT value = "tri_area"
DIVIDE INT
TIMES INT
DEREF INT
ID PTR|INT value = "base"
DEREF INT
ID PTR|INT value =
"height" NUM INT
value = 2
In this display, each node is followed by its left subtree and then its right subtree, indented one
tab stop. Notice that base and height are dereferenced, but tri area isn't. That will be explained
next.
Tree Nodes and the Tree Node Class
A tree node will be implemented by the Tree Node class. If a tree node is an interior node, then it
will contain an operator that acts on the left and right subtrees. The operator will have a mode,
which will be the data type involved in the operation. For example, if the mode of an assignment
operator is INT, then the operator will assign an int to an int. If a tree node is an exterior (leaf)
node, then it will contain an object, which will be an identi_er or a number (and later a string). The
mode of an exterior node will be the kind of object stored in that node. For example, if the object is
an integer variable (l-value), then the mode will be a pointer to an INT.
If the object is an integer constant, then the mode will be INT. Open the _le TreeNode.java.
This _le de_nes the TreeNode class whose objects have the following attributes: the operation
(oper) represented by the node, the mode (mode) of the operation, a reference to the left subtree
55
(left), a reference to the right subtree (right), the identi_er (id) represented by the node, the number
(num) represented by the node, and the string (str) represented by the node.
If the node is a binary interior node, then left and right will be non-null, and id, num, and str
will be unde_ned. On the other hand, if the node is an exterior node, then left and right will be null,
while exactly one of id, num, and str will be de_ned, depending on the kind of exterior node. From
time to time, we will have unary interior nodes. They will always use the left subtree rather than the
right subtree.
Note the types of the data members oper, mode, left, right, id, num,
and str. Also, one constructor
publicTreeNode(IdEntryi)
and the toString() function have been de_ned. You will de_ne three additional constructors. First,
de_ne the default constructor:
publicTreeNode()
It should set oper, mode, and num to 0 and left, right, id, and str
to null. Next, de_ne the following constructor.
publicTreeNode(int op, int m, TreeNode l, TreeNode r)
The purpose of this constructor is to join together two existing trees, with root nodes l and r, as the
left and right subtrees of a new tree with this node as its root node.
In the root node, the value of oper should be op and the value of mode
should be m. Finally, define the constructor
publicTreeNode(int n)
It will create a node that represents a number. The member oper should be Ops.NUM,
mode should be Ops.INT, and num should be the value of n. Write these constructors. We will use
these constructors later in this lab.
Yacc is a tool for building syntax analyzers, also known as parser,yacc has been used to
implement hundreds of languages. Its applications range from small desk calculators, to medium-
sized preprocessors for typesetting, to large compiler front ends for complete programming
languages.
A yacc specification is based on a collection of grammar rules that describe the syntax of a
language; yacc turns the specification into a syntax analyzer. A pure syntax analyzer merely checks
whether or not an input string conforms to the syntax of the language.
56
Algorithm:
Step1: Start
Step2: declare the declarations as a header file {include<ctype.h>}
Step3: token digit
Step4: define the translations rules like line, expr, term, factor
Line: expr \n {print (\n %d \n,$1)}
Expr: expr+ term ($$=$1=$3}
Term: term + factor ($$ =$1*$3}
Factor: (enter) {$$ =$2)
%%
Step5: define the supporting C routines
Step6: Stop
Conclusion:
We have studied and implemented code generation techniques.
FAQs
1. What is AST?
2. What is the need of AST?
3. Which phase of compiler generates AST?
4. What are the applications of AST in compiler?
57
Assignment No: 9
Title: Implementing recursive descent parser for sample language.
Aim: Write a program to generate Recursive Descent Parser.
Objective:
To develop a recursive-descent parser for a given grammar.
To generate a syntax tree as an output of the parser
To handle syntax errors.
Theory:
A recursive descent parser is a kind of top-down parser built from a set of mutually-recursive
procedures (or a non-recursive equivalent) where each such procedure usually implements one
of the production rules of the grammar. Thus the structure of the resulting program closely
mirrors that of the grammar it recognizes.
This parser attempts to verify that the syntax of the input stream is correct as it is read from left
to right. A basic operation necessary for this involves reading characters from the input stream
and matching then with terminals from the grammar that describes the syntax of the input. Our
recursive descent parsers will look ahead one character and advance the input stream reading
pointer when proper matches occur. What a recursive descent parser actually does is to perform
a depth-first search of the derivation tree for the string being parsed. This provides the 'descent'
58
portion of the name. The 'recursive' portion comes from the parser's form, a collection of
recursive procedures.
As our first example, consider the simple grammar
E
->
x+
T
T
->
(E
)T
->
x
and the derivation tree in figure 2 for the expression x+(x+x)
Figure: Derivation Tree for x+(x+x)
A recursive descent parser traverses the tree by first calling a procedure to recognize an E. This
procedure reads an 'x' and a '+' and then calls a procedure to recognize a T. This would look like
the following routine.
Procedure E()
Begin
59
If
(input_symbol=x)
then next();
If (input_symbol=+) then
Next();
T();
Else
Errorhandler();
END
Note that the 'next' looks ahead and always provides the next character that will be read from
the input stream. This feature is essential if we wish our parsers to be able to predict what is due
to arrive as input. Note that 'errorhandler' is a procedure that notifies the user that a syntax error
has been made and then possibly terminates execution.
In order to recognize a T, the parser must figure out which of the productions to execute. This is
not difficult and is done in the procedure that appears below.
Procedure T()
Begin
Begin
If
(input_symbol=()
then next();
E();
If
(input_symbol=))
then next();
end
else If
(input_symbol=x)
then next();
else
Errorha
ndler();
60
END
In the above routine, the parser determines whether T had the form (E) or x. If not then the
error routine was called, otherwise the appropriate terminals and nonterminals were recognized.
Algorithm:
1. Make grammar suitable for parsing i.e. remove left recursion (if required).
2. Write a function for each production with error handler.
3. Given input is said to be valid if input is scanned completely and no error function is called.
Conclusion:
We have studied and implemented Recursive Descent Parser.
FAQs:
1.What do you mean by Recursive Descent Parsing?
2.What are the applications of Recursive descent parse
3.Advantages of Recursive descent parse
Assignment No: 10
Title: Implement Apriori approach for data mining to organize the data items on a shelf.
Aim: Write a program to implement Apriori algorithm.
Objective:
To find frequent itemsets and association between different itemsets i.e. association
Theory:
Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a
set of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where
X,YC I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent
(left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.
61
To illustrate the concepts, we use a small example from the supermarket domain. The set of
items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes
presence and 0 absence of an item in a transaction) is shown in the table to the right. An
example rule for the supermarket could be meaning that if milk and bread is bought, customers
also buy butter.
To select interesting rules from the set of all possible rules, constraints on various measures of
significance and interest can be used. The best-known constraints are 51 minimum thresholds
on support and confidence. The support supp(X) of an itemsetX is defined as the proportion of
transactions in the data set which contain the itemset. In the example database, the itemset
{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5
transactions).
The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in
the database, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability
of finding the RHS of the rule in transactions under the condition that these transactions also
contain the LHS
62
The lift of a rule is defined as or the ratio of the observed confidence to that expected by chance.
The rule has a lift of.
The conviction of a rule is defined as. The rule has a conviction of , and be interpreted as the ratio of
the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an
incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect
predictions. In this example, the conviction value of 1.2 shows that the rule would be incorrect 20%
more often (1.2 times as often) if the association between X and Y was purely random chance.
Association rules are required to satisfy a user-specified minimum support and a user-specified
minimum confidence at the same time. To achieve this, association rule generation is a two-step
process. First, minimum support is applied to find all frequent itemsets in a database. In a second
step, these frequent itemsets and the minimum confidence constraint are used to form rules. While
the second step is straight forward, the first step needs more attention.
Many algorithms for generating association rules were presented over time.
Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since
they are algorithms for mining frequent itemsets. Another step needs to be done after to generate
rules from frequent itemsets found in a database.

63
Algorithm:
Association rule mining is to find out association rules that satisfy the predefined minimum support
and confidence from a given database. The problem is usually decomposed into two sub problems.
One is to find those item sets whose occurrences exceed a predefined threshold in the database;
those item sets arecalled frequent or large itemsets. The second problem is to generate association
rules from those large itemsets with the constraints of minimal confidence.
Suppose one of the large itemsets is Lk, Lk = {I1, I2, ,Ik}, association rules with this itemsets are
generated in the following way: the first rule is {I1, I2, , Ik-1}and {Ik}, by checking the
confidence this rule can be determined as interesting or not. Then other rule are generated by
deleting the last items in the antecedent and inserting it to the consequent, further the confidences of
the new rules are checked to determine the interestingness of them. Those processes iterated until
the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the
researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In Database
D.
Find frequent set Lk 1.
Join Step.
Ck is generated by joining Lk 1with itself
Prune Step.
Any (k 1) -itemset that is not frequent cannot be a subset of a frequent k -itemset,
hence should be removed.
where
(Ck: Candidate itemset of size k)
(Lk: frequent itemset of size k)
Example:
A large supermarket tracks sales data by SKU (item), and thus is able to know what items are
typically purchased together. Apriori is a moderately efficient way to build a list of frequent
purchased item pairs from this data. Let the database of transactions consist of the sets are
T1:{1,2,3,4},
T2: {2,3,4},
T3: {2,3},

64
T4:{1,2,4}, T5:
{1,2,3,4}, and
T6: {2,4}.
Each number corresponds to a product such as "butter" or "water". The first step of Apriori to count
up the frequencies, called the supports, of each member item separately:
We can define a minimum support level to qualify as "frequent," which depends on the context. For
this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-
pairs of the frequent items. Had any of the above items not been frequent, they wouldn't have been
included as a possible member of possible 2-item pairs
In this way, Apriori prunes the tree of all possible sets.
This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't
need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no
supersets of them can possibly be frequent. Keep going
This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't
need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no
supersets of them can possibly be frequent. Keep going

65
Conclusion:
We have studied and implemented Apriori algorithm.
FAQs :
1].What is the purpose of Apriori Algorithm?

2].Give few techniques to improve the efficiency of Apriori algorithm.
3] Define Association rule.
4] What are concept like Support,Confidence,Lift,Conviction.
5]What are application of Apriori.
6]What are limitations of Apriori Algorithm?
Assignment No:11

66
Title: Using any similarity based techniques develop an application to classify text data. Perform
tasks as per requirement.
Aim: Implement code to classify text data.

Prerequisites:classification technique like k nearest neighbor, .SVM, decision learning and rule
learning
Objectives:

Theory:
Basic Measures for Text Retrieval

We need to check the accuracy of a system when it retrieves a number of documents on the basis of
user's input. Let the set of documents relevant to a query be denoted as {Relevant} and the set of
retrieved document as {Retrieved}. The set of documents that are relevant and retrieved can be
denoted as {Relevant} {Retrieved}. This can be shown in the form of a Venn diagram as follows .
There are three fundamental measures for assessing the quality of text retrieval
Precision
Recall
F-score
Precision

67
Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can
be defined as
Precision= |{Relevant} {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall
is defined as
Recall = |{Relevant} {Retrieved}| / |{Relevant}|
F-score
F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for
precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows
F-score = recall x precision / (recall + precision) / 2
The World Wide Web contains huge amounts of information that provides a rich source for data
mining.
Challenges in Web Mining

The web poses great challenges for resource and knowledge discovery based on the following
observations
The web is too huge The size of the web is very huge and rapidly increasing. This seems that the web is
too huge for data warehousing and data mining.
Complexity of Web pages The web pages do not have unifying structure. They are very complex as
compared to traditional text document. There are huge amount of documents in digital library of web.
These libraries are not arranged according to any particular sorted order.
Web is dynamic information source The information on the web is rapidly updated. The data such as
news, stock markets, weather, sports, shopping, etc., are regularly updated.
Diversity of user communities The user community on the web is rapidly expanding. These users have
different backgrounds, interests, and usage purposes. There are more than 100 million workstations that
are connected to the Internet and still rapidly increasing.

68
Relevancy of Information It is considered that a particular person is generally interested in only small
portion of the web, while the rest of the portion of the web contains the information that is not relevant to
the user and may swamp desired results.
Mining Web page layout structure

The basic structure of the web page is based on the Document Object Model (DOM). The DOM
structure refers to a tree like structure where the HTML tag in the page corresponds to a node in the
DOM tree. We can segment the web page by using predefined tags in HTML. The HTML syntax is
flexible therefore, the web pages does not follow the W3C specifications. Not following the
specifications of W3C may cause error in DOM tree structure.
The DOM structure was initially introduced for presentation in the browser and not for description
of semantic structure of the web page. The DOM structure cannot correctly identify the semantic
relationship between the different parts of a web page.
Vision-based page segmentation (VIPS)

The purpose of VIPS is to extract the semantic structure of a web page based on its visual
presentation.
Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to
a block.
A value is assigned to each node. This value is called the Degree of Coherence. This value is
assigned to indicate the coherent content in the block based on visual perception.
The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that
it finds the separators between these blocks.
The separators refer to the horizontal or vertical lines in a web page that visually cross with
no blocks.
The semantics of the web page is constructed on the basis of these blocks.
The following figure shows the procedure of VIPS algorithm

69
ss
Data mining is widely used in diverse areas. There are a number of commercial data mining system
available today and yet there are many challenges in this field. In this tutorial, we will discuss the
applications and the trend of data mining.
Data Mining Applications

Here is the list of areas where data mining is widely used
Financial Data Analysis

Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crime.

70
Conclusion: Thus we have studied and implemented text classify data.
FAQS
1] What are different techniques used for classification of text data.
2] What are machine learning techniques for data classification.
3] What is text mining explain data clustering.
4] What are different attributes needs to be extracted to classify text data.

71
Assignment No:12
Title: Implementation of K-NN approach takes suitable example.
Aim: Implementation of K-NN approach takes suitable example.
Prerequisites:
Knowledge of K-NN approach.
Objectives:
To learn the concept of K-NN approach with suitable example.
To implement K-NN approach.
Theory:
K-NN approach
In K-NN classification, the output is a class membership. An object is classified by a majority

vote of its neighbors, with the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to
the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the
values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where
the function is only approximated locally and all computation is deferred until classification. The
k-NN algorithm is among the simplest of all machine learning algorithms.
Both for classification and regression, it can be useful to assign weight to the contributions of the
neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where
d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class
(for k-NN classification) or the object property value (for k-NN regression) is known. This can be
thought of as the training set for the algorithm, though no explicit training step is required.
A limitation of the k-NN algorithm is that it is sensitive to the local structure of the data. The
algorithm has nothing to do with and is not to be confused with k-means, another popular machine
learning technique.
72
Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label.
The training phase of the algorithm consists only of storing the feature vectors and class labels of
the trainingsamples. In the classification phase, k is a user-defined constant, and an unlabeled
vector (a query or test point) is classified by assigning the label which is most frequent among the
k training samples nearest to that query point.
A commonly used distance metric for continuous variables is Euclidean distance. For discrete
variables, such as for text classification, another metric can be used, such as the overlap metric (or
Hamming distance). In the context of gene expression microarray data, for example, k-NN has also
been employed with correlation coefficients such as Pearson and Spearman. Often, the
classification accuracy of k-NN can be improved significantly if the distance metric is learned with
specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components
analysis.
A drawback of the basic "majority voting" classification occurs when the class distribution is
skewed. That is, examples of a more frequent class tend to dominate the prediction of the new
example, because they tend to be common among the k nearest neighbors due to their large
number. One way to overcome this problem is to weight the classification, taking into account the
distance from the test point to each of its k nearest neighbors. The class (or value, in regression
problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of
the distance from that point to the test point. Another way to overcome skew is by abstraction in
data representation. For example in a self-organizing map (SOM), each node is a representative (a
center) of a cluster of similar points, regardless of their density in the original training data. K-NN
can then be applied to the SOM.
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise
on the classification, but make boundaries between classes less distinct. A good k can be selected
73
by various heuristic techniques (see hyperparameter optimization). The special case where the
class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest
neighbor algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their importance. Much research
effort has been put into selecting or scaling features to improve classification. A particularly
popular[citation needed] approach is the use of evolutionary algorithms to optimize feature
scaling.Another popular approach is to scale features by the mutual information of the training data
with the training classes.[citation needed]
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this
avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via
bootstrap method.
Feature Extraction.
When the input data to an algorithm is too large to be processed and it is suspected to be
notoriously redundant (e.g. the same measurement in both feet and meters) then the input data will
be transformed into a reduced representation set of features (also named features vector).
Transforming the input data into the set of features is called feature extraction. If the features
extracted are carefully chosen it is expected that the features set will extract the relevant
information from the input data in order to perform the desired task using this reduced
representation instead of the full size input. Feature extraction is performed on raw data prior to
applying k-NN algorithm on the transformed data in feature space.
Conclusion:
The K-NN approach is studied and implemented.
74
GROUP C: ASSIGNMENTS
(Any one)
75
Assignment No:13
Title: Generate Huffman codes for a gray scale 8 bit image.
Aim: Generate Huffman codes for a gray scale 8 bit image.
Prerequisites:
Knowledge of Huffman codes.
Objectives:
To generate Huffman codes for a gray scale 8 bit image.
Theory:
Huffman coding,
Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT,
and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".
The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source
symbol (such as a character in a file). The algorithm derives this table from the estimated probability or
frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy
encoding methods, more common symbols are generally represented using fewer bits than less common
symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of
input weights if these weights are sorted. However, although optimal among methods encoding symbols
separately, Huffman coding is not always optimal among all compression methods.
The beauty of Huffman codes is that variable length codes can achieve a higher data density than fixed
length codes if the characters differ in frequency of occurrence. The length of the encoded character is
inversely proportional to that character's frequency. Huffman wasn't the first to discover this, but his
paper presented the optimal algorithm for assigning these codes. Huffman codes are similar to the
Morse code. Morse code uses few dots and dashes for the most frequently occurring letter. An E is
represented with one dot. A T is represented with one dash. Q, a letter occurring less frequently is
represented with dash-dash-dot-dash.
76
Huffman codes are created by analyzing the data set and assigning short bit streams to the datum occurring
most frequently. The algorithm attempts to create codes that minimize the average number of bits per
character. Table 9.1 shows an example of the frequency of letters in some text and their corresponding
Huffman code. To keep the table manageable, only letters were used. It is well known that
in English text, the space character is the most frequently occurring character.
As expected, E and T had the highest frequency and the shortest Huffman codes. Encoding with these
codes is simple. Encoding the word toupee would be just a matter of stringing together the appropriate
bit strings, as follows:
T 0 U P E E
111 0100 10111 10110 100 100
One ASCII character requires 8 bits. The original 48 bits of data have been coded with 23 bits
achieving a compression ratio of 2.08.
77
Letter Frequency Code

A 8.23 0000
B 1.26 110000
C 4.04 1101
D 3.40 01011
E 12.32 100
F 2.28 11001
G 2.77 10101
H 3.94 00100
I 8.08 0001
J 0.14 110001001
K 0.43 1100011
L 3.79 00101
M 3.06 10100
N 6.81 0110
O 7.59 0100
P 2.58 10110
Q 0.14 1100010000
R 6.67 0111
S 7.64 0011
T 8.37 111
U 2.43 10111
V 0.97 0101001
W 1.07 0101000
X 0.29 11000101
Y 1.46 010101
Z 0.09 1100010001
Table C1.1 Huffman codes for the alphabet letters.
78
79
Modified Huffman Coding
Modified Huffman coding is used in fax machines to encode black on white images (bitmaps). It is also
an option to compress images in the TIFF file format. It combines the variable length codes of Huffman
coding with the coding of repetitive data in run length encoding. Since facsimile transmissions are
typically black text or writing on white background, only one bit is required to represent each pixel or
sample. These samples are referred to as white bits and black bits. The runs of white bits and black bits
are counted, and the counts are sent as variable length bit streams.
The encoding scheme is fairly simple. Each line is coded as a series of alternating runs of white and
black bits. Runs of 63 or less are coded with a terminating code. Runs of 64 or greater require that a
makeup code prefix the terminating code. The makeup codes are used to describe runs in multiples of
64 from 64 to 2560. This deviates from the normal Huffman scheme which would normally require
encoding all 2560 possibilities. This reduces the size of the Huffman code tree and accounts for the
term modified in the name.
Studies have shown that most facsimiles are 85 percent white, so the Huffman codes have been
optimized for long runs of white and short runs of black. The protocol also assumes that the line begins
with a run of white bits. If it doesn't, a run of white bits of 0 length must begin the encoded line. The
encoding then alternates between black bits and white bits to the end of the line. Each scan line ends
with a special EOL (end of line) character consisting of eleven zeros and a 1 (000000000001). The
EOL character doubles as an error recovery code. Since there is no other combination of codes that has
more than seven zeroes in succession, a decoder seeing eight will recognize the end of line and
continue scanning for a 1. Upon receiving the 1, it will then start a new line. If bits in a scan line get
corrupted, the most that will be lost is the rest of the line. If the EOL code gets corrupted, the most that
will get lost is the next line.
Tables 13.2 and 13.3 show the terminating and makeup codes. Figure 13.1 shows how to encode a
1275 pixel scanline with 53 bits.
Run White bits Black bits Run White bits Black bits
Length Lengt
h
0 00110101 0000110111 32 00011011 000001101010
1 000111 010 33 00010010 000001101011
80
2 0111 11 34 00010011 000011010010
3 1000 10 35 00010100 000011010011

4 1011 011 36 00010101 000011010100
5 1100 0011 37 00001110 000011010101
6 1110 0010 38 00010111 000011010110
7 1111 00011 39 00101000 000011010111
8 10011 000101 40 00101001 000001101100
9 10100 000100 41 00101010 000001101101
10 00111 0000100 42 00101011 000011011010
11 01000 0000101 43 00101100 000011011011
12 001000 0000111 44 00101101 000001010100
13 000011 00000100 45 00000100 000001010101
14 110100 00000111 46 00000101 000001010110
15 110101 000011000 47 00001010 000001010111
16 101010 0000010111 48 00001011 000001100100
17 101011 0000011000 49 01010010 000001100101
18 0100111 0000001000 50 01010011 000001010010
19 0001100 00001100111 51 01010100 000001010011
20 0001000 00001101000 52 01010101 000000100100
21 0010111 00001101100 53 00100100 000000110111
22 0000011 00000110111 54 00100101 000000111000
23 0000100 00000101000 55 01011000 000000100111
24 0101000 00000010111 56 01011001 000000101000
25 0101011 00000011000 57 01011010 000001011000
26 0010011 000011001010 58 01011011 000001011001
27 0100100 000011001011 59 01001010 000000101011
28 0011000 000011001100 60 01001011 000000101100
29 00000010 000011001101 61 00110010 000001011010
30 00000011 000001101000 62 001110011 000001100110
31 00011010 000001101001 62 00110100 000001100111
81
Table 13.2 Terminating codes
64 11011 000000111
128 10010 00011001000
192 010111 000011001001
256 0110111 000001011011
320 00110110 000000110011
384 00110111 000000110100
448 01100100 000000110101
512 01100101 0000001101100
576 01101000 0000001101101
640 01100111 0000001001010
704 011001100 0000001001011
768 011001101 0000001001100
832 011010010 0000001001101
896 101010011 0000001110010
960 011010100 0000001110011
1024 011010101 0000001110100
1088 011010110 0000001110101
1152 011010111 0000001110110
1216 011011000 0000001110111
1280 011011001 0000001010010
1344 011011010 0000001010011
1408 011011011 0000001010100
1472 010011000 0000001010101
1536 010011001 0000001011010
1600 010011010 0000001011011
82
1664 011000 0000001100100

1728 010011011 0000001100101
1792 00000001000 00000001000
1856 00000001100 00000001100
1920 00000001101 00000001101
1984 000000010010 000000010010
2048 000000010011 000000010011
2112 000000010100 000000010100
2170 000000010101 000000010101

2240 000000010110 000000010110
2304 000000010111 000000010111
2368 000000011100 000000011100
2432 000000011101 000000011101
2496 000000011110 000000011110
2560 000000011111 000000011111
EOL 000000000001 000000000001
Table 13.3 Makeup code
words
0 white 00110101
1 block 010
4 white 1011
2 block 11
83
1 white 0111
1 block 010
1266 white 011011000 + 01010011
EOL 000000000001
Figure 13.1 Example encoding of a scanline.
Conclusion:
Generation of Huffman codes for a gray scale 8 bit image is studied.
84

Lab Manual CL I Vibhs

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab Manual CL I Vibhs

Uploaded by

Copyright:

Available Formats

DEPARTMENT OF COMPUTER

B.E COMP. ENGG. SEM: I

410446: Computer Laboratory-I

410446 Computer Laboratory-I

Teaching Scheme Examination Scheme Oral Assessment: 50

3 Assignment to understand the syntax of LEX specifications, built-in functions and

4 Write an ambiguous CFG to implement Parser for sample Language using

5 To write an attributed translation grammar to recognize declarations of simple

7 To generate the target code for the optimized code in assignment.

9 Write a program to generate Recursive Descent Parser.

10 Write a program to implement Apriori algorithm.

11 Using any similarity based techniques develop an application to classify text

12 Implementation of K-NN approach takes suitable example

Group C 13 Generate Huffman codes for a gray scale 8 bit image.

{ sequential search for x in array

if T[i] x then return

Binary Search Algorithm

if x T [k] then return binrec (T [i . k ], x)

Flowchart for Implementation of Divide and Conquer

2) What is Time Complexity of algorithm? Explain the different time complexities.

Aim: Implementation of Concurrent Quick Sort algorithm using C++.

2. To understand the application of data structures such as linked-lists and trees.

3. To understand LEX programming.

4. To understand rules ie LEX specification, built-in functions and variables.

LEX Specification Format ( . l File Format):-

<C global variables, prototypes and comments >

Fig. 1.1: LEX Specification Structure

Let Discus each section in brief:

Regular Expression used by LEX:

Regular expression: A Regular Expression is a pattern description using a meta

Regular Expressions used by LEX

Regular Expression Meaning

. Matches Any single character except \n

* Matches zero or more copies of preceding expression

[] Character class that matches any character within the brackets.

- indicates within the range any character

eg. [a-z] means any character between a to z.

^ Matches beginning of line as a first character of a regular expression

$ Matches end of line as a Last character of a regular expression

{} Indicates how many time previous pattern is allowed to match.

Eg. A{1,3} matches one to three occurances of letter A

+ One Or More occurrences

? Zero or One occurrence

| Matches Either preceding or following regular expression

/ Matches preceding regular expression but only if followed by the following

() Groups series of regular expression as a new regular expression

II] Rules Section:

Rule section is list of rules as follows

< Pattern > {Action}

<Pattern > {Action}

< Pattern > {Action}

<pattern> :- It is the rule that matched by token. It is nothing but the

regular expression for that particular token.

{ Action} :- It is the action taken by LEX when particular pattern is

III] User Subroutines:

LEX Specification LEX lex.yy.c

( i.e. .l file) Compiler (c routine for .l file

lex.yy.c CC Output File

Input Stream a.out List Of Token

Fig. 1.2: Block Diagram

How to create and Execute LEX Program?

As shown in block diagram there are three steps for execution