13 views

Original Title: Lab Manual Cl i Vibhs(1)

Uploaded by Rohit

- ADA lab
- Bca3 Data Structure
- Lecture 06
- Lecture 4
- Compiler Lab Manual
- Convex Hull
- Sorting
- Quick Sort
- About Parsing.help
- Untitled
- compiler design.docx
- Week-6(a)
- c8-2
- Divide and Conquer 2.0
- ECX_4235-TMA2-2013-14
- 4bup
- A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERING
- Runtime complexity analysis for Adaptive sort
- Ds 3 Sorting
- Java Chap10 Arrays & Their Applications ( Prof. Ananda M Ghosh.)

You are on page 1of 84

ENGINEERING

LAB MANUAL

FOR

Vibha Lahane

Prepared By

Practicals: 4 Hrs/Week Practical Assessment : 50

List of Practicals

1 Using Divide and Conquer Strategies design a function for Binary Search using C+

+/ Java/ Python/Scala.

Group A

2 Using Divide and Conquer Strategies design a class for Concurrent Quick Sort

using C++.

variables. (Lexical analyzer for sample language using LEX).

YACC and Lex. Provide the details of all conflicting entries in the parser table

generated by LEX and YACC and how they have been resolved

variables, arithmetic expressions, for if, if-else statement as per syntax of C

to generate three address code for the given input.

6 Write a program to implement k-means clustering using C++.

8 Write a LEX and YACC program to generate abstract syntax tree.

data. Perform tasks as per requirement.

GROUP A: ASSIGNMENTS

(Mandatory Six Assignments)

Assignment No: 01

Title: Using Divide and Conquer Strategies design a function for Binary Search using C++/

Java/ Python/Scala.

Aim: Implementation of Binary Search algorithm using using C++/ Java/ Python/Scala.

Prerequisites:

Knowledge of writing programs in C++.

Objectives:

To learn the concept of Divide and Conquer Strategy.

To study the design and implementation of Binary Search algorithm.

Theory:

Divide and Conquer strategy:

A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-

problems of the same (or related) type, until these become simple enough to be solved directly. The

solutions to the sub-problems are then combined to give a solution to the original problem.

This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,

quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and

computing the discrete Fourier transform (FFTs).

Searching

Sequential Algorithm

function sequential (T [ 1 .. n ], x)

T } for i1 to n do

i return n+1

This algorithm clearly takes a time in (r), where r is the index returned : this is O(n) in the worst case and

(1) in the best case. If we assume that all the elements of T are distinct, that x is indeed somewhere in the

array

CL-I B.E. Computer Engineering

Binary Search

The binary search algorithm begins by comparing the target value to value of the middle element of the

sorted array. If the target value is equal to the middle element's value, the position is returned. If the

target value is smaller, the search continues on the lower half of the array, or if the target value is

larger, the search continues on the upper half of the array. This process continues until the element is

found and its position is returned, or there are no more elements left to search for in the array and a

"not found" indicator is returned.

Binary search can be applied to sorted list only. It searches sorted lists using a divide and conquer

technique. On each iteration the search domain is cut in half, until the result is found. The

computational complexity of a binary search is O(log n).

function binsearch (T [ 1 .. n ], x)

{ binary search for x in array T [1..n]} if

n = 0 or x > T [n] then return n+1

else return binrec (T[1..n], x)

functionbinrec (T [i .. j ], x)

{ binary search for x in subarray T [i .. j] }

If i = j then return i k

(i+j+1)/2

else return binrec (T [k+1 .. j ],x)

Binary searching is the algorithm used to look up a word in a dictionary or a name in a telephone

directory. It is probably the simplest application of divide-and-conquer. It can be applied to a sorted list

only.

7

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

8

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

The concept of divide and conquer strategy is studied and binary search algorithm is implemented

using C++.

FAQs:

1) What is Divide and Conquer approach? Also explain its advantages.

3) Explain the need of analysis of algorithm with respect to complexities as well as techniques

used for analysis.

4) Compute time complexity and space complexity of your program. Also give the proper

justification for same.

5) Compare the conventional Binary Search algorithm and the Divide and Conquer Binary Search

algorithm. Also explain the advantages of Divide and Conquer approach in terms of quick sort.

6) Compare between Divide and Conquer, Concurrent programming, Back tracking,brach and

bound approach.

9

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No: 02

Title: Using Divide and Conquer Strategies design a class for Concurrent Quick Sort using

C++.

Prerequisites:

Knowledge of writing programs in C++.

Objectives:

To learn the concept of Divide and Conquer Strategy.

To study the design and implementation of Quick Sort algorithm.

Theory:

Divide and Conquer strategy:

A divide and conquer algorithm works by recursively breaking down a problem into two or more sub-

problems of the same (or related) type, until these become simple enough to be solved directly. The

solutions to the sub-problems are then combined to give a solution to the original problem.

This technique is the basis of efficient algorithms for all kinds of problems, such as sorting (e.g.,

quicksort, merge sort), multiplying large numbers, syntactic analysis (e.g., top-down parsers) and

computing the discrete Fourier transform (FFTs).

Sorting

Quick Sort

The sorting algorithm invented by Hoare, usually known as "quicksort", is also based on the idea of

divide-and-conquer. As a first step, this algorithm chooses one of the items in the array to be sorted as

the pivot. The array is then partitioned on either side of the pivot, elements are moved in such a way

that those greater than the pivot are placed on its right, whereas all the others are moved to its left. If

now the two sections of the array on either side of the pivot are sorted independently by recursive calls

of the algorithm, the final result is a completely sorted array, no subsequent merge step being necessary.

To balance the sizes of the two sub instances to be sorted, we would like to use the median element as

10

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

the pivot. Finding the median takes more time than it is worth. For this reason we simply use the first

element of the array as the pivot. The quick sort algorithm is given below.

procedure quicksort (T [i .. j ])

{ sorts array T [i .. j ] into increasing

order } if j - i is small then insert (T [i .. j ])

else pivot (T [i .. j ],1)

quicksort(T [i .. I -1])

quicksort (T [1 +1 .. j ])

Let p = T [i ] be the pivot. One good way of pivoting consists of scanning the array T [i .. j ] just once,

but starting at both ends. Pointers k and 1 are initialized to i and j + 1, respectively. Pointer k is then

incremented until T [k] >p, and pointer I is decremented until T [1] <- p. Now T [k] and T [1] are

interchanged. This process continues as long as k < 1. Finally, T [i] and T [1] are interchanged to put the

pivot in its correct position.

procedure pivot (T [i .. j ] ; var 1)

{ permutes the elements in array T [i .. j ] in such a way that, at the

end, i<- l <- j, the elements of T [i .. 1-1] are not greater than p,

T[11 =p, and the elements of T J1+1 .. j ] are greater than

p, where p is the initial value of T [i ] }

p <-T[i]

k<- i; 1<-j+1;

repeat k- k + 1 until T [k] > p or k >-

j repeat I E- 1- 1 until T [1] <- p

while k < I do

interchange T [k] and T [1]

repeat k F- k + 1 until T [k] >

p repeat 1 f- 1- 1 until T [1] p

interchange T [i] and T [1]

Quicksort is a sequential based, sequential sorting algorithm. It is a recursive algorithm that uses the list,

the pivot, and finds its position in the list where the key should be placed. This is the low side of the pivot

and ii) the keys larger than or equal to the pivot are placed to the high side of the pivot. Then the

same program is recursively applied on these two parts.

11

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

2

The average time complexity of Quick sort is O(n log n). The worst-case time complexity is O(n )

12

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Flow Chart for Quick Sort using Divide and Conquer Approach.

13

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

The concept of divide and conquer strategy is studied and Concurrent Quick Sort algorithm is

implemented using C++.

FAQs

1) Explain the need of Divide and Conquer approach for Quick Sort.

2) What is advantage of Divide and Conquer Technique over the recursion?

3) Compare the conventional Quick Sort algorithm with Quick sort using Divide and Conquer .

4) When does the worst case of Quick Sort occur?

5) What are the advantages and disadvantages of quick sort?

6) What is the complexity of quick sort?

Assignment No: 3

Aim:

Assignment to understand the syntax of LEX specifications, built-in functions and variables. (Lexical

analyzer for sample language using LEX)

14

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Objective:

1. To understand how to construct a compiler using LEX and YACC. LEX and YACC are tools used to

generate lexical analyzers and parsers.

What is LEX?

It is a tool for generating Lexical Analyzer. It takes a specification of tokens in the form of a

list of regular expression. From above input LEX generate a lexical analyzer. Its source file is a

specification file consisting of a set of regular expression together with an action.

This file has three sections as given below

%{

%}

Definition Section

%%

Rules Section

%%

15

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

User Subroutines

I] Definition Section:

In this section literal block, definitions, internal table declaration, start conditions and

translations are included.

We can use C code also as it is just by writing that code in special brackets as shown in above diagram i.e %{

%} all code in between those brackets is copied as it is in lex.yy.c. we can also declare Regular expression in

this section which we can use in Rule section.

There are some regular expressions used by the LEX with their meaning is listed below:

language, a language that you use to describe particular patterns of interest.

16

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

^ indicates match any character except the ones within the brackets

\ Escape character

regular expression

matched by any input stream. This action is a typical C Code Statements stating what action

should be taken by LEX after matching pattern.

17

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

This section is for defining the other subroutines required for the Lexical analyzer like symbol

table management etc.

Hence it is also a typical C Code section. The main() method is defined here which will include yylex()

method. yylex() method is defined in LEX which calls the lex.yy.c.

Block Diagram:

a.out

18

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Input : FirstLexProgram.l

Output : lex.yy.c

This command will convert lex specification given in FirstLexProgram.l into C code. There is fixed

destination or the default file to store this C code and that is lex.yy.c.

Input : lex.yy.c

Output : a.out

This command will check that the lex.yy.c generated by first step is syntactically correct or not

according to C Language Syntax.

-O : It is Redirecting output to some file means store the result of compilation into the file mentioned after it

a.out: File containing the output of compilation. A.out is default. We can change this file. i.e. we can store

result in any file.

Final a.out is nothing but the lexical analyzer. If we provide an input stream to the a.out it will separate

out the different tokens in given input stream

Build in variables

19

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

matched the pattern.

3. yyleng:- It holds the Length of the string recognized by Lexer

4. yyin :- holds name of standard I/O. bydefault i.e. stdin

Build in Functions

1. yylex() : - Lexical analyzer produced by LEX is C routing called

yylex().

read.

Build in macros

a. input():- Gets Next Character From Input

b. unput():- Put character back in logical input stream

c. LEX patterns only match a given input character or string once.

d. LEX executes the action for the longest possible match for current input.

If two possible rules that match the same length, LEXER use the ear Heres a program that does

nothing at all. All input is matched, but no action is associated with any pattern, so there will be no output.

%%

\n

The following example prepends line numbers to each line in a file. Some implementations of lex

predefine and calculate yylineno. The input file for lex is yyin, and defaults to stdin.

Whitespace must separate the defining term and the associated expression. References to substitutions

in the rules section are surrounded by braces ({letter}) to distinguish them from literals. When we have a

match in the rules section, the associated C code is executed. Here is a scanner that counts the number of

characters, words, and lines in a file (similar to Unix wc).

20

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

21

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

LEX is a tool which accepts regular expressions as an input & generates a C code to recognize that

token. If that token is identified, then the LEX allows us to write user defined routines that are to be executed.

When we give input specification file to LEX, LEX generates lex.yy.c file as an output which contains function

yylex() which is generated by the LEX tool & contains a C code to recognize the token & action to be carried

out if we find the token.

We also wrote a small LEX specification for recognizing the C type comments.

FAQs:

3. What is a parser?

22

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No. 4

Aim:

Write an ambiguous CFG to implement Parser for sample Language using YACC and Lex.

Provide the details of all conflicting entries in the parser table generated by LEX and YACC and how

they have been resolved.

Objectives:

Theory:

Ambiguous grammars:

C and Java have an ambiguity in the grammar for expressions, which, hugely simplified, looks

something like this:

exp : exp '-' sub_exp

| sub_exp

;

sub_exp : '(' type_name ')' sub_exp

| '-' sub_exp

| id

| literal

| '(' exp ')'

;

type_name : id

| more_complex_type_descriptions

;

23

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

but what is meant by: ( b ) - ( c ) ?

The problem is that a single input string corresponds to more than one possible parse tree.

That is, it is a valid part of the language, but we don't know what it means for certain!

This is a genuine problem with Java and with C, that takes extra work by compiler-writers to

solve - every identifier has to be checked (e.g. by LEX) to see if it has already appeared in a class or

typedef declaration, in which case it definitely a type_name, otherwise it is an ordinary id and can't

become a type_name. We would also need to modify the grammar slightly to make this distinction

clear.

Ambiguous grammars are, by definition, going to be difficult to handle no matter what tools

we use. The assumption made with languages designed for computers is that we do our best to make

them unambiguous. Therefore, we would normally expect any tools we use, like YACC, only to have

to handle unambiguous grammars. Given that, can they handle any unambiguous grammar?

Unfortunately, the answer is ``no'' - there are unambiguous grammars that tools like YACC

and JAVACC can't handle. Luckily, for most good tools, you are unlikely to come across such a

grammar, and if you do, you can usually modify the grammar to overcome the problems but still

recognize the same language.

Equally unfortunately, there is no way of deciding whether a grammar is ambiguous or not -

the best that can be done is to try to create a parser, but if the process fails it can't tell us whether this

is because the grammar is really ambiguous or if it is just because the grammar is too confusing for

the kind of parser we are trying to make.

How to confuse parsers:

The decision that a parser repeatedly makes is: given what it has already read of the input, and

the grammar rules it has already recognised, what grammar rule comes next? The more input the

parser can look at before it has to make a decision, the more likely it is to be able to avoid confusion

and get it right.

For example, suppose we look at languages where assignment is a particular kind of

statement, rather than an operation that can be embedded in any expression:

stat : target '=' exp ';'

target '(' explist ')' ';'

;

24

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

target : id

| target '.' id

;

An LL(1) parser trying to compile this language would have difficulties distinguishing

between assignments (e.g. a=x;) and procedure calls i.e. functions/methods returning void (e.g. a(x);).

This is because an LL(1) parser has to decide which kind of statement it is looking at after seeing only

1 symbol (i.e. a), and it isn't until we see the = or ( that we can tell what is intended. Suppose we used

a more complex algorithm, such as LL(3) - even this couldn't decide between e.g. a.b=x and a.b(x). In

fact, no matter how far it looks ahead, an LL(n) parser, which looks ahead a fixed amount, can always

be confused by a sufficiently complicated target in an assignment or call.

There are two kinds of solutions - the parser can use a variable amount of lookahead, as

JAVACC can be asked to do, so it reads as far as the = or ( before making a decision - or we can

rewrite the grammar, by left-factorising it, so that the two kinds of statement are merged until we can

make the decision:

stat : target assign_or_call ';'

;

assign_or_call : '=' exp

| '(' explist ')'

;

An LR (1) parser has no difficulty dealing with the original grammar, as it will have read to

the end of the statement, and seen the = or (on the way, before it has to decide whether to recognize

an assignment or a call.

It is possible to construct unambiguous grammars that would confuse any LR(n) parser (as

well as any LL(n) parser) e.g. palindromes - strings that are their own mirror images, such as abba or

abacaba:

P:

| 'a' | 'b' | 'c' |...

| 'a' P 'a' | 'b' P 'b' | 'c' P 'c' | . . .

;

The problem is that, although it is perfectly obvious to us what to do - find the middle, and

work out to both ends - LR(n) and LL(n) read strictly left-to-right, and can only locate the middle of

25

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

the string by using their finite lookahead to find the end of the string. This could not work for strings

of length > n for LL(n), or length >2n for LR(n).

Confusing YACC:

Once an ambiguity has been pointed out in a grammar, it is usually clear enough to the user

what the problem is, even if it isn't obvious what to do about it. However, what kinds of error

messages are reported by tools like YACC, and how easy is it to find the corresponding ambiguity or

confusion?

YACC reports problems with grammars, whether ambiguous or just confusing, as shift/reduce

conflicts (where YACC can't decide whether to perform a shift or reduce - i.e. the grammar rule is

complete?) and/or as reduce/reduce conflicts (where YACC can't decide which reduce to perform -

i.e. which grammar rule is it?).

An example of a shift/reduce conflict:

The start of a function/method declaration in a C-like language, that accepts headers like void

fred(int a, int b, float x, float z), looks something like this header:

type_name id '(' params ')'

| type_name id '(' ')'

;

params : param

| params ',' param

;

param : type_name id

;

YACC has no problems with this grammar, but what if we modify it? It might be nice to be

able to write the example above simply as void fred(int a, b, float x, z). We could try rewriting the

grammar like this:

param : type_name ids

;

ids : id

| ids ',' id

;

26

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

But now, YACC reports a shift/reduce conflict, and the details from the y.output file are:

13: shift/reduce conflict (shift 15, reduce 5) on ','

state 13

param : type_name ids . (5)

ids : ids . ',' id (7)

That is, when the generated parser sees a , after a list of identifiers in a param, it doesn't know

whether that , (and the id it expects after) is part of the same param (in which case it should shift, to

include them as part of the RHS) or the start of the next param (in which case it should reduce this

RHS and start a new RHS).

This is not ambiguous, just confusing to YACC, as it needs more lookahead to see if the next

few symbols are e.g. , a b (a is a type_name, b is a parameter name of type a) or , a , or , a ) (a is a

parameter name of the current type). The way to make this clear to YACC is to rewrite the grammar

so that it can see more of the input before having to make a decision:

params : type_name id

| params ',' type_name id

| params ',' id

;

An example of a reduce/reduce conflict:

state 8

sub_exp : id . (5)

type_name : id . (8)

That is, when it sees id) it doesn't know whether the id is a variable giving a value or a type

name, so it doesn't know which rule to use to recognize the id.

Assuming we don't already know what the problem is, this hasn't helped much, but we can get

more information by working back through the states in the y.output file to try to find how we get

here. To do so, we need to look for states that include shift 8 or goto 8. In this example, all we find is:

state 4

sub_exp : '(' . type_name ')' sub_exp (3)

sub_exp : '(' . exp ')' (7)

...

id shift 8

27

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

So the input must include (id), which can be recognized either as a type-cast or as an

expression.

This is a big hint about the source of the ambiguity in the grammar, but more by luck than

anything else - YACC remains confused even if we make the grammar unambiguous, by removing the

rule sub_exp : '-' sub_exp. YACC still reports the same reduce/reduce conflict for this modified

grammar, as it is confused by an input as simple as ( a ) - it has to decide whether this is a value in an

expression or a type-cast before it reads past the ) to see e.g. ( a ) 99 (i.e. a type-cast) or ( a ) - 99 (i.e.

the value a - 99).

Luckily, the solution to the general problem of the ambiguity - to somehow get LEX to

distinguish between identifiers that are really type names (or class names) and all other identifiers -

also solves this confusion for YACC.

Epilogue:

Most of the time, an ambiguous grammar results from an error made by the implementers of a

programming language. Sometimes, however, it is the fault of the language designer. Many languages

are defined in such a way that some part is either inherently ambiguous or confusing (e.g. not LR(1)).

Does this matter? We should not limit language designers to what a particular type of parser generator

can cope with, but on the other hand there is no particular merit in making a language harder to

compile if a small change can simplify the problem.

An example of this is a well-known problem with conditional statements; the dangling else.

Most imperative languages permit conditional statements to take two slightly different forms:

if ( ... ) ...

So the else d in if (a) if (b) c else d could be associated either with if (a) or with if (b).

Most languages attempt to fix this problem by stating that the second interpretation is more

natural, and so is correct, although some languages have different rules. Whatever the language

definition, it is an extra rule that anyone learning the language has to remember.

Similarly, the compiler writer has to deal with this special case: if we use a tool like YACC we

get a shift/reduce error - do we shift the else to get if (b) c else d, or do we reduce the if (b) c as it

28

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

stands, so we get if (a) ... else d To overcome this problem, we can rewrite the grammar to explicitly

say ``you can't have an unmatched then (logically) immediately before an else - the then and the else

must be paired up'':

stat : matched

| unmatched

|...

|...

| exp

removes this ambiguity - to have a terminating keyword such as end_if or fi:

| . . .;

29

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

30

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

We have written an ambiguous CFG to recognize an infix expression and implement a parser

that recognizes the infix expression using YACC, And also the details of all conflicting entries in the

parser table generated by LEX and YACC and how they have been resolved.

Questions:

2. Describe the way to avoid confusion of parsers?

3. What is reduce/ reduce conflict?

4. What is ambiguity?

Assignment No. 05

31

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Aim:

arithmetic expressions, for if, if-else statement as per syntax of C to generate three address code

for the given input.

Theory:

Semantic Actions:

Parsing tools use a generalization of CFG's in which each grammar symbol one or more

values, called attributes, have associated with it. Each production of the grammar may have an

associated "action", which can refer to and compute the values of attributes. So we have:

Terminals & non-terminals . have attributes

Productions . have semantic actions

Example:

E -> E' + E

| E'

E' -> int * E'

| int

For each symbol, let X.val be an integer value associated with X.

For terminal symbols, val is the lexeme provided by the lexical analyzer.

For non-terminals, val should be the integer value of the expression. This attribute is

computed from the attributes of sub-expressions.

Production Action:

E -> E' + E1 E.val = E'.val + E1.val

| E' E.val = E'.val

E' -> int * E1' E'.val = int.val * E1'.val

| int E'.val = int.val

Note: the attribute of some grammar symbols, such as the terminals + and *, is unused.

32

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Example:

5*3+2*4

Parse Tree Equations

E1 E1.val = E3'.val + E2.val

------------- E3'.val = int7.val + E4'.val

E3'+E2 E4'.val = int8.val

------------- E2.val = E5'.val

int7 * E4' E5' E5'.val = int9.val * E6'.val

------------- E6'.val = int0.val

int8 int9*E6' int7.val = 5

------------- int8.val = 3

int0 int9.val = 2

int0.val = 4

Working from the leaves to the root, we can compute each val attribute.

For example, E6'.val = 4 and E5'.val = 8. Finally, E1.val = 23.

Notes:

1. Fresh attributes are associated with every node in the parse tree.

2. The semantic actions specify a system of equations; they don't say in what order the

equations are resolved. The user just gives a specification and the parser takes care of the

implementation.

Warning: You can use side-effects in semantic actions, but in this case you have to understand the

order in which attributes get computed or the results will seem unpredictable.

3. In this example, the val attribute can be evaluated bottom-up: the .val attribute for a node

of the parse depends only on the .val attributes of its children.

4. The parse tree need not actually be built by the parser. In fact, a parser tool would

compile this specification into code that simply traces out the structure of the parse tree

without actually building it.

5. Pattern/action parsing can be though of as a systematic translation of the original text into

a new form specified by the semantic actions. Because the translation is guided by the syntax,

it is called syntax-directed translation. (NB: Book uses SDT in a narrower sense.)

33

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

6. Attributes may also be passed top-down: an attribute of a node may depend on an attribute

of the parent in the parse tree. Such an attributed is called "inherited". We will talk about

inherited attributes eventually, but they will not be used in the course project.

Synthesized:

Attribute value depends on descendants of the node

Example: the val attribute above

Inherited:

Attribute value depends on parent and siblings of the node

Example: symbol table environment

S-attributed Definitions:

- An attribute grammar is S-attributed if it consists only of inherited attributes

- Can be evaluated bottom-up:

- Keep a stack S parallel to parsing stack

- consider production

A -> XY A.val = X.val + Y.val

- When reducing by A -> XY

- the top of the S stack has X.val and Y.val

- compute A.val

- pop X.val and Y.val from S, push A.val

- symmetric with reduce action on the parse stack

- Tools like Bison/Flex support S-attributed definitions

Evaluating Attributes:

- S attributed definitions are a very special case of attribute grammars

- The most general method is to construct an ordering from the parse tree itself: Define a

graph as follows. For each attribute E.a to be computed add a node in the graph. If E.a

depends on E1.a1,...,En.an then add directed edges from Ei.ai to E.a for

1 <= i <= n.

34

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

A topological sort of the graph is any ordering n1,...,nk of the nodes such that edges of

the graph are all from left-to-right in the ordering; i.e., a node appears in the ordering after all of the

nodes it depends on. Any topological sort is a legal evaluation order of the attributes.

Note: for the topological sort to make sense there can be no cycles in the graph.

- can make sense even cyclically defined attributes if they are treated as recursive

definitions

- In practice, computing all of the attribute dependencies from the AST is rarely, if ever,

used. Instead, special cases of syntax-directed definitions are used where the attribute

evaluation order can be determined once and for all from the actions.

- The most important special case is S-attributed grammars: grammars with only

synthesized attributes. Building an AST is an example of an S-attributed grammar (i.e., PA3).

These attributes can be evaluated bottom-up during parsing.

Testing For Circularity:

- If an attribute grammar has a dependence cycle among attributes in some parse tree, then

the attribute grammar is said to be circular.

- Circular attribute grammars are considered meaningless---that is, erroneous.

- It is possible to check whether a given attribute grammar is circular.

Input:

Identifiers from the input in a symbol table and other relevant information about the identifiers

Output:

Instructions:

For the For Statement, if, if-else statement as per the syntax of C or Pascal and generate

equivalent three address code for the given input made up of constructs mentioned above using LEX

and YACC. Write a code to store the identifiers from the input in a symbol table and also to record

35

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

other relevant information about the identifiers from the input in a symbol table and also to records

stored in the symbol table.

36

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

variables

Questions:

37

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No: 6

To learn and use graphics for clustering in C++.

Theory:

In statistics and machine learning, k-means clustering is a method of cluster analysis which

aims to partition n observations into k clusters in which each observation belongs to the cluster

with the nearest mean.

Algorithm:

Regarding computational complexity, the k-means clustering problem is:

NP-hard in general Euclidean space d even for 2 clusters

NP-hard for a general number of clusters k even in the plane

If k and d are fixed, the problem can be exactly solved in time O(ndk+1 log n),

38

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

clustered Standard algorithm

The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is

often called the k-means algorithm.

Given an initial set of k means m1,,mk, which may be specified randomly or by some

heuristic, the algorithm proceeds by alternating between two steps:

Assignment step: Assign each observation to the cluster with the closest mean (i.e. partition

the observations according to the Voronoi diagram generated by the means).

Update step: Calculate the new means to be the centroid of the observations in the cluster.

clustering allows for unsupervised learning. That is, the machine / software will learn on its own,

using the data (learning set), and will classify the objects into a particular class for example, if

our class (decision) attribute is tumor Type and its values are: malignant, benign, etc. - these will

be the classes. They will be represented by cluster1, cluster2, etc. However, the class information is

never provided to the algorithm. The class information can be used later on, to evaluate how

accurately the algorithm classified the objects.

39

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Example:

Problem: Cluster the following eight points (with (x, y) representing locations) into three clusters

A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are:

A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2,

y2) is defined as:

(a, b) = |x2 x1| + |y2 y1| .

Use k-means algorithm to find the three cluster centers after the second iteration.

First we list all points in the first column of the table above. The initial cluster centers means, are

(2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point

(2, 10) to each of the three means, by using the distance function:

point mean1

x1, y1 x2, y2

(2, 10) (2, 10)

(a, b) = |x2 x1| + |y2 y1|

(point, mean1) = |x2 x1| + |y2 y1|

40

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

= |2 2| + |10 10|

=0 + 0

=0

point mean2

x1, y1 x2, y2

(2, 10) (5, 8)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|

= |5 2| + |8 10|

=3+2

=5

point mean3

x1, y1 x2, y2

(2, 10) (1, 2)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|

= |1 2| + |2 10|

=1+8

=9

So, we fill in these values in the table:

41

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

So, which cluster should the point (2, 10) be placed in? The one, where the point has the shortest

distance to the mean that is mean 1 (cluster 1), since the distance is 0.

Cluster 1 Cluster 2 Cluster

3 (2, 10)

So, we go to the second point (2, 5) and we will calculate the distance to each of the three means,

by using the distance function:

point mean1

x1, y1 x2, y2

(2, 5) (2, 10)

(a, b) = |x2 x1| + |y2 y1|

(point, mean1) = |x2 x1| + |y2 y1|

= |2 2| + |10 5|

=0+5

=5

point mean2

x1, y1 x2, y2

(2, 5) (5, 8)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|

= |5 2| + |8 5|

=3+3

=6

point mean3

x1, y1 x2, y2

(2, 5) (1, 2)

(a, b) = |x2 x1| + |y2 y1|

(point, mean2) = |x2 x1| + |y2 y1|

= |1 2| + |2 5|

=1+3

=4

So, we fill in these values in the table:

42

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Iteration

So, which cluster should the point (2, 5) be placed in? The one, where the point has the shortest

distance to the mean that is mean 3 (cluster 3), since the distance is 0.

Cluster 1 Cluster 2 Cluster 3

(2, 10) (2, 5)

Analogically, we fill in the rest of the table, and place each point in one of the

clusters: Iteration 1

(2, 10) (8, 4) (2, 5)

(5, 8) (1, 2)

(7, 5)

(6, 4)

43

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

(4, 9)

Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all

points in each cluster.

For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center

remains the same.

For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) =

(6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)

44

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

45

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

The initial cluster centers are shown in red dot. The new cluster centers are shown in red x.

That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2), Iteration3, and so on until

the means do not change anymore.

In Iteration2, we basically repeat the process from Iteration1 this time using the new means

we computed.

46

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion:

FAQs :

47

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

GROUP B: ASSIGNMENTS

( any 6 Assignments)

Assignment No: 07

Aim:

48

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Objective:

4. To understand rules for generating the target code by providing three address code as a

input.

Theory:

Code generation is final phase of compiler. Basically code generation is process of creating

low level (assembly language or m/c ) code for three address code (generated by intermediate

code generation phase) or optimized three address code(Optimized by Code Optimizer phase).

Source Assembly

Front End Code Optimization Code Generator

Program Code code Code

Symbol Table

Read the expression in the form of operator ,operand1,operand2 and generate code using

following algorithm .

Gen_Code(operator,operand1,operand2)

{

If(operand1.addressmode=R)

{

If(operator=+)

49

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Generate(ADD operand2,R0);

else if(operator=-)

Generate(SUB operand2,R0);

else if(operator=*)

Generate(MUL operand2,R0);

else if(operator=/)

Generate(DIV operand2,R0);

}

else If(operand2.addressmode=R)

{

If(operator=+)

Generate(ADD operand1,R0);

else if(operator=-)

Generate(SUB operand1,R0);

else if(operator=*)

Generate(MUL operand1,R0);

else if(operator=/)

Generate(DIV operand1,R0);

}

else{

If(operator=+)

Generate(ADD operand2,R0);

else if(operator=-)

Generate(SUB operand2,R0);

else if(operator=*)

Generate(MUL operand2,R0);

else if(operator=/)

Generate(DIV operand2,R0);

}

}

50

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Example:

X:= (a+b)*(c-d)+((e/f)*(a+b))

t1:=a+b

t2:=c-d

t3:=e/f

t4:=t1*t2

t5=t3*t1

t6:=t4+t5

Using simple code generation algorithm the sequence target code can be generated

Sequence

t1:=a+b MOV a,R0 Empty

ADD b,R0 R0 contains t1

t1 R R0

t2:=c-d MOV c,R1 R1 contains c

SUB d,R1 R1 contains t2

t2 R R1

t3:=e/f MOV e,R2 R2 contains e

DIV f,R2 R2 contains t3

t3 R R2

t4:=t1*t2 MUL R0,R1 R0 contains t1

R1 contains t2

R1 contains t4 t4 R R1

t5=t3*t1 MUL R2,R1 R2 contains t3

R0 contains t1

R0 contains t5 t5 R R0

t6:=t4+t5 ADD R1,R0 R0 contains t4

R0 contains t5

R0 contains t6 t6 R R0

51

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

52

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Conclusion :

Thus we have studied to generate the target code for the optimized code.

Questions:

1. What is complier?

4. What is Ambiguity?

5. Explain the difference between the target code and intermediate code?

53

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No: 8

Aim: Write a LEX and YACC program to generate abstract syntax tree.

Objective:

To understand working of Code Generation Phase of Compiler

Theory:

The purpose of this lab is to create and print an abstract syntax tree for a C program. The C program

will use only a small subset of the grammar.

As an example of a syntax tree, consider the statement tri_area = (base *

height)/2; The root node is an assignment operation. Its left subtree is a pointer

to tri area.

Its right subtree represents the expression (base * height)/2. The tree looks like the tree in Figure

54

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

ASSIGN INT

ID PTR|INT value = "tri_area"

DIVIDE INT

TIMES INT

DEREF INT

ID PTR|INT value = "base"

DEREF INT

ID PTR|INT value =

"height" NUM INT

value = 2

In this display, each node is followed by its left subtree and then its right subtree, indented one

tab stop. Notice that base and height are dereferenced, but tri area isn't. That will be explained

next.

Tree Nodes and the Tree Node Class

A tree node will be implemented by the Tree Node class. If a tree node is an interior node, then it

will contain an operator that acts on the left and right subtrees. The operator will have a mode,

which will be the data type involved in the operation. For example, if the mode of an assignment

operator is INT, then the operator will assign an int to an int. If a tree node is an exterior (leaf)

node, then it will contain an object, which will be an identi_er or a number (and later a string). The

mode of an exterior node will be the kind of object stored in that node. For example, if the object is

an integer variable (l-value), then the mode will be a pointer to an INT.

If the object is an integer constant, then the mode will be INT. Open the _le TreeNode.java.

This _le de_nes the TreeNode class whose objects have the following attributes: the operation

(oper) represented by the node, the mode (mode) of the operation, a reference to the left subtree

55

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

(left), a reference to the right subtree (right), the identi_er (id) represented by the node, the number

(num) represented by the node, and the string (str) represented by the node.

If the node is a binary interior node, then left and right will be non-null, and id, num, and str

will be unde_ned. On the other hand, if the node is an exterior node, then left and right will be null,

while exactly one of id, num, and str will be de_ned, depending on the kind of exterior node. From

time to time, we will have unary interior nodes. They will always use the left subtree rather than the

right subtree.

Note the types of the data members oper, mode, left, right, id, num,

and str. Also, one constructor

publicTreeNode(IdEntryi)

and the toString() function have been de_ned. You will de_ne three additional constructors. First,

de_ne the default constructor:

publicTreeNode()

It should set oper, mode, and num to 0 and left, right, id, and str

to null. Next, de_ne the following constructor.

publicTreeNode(int op, int m, TreeNode l, TreeNode r)

The purpose of this constructor is to join together two existing trees, with root nodes l and r, as the

left and right subtrees of a new tree with this node as its root node.

In the root node, the value of oper should be op and the value of mode

should be m. Finally, define the constructor

publicTreeNode(int n)

It will create a node that represents a number. The member oper should be Ops.NUM,

mode should be Ops.INT, and num should be the value of n. Write these constructors. We will use

these constructors later in this lab.

Yacc is a tool for building syntax analyzers, also known as parser,yacc has been used to

implement hundreds of languages. Its applications range from small desk calculators, to medium-

sized preprocessors for typesetting, to large compiler front ends for complete programming

languages.

A yacc specification is based on a collection of grammar rules that describe the syntax of a

language; yacc turns the specification into a syntax analyzer. A pure syntax analyzer merely checks

whether or not an input string conforms to the syntax of the language.

56

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Algorithm:

Step1: Start

Step2: declare the declarations as a header file {include<ctype.h>}

Step3: token digit

Step4: define the translations rules like line, expr, term, factor

Line: expr \n {print (\n %d \n,$1)}

Expr: expr+ term ($$=$1=$3}

Term: term + factor ($$ =$1*$3}

Factor: (enter) {$$ =$2)

%%

Step5: define the supporting C routines

Step6: Stop

Conclusion:

FAQs

1. What is AST?

2. What is the need of AST?

3. Which phase of compiler generates AST?

4. What are the applications of AST in compiler?

57

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No: 9

Objective:

To develop a recursive-descent parser for a given grammar.

To generate a syntax tree as an output of the parser

To handle syntax errors.

Theory:

A recursive descent parser is a kind of top-down parser built from a set of mutually-recursive

procedures (or a non-recursive equivalent) where each such procedure usually implements one

of the production rules of the grammar. Thus the structure of the resulting program closely

mirrors that of the grammar it recognizes.

This parser attempts to verify that the syntax of the input stream is correct as it is read from left

to right. A basic operation necessary for this involves reading characters from the input stream

and matching then with terminals from the grammar that describes the syntax of the input. Our

recursive descent parsers will look ahead one character and advance the input stream reading

pointer when proper matches occur. What a recursive descent parser actually does is to perform

a depth-first search of the derivation tree for the string being parsed. This provides the 'descent'

58

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

portion of the name. The 'recursive' portion comes from the parser's form, a collection of

recursive procedures.

As our first example, consider the simple grammar

E

->

x+

T

T

->

(E

)T

->

x

and the derivation tree in figure 2 for the expression x+(x+x)

A recursive descent parser traverses the tree by first calling a procedure to recognize an E. This

procedure reads an 'x' and a '+' and then calls a procedure to recognize a T. This would look like

the following routine.

Procedure E()

Begin

59

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

If

(input_symbol=x)

then next();

If (input_symbol=+) then

Next();

T();

Else

Errorhandler();

END

Note that the 'next' looks ahead and always provides the next character that will be read from

the input stream. This feature is essential if we wish our parsers to be able to predict what is due

to arrive as input. Note that 'errorhandler' is a procedure that notifies the user that a syntax error

has been made and then possibly terminates execution.

In order to recognize a T, the parser must figure out which of the productions to execute. This is

not difficult and is done in the procedure that appears below.

Procedure T()

Begin

Begin

If

(input_symbol=()

then next();

E();

If

(input_symbol=))

then next();

end

else If

(input_symbol=x)

then next();

else

Errorha

ndler();

60

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

END

In the above routine, the parser determines whether T had the form (E) or x. If not then the

error routine was called, otherwise the appropriate terminals and nonterminals were recognized.

Algorithm:

1. Make grammar suitable for parsing i.e. remove left recursion (if required).

2. Write a function for each production with error handler.

3. Given input is said to be valid if input is scanned completely and no error function is called.

Conclusion:

FAQs:

1.What do you mean by Recursive Descent Parsing?

2.What are the applications of Recursive descent parse

3.Advantages of Recursive descent parse

Assignment No: 10

Title: Implement Apriori approach for data mining to organize the data items on a shelf.

Aim: Write a program to implement Apriori algorithm.

Objective:

To find frequent itemsets and association between different itemsets i.e. association

Theory:

Association rule mining is defined as: Let be a set of n binary attributes called items. Let be a

set of transactions called the database. Each transaction in D has a unique transaction ID and

contains a subset of the items in I. A rule is defined as an implication of the form X=>Y where

X,YC I and X Y= . The sets of items (for short itemsets) X and Y are called antecedent

(left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively.

61

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

To illustrate the concepts, we use a small example from the supermarket domain. The set of

items is I = {milk,bread,butter,beer} and a small database containing the items (1 codes

presence and 0 absence of an item in a transaction) is shown in the table to the right. An

example rule for the supermarket could be meaning that if milk and bread is bought, customers

also buy butter.

To select interesting rules from the set of all possible rules, constraints on various measures of

significance and interest can be used. The best-known constraints are 51 minimum thresholds

on support and confidence. The support supp(X) of an itemsetX is defined as the proportion of

transactions in the data set which contain the itemset. In the example database, the itemset

{milk,bread} has a support of 2 / 5 = 0.4 since it occurs in 40% of all transactions (2 out of 5

transactions).

The confidence of a rule is defined. For example, the rule has a confidence of 0.2 / 0.4 = 0.5 in

the database, which means that for 50% of the transactions containing milk and bread the rule is

correct. Confidence can be interpreted as an estimate of the probability P(Y | X), the probability

of finding the RHS of the rule in transactions under the condition that these transactions also

contain the LHS

62

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

The lift of a rule is defined as or the ratio of the observed confidence to that expected by chance.

The rule has a lift of.

The conviction of a rule is defined as. The rule has a conviction of , and be interpreted as the ratio of

the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an

incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect

predictions. In this example, the conviction value of 1.2 shows that the rule would be incorrect 20%

more often (1.2 times as often) if the association between X and Y was purely random chance.

Association rules are required to satisfy a user-specified minimum support and a user-specified

minimum confidence at the same time. To achieve this, association rule generation is a two-step

process. First, minimum support is applied to find all frequent itemsets in a database. In a second

step, these frequent itemsets and the minimum confidence constraint are used to form rules. While

the second step is straight forward, the first step needs more attention.

Many algorithms for generating association rules were presented over time.

Some well known algorithms are Apriori, Eclat and FP-Growth, but they only do half the job, since

they are algorithms for mining frequent itemsets. Another step needs to be done after to generate

rules from frequent itemsets found in a database.

63

CL-I B.E. Computer Engineering

Algorithm:

Association rule mining is to find out association rules that satisfy the predefined minimum support

and confidence from a given database. The problem is usually decomposed into two sub problems.

One is to find those item sets whose occurrences exceed a predefined threshold in the database;

those item sets arecalled frequent or large itemsets. The second problem is to generate association

rules from those large itemsets with the constraints of minimal confidence.

Suppose one of the large itemsets is Lk, Lk = {I1, I2, ,Ik}, association rules with this itemsets are

generated in the following way: the first rule is {I1, I2, , Ik-1}and {Ik}, by checking the

confidence this rule can be determined as interesting or not. Then other rule are generated by

deleting the last items in the antecedent and inserting it to the consequent, further the confidences of

the new rules are checked to determine the interestingness of them. Those processes iterated until

the antecedent becomes empty. Since the second subproblem is quite straight forward, most of the

researches focus on the first subproblem. The Apriori algorithm finds the frequent sets L In Database

D.

Find frequent set Lk 1.

Join Step.

Ck is generated by joining Lk 1with itself

Prune Step.

Any (k 1) -itemset that is not frequent cannot be a subset of a frequent k -itemset,

hence should be removed.

where

(Ck: Candidate itemset of size k)

(Lk: frequent itemset of size k)

Example:

A large supermarket tracks sales data by SKU (item), and thus is able to know what items are

typically purchased together. Apriori is a moderately efficient way to build a list of frequent

purchased item pairs from this data. Let the database of transactions consist of the sets are

T1:{1,2,3,4},

T2: {2,3,4},

T3: {2,3},

64

CL-I B.E. Computer Engineering

T4:{1,2,4}, T5:

{1,2,3,4}, and

T6: {2,4}.

Each number corresponds to a product such as "butter" or "water". The first step of Apriori to count

up the frequencies, called the supports, of each member item separately:

We can define a minimum support level to qualify as "frequent," which depends on the context. For

this case, let min support = 3. Therefore, all are frequent. The next step is to generate a list of all 2-

pairs of the frequent items. Had any of the above items not been frequent, they wouldn't have been

included as a possible member of possible 2-item pairs

In this way, Apriori prunes the tree of all possible sets.

This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't

need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no

supersets of them can possibly be frequent. Keep going

This is counting up the occurrences of each of those pairs in the database. Since minsup=3, we don't

need to generate 3-sets involving {1,3}. This is due to the fact that since they're not frequent, no

supersets of them can possibly be frequent. Keep going

65

CL-I B.E. Computer Engineering

Conclusion:

FAQs :

2].Give few techniques to improve the efficiency of Apriori algorithm.

Assignment No:11

66

CL-I B.E. Computer Engineering

Title: Using any similarity based techniques develop an application to classify text data. Perform

tasks as per requirement.

Prerequisites:classification technique like k nearest neighbor, .SVM, decision learning and rule

learning

Objectives:

Theory:

We need to check the accuracy of a system when it retrieves a number of documents on the basis of

user's input. Let the set of documents relevant to a query be denoted as {Relevant} and the set of

retrieved document as {Retrieved}. The set of documents that are relevant and retrieved can be

denoted as {Relevant} {Retrieved}. This can be shown in the form of a Venn diagram as follows .

There are three fundamental measures for assessing the quality of text retrieval

Precision

Recall

F-score

Precision

67

CL-I B.E. Computer Engineering

Precision is the percentage of retrieved documents that are in fact relevant to the query. Precision can

be defined as

Recall

Recall is the percentage of documents that are relevant to the query and were in fact retrieved. Recall

is defined as

F-score

F-score is the commonly used trade-off. The information retrieval system often needs to trade-off for

precision or vice versa. F-score is defined as harmonic mean of recall or precision as follows

The World Wide Web contains huge amounts of information that provides a rich source for data

mining.

The web poses great challenges for resource and knowledge discovery based on the following

observations

The web is too huge The size of the web is very huge and rapidly increasing. This seems that the web is

too huge for data warehousing and data mining.

Complexity of Web pages The web pages do not have unifying structure. They are very complex as

compared to traditional text document. There are huge amount of documents in digital library of web.

These libraries are not arranged according to any particular sorted order.

Web is dynamic information source The information on the web is rapidly updated. The data such as

news, stock markets, weather, sports, shopping, etc., are regularly updated.

Diversity of user communities The user community on the web is rapidly expanding. These users have

different backgrounds, interests, and usage purposes. There are more than 100 million workstations that

are connected to the Internet and still rapidly increasing.

68

CL-I B.E. Computer Engineering

Relevancy of Information It is considered that a particular person is generally interested in only small

portion of the web, while the rest of the portion of the web contains the information that is not relevant to

the user and may swamp desired results.

The basic structure of the web page is based on the Document Object Model (DOM). The DOM

structure refers to a tree like structure where the HTML tag in the page corresponds to a node in the

DOM tree. We can segment the web page by using predefined tags in HTML. The HTML syntax is

flexible therefore, the web pages does not follow the W3C specifications. Not following the

specifications of W3C may cause error in DOM tree structure.

The DOM structure was initially introduced for presentation in the browser and not for description

of semantic structure of the web page. The DOM structure cannot correctly identify the semantic

relationship between the different parts of a web page.

The purpose of VIPS is to extract the semantic structure of a web page based on its visual

presentation.

Such a semantic structure corresponds to a tree structure. In this tree each node corresponds to

a block.

A value is assigned to each node. This value is called the Degree of Coherence. This value is

assigned to indicate the coherent content in the block based on visual perception.

The VIPS algorithm first extracts all the suitable blocks from the HTML DOM tree. After that

it finds the separators between these blocks.

The separators refer to the horizontal or vertical lines in a web page that visually cross with

no blocks.

The semantics of the web page is constructed on the basis of these blocks.

69

CL-I B.E. Computer Engineering

ss

Data mining is widely used in diverse areas. There are a number of commercial data mining system

available today and yet there are many challenges in this field. In this tutorial, we will discuss the

applications and the trend of data mining.

Here is the list of areas where data mining is widely used

Retail Industry

Telecommunication Industry

Biological Data Analysis

Other Scientific Applications

Intrusion Detection

Loan payment prediction and customer credit policy analysis.

70

CL-I B.E. Computer Engineering

FAQS

1] What are different techniques used for classification of text data.

71

CL-I B.E. Computer Engineering

Assignment No:12

Prerequisites:

Knowledge of K-NN approach.

Objectives:

To learn the concept of K-NN approach with suitable example.

To implement K-NN approach.

Theory:

K-NN approach

vote of its neighbors, with the object being assigned to the class most common among its k nearest

neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to

the class of that single nearest neighbor.

In k-NN regression, the output is the property value for the object. This value is the average of the

values of its k nearest neighbors. k-NN is a type of instance-based learning, or lazy learning, where

the function is only approximated locally and all computation is deferred until classification. The

k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of the

neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where

d is the distance to the neighbor. The neighbors are taken from a set of objects for which the class

(for k-NN classification) or the object property value (for k-NN regression) is known. This can be

thought of as the training set for the algorithm, though no explicit training step is required.

A limitation of the k-NN algorithm is that it is sensitive to the local structure of the data. The

algorithm has nothing to do with and is not to be confused with k-means, another popular machine

learning technique.

72

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Algorithm

The training examples are vectors in a multidimensional feature space, each with a class label.

The training phase of the algorithm consists only of storing the feature vectors and class labels of

the trainingsamples. In the classification phase, k is a user-defined constant, and an unlabeled

vector (a query or test point) is classified by assigning the label which is most frequent among the

k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete

variables, such as for text classification, another metric can be used, such as the overlap metric (or

Hamming distance). In the context of gene expression microarray data, for example, k-NN has also

been employed with correlation coefficients such as Pearson and Spearman. Often, the

classification accuracy of k-NN can be improved significantly if the distance metric is learned with

specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components

analysis.

A drawback of the basic "majority voting" classification occurs when the class distribution is

skewed. That is, examples of a more frequent class tend to dominate the prediction of the new

example, because they tend to be common among the k nearest neighbors due to their large

number. One way to overcome this problem is to weight the classification, taking into account the

distance from the test point to each of its k nearest neighbors. The class (or value, in regression

problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of

the distance from that point to the test point. Another way to overcome skew is by abstraction in

data representation. For example in a self-organizing map (SOM), each node is a representative (a

center) of a cluster of similar points, regardless of their density in the original training data. K-NN

can then be applied to the SOM.

Parameter selection

The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise

on the classification, but make boundaries between classes less distinct. A good k can be selected

73

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

by various heuristic techniques (see hyperparameter optimization). The special case where the

class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest

neighbor algorithm.

The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or

irrelevant features, or if the feature scales are not consistent with their importance. Much research

effort has been put into selecting or scaling features to improve classification. A particularly

popular[citation needed] approach is the use of evolutionary algorithms to optimize feature

scaling.Another popular approach is to scale features by the mutual information of the training data

with the training classes.[citation needed]

In binary (two class) classification problems, it is helpful to choose k to be an odd number as this

avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via

bootstrap method.

Feature Extraction.

When the input data to an algorithm is too large to be processed and it is suspected to be

notoriously redundant (e.g. the same measurement in both feet and meters) then the input data will

be transformed into a reduced representation set of features (also named features vector).

Transforming the input data into the set of features is called feature extraction. If the features

extracted are carefully chosen it is expected that the features set will extract the relevant

information from the input data in order to perform the desired task using this reduced

representation instead of the full size input. Feature extraction is performed on raw data prior to

applying k-NN algorithm on the transformed data in feature space.

Conclusion:

The K-NN approach is studied and implemented.

74

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

GROUP C: ASSIGNMENTS

(Any one)

75

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Assignment No:13

Title: Generate Huffman codes for a gray scale 8 bit image.

Prerequisites:

Knowledge of Huffman codes.

Objectives:

To generate Huffman codes for a gray scale 8 bit image.

Theory:

Huffman coding,

Huffman coding, an algorithm developed by David A. Huffman while he was a Ph.D. student at MIT,

and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source

symbol (such as a character in a file). The algorithm derives this table from the estimated probability or

frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy

encoding methods, more common symbols are generally represented using fewer bits than less common

symbols. Huffman's method can be efficiently implemented, finding a code in linear time to the number of

input weights if these weights are sorted. However, although optimal among methods encoding symbols

separately, Huffman coding is not always optimal among all compression methods.

The beauty of Huffman codes is that variable length codes can achieve a higher data density than fixed

length codes if the characters differ in frequency of occurrence. The length of the encoded character is

inversely proportional to that character's frequency. Huffman wasn't the first to discover this, but his

paper presented the optimal algorithm for assigning these codes. Huffman codes are similar to the

Morse code. Morse code uses few dots and dashes for the most frequently occurring letter. An E is

represented with one dot. A T is represented with one dash. Q, a letter occurring less frequently is

represented with dash-dash-dot-dash.

76

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Huffman codes are created by analyzing the data set and assigning short bit streams to the datum occurring

most frequently. The algorithm attempts to create codes that minimize the average number of bits per

character. Table 9.1 shows an example of the frequency of letters in some text and their corresponding

Huffman code. To keep the table manageable, only letters were used. It is well known that

in English text, the space character is the most frequently occurring character.

As expected, E and T had the highest frequency and the shortest Huffman codes. Encoding with these

codes is simple. Encoding the word toupee would be just a matter of stringing together the appropriate

bit strings, as follows:

T 0 U P E E

One ASCII character requires 8 bits. The original 48 bits of data have been coded with 23 bits

achieving a compression ratio of 2.08.

77

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

A 8.23 0000

B 1.26 110000

C 4.04 1101

D 3.40 01011

E 12.32 100

F 2.28 11001

G 2.77 10101

H 3.94 00100

I 8.08 0001

J 0.14 110001001

K 0.43 1100011

L 3.79 00101

M 3.06 10100

N 6.81 0110

O 7.59 0100

P 2.58 10110

Q 0.14 1100010000

R 6.67 0111

S 7.64 0011

T 8.37 111

U 2.43 10111

V 0.97 0101001

W 1.07 0101000

X 0.29 11000101

Y 1.46 010101

Z 0.09 1100010001

78

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

79

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

Modified Huffman coding is used in fax machines to encode black on white images (bitmaps). It is also

an option to compress images in the TIFF file format. It combines the variable length codes of Huffman

coding with the coding of repetitive data in run length encoding. Since facsimile transmissions are

typically black text or writing on white background, only one bit is required to represent each pixel or

sample. These samples are referred to as white bits and black bits. The runs of white bits and black bits

are counted, and the counts are sent as variable length bit streams.

The encoding scheme is fairly simple. Each line is coded as a series of alternating runs of white and

black bits. Runs of 63 or less are coded with a terminating code. Runs of 64 or greater require that a

makeup code prefix the terminating code. The makeup codes are used to describe runs in multiples of

64 from 64 to 2560. This deviates from the normal Huffman scheme which would normally require

encoding all 2560 possibilities. This reduces the size of the Huffman code tree and accounts for the

term modified in the name.

Studies have shown that most facsimiles are 85 percent white, so the Huffman codes have been

optimized for long runs of white and short runs of black. The protocol also assumes that the line begins

with a run of white bits. If it doesn't, a run of white bits of 0 length must begin the encoded line. The

encoding then alternates between black bits and white bits to the end of the line. Each scan line ends

with a special EOL (end of line) character consisting of eleven zeros and a 1 (000000000001). The

EOL character doubles as an error recovery code. Since there is no other combination of codes that has

more than seven zeroes in succession, a decoder seeing eight will recognize the end of line and

continue scanning for a 1. Upon receiving the 1, it will then start a new line. If bits in a scan line get

corrupted, the most that will be lost is the rest of the line. If the EOL code gets corrupted, the most that

will get lost is the next line.

Tables 13.2 and 13.3 show the terminating and makeup codes. Figure 13.1 shows how to encode a

1275 pixel scanline with 53 bits.

Run White bits Black bits Run White bits Black bits

Length Lengt

h

0 00110101 0000110111 32 00011011 000001101010

1 000111 010 33 00010010 000001101011

80

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

4 1011 011 36 00010101 000011010100

5 1100 0011 37 00001110 000011010101

6 1110 0010 38 00010111 000011010110

7 1111 00011 39 00101000 000011010111

8 10011 000101 40 00101001 000001101100

9 10100 000100 41 00101010 000001101101

10 00111 0000100 42 00101011 000011011010

11 01000 0000101 43 00101100 000011011011

12 001000 0000111 44 00101101 000001010100

13 000011 00000100 45 00000100 000001010101

14 110100 00000111 46 00000101 000001010110

15 110101 000011000 47 00001010 000001010111

16 101010 0000010111 48 00001011 000001100100

17 101011 0000011000 49 01010010 000001100101

18 0100111 0000001000 50 01010011 000001010010

19 0001100 00001100111 51 01010100 000001010011

20 0001000 00001101000 52 01010101 000000100100

21 0010111 00001101100 53 00100100 000000110111

22 0000011 00000110111 54 00100101 000000111000

23 0000100 00000101000 55 01011000 000000100111

24 0101000 00000010111 56 01011001 000000101000

25 0101011 00000011000 57 01011010 000001011000

26 0010011 000011001010 58 01011011 000001011001

27 0100100 000011001011 59 01001010 000000101011

28 0011000 000011001100 60 01001011 000000101100

29 00000010 000011001101 61 00110010 000001011010

30 00000011 000001101000 62 001110011 000001100110

31 00011010 000001101001 62 00110100 000001100111

81

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

64 11011 000000111

128 10010 00011001000

192 010111 000011001001

256 0110111 000001011011

320 00110110 000000110011

384 00110111 000000110100

448 01100100 000000110101

512 01100101 0000001101100

576 01101000 0000001101101

640 01100111 0000001001010

704 011001100 0000001001011

768 011001101 0000001001100

832 011010010 0000001001101

896 101010011 0000001110010

960 011010100 0000001110011

1024 011010101 0000001110100

1088 011010110 0000001110101

1152 011010111 0000001110110

1216 011011000 0000001110111

1280 011011001 0000001010010

1344 011011010 0000001010011

1408 011011011 0000001010100

1472 010011000 0000001010101

1536 010011001 0000001011010

1600 010011010 0000001011011

82

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

1728 010011011 0000001100101

1792 00000001000 00000001000

1856 00000001100 00000001100

1920 00000001101 00000001101

1984 000000010010 000000010010

2048 000000010011 000000010011

2112 000000010100 000000010100

2240 000000010110 000000010110

2304 000000010111 000000010111

2368 000000011100 000000011100

2432 000000011101 000000011101

2496 000000011110 000000011110

2560 000000011111 000000011111

EOL 000000000001 000000000001

words

0 white 00110101

1 block 010

4 white 1011

2 block 11

83

Dr. D. Y. Patil College of Engg.,Ambi

CL-I B.E. Computer Engineering

1 white 0111

1 block 010

1266 white 011011000 + 01010011

EOL 000000000001

Conclusion:

Generation of Huffman codes for a gray scale 8 bit image is studied.

84

Dr. D. Y. Patil College of Engg.,Ambi

- ADA labUploaded byVikram Rao
- Bca3 Data StructureUploaded byAritra Mondal
- Lecture 06Uploaded bypariele27
- Lecture 4Uploaded byHavend Ali
- Compiler Lab ManualUploaded byEyigeeChidambaram
- Convex HullUploaded bybsudheertec
- SortingUploaded byTarash Agarwal
- Quick SortUploaded byManuel Asm
- About Parsing.helpUploaded byocnogueira
- UntitledUploaded bydeepak9471
- compiler design.docxUploaded byGayathri Ramasamy
- Week-6(a)Uploaded byprasad9440024661
- c8-2Uploaded byVinay Gupta
- Divide and Conquer 2.0Uploaded byKandarpGupta
- ECX_4235-TMA2-2013-14Uploaded byNoxz Dunika
- 4bupUploaded byYiğit Tunç
- A NEAR-DUPLICATE DETECTION ALGORITHM TO FACILITATE DOCUMENT CLUSTERINGUploaded byLewis Torres
- Runtime complexity analysis for Adaptive sortUploaded bysamdhk2004
- Ds 3 SortingUploaded byVinamra Mittal
- Java Chap10 Arrays & Their Applications ( Prof. Ananda M Ghosh.)Uploaded byProf. (Dr) Ananda M Ghosh.
- 3246-2Uploaded byHimanshubhusan Rath
- termprojf14aUploaded byAkshat Agrawal
- improved external memory BFS implementations.pdfUploaded byAshok Kumar
- Linear SortinngUploaded byapi-3844034
- DSA Book.pdfUploaded bySathish kumar
- r7311502-Automata and Compiler DesignUploaded bysivabharathamurthy
- 2010 FinkelUploaded byMatthew Hagen
- Bottom Up LR(0) Parsing in C.Uploaded byshashankmudgal
- Sorting Algorithms in CUploaded byDeeksha Shankhdher
- lres_tema5_2Uploaded bysajjad

- cloudComputingSec_p3.pptxUploaded byRohit
- traffic_rep0118.pdfUploaded byRohit
- it_methodology_-wf_&_agileUploaded byRohit
- 7.B.3.SolarSystem.pptUploaded byRohit
- branniganUploaded byRohit
- 53095288-The-story-of-Chris-Gardner.docUploaded byRohit
- Chapter 3Uploaded byazwar charis
- Group Dynamics in the movie 12 Angry MenUploaded byBryan Kennedy
- LAB CL_1_NSUploaded byRohit
- Group Stages and Dynamics in 12 Angry MenUploaded byWilliam Pleasant

- Minimum Spanning Tree Micro ArrayUploaded byTavpritesh Sethi
- Carlsson_Topology and DataUploaded byEdwardTheGreat86
- Gis NewUploaded byapi-3850604
- John a. Hartigan-Clustering Algorithms-John Wiley & Sons (1975)Uploaded byCarolina Salas
- SAS Text Miner RefrenceUploaded byRitesh Raman
- MRI Brain Tumor Segmentation Based on Improved Fuzzy C-means MethodUploaded byBudiUtomo
- Chapter 7Uploaded bymehmetgunn
- Take Home ExamUploaded bychsudheer291985
- Parenting Styles Analysis Guidance Document (in Place of DD), Aug 2013Uploaded byAloysia Ispriantari
- Representation learningUploaded byDuyệt Trần
- Item_Fixed LRIC Model User GuideUploaded bySead Kurtović
- SegPet ProjectUploaded byAnonymous 1aqlkZ
- Review on Energy Efficient Routing Protocols Based on Clustering in WSNUploaded byIRJET Journal
- 10 Fuzzy Clustering.pdfUploaded byponpisut
- Introducing_Azure_Machine_Learning.pdfUploaded byAnonymous EMyy4EvYce
- Spike Sorting TutorialUploaded bytiger05
- A Modified Method for Order Reduction of Large Scale Discrete SystemsUploaded byEditor IJACSA
- Human Motion Segmentation via Robust Kernel Sparse Subspace ClusteringUploaded byLeMenizInfotech
- NILM 12-10-2015Uploaded bywort85
- Similarity Based Imputation Method For Time Variant DataUploaded byijcsis
- Obstacle Avoidance with KinectUploaded byjuıhuh
- Web News Documents Clustering in Indonesian Language using SVD-PCA and Ant AlgorithmsUploaded byArif Fadllullah Ayip
- Network Based Intrusion Detection System using Filter Based Feature Selection AlgorithmUploaded byIRJET Journal
- Genetica HenriqueUploaded byDouglas Gomes Viana
- data mining clusterUploaded byapi-315994488
- Interview Questions in Business AnalyticsUploaded byryo masonmercy
- Classical Methods and Modern Analysis for Studying Fungal DiversityUploaded bynevinibrahim
- 016A REVIEW ON SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING TEXT MINING AND MACHINE LEARNINGUploaded byIJAR Journal
- Current Version_ SaTScan v9.1.1 Released March 9 2011Uploaded bycatoper
- Overviewof Data Mining Techniques and Image SegmentationUploaded byIntegrated Intelligent Research