You are on page 1of 18

A-PDF MERGER DEMO

Message Classification Using Instance


Based Learning Algorithm

A mini project report


Submitted in partial fulfillment of
the requirements for the award of the degree of

Master of Technology
in
Computer Science and Engineering
by
Swapna T R
M050181CS

Guided by
Abdul Nazeer A K

Department of Computer Engineering


National Institute of Technology, Calicut
Kerala – 673601
2006
CERTIFICATE

This is to certify that the mini project Message Classification Using


Instance Based Learning Algorithm is a bonafide record of the mini
project done by Ms. Swapna T R (M050181CS) under our supervision and
guidance. The project report has been submitted to Department of Computer
Engineering of National Institute of technology, Calicut in partial fulfillment
of the requirements for the award of the degree of Master Of Technology in
Computer Science and Engineering.

Dr.M.P.Sebastian Mr.Abdul Nazeer A K


Professor and Head Lecturer(SS)
Dept. of Computer Engineering Dept. of Computer Engineering
NIT Calicut NIT Calicut
Message Classification using Instance Based
Learning Algorithm
Swapna. T R
M050181CS
Guided by:Abdul Naseer A. k.
April 6, 2006

1
ACKNOWLEDGEMENT

I have been very fortunate to have Abdul Naseer A.K.,Department of com-


puter Engineering, as my guide whose timely guidance, advice and inspira-
tion helped me in the preparation of this Mini Project. I express my sincere
gratitude for having guided me through this work.I am thankful to Dr. M.P
Sebastian, Head of Computer Engineering Department for his encouragement
and for giving me this opportunity.

Swapna. T R

2
Contents
1 Abstract 4

2 Introduction 5

3 Instance Based Learning Methods 6


3.1 Common Instance Based Learning Methods . . . . . . . . . . 7
3.2 K-Nearest Neighborhood Algorithm . . . . . . . . . . . . . . 8

4 About WEKA 10

5 Conclusion 11

6 Snapshots 11

7 References 16

3
1 Abstract

Classification of text documents is an example of supervised learning that

seeks to build a instance based model of a function that maps documents

to classifications.In supervised learning of text,where an entire document

represents one example to be classified,a learning algorithm is presented with

a set of already classified or labeled examples.This set is called the training

set.A number of classified documents from the training set are removed prior

to model building to be used for testing the models performance.

In this project the messages are classified as ”hit” or ”miss”,two predefined

classes based on Instance based K-Nearest Neighbourhood algorithm using

Waikato Environment for knowledge analysis environment(WEKA) frame-

work.

4
2 Introduction

The automated categorizations of text into predefined categories has wit-

nessed a booming interest in the last 10 years.due to the increased availability

of documents in digital form and the ensuing need to organize them.In the

research community the dominant approach to this problem is based on ma-

chine learning techniques;a general inductive process automatically builds a

classifier by learning from a set of preclassified documents,the characteristics

of the categories.The advantage of this approach over the knowledge engi-

neering approach are a very good effectiveness,considerable savings in terms

of expert, labour power and straightforward portability to different domains.

This project involves classifying the documents to two predefined cate-

gories based on Instance based algorithm.The classes are ”hit” and ”miss”.

5
3 Instance Based Learning Methods

Instance based learning method is defined as the generalizing of the new in-

stance to be classified from the stored training examples.Training examples

are processed when a new instance arrives. Instance-Based Learning meth-

ods are sometimes called Lazy Learning because they delay the processing

until a new instance must be classified. Each time a new query instance is

encountered, its relationship to the previously stored examples is examined

to assign a target function value for the new instance.Search for the best

match, similar match, or close match, but not necessary exact match.

For example,the popular industry draughting software for architecture and

engineering, called AutoCAD, contains a standard design library plus the

user’s previous stored designs. If a new project requires some machinery

parts to be designed, the engineer would specify some design parameters

(attributes) to the AutoCAD software. It will retrieve similar instances of

any past design that had previously been stored in the database. If it matches

exactly a previous design instance, it retrieves that one or else it retrieves

the closest design match. There can be many similar instances retrieved and

the best attributes of each design can be combined and used to design a

completely new one.

6
3.1 Common Instance Based Learning Methods

Instance based Learning algorithms consist of simply storing the presented

training examples. When a new instance is encountered,a set of similar,related

instances is retrieved from memory and used to classify the query instance(target

function).

The following are the most common Instance based learning methods:

• k-Nearest Neighbor

• Locally weighted Regression

• Radial basis function.

Instance Based Learning Method approaches can construct a different ap-

proximation of the target function for each query instance that has to be

classified.Some techniques only construct local approximation of the target

function that applies in the neighbourhood of the new query instance and

never construct an approximation designed to perform well over the entire

instance space. This has a significant advantage when the target function is

very complex, but can still be described by a collection of less complex local

approximations.

7
3.2 K-Nearest Neighborhood Algorithm

The K-Nearest Neighbor algorithm is the most basic of all Instance-Based

Learning methods.The idea behind the K-Nearest Neighbor algorithm is to

build a classification method using no assumptions about the form of the

function, y=f(x1,x2,.....xp) that relates the dependant variable,y, to the in-

dependent(or predictor) variables x1,x2,x3,.......xp.The only assumption we

make is that it is a smooth function. We have training data in which each

observation has a y value,which is just the class to which observation be-

longs.For example,if we have two classes,y is a binary variable. The idea

in K-Nearest Neighbor methods is to dynamically identify k observations in

the training data set that are similar to a new observation,say (u1,u2,....,up)

,that we wish to classify and to use these observations to classify the ob-

servation into a class C.if we knew the function f,we would simply compute

C=f(u1,u2,....,up).if we assume that f is a smooth function ,a reasonable idea

is to look for observations in our training data that are near it,and then to

compute C from the values of y for these observations.When we talk about

neighbours we are implying that there is a distance or that there is a dis-

tance or dissimilarity measure that we can compute between observations

based on the independent variables.The most popular measure of distance is

the Euclidian distance.

8
The Euclidian distance between the points (x1,x2,.......xp) and (u1,u2,.....up)

is given by the following equation.


q
(x1 − u1)2 + (x2 − u2)2 + ........ + (xp − up)2 .

9
4 About WEKA

WEKA is a comprehensive tool bench for machine learning and data min-

ing.WEKA was developed at the University of Waikato in New Zealand.”WEKA”

stands for the Waikato Environment in Knowledge Analysis.WEKA has been

widely tested in all operating systems.

WEKA provides implementation of state-of-the-art learning algorithms

that can apply to your dataset.It also includes a variety of tools for trans-

forming datasets ,like the algorithms for dicretization.We can preprocess a

dataser ,feed it into a learning scheme and analyze the resulting classifier and

its performance.

WEKA expects the datasets to be in ARFF(Attribute Relation File For-

mat) format because it is necessary to have type information about each

attribute,which cannot be automatically deduced from the attribute values.

we can also call packages that are implemented in WEKA from our own

source code in java.For this project i have used the IBK,which is WEKA’s

version of Instance based Classifier.

10
5 Conclusion

In this project i have extensively studied the WEKA’s framework .It has

got its own classifiers,filters which are implemented as packages.The Weka.core.*

package is the center to the WEKA System.It forms the base for every other

class.I implemented a Message classifier ,with which i could train the system

with a a sample message file and classify it as ”hit” or ”miss”.If an unknown

instance is given ,it is also getting correctly classified using K-Nearest Neigh-

bourhood Algorithm in WEKA.

6 Snapshots

11
Figure 1: WEKA’s Command prompt

12
Figure 2: Classifying a Message File

13
Figure 3: Already classified Message File

14
Figure 4: The Training file- The Message appears in bold

15
7 References

References

[1] Machine Learning in Automated Text Categorization - Fabrizio Sebas-

tiani (ACM Computing Surveys).

[2] www.cs.waikato.ac.nz/ml/weka/

[3] Data Mining Theory and Practice-Dr. K P Soman,Shyam Diwakar,V Ajay

16

You might also like