Application of Bayesian Networks To Data Mining

A-PDF MERGER DEMO
Message Classification Using Instance

Based Learning Algorithm
A mini project report

Submitted in partial fulfillment of
the requirements for the award of the degree of
Master of Technology
in
Computer Science and Engineering
by
Swapna T R
M050181CS
Guided by
Abdul Nazeer A K
Department of Computer Engineering

National Institute of Technology, Calicut
Kerala – 673601
2006
CERTIFICATE
This is to certify that the mini project Message Classification Using

Instance Based Learning Algorithm is a bonafide record of the mini
project done by Ms. Swapna T R (M050181CS) under our supervision and
guidance. The project report has been submitted to Department of Computer
Engineering of National Institute of technology, Calicut in partial fulfillment
of the requirements for the award of the degree of Master Of Technology in
Computer Science and Engineering.
Dr.M.P.Sebastian Mr.Abdul Nazeer A K

Professor and Head Lecturer(SS)
Dept. of Computer Engineering Dept. of Computer Engineering
NIT Calicut NIT Calicut
Message Classification using Instance Based
Learning Algorithm
Swapna. T R
M050181CS
Guided by:Abdul Naseer A. k.
April 6, 2006
1
ACKNOWLEDGEMENT
I have been very fortunate to have Abdul Naseer A.K.,Department of com-

puter Engineering, as my guide whose timely guidance, advice and inspira-
tion helped me in the preparation of this Mini Project. I express my sincere
gratitude for having guided me through this work.I am thankful to Dr. M.P
Sebastian, Head of Computer Engineering Department for his encouragement
and for giving me this opportunity.
Swapna. T R
2
Contents
1 Abstract 4
2 Introduction 5
3 Instance Based Learning Methods 6

3.1 Common Instance Based Learning Methods . . . . . . . . . . 7
3.2 K-Nearest Neighborhood Algorithm . . . . . . . . . . . . . . 8
4 About WEKA 10
5 Conclusion 11
6 Snapshots 11
7 References 16
3
1 Abstract
Classification of text documents is an example of supervised learning that
seeks to build a instance based model of a function that maps documents
to classifications.In supervised learning of text,where an entire document
represents one example to be classified,a learning algorithm is presented with
a set of already classified or labeled examples.This set is called the training
set.A number of classified documents from the training set are removed prior
to model building to be used for testing the models performance.
In this project the messages are classified as ”hit” or ”miss”,two predefined
classes based on Instance based K-Nearest Neighbourhood algorithm using
Waikato Environment for knowledge analysis environment(WEKA) frame-
work.
4
2 Introduction
The automated categorizations of text into predefined categories has wit-
nessed a booming interest in the last 10 years.due to the increased availability
of documents in digital form and the ensuing need to organize them.In the
research community the dominant approach to this problem is based on ma-
chine learning techniques;a general inductive process automatically builds a
classifier by learning from a set of preclassified documents,the characteristics
of the categories.The advantage of this approach over the knowledge engi-
neering approach are a very good effectiveness,considerable savings in terms
of expert, labour power and straightforward portability to different domains.
This project involves classifying the documents to two predefined cate-
gories based on Instance based algorithm.The classes are ”hit” and ”miss”.
5
3 Instance Based Learning Methods
Instance based learning method is defined as the generalizing of the new in-
stance to be classified from the stored training examples.Training examples
are processed when a new instance arrives. Instance-Based Learning meth-
ods are sometimes called Lazy Learning because they delay the processing
until a new instance must be classified. Each time a new query instance is
encountered, its relationship to the previously stored examples is examined
to assign a target function value for the new instance.Search for the best
match, similar match, or close match, but not necessary exact match.
For example,the popular industry draughting software for architecture and
engineering, called AutoCAD, contains a standard design library plus the
user’s previous stored designs. If a new project requires some machinery
parts to be designed, the engineer would specify some design parameters
(attributes) to the AutoCAD software. It will retrieve similar instances of
any past design that had previously been stored in the database. If it matches
exactly a previous design instance, it retrieves that one or else it retrieves
the closest design match. There can be many similar instances retrieved and
the best attributes of each design can be combined and used to design a
completely new one.
6
3.1 Common Instance Based Learning Methods
Instance based Learning algorithms consist of simply storing the presented
training examples. When a new instance is encountered,a set of similar,related
instances is retrieved from memory and used to classify the query instance(target
function).
The following are the most common Instance based learning methods:
• k-Nearest Neighbor
• Locally weighted Regression
• Radial basis function.
Instance Based Learning Method approaches can construct a different ap-
proximation of the target function for each query instance that has to be
classified.Some techniques only construct local approximation of the target
function that applies in the neighbourhood of the new query instance and
never construct an approximation designed to perform well over the entire
instance space. This has a significant advantage when the target function is
very complex, but can still be described by a collection of less complex local
approximations.
7
3.2 K-Nearest Neighborhood Algorithm
The K-Nearest Neighbor algorithm is the most basic of all Instance-Based
Learning methods.The idea behind the K-Nearest Neighbor algorithm is to
build a classification method using no assumptions about the form of the
function, y=f(x1,x2,.....xp) that relates the dependant variable,y, to the in-
dependent(or predictor) variables x1,x2,x3,.......xp.The only assumption we
make is that it is a smooth function. We have training data in which each
observation has a y value,which is just the class to which observation be-
longs.For example,if we have two classes,y is a binary variable. The idea
in K-Nearest Neighbor methods is to dynamically identify k observations in
the training data set that are similar to a new observation,say (u1,u2,....,up)
,that we wish to classify and to use these observations to classify the ob-
servation into a class C.if we knew the function f,we would simply compute
C=f(u1,u2,....,up).if we assume that f is a smooth function ,a reasonable idea
is to look for observations in our training data that are near it,and then to
compute C from the values of y for these observations.When we talk about
neighbours we are implying that there is a distance or that there is a dis-
tance or dissimilarity measure that we can compute between observations
based on the independent variables.The most popular measure of distance is
the Euclidian distance.
8
The Euclidian distance between the points (x1,x2,.......xp) and (u1,u2,.....up)
is given by the following equation.

q
(x1 − u1)2 + (x2 − u2)2 + ........ + (xp − up)2 .
9
4 About WEKA
WEKA is a comprehensive tool bench for machine learning and data min-
ing.WEKA was developed at the University of Waikato in New Zealand.”WEKA”
stands for the Waikato Environment in Knowledge Analysis.WEKA has been
widely tested in all operating systems.
WEKA provides implementation of state-of-the-art learning algorithms
that can apply to your dataset.It also includes a variety of tools for trans-
forming datasets ,like the algorithms for dicretization.We can preprocess a
dataser ,feed it into a learning scheme and analyze the resulting classifier and
its performance.
WEKA expects the datasets to be in ARFF(Attribute Relation File For-
mat) format because it is necessary to have type information about each
attribute,which cannot be automatically deduced from the attribute values.
we can also call packages that are implemented in WEKA from our own
source code in java.For this project i have used the IBK,which is WEKA’s
version of Instance based Classifier.
10
5 Conclusion
In this project i have extensively studied the WEKA’s framework .It has
got its own classifiers,filters which are implemented as packages.The Weka.core.*
package is the center to the WEKA System.It forms the base for every other
class.I implemented a Message classifier ,with which i could train the system
with a a sample message file and classify it as ”hit” or ”miss”.If an unknown
instance is given ,it is also getting correctly classified using K-Nearest Neigh-
bourhood Algorithm in WEKA.
6 Snapshots
11
Figure 1: WEKA’s Command prompt
12
Figure 2: Classifying a Message File
13
Figure 3: Already classified Message File
14
Figure 4: The Training file- The Message appears in bold
15
7 References
References
[1] Machine Learning in Automated Text Categorization - Fabrizio Sebas-
tiani (ACM Computing Surveys).
[2] www.cs.waikato.ac.nz/ml/weka/
[3] Data Mining Theory and Practice-Dr. K P Soman,Shyam Diwakar,V Ajay
16

Application of Bayesian Networks To Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Application of Bayesian Networks To Data Mining

Uploaded by

Copyright:

Available Formats

A-PDF MERGER DEMO

Message Classification Using Instance

A mini project report

Department of Computer Engineering

This is to certify that the mini project Message Classification Using

Dr.M.P.Sebastian Mr.Abdul Nazeer A K

I have been very fortunate to have Abdul Naseer A.K.,Department of com-

3 Instance Based Learning Methods 6

Classification of text documents is an example of supervised learning that

seeks to build a instance based model of a function that maps documents

to classifications.In supervised learning of text,where an entire document

represents one example to be classified,a learning algorithm is presented with

a set of already classified or labeled examples.This set is called the training

to model building to be used for testing the models performance.

In this project the messages are classified as ”hit” or ”miss”,two predefined

classes based on Instance based K-Nearest Neighbourhood algorithm using

Waikato Environment for knowledge analysis environment(WEKA) frame-

The automated categorizations of text into predefined categories has wit-

nessed a booming interest in the last 10 years.due to the increased availability

research community the dominant approach to this problem is based on ma-

chine learning techniques;a general inductive process automatically builds a

classifier by learning from a set of preclassified documents,the characteristics

of the categories.The advantage of this approach over the knowledge engi-

neering approach are a very good effectiveness,considerable savings in terms

of expert, labour power and straightforward portability to different domains.

This project involves classifying the documents to two predefined cate-

stance to be classified from the stored training examples.Training examples

are processed when a new instance arrives. Instance-Based Learning meth-

encountered, its relationship to the previously stored examples is examined

For example,the popular industry draughting software for architecture and

engineering, called AutoCAD, contains a standard design library plus the

user’s previous stored designs. If a new project requires some machinery

parts to be designed, the engineer would specify some design parameters

(attributes) to the AutoCAD software. It will retrieve similar instances of

exactly a previous design instance, it retrieves that one or else it retrieves

completely new one.

Instance based Learning algorithms consist of simply storing the presented

training examples. When a new instance is encountered,a set of similar,related

• Locally weighted Regression

• Radial basis function.

Instance Based Learning Method approaches can construct a different ap-

classified.Some techniques only construct local approximation of the target

never construct an approximation designed to perform well over the entire

The K-Nearest Neighbor algorithm is the most basic of all Instance-Based

Learning methods.The idea behind the K-Nearest Neighbor algorithm is to

build a classification method using no assumptions about the form of the

function, y=f(x1,x2,.....xp) that relates the dependant variable,y, to the in-

dependent(or predictor) variables x1,x2,x3,.......xp.The only assumption we

make is that it is a smooth function. We have training data in which each

observation has a y value,which is just the class to which observation be-

longs.For example,if we have two classes,y is a binary variable. The idea

in K-Nearest Neighbor methods is to dynamically identify k observations in

C=f(u1,u2,....,up).if we assume that f is a smooth function ,a reasonable idea

compute C from the values of y for these observations.When we talk about

neighbours we are implying that there is a distance or that there is a dis-

tance or dissimilarity measure that we can compute between observations

based on the independent variables.The most popular measure of distance is

the Euclidian distance.

is given by the following equation.

ing.WEKA was developed at the University of Waikato in New Zealand.”WEKA”

stands for the Waikato Environment in Knowledge Analysis.WEKA has been

widely tested in all operating systems.