Lecture6 - Classification and Its Techniques

Data Mining
Instructor: Bajuna Salehe Email: bajunar@yahoo.com Web: http:// www.ifm.ac.tz/staff/bajuna/courses
Classification and Prediction
Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large.
Classification
Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).
What is Classification
Classification is the task of assigning objects to their respective categories.

Examples
include classifying email messages as spam or non-spam based upon the message header and content, and classifying galaxies based upon their respective shapes.
Classification can provide a valuable support for informed decision making in the organisation. For example, suppose a mobile phone company would like to promote a new cellphone product to the public. Instead of mass mailing the promotional catalog to everyone, the company may be able to reduce the campaign cost by targeting only a small segment of the population
It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings.
For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.
An example application

An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients.
Another application
age Marital
A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,
status annual salary outstanding debts credit rating etc.
Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.
Classification and Machine Learning

Like human learning from past experiences. A computer does not have experiences. A computer system learns from data, which represent some past experiences of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning.
The data and the goal
Data: A set of data records (also called examples, instances or cases) described by
k a
attributes: A1, A2, Ak.
class: Each example is labelled with a predefined class.
Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.
An example: data (loan application)
Approved or not
An example: the learning task

Learn a classification model from the data Use the model to classify future loan applications into
Yes
(approved) and No (not approved)
What is the class for following case/instance?
Supervised vs. unsupervised Learning
Supervised learning: classification is seen as supervised learning from examples.

Supervision:
The data (observations, measurements, etc.) are labeled with predefined classes. It is like that a teacher gives the classes (supervision). Test data are classified into these classes too.
Unsupervised learning (clustering)

Class
labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data
Preliminaries
The input data for classification task is given in the form of collection of records. Each record also known as instance or example is characterised by a tuple (x,y), where x is the attribute set and y is the class label
Preliminaries
Table 1. Vertebrate Data Set
Preliminaries
In the above slide, the table shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water.
Preliminaries
The attribute set may contain discrete and continuous features, however on the table above attribute set contains mostly discrete values. The class label on the other hand, must be a discrete attribute. This is a key characteristics that distinguishes classification from another predictive modeling task known as regression, where y is a continuous attribute.
Preliminaries: Discrete data
A type of data is discrete if there are only a finite number of values possible or if there is a space on the number line between each 2 possible values.
Ex. A 5 question quiz is given in a Math class. The number of correct answers on a student's quiz is an example of discrete data. The number of correct answers would have to be one of the following : 0, 1, 2, 3, 4, or 5. There are not an infinite number of values, therefore this data is discrete.
Also, if we were to draw a number line and place each possible value on it, we would see a space between each pair of values. Discrete data usually occurs in a case where there are only a certain number of values, or when we are counting something (using whole numbers).
Preliminaries: Continuous data

Continuous data makes up the rest of numerical data. This is a type of data that is usually associated with some sort of physical measurement. For example height, weight, temperature, the amount of sugar in an orange, the time required to run a mile.
Preliminaries: Categorical data
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
Preliminaries: Categorical data
While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups. (discretize)
Usefulness of Classification Model
A classification model is useful for the following purposes:

It
may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling). may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below:
It
A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Example you can be given the characteristics of creature known as gila monster.
By building a classification model from the data set shown in Table 2, you may use the model to determine the class to which the creature belongs. Classification models are most suited for predicting or describing data sets with binary or nominal target attributes.
Classification Techniques
Classification Technique
A classification technique is a systematic approach for building classification models from an input data set. Examples of classification techniques include:
Decision
Tree Classifiers Rule-Based Classifiers Neural Networks Support Vector Machines Nave Bayes Classifiers Nearest-Neighbor Classifiers
Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data).
A good classification model must predict correctly the class labels of records it has never seen before. Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm.
General Approach to Solve a Classification Problem
A general strategy to solving a classification problem is that:

First,
the input data is divided into two disjoint sets, known as the training set and test set, respectively.
The training set will be used for building a classification model. The induced model is later applied to the test set to predict the class label of each test record.
Why are we dividing the data into two set?
This strategy of dividing the data into independent training and test sets allows us to obtain an unbiased estimate of the performance of a model on previously unseen records. A figure below in the next slide depicts
General Approach to Solve a Classification Problem
Performance Measurement of Model
Evaluation of the performance of a classification model is based upon the number of test records predicted correctly and wrongly by the model. The counts are tabulated in a table known as a confusion matrix.
Table 2 depicts the confusion matrix for a binary classification problem.
Each entry fij in this table denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 wrongly predicted as class 1 Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of wrong predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how good is a classification model, it is useful to summarize this information into a single number. This would make it more convenient to compare the performance of different classification models.

There are several performance metrics available for doing this. One of the most popular metrics is model accuracy, which is defined as: Accuracy = Number of correct predictions Total number of predictions = f11 + f00 f11 + f10 + f01 + f00

Equivalently, the performance of a model can be expressed in terms of its error rate given by the following equation: Error rate = Number of wrong predictions Total number of predictions = f10 + f01 f11 + f10 + f01 + f00
Decision Trees

Lecture6 - Classification and Its Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture6 - Classification and Its Techniques

Uploaded by

Copyright:

Available Formats

Data Mining

Instructor: Bajuna Salehe Email: bajunar@yahoo.com Web: http:// www.ifm.ac.tz/staff/bajuna/courses

Classification and Prediction

Classification and Prediction

Classification is the task of assigning objects to their respective categories.

Classification and Prediction

Classification and Machine Learning

The data and the goal

attributes: A1, A2, Ak.

class: Each example is labelled with a predefined class.

An example: data (loan application)

An example: the learning task

(approved) and No (not approved)

What is the class for following case/instance?

Supervised vs. unsupervised Learning

Supervised learning: classification is seen as supervised learning from examples.

Unsupervised learning (clustering)

Preliminaries: Discrete data

Preliminaries: Discrete data

Preliminaries: Discrete data

Preliminaries: Continuous data

Preliminaries: Categorical data

Preliminaries: Categorical data

Usefulness of Classification Model

A classification model is useful for the following purposes:

Usefulness of Classification Model

Usefulness of Classification Model

General Approach to Solve a Classification Problem

A general strategy to solving a classification problem is that:

Why are we dividing the data into two set?

General Approach to Solve a Classification Problem

Performance Measurement of Model

Performance Measurement of Model

Table 2 depicts the confusion matrix for a binary classification problem.

Performance Measurement of Model

Performance Measurement of Model

Performance Measurement of Model

Performance Measurement of Model

You might also like