You are on page 1of 42

Data Mining

Instructor: Bajuna Salehe Email: bajunar@yahoo.com Web: http:// www.ifm.ac.tz/staff/bajuna/courses

Classification and Prediction

Classification and Prediction

Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large.

Classification
Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).

What is Classification

Classification is the task of assigning objects to their respective categories.


Examples

include classifying email messages as spam or non-spam based upon the message header and content, and classifying galaxies based upon their respective shapes.

What is Classification

Classification can provide a valuable support for informed decision making in the organisation. For example, suppose a mobile phone company would like to promote a new cellphone product to the public. Instead of mass mailing the promotional catalog to everyone, the company may be able to reduce the campaign cost by targeting only a small segment of the population

What is Classification

It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings.

Classification and Prediction

For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.

An example application

An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients.

Another application
age Marital

A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,
status annual salary outstanding debts credit rating etc.

Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.

Classification and Machine Learning


Like human learning from past experiences. A computer does not have experiences. A computer system learns from data, which represent some past experiences of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning.

The data and the goal

Data: A set of data records (also called examples, instances or cases) described by
k a

attributes: A1, A2, Ak.

class: Each example is labelled with a predefined class.

Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.

An example: data (loan application)

Approved or not

An example: the learning task


Learn a classification model from the data Use the model to classify future loan applications into
Yes

(approved) and No (not approved)

What is the class for following case/instance?

Supervised vs. unsupervised Learning

Supervised learning: classification is seen as supervised learning from examples.


Supervision:

The data (observations, measurements, etc.) are labeled with predefined classes. It is like that a teacher gives the classes (supervision). Test data are classified into these classes too.

Unsupervised learning (clustering)


Class

labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data

Preliminaries
The input data for classification task is given in the form of collection of records. Each record also known as instance or example is characterised by a tuple (x,y), where x is the attribute set and y is the class label

Preliminaries
Table 1. Vertebrate Data Set

Preliminaries

In the above slide, the table shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water.

Preliminaries

The attribute set may contain discrete and continuous features, however on the table above attribute set contains mostly discrete values. The class label on the other hand, must be a discrete attribute. This is a key characteristics that distinguishes classification from another predictive modeling task known as regression, where y is a continuous attribute.

Preliminaries: Discrete data

A type of data is discrete if there are only a finite number of values possible or if there is a space on the number line between each 2 possible values.

Preliminaries: Discrete data

Ex. A 5 question quiz is given in a Math class. The number of correct answers on a student's quiz is an example of discrete data. The number of correct answers would have to be one of the following : 0, 1, 2, 3, 4, or 5. There are not an infinite number of values, therefore this data is discrete.

Preliminaries: Discrete data

Also, if we were to draw a number line and place each possible value on it, we would see a space between each pair of values. Discrete data usually occurs in a case where there are only a certain number of values, or when we are counting something (using whole numbers).

Preliminaries: Continuous data


Continuous data makes up the rest of numerical data. This is a type of data that is usually associated with some sort of physical measurement. For example height, weight, temperature, the amount of sugar in an orange, the time required to run a mile.

Preliminaries: Categorical data

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.

Preliminaries: Categorical data

While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups. (discretize)

Usefulness of Classification Model

A classification model is useful for the following purposes:


It

may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling). may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below:

It

Usefulness of Classification Model

A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Example you can be given the characteristics of creature known as gila monster.

Usefulness of Classification Model

By building a classification model from the data set shown in Table 2, you may use the model to determine the class to which the creature belongs. Classification models are most suited for predicting or describing data sets with binary or nominal target attributes.

Classification Techniques

Classification Technique
A classification technique is a systematic approach for building classification models from an input data set. Examples of classification techniques include:

Decision

Tree Classifiers Rule-Based Classifiers Neural Networks Support Vector Machines Nave Bayes Classifiers Nearest-Neighbor Classifiers

Classification Technique

Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data).

Classification Technique
A good classification model must predict correctly the class labels of records it has never seen before. Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm.

General Approach to Solve a Classification Problem

A general strategy to solving a classification problem is that:


First,

the input data is divided into two disjoint sets, known as the training set and test set, respectively.

The training set will be used for building a classification model. The induced model is later applied to the test set to predict the class label of each test record.

Why are we dividing the data into two set?

This strategy of dividing the data into independent training and test sets allows us to obtain an unbiased estimate of the performance of a model on previously unseen records. A figure below in the next slide depicts

General Approach to Solve a Classification Problem

Performance Measurement of Model

Evaluation of the performance of a classification model is based upon the number of test records predicted correctly and wrongly by the model. The counts are tabulated in a table known as a confusion matrix.

Performance Measurement of Model

Table 2 depicts the confusion matrix for a binary classification problem.

Performance Measurement of Model

Each entry fij in this table denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 wrongly predicted as class 1 Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of wrong predictions is (f10 + f01).

Performance Measurement of Model

Although a confusion matrix provides the information needed to determine how good is a classification model, it is useful to summarize this information into a single number. This would make it more convenient to compare the performance of different classification models.

Performance Measurement of Model


There are several performance metrics available for doing this. One of the most popular metrics is model accuracy, which is defined as: Accuracy = Number of correct predictions Total number of predictions = f11 + f00 f11 + f10 + f01 + f00

Performance Measurement of Model


Equivalently, the performance of a model can be expressed in terms of its error rate given by the following equation: Error rate = Number of wrong predictions Total number of predictions = f10 + f01 f11 + f10 + f01 + f00

Decision Trees

You might also like