Professional Documents
Culture Documents
Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Such analysis can help provide us with a better understanding of the data at large.
Classification
Classification is the process of finding a model (or function) that describes and distinguishes data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a set of training data (i.e., data objects whose class label is known).
What is Classification
include classifying email messages as spam or non-spam based upon the message header and content, and classifying galaxies based upon their respective shapes.
What is Classification
Classification can provide a valuable support for informed decision making in the organisation. For example, suppose a mobile phone company would like to promote a new cellphone product to the public. Instead of mass mailing the promotional catalog to everyone, the company may be able to reduce the campaign cost by targeting only a small segment of the population
What is Classification
It may classify each person as a potential buyer or non-buyer based on their personal information such as income, occupation, lifestyle, and credit ratings.
For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures in dollars of potential customers on computer equipment given their income and occupation.
An example application
An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether to put a new patient in an intensive-care unit. Due to the high cost of ICU, those patients who may survive less than a month are given higher priority. Problem: to predict high-risk patients and discriminate them from low-risk patients.
Another application
age Marital
A credit card company receives thousands of applications for new cards. Each application contains information about an applicant,
status annual salary outstanding debts credit rating etc.
Problem: to decide whether an application should approved, or to classify applications into two categories, approved and not approved.
Like human learning from past experiences. A computer does not have experiences. A computer system learns from data, which represent some past experiences of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute, e.g., approve or not-approved, and high-risk or low risk. The task is commonly called: Supervised learning, classification, or inductive learning.
Data: A set of data records (also called examples, instances or cases) described by
k a
Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.
Approved or not
Learn a classification model from the data Use the model to classify future loan applications into
Yes
The data (observations, measurements, etc.) are labeled with predefined classes. It is like that a teacher gives the classes (supervision). Test data are classified into these classes too.
labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data
Preliminaries
The input data for classification task is given in the form of collection of records. Each record also known as instance or example is characterised by a tuple (x,y), where x is the attribute set and y is the class label
Preliminaries
Table 1. Vertebrate Data Set
Preliminaries
In the above slide, the table shows a sample data set used for classifying vertebrates into one of the following categories: mammal, bird, fish, reptile, or amphibian. The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of reproduction, ability to fly and ability to live in water.
Preliminaries
The attribute set may contain discrete and continuous features, however on the table above attribute set contains mostly discrete values. The class label on the other hand, must be a discrete attribute. This is a key characteristics that distinguishes classification from another predictive modeling task known as regression, where y is a continuous attribute.
A type of data is discrete if there are only a finite number of values possible or if there is a space on the number line between each 2 possible values.
Ex. A 5 question quiz is given in a Math class. The number of correct answers on a student's quiz is an example of discrete data. The number of correct answers would have to be one of the following : 0, 1, 2, 3, 4, or 5. There are not an infinite number of values, therefore this data is discrete.
Also, if we were to draw a number line and place each possible value on it, we would see a space between each pair of values. Discrete data usually occurs in a case where there are only a certain number of values, or when we are counting something (using whole numbers).
Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
While the latter two variables may also be considered in a numerical manner by using exact values for age and highest grade completed, it is often more informative to categorize such variables into a relatively small number of groups. (discretize)
may serve as an explanatory tool to distinguish between objects of different classes (Descriptive Modeling). may also be used to predict the class label of unknown records (Predictive Modeling). Consider the table below:
It
A classification model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record. Example you can be given the characteristics of creature known as gila monster.
By building a classification model from the data set shown in Table 2, you may use the model to determine the class to which the creature belongs. Classification models are most suited for predicting or describing data sets with binary or nominal target attributes.
Classification Techniques
Classification Technique
A classification technique is a systematic approach for building classification models from an input data set. Examples of classification techniques include:
Decision
Tree Classifiers Rule-Based Classifiers Neural Networks Support Vector Machines Nave Bayes Classifiers Nearest-Neighbor Classifiers
Classification Technique
Each technique employs a learning algorithm to identify a model that best fits the relationship between the attribute set and class label of the input data (produces outputs consistent with the class labels of the input data).
Classification Technique
A good classification model must predict correctly the class labels of records it has never seen before. Building models with good generalization capability, i.e., models that accurately predict the class labels of previously unseen records, is therefore a key objective of the learning algorithm.
the input data is divided into two disjoint sets, known as the training set and test set, respectively.
The training set will be used for building a classification model. The induced model is later applied to the test set to predict the class label of each test record.
This strategy of dividing the data into independent training and test sets allows us to obtain an unbiased estimate of the performance of a model on previously unseen records. A figure below in the next slide depicts
Evaluation of the performance of a classification model is based upon the number of test records predicted correctly and wrongly by the model. The counts are tabulated in a table known as a confusion matrix.
Each entry fij in this table denotes the number of records from class i predicted to be of class j. For instance, f01 is the number of records from class 0 wrongly predicted as class 1 Based on the entries in the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the total number of wrong predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how good is a classification model, it is useful to summarize this information into a single number. This would make it more convenient to compare the performance of different classification models.
Decision Trees