Professional Documents
Culture Documents
Agenda
Introduction Decision Tree Induction Statistical based Algo Distance based algo Rule based algo
Classification predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Loan Disbursement risky or safe Medical diagnosis (prediction of treatment from below ) Treatment A, Treatment B, Treatment C
Data Mining: Concepts and Techniques 3
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
Data Mining: Concepts and Techniques 4
Training Data
income loan_decision low risky low risky high safe low risky low safe medium safe
Classifier (Model)
Unseen Data
(Henry, middle_aged,l
NAME Tom Crest yee age senior middle_aged middle_aged income low low high loan_decision safe risky safe
Loan_decision?
risky
6
Supervision: The training data are accompanied by labels indicating the class of the observations e.g. risky, safe etc. New data is classified based on the training set The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Data Mining: Concepts and Techniques 7
Data cleaning
Data transformation
Accuracy classifier accuracy: predicting class label predictor accuracy: guessing value of predicted attributes Speed time to construct the model (training time) time to use the model (classification/prediction time) Robustness: handling noise and missing values Scalability: efficiency in disk-resident databases Interpretability understanding and insight provided by the model
Decision tree is a flowchart like tree structure Non-leaf node denotes a test on an attribute Each branch represents an outcome of the test Each leaf node holds a class label Topmost node is root
10
11
age?
youth
Middle_aged
senior
<=30 student?
no yes
31..40 overcast
yes
excellent
fair
no
yes
yes
12
Basic algorithm Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Examples are partitioned based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
13
DT Algorithm
14
Choosing splitting attributes-which attributes to use as splitting impacts the performance e.g. age or credit_rating, student-name is not useful Ordering of splitting attributes order in which attributes are chosen is important. In example age is first, then student and credit_rating is chosen. Splits number of splits to take place Tree structure balanced tree with fewest levels is desirable. Training data the structure of DT depends on training data. If training data is very small generated tree might not be enough to work properly and if it is too large created tree may overfit.(might not work for future states) Pruning once a tree is constructed, some modifications to the tree might be needed to improve the performance of the tree. -- pruning phase redundant comparisons or remove subtrees to achieve better performance.
Data Mining: Concepts and Techniques 15
16
17
18
Regression
Bayesian classification
19
Regression
Regression deals with estimation of output value based on input values. In classification the input values are values from database D and output values represent the classes. If we know input parameters x1,x2,x3..xn then the relation between the output parameter y and input parameters can be y=c0+c1x1+c2x2+cnxn c0,c1,..cn are regression coefficients
20
Prentice Hall
21
Prentice Hall
22
Division
Prentice Hall
23
Prediction
Prentice Hall
24
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes Theorem. Performance: A simple Bayesian classifier, nave Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct prior knowledge can be combined with observed data
25
Let X be a data sample (evidence): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability of H), the initial probability
P(X): prior probability of X that sample data is observed e.g. a person from our set of customers is 35 years old and earns $40000.
P(X|H) (posteriori probability), the probability after observing the sample X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is 31..40, medium income
Data Mining: Concepts and Techniques 26
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem
27
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, , xn) Suppose there are m classes C1, C2, , Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes theorem
P(X | C )P(C ) i i P(C | X) i P(X)
Since P(X) is constant for all classes, only P(C | X) P(X| C )P(C ) i i i needs to be maximized
Data Mining: Concepts and Techniques 28
Class: C1:buys_computer = yes C2:buys_computer = no Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)
income student redit_rating c buys_compu high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no
29
P(Ci):
P(age = <=30 | buys_computer = yes) = 2/9 = 0.222 P(age = <= 30 | buys_computer = no) = 3/5 = 0.6 P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium | buys_computer = no) = 2/5 = 0.4 P(student = yes | buys_computer = yes) = 6/9 = 0.667 P(student = yes | buys_computer = no) = 1/5 = 0.2 P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667 P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4 P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) =0.044*0.643 =0.028 P(X|buys_computer = no) * P(buys_computer = no) = 0.019*0.357 = 0.007 Therefore, X belongs to class (buys_computer = yes)
May 23, 2012 Data Mining: Concepts and Techniques 30
Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables
31
32
Represent the knowledge in the form of IF-THEN rules r=<a,c> where a is antecedent and c is consequent R: IF age = youth AND student = yes THEN buys_computer = yes
Rules are generated by many techniques like decision trees and Neural networks.
33
Prentice Hall
34
35
Prentice Hall
36
37
Each item that is mapped to the same class are more similar to the other items in that class than the items in other classes So distance (or similarity) measures may be used to identify alikeness of different items in database. Place items in class to which they are closest. Must determine distance between an item and a class.
Prentice Hall
38
Distance Based
Prentice Hall
39
40
Common classification scheme Lazy learners due to learning from neighbors Training set includes data alongwith classes. Examine K closest items near the item to be classified. New item is placed in the class where the most of the k closest items are.
Prentice Hall
41
KNN
K=3 Three closest items in the training set are shown T will be placed in the class where most of these are.
Prentice Hall
42
KNN Algorithm
Prentice Hall
43