Professional Documents
Culture Documents
Data Mining
Data Mining is the process of discovering meaningful
new correlations, patterns and trends by sifting
through large amounts of data stored in repositories,
by using pattern recognition technologies as well as
statistical and mathematical techniques.
Outlier analysis
Outlier: a data object that does not comply with the general
behavior of the data
It can be considered as noise or exception but is quite useful in
fraud detection, rare events analysis
Classification
Prediction:
models continuous-valued functions, i.e., predicts unknown
or missing values
Training
Data
NAME
Mike
Mary
Bill
Jim
Dave
Anne
RANK
YEARS Ind. Chrg.
Assistant Prof
3
no
Assistant Prof
7
yes
Professor
2
yes
Associate Prof
7
yes
Assistant Prof
6
no
Associate Prof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank = professor
OR years > 6
THEN Ind. Chrg. = yes
Unseen Data
(Jeff, Professor, 4)
NAME
Tom
Merlisa
George
Joseph
RANK
YEARS Ind. Chrg.
Assistant Prof
2
no
Associate Prof
7
no
Professor
5
yes
Assistant Prof
7
yes
Ind. Chrg.?
Data transformation
Generalize and/or normalize data
Income group 2
Income group 3
or 4
age >41
Income
group 3
age < 41
Income
group 4
age < 41
Saw Birdcage
Income group 1
age < 27
Saw Birdcage
Saw Courage
Under Fire
A tree structure
Internal node denotes a test on an attribute
Branch represents an outcome of the test
Leaf nodes represent class labels
Training Dataset
age
<=30
<=30
3140
>40
>40
>40
3140
<=30
<=30
>40
<=30
3140
3140
>40
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
overcast
30..40
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
p
p
n
n
I ( p, n) =
log 2
log 2
p+n
p+n p+n
p+n
i =1
p i + ni
I ( p i , ni )
p+n
Gain( A) = I ( p, n) E ( A)
5
4
I ( 2 ,3 ) +
I ( 4 ,0 )
14
14
5
+
I ( 3 , 2 ) = 0 . 69
14
E ( age ) =
g Class N: buys_computer =
no
Hence
Gain(age) = I ( p, n) E (age)
age
<=30
3040
>40
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
N 1 gini ( ) + N 2 gini ( )
(
T
)
=
gini split
T1
T2
N
N
The attribute that provides the smallest ginisplit (T) is chosen to
split the node
Boosting
Boosting increases classification accuracy
Applicable to decision trees
Learn a series of classifiers, where each
classifier in the series pays more attention to the
examples misclassified by its predecessor
Boosting requires only linear time and constant
space