Professional Documents
Culture Documents
Course Outline
1. Pengenalan Data Mining
2. Proses Data Mining
3. Evaluasi dan Validasi pada Data Mini
ng
4. Metode dan Algoritma Data Mining
5. Penelitian Data Mining
Simplicity first
Simple algorithms often work very well!
There are many kinds of simple structure, eg:
Basic version
One branch for each value
Each branch assigns most frequent class
Error rate: proportion of instances that dont bel
ong to the majority class of their corresponding b
ranch
Choose attribute with lowest error rate
(assumes nominal attributes)
Pseudo-code for 1R
For each attribute,
For each value of the attribute, make a
rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error
rate
Outlook
Temp
Humidit
y
Wind
y
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Attribute
Rules
Error
s
Outlook
Sunny No
2/5
Overcast Yes
0/4
Rainy Yes
2/5
Hot No*
2/4
Mild Yes
2/6
Cool Yes
1/4
High No
3/7
Normal Yes
1/7
False Yes
2/8
True No*
3/6
Temp
Humidity
Windy
Total
error
s
4/14
5/14
4/14
5/14
71
72
72
75
75
80
81
Yes |
Humidity
Windy
Play
Sunny
Temperatur
e
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
83
No
65
No
Yes Yes | No
68
69 70
Yes Yes Yes | No
71 72
No Yes
72
75
75
Yes Yes | No
Yes |
80
81
Yes Yes
85
No
83
No
85
Temperature
Humidity
Windy
Errors
Total errors
Sunny No
2/5
4/14
Overcast Yes
0/4
Rainy Yes
2/5
77.5 Yes
3/10
2/4
82.5 Yes
1/7
2/6
0/1
False Yes
2/8
True No*
3/6
5/14
3/14
5/14
Discussion of 1R
1R was described in a paper by Holte (1993)
Contains an experimental evaluation on 16 datas
ets (using cross-validation so that results were re
presentative of performance on future data)
Minimum number of instances was set to 6 after
some experimentation
1Rs simple rules performed not much worse tha
n much more complex decision trees
Statistical modeling
Opposite of 1R: use all the attributes
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing ab
out the value of another (if the class is known)
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Windy
Yes
No
High
Normal
High
Normal
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Outlook
Temperature
Yes
Humidity
Yes
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
A new day:
No
Windy
Yes
No
High
Normal
High
Normal
Outlook
Temp.
Sunny
Cool
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Humidit
y
High
Windy
Play
True
Bayess rule
Probability of event H given evidence E:
A priori probability of H :
Probability of event before evidence is seen
A posteriori probability of H :
Temp.
Sunny
Cool
Humidit
y
High
Windy Play
True
EvidenceE
Probabilityof
classyes
Cognitive Assignment II
1. Pahami dan kuasai satu algoritma data
mining dari berbagai literatur
2. Rangkumkan dengan detail dalam bentu
k slide, dengan format:
1. Definisi
2. Tahapan Algoritma
3. Penerapan Tahapan Algoritma untuk Dataset Main
Golf atau Iris
4. Code Java dari Algoritma
Pilihan Algoritma
1. Neural Network
2. Logistic Regression
3. Support Vector Mach
ine
4. K-Means
5. K-Nearest Neighbor
6. Self-Organizing Map
7. Linear Regression
8. Nave Bayes
9. FP-Growth
Referensi
1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevie
r, 2011
2. Daniel T. Larose, Discovering Knowledge in Data: an Introduc
tion to Data Mining, John Wiley & Sons, 2005
3. Florin Gorunescu, Data Mining: Concepts, Models and Techn
iques, Springer, 2011
4. Jiawei Han and Micheline Kamber, Data Mining: Concepts an
d Techniques Third Edition, Elsevier, 2012
5. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Adv
ances in Data Mining of Enterprise Data: Algorithms and Appl
ications, World Scientific, 2007