Romi DM 04 Algoritma Nov2014

Data Mining:
4. Metode dan Algoritma

Romi Satria Wahono
romi@romisatriawahono.net
http://romisatriawahono.net/dm
WA/SMS: +6281586220090
Romi Satria Wahono

SD Sompok Semarang (1987)
SMPN 8 Semarang (1990)
SMA Taruna Nusantara Magelang (1993)
B.Eng, M.Eng and Ph.D in Software Engineering from
Saitama University Japan (1994-2004)
Universiti Teknikal Malaysia Melaka (2014)
Research Interests: Software Engineering,
Machine Learning
Founder dan Koordinator IlmuKomputer.Com
Peneliti LIPI (2004-2007)
Founder dan CEO PT Brainmatics Cipta Informatika
2
Course Outline
1. Pengenalan Data Mining
2. Proses Data Mining
3. Evaluasi dan Validasi pada Data Mini
ng
4. Metode dan Algoritma Data Mining
5. Penelitian Data Mining
4. Metode dan Algoritma

1. Inferring rudimentary rules
2. Statistical modeling
3. Constructing decision trees
4. Constructing rules
5. Association rule learning
6. Linear models
7. Instance-based learning
8. Clustering
Simplicity first
Simple algorithms often work very well!
There are many kinds of simple structure, eg:
One attribute does all the work

All attributes contribute equally & independently
A weighted linear combination might do
Instance-based: use a few prototypes
Use simple logical rules
Success of method depends on the domain
Inferring rudimentary rules

1R: learns a 1-level decision tree
I.e., rules that all test one particular attribute
Basic version
One branch for each value
Each branch assigns most frequent class
Error rate: proportion of instances that dont bel
ong to the majority class of their corresponding b
ranch
Choose attribute with lowest error rate
(assumes nominal attributes)
Pseudo-code for 1R
For each attribute,
For each value of the attribute, make a
rule as follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error
rate
Note: missing is treated as a separate attri

bute value
Evaluating the weather attrib

utes
Outlook
Temp
Humidit
y
Wind
y
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Attribute
Rules
Error
s
Outlook
Sunny No
2/5
Overcast Yes
0/4
Rainy Yes
2/5
Hot No*
2/4
Mild Yes
2/6
Cool Yes
1/4
High No
3/7
Normal Yes
1/7
False Yes
2/8
True No*
3/6
Temp
Humidity
Windy
Total
error
s
4/14
5/14
4/14
5/14
Dealing with numeric attribut

es
Discretize numeric attributes
Divide each attributes range into intervals
Sort instances according to attributes values
Place breakpoints where class changes (majority clas
s)
This minimizes the total error
64
65
68
69 70
85
Yes | No | Yes Yes Yes | No
71
72
72
75
75
80
81
Example: temperature from weather data

Outlook
No Yes | Yes Yes | No | Yes
Yes |
Humidity
Windy
Play
Sunny
Temperatur
e
85
85
False
No
Sunny
80
90
True
No
Overcast
83
86
False
Yes
Rainy
75
80
False
Yes
83
No
The problem of overfitting

This procedure is very sensitive to noise
One instance with an incorrect class label will pr
obably produce a separate interval
Also: time stamp attribute will have zero erro

rs
Simple solution:
enforce minimum number of instances in ma
jority class per interval
65
68(with
69 70
72 72
75 75
80
81
83
64Example
min 71
= 3):
Yes | No | Yes
64
Yes
65
No
Yes Yes | No
68
69 70
Yes Yes Yes | No
No Yes | Yes Yes | No | Yes
71 72
No Yes
72
75
75
Yes Yes | No
Yes |
80
81
Yes Yes
85
No
83
No
85
With overfitting avoidance

Attribute
Resulting rule
set:
Rules
Outlook
Temperature
Humidity
Windy
Errors
Total errors
Sunny No
2/5
4/14
Overcast Yes
0/4
Rainy Yes
2/5
77.5 Yes
3/10
> 77.5 No*
2/4
82.5 Yes
1/7
> 82.5 and 95.5 No
2/6
> 95.5 Yes
0/1
False Yes
2/8
True No*
3/6
5/14
3/14
5/14
Discussion of 1R
1R was described in a paper by Holte (1993)
Contains an experimental evaluation on 16 datas
ets (using cross-validation so that results were re
presentative of performance on future data)
Minimum number of instances was set to 6 after
some experimentation
1Rs simple rules performed not much worse tha
n much more complex decision trees
Simplicity first pays off!
Very Simple Classification Rules Perform Well

on Most Commonly Used Datasets
Robert C. Holte, Computer Science Department,
University of Ottawa
Discussion of 1R: Hyperpipes

Another simple technique: build one rule for
each class
Each rule is a conjunction of tests, one for each a
ttribute
For numeric attributes: test checks whether insta
nce's value is inside an interval
Interval given by minimum and maximum observed in t
raining data
For nominal attributes: test checks whether value

is one of a subset of attribute values
Subset given by all possible values observed in training
data
Class with most matching tests is predicted
Statistical modeling
Opposite of 1R: use all the attributes
Two assumptions: Attributes are
equally important
statistically independent (given the class value)
I.e., knowing the value of one attribute says nothing ab
out the value of another (if the class is known)
Independence assumption is never correct!

But this scheme works well in practice
Probabilities for weather dat

a
Outlook
Temperature
Yes
Humidity
Yes
No
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
Windy
Yes
No
High
Normal
High
Normal
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Outlook
Temp
Humidity
Windy
Play
Sunny
Hot
High
False
No
Sunny
Hot
High
True
No
Overcast
Hot
High
False
Yes
Rainy
Mild
High
False
Yes
Rainy
Cool
Normal
False
Yes
Rainy
Cool
Normal
True
No
Overcast
Cool
Normal
True
Yes
Sunny
Mild
High
False
No
Sunny
Cool
Normal
False
Yes
Rainy
Mild
Normal
False
Yes
Sunny
Mild
Normal
True
Yes
Overcast
Mild
High
True
Yes
Overcast
Hot
Normal
False
Yes
Rainy
Mild
High
True
No
Probabilities for weather dat

a
Outlook
Temperature
Yes
Humidity
Yes
No
Sunny
Hot
Overcast
Mild
Rainy
Cool
Sunny
2/9
3/5
Hot
2/9
2/5
Overcast
4/9
0/5
Mild
4/9
2/5
Rainy
3/9
2/5
Cool
3/9
1/5
A new day:
No
Windy
Yes
No
High
Normal
High
Normal
Outlook
Temp.
Sunny
Cool
Play
Yes
No
Yes
No
False
True
3/9
4/5
False
6/9
2/5
6/9
1/5
True
3/9
3/5
9/
14
5/
14
Humidit
y
High
Windy
Play
True
Likelihood of the two classes

For yes = 2/9 3/9 3/9 3/9 9/14 = 0.0053
For no = 3/5 1/5 4/5 3/5 5/14 = 0.0206
Conversion into a probability by normalization:
P(yes) = 0.0053 / (0.0053 + 0.0206) = 0.205
P(no) = 0.0206 / (0.0053 + 0.0206) = 0.795
Bayess rule
Probability of event H given evidence E:
A priori probability of H :
Probability of event before evidence is seen
A posteriori probability of H :
Probability of event after evidence is seen

Thomas Bayes
Born: 1702 in London, England
Died: 1761 in Tunbridge Wells, Kent,
England
Nave Bayes for classification

Classification learning: whats the probabilit
y of the class given an instance?
Evidence E = instance
Event H = class value for instance
Nave assumption: evidence splits into parts
(i.e. attributes) that are independent
Weather data example

Outlook
Temp.
Sunny
Cool
Humidit
y
High
Windy Play
True
EvidenceE
Probabilityof
classyes
Cognitive Assignment II
1. Pahami dan kuasai satu algoritma data
mining dari berbagai literatur
2. Rangkumkan dengan detail dalam bentu
k slide, dengan format:
1. Definisi
2. Tahapan Algoritma
3. Penerapan Tahapan Algoritma untuk Dataset Main
Golf atau Iris
4. Code Java dari Algoritma
3. Presentasikan di Pertemuan Berikutnya
Pilihan Algoritma
1. Neural Network
2. Logistic Regression
3. Support Vector Mach
ine
4. K-Means
5. K-Nearest Neighbor
6. Self-Organizing Map
7. Linear Regression
8. Nave Bayes
9. FP-Growth
Referensi
1. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevie
r, 2011
2. Daniel T. Larose, Discovering Knowledge in Data: an Introduc
tion to Data Mining, John Wiley & Sons, 2005
3. Florin Gorunescu, Data Mining: Concepts, Models and Techn
iques, Springer, 2011
4. Jiawei Han and Micheline Kamber, Data Mining: Concepts an
d Techniques Third Edition, Elsevier, 2012
5. Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
6. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Adv
ances in Data Mining of Enterprise Data: Algorithms and Appl
ications, World Scientific, 2007

Romi DM 04 Algoritma Nov2014

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Romi DM 04 Algoritma Nov2014

Uploaded by

Copyright:

Available Formats

Data Mining:

4. Metode dan Algoritma

Romi Satria Wahono

4. Metode dan Algoritma

One attribute does all the work

Success of method depends on the domain

Inferring rudimentary rules

Note: missing is treated as a separate attri

Evaluating the weather attrib

Dealing with numeric attribut

Example: temperature from weather data

No Yes | Yes Yes | No | Yes

The problem of overfitting

Also: time stamp attribute will have zero erro

No Yes | Yes Yes | No | Yes

With overfitting avoidance

> 77.5 No*

> 82.5 and 95.5 No

> 95.5 Yes

Simplicity first pays off!

Very Simple Classification Rules Perform Well

Discussion of 1R: Hyperpipes

For nominal attributes: test checks whether value

Class with most matching tests is predicted

Independence assumption is never correct!

Probabilities for weather dat

Probabilities for weather dat

Likelihood of the two classes

Probability of event after evidence is seen

Nave Bayes for classification

Weather data example

3. Presentasikan di Pertemuan Berikutnya

You might also like