Professional Documents
Culture Documents
2
What
What is
is WEKA
WEKA ??
Machine learning/data mining software written in Java (distributed under
the GNU Public License)
Used for research, education, and applications
Complements Data Mining by Witten & Frank
Main features:
Comprehensive set of data pre-processing tools, learning algorithms
and evaluation methods
Graphical user interfaces (incl. data visualization)
Environment for comparing learning algorithms
Weka versions
WEKA 3.4: book version compatible with description in data mining
book
WEKA 3.5: developer version with lots of improvements
3
Formatting
Formatting Data
Data into
into ARFF
ARFF
@relation iris
@attribute sepallength real
@attribute sepalwidth real
@attribute petallength real
@attribute petalwidth real
@attribute class {Iris-setosa, Iris-versicolor, Iris-virginica}
@data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
7.0,3.2,4.7,1.4,Iris-versicolor
6.4,3.2,4.5,1.5,Iris-versicolor
6.3,3.3,6.0,2.5,Iris-virginica
5.8,2.7,5.1,1.9,Iris-virginica
4
Practicing
Practicing WEKA
WEKA
What is WEKA ?
Formatting the data into ARFF
Klasifikasi
Tahapan membangun classifier
Contoh kasus : Klasifikasi bunga iris
Tahapan membangun classifier
Merangkum hasil eksperimen k-Nearest Neighbor Classifier
Eksperimen memakai classifier yang lain (JST, SVM)
Classification of cancers based on gene expression
Parkinson Disease Detection
K-Means Clustering
5
Tahapan
Tahapan membangun
membangun Classifier
Classifier
1. Tentukan manakah informasi yang merupakan
(a) attribute/feature
(b) class
(c) training & testing set
(d) skenario pengukuran akurasi
2. Tentukan kombinasi parameter model, dan lakukan proses
pelatihan memakai training set
3. Ukurlah akurasi yang dicapai dengan testing set
4. Ubahlah parameter model, dan ulang kembali mulai dari
step 2, sampai dicapai akurasi yang diinginkan
6
Contoh
Contoh Kasus
Kasus :: Klasifikasi
Klasifikasi bunga
bunga iris
iris
7
Flowers
Flowers parts
parts
8
Tahapan
Tahapan membangun
membangun Classifier
Classifier
1. Tentukan manakah informasi yang merupakan
(a) attribute/feature : sepal length (panjang kelopak)
sepal width (lebar kelopak)
petal length (panjang mahkota)
petal width (lebar mahkota)
(b) class: iris setosa
iris versicolor
iris virginica
(c) training & testing set
training set : 25 instances/class
testing set: 25 instances/class
(d) skenario pengukuran akurasi
9
Step
Step by
by Step
Step klasifikasi
klasifikasi
10
Open file iris-training.arff
11
Klik pada Classify untuk mem
Classifier algorithm
12
Klik pada Choose untuk me
Classifier algorithm
13
Nave Bayes
14
IB1 : 1-Nearest Neighbor
Classifier)
IBk : k-Nearest Neighbor
Classifier
15
Multilayer Perceptron
(Jaringan Syaraf Tiru
16
SMO singkatan dari
Sequential Minimal Optimiza
SMO adalah implementasi SV
Mengacu pada paper John Pl
17
Decision Tree J48 (C4.5)
18
Misalnya kita pilih
IBk : k-Nearest Neighbor
Classifier
19
Selanjutnya pilihlah skenari
Pengukuran akurasi. Dari 4
Options yang diberikan, pili
Supplied test set dan klik
Button Set untuk memiilih
Testing set file iris-testing.a
20
Tahapan
Tahapan membangun
membangun Classifier
Classifier
Iris-training.arff
iris setosa 25
Iris-testing.arff
iris versicolor 25
25 25
iris virginica
25
Classifiers : 25
1. Nave Bayes Akurasi
2. K-Nearest Neighbor Classifier terhadap
(lazy iBk) testing set ?
3. Artificial Neural Network
(function multilayer
perceptron)
4. Support Vector Machine 21
Apakah
Apakah yang
yang dimaksud
dimaksud mengukur
mengukur
akurasi
akurasi
Testing set iris-testing.arff dilengkapi dengan
informasi actual class-nya. Misalnya instance no.1
adalah suatu bunga yang memiliki sepal length
5.0 cm, sepal width 3.0cm, petal length 1.6 cm,
petal width 0.2 cm, dan jenis bunganya (class)
Iris setosa
23
Berbagai
Berbagai cara
cara pengukuran
pengukuran akurasi
akurasi
Cross Validation Method
( fold = 5 atau 10 ) : teknik
estimasi akurasi yang
dipakai, jika jumlah sampel
terbatas. Salah satu bentuk
khusus CV adalah Leave-
one-out Cross Validation
(LOOCV) : dipakai jka
jumlah sampel sangat
terbatas
24
Ilustrasi
Ilustrasi Cross
Cross Validation
Validation (k=5)
(k=5)
1. Data terdiri dari 100 instances (samples), dibagi ke dalam 5
blok dengan jumlah sampel yang sama. Nama blok : A, B, C,
D dan E, masing-masing terdiri dari 20 instances
2. Kualitas kombinasi parameter tertentu diuji dengan cara sbb.
step 1: training memakai A,B,C,D testing memakai E akurasi a
step 2: training memakai A,B,C,E testing memakai D akurasi b
step 3: training memakai A,B, D,E testing memakai C akurasi c
step 4: training memakai A, C,D,E testing memakai B akurasi d
step 5: training memakai B,C,D,E testing memakai A akurasi e
3. Rata-rata akurasi : (a+b+c+d+e)/5 mencerminkan kualitas
parameter yang dipilih
4. Ubahlah parameter model, dan ulangi dari no.2 sampai
dicapai akurasi yang diinginkan
25
Kali ini memakai Supplied tes
Selanjutnya klik pada bagian y
Di dalam kotak untuk men-set
Parameter. Dalam hal ini, adala
Nilai k pada k-Nearest Neigh
Classifier (Nick name : IBK)
26
Set-lah nilai kmisalnya 3 dan klik OK.
Untuk memahami parameter yang lain,
kliklah button More & Capabilities
27
Klik button Start
Hasil eksperimen : Correct classification rat
(benar 72 dari total 75 data pada testing se
1 1 ? ? ? ?
2 3 100% 96% 92% 96%
3 5
5
7
9
34
Eksperimen
Eksperimen memakai
memakai SVM
SVM
C: complexity
parameter (biasanya
mengambil nilai
besar. 100, 1000 dst)
Untuk memilih
kernel
35
Eksperimen
Eksperimen memakai
memakai SVM
SVM
Classification
Classification of
of cancers
cancers based
based on
on gene
gene expression
expression
Biological reference:
Classification and diagnostic prediction of cancers using gene expression
profiling and artificial neural networks,
J. Khan, et al., Nature Medicine 7, pp.673-679, 2001
(http://www.thep.lu.se/~carsten/pubs/lu_tp_01_06.pdf )
Data is available from
http://research.nhgri.nih.gov/microarray/Supplement/
Small Round Blue Cell Tumors (SRBCT) has two class:
Ewing Family of Tumors (EWS)
NB: Neuroblastoma
BL: Burkitt lymphomas
RMS: Rhabdomyosarcoma : RMS
Characteristic of the data
Training samples : 63 (EWS:23 BL:8 NB:12 RMS:20)
Testing samples: 20 (EWS:6 BL:3 NB:6 RMS:5)
Number of features (attributes): 2308 37
Classification
Classification of
of cancers
cancers based
based on
on gene
gene expression
expression
Experiment using k-Nearest Neighbor Classifier
Training and testing set are given as separated arff file
Use training set to build a classifier: k-Nearest Neighbor (k=1)
Evaluate its performance on the testing set.
Change the value of k into 3,5,7 and 9 and repeat step 1 to 3 for each
value.
Experiment using Artificial Neural Network
Do the same experiment using Multilayer Perceptron Artificial Neural
Network for various parameter setting (hidden neurons, learning rate,
momentum, maximum iteration). Make at least five parameter settings.
38
Parkinson
Parkinson Disease
Disease Detection
Detection
Max Little (Oxford University) recorded speech signals and measured the
biomedical voice from 31 people, 23 with Parkinson Disease (PD). In the
dataset which will be distributed during final examination, each column in the
table is a particular voice measure, and each row corresponds one of 195 voice
recording from these individuals ("name" column). The main aim of the data is
to discriminate healthy people from those with PD, according to "status" column
which is set to 0 for healthy and 1 for PD. There are around six recordings per
patient, making a total of 195 instances. (Ref. 'Exploiting Nonlinear Recurrence
and Fractal Scaling Properties for Voice Disorder Detection', Little MA,
McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering
OnLine 2007, 6:23, 26 June 2007).
Experiment using k-Nearest Neighbor Classifier
Conduct classification experiments using k-Nearest Neighbor Classifier and
Support Vector Machines, by using 50% of the data as training set and the rest
as testing set. Try at least 5 different values of k for k-Nearest neighbor, and
draw a graph show the relationship between k and classification rate. In case of
Support Vector Machine experiments, try several parameter combinations by
modifying the type of Kernel and its parameters (at least 5 experiments).
Compare and discuss the results obtained by both classifiers. Which of them
achieved higher accuracy ?
39
Parkinson
Parkinson Disease
Disease Detection
Detection
Max Little (Oxford University) recorded speech signals and measured the
biomedical voice from 31 people, 23 with Parkinson Disease (PD). In the
dataset which will be distributed during final examination, each column in the
table is a particular voice measure, and each row corresponds one of 195 voice
recording from these individuals ("name" column). The main aim of the data is
to discriminate healthy people from those with PD, according to "status" column
which is set to 0 for healthy and 1 for PD. There are around six recordings per
patient, making a total of 195 instances. (Ref. 'Exploiting Nonlinear Recurrence
and Fractal Scaling Properties for Voice Disorder Detection', Little MA,
McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering
OnLine 2007, 6:23, 26 June 2007).
Experiment using k-Nearest Neighbor Classifier
Conduct classification experiments using k-Nearest Neighbor Classifier and
Support Vector Machines, by using 50% of the data as training set and the rest
as testing set. Try at least 5 different values of k for k-Nearest neighbor, and
draw a graph show the relationship between k and classification rate. In case of
Support Vector Machine experiments, try several parameter combinations by
modifying the type of Kernel and its parameters (at least 5 experiments).
Compare and discuss the results obtained by both classifiers. Which of them
achieved higher accuracy ?
40
Practicing
Practicing WEKA
WEKA
What is WEKA ?
Formatting the data into ARFF
Klasifikasi
Tahapan membangun classifier
Contoh kasus : Klasifikasi bunga iris
Tahapan membangun classifier
Merangkum hasil eksperimen k-Nearest Neighbor Classifier
Eksperimen memakai classifier yang lain (JST, SVM)
Classification of cancers based on gene expression
Parkinson Disease Detection
K-Means Clustering
41
K-Means
K-Means Clustering
Clustering :: Step
Step by
by Step
Step
Pilihlah k buah data sebagai initial centroid
Ulangi
Bentuklah K buah cluster dengan
meng-assign tiap data ke centroid
terdekat
Update-lah centroid tiap cluster
Sampai centroid tidak berubah
42
K-Means
K-Means Clustering
Clustering :: Step
Step by
by Step
Step
43
Filename :
kmeans_clustering.arf
1
2
45
Klik untuk memilih algoritma
clustering
46
47
Klik untuk memilih
nilai k
48
maxIterations:
untuk menghentikan
proses clustering jika
iterasi melebih nilai
tertentu
numClusters: nilai k
(banyaknya
cluster)
49
Hasil clustering: terbentuk 3 cluster dan
masing-masing beranggotakan 50 instances
50
Klik dengan button kanan mouse untuk
menampilkan visualisasi cluster
51
Nilai attribute x ditampilkan pada sumbu x,
dan nilai attribute y ditampilkan pada sumbu
y
52