You are on page 1of 16

Bata

Analysis
using WEKA
IT foi Business Intelligence
Walla - 108M60043
WLkA ls collecLlon of lsallzaLlon Lools and algorlLhms for daLa analysls and predlcLle modelllng, LogeLher wlLh
graphlcal ser lnLerfaces

l
Contents
ClsLer Analysls ................................................................................................................................. 2
AdanLages .................................................................................................................................... 2
ulsadanLages ............................................................................................................................... 2
ClsLer Analysls slng WLkA .......................................................................................................... 2
8egresslon Analysls ........................................................................................................................... 8
AdanLages .................................................................................................................................... 8
ulsadanLages ............................................................................................................................... 8
8egresslon Analysls slng WLkA .................................................................................................... 8
lsL of llgres .................................................................................................................................. 14
Works ClLed ..................................................................................................................................... 13



ll
Cluster Analysis

MarkeL segmenLaLlon sally ls based noL on one facLor bL on mlLlple facLors. rlmarlly, each
arlable has lLs own clsLer. 1he work ahead ls Lo comblne Lhe arlables so LhaL homogeneos
clsLers are formed. Sch clsLers are/be lnLernally homogeneos and exLernally heLerogeneos.
ClsLer analysls (Anonymos, ClsLer Analysls) ls a class of sLaLlsLlcal Lechnlqes LhaL can be applled
Lo daLa LhaL exhlblL naLral" groplngs.
lL allows a ser Lo make grops of daLa Lo deLermlne paLLerns from Lhe daLa. ClsLerlng has lLs
adanLages when Lhe daLa seL ls deflned and a general paLLern needs Lo be deLermlned from Lhe
daLa.
user can creaLe a speclflc nmber of grops, dependlng on Lhe bslness needs. ClsLer analysls ls
sefl ln exploraLory phase of research when Lhere are no prlor hypoLheses.
Advantages
leosibi/ity ClsLer sampllng ls feaslble whlle deallng wlLh large poplaLlon.
y 8edcLlon ln cosL of Lraellng and llsLlng. lor example Complllng research
lnformaLlon aboL eery hose hold ln clLy ls dlfflclL, whereas complllng lnformaLlon aboL
arlos socleLles ls easler wlLh Lhe efforLs belng greaLly redced.
edued voriobi/ity Whlle conslderlng Lhe esLlmaLes, redced arlablllLy ln reslLs ls
obsered whlch may noL be an ldeal slLaLlon. lncreased arlablllLy ln reslLs ls obsered ln
clsLer sampllng.
isadvantages
iosed 5o/es lf Lhe grop LhaL ls chosen as a clsLer sample has a blased oplnlon Lhen Lhe
enLlre poplaLlon ls lnferred Lo hae Lhe same oplnlon whlch may noL be Lhe acLal case.
5o/i rrrs CLher probablllsLlc meLhods gle fewer errors Lhan clsLer sampllng.
Pence, clsLer sampllng ls dlscoraged for beglnners.
Poweer, for Lhe aerage ser, clsLerlng can be sefl daLa mlnlng meLhod.
Cluster Analysis using WEKA {Ian H. Witten, 2012]

1he daLa seL (AberneLhy, 2010) for classlflcaLlon example focses on flcLlonal 8MW dealershlp.
1he dealershlp has kepL Lrack of how people walk Lhrogh Lhe dealershlp and Lhe showroom, whaL
cars Lhey look aL, and how ofLen Lhey lLlmaLely make prchases.
WlLh Lhe help of clsLer analysls, we need Lo see any paLLerns ln Lhe daLa and Lo deLermlne lf cerLaln
behalors ln Lhe csLomers emerge.
1here are 100 rows of daLa ln Lhls sample, and each colmn descrlbes Lhe sLeps LhaL Lhe csLomers
reached ln Lhelr 8MW experlence, wlLh a colmn halng a 1 (Lhey made lL Lo Lhls sLep or looked aL
Lhls car), or 0 (Lhey dldn'L reach Lhls sLep).

lll

I|gure 1 C|uster data |n WLkA
Cllck on Lhe ClsLer Lab. Cllck Choose and selecL SlmplekMeans from Lhe cholces LhaL appear.

I|gure 2 S|mp|e k Means C|uster A|gor|thm
Ad[sL Lhe aLLrlbLes by cllcklng SlmplekMeans. Ad[sL Lhe nmClsLers fleld, whlch Lells how many
clsLers we wanL Lo creaLe.

lv
lor or analysls we change Lhe ale of 2 Lo 3, bL ser can ad[sL Lhe nmber of clsLers creaLed.
Cllck Ck Lo accepL Lhese ales.


I|gure 3 C|uster Attr|bute
AfLer rnnlng Lhe clsLerlng algorlLhm, Lhe followlng oLpL appears,
=== Run information ===
Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R
first-last" -I 500 -S 10
Relation: car-browsers
Instances: 100
Attributes: 8
Dealership
Showroom
ComputerSearch
M5
3Series
Z4
Financing
Purchase
Test mode: evaluate on training data

v
=== Clustering model (full training set) ===
kMeans
======
Number of iterations: 8
Within cluster sum of squared errors: 113.58260073260074
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4
(100) (26) (27) (5) (14) (28)
==================================================================================
Dealership 0.6 0.9615 0.6667 1 0.8571 0
Showroom 0.72 0.6923 0.6667 0 0.5714 1
ComputerSearch 0.43 0.6538 0 1 0.8571 0.3214
M5 0.53 0.4615 0.963 1 0.7143 0
3Series 0.55 0.3846 0.4444 0.8 0.0714 1
Z4 0.45 0.5385 0 0.8 0.5714 0.6786
Financing 0.61 0.4615 0.6296 0.8 1 0.5
Purchase 0.39 0 0.5185 0.4 1 0.3214

Time taken to build model (full training data): 0.05 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 26 (26%)
1 27 (27%)
2 5 (5%)
3 14 (14%)
4 28 (28%)

Also, one can rlghL-cllck on Lhe 8eslL lsL secLlon of Lhe ClsLer Lab Lo see vlsallze ClsLer
AsslgnmenLs. lL leLs ser Lo see Lhe reslLs lsally.
Change Lhe x axls Lo be Z4, Lhe ? axls Lo rchase, and Lhe Color Lo ClsLer.

vl
1he reslL dlsplays how Lhe clsLers are groped ln Lerms of who looked aL Lhe M3 and who
prchased one.

I|gure 4 V|sua||se C|uster Ass|gnment

I|gure S C|uster Chart
Lach clsLer shows s a Lype of behalor ln or csLomers, from whlch we can begln Lo draw some
conclslons

vll
/uster 0 - namlng Lhe grop Lhe "lanLaslsLs," as Lhey wander arond Lhe dealershlp,
looklng aL cars parked oLslde on Lhe loLs, bL Lrall off when lL comes Lo comlng lnLo Lhe
dealershlp, and don'L prchase anyLhlng.
/uster 1 - 1he "M3 ueoLees" as Lhey Lend Lo walk sLralghL Lo Lhe M3s, lgnorlng Lhe oLher
serles cars. 1hey don'L hae a hlgh prchase raLe (32) whlch lmplles a poLenLlal problem
and cold be a focs for lmproemenL.
/uster 2 - 8elng a ery small grop, one can lgnore Lhls clsLer as Lhey are sLaLlsLlcally
more or less lrreleanL. And no concreLe conclslon can be drawn from Lhelr behalor.
/uster l - 1he "8MW Chlldren" as Lhey always end p prchaslng a car by flnanclng lL.
1hey walk arond Lhe loL looklng aL cars, Lhen Lrn Lo Lhe compLer search aallable aL Lhe
dealershlp. LenLally, Lhey Lend Lo by M3/Z4 serles. 1hls clsLer polnL Lhe dealershlp
Lowards maklng lLs search compLers promlnenL arond Lhls loL (oLdoor search
compLers), and maybe maklng Lhe M3 or Z4 more promlnenL ln Lhe search reslLs. Cnce
Lhe csLomer has made p hls mlnd Lo prchase Lhe ehlcle, he always qallfles for flnanclng
and compleLes Lhe prchase.
/uster 4 - 1he "8MW SLarLers" as Lhey always look aL Lhe 3-serles and noL aL Lhe lxrlos
opLlons. 1hey walk rlghL lnLo Lhe showroom, don'L walk arond and Lend Lo lgnore Lhe
compLer search Lermlnals. Whlle 30 reach flnanclng sLage, only 32 eenLally flnlsh Lhe
LransacLlon. Cne can lnfer LhaL Lhese csLomers are conslderlng bylng Lhelr flrsL 8MWs,
knowlng exacLly whaL klnd of car Lhey wanL (Lhe 3-serles enLry-leel model) and are hoplng
Lo qallfy for flnanclng Lo be able Lo afford lL. May be dealershlp cold srge sales Lo Lhls
clsLer by relaxlng flnanclng sLandards.



vlll
egression Analysis

8egresslon analysls (Anonymos, 8egresslon analysls) ls Lhe sLaLlsLlcal Lechnlqe LhaL ldenLlfles Lhe
relaLlonshlp beLween Lwo or more qanLlLaLle arlables a dependenL arlable, whose ale ls Lo be
predlcLed, and an lndependenL or explanaLory arlable (or arlables), aboL whlch knowledge ls
aallable.
1he Lechnlqe ls sed Lo flnd Lhe eqaLlon LhaL represenLs Lhe relaLlonshlp beLween Lhe arlables. A
slmple regresslon analysls can show LhaL Lhe relaLlon beLween an lndependenL arlable x and a
dependenL arlable ? ls llnear, slng Lhe slmple llnear regresslon eqaLlon ?= a + bx (where a and b
are consLanLs). MlLlple regresslon wlll prolde an eqaLlon LhaL predlcLs one arlable from Lwo or
more lndependenL arlables, ?= a + bx
1
+ cx
2
+ dx
3
.
lL helps one Lo comprehend Lhe sLaLlsLlcal dependence of one arlable on oLher arlables and can
show whaL proporLlon of arlance beLween arlables ls de Lo Lhe dependenL arlable, and whaL de
Lo Lhe lndependenL.
Advantages
8egresslon analysls offers a chance Lo posLlaLe hypoLheses regardlng Lhe naLre of effecLs
and explanaLory facLors.
WlLh a sLaLlsLlcally alld ad[sLmenL and execLlon, lL ylelds a qanLlLaLle esLlmaLe of neL
effecLs.
unllke Lhe modlfled mlLlple approach, where one can conLrol for dlfferences on only one
arlable, a regresslon can be exLended Lo allow for more Lhan one arlable and een for
cross effecLs across Lhese arlables.
isadvantages
lL ls senslLle Lo oLllers and reqlres large daLa
uaLa Snooplng - Analysls mlghL polnL Lowards a sLrong llnk beLween Lwo arlables, whereas
Lhe lnflence of oLher arlables may noL hae been esLlmaLed
lL ofLen gles opLlmal esLlmaLes of Lhe nknown parameLers
lf Lhe relaLlon beLween Lhe dlfferenL explalned and explanaLory arlables ls clrclar, meLhod
becomes lnappllcable.
egression Analysis using WEKA {Ian H. Witten, 2012]
1he daLa seL (AberneLhy, 2010) for classlflcaLlon example focses on a hose prlce-based regresslon
model.
S|ze (sq. feet) Lot s|ze 8edrooms Gran|te k|tchen Upgraded bathroom Se|||ng pr|ce (ks.)
3S29 11 6 0 0 203,000
3247 10061 3 1 1 224,00
4032 10130 3 0 1 1,00
2397 14136 4 1 0 18,00
2200 600 4 0 1 13,000
3S36 14 6 1 1 323,000
2983 363 3 0 1 230,000
3198 66 3 1 1 UNkNCWN

lx
ln Lhe lefL secLlon of Lhe Lxplorer wlndow, lL oLllnes all of Lhe colmns (ALLrlbLes) and Lhe nmber
of rows (lnsLances).
SelecLlng each colmn, gles Lhe lnformaLlon aboL Lhe daLa ln LhaL colmn. lor example, by
selecLlng Lhe "selllngrlce" colmn, Lhe rlghL-secLlon shows Lhe sLaLlsLlcal lnformaLlon aboL Lhe
colmn llke Lhe maxlmm ale ln Lhe daLa seL for Lhls colmn ls 323000, and Lhe mlnlmm ls
1800.

I|gure 6 nouse Data |n WLkA
Also, one can lsallze Lhe daLa by cllcklng Lhe vlsallze All bLLon.

x

I|gure 7 V|sua||zat|on - nouse Data
Cllck on Lhe Classlfy" Lab Lhen cllck Lhe Choose" bLLon, Lhen expand Lhe fncLlons branch. SelecL
lnear8egresslon".

I|gure 8 Se|ect L|near kegress|on

xl

I|gure 9 L|near kegress|on Mode|
uaLa aparL from ln Lhe presenL daLa seL can also be prolded ln ways llke
Spplled LesL seL - Spply a dlfferenL seL of daLa Lo blld Lhe model
Cross-alldaLlon - 8ased on sbseLs of Lhe spplled daLa and aeraglng Lhem oL Lo creaLe a
flnal model
ercenLage spllL - ercenLlle sbseL of Lhe spplled daLa Lo blld a flnal model
Choose Lhe dependenL arlable, here lL belng Lhe selllng prlce" by selecLlng lL ln Lhe combo box
below Lhe LesL opLlons.
Cllck on SLarL" Lo blld Lhe model. 1he followlng oLpL appears
=== Run information ===
Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: house
Instances: 7
Attributes: 6
houseSize
lotSize

xll
bedrooms
granite
bathroom
sellingPrice
Test mode: evaluate on training data
=== Classifier model (full training set) ===
Linear Regression Model
sellingPrice =
-26.6882 * houseSize +
7.0551 * lotSize +
43166.0767 * bedrooms +
42292.0901 * bathroom +
-21661.1208
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correlation coefficient 0.9945
Mean absolute error 4053.821
Root mean squared error 4578.4125
Relative absolute error 13.1339 %
Root relative squared error 10.51 %
Total Number of Instances 7
Also, Lhe lsal classlfler can be seen for Lhls Loo.

I|gure 10 V|sua| C|ass|f|er for L|near kegress|on

xlll
noL only now we can calclaLe Lhe ale of Lhe unknCWn" arlable,
selllngrlce = (-26.6882*318) + (.0331*66) + (43166.06*3) +(4222.001*1) - 21661.1208
selllngrlce = 8s. 21,328
CLher lnferences one can arrle on
8aLhrooms maLLer - As per Lhe model lLs presence adds 4222.0 Lo Lhe LoLal ale
resence of granlLe does noL add ale as Lhe model lgnores Lhe aLLrlbLe whlle Lhe
formaLlon of model (please see oLpL for reference).
Also, anoLher nllkely lnference LhaL comes ls LhaL Lhe blgger hose feLches lower selllng
prlce (negaLle coefflclenL ln fronL of Lhe hoseSlze" arlable)

1hogh Lhe hose slze ls noL an lndependenL arlable as lL ls relaLed Lo Lhe bedrooms and
baLhrooms whlch only polnLs LhaL Lhe daLa belng lnsfflclenL gles an lmperfecL model whlch
alldaLes Lhe dlsadanLage of regresslon analysls regardlng Lhe reqlremenL of large daLa.



xlv
ist of Figures

llgre 1 ClsLer daLa ln WLkA 3
llgre 2 Slmple k Means ClsLer AlgorlLhm 3
llgre 3 ClsLer ALLrlbLe 4
llgre 4 vlsallse ClsLer AsslgnmenL 6
llgre 3 ClsLer CharL 6
llgre 6 Pose uaLa ln WLkA
llgre vlsallzaLlon - Pose uaLa 10
llgre 8 SelecL lnear 8egresslon 10
llgre lnear 8egresslon Model 11
llgre 10 vlsal Classlfler for lnear 8egresslon 12



xv
Works Cited

AberneLhy, M. (2010, Aprll 2). uoto Mloloq wltb wkA. 8eLrleed from l8M deeloperWorks
hLLp//www.lbm.com/deeloperworks/opensorce/llbrary/os-weka1/lndex.hLml
Anonymos. (n.d.). clostet Aoolysls. 8eLrleed from Wlklpedla
hLLp//en.wlklpedla.org/wlkl/ClsLer_analysls
Anonymos. (n.d.). keqtessloo ooolysls. 8eLrleed from Wlklpedla
hLLp//en.wlklpedla.org/wlkl/8egresslon_analysls
lan P. WlLLen, L. l. (2012). uA1A MlNlNC. 8rllngLon LSLvlL8.

You might also like