Professional Documents
Culture Documents
Assignement 1
Data and Data Pre Process
Equal-width discretization
In equal width discretization, it divides the range of the attributes into N intervals of
equal size. If ‘X’ and ‘Y’ are the minimum and maximum values of the attributes, the
width (W) of intervals will be:
W = (Y – X) / N
Even though equal width discretization is the most straight-forward method it has some
drawbacks as follows.
Example:
W= ( 27 - 0 ) / 3 = 9
Bin 1: 0, 5
Bin 2: 10,14,15,17
Bin 3: 23, 26, 27
No of attributes are 9. So, the number of attributes for one bin is = 9/3 =3
Bin 1: 0, 5,10
Bin 2:14,15,17
Bin 3: 23, 26, 27
sepalLength 59 71 20
[2-2.8] [2.8-3.6] [3.6-4.4]
sepalWidth 47 88 15
petalLength 50 54 46
petalWidth 50 54 46
sepalLength 52 47 51
sepalWidth 47 48 55
petalLength 50 49 51
petalWidth 50 52 48
Q2.
Min-Max normalization
In Min-max normalization, it scales data into a fixed range - usually 0 to 1. In this
approach minimum value of that data gets transformed into a 0, the maximum value
gets transformed into a 1, and every other value gets transformed into a decimal
between 0 and 1. It defined as
v−min A
v '= ( new max A−new min A )+ newmin A
max A −min A
In min- max normalisation method it guarantees all data will have the exact same scale
but does not handle outliers well.
Example:
If we normalize 400
v−μ A
v '=
σA
μ = mean
σ = standard deviation
In z-score normalization it handles outliers, but does not produce normalized data with
the exact same scale.
Example:
If mean (μ) of marks is 13.25 and standard deviation (σ) is 4.6. Z-Score for mark
8 is
V’ = (8 – 13.25 ) / 4.6
= - 1.14
Attribute summarization Before Normalisation
Reference
Li, J 2019, Lecture 1: Data pre – processing, Lecture notes, Data and Web Mining (COMP
4008), University of South Australia, delivered 4 March 2019.