You are on page 1of 8

Data and Web Mining (COMP 4008)

Assignement 1
Data and Data Pre Process

Student Name: Mahesha Sisirakumara

Student No: 110238456


Q1.
Equal Width and Equal Frequency are two unsupervised binning methods.

 Equal-width discretization
In equal width discretization, it divides the range of the attributes into N intervals of
equal size. If ‘X’ and ‘Y’ are the minimum and maximum values of the attributes, the
width (W) of intervals will be:

W = (Y – X) / N

Even though equal width discretization is the most straight-forward method it has some
drawbacks as follows.

 Outliers may dominate presentation


 Skewed data is not handled well. .( Li, J 2019)

Example:

Data: 0, 5, 10, 14, 15, 17, 23, 26, 27

If number of intervals are 3;

W= ( 27 - 0 ) / 3 = 9

 Bin 1: 0, 5
 Bin 2: 10,14,15,17
 Bin 3: 23, 26, 27

 Equal depth (frequency) discretization


In Equal-frequency discretization method, first it sorts all the values in ascending order
(if the values are not sorted). It divides the range into a user-defined number of intervals
called bins. Each bin will contain approximately the same number of sorted attributes.
This method is good in data scaling but, managing categorical features could be tricky.
( Li, J 2019)
Example:

Data: 0, 5, 10, 14, 15, 17, 23, 26, 27

If number of intervals are 3;

No of attributes are 9. So, the number of attributes for one bin is = 9/3 =3

 Bin 1: 0, 5,10
 Bin 2:14,15,17
 Bin 3: 23, 26, 27

 Attribute summarization before discretisation

Min Q1 Median Q3 Max


sepalLengt 4.3 5.1 5.8 6.4 7.9
h
sepalWidth 2 2.8 3 3.3 4.4
petalLength 1 1.6 4.35 5.1 6.9
petalWidth 0.1 0.3 1.3 1.8 2.5

 Attribute summarization after equal width discretisation

[4.3-5.5] [5.5-6.7] [6.7-7.9]

sepalLength 59 71 20
[2-2.8] [2.8-3.6] [3.6-4.4]

sepalWidth 47 88 15

[1-2.97] [2.97-4.93] [4.93-6.9]

petalLength 50 54 46

[0.1-0.9] [0.9-1.7] [1.7-2.5]

petalWidth 50 54 46

 Attribute summarization after equal depth (frequency) discretisation

[4.3-4.5] [4.5-6.25] [6.25-7.9]

sepalLength 52 47 51

[2-2.85] [2.85-3.15] [3.15-4.4]

sepalWidth 47 48 55

[1-2.45] [2.45-4.85] [4.85-6.9]

petalLength 50 49 51

[0.1-0.8] [0.8-1.65] [1.65-2.5]

petalWidth 50 52 48
Q2.

 Min-Max normalization
In Min-max normalization, it scales data into a fixed range - usually 0 to 1. In this
approach minimum value of that data gets transformed into a 0, the maximum value
gets transformed into a 1, and every other value gets transformed into a decimal
between 0 and 1. It defined as

v−min A
v '= ( new max A−new min A )+ newmin A
max A −min A

In min- max normalisation method it guarantees all data will have the exact same scale
but does not handle outliers well.

Example:

Data: 200, 300, 400, 600, 1000

If we normalize 400

V’ = [(400 – 200) / (1000 – 200)] * [(1-0) + 0]


= 0.25
 Z-score Normalization.
In the z-score normalization, it transforms the data by converting the values to a
common scale. In this technique, it calculates the statistical mean and the standard
deviation of the attribute values, then subtracts the mean from each value and divides
the result by the standard deviation.

The transformation formula is:

v−μ A
v '=
σA

μ = mean

σ = standard deviation

In z-score normalization it handles outliers, but does not produce normalized data with
the exact same scale.

Example:

If mean (μ) of marks is 13.25 and standard deviation (σ) is 4.6. Z-Score for mark
8 is

V’ = (8 – 13.25 ) / 4.6

= - 1.14
 Attribute summarization Before Normalisation

sepalLength sepalWidth petalLength petalWidth


Min 4.3 2 1 0.1
Q1 5.1 2.8 1.6 0.3
Median 5.8 3 4.35 1.3
Q3 6.4 3.3 5.1 1.8
Max 7.9 4.4 6.9 2.5

 Attribute summarization after Min-max normalisation

sepallength sepalwidth petallength petalwidth


min 0 0 0 0
Q1 0.22 0.33 0.10 0.083
Median 0.42 0.42 0.57 0.5
Q3 0.58 0.54 0.69 0.71
Max 1 1 1 1

 Attribute summarization after Z-Score normalisation

sepallength sepalwidth petallength petalwidth


Min -1.86 -2.43 -1.56 -1.44
Q1 -0.90 -0.59 -1.22 -1.18
Median -0.05 -0.12 0.34 0.13
Q3 0.67 0.57 0.76 0.79
Max 2.48 3.10 1.78 1.71
Bonus Q:
Discuss why the numbers of elements in the bins are not exactly the
same after equal depth discretisation.
I would like to explain this, using the Iris data set. Iris data set has 150 data elements
and when it discretise into 3 bins using equal depth discretisation, 50 elements will
include for each bin. But after the equal depth discretisation number of elements in each
bin is not exactly the same. It happens because some values are repeating in the data set.
In the above example first fifty elements have to be allocated to the 1 st bin and the next
fifty elements have allocated to the 2 nd bin. But if the 51st element also has the same
value as the 50th, then the 51st element will be allocated to the 1st bin. Not for the 2nd bin.
Then the number of elements in the 1st bin becomes 51 and the number of elements in
the 2nd bin becomes 49. Due to the above reason numbers of elements in the bins are not
exactly the same after equal depth discretisation.

Reference
Li, J 2019, Lecture 1: Data pre – processing, Lecture notes, Data and Web Mining (COMP
4008), University of South Australia, delivered 4 March 2019.

You might also like