You are on page 1of 23

ISOM3360 Data Mining for Business Analytics

Data Preparation

Instructor: Rong Zheng


Department of ISOM
Fall 2018
Last lecture

Data (very importantly, variables)

Supervised vs. Unsupervised learning

This lecture

Data preparation

2
Data Mining Process

3
Business intelligence pyramid

data
preparation
Data Preparation Techniques

Handle categorical variable

Handle numeric variable

Handle Missing data

Outliers

Etc...

5
Missing data
Missing data may be due to various reasons,
data not entered due to misunderstanding

certain data may not be considered important at the time


of entry

deleted accidentally

Missing data may carry some information content


Other variables may contain useful information

6
How to handle missing data: Ignore

Ignore data instances that have missing value (may


affect a lot records)

Ignore attributes with missing values (may leave out


important features)

7
How to handle missing data: Infer

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to


the same class to fill in the missing value

Other more sophisticated methods


Finding the k neighbors nearest to the point and fill in the
most frequent value or the average value

8
Nearest neighbor

9
Outliers

Outliers are values that lie far away from the bulk of
data.

Anything over 3 standard deviations away from the mean

But there is no rule can tell us whether an outlier is the


result of an error.

10
How to handle outliers

Most likely: remove it

A simple model of daily stock market returns may


include extreme moves such as Black Monday
(1987), but might not model the breakdown of
markets following the 9/11 attacks.

11
Handling categorical variables

Categorical variable

Size: small, medium, large

Industry: Finance, IT, etc...

Some data mining algorithms can support


categorical values without further manipulation but
there are many more algorithms that do not.

12
Handling categorical variables

Automobile dataset [link]

Some variables in the dataset are categorical

13
Handling categorical variables: Find and Replace

Very simple. Replace words to numerical numbers.

Eg. num_doors: two, four

14
Handling categorical variables: Label Encoding
Label encoding is converting each value in a column
to a number.
•convertible -> 0
•hardtop -> 1
•hatchback -> 2
•sedan -> 3
•wagon -> 4

Advantage: straightforward

Disadvantage: numeric values can be misinterpreted


by the algorithms. Is 0 worse than 4?
15
Handling categorical variables: One-hot Encoding
convert each category value into a new (dummy)
column and assigns a 1 or 0 value to the column.
drive_wheels: rwd, fwd, 4wd

wheel_rwd wheel_fwd wheel_4wd

1 0 0
1 0 0
1 0 0
0 1 0
0 0 1

Disadvantage: the number of columns to expand


greatly if there are many unique values in a column. 16
Exercise

Many stock price prediction model, firm


performance prediction model uses industry type as
one of the variable.

2-digit SIC (Standard Industrial Classification)


defines around 80 different categories: Agriculture,
Retail, Manufacturing, etc

How do you handle such category variable?

17
Handling numerical variables
Normalization (standardization) helps to prevent
that attributes with large ranges out-weight
attributes with small ranges. It brings all variables to
the same scale.

Do you apply normalization on training or testing


set?

difference in one coordinate (in this


example, weight) is insignificant
compared to a change in the other
coordinate (height),

18
Handling numerical variables: min-max

Normalization (standardization) helps to prevent


that attributes with large ranges out-weight
attributes with small ranges.

Transform the data from measured units to a new


interval (commonly [0,1]) from 𝑛𝑒𝑤_𝑚𝑖𝑛_v to
𝑛𝑒𝑤_𝑚𝑎𝑥_v for variable v.

19
Min-max normalization: example

Suppose that the minimum and maximum values for


the income variable are $12,000 and $98,000,
respectively. We would like to map income to the
range [0.0,1.0]. By min-max normalization, a value of
$73,600 for income is transformed to:

20
Handling numerical variables: z-score

Transform the data by converting the values to a


common scale with an mean of 0 and a standard
deviation of 1. A value, 𝑣, of variable A is normalized
to 𝑣 ′ by computing:

where 𝑣 ̅ and 𝜎𝑣 are the mean and standard


deviation of variable v, respectively.
21
z-score normalization: example

Suppose that the mean and standard deviation of


the values for the income variable are $54,000 and
$16,000, respectively. With z-score normalization, a
value of $73,600 for income is transformed to

22
Summary

Handle Missing data

Outliers

Handle categorical variable

Handle numeric variable

23

You might also like