3 Data Preparation

ISOM3360 Data Mining for Business Analytics
Data Preparation
Instructor: Rong Zheng

Department of ISOM
Fall 2018
Last lecture
Data (very importantly, variables)
Supervised vs. Unsupervised learning
This lecture
Data preparation
2
Data Mining Process
3
Business intelligence pyramid
data
preparation
Data Preparation Techniques
Handle categorical variable
Handle numeric variable
Handle Missing data
Outliers
Etc...
5
Missing data
Missing data may be due to various reasons,
data not entered due to misunderstanding
certain data may not be considered important at the time

of entry
deleted accidentally
Missing data may carry some information content

Other variables may contain useful information
6
How to handle missing data: Ignore
Ignore data instances that have missing value (may

affect a lot records)
Ignore attributes with missing values (may leave out

important features)
7
How to handle missing data: Infer
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to

the same class to fill in the missing value
Other more sophisticated methods

Finding the k neighbors nearest to the point and fill in the
most frequent value or the average value
8
Nearest neighbor
9
Outliers
Outliers are values that lie far away from the bulk of
data.
Anything over 3 standard deviations away from the mean
But there is no rule can tell us whether an outlier is the

result of an error.
10
How to handle outliers
Most likely: remove it
A simple model of daily stock market returns may

include extreme moves such as Black Monday
(1987), but might not model the breakdown of
markets following the 9/11 attacks.
11
Handling categorical variables
Categorical variable
Size: small, medium, large
Industry: Finance, IT, etc...
Some data mining algorithms can support

categorical values without further manipulation but
there are many more algorithms that do not.
12
Handling categorical variables
Automobile dataset [link]
Some variables in the dataset are categorical
13
Handling categorical variables: Find and Replace
Very simple. Replace words to numerical numbers.
Eg. num_doors: two, four
14
Handling categorical variables: Label Encoding
Label encoding is converting each value in a column
to a number.
•convertible -> 0
•hardtop -> 1
•hatchback -> 2
•sedan -> 3
•wagon -> 4
Advantage: straightforward
Disadvantage: numeric values can be misinterpreted

by the algorithms. Is 0 worse than 4?
15
Handling categorical variables: One-hot Encoding
convert each category value into a new (dummy)
column and assigns a 1 or 0 value to the column.
drive_wheels: rwd, fwd, 4wd
wheel_rwd wheel_fwd wheel_4wd
1 0 0
1 0 0
1 0 0
0 1 0
0 0 1
Disadvantage: the number of columns to expand

greatly if there are many unique values in a column. 16
Exercise
Many stock price prediction model, firm

performance prediction model uses industry type as
one of the variable.
2-digit SIC (Standard Industrial Classification)

defines around 80 different categories: Agriculture,
Retail, Manufacturing, etc
How do you handle such category variable?
17
Handling numerical variables
Normalization (standardization) helps to prevent
that attributes with large ranges out-weight
attributes with small ranges. It brings all variables to
the same scale.
Do you apply normalization on training or testing

set?
difference in one coordinate (in this

example, weight) is insignificant
compared to a change in the other
coordinate (height),
18
Handling numerical variables: min-max
Normalization (standardization) helps to prevent

that attributes with large ranges out-weight
attributes with small ranges.
Transform the data from measured units to a new

interval (commonly [0,1]) from 𝑛𝑒𝑤_𝑚𝑖𝑛_v to
𝑛𝑒𝑤_𝑚𝑎𝑥_v for variable v.
19
Min-max normalization: example
Suppose that the minimum and maximum values for

the income variable are $12,000 and $98,000,
respectively. We would like to map income to the
range [0.0,1.0]. By min-max normalization, a value of
$73,600 for income is transformed to:
20
Handling numerical variables: z-score
Transform the data by converting the values to a

common scale with an mean of 0 and a standard
deviation of 1. A value, 𝑣, of variable A is normalized
to 𝑣 ′ by computing:
where 𝑣 ̅ and 𝜎𝑣 are the mean and standard

deviation of variable v, respectively.
21
z-score normalization: example
Suppose that the mean and standard deviation of

the values for the income variable are $54,000 and
$16,000, respectively. With z-score normalization, a
value of $73,600 for income is transformed to
22
Summary
Handle Missing data
Outliers
Handle categorical variable
Handle numeric variable
23

3 Data Preparation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Data Preparation

Uploaded by

Copyright:

Available Formats

ISOM3360 Data Mining for Business Analytics

Instructor: Rong Zheng

Data (very importantly, variables)

Supervised vs. Unsupervised learning

Handle categorical variable

Handle numeric variable

Handle Missing data

certain data may not be considered important at the time

Missing data may carry some information content

Ignore data instances that have missing value (may

Ignore attributes with missing values (may leave out

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to

Other more sophisticated methods

Anything over 3 standard deviations away from the mean

But there is no rule can tell us whether an outlier is the

Most likely: remove it

A simple model of daily stock market returns may

Size: small, medium, large

Industry: Finance, IT, etc...

Some data mining algorithms can support

Automobile dataset [link]

Some variables in the dataset are categorical

Very simple. Replace words to numerical numbers.

Eg. num_doors: two, four

Disadvantage: numeric values can be misinterpreted

wheel_rwd wheel_fwd wheel_4wd

Disadvantage: the number of columns to expand

Many stock price prediction model, firm

2-digit SIC (Standard Industrial Classification)

How do you handle such category variable?

Do you apply normalization on training or testing

difference in one coordinate (in this

Normalization (standardization) helps to prevent

Transform the data from measured units to a new

Suppose that the minimum and maximum values for

Transform the data by converting the values to a

where 𝑣 ̅ and 𝜎𝑣 are the mean and standard

Suppose that the mean and standard deviation of

Handle Missing data

Handle categorical variable

Handle numeric variable

You might also like