Professional Documents
Culture Documents
Data Preparation
This lecture
Data preparation
2
Data Mining Process
3
Business intelligence pyramid
data
preparation
Data Preparation Techniques
Outliers
Etc...
5
Missing data
Missing data may be due to various reasons,
data not entered due to misunderstanding
deleted accidentally
6
How to handle missing data: Ignore
7
How to handle missing data: Infer
8
Nearest neighbor
9
Outliers
Outliers are values that lie far away from the bulk of
data.
10
How to handle outliers
11
Handling categorical variables
Categorical variable
12
Handling categorical variables
13
Handling categorical variables: Find and Replace
14
Handling categorical variables: Label Encoding
Label encoding is converting each value in a column
to a number.
•convertible -> 0
•hardtop -> 1
•hatchback -> 2
•sedan -> 3
•wagon -> 4
Advantage: straightforward
1 0 0
1 0 0
1 0 0
0 1 0
0 0 1
17
Handling numerical variables
Normalization (standardization) helps to prevent
that attributes with large ranges out-weight
attributes with small ranges. It brings all variables to
the same scale.
18
Handling numerical variables: min-max
19
Min-max normalization: example
20
Handling numerical variables: z-score
22
Summary
Outliers
23