Cleaning and Transforming Data

Data Cleaning
Wajda Wikhamn, Associate Professor

Department of Business Administration
School of Business, Economics and Law
University of Gothenburg
Email: wajda.wikhamn@handels.gu.se
www.handels.gu.se
Lecture outline
1. Graphical examination.
2. Identify and evaluate missing values.
3. Identify and deal with outliers.

4. Check whether statistical assumptions are met.
5. Dummy variables.
6. Data Transformation.
www.handels.gu.se
1. Graphical Examination
1. Univariate profiling
The shap: Histogram
This is the
distribution for
HBAT database
variable
X19 Satisfaction.
X19 - Satisfaction
30
20
Frequency
10
Std. Dev = 1.19

Mean = 6.92
N = 100.00
0
4.50
5.50
5.00
6.50
6.00
7.50
7.00
8.50
8.00
9.50
9.00
10.00
X19 - Satisfaction
2016-11-16
www.handels.gu.se
How to get such a graph?
Database: HBAT
In SPSS, select Analyze,
Frequencies
Move the variable/s of interest
under the box Variables
Press Charts, tick
Histograms and Show
normal curve.
www.handels.gu.se
X6
X7
X8
X12
X13
2. Bivariate profiling
a) The relationship: scatterplot
strength & direction (positive,
negative, no relationship,
nonlinear relationship)
2016-11-16
www.handels.gu.se
2) Bivariate profiling
B. Group differences: boxplot
2016-11-16
www.handels.gu.se
2. Missing data
The impact of missing data

Reduction of sample size included in the analysis (practical impact)
Biased results nonrandom missing data (substantive impact)
Strategies for handling missing data
...
use observations with complete data only;

delete case(s) and/or variable(s);
estimate (impute) missing values.
2016-11-16
www.handels.gu.se
Identifying missing data and applying remedies
1. Determine the type of missing data

Ignorable missing: expected & part of research design
Nonsampled observations (probability sampling)
Skip patterns
Not ignorable missing data: remedies are needed
Known missing: errors in data entry, failure to complete
questionnaire
Unknown missing: refusal to respond to some questions
2. A. Determine the extent of missing data: Tabulating percentage of
variables with missing data for each case and number of cases with
missing data for each variable
Missing data under 10% for a case (respondent) or
observation (variable) can be ignored
Nonmissing (complete) sample size is sufficient for the analysis
11/16/2016
www.handels.gu.se
Missing data Example
2016-11-16
www.handels.gu.se
2. B. Deleting individual cases or variables

15% or more the variable is candidate for deletion
Overall decrease in missing data is large to justify deletion
Cases for dependent variable are deleted to avoid artificial
relationship
When deleting a variable, alternative similar variables should be
included in the analysis
Perform the analysis with and without the deleted cases or
variables to identify differences
2016-11-16
www.handels.gu.se
3. Diagnose the randomness of missing data
Levels of randomness of missing data (WHY data are missing)
Example: Suppose you are modeling weight (Y) as a function of

gender (X). Some respondents wouldn't disclose their weight, so
you are missing some values for Y. There are possible
mechanisms for the nondisclosure:
11/16/2016
www.handels.gu.se
There may be no particular reason why some respondents told you

their weights and others didn't. That is, the probability that Y is
missing may have no relationship to X or Y. Such data said to be
missing completely at random (MCAR).
11/16/2016
www.handels.gu.se
One gender may be less likely to disclose its weight. That is, the
probability that Y is missing depends only on the value of X. Such
data are missing at random (MAR).
11/16/2016
www.handels.gu.se
People with overweight / underweight may be less likely to disclose

their weight. That is, the probability that Y is missing depends on the
unobserved value of Y itself. Such data are not missing at random
(NMAR).
11/16/2016
www.handels.gu.se
Diagnostic tests for levels of randomness
Two groups: observations with missing Y and without missing Y

(% or t-test)
An overall test of randomness (MCAR)
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se

4. Select the imputation method
Imputation is the process of estimating the missing value based
on valid values of other variables and/or cases in the sample.
Used only for metric variables
If missing data is MCAR using valid data or using replacements
for missing data
Valid data: complete case approach (LISTWISE) or all
available approach (PAIRWISE)
Replacement:
Known: Hot or cold deck imputation; case substitution
Calculating replacements: mean or regression
(predicted y = b0 + b1x1 + b2x2 + b3x3 + b4x4)
Y is the missing value
For the missing case (e.g., case 23), we substitute valid
values for X1, X2, X3 and X4 to obtain the replacement
value.
11/16/2016
www.handels.gu.se
4. Select the imputation method
If missing data is nonrandom or MAR specifically designed

modeling approach (EM method; ML estimation) or defining
observations with missing data as a subset of the sample and

including them in the analysis
11/16/2016
www.handels.gu.se
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
Imputation of Missing Data
Under 10% Any of the imputation methods can be applied when

missing data is this low, although the complete case
method has been shown to be the least preferred.
10 to 20% The increased presence of missing data makes the all
available, hot deck case substitution and regression
methods most preferred for MCAR data, while
model-based methods are necessary with MAR missing
data processes
Over 20% If it is necessary to impute missing data when the
level is over 20%, the preferred methods are:
o the regression method for MCAR situations, and
o model-based methods when MAR missing data occurs.
www.handels.gu.se
3. Outliers
Outliers are observations with unique combination of characteristics

identifiable as distinctly different (unusually high or low) from the other
observations.
Why do they occur?

Procedural error data entry or coding
Extra-ordinary event or observation
2016-11-16
www.handels.gu.se
Outliers
Detecting outliers
Univariate detection the distribution of each variable
Standardized scores
Small samples (80 or fewer), z score is 2.5 or larger
Larger samples (more than 80), Z score is 3 (or 4) or larger
Bivariate detection pair of variables
Scatterplot with an ellipse representing a normal distribution (CI
at a specified alpha level)
Multivariate detection: Mahalanobis D2 (a multivariate assessment
that measures the distance of a case from the multidimensional
mean of a distribution across a set of variables)
Chi-square test for D2 with a p-value at .005 or .001 (outlier)
11/16/2016
www.handels.gu.se
4. Testing the assumptions of multivariate analysis
The following apply to metric data

1. Normality
The correspondence of the shape of the data distribution to the
normal distribution
Invalid statistical results if violated as t- and f-tests require it.
The distribution of the data can be skewed, kurtosis, or normal
Sample size issue larger samples, less sampling error
Graphically: Normal probability plot (p-p plot), Histogram
Statistically
z-value: if it exceeds the critical value (e.g, 1.95 at .05 error
level), then data is not normally distributed.
Shapiro-Wilks and Kolmogorov-Smirnov: if p < .05, data is not
normally distributed.
Less useful if sample size is fewer than 30 or exceeding 1000
11/16/2016
www.handels.gu.se
3. Linearity
Linear association between the variables
Graphically:
Scatterplot (used mostly)
Remedy: transformation
4. Absence of correlated error (correlated residuals)
Graphically: saving residuals and then plotting them
Remedy: adding a variable that represents the omitted factor
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
4. Homoscedasticity
Applies for both metric and nonmetric independent variables
The dependent variable(s) exhibit equal levels of variance across
the range of predictor variable(s)
Importance: the variance of the dependent variable being explained
in the dependence relationship should not be concentrated in only a
limited range of the independent values.
Sources of Hetroscedasticity
Variable type: as a variable increases in value (0 1 000 000),
a wider range of answers is possible.
Skewed distribution of one or both variables
Graphically: Plots (metric variables) and Boxplot (non-metric)
Statistically; Levens test and Boxs M test
Remedy: data transformation
11/16/2016
www.handels.gu.se
2016-11-16
www.handels.gu.se
5. Dummy variables
Nonmetric variables in multivariate techniques e.g., gender, martial

status, occupation, nationality
Dummy variables are dichotomous variables that represent one
category of a nonmetric independent variable.
Any non-metric variable with k categories can be represented by k-1
dummy variables
Coding dummy variable via Indicator coding:

E.g., City: Gothenburg, Stockholm, & Malm
Two categories: Gothenburg = 1, else = 0; Stockholm = 1, else = 0,
the omitted category is Malm
2016-11-16
www.handels.gu.se
6. Data Transformation
Non-linearity:
2016-11-16
www.handels.gu.se
6. Data Transformation
Normality:
If you have 200 observations or more , the effects of nonnormality
may be negligible.
If you have 50 (or worse, 30) observations or fewer, significant
departure from normality can have substantial impact on the results.
Remedies to achieve normality and homoscedasticity:
Flat distribution -> inverse (e.g., 1/x or 1/Y)
Negatively skewed distribution -> squared or cubed (X2 or X3)
Positively skewed distribution -> square root or logarithms
2016-11-16
www.handels.gu.se

Cleaning and Transforming Data

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cleaning and Transforming Data

Uploaded by

Copyright:

Available Formats

Data Cleaning

Wajda Wikhamn, Associate Professor

3. Identify and deal with outliers.

Std. Dev = 1.19

How to get such a graph?

The impact of missing data

Strategies for handling missing data

use observations with complete data only;

Identifying missing data and applying remedies

1. Determine the type of missing data

Missing data Example

Identifying missing data and applying remedies

2. B. Deleting individual cases or variables

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

Levels of randomness of missing data (WHY data are missing)

Example: Suppose you are modeling weight (Y) as a function of

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

There may be no particular reason why some respondents told you

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

People with overweight / underweight may be less likely to disclose

Identifying missing data and applying remedies

Diagnostic tests for levels of randomness

Two groups: observations with missing Y and without missing Y

An overall test of randomness (MCAR)

Identifying missing data and applying remedies

Identifying missing data and applying remedies

4. Select the imputation method

If missing data is nonrandom or MAR specifically designed

observations with missing data as a subset of the sample and

Imputation of Missing Data

Under 10% Any of the imputation methods can be applied when

Outliers are observations with unique combination of characteristics

Why do they occur?

4. Testing the assumptions of multivariate analysis

The following apply to metric data

4. Testing the assumptions of multivariate analysis

4. Testing the assumptions of multivariate analysis

Nonmetric variables in multivariate techniques e.g., gender, martial

Coding dummy variable via Indicator coding:

You might also like