You are on page 1of 33

Data Cleaning

Wajda Wikhamn, Associate Professor


Department of Business Administration
School of Business, Economics and Law
University of Gothenburg
Email: wajda.wikhamn@handels.gu.se

www.handels.gu.se

Lecture outline

1. Graphical examination.
2. Identify and evaluate missing values.

3. Identify and deal with outliers.


4. Check whether statistical assumptions are met.
5. Dummy variables.
6. Data Transformation.

www.handels.gu.se

1. Graphical Examination

1. Univariate profiling
The shap: Histogram

This is the
distribution for
HBAT database
variable
X19 Satisfaction.

X19 - Satisfaction
30

20

Frequency

10

Std. Dev = 1.19


Mean = 6.92
N = 100.00

0
4.50

5.50
5.00

6.50
6.00

7.50
7.00

8.50
8.00

9.50
9.00

10.00

X19 - Satisfaction

2016-11-16

www.handels.gu.se

How to get such a graph?

Database: HBAT
In SPSS, select Analyze,
Frequencies
Move the variable/s of interest
under the box Variables
Press Charts, tick
Histograms and Show
normal curve.

www.handels.gu.se

1. Graphical Examination
X6

X7

X8

X12

X13

2. Bivariate profiling
a) The relationship: scatterplot
strength & direction (positive,
negative, no relationship,
nonlinear relationship)

2016-11-16

www.handels.gu.se

1. Graphical Examination

2) Bivariate profiling
B. Group differences: boxplot

2016-11-16

www.handels.gu.se

2. Missing data

The impact of missing data


Reduction of sample size included in the analysis (practical impact)
Biased results nonrandom missing data (substantive impact)

Strategies for handling missing data

...

use observations with complete data only;


delete case(s) and/or variable(s);
estimate (impute) missing values.

2016-11-16

www.handels.gu.se

Identifying missing data and applying remedies

1. Determine the type of missing data


Ignorable missing: expected & part of research design
Nonsampled observations (probability sampling)
Skip patterns
Not ignorable missing data: remedies are needed
Known missing: errors in data entry, failure to complete
questionnaire
Unknown missing: refusal to respond to some questions
2. A. Determine the extent of missing data: Tabulating percentage of
variables with missing data for each case and number of cases with
missing data for each variable
Missing data under 10% for a case (respondent) or
observation (variable) can be ignored
Nonmissing (complete) sample size is sufficient for the analysis
11/16/2016

www.handels.gu.se

Missing data Example

2016-11-16

www.handels.gu.se

Identifying missing data and applying remedies

2. B. Deleting individual cases or variables


15% or more the variable is candidate for deletion
Overall decrease in missing data is large to justify deletion
Cases for dependent variable are deleted to avoid artificial
relationship
When deleting a variable, alternative similar variables should be
included in the analysis
Perform the analysis with and without the deleted cases or
variables to identify differences

2016-11-16

www.handels.gu.se

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

Levels of randomness of missing data (WHY data are missing)

Example: Suppose you are modeling weight (Y) as a function of


gender (X). Some respondents wouldn't disclose their weight, so
you are missing some values for Y. There are possible
mechanisms for the nondisclosure:

11/16/2016

www.handels.gu.se

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

There may be no particular reason why some respondents told you


their weights and others didn't. That is, the probability that Y is
missing may have no relationship to X or Y. Such data said to be
missing completely at random (MCAR).

11/16/2016

www.handels.gu.se

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

One gender may be less likely to disclose its weight. That is, the
probability that Y is missing depends only on the value of X. Such
data are missing at random (MAR).

11/16/2016

www.handels.gu.se

Identifying missing data and applying remedies

3. Diagnose the randomness of missing data

People with overweight / underweight may be less likely to disclose


their weight. That is, the probability that Y is missing depends on the
unobserved value of Y itself. Such data are not missing at random
(NMAR).

11/16/2016

www.handels.gu.se

Identifying missing data and applying remedies

Diagnostic tests for levels of randomness

Two groups: observations with missing Y and without missing Y


(% or t-test)

An overall test of randomness (MCAR)

2016-11-16

www.handels.gu.se

2016-11-16

www.handels.gu.se

Identifying missing data and applying remedies


4. Select the imputation method
Imputation is the process of estimating the missing value based
on valid values of other variables and/or cases in the sample.
Used only for metric variables
If missing data is MCAR using valid data or using replacements
for missing data
Valid data: complete case approach (LISTWISE) or all
available approach (PAIRWISE)
Replacement:
Known: Hot or cold deck imputation; case substitution
Calculating replacements: mean or regression
(predicted y = b0 + b1x1 + b2x2 + b3x3 + b4x4)
Y is the missing value
For the missing case (e.g., case 23), we substitute valid
values for X1, X2, X3 and X4 to obtain the replacement
value.
11/16/2016

www.handels.gu.se

Identifying missing data and applying remedies

4. Select the imputation method

If missing data is nonrandom or MAR specifically designed


modeling approach (EM method; ML estimation) or defining

observations with missing data as a subset of the sample and


including them in the analysis

11/16/2016

www.handels.gu.se

www.handels.gu.se

2016-11-16

www.handels.gu.se

2016-11-16

www.handels.gu.se

2016-11-16

www.handels.gu.se

Imputation of Missing Data

Under 10% Any of the imputation methods can be applied when


missing data is this low, although the complete case
method has been shown to be the least preferred.
10 to 20% The increased presence of missing data makes the all
available, hot deck case substitution and regression
methods most preferred for MCAR data, while
model-based methods are necessary with MAR missing
data processes
Over 20% If it is necessary to impute missing data when the
level is over 20%, the preferred methods are:
o the regression method for MCAR situations, and
o model-based methods when MAR missing data occurs.
www.handels.gu.se

3. Outliers

Outliers are observations with unique combination of characteristics


identifiable as distinctly different (unusually high or low) from the other
observations.

Why do they occur?


Procedural error data entry or coding
Extra-ordinary event or observation

2016-11-16

www.handels.gu.se

Outliers

Detecting outliers
Univariate detection the distribution of each variable
Standardized scores
Small samples (80 or fewer), z score is 2.5 or larger
Larger samples (more than 80), Z score is 3 (or 4) or larger
Bivariate detection pair of variables
Scatterplot with an ellipse representing a normal distribution (CI
at a specified alpha level)
Multivariate detection: Mahalanobis D2 (a multivariate assessment
that measures the distance of a case from the multidimensional
mean of a distribution across a set of variables)
Chi-square test for D2 with a p-value at .005 or .001 (outlier)
11/16/2016

www.handels.gu.se

4. Testing the assumptions of multivariate analysis

The following apply to metric data


1. Normality
The correspondence of the shape of the data distribution to the
normal distribution
Invalid statistical results if violated as t- and f-tests require it.
The distribution of the data can be skewed, kurtosis, or normal
Sample size issue larger samples, less sampling error
Graphically: Normal probability plot (p-p plot), Histogram
Statistically
z-value: if it exceeds the critical value (e.g, 1.95 at .05 error
level), then data is not normally distributed.
Shapiro-Wilks and Kolmogorov-Smirnov: if p < .05, data is not
normally distributed.
Less useful if sample size is fewer than 30 or exceeding 1000
11/16/2016

www.handels.gu.se

4. Testing the assumptions of multivariate analysis

3. Linearity
Linear association between the variables
Graphically:
Scatterplot (used mostly)
Remedy: transformation
4. Absence of correlated error (correlated residuals)
Graphically: saving residuals and then plotting them
Remedy: adding a variable that represents the omitted factor

2016-11-16

www.handels.gu.se

2016-11-16

www.handels.gu.se

4. Testing the assumptions of multivariate analysis

4. Homoscedasticity
Applies for both metric and nonmetric independent variables
The dependent variable(s) exhibit equal levels of variance across
the range of predictor variable(s)
Importance: the variance of the dependent variable being explained
in the dependence relationship should not be concentrated in only a
limited range of the independent values.
Sources of Hetroscedasticity
Variable type: as a variable increases in value (0 1 000 000),
a wider range of answers is possible.
Skewed distribution of one or both variables
Graphically: Plots (metric variables) and Boxplot (non-metric)
Statistically; Levens test and Boxs M test
Remedy: data transformation
11/16/2016

www.handels.gu.se

2016-11-16

www.handels.gu.se

5. Dummy variables

Nonmetric variables in multivariate techniques e.g., gender, martial


status, occupation, nationality
Dummy variables are dichotomous variables that represent one
category of a nonmetric independent variable.
Any non-metric variable with k categories can be represented by k-1
dummy variables

Coding dummy variable via Indicator coding:


E.g., City: Gothenburg, Stockholm, & Malm
Two categories: Gothenburg = 1, else = 0; Stockholm = 1, else = 0,
the omitted category is Malm

2016-11-16

www.handels.gu.se

6. Data Transformation

Non-linearity:

2016-11-16

www.handels.gu.se

6. Data Transformation

Normality:
If you have 200 observations or more , the effects of nonnormality
may be negligible.
If you have 50 (or worse, 30) observations or fewer, significant
departure from normality can have substantial impact on the results.
Remedies to achieve normality and homoscedasticity:
Flat distribution -> inverse (e.g., 1/x or 1/Y)
Negatively skewed distribution -> squared or cubed (X2 or X3)
Positively skewed distribution -> square root or logarithms

2016-11-16

www.handels.gu.se

You might also like