Professional Documents
Culture Documents
www.handels.gu.se
Lecture outline
1. Graphical examination.
2. Identify and evaluate missing values.
www.handels.gu.se
1. Graphical Examination
1. Univariate profiling
The shap: Histogram
This is the
distribution for
HBAT database
variable
X19 Satisfaction.
X19 - Satisfaction
30
20
Frequency
10
0
4.50
5.50
5.00
6.50
6.00
7.50
7.00
8.50
8.00
9.50
9.00
10.00
X19 - Satisfaction
2016-11-16
www.handels.gu.se
Database: HBAT
In SPSS, select Analyze,
Frequencies
Move the variable/s of interest
under the box Variables
Press Charts, tick
Histograms and Show
normal curve.
www.handels.gu.se
1. Graphical Examination
X6
X7
X8
X12
X13
2. Bivariate profiling
a) The relationship: scatterplot
strength & direction (positive,
negative, no relationship,
nonlinear relationship)
2016-11-16
www.handels.gu.se
1. Graphical Examination
2) Bivariate profiling
B. Group differences: boxplot
2016-11-16
www.handels.gu.se
2. Missing data
...
2016-11-16
www.handels.gu.se
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
11/16/2016
www.handels.gu.se
11/16/2016
www.handels.gu.se
One gender may be less likely to disclose its weight. That is, the
probability that Y is missing depends only on the value of X. Such
data are missing at random (MAR).
11/16/2016
www.handels.gu.se
11/16/2016
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
www.handels.gu.se
11/16/2016
www.handels.gu.se
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
3. Outliers
2016-11-16
www.handels.gu.se
Outliers
Detecting outliers
Univariate detection the distribution of each variable
Standardized scores
Small samples (80 or fewer), z score is 2.5 or larger
Larger samples (more than 80), Z score is 3 (or 4) or larger
Bivariate detection pair of variables
Scatterplot with an ellipse representing a normal distribution (CI
at a specified alpha level)
Multivariate detection: Mahalanobis D2 (a multivariate assessment
that measures the distance of a case from the multidimensional
mean of a distribution across a set of variables)
Chi-square test for D2 with a p-value at .005 or .001 (outlier)
11/16/2016
www.handels.gu.se
www.handels.gu.se
3. Linearity
Linear association between the variables
Graphically:
Scatterplot (used mostly)
Remedy: transformation
4. Absence of correlated error (correlated residuals)
Graphically: saving residuals and then plotting them
Remedy: adding a variable that represents the omitted factor
2016-11-16
www.handels.gu.se
2016-11-16
www.handels.gu.se
4. Homoscedasticity
Applies for both metric and nonmetric independent variables
The dependent variable(s) exhibit equal levels of variance across
the range of predictor variable(s)
Importance: the variance of the dependent variable being explained
in the dependence relationship should not be concentrated in only a
limited range of the independent values.
Sources of Hetroscedasticity
Variable type: as a variable increases in value (0 1 000 000),
a wider range of answers is possible.
Skewed distribution of one or both variables
Graphically: Plots (metric variables) and Boxplot (non-metric)
Statistically; Levens test and Boxs M test
Remedy: data transformation
11/16/2016
www.handels.gu.se
2016-11-16
www.handels.gu.se
5. Dummy variables
2016-11-16
www.handels.gu.se
6. Data Transformation
Non-linearity:
2016-11-16
www.handels.gu.se
6. Data Transformation
Normality:
If you have 200 observations or more , the effects of nonnormality
may be negligible.
If you have 50 (or worse, 30) observations or fewer, significant
departure from normality can have substantial impact on the results.
Remedies to achieve normality and homoscedasticity:
Flat distribution -> inverse (e.g., 1/x or 1/Y)
Negatively skewed distribution -> squared or cubed (X2 or X3)
Positively skewed distribution -> square root or logarithms
2016-11-16
www.handels.gu.se