Data Exploration

Data Exploration
A.
Data as the Starting Point
In statistical theory we assume that the statistical model is correctly specified and
subsequently derive the properties of the estimators and of the hypothesis tests based
upon them.
The applied researchers, however, more often than not are mainly concerned with finding
an appropriate model for the problem at hand.
A major theme in applied research is not just to test ideas against data but also to get
ideas from data.
Data help to confirm (or falsify) ideas one holds, but they often also provide clues and
hints which point towards other, perhaps more powerful ideas.
Indeed applied researchers are often as much concerned with model creation as with
model estimation.
Model specification and model selection are two important steps to precede the model
estimation stage.
Exploratory Data Analysis (EDA) becomes an important tool of empirical research with
this changed orientation.
B.
Data Presentation: Raw Data
Variables can be both quantitative (measurement variable- how much) or qualitative

(categorical variable- what kind) in nature;
The latter may also be either ordered (more-less-type: level of education) or unordered
(different-type: colour of eyes, religion, etc.);
The information on some variables may sometimes be missing or even when available
there may have some measurement error;
Before starting data analysis one has to take care of these problems;
The raw data are presented in a tabular form (frequency distribution, grouped frequency
distribution, histogram, frequency polygon, etc.);
STEM & LEAF display is a very useful technique of data presentation that captures the
characteristics of entire data set before entering into the analysis of summary measures;
Suppose we have the following dataset available on the 50 state of the US:
Data Exploration
SL. No.
State
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Idaho
Utah
Alaska
Wyoming
Alabama
Mississippi
Virginia
Nebraska
Arizona
Arkansas
Texas
Kansas
Louisiana
Kentucky
N. Carolina
Tennessee
New Mexico
Nevada
S. Carolina
Colorado
Georgia
Florida
Oklahoma
Oregon
Indiana
Environmental
Voting Percentage
12
16
17
26
33
33
33
34
35
36
39
39
40
41
44
45
47
47
47
47
49
51
52
53
54
SL. No.
State
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
S. Dakota
Illinois
Montana
Missouri
Ohio
Washington
California
N. Dakota
Maryland
Pennsylvania
Hawaii
Delaware
Michigan
W. Virginia
Minnesota
New York
Wisconsin
New Hampshire
New Jersey
Iowa
Main
Connecticut
Massachusetts
Rhode Island
Vermont
Stem and Leaf Graph

1
2
3
4
5
6
7
8
9
267
6
33345699
014577779
123456667799
224999
02222499
26
6
Environmental
Voting Percentage
55
56
56
56
57
57
59
59
62
62
64
69
69
69
70
72
72
72
72
74
79
79
82
86
96
Data Exploration
Stem and leaf displays are the compact versions of the ordered data with initial digits
broken off to form stems shown to the left of the vertical line and to the right of that line
are the leaves, the following digits.
When the raw data is arranged in order this display presents the shape of the distribution.
The voting percentage ranges from 12 to 96;
The centre is around 55;
Most of the observations are lying between 30 to 70 per cent;
C.
Summarizing the dataset:
To explore the nature of the data we have to put them in order;
In fact, data exploration always places more emphasis on median based analysis than
mean based analysis as the former being less sensitive to extreme values (i.e., outliers) is
more robust in nature;
So, data exploration starts from order-based analysis of moments like median, interquartile range, Bowleys measure of skewness, etc.
For any data set five observations are most important in studying its nature,
a.
Minimum Value;
b.
Maximum Value;
c.
Median Value (Q2);
d.
First Quartile (Q1) and
e.
Third Quartile (Q3);
The distance between (Q3 Q1) is called the Inter Quartile Range (IQR);
If the distribution is symmetric then (Q2 Q1) = (Q3 Q2);
So, by comparing these two distances idea can be formed about the nature of asymmetry,
i.e., skewness;
The distance between (Q1 & Minimum Value) and (Q3 & Maximum Value) helps to
study the thickness of tail (Kurtosis) which is very important in analyzing outliers.
If a distribution is NORMAL then it is symmetric and thin tailed (skewness 0

mesokurtik);
3
Data Exploration
Then [IQR/1.35] = std deviation ();
For any distribution [IQR/1.35] is called the pseudo standard deviation (PSD) and is
compared with the actual SD to detect the presence of outliers;
If the distribution is asymmetric and/or PSD is distinctly different from SD then

arithmetic mean is no longer the proper representation of the average behavior of the data
in a probabilistic sense;
Now the centre of gravity (mean) differs from the centre of probability (median);
Box-plot is very useful technique here;
Checking for Symmetry: for uni-modal distribution
Positive Skewness: mean > median;
Symmetry: mean
Negative Skewness: mean < median;
median;
Checking for the thickness of the tails:

For any distribution comparison of SD and PSD helps to determine the nature of tail
PSD < SD implies heavier than normal tails;
PSD
PSD > SD implies thinner than normal tails.
Ladder analysis is applied to find out the most appropriate transformation to ensure
SD implies approximately normal tails;
correspondence between mean and median;
Data Exploration
Consider the distribution of per capita household income
We have to get a symmetric distribution

Log-transformation
Data Exploration
Fourth root of GNP per capita
D. Regression Models:
The theory of statistics primarily deals with univariate distributions;
For regression by using the proposed causal relation we try to derive an estimated value
of the study variable Y and then try to minimize the difference between the observed
value of Y and the estimated value of Y;
This difference is called the residual term and is defined as: ei (Yi Yi ) ;
Regression models are entirely concerned with the stochastic properties of this single
variate ei;
If it is symmetric and thin-tailed then the mean-centric analysis of gravity turns out to be
equivalent to the median-centric analysis of probability and the likeliness of the proposed
causal connection between Y and Xs will be statistically confirmed;
Once that is ensured through the testing of Goodness of Fit the data exploration part is
complete;
One may apply standard Econometric analysis on the transformed model to come up with
statistically reliable confirmations;
Data Exploration
E. Illustration I:
This section uses an example (Mukherjee, White & Wuyts 1998) to show that seemingly
good results in regression analysis may turn out to be quite questionable if we care to
scrutinize the data in greater depth.
In a proposed model with the help of cross-country data crude birth rate (Y) is attempted to
be explained in terms of per capita GNP (X1) and infant mortality rate (X2) where the former
is expected to have negative influence on the study variable and the latter a positive
influence.
The estimated regression is:
Yi 18
.8 0
.00039
.22 X 2
X1 0
(11.65) (2.23)
(14.04)
With R 2 0.79 & n 109 ;
At the first sight, these results look good. The coefficient of determination (R2) tells us
that the regression explains 79% of the total variation of CBR and both the slope
coefficients have expected sign and significance. Many researchers may be inclined to
stop the data analysis at this point. That would be definitely unwise.
Before accepting the model one has to carry out diagnostic checking. The normality
assumption is checked by J-B test and it has been weakly passes at 5% level of
significance. G-Q test has been carried out to eliminate the possibility of heteroscedastic
errors and it has also accepted the null hypothesis of homoscedasticity at 5% level of
significance. So far, we have enough reason to be satisfied with our fitted model.
However, a few interesting observations may be extended at this juncture: (a) we never
actually looked at the data but concentrated on the final results alone; (b) the only
purpose of the data set was to verify whether it supports the hypothesis at hand or not; (c)
no attempt was made to explore other equally interesting possibilities those may be
suggested by the same data set.
To explore data at the first step the histograms of the three variables are plotted and it can
be seen that they have very different patterns of distribution. Since the regression model
is trying to explain the Y-variation in terms of variation in Xs, hence, the pattern of Xvariations should have close correspondence with that of Y-variation. Moreover, for the
normality assumption to hold good all these distributions should be more or less bellshaped.
Data Exploration
At the next step pair-wise scatter plots and correlation coefficients should be studied.
Data Exploration
X1
X1
X2
Negative exponential
with number
of outliers
X2
Y
with number of outliers
Positive exponential
Positive concave
Since the regression model is a linear one the variables should be so transformed as to
arrive at linear scatters among all pairs. Skewness in data is a major problem when
modeling an average.
A skewed distribution has no clear centre. The mean, its centre of gravity will differ from
the median, its centre of probability. Moreover, the sample mean is no longer an
attractive estimator of population mean in presence of skewness.
To handle this problem one has to design suitable non-linear transformations for each
relevant variable. For example, a power, a square root or a logarithm, and so on. The logtransformation is highly popular in applied data analysis.
When X1 is replaced by log(X1) and X2 by (X2 ) the pair-wise scatter plots suggest
regular linear shapes.
Data Exploration
Yi
2.
59 0
.63 log X1 4
.06 X 2 where R 2 0.85 & n 109 ;
(13.78)
(0.38) (0.925)
The first thing to note about this regression is that the value of R2 has gone up from 0.79
to 0.85. The income variable log(X1) has lost its importance altogether and should be
dropped from the model.
When dropped, it gives the estimated regression equation as Yi 3

.61 3
.83 X 2
2.75 (24.17)
2
with R =0.85 & n=109.
This simple regression confirms that dropping the income variable from the equation
hardly affects the coefficient of determination. This regression, therefore, yields a better
result than the multiple regression proposed earlier.
10
Data Exploration
F. Illustration II (Dutta & Banerjee, 2013):
Here an example will be given by using unit-level NSSO data on Morbidity collected
in 60th Round during 2004-05;
Observations have been taken on Urban West Bengal and the type of fuel use has
been taken as a proxy for indoor pollution;
Causal link between indoor pollution and morbidity has been studied by giving
control to a number of socio economic variables like monthly per capita expenditure,
living condition, level of education of the head of the household, etc.
Hypothesis and Variables:
Morbidity = f (Income, fuel-use, education, living condition)
Sl
no.
Variable name
Notation
Description
Expected sign
I.
Morbidity index
M2
Percentage of Morbid Members

in each household
Dependent
variable
II.
Monthly per capita

expenditure of the
household
MPCE
Same as in the data source
III.
Fuel use
FUEL
Dirty= 1 / Clean = 0
IV.
Education
EDU
Education of the head of the

household (Indexed)
EDU=(Actual/ Maximum)*100
V.
Living Condition
LCI
Constructed on the basis of

information on house structure,
latrine, drainage, water source
using PCA
11
Data Exploration
Descriptive Statistics (NSS_60_Round)

Descriptive Statistics
M2
Mean
17.99 1062.45 79.28
Median
MPCE
LCI
FUEL EDU
0.51 58.02
0.00
875.00 88.27
1.00 66.67
Standard Deviation
25.58
753.27 18.78
0.50 35.24
IQR
25.00
716.00 18.12
1.00 66.67
Pseudo SD
18.53
531.3 12.07
0.74 49.42
Skewness
1.77
2.78
-1.37
-0.04
-0.39
Kurtosis (Normalized)
2.80
13.42
1.33
-2.00
-1.10
Sample Size
1878
1878
1878
1878
1878
Observations with 0 value
922
Observations with 1 value
956
2,000
MPCE
4,000
6,000
8,000
Box-Plot: MPCE
12
Data Exploration
20
40
60
80
100
Box_Plot: M2, EAI, LCI (all Index values)
M2
LCI_indx
EAI
For MPCE:
Mean > Median Positively skewed with thickness in right tail as PSD < SD;
Outlier at the upper tail;
High values need to be dampened;
to decide on the nature of appropriate transformation consider the ladder graph;
given the nature of transformation needed log (ln) or square root (sqrt) seems suitable;
13
Data Exploration
The Ladder-Graph for MPCE
0 1.00e+11
2.00e+11
3.00e+11
4.00e+11
2.00e+07 4.00e+07 6.00e+07
40
60
80
-.01 -.008 -.006 -.004 -.002
1/square
5.0e+04
0 1.0e+05
1.5e+05
2.0e+05
2.5e+05
inverse
8000
0 1020 3040 50
-.0001
-.00008
-.00006
-.00004
-.00002
-6.78e-21
-.1
-.08
01.0e+07
2.0e+07
3.0e+07
20
2000 4000 6000
1/sqrt
0 .2 .4 .6 .8 1
.02 .04 .06

0
log
0 200400600800
Density
sqrt
identity
2.0e-04
0 4.0e-04
6.0e-04
8.0e-04
.001
square
1.0e-07
0 2.0e-07
3.0e-07
4.0e-07
5.0e-07
2.0e-11
0
4.0e-11
6.0e-11
8.0e-11
cubic
-.06
-.04
-.02
1/cubic
-1.00e-06
-8.00e-07
-6.00e-07
-4.00e-07
-2.00e-07 0
mpce
Histograms by transformation
Between ln and sqrt, the former appears more suitable and this can be confirmed
by the 2 test of goodness of fit through application of ladder analysis;
For EAI no adjustment is needed;
For M2 upper tail needs to be dampened; due to the presence of a number of 0values, log transformation cannot be applied; hence square root appeared to be the
most suitable one;
For LCI the outlier is in the left-hand side; the lower values need to be raised and
to be pushed towards the center;
This suggested the required transformation as square of LCI;

14
Data Exploration
Transformation of the variables

Sl No. Variable
Property
transformation
I.
M2
II.
MPCE
skewness >0, fat right tail Square root (many 0,

log not appropriate)
skewness>0, fat right tail logarithmic
IV.
EDU
Kurtosis<0 with thin tails
No transformation
V.
LCI
Skewness<0, fat left tail
Square
So, data exploration advises the researcher to pay attention to the entire data set and not
merely to the central tendency alone. Whenever the distribution is skewed (i.e., mean is
different from median) or fat-tailed (indicating denial of mean-convergence and presence
of outliers) appropriate transformations should be defined to come up with satisfactory
model specification.
15

Data Exploration

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Exploration

Uploaded by

Copyright:

Available Formats

Data Exploration

Data as the Starting Point

Data Presentation: Raw Data

Variables can be both quantitative (measurement variable- how much) or qualitative

Stem and Leaf Graph

The voting percentage ranges from 12 to 96;

The centre is around 55;

Most of the observations are lying between 30 to 70 per cent;

Summarizing the dataset:

To explore the nature of the data we have to put them in order;

Median Value (Q2);

First Quartile (Q1) and

Third Quartile (Q3);

If the distribution is symmetric then (Q2 Q1) = (Q3 Q2);

If a distribution is NORMAL then it is symmetric and thin tailed (skewness 0

Then [IQR/1.35] = std deviation ();

If the distribution is asymmetric and/or PSD is distinctly different from SD then

Box-plot is very useful technique here;

Checking for Symmetry: for uni-modal distribution

Positive Skewness: mean > median;

Negative Skewness: mean < median;

Checking for the thickness of the tails:

PSD < SD implies heavier than normal tails;

PSD > SD implies thinner than normal tails.

SD implies approximately normal tails;

correspondence between mean and median;

Consider the distribution of per capita household income

We have to get a symmetric distribution

Fourth root of GNP per capita

The theory of statistics primarily deals with univariate distributions;

When dropped, it gives the estimated regression equation as Yi 3

F. Illustration II (Dutta & Banerjee, 2013):

Hypothesis and Variables:

Morbidity = f (Income, fuel-use, education, living condition)

Percentage of Morbid Members

Monthly per capita

Same as in the data source

Education of the head of the

Constructed on the basis of

Descriptive Statistics (NSS_60_Round)

17.99 1062.45 79.28

Observations with 0 value

Observations with 1 value

Box_Plot: M2, EAI, LCI (all Index values)

The Ladder-Graph for MPCE

2.00e+07 4.00e+07 6.00e+07

-.01 -.008 -.006 -.004 -.002

2000 4000 6000

.02 .04 .06

For EAI no adjustment is needed;

This suggested the required transformation as square of LCI;

Transformation of the variables

skewness >0, fat right tail Square root (many 0,

Kurtosis<0 with thin tails

Skewness<0, fat left tail

You might also like