You are on page 1of 15

Data Exploration

A.

Data as the Starting Point

In statistical theory we assume that the statistical model is correctly specified and
subsequently derive the properties of the estimators and of the hypothesis tests based
upon them.

The applied researchers, however, more often than not are mainly concerned with finding
an appropriate model for the problem at hand.

A major theme in applied research is not just to test ideas against data but also to get
ideas from data.

Data help to confirm (or falsify) ideas one holds, but they often also provide clues and
hints which point towards other, perhaps more powerful ideas.

Indeed applied researchers are often as much concerned with model creation as with
model estimation.

Model specification and model selection are two important steps to precede the model
estimation stage.

Exploratory Data Analysis (EDA) becomes an important tool of empirical research with
this changed orientation.

B.

Data Presentation: Raw Data

Variables can be both quantitative (measurement variable- how much) or qualitative


(categorical variable- what kind) in nature;

The latter may also be either ordered (more-less-type: level of education) or unordered
(different-type: colour of eyes, religion, etc.);

The information on some variables may sometimes be missing or even when available
there may have some measurement error;

Before starting data analysis one has to take care of these problems;

The raw data are presented in a tabular form (frequency distribution, grouped frequency
distribution, histogram, frequency polygon, etc.);

STEM & LEAF display is a very useful technique of data presentation that captures the
characteristics of entire data set before entering into the analysis of summary measures;

Suppose we have the following dataset available on the 50 state of the US:

Data Exploration

SL. No.

State

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Idaho
Utah
Alaska
Wyoming
Alabama
Mississippi
Virginia
Nebraska
Arizona
Arkansas
Texas
Kansas
Louisiana
Kentucky
N. Carolina
Tennessee
New Mexico
Nevada
S. Carolina
Colorado
Georgia
Florida
Oklahoma
Oregon
Indiana

Environmental
Voting Percentage
12
16
17
26
33
33
33
34
35
36
39
39
40
41
44
45
47
47
47
47
49
51
52
53
54

SL. No.

State

26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

S. Dakota
Illinois
Montana
Missouri
Ohio
Washington
California
N. Dakota
Maryland
Pennsylvania
Hawaii
Delaware
Michigan
W. Virginia
Minnesota
New York
Wisconsin
New Hampshire
New Jersey
Iowa
Main
Connecticut
Massachusetts
Rhode Island
Vermont

Stem and Leaf Graph


1
2
3
4
5
6
7
8
9

267
6
33345699
014577779
123456667799
224999
02222499
26
6

Environmental
Voting Percentage
55
56
56
56
57
57
59
59
62
62
64
69
69
69
70
72
72
72
72
74
79
79
82
86
96

Data Exploration

Stem and leaf displays are the compact versions of the ordered data with initial digits
broken off to form stems shown to the left of the vertical line and to the right of that line
are the leaves, the following digits.

When the raw data is arranged in order this display presents the shape of the distribution.

The voting percentage ranges from 12 to 96;

The centre is around 55;

Most of the observations are lying between 30 to 70 per cent;

C.

Summarizing the dataset:

To explore the nature of the data we have to put them in order;

In fact, data exploration always places more emphasis on median based analysis than
mean based analysis as the former being less sensitive to extreme values (i.e., outliers) is
more robust in nature;

So, data exploration starts from order-based analysis of moments like median, interquartile range, Bowleys measure of skewness, etc.

For any data set five observations are most important in studying its nature,
a.

Minimum Value;

b.

Maximum Value;

c.

Median Value (Q2);

d.

First Quartile (Q1) and

e.

Third Quartile (Q3);

The distance between (Q3 Q1) is called the Inter Quartile Range (IQR);

If the distribution is symmetric then (Q2 Q1) = (Q3 Q2);

So, by comparing these two distances idea can be formed about the nature of asymmetry,
i.e., skewness;

The distance between (Q1 & Minimum Value) and (Q3 & Maximum Value) helps to
study the thickness of tail (Kurtosis) which is very important in analyzing outliers.

If a distribution is NORMAL then it is symmetric and thin tailed (skewness 0


mesokurtik);
3

Data Exploration

Then [IQR/1.35] = std deviation ();

For any distribution [IQR/1.35] is called the pseudo standard deviation (PSD) and is
compared with the actual SD to detect the presence of outliers;

If the distribution is asymmetric and/or PSD is distinctly different from SD then


arithmetic mean is no longer the proper representation of the average behavior of the data
in a probabilistic sense;

Now the centre of gravity (mean) differs from the centre of probability (median);

Box-plot is very useful technique here;

Checking for Symmetry: for uni-modal distribution

Positive Skewness: mean > median;

Symmetry: mean

Negative Skewness: mean < median;

median;

Checking for the thickness of the tails:


For any distribution comparison of SD and PSD helps to determine the nature of tail

PSD < SD implies heavier than normal tails;

PSD

PSD > SD implies thinner than normal tails.

Ladder analysis is applied to find out the most appropriate transformation to ensure

SD implies approximately normal tails;

correspondence between mean and median;

Data Exploration

Consider the distribution of per capita household income

We have to get a symmetric distribution


Log-transformation

Data Exploration

Fourth root of GNP per capita

D. Regression Models:

The theory of statistics primarily deals with univariate distributions;

For regression by using the proposed causal relation we try to derive an estimated value
of the study variable Y and then try to minimize the difference between the observed
value of Y and the estimated value of Y;

This difference is called the residual term and is defined as: ei (Yi Yi ) ;

Regression models are entirely concerned with the stochastic properties of this single
variate ei;

If it is symmetric and thin-tailed then the mean-centric analysis of gravity turns out to be
equivalent to the median-centric analysis of probability and the likeliness of the proposed
causal connection between Y and Xs will be statistically confirmed;

Once that is ensured through the testing of Goodness of Fit the data exploration part is
complete;

One may apply standard Econometric analysis on the transformed model to come up with
statistically reliable confirmations;

Data Exploration

E. Illustration I:

This section uses an example (Mukherjee, White & Wuyts 1998) to show that seemingly
good results in regression analysis may turn out to be quite questionable if we care to
scrutinize the data in greater depth.

In a proposed model with the help of cross-country data crude birth rate (Y) is attempted to
be explained in terms of per capita GNP (X1) and infant mortality rate (X2) where the former
is expected to have negative influence on the study variable and the latter a positive
influence.
The estimated regression is:
Yi 18
.8 0
.00039
.22 X 2

X1 0
(11.65) (2.23)
(14.04)
With R 2 0.79 & n 109 ;

At the first sight, these results look good. The coefficient of determination (R2) tells us
that the regression explains 79% of the total variation of CBR and both the slope
coefficients have expected sign and significance. Many researchers may be inclined to
stop the data analysis at this point. That would be definitely unwise.

Before accepting the model one has to carry out diagnostic checking. The normality
assumption is checked by J-B test and it has been weakly passes at 5% level of
significance. G-Q test has been carried out to eliminate the possibility of heteroscedastic
errors and it has also accepted the null hypothesis of homoscedasticity at 5% level of
significance. So far, we have enough reason to be satisfied with our fitted model.

However, a few interesting observations may be extended at this juncture: (a) we never
actually looked at the data but concentrated on the final results alone; (b) the only
purpose of the data set was to verify whether it supports the hypothesis at hand or not; (c)
no attempt was made to explore other equally interesting possibilities those may be
suggested by the same data set.

To explore data at the first step the histograms of the three variables are plotted and it can
be seen that they have very different patterns of distribution. Since the regression model
is trying to explain the Y-variation in terms of variation in Xs, hence, the pattern of Xvariations should have close correspondence with that of Y-variation. Moreover, for the
normality assumption to hold good all these distributions should be more or less bellshaped.

Data Exploration

At the next step pair-wise scatter plots and correlation coefficients should be studied.

Data Exploration

X1
X1
X2

Negative exponential
Negative exponential
with number
of outliers

X2
Negative exponential

Y
Negative exponential
with number of outliers
Positive exponential

Positive concave

Since the regression model is a linear one the variables should be so transformed as to
arrive at linear scatters among all pairs. Skewness in data is a major problem when
modeling an average.

A skewed distribution has no clear centre. The mean, its centre of gravity will differ from
the median, its centre of probability. Moreover, the sample mean is no longer an
attractive estimator of population mean in presence of skewness.

To handle this problem one has to design suitable non-linear transformations for each
relevant variable. For example, a power, a square root or a logarithm, and so on. The logtransformation is highly popular in applied data analysis.

When X1 is replaced by log(X1) and X2 by (X2 ) the pair-wise scatter plots suggest
regular linear shapes.

Data Exploration

Yi
2.
59 0
.63 log X1 4
.06 X 2 where R 2 0.85 & n 109 ;

(13.78)
(0.38) (0.925)

The first thing to note about this regression is that the value of R2 has gone up from 0.79
to 0.85. The income variable log(X1) has lost its importance altogether and should be
dropped from the model.

When dropped, it gives the estimated regression equation as Yi 3


.61 3
.83 X 2
2.75 (24.17)
2
with R =0.85 & n=109.

This simple regression confirms that dropping the income variable from the equation
hardly affects the coefficient of determination. This regression, therefore, yields a better
result than the multiple regression proposed earlier.

10

Data Exploration

F. Illustration II (Dutta & Banerjee, 2013):

Here an example will be given by using unit-level NSSO data on Morbidity collected
in 60th Round during 2004-05;

Observations have been taken on Urban West Bengal and the type of fuel use has
been taken as a proxy for indoor pollution;

Causal link between indoor pollution and morbidity has been studied by giving
control to a number of socio economic variables like monthly per capita expenditure,
living condition, level of education of the head of the household, etc.

Hypothesis and Variables:

Morbidity = f (Income, fuel-use, education, living condition)

Sl
no.

Variable name

Notation

Description

Expected sign

I.

Morbidity index

M2

Percentage of Morbid Members


in each household

Dependent
variable

II.

Monthly per capita


expenditure of the
household

MPCE

Same as in the data source

III.

Fuel use

FUEL

Dirty= 1 / Clean = 0

IV.

Education

EDU

Education of the head of the


household (Indexed)

EDU=(Actual/ Maximum)*100
V.

Living Condition

LCI

Constructed on the basis of


information on house structure,
latrine, drainage, water source
using PCA

11

Data Exploration

Descriptive Statistics (NSS_60_Round)


Descriptive Statistics

M2

Mean

17.99 1062.45 79.28

Median

MPCE

LCI

FUEL EDU
0.51 58.02

0.00

875.00 88.27

1.00 66.67

Standard Deviation

25.58

753.27 18.78

0.50 35.24

IQR

25.00

716.00 18.12

1.00 66.67

Pseudo SD

18.53

531.3 12.07

0.74 49.42

Skewness

1.77

2.78

-1.37

-0.04

-0.39

Kurtosis (Normalized)

2.80

13.42

1.33

-2.00

-1.10

Sample Size

1878

1878

1878

1878

1878

Observations with 0 value

922

Observations with 1 value

956

2,000

MPCE
4,000

6,000

8,000

Box-Plot: MPCE

12

Data Exploration

20

40

60

80

100

Box_Plot: M2, EAI, LCI (all Index values)

M2
LCI_indx

EAI

For MPCE:
Mean > Median Positively skewed with thickness in right tail as PSD < SD;
Outlier at the upper tail;
High values need to be dampened;
to decide on the nature of appropriate transformation consider the ladder graph;
given the nature of transformation needed log (ln) or square root (sqrt) seems suitable;

13

Data Exploration

The Ladder-Graph for MPCE

0 1.00e+11
2.00e+11
3.00e+11
4.00e+11

2.00e+07 4.00e+07 6.00e+07

40

60

80

-.01 -.008 -.006 -.004 -.002

1/square

5.0e+04
0 1.0e+05
1.5e+05
2.0e+05
2.5e+05

inverse

8000

0 1020 3040 50

-.0001
-.00008
-.00006
-.00004
-.00002
-6.78e-21

-.1

-.08

01.0e+07
2.0e+07
3.0e+07

20

2000 4000 6000

1/sqrt

0 .2 .4 .6 .8 1

.02 .04 .06


0

log

0 200400600800

Density

sqrt

identity

2.0e-04
0 4.0e-04
6.0e-04
8.0e-04
.001

square

1.0e-07
0 2.0e-07
3.0e-07
4.0e-07
5.0e-07

2.0e-11
0
4.0e-11
6.0e-11
8.0e-11

cubic

-.06

-.04

-.02

1/cubic

-1.00e-06
-8.00e-07
-6.00e-07
-4.00e-07
-2.00e-07 0

mpce
Histograms by transformation

Between ln and sqrt, the former appears more suitable and this can be confirmed
by the 2 test of goodness of fit through application of ladder analysis;

For EAI no adjustment is needed;

For M2 upper tail needs to be dampened; due to the presence of a number of 0values, log transformation cannot be applied; hence square root appeared to be the
most suitable one;

For LCI the outlier is in the left-hand side; the lower values need to be raised and
to be pushed towards the center;

This suggested the required transformation as square of LCI;


14

Data Exploration

Transformation of the variables


Sl No. Variable
Property
transformation

I.

M2

II.

MPCE

skewness >0, fat right tail Square root (many 0,


log not appropriate)
skewness>0, fat right tail logarithmic

IV.

EDU

Kurtosis<0 with thin tails

No transformation

V.

LCI

Skewness<0, fat left tail

Square

So, data exploration advises the researcher to pay attention to the entire data set and not
merely to the central tendency alone. Whenever the distribution is skewed (i.e., mean is
different from median) or fat-tailed (indicating denial of mean-convergence and presence
of outliers) appropriate transformations should be defined to come up with satisfactory
model specification.

15

You might also like