Professional Documents
Culture Documents
A.
In statistical theory we assume that the statistical model is correctly specified and
subsequently derive the properties of the estimators and of the hypothesis tests based
upon them.
The applied researchers, however, more often than not are mainly concerned with finding
an appropriate model for the problem at hand.
A major theme in applied research is not just to test ideas against data but also to get
ideas from data.
Data help to confirm (or falsify) ideas one holds, but they often also provide clues and
hints which point towards other, perhaps more powerful ideas.
Indeed applied researchers are often as much concerned with model creation as with
model estimation.
Model specification and model selection are two important steps to precede the model
estimation stage.
Exploratory Data Analysis (EDA) becomes an important tool of empirical research with
this changed orientation.
B.
The latter may also be either ordered (more-less-type: level of education) or unordered
(different-type: colour of eyes, religion, etc.);
The information on some variables may sometimes be missing or even when available
there may have some measurement error;
Before starting data analysis one has to take care of these problems;
The raw data are presented in a tabular form (frequency distribution, grouped frequency
distribution, histogram, frequency polygon, etc.);
STEM & LEAF display is a very useful technique of data presentation that captures the
characteristics of entire data set before entering into the analysis of summary measures;
Suppose we have the following dataset available on the 50 state of the US:
Data Exploration
SL. No.
State
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Idaho
Utah
Alaska
Wyoming
Alabama
Mississippi
Virginia
Nebraska
Arizona
Arkansas
Texas
Kansas
Louisiana
Kentucky
N. Carolina
Tennessee
New Mexico
Nevada
S. Carolina
Colorado
Georgia
Florida
Oklahoma
Oregon
Indiana
Environmental
Voting Percentage
12
16
17
26
33
33
33
34
35
36
39
39
40
41
44
45
47
47
47
47
49
51
52
53
54
SL. No.
State
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
S. Dakota
Illinois
Montana
Missouri
Ohio
Washington
California
N. Dakota
Maryland
Pennsylvania
Hawaii
Delaware
Michigan
W. Virginia
Minnesota
New York
Wisconsin
New Hampshire
New Jersey
Iowa
Main
Connecticut
Massachusetts
Rhode Island
Vermont
267
6
33345699
014577779
123456667799
224999
02222499
26
6
Environmental
Voting Percentage
55
56
56
56
57
57
59
59
62
62
64
69
69
69
70
72
72
72
72
74
79
79
82
86
96
Data Exploration
Stem and leaf displays are the compact versions of the ordered data with initial digits
broken off to form stems shown to the left of the vertical line and to the right of that line
are the leaves, the following digits.
When the raw data is arranged in order this display presents the shape of the distribution.
C.
In fact, data exploration always places more emphasis on median based analysis than
mean based analysis as the former being less sensitive to extreme values (i.e., outliers) is
more robust in nature;
So, data exploration starts from order-based analysis of moments like median, interquartile range, Bowleys measure of skewness, etc.
For any data set five observations are most important in studying its nature,
a.
Minimum Value;
b.
Maximum Value;
c.
d.
e.
The distance between (Q3 Q1) is called the Inter Quartile Range (IQR);
So, by comparing these two distances idea can be formed about the nature of asymmetry,
i.e., skewness;
The distance between (Q1 & Minimum Value) and (Q3 & Maximum Value) helps to
study the thickness of tail (Kurtosis) which is very important in analyzing outliers.
Data Exploration
For any distribution [IQR/1.35] is called the pseudo standard deviation (PSD) and is
compared with the actual SD to detect the presence of outliers;
Now the centre of gravity (mean) differs from the centre of probability (median);
Symmetry: mean
median;
PSD
Ladder analysis is applied to find out the most appropriate transformation to ensure
Data Exploration
Data Exploration
D. Regression Models:
For regression by using the proposed causal relation we try to derive an estimated value
of the study variable Y and then try to minimize the difference between the observed
value of Y and the estimated value of Y;
This difference is called the residual term and is defined as: ei (Yi Yi ) ;
Regression models are entirely concerned with the stochastic properties of this single
variate ei;
If it is symmetric and thin-tailed then the mean-centric analysis of gravity turns out to be
equivalent to the median-centric analysis of probability and the likeliness of the proposed
causal connection between Y and Xs will be statistically confirmed;
Once that is ensured through the testing of Goodness of Fit the data exploration part is
complete;
One may apply standard Econometric analysis on the transformed model to come up with
statistically reliable confirmations;
Data Exploration
E. Illustration I:
This section uses an example (Mukherjee, White & Wuyts 1998) to show that seemingly
good results in regression analysis may turn out to be quite questionable if we care to
scrutinize the data in greater depth.
In a proposed model with the help of cross-country data crude birth rate (Y) is attempted to
be explained in terms of per capita GNP (X1) and infant mortality rate (X2) where the former
is expected to have negative influence on the study variable and the latter a positive
influence.
The estimated regression is:
Yi 18
.8 0
.00039
.22 X 2
X1 0
(11.65) (2.23)
(14.04)
With R 2 0.79 & n 109 ;
At the first sight, these results look good. The coefficient of determination (R2) tells us
that the regression explains 79% of the total variation of CBR and both the slope
coefficients have expected sign and significance. Many researchers may be inclined to
stop the data analysis at this point. That would be definitely unwise.
Before accepting the model one has to carry out diagnostic checking. The normality
assumption is checked by J-B test and it has been weakly passes at 5% level of
significance. G-Q test has been carried out to eliminate the possibility of heteroscedastic
errors and it has also accepted the null hypothesis of homoscedasticity at 5% level of
significance. So far, we have enough reason to be satisfied with our fitted model.
However, a few interesting observations may be extended at this juncture: (a) we never
actually looked at the data but concentrated on the final results alone; (b) the only
purpose of the data set was to verify whether it supports the hypothesis at hand or not; (c)
no attempt was made to explore other equally interesting possibilities those may be
suggested by the same data set.
To explore data at the first step the histograms of the three variables are plotted and it can
be seen that they have very different patterns of distribution. Since the regression model
is trying to explain the Y-variation in terms of variation in Xs, hence, the pattern of Xvariations should have close correspondence with that of Y-variation. Moreover, for the
normality assumption to hold good all these distributions should be more or less bellshaped.
Data Exploration
At the next step pair-wise scatter plots and correlation coefficients should be studied.
Data Exploration
X1
X1
X2
Negative exponential
Negative exponential
with number
of outliers
X2
Negative exponential
Y
Negative exponential
with number of outliers
Positive exponential
Positive concave
Since the regression model is a linear one the variables should be so transformed as to
arrive at linear scatters among all pairs. Skewness in data is a major problem when
modeling an average.
A skewed distribution has no clear centre. The mean, its centre of gravity will differ from
the median, its centre of probability. Moreover, the sample mean is no longer an
attractive estimator of population mean in presence of skewness.
To handle this problem one has to design suitable non-linear transformations for each
relevant variable. For example, a power, a square root or a logarithm, and so on. The logtransformation is highly popular in applied data analysis.
When X1 is replaced by log(X1) and X2 by (X2 ) the pair-wise scatter plots suggest
regular linear shapes.
Data Exploration
Yi
2.
59 0
.63 log X1 4
.06 X 2 where R 2 0.85 & n 109 ;
(13.78)
(0.38) (0.925)
The first thing to note about this regression is that the value of R2 has gone up from 0.79
to 0.85. The income variable log(X1) has lost its importance altogether and should be
dropped from the model.
This simple regression confirms that dropping the income variable from the equation
hardly affects the coefficient of determination. This regression, therefore, yields a better
result than the multiple regression proposed earlier.
10
Data Exploration
Here an example will be given by using unit-level NSSO data on Morbidity collected
in 60th Round during 2004-05;
Observations have been taken on Urban West Bengal and the type of fuel use has
been taken as a proxy for indoor pollution;
Causal link between indoor pollution and morbidity has been studied by giving
control to a number of socio economic variables like monthly per capita expenditure,
living condition, level of education of the head of the household, etc.
Sl
no.
Variable name
Notation
Description
Expected sign
I.
Morbidity index
M2
Dependent
variable
II.
MPCE
III.
Fuel use
FUEL
Dirty= 1 / Clean = 0
IV.
Education
EDU
EDU=(Actual/ Maximum)*100
V.
Living Condition
LCI
11
Data Exploration
M2
Mean
Median
MPCE
LCI
FUEL EDU
0.51 58.02
0.00
875.00 88.27
1.00 66.67
Standard Deviation
25.58
753.27 18.78
0.50 35.24
IQR
25.00
716.00 18.12
1.00 66.67
Pseudo SD
18.53
531.3 12.07
0.74 49.42
Skewness
1.77
2.78
-1.37
-0.04
-0.39
Kurtosis (Normalized)
2.80
13.42
1.33
-2.00
-1.10
Sample Size
1878
1878
1878
1878
1878
922
956
2,000
MPCE
4,000
6,000
8,000
Box-Plot: MPCE
12
Data Exploration
20
40
60
80
100
M2
LCI_indx
EAI
For MPCE:
Mean > Median Positively skewed with thickness in right tail as PSD < SD;
Outlier at the upper tail;
High values need to be dampened;
to decide on the nature of appropriate transformation consider the ladder graph;
given the nature of transformation needed log (ln) or square root (sqrt) seems suitable;
13
Data Exploration
0 1.00e+11
2.00e+11
3.00e+11
4.00e+11
40
60
80
1/square
5.0e+04
0 1.0e+05
1.5e+05
2.0e+05
2.5e+05
inverse
8000
0 1020 3040 50
-.0001
-.00008
-.00006
-.00004
-.00002
-6.78e-21
-.1
-.08
01.0e+07
2.0e+07
3.0e+07
20
1/sqrt
0 .2 .4 .6 .8 1
log
0 200400600800
Density
sqrt
identity
2.0e-04
0 4.0e-04
6.0e-04
8.0e-04
.001
square
1.0e-07
0 2.0e-07
3.0e-07
4.0e-07
5.0e-07
2.0e-11
0
4.0e-11
6.0e-11
8.0e-11
cubic
-.06
-.04
-.02
1/cubic
-1.00e-06
-8.00e-07
-6.00e-07
-4.00e-07
-2.00e-07 0
mpce
Histograms by transformation
Between ln and sqrt, the former appears more suitable and this can be confirmed
by the 2 test of goodness of fit through application of ladder analysis;
For M2 upper tail needs to be dampened; due to the presence of a number of 0values, log transformation cannot be applied; hence square root appeared to be the
most suitable one;
For LCI the outlier is in the left-hand side; the lower values need to be raised and
to be pushed towards the center;
Data Exploration
I.
M2
II.
MPCE
IV.
EDU
No transformation
V.
LCI
Square
So, data exploration advises the researcher to pay attention to the entire data set and not
merely to the central tendency alone. Whenever the distribution is skewed (i.e., mean is
different from median) or fat-tailed (indicating denial of mean-convergence and presence
of outliers) appropriate transformations should be defined to come up with satisfactory
model specification.
15