Professional Documents
Culture Documents
Access to this document was granted through an Emerald subscription provided by 451335 []
For Authors
If you would like to write for this, or any other Emerald publication, then please use our Emerald for
Authors service information about how to choose which publication to write for and submission guidelines
are available for all. Please visit www.emeraldinsight.com/authors for more information.
About Emerald www.emeraldinsight.com
Emerald is a global publisher linking research and practice to the benefit of society. The company
manages a portfolio of more than 290 journals and over 2,350 books and book series volumes, as well as
providing an extensive range of online products and additional customer resources and services.
Emerald is both COUNTER 4 and TRANSFER compliant. The organization is a partner of the Committee
on Publication Ethics (COPE) and also works with Portico and the LOCKSS initiative for digital archive
preservation.
JM2
9,3
An observatory note on tests for
normality assumptions
Ahmed F. Siddiqi
School of Business & Economics, University of
290 Management & Technology, Lahore, Pakistan
Abstract
Purpose The purpose of this paper is to discuss how numerous tests that are available in statistical
literature to assess normality of a given set of observations perform in normal and near-normal
situations. Not all these tests are suitable for all situations but each test has an exclusive area of
Downloaded by Monash University At 10:33 06 December 2014 (PT)
application.
Design/methodology/approach These tests are assessed for their power at varying degrees of
skewness, kurtosis and sample size on the basis of simulated experiments.
Findings It is observed that almost all these tests are indifferent for smaller values of skewness and
kurtosis. Further, the power of accepting normality reduces with increasing sample size.
Originality/value The article gives guidelines to researchers to apply normality assessing tests in
different situations.
Keywords Decision making, Data analysis, Normality assumptions, Skewness,
Kurtosis, Normality test
Paper type Research paper
1. Introduction
The normality assumption is an omnipresent assumption in almost every statistical or
even statistics-oriented test of significance and models. In essence, this assumption
requires that a set of data upon which a statistical test of significance or statistical
modeling is to be applied must either exactly, or at least approximately, be normally
distributed. Primarily, it is due to the fact that almost all of these tests and models are
developed by keeping normal distribution in mind. So, a proper, apt and legitimate
application requires this distribution as primary building block. Application of
Students t test, 2 test and F test all assume normality of the parent distribution.
Secondly, for theoretical reasons (such as the central limit theorem), any variable that is
the sum of a large number of independent factors is likely to be normally distributed. For
this reason, normal distribution is used throughout statistics, natural science and social
science as a simple model for complex phenomena. For example, the observational error
in an experiment is usually assumed to follow a normal distribution, and the
propagation of uncertainty is computed using this assumption. So, a violation of the
normality results in the in-ability of these tests to verdict statistical significance.
Studies on assessing normality began probably at the dawn of the previous century
when Pearson (1900) introduced his test which was based upon estimated distribution,
Journal of Modelling in Management which is then compared with the given set of information using 2 distribution. There
Vol. 9 No. 3, 2014
pp. 290-305 are certain conditions that need to be satisfied for this test to be applied legitimately, like
Emerald Group Publishing Limited information for events must be mutually exclusive and have total probability 1. Then
1746-5664
DOI 10.1108/JM2-04-2014-0032 the sample size, both whole and per event, should be fairly large (Yates, 1934). All these
pre-requisites put a question mark upon the application and the normality assessment Tests for
by this seminal test. After this attempt, Kolmogorov and Smirnov (Kolmogorov, 1933)
proposed their test, which used cumulative distribution and its distance from the
normality
empirical distribution of the sample. This periodogram statistics, as labeled by Stephens assumptions
(1970), is based upon the deviation of the ith order statistics from its expected value. The
test is among the most used non-parametric methods, in the sense that the critical values
do not depend on the specific distribution being tested, and is sensitive to both location 291
and shape of the empirical distribution. However, the empirical distribution, which is
based upon a sample, belittles the power of the test, thus making the normality
assessment less reliable. The test was modified, up to a certain extent, by Lilliefors
(1967), who made use of the minimum distance between the cumulative and the
empirical distributions. However, both of these tests, as based upon sample estimates,
are at severe criticism. DAgostesto (1986) rendered these tests simply as historical
Downloaded by Monash University At 10:33 06 December 2014 (PT)
curiosity and suggested not to use these. Primarily, this test is meant for continuous
distribution, however, there exist its version for discontinuous data (Conover, 1972;
Horn, 1977; Gleser, 1985).
Anderson and Darling (1954) suggested their test which is based upon the
cumulative distribution of order statistics of the given set of observations. In literature,
it is considered to be a modification of the KolmogrovSimronov test, which gives more
weight to the tails. The test makes use of the specific distribution in calculating critical
values. This has the advantage of allowing a more sensitive test, especially at tails, and
the disadvantage that critical values must be calculated for each distribution. Kuiper
(1960) also suggested his V test, which is using the maximum differences between the
empirical and cumulative distribution but having a different functional form. It is a
rotation-invariant Kolmogorov-type test statistic (Jammalamadaka and SenGupta,
2001, Section 7.2). It requires the knowledge of the two normal parameters, i.e. mean and
variance, without which it is not possible to apply it (Dyer, 1974). However, Louter and
Koerts (2008) devised a modification to apply this test even in case of composite
hypothesis.
Shapiro and Wilk (1965) suggested their test by introducing a ratio of a linear
combination of the order statistics of the sample to its variance estimate. This is in
contrast to previously discussed distance-based tests. The ratio is claimed to be both
scale and origin invariant. A modification of this ratio was suggested by DAgostesto
(1972), who used ratio of a linear unbiased estimator of the standard deviation, using
order statistics, to the usual mean square estimator. The test was originally proposed for
moderate sample sizes and can detect departures from normality both for skewness and
kurtosis. Ajne (1968) suggested a test which is primarily meant for circular uniformity
but is also used to assess normality. The test is locally most powerful and invariant for
circular rotation (Stephens, 1970). Vasicek (1976) introduced a test for normality, based
upon the property of the normal distribution that its entropy exceeds that of any other
distribution with a density that has the same variance. As the entropy of normal
distribution depends only on its variance and not upon the mean, so the test is meant
only for composite hypothesis. Arizono and Ohta (1989) extended this test for simple
hypotheses by using Kullback and Leiblers (1951) information, which is an extended
concept of entropy. Jarque and Bera (1987) proposed a goodness-of-fit measure of
departure from normality, based on the sample kurtosis and skewness. The null
hypothesis, for the test, is a joint hypothesis of the skewness being 0 and the excess
JM2 kurtosis being 0. Urza (1996) warns, however, about the incorrect use of this Jarque and
Bera (1987) (JB) test in case of small- and medium-size samples. He also introduces a
9,3 modified version of the same test which is more suitable for smaller samples.
Arizono and Ohta (1989) presented a Monte Carlo simulation-based comparison of
different tests to establish the power-based superiority of their sample entropy-based
test. The study shows that this entropy-based test statistics is statistically more
292 powerful as compared to Kolmogorov Smirnov (KS), and Cramer von Mises (CVM)
tests. However, the comparison is based upon a sample of size 20 only, and different
results are expected for smaller samples. DAgostino et al. (1990) studied the usefulness
of symmetry and kurtosis measures, 1 and 2, respectively, for assessing the
normality. As both of these measures are related to the graphical presentation of the
data, they recommended a combined use of graphical and numerical techniques in doing
so. Lee et al. (2005) developed some new tests to assess normality based on U processes.
Downloaded by Monash University At 10:33 06 December 2014 (PT)
But still it is based upon order statistics which belittles the power of these tests (Yazici
and Yolacan, 2007). Oztuna et al. (2006) establishes the superiority of Shapiro and Wilk
(1965) test over four of its competitors when compared on the basis of type 1 error. Yazici
and Yolacan (2007) attempted a comparison of 12 different normality tests for their
statistical power to assess the normality assumption. They concluded that the tests
based upon cumulative density function are slightly more powerful when compared
with the ones based upon ordered statistics. The study is, however, restricted for a
sample of size up to 50 only and no attempt was made to study the skewness and
kurtosis of the under investigation sample. Asma (2008) developed computer algorithms
in Delphi programming language for most of these tests without discussing the pros and
cons. Masuda (2010) derives consistent and asymptotically distribution-free test
statistics for the normality of the driving Levy process, based on the self-normalized
partial sums of residuals. Akbilgi and Howe (2011) introduced a test based upon
identity transformation of the Gaussian function. The test is evaluated on the basis of its
type I error and the associated power. Not many menu driven statistical packages
available in the software market, like SPSS, Statistica, etc. offer a variety of these
normality assessing tests. Instead, they offer only a few whose selection is based upon
their own choice. Command-driven statistical packages like R, SAS, etc., on the other
hand, offer a wide variety of such tests. This paper is also using some built in algorithms
in R for various tests.
All these studies which attempts to compare the relative performance of different
tests available to assess the normality are stereotyped where no effort is made to study
the behavior of these tests for nearly or approximately normal situations. These
near-normal situations are usually judged either by symmetry or kurtosis measures.
Some comparisons do exists, like Yazici and Yolacan (2007), which addresses the lack of
symmetry or kurtosis but not in terms of their direct measures like 1 or 2, but in terms
of different distributions which are either skewed or showing not-zero excess kurtosis.
Such a comparison may have an academic worth, but for practical purposes and
especially when the user is non-statistician, these comparisons simply add to the
confusion. The current study focuses exactly on near-normal situations where the
behavior of these tests is assessed with respect to skewness and kurtosis. Section 2
describes the tests of normality being discussed in this study. Section 3 describes a
Monte Carlo simulation-based comparison of these tests of normality with respect to
skewness, kurtosis and sample sizes.
2. Testing tests of normality Tests for
As has been discussed earlier, there is a long list of test available in statistical literature
to test the normality assumption. It is not possible to study all these tests in a single
normality
study. Selected tests are Anderson and Darling (1954) (AD), Cramer (1928) and Mises assumptions
(1947), DAgostesto (1972) (DAG), Kuiper (1960), Kolmogorov and Simronov, 1933,
Lilliefors (1967) modification to KS (KS-L), Pearson (1900) (P), Shapiro and Francia
(1972) (SF) and lastly, but not least, Shapiro and Wilk (1965) (SW). A brief description of 293
all these tests is given in Table I.
All these tests are developed for a composite null hypothesis of normality. The first
column gives the symbol, usually used in the statistical literature, for the test, the second
column gives the authors name and the third column gives the respective expression
to calculate numerical value of these tests. Forth and fifth columns gives the Pearson
correlation coefficients showing the sensitivity of these tests for skewness and kurtosis,
respectively (details are in Section 3.1).
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Not all these tests are equally powerful in all situations and for similar situations
different tests behaves differently. It has also been observed that behavior of these tests
is more erratic in near, or approximately, normal situations. It is attempted in this article
to discriminate these tests for near-normal situations. Normality is assessed either
through symmetry, kurtosis or through any combination of these. The comparison is
based upon simulations using the same established characteristics of the normal
distribution.
9,3
JM2
294
Table I.
Some tests of normality
Test Correlation with
name Authors Test statistics Skewness Kurtosis
D max ( ni p ) (i)
i
D (i)
max ( p )
n
x(i)x
p(i)
s
KS Kolmogorov Simronov Test D max (D , D ) 0.119 0.067
i
D max p(i)( )
n
i
D max p(i) ( )
n
x(i)x
p(i)
s
(continued)
Downloaded by Monash University At 10:33 06 December 2014 (PT)
ni (xi x)2
SW Shapiro Wilk test 0.597 0.360
( aix (i) )2
i
W
n
i (xix) 2
Table I.
295
assumptions
normality
Tests for
JM2 The results of this experiments are shown in the column 4 and 5 of Table I. Here is a brief
commentary on these results:
9,3
All correlation coefficients for skewness are negative (as shown in column 4). This
shows that all the selected tests behave similarly for increasing values of
skewness. However, their magnitudes are quite different; varies from 0.119 for KS
to 0.781 for JB. Classical interpretation of the correlation coefficient says a 1 per
296 cent increase in coefficient of skewness develops a 0.119 per cent change in the p
value of the KS test and a 0.781 per cent in JB. In other words, KS is sensitive to a
change in skewness; however, not as much as JB, for which the correlation is the
highest. As the correlation coefficient is different for different tests, one may infer
that the sensitivity levels of these tests to skewness are different for different tests.
One may rank these tests for their sensitivity to skewness.
Column 5 shows these correlation coefficients for kurtosis values. Same, as in case
Downloaded by Monash University At 10:33 06 December 2014 (PT)
297
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Figure 1.
Performance of normality
tests for skewness
For CVM test, Figure 1(b), the aberration of points away from vertical axis for
higher values of skewness is even more explicit, as compare to the AD test. No
hollowness is observed at the origin, which means it is rejecting normality for the
samples having skewness at or around 0.
The SF test, in Figure 1(c), a hollow area is observed which indicates its power to reject
the normality hypothesis at or around 0 skewness. However, at the same time, it also
accept normality hypothesis for many samples with higher value of skewness.
However, it deals with skewness in a better way compared to AD or CVM tests.
KS-L, in Figure 1(d), shows a similar behavior as CVM; accepting many samples with
skewness well beyond 0 and at the same time rejecting normality of samples with
skewness at or around 0 skewness.
JM2 P test, in Figure 1(e), and KS test, in Figure 1(i), both show a real pathetic picture. No
upside-down shape, instead a pillar-like structure which reflects its impotence in
9,3 detecting, or rejecting, normality at 0 or around 0 skewnesses. Comments made by
DAgostesto (1986) rendering KS test simply as historical curiosity got their meanings
here in Figure 1(e) and 1(i).
SW test, in Figure 1(g), shows slightly better behavior; at least accepting normality for
298 zero skewness. However, behaving equally bad for accepting normality for near zero
skewnesses.
DAG test, in Figure 1(h), gives higher p-values to samples with skewness around 0.5.
It has a tendency to reject normality at 0 skewness, while accepting it around the 0
value.
JB test, in Figure 1(f), is probably the best test in rejecting the normality for samples
with skewness around 0, as it has a large hollow area at the center.
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Generally, all these tests have a tendency to accept normality for samples with skewness
within the range of 1 or rejecting normality at 0 skewness, except the JB test. These
were the results obtained at controlled kurtosis. Lets do a similar exercise for controlled
skewness to study the behavior of these test at varying levels of coefficient of kurtosis in
another simulation experiment.
299
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Figure 2.
Performance of normality
tests for kurtosis
SF test, Figure 2(c), is doing a comparatively better job, in rejecting the hypothesis
of normality for all samples having kurtosis coefficient greater than 2.
The behavior of KS-L test in Figure 2(d) is almost similar to that of CVM test.
Both JB test in Figure 2(f) and DAG test in Figure 2(h) are better in assessing the
normality like SF test.
In short, the results are not very much different from what we have seen in the case of
skewness.
In all the these Monte Carlo simulation experiments, sample size was kept constant.
The sample size may play a critical role in determination of normality, as observed by
JM2 Bearden et al. (1982) among others. Another simulation experiment is needed to study
the behavior of these tests for varying sample size.
9,3
Assessing the effect of sample size
A forth simulation experiment is conducted to study the effect of sample size on
these tests. The experiment is based on 500 random variables drawn from a normal
distribution with both mean and standard deviation of 50, filtered for higher values
300 of skewness and kurtosis in such a way that no sample has skewness or kurtosis
more than 1. The sample size is let to vary from 20 to 350 and observations are
made regarding the number of times, in percentage, these tests rejects normality.
Results are shown in Figure 3, which is a scatter diagram accentuated with a
polynomial trend. Each point in these scatter diagrams shows the number of times, in
percentage, a test is rejecting normality at a specific sample size. Expected is a
downward trend as behavior of these tests is expected to improve as sample size
Downloaded by Monash University At 10:33 06 December 2014 (PT)
increases.
However, the reality is quite different. Despite a low rejection rate, the trend is still not
downward even for a single test. Almost, for all these tests, the trend is a curve; the
rejection rate rises sharply at the start, for sample sizes up to 100, then it gets constant.
However, the situation is a little different for SF, JB and DAG tests, where the rejection
rate gets constant around a sample size of 150. The rejection rate for the KS test is the
highest. Scatterness of these plots also varies with tests. For some, like KS, JB, SF and
SW, it is not as significant as it is for others.
The results in Figure 3 confirms that the diagnostic power of the normality tests gets
better with increasing sample size. However, for sample sizes up to 100, the rejection for
normality rate for most of these tests increases, so their performance is not reliable for
sizes lesser than 100.
Apart from these near-normal situations, the behavior of these tests for entirely
non-normal situation is also worth a discussion. For the sake of the current article, only
Students t, chi-square (2), binomial and Poisson probability distributions have been
selected. Yazici and Yolacan (2007) have used many other distributions for a similar
study.
301
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Figure 3.
Performance of normality
tests for different sample
sizes
performance of most of the tests remains very good except JB and DAG, which fall
prey to similarities in shape.
Results summary
These Monte Carlo experiments evaluates the relative efficacy, power and
applicability of six widely used tests of normality assessment from five different
perspectives. Tests like DAG, JB, SF and SW perform comparatively better in
almost all the situations while the tests like KS-L, CVM, AD and especially KS and
P perform pathetically poor.
Downloaded by Monash University At 10:33 06 December 2014 (PT)
9,3
JM2
302
Figure 4.
tests of normality
Comparison for different
4. Concluding remarks Tests for
One of the most common and crucial assumption for a correct application of statistical
tests is the normality of the given set of observations. The theory of statistics is very
normality
strict in this regard and has developed many yardsticks, benchmarks, graphs and assumptions
diagnostics to assess the normality. The relative efficacy and the power of these
assessment techniques varies, however, with the situation, and this is a point of concern
for many data analysts. 303
The paper is written with the sole objective of appraising the performance of different
available statistical diagnostic tests for assessing normality of a given set of
observations to discuss the appropriateness of these tests for different situations.
Although there are many normality tests available in the academic literature, none
dominate for all conditions and specific situations calls different tests. However, one
may use the results of this paper to apply these tests more cognizantly for ones own
situation.
Downloaded by Monash University At 10:33 06 December 2014 (PT)
The paper is using Monte Carlo simulation for generating random samples. Three
different kinds of random samples have been generated:
(1) normal with controlled sample size to assess the effect of skewness and kurtosis;
(2) normal with controlled skewness and kurtosis to assess the effect of sample size;
and
(3) non normal to assess the overall performance of these tests.
It is believed that one manifestation of the normality of a set of observation is that the
corresponding skewness coefficient is 0 and the kurtosis value, calculated through the
second moment ratio, 2 4 / 22 , exactly equals to 3. So, the diagnostic tests for
normality should answer in no for any deviation in these shape parametric values.
However, this is not the case. Almost all of these tests fails, in one way or the other, to
discern non-normality for little deviations in these shape parametric values; accepting
normality even when skewness is not 0 or kurtosis value is not 3. For some, like KS,
Lilliefore, this failure is even more significant. For smaller sample sizes, the tests do
perform well; the bigger the sample size, the lesser are the chances that the test would
assess the normality. One may conclude that the performance of these tests is not
reliable in near-normal situations.
These simulation indicates that the tests like DAG, JB, SF and SW perform
comparatively better in almost all the situations, while the tests like KS-L, CVM, AD and
especially KS and P perform pathetically poor. However, no verdict should be
considered final. Setting the scenario, from practical point of view, and especially for
non-technical researchers, it is quite wise to start with visual inspection of the data either
through QQ plots, density plots or some other more intelligent graphical techniques to
have a sketch of the data. Such a sketch would be helpful not only to appraise the
correctness of the diagnostics but also to locate the abnormality, if it exists. Never rely
on the results of a single diagnostic test but apply at least three of such tests. Prefer tests
like DAG and JB, which give better results in almost all situations. Further, the results of
the numerical diagnostics should be read in conjunction with the graphical sketch.
References
Ajne, B. (1968), A simple test for uniformity of a circular distribution, Biometrika, Vol. 55 No. 2,
p. 343.
JM2 Akbilgi, O. and Howe, J.A. (2011), A novel normality test using an identity transformation of the
Gaussian function, European Journal of Pure and Applied Mathematics, Vol. 4 No. 4,
9,3 pp. 448-454.
Anderson, T.W. and Darling, D.A. (1954), A test of goodness of fit, Journal of American
Statistical Association, Vol. 49 No. 268, pp. 765-769.
Arizono, I. and Ohta, H. (1989), A test for normality based on Kullback-Leibler information,
304 American Statistician, Vol. 43 No. 1, pp. 20-22.
Asma, S. (2008), Delphi programming of normality tests, Proceedings of the 7th WSEAS
International Conference on Application of Electrical Engineering, World Scientific and
Engineering Academy and Society (WSEAS), pp. 181-185.
Azzalini, A. (2005), The skew-normal distribution and related multivariate families,
Scandinavian Journal of Statistics, Vol. 32 No. 2, pp. 159-188.
Bearden, W., Sharma, S. and Teel, J. (1982), Sample size effects on chi square and other statistics
Downloaded by Monash University At 10:33 06 December 2014 (PT)
Corresponding author
Dr Ahmed F. Siddiqi can be contacted at: ahmedfsiddiqi@gmail.com