Non-Normality and Outliers1

Running head: NON-NOMALITY AND OUTLIERS
Checking for Non-normality and Outliers in ANOVA and MANOVA Briefing Pager and Tutorial France Goulard APSY 607
NON-NORMALITY AND OUTLIERS
Abstract This briefing paper and tutorial presents a short review of ANOVA and MANOVA and explores various ways to check for non-normality and outliers in both of these types of analyses. A definition of non-normality and outlier will be introduced as well as how outliers affect normality in research, the importance of identifying outliers, and types of practices for testing outliers. Furthermore, different methods on how to deal with outliers, as well as different types of outlier approaches will be discussed.
Checking for Non-normality and Outliers in ANOVA and MANOVA This briefing paper and tutorial will discuss the importance of checking for outliers and non-normality in an analysis of variance (ANOVA) and a multivariate analysis of variance (MANOVA). An ANOVA tests the difference in means between two or more groups, while a MANOVA tests for the difference in two or more vectors of means. MANOVA is simply an ANOVA with several dependent variables. This paper will Now, take a closer look at what normality and outliers are, as well as their roles and importance while doing an analysis. Normality Normality is utilized when researchers are interested in how well the data is normally distributed and how it follows the bell-shaped curve. The normal distribution is considered the most prominent probability distribution in statistics (Hamish & McGill, 2011). Several reasons for this are as follows (Hamish & McGill, 2011): First, the normal distribution is very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form. Second, the normal distribution arises as the outcome of the Central Limit Theorem, which with reasonable sample sizes, the sampling distribution of the sample mean is approximately normal, with mean and standard deviation (sigma over the square root of n) (Hamish & McGill, 2011). This holds regardless of the shape of the original population distribution, and the approximation becomes increasingly accurate as the sample size increases. Finally, the bell shape of the normal distribution makes it a convenient choice for modeling a large variety of random variables encountered in practice. According to Stevens (2009), each of the individual variables must be normally distributed in order to follow a multivariate normal distribution. Please refer to figure 1 for an example of a normally distributed data compared to non-normality. Non-normality, as shown in figure 1a, is evident as some of the points deviate from the normal distribution line
seen in red. Normality distribution is when the data (observation points) follow the red line, which is called normal distribution. Any data points that do not follow the red line are possible outliers, See 1b. a) Normal distribution line:
Q ic T e a da u k im n d c mr s o e o pe s r a en e e t s et isp t r . r e d d o e h ic ue
b) Identification of 3 outliers to the right of the normal distribution line:
Q uickTim e and a decom pressor are needed to see this picture.
NON-NORMALITY AND OUTLIERS Outliers
In general, outliers are very influential data points. In a univariate analysis, an outlier is an extreme value on one data point. In a multivariate analysis, an outlier is an unusual combination of scores on two or more variables and is known to be very sensitive to multivariate techniques (Stevens, 2009). Outliers, in both cases, have the potential to distort statistical results. According to Stevens (2009), outliers are found in: both univariate and multivariate situations; among both dichotomous and continuous variables; between both dependent and independent variables; and in both data and results of analysis. They are atypical, infrequent observations that diverge from the overall patterns and are unusual in size (big or small) compared to the other values being observed. They are so far separated in value from the remainder of the group suggesting that they may be from a different population or the result of an error in measurement. Identification of outliers According to Stevens (2009), outliers occur for four fundamental reasons: A data recording or entry error was made; there was a failure to specify missing-value in compute syntax so that the missing value is read as real data; the outlier was not a member of the population from which the sample was intended; and, the subjects were simply different from the rest. One can detect an outlier by visually examining the data at hand (Seaman & Allan, 2010). For example, a score of 9% on a test where the median score is 85%. The use of visual plots is helpful in the identification of outliers, as some of them are sometimes hard to recognize. There are four different kinds of plots: histograms, normal probability plots, scatter plots, and box plots. Figure 2 shows examples of the different kinds of plots. Figure 2: Types of visual plots to help identify outliers. Histogram
Normal Distribution
Skewed to the left due to 6 outliers
Q u ic k T im e a n d a Q u ic k T im e a n d a d e c o m p re s s o r d e c o m p re s s o r a re n e e d e d to s e e th is p ic tu re . a r e n e e d e d t o s e e t h is p ic t u r e .
Normal Probability Plot Normal Probability Plot (no outliers) Probability Plot (with outliers)
Q u ic k T im e a n d a d e c o m p re s s o r a re n e e d e d to s e e th is p ic tu re .
QuickTime and a decompressor are needed to see this picture.
Straight line=normal distribution
Curved line=non-normal distribution
Scatter Plots with Various Properties
Q ic T e a da u k im n d c mr s o e o pe s r a en e e t s et isp t r . r e d d o e h ic u e
(a) Shotgun scatter plot with low correlation (d) Low correlation (b) Strong positive correlation (e) Low correlation (c) Strong negative correlation (f) Spurious high correlation because of the points shaded in gray Box Plots The main features of a box plot, including outliers or extreme values are excluded from the range
Qi k i e a d u T n a c m d c mr s o e o pes r aen e e t s e h p t r . r e d d o e t i i ue s c
The shape of the distribution is an important aspect of the description of a variable as it tells you the frequency of values for different ranges of the variable. Bivariate normality for correlated variables implies that the scatter plots for each pair of the variable will be elliptical; therefore, the higher the correlation, the thinner the ellipse (Stevens, 2009). Figure 3 identifies the
NON-NORMALITY AND OUTLIERS change in elliptical formation with the outlier included. Figure 3: Identifies change in elliptical formation with the presence of an outlier: No outlier Outlier
Normal Elliptical Distribution
Non-normal Elliptical Distribution
Stevens (2009) says that the identification of an outlier due to a data recording or entry error can be identified by listing the data and checking it to make sure the data has been read with accuracy. In figure 4, Stevens (2009) stresses the importance of using the median as a robust measure of central tendency, where there are extreme values shown above because the median is unaffected by outliers. Figure 4: Outlier seen in data set as well on scatter plot
Q i k i e a da uc Tm n d c mr s o e o pe s r aen e e t s et i pcue r e d d o e hs i t r .
In the data set, by looking at subject number 6, it is visible that the x and y values are comparatively different than the rest of the subjects. This means that subject 6 is an outlier.
The scatter plot on the right hand side shows the outlier very clearly as it is far away from the other points, in the top right corner. Importance of Identifying Outliers Outliers are important to identify as it can wrongly increase the value of a correlation coefficient or decrease the value of a proper correlation (Stevens, 2009). This could lead to Type I (false positive) and Type II errors (false negative). Furthermore, excluding the outlier could drastically change the interpretation of the results. The effect of outliers on normality in research Outliers can be known as problematic as it can highly influence the data by just one or two errant data points (Stevens, 2009). Researchers want the results to reflect most of the data and ideally represent the whole overall data set by statistical analysis. Statistical procedures are sensitive to outliers and there is a risk that outliers may have a profound influence on the researchers results (Stevens, 2009). Types of practices for testing outliers A number of tests exist to help test for outliers. These tests are dependent on whether or not the data is grouped. In ungrouped data, univariate and multivariate outliers are sought among all cases at once. Examples of grouped data are regression, canonical correlation, and factor analysis. Whereas in grouped data, outliers are looked for separately, within each group. Examples of grouped data are ANOVA and MANOVA. The following are some of the recommended tests for best practice when testing for outliers.
Univariate Tests Stevens (2009) identifies the following as methods that are useful for assessing univariate
NON-NORMALITY AND OUTLIERS normality:
10
i. Normal Probability Plot: Observations are arranged in increasing order of magnitude and then plotted against expected normal distribution values. The plot should resemble a straight line if normality is tenable. Outliers are evident out of the line formation. ii. Histogram of Stem-and-Leaf Plot: Examination of the variable in each group gives indication of whether normality might be violated. It is difficult to assess whether the normality is real or apparent with small or moderate sample sizes because of considerable sampling error. iii. Chi-square Goodness of Fit: Chi-square depends on the number of intervals used for grouping. iv. Kolmogorov-Smirnov: This test is not as powerful as Shapiro-Wik and Skewness-Kurtosis. v. Shapiro-Wilk Test and Skewness and Kurtosis Co-efficients: This combination is the most powerful in detecting departures from normality. Multivariate Tests Tabacknik and Fidell (2007) identify the following as methods that are useful for determining multivariate outliers: i. Mahalanobis Distance (hat elements): This test is used to identify influential data or outlier points on the predictors. The distance of a case from the centroid of the remaining cases where the centroid is the point created at the intersection of the means of all variables. Such outliers will not necessarily be influential (Stevens, 2009). ii. Leverage: Leverage is related to Mahalanobis distance (hat elements=leverage). iii. Discrepancy: It is the extent to which a case is in line with others. iv. Influence: It is the product of leverage and discrepancy.
11
v. Cooks distance: It is a measure of the change in the regression coefficients that would occur if this case were omitted revealing which cases are most influential in affecting the regression equations (Stevens, 2009). This is useful for identifying the combined influence of a case being an outlier on the y and on the set of predictors. A value of 1 would be considered large and would warrant further investigation of that case. vi. Weisberg Test: This test will detect y outliers. vii. DFFITS: This indicates how much the fitted value will change if the observation is deleted. viii. DFBETAS: This is useful in indicating how much each regression coefficient will change if the observation is deleted. Range Tests There are three main methods for identifying outliers using range (Jubal, 2001). i. Upper and lower quartile values: The upper quartile value (UQ) is the value that 75% of the data set is equal to or less than. The lower quartile value (LQ) is the value that 25% of the data set is equal to or less than. The interquartile range (IQR) is defined as the difference between the upper and lower quartiles (IQR=UQ-LQ). Statistically, outliers as those data points that are at least 1.5 IQR greater than the upper quartile or 1.5 IQR less that the lower quartile. ii. Z-test: Among continuous variables, univariate outliers are cases with very large standardized scores (Z scores) on one or more variables, that are disconnected from other z scores (Tabachnik & Fidell, 2007). The mean and the standard deviation of the data set are calculated in a z-test. Anything that falls more than three standard deviations away from the mean is identified as an outlier. That is, x is an outlier if abs(x-mean) ------------------> 3
NON-NORMALITY AND OUTLIERS std dev
12
About 99% of the scores should lie within three standard variations of the mean (Stevens, 2009). Therefore, any z value greater than 3 indicates a value very unlikely to occur. iii. Q-test: The Q-test compares how far out the outlier is to the total range of the data. To do a Q-test, the researchers first find the ratio. abs(x_a-x_b) Q=----------------R x_a is the possible outlier, x_b is the data point closest to it, and R is the total range of the data set. If Q is greater than a certain critical value (Qcrit depends on the number of data points and how sure you want to be that its okay to reject x_a as an outlier), then x_a is an outlier. Jubal (2001) cautions that it is possible to reject almost the entire data set if you apply the Qtest several times in succession, so never do it more than once. F.E. Grubbs Parametric Tests Grubbs tests can be given as follows, in which xi denotes an individual data point, s is the sample standard deviation and n is the sample size (Seaman & Allen, 2010):
, looks for outliers in single points of data,
, finds outliers
at the minimum and maximum of a distribution, and
, finds pairs of outliers
at either extreme.
Other tests that might be useful include Dixons Q test that is similar to G2 for a small number of observations (between 3 and 25), and Rosners test that is a generalization of Grubbs test to detect
NON-NORMALITY AND OUTLIERS up to k outliers when the sample size is 25 or more (Seaman & Allen, 2010). Chauvenets Criterion
13
Chauvenets criterion is a means of assessing whether one piece of experimental data, an outlier, from a set of observations, is likely to be spurious. To apply this criterion, first calculate the mean and standard deviation of the observed data. Then, use the normal distribution function based on how much the suspect data differs from the mean. This will determine the probability that a given data point will be at the value of the suspect data point. Multiply this probability by the number of data points taken. If the result is less than 0.5, the suspicious data point may be discarded. Peirces Criterion Peirces criterion is derived from a statistical analysis of the Gaussian distribution. Unlike some other criteria for removing outliers, Peirces method can be applied to identify two or more outliers. It is proposed to determine in a series of m observations the limit of error, beyond which all observations involving so great an error may be rejected, provided there are as many as n such observations. The principle upon which it is proposed to solve this problem is, that the observations that are being proposed should be rejected when the probability of the system of errors obtained by retaining them is less than that of the system of errors obtained by their rejection multiplied by the probability of making so many, and no more, abnormal observations (Peirce, 1878). Dealing with Outliers
14
Outliers can be misleading from various determinants such as: limited measurement precision, compared results due to an infinite number of standard deviations away from the mean of the remaining results, successive outlying points being identified by a genuine long tail on the distribution, and the risk of an outlier being identified by chance if there if very little data. Therefore, upon discovery of a suspected outlier, the initial temptation is to eliminate the points from the data and to simplify the analyses to make the results easy to explain (Seaman & Allen, 2010). On the basis of some simple assumptions, the outlier tests tell you where you are most likely to have technical error but they do not tell you that the point is wrong (Seaman & Allen, 2010). Also, it is not recommended that identified outliers should be dropped, as no matter how extreme the data is, it could be a correct piece of information (Stevens, 2009). With that being said, the following recommendations on how to deal with outliers are advised by Stevens (2009): If outliers are due to a recording or entry error, then one should correct the data value and redo the analysis; If the outlier is due to an instrumentation error or process error, it is legitimate to drop the outlier; and, if either of the previous is not the case, then do not drop the outlier but report two analyses, one that include the outlier and one that does not. Different Types of Outlier Approach An outlier may be the result of an error in measurement or data entry in which case should distort the interpretation of the data. Therefore, once the outlier is identified, it may be necessary to investigate the analysis and fix any errors that occurred. Identified data points should only be removed if a technical reason can be found for the unusual behaviour (Ostle, 1988). It is imperative that outliers be examined thoroughly and
15
carefully before starting any formal analysis. Dropping the outlier without any good reason is not recommended and should not be practiced. According the Seaman & Allen (2010), removing the outlier may miss intricacies of the data, have large affects on any analysis of the data, and lead to serious biases. In the case where more than 20% of the data are identified as outliers, the researcher should start questioning the assumption of the data distribution and the quality of the collected data (Timm, 1975). Another approach is to report two different analyses, one with the outlier and one without. This would allow the reader to make up their own ideas upon which analysis they should use. The only downfall to this approach is that it can be very time consuming to report both analyses. Discussion In conclusion, normal distribution is needed in order to show how well the data is normally distributed and how it follows the bell-shaped curve. Outliers should not be regarded as bad, as they can provide interesting cases for future study (Stevens, 2009). Testing for outliers is a necessary part of data analysis and must be conducted with care and caution. If an outlier is a genuine result, it should not be disregarded or dropped as it can bring important value to the study at hand and can help in discovering why certain studies are more extreme than others. Links for Checking Non-normality and Outliers in ANOVA and MANOVA Assessing Classical Test Assumptions http://www.statmethods.net/stats/anovaAssumptions.html Missing Values, Outliers, Robust Statistics, & Non-parametric Methods http://www.lcgceurope.com/lcgceurope/data/articlestandard/lcgceurope/502001/4509/article.pdf Mulitvariate Analysis of Variance (MANOVA)
16
http://www.stat.psu.edu/~ajw13/stat505/fa06/12_1wMANOVA/05_1wMANOVA_ex.html Numbers: Numerical methods for biosciences students http://web.anglia.ac.uk/numbers/graphsCharts.html Quality Progress http://www.asq.org/quality-progress/2010/02/statistics-roundtable/outlier-options.html StatSoft Electronic Statistics Textbook http://www.statsoft.com/textbook/basic-statistics/ Stat Trek: Teach yourself statistics http://stattrek.com/AP-Statistics-1/Residual.aspx The Math Doctor http://mathforum.org/dr.math/
References Taylor, H.J. & McGill, J.I. (2011). Analysis based decision making: Executive MBA programs school of business. Kingston, Ontario. Queens School of Business. Jubal, Dr. (2001, June). Using the range to find outliers. Retrieved from http://mathforum.org/library/drmath/view/52720.html
17
Ostle, B., & Malone, L. C. (1988). Statistics in research: Basic concepts and techniques for research workers (4th ed.). Ames, IA: Iowa State Press Peirce, B. (1878). On Peirces criterion. Proceedings of the American Academy of Arts and Sciences. 13, 348-351. doi: 10.2307/2513498 Seaman, J.E. & Allen, E.I. (2010, June). Consider simple parametric tests to find an outliers significance. Quality Progress. Retrieved from http://www.asq.org/qualityprogress/2010/02/statistics-roundtable/outlier-options.html Stevens, J.P. (2009). Applied multivariate statistics for the social sciences. (5th ed.). New York, NY; Routledge, Taylor & Francis Group. StatSoft Electronic Statistics Textbook (2011). Basic statistics. Retrieved from http://www.statsoft.com/textbook/basic-statistics/ Tabachnik & Fidell. (2007). Cleaning up your act: Screening data prior to analysis. (5th ed.). NY: Routledge. Timm, N. H. (1975). Multivariate analysis with applications in education and psychology. Monterey, CA: Brooks/Cole.

Non-Normality and Outliers1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Non-Normality and Outliers1

Uploaded by

Copyright:

Available Formats

Running head: NON-NOMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

b) Identification of 3 outliers to the right of the normal distribution line:

Q uickTim e and a decom pressor are needed to see this picture.

NON-NORMALITY AND OUTLIERS Outliers

NON-NORMALITY AND OUTLIERS

Skewed to the left due to 6 outliers

QuickTime and a decompressor are needed to see this picture.

Straight line=normal distribution

Curved line=non-normal distribution

Scatter Plots with Various Properties

NON-NORMALITY AND OUTLIERS

Normal Elliptical Distribution

Non-normal Elliptical Distribution

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS normality:

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS std dev

, looks for outliers in single points of data,

at the minimum and maximum of a distribution, and

, finds pairs of outliers

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

NON-NORMALITY AND OUTLIERS

You might also like