Professional Documents
Culture Documents
Tsagris Michael
1.1 Introduction............................................................................................................4 2.1 Data Analysis..........................................................................................................5 2.2 Descriptive Statistics..............................................................................................6 2.3 Z-test for two samples............................................................................................8 2.4 t-test for two samples assuming unequal variances ............................................9 2.5 t-test for two samples assuming equal variances ..............................................10 2.6 F-test for the equality of variances .....................................................................11 2.7 Paired t-test for two samples...............................................................................12 2.8 Ranks, Percentiles, Sampling, Random Numbers Generation ........................13 2.9 Covariance, Correlation, Linear Regression.....................................................15 2.10 One-way Analysis of Variance..........................................................................18 2.11 Two-way Analysis of Variance with replication .............................................19 2.12 Two-way Analysis of Variance without replication........................................21 3.1 Statistical Functions.............................................................................................23 3.2 Spearmans (non-parametric) correlation coefficient ......................................27 3.3 Wilcoxon Signed Rank Test for a Median.........................................................28 3.4 Wilcoxon Signed Rank Test with Paired Data ..................................................29
Tsagris Michael
1.1 Introduction
One of the reasons for which these notes were written was to help students and not only to perform some statistical analyses without having to use statistical software such as Splus, SPSS, and Minitab e.t.c. It is reasonable not to expect that excel offers much of the options for analyses offered by statistical packages but it is in a good level nonetheless. The areas covered by these notes are: descriptive statistics, z-test for two samples, t-test for two samples assuming (un)equal variances, paired ttest for two samples, F-test for the equality of variances of two samples, ranks and percentiles, sampling (random and periodic, or systematic), random numbers generation, Pearsons correlation coefficient, covariance, linear regression, one-way ANOA, two-way ANOVA with and without replication and the moving average. We will also demonstrate the use of non-parametric statistics in Excel for some of the previously mentioned techniques. Furthermore, informal comparisons with the results provided by the Excel and the ones provided by SPSS and some other packages will be carried out to see for any discrepancies between Excel and SPSS. One thing that is worthy to mention before somebody goes through these notes is that they do not contain the theory underlying the techniques used. These notes show how to cope with statistics using Excel.
Tsagris Michael
Picture 1
Tsagris Michael
Picture 2
Picture 3
Tsagris Michael
Picture 4
Column1 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Confidence Level(95.0%)
194.0418719 5.221297644 148.5 97 105.2062324 11068.35133 -0.79094723 0.692125308 451 4 455 78781 406 10.26422853
Table 1: Descriptive Statistics The results are pretty much the same as should be. There are only some really slight differences with regard to the rounding in the results of SPSS but of not importance. The sample variances differ slightly but it is really not a problem. SPSS calculates a 95% confidence interval for the true mean whereas Excel provides only the quantity used to calculate the 95% confidence interval. The construction of this interval is really straightforward. Subtract this quantity from the mean to get the lower limit and add it to the mean to get the upper limit of the 95% confidence interval.
Tsagris Michael
Picture 5
Tsagris Michael
Variable 2 5.810701181 9 80
Table 2: Z-test
Picture 6
Tsagris Michael
t-Test: Two-Sample Assuming Unequal Variances Variable 1 Variable 2 Mean 3.76501977 5.810701181 Variance 1.095786123 8.073733335 Observations 100 80 Hypothesized Mean 0 Difference df 96 t Stat -6.115932537 P(T<=t) one-tail 1.0348E-08 t Critical one-tail 1.660881441 P(T<=t) two-tail 2.06961E-08 t Critical two-tail 1.984984263
Picture 7 The results are the same with the ones provided by SPSS. What is worthy to mention and to pay attention is that the degrees of freedom (df) for this case are equal to 178, whereas in the previous case were equal to 96. Also the t-statistics is slightly different. The reason it that different kind of formulae are used in both cases.
10
Tsagris Michael
t-Test: Two-Sample Assuming Equal Variances Variable 1 Variable 2 Mean 3.76501977 5.810701181 Variance 1.095786123 8.073733335 Observations 100 80 Pooled Variance 4.192740223 Hypothesized Mean 0 Difference df 178 t Stat -6.660360895 P(T<=t) one-tail 1.6413E-10 t Critical one-tail 1.653459127 P(T<=t) two-tail 3.2826E-10 t Critical two-tail 1.973380848
Picture 8
11
Tsagris Michael
F-Test Two-Sample for Variances Variable 1 Variable 2 Mean 3.76501977 5.8107012 Variance 1.09578612 8.0737333 Observations 100 80 df 99 79 F 0.13572236 P(F<=f) one0 tail F Critical 0.70552977 one-tail
Picture 9
12
Tsagris Michael
t-Test: Paired Two Sample for Means Variable 1 Mean 3.7307128 Variance 1.1274599 Observations 80 Pearson Correlation 0.1785439 Hypothesized Mean 0 Difference df 79 t Stat -6.527179 P(T<=t) one-tail 2.95E-09 t Critical one-tail 1.6643714 P(T<=t) two-tail 5.9E-09 t Critical two-tail 1.9904502
Picture 8
13
Tsagris Michael
Picture 9 If you are interested in a random sample from a know distribution then the random numbers generation is the option you want to use. Unfortunately not many distributions are offered. The window of this option is at picture 10. In the number of variables you can select how many samples you want to be drawn from the specific distribution. The white box below is used to define the sample size. The distributions offered are Uniform, Normal, Bernoulli, Binomial, and Poisson. Two more options are also allowed. Different distributions require different parameters to be defined. The random seed is an option used to give the sampling algorithm a starting value but can be left blank as well.
Picture 10
14
Tsagris Michael
Picture 11
Column 1 1.113367 0.531949 Column 2
Column 1 Column 2
7.972812
Table 7: Covariance The above table is called the variance-covariance table since it produces both of these measures. The first cell (1.113367) refers to the variance of the first column and the last cell refers to the variance of the second column. The remaining cell (0.531949) refers to the covariance of the two columns. The blank cell is white due to the fact that the value is the covariance (the elements of the diagonal are the variances and the others refer to the covariance). The window of the linear regression option is presented at picture 12. (Different normal data used in the regression analysis). We fill the white boxes with the columns that represent Y and X values. The X values can contain more than one column (i.e. variable). We select the confidence interval option. We also select the Line Fit Plots and Normal Probability Plots. Then by pressing OK, the result appears in table 8.
15
Tsagris Michael
Picture 12
SUMMARY OUTPUT Regression Statistics Multiple R 0.875372 R Square 0.766276 Adjusted R 0.76328 Square Standard Error 23.06123 Observations 80 ANOVA df Regression Residual Total 1 78 79 Coefficients -10.6715 0.043651
F 255.7274
Significance F 2.46E-26
Intercept X Variable 1
Table 8: Analysis of variance table The multiple R is the Pearson correlation coefficient, whereas the R Square is called coefficient of determination and it is a quantity that measures the fitting of the model. It shows the proportion of variability of the data explained by the linear model. The model is Y=-10.6715+0.043651*X. The adjusted R Square is the coefficient of determination adjusted for the degrees of freedom of the model; this is a penalty of the coefficient. The p-value of the constant provides evidence to claim that the constant is not statistical significant and therefore it should be removed from the model. So, if we run the regression again we will just click on Constant is Zero. The results are the same generated by SPSS except for some slight differences due to roundings. The disadvantage of Excel is that it offers no normality test. The two plots also constructed by Excel are presented. 16
Tsagris Michael
200
Figure 2: Normal Probability Plot The first figure is a scatter plot of the data, the X values versus the Y values and the predicted Y values. The linear relation between the two variables is obvious through the graph. Do not forget that the correlation coefficient exhibited a high value. The Normal Probability Plot is used to check the normality of the residuals graphically. Should the residuals follow the normal distribution, then the graph should be a straight line. /unfortunately many times the eye is not the best judge of things. The Kolmogorov Smirnov test conducted in SPSS provided evidence to support the
17
Tsagris Michael
normality hypothesis of the residuals. Excel produced also the residuals and the predicted values in the same sheet. We shall construct a scatter plot of these two values, in order to check (graphically) the assumption of homoscedasticity (i.e. constant variance through the residuals). If the assumption of heteroscedasticity of the residuals holds true, then we should see all the values within a bandwidth. We see that almost all values fall within 40 and -40, except for two values that are over 70 and 100. These values are the so called outliers. We can assume that the residuals exhibit constant variance. If we are not certain as for the validity of the assumption we can transform the Y values using a log transformation and run the regression using the transformed Y values.
120 100 80 60 Residuals 40 Series1 20 0 -20 -40 -60 Predicted Values 0 50 100 150 200 250
18
Tsagris Michael
Picture 13
Anova: Single Factor SUMMARY Groups Count Column 1 253 Column 2 73 Column 3 79 ANOVA Source of SS Variation Between 1909939.2 Groups Within Groups 2536538 Total 4446477.2
Table 9: The one-way analysis of variance The results generated by SPSS are very close with the results shown above. There is some difference in the sums of squares, but rather of small importance. The mean square values (MS) are very close with one another. Yet, by no means can we assume that the above results hold true since Excel does not offer options for assumptions checking.
19
Tsagris Michael
other words the first combination the two factors are the cells from B2 to B26. This means that each combination of factors has 24 measurements.
Picture 14 From the window of picture 3, we select Anova: Two-Factor with replication and the window to appear is shown at picture 15.
Picture 15 We filled the two blank white boxes with the input range and Rows per sample. The alpha is at its usual value, equal to 0.05. By pressing OK the results are presented overleaf. The results generated by SPSS are the same. At the bottom of the table 10 there are three p-values; two p-values for the two factors and one p-value for the interaction. The row factor is denoted as sample in Excel.
20
Tsagris Michael
C3 24 2378 99.08333 508.4275 24 2461 102.5417 515.7373 24 2523 105.125 664.8967 72 7362 102.25 553.3732 MS 19856.59 938845.2 18409.53 3128.936
Total 72 13144 182.5556 15441.38 72 10995 152.7083 8543.364 72 12978 180.25 12621.15
24 2537 105.7083 237.5199 24 2531 105.4583 416.433 24 2826 117.75 802.9783 72 7894 109.6389 505.3326 df 2 2 4 207 215
24 7629 317.875 7763.679 72 21861 303.625 9660.181 SS 39713.18 1877690 73638.1 647689.7 2638731
Count Sum Average Variance ANOVA Source of Variation Sample (=Rows) Columns Interaction Within Total
21
Tsagris Michael
Picture 16
Anova: Two-Factor Without Replication SUMMARY Row 1 Row 2 Row 3 Column 1 Column 2 Column 3 Count 3 3 3 3 3 3 Sum 553 544 525 975 340 307 Average 184.3333 181.3333 175 325 113.3333 102.3333 Variance 11385.33 21336.33 15379 499 332.3333 85.33333
df 2 2 4 8
F 0.160534 111.3707
22
Tsagris Michael
x x x
x x x
Tsagris Michael
FDIST calculates the F probability distribution (degree of diversity) for two data sets. FINV returns the inverse of the F probability distribution. FISHER calculates the Fisher transformation. FISHERINV returns the inverse of the Fisher transformation. FORECAST calculates a future value along a linear trend based on an existing time series of values. FREQUENCY calculates how often values occur within a range of values and then returns a vertical array of numbers having one or more elements than Bins_array. FTEST returns the result of the one-tailed test that the variances of two data sets are not significantly different. GAMMADIST calculates the gamma distribution. GAMMAINV returns the inverse of the gamma distribution. GAMMALN calculates the natural logarithm of the gamma distribution. GEOMEAN calculates the geometric mean. GROWTH predicts the exponential growth of a data series. HARMEAN calculates the harmonic mean. HYPGEOMDIST returns the probability of selecting an exact number of a single type of item from a mixed set of objects. For example, a jar holds 20 marbles, 6 of which are red. If you choose three marbles, what is the probability you will pick exactly one red marble? INTERCEPT calculates the point at which a line will intersect the y-axis. KURT calculates the kurtosis of a data set. LARGE returns the k-th largest value in a data set. LINEST generates a line that best fits a data set by generating a two dimensional array of values to describe the line. LOGEST generates a curve that best fits a data set by generating a two dimensional array of values to describe the curve. 24
x x x x x x x x
x x x x x
Tsagris Michael
x x x x x x x x x x x x x
LOGINV returns the inverse logarithm of a value in a distribution. LOGNORMDIST Returns the number of standard deviations a value is away from the mean in a lognormal distribution. MAX returns the largest value in a data set (ignore logical values and text). MAXA returns the largest value in a set of data (does not ignore logical values and text). MEDIAN returns the median of a data set. MIN returns the largest value in a data set (ignore logical values and text). MINA returns the largest value in a data set (does not ignore logical values and text). MODE returns the most frequently occurring values in an array or range of data. NEGBINOMDIST returns the probability that there will be a given number of failures before a given number of successes in a binomial distribution. NORMDIST returns the number of standard deviations a value is away from the mean in a normal distribution. NORMINV returns a value that reflects the probability a random value selected from a distribution will be above it in the distribution. NORMSDIST returns a standard normal distribution, with a mean of 0 and a standard deviation of 1. NORMSINV returns a value that reflects the probability a random value selected from the standard normal distribution will be above it in the distribution. PEARSON returns a value that reflects the strength of the linear relationship between two data sets. PERCENTILE returns the k-th percentile of values in a range. PERCENTRANK returns the rank of a value in a data set as a percentage of the data set. PERMUT calculates the number of permutations for a given number of objects that can be selected from the total objects.
x x x x
25
Tsagris Michael
POISSON returns the probability of a number of events happening, given the Poisson distribution of events. PROB calculates the probability that values in a range are between two limits or equal to a lower limit. QUARTILE returns the quartile of a data set. RANK calculates the rank of a number in a list of numbers: its size relative to other values in the list. RSQ calculates the square of the Pearson correlation coefficient (also met as coefficient of determination in the case of linear regression). SKEW returns the skewness of a data set (the degree of asymmetry of a distribution around its mean). SLOPE returns the slope of a line. SMALL returns the k-th smallest values in a data set. STANDARDIZE calculates the normalized values of a data set (each value minus the mean and then divided by the standard deviation). STDEV estimates the standard deviation of a numerical data set based on a sample of the data. STDEVA estimates the standard deviation of a data set (which can include text and true/false values) based on a sample of the data. STDEVP calculates the standard deviation of a numerical data set. STDEVPA calculates the standard deviation of a data set (which can include text and true/false values). STEYX returns the predicted standard error for the y value for each x value in regression. TDIST returns the Students t distribution TINV returns a t value based on a stated probability and degrees of freedom. TREND Returns values along a trend line. TRIMMEAN calculates the mean of a data set having excluded a percentage of the upper and lower values. TTEST returns the probability associated with a Students t distribution.
26
Tsagris Michael
VARA estimates the variance of a data set (which can include text and true/ false values) based on a sample of the data. VARP calculates the variance of a data population. VARPA calculates the variance of a data population, which can include text and true/false values. WEIBULL calculates the Weibull distribution. ZTEST returns the two-tailed p-value of a z-test.
27
Tsagris Michael
Column 1 contains the values, Rank contains the ranks of the values, Percent contains the cumulative percentage of the values (the size of the values relative to the others) and the first column (Points) indicates the row of each value. In the above table, Excel has sorted the values according to their ranks. The first column indicates the exact position of the values. We have to sort the data with respect to this first column, so that the format will be as in the first place. We will repeat these actions for the second set of data and then calculate the correlation coefficient of the ranks of the values. Attention is to be paid at the sequence of the actions described. The ranks of the values must be calculated separately for each data set and the sorting need to be done before calculating the correlation coefficient. The results for the data used in this example calculated the Spearmans correlation coefficient to be equal to 0.020483 whereas the correlation calculated using SPSS is equal to 0.009. The reason for this difference in the two correlations is that SPSS has a way of dealing the values that have the same rank. It assigns to all values the average of the ranks. That is, if three values are equal (so their ranks are the same), SPSS assigns to each of these three values the average of their ranks (Excel does not do this action).
Tsagris Michael
test statistics due to the different handling of the tied ranks and the use of different test statistic. There is also another way to calculate a test statistics and that is by taking the sum of the positive ranks. Both Minitab and SPSS calculate another type of test statistic, which is based on either the positive or the negative ranks. What is worthy to mention is that the second formula is better used in the case when there are no tied ranks. Irrespectively of the test statistics used the result will be the same as for the rejection of the null hypothesis. Using the second formula the result is 1401, whereas Minitab provides a result of 1231.5. As for the result of the test (reject the null hypothesis or not) one must look at the tables for the 1 sample Wilcoxon signed rank test. The fact that Excel does not offer options for calculating the probabilities used in the non-parametric tests in conjunction with the tedious work, makes it less popular for use.
Values (Xi) 307 350 318 304 302 429 454 440 455 390 350 351 383 360 383 m-Xi 13 -30 2 16 18 -109 -134 -120 -135 -70 -30 -31 -63 -40 -63 absolute(m-Xi) 13 30 2 16 18 109 134 120 135 70 30 31 63 40 63 Ranks of absolute values 64 47 67 60 56 19 13 17 11 32 47 43 37 41 37 positive or negative differences 1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 Ranks Ri 64 -47 67 60 56 -19 -13 -17 -11 -32 -47 -43 -37 -41 -37 Squared Ranks Ri2 4096 2209 4489 3600 3136 361 169 289 121 1024 2209 1849 1369 1681 1369
29
Tsagris Michael
Ranks Ri 8 6 11 9 -13 14 5 3 9 -15 -12 7 1 4 2 Squared Ranks Ri2 64 36 121 81 169 196 25 9 81 225 144 49 1 16 4
307 350 318 304 302 429 454 440 455 390 350 351 383 360 383
225 250 250 232 350 400 351 318 383 400 400 258 140 250 250
Table 14: Procedure of the Wilcoxon Signed Rank Test with Paired Data
30