Professional Documents
Culture Documents
This tutorial offers a chance for students with limited statistics background a concise review of and introduction to fundamental topics in the MA program. It also provides a refresher for students with more extensive statistics backgrounds.
To encourage a practical understanding, topics are presented using actual data for air travel data and Excel screenshots of statistical results.
There is a self-test at the end of each section to help each student evaluate grasp of the material.
No one will grade these self tests; responsibility rests with the student Students are advised to review incorrect answers and seek additional assistance in understanding incorrect answers if needed. Students may contact brian.goff@wku.edu with questions.
Additional, concise sources of information on the topics presented are available from Hyperstats http://davidmlane.com/hyperstat/ Statsoft Electronic Textbook -- http://www.statsoft.com/textbook/stathome.html
Section I
Descriptive Statistics and Measures of Sampling Error
The distribution of fare per mile is similar to the normal after smoothing out the rectangles, but is just slightly right-tilted or skewed
This graphic was produced by the statistical software package, SPSS
Error not related to sampling selection (question bias, response bias, dishonest responses, data entry errors, ) must be small relative to the size of the sampling error
This kind of error is called non-sampling error
One-Sample Te st Test Value = .30 95% Confidence Interval of the Difference Lower Upper -.0611 .0268
Note: our test is really 1-tailed since we are testing greater than 0.30. We should cut the p-value in half to 21.25, but this is still well above 0.05
One-Sample Test Test Value = .35 95% Confidence Interval of the Difference Lower Upper -.1111 -.0232
farepermile
t -3.189
df 20
Sidebar on Hypothesis Testing In addition to p-values, t-statistics and confidence intervals (all derived from standard errors) can also test a hypothesis
As a rule-of-thumb, t-values greater than 2 in absolute value are equivalent to p-values below 0.05
Time
t -4.373
df 24
4. 5. 6. 7. 8. 9.
Section II
Regression Analysis
The most frequently used statistical technique used to examine relationships between variables is Regression Analysis or some technique that is very similar to regression analysis. Regression analysis can be used for all kinds data and relationships including Linear relationships and Curved relationships Quantitative data and Qualitative data Cross-sectional and time series data The following slides present the simplest form of Regression Analysis A quantitative dependent variable (Air Fare) and one quantitative, independent variable (Distance) The relationship is treated as linear
The Scatterplot presented in Figure 1 depicts the 21 Fare (Y-axis) and Distance (X-axis) combinations in the data set
The graph shows that as distance increases, fare also tends to increase, but that the relationship is not perfect; otherwise, it would lay on a straight line
Figure 3 adds another element to the plot a straight line of points (a line connecting the pink points)
These points represent the regression line that Excel chose as the straight line that best fit the scatterplot points Software chooses the line to minimize the sum of the (squared) distances between the blue points and the pink line this method is called the Least Squares or Ordinary Least Squares (OLS) method and is widely used
400 300
Fare
Intercept Distance
Fare = 157 +
(Intercept)
The y-intercept indicates that if distance were 0, the fare would be 157
The intercept in this case is not an economically meaningful number because there are no flights of 0 miles The intercept merely extends the line to the X-axis for statistical purposes Be aware of the relevant range (min, max) of the X-variable
In a 2-variable regression like this one, the Multiple R is the same thing as the Correlation Coefficient between X and Y.
The R-square is the squared correlation coefficient in such cases. Its maximum is 1.0 or -1.0 (perfectly correlated) and 0 is the min It can take on positive or negative values depending on the direction of the relationship between the two variables
Just like the sample mean, the regression coefficients are sample statistics that are usually used to estimate what the true relationship would be if all possible data were used Regression coefficients, therefore, also have standard errors that estimate their sampling error
The slope coefficient for distance (0.08) has a standard error of 0.02 This implies that the population parameter (regression coefficient using all possible data) may easily be 2 cents higher or lower than the 0.08 coefficient estimated by this sample For a wider (apx. 95%) margin for error, this standard error can be multiplied by about 2.0
The t-stat and p-value are also ways of assessing the reliability of the coefficient They test whether the coefficient is significantly different from zero As a rule of thumb, if the t-statistic is > 2.0 (< - 2.0), this is viewed as significantly different from zero The t-Stat on Distance is 4.239, so it is statistically significant The p-value estimates the likelihood of finding the coefficient of 0.084 by mere chance if the true value were zero The p-value of 0.000 indicates that this would be very unlikely, also showing a statistically significant result In scientific research, p-values below 5 percent (0.05) are taken as statistically significant In other settings, the cutoff level for the p-value may vary
Such regressions are sometimes called multiple regression analysis and involve only slight modifications of the earlier points
Also, economists widely use qualitative variables as independent variables. When these take on only two values (male, female) they are usually coded as (1,0) and called binary or dummy variables
In the Air Travel data, we have such a variable, Direct SWA, that indicates whether Southwest Airlines flies this route directly (1) or not (0). This variable is added to the regression analysis, resulting in the following Excel output:
Note that the SWA dummy variable only influences the y-intercept The SWA variable does not influence the slope for distance (see next slide)
The line connecting the upper pink dots shows the regression line when SWA= 0
The line connecting the lower pink dots shows the regression line when SWA=1 The Fare-Distance slope for both lines is 0.08
Table R2. Regression with Multiple X-Variables Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations
Another important difference that results from adding the SWA variable is the increase in the R-Square value It is now 82.2 (it was about 48% when using only Distance)
The combination of Distance and Direct SWA account for 82.2% of the differences in Fares across cities. Adding SWA increased this value by about 36%
From the regression predictions and errors, Excel (and other software) compute an Analysis of Variance or ANOVA The F-Statistic is the most important number here; itcomputes the ratio of the mean regression sum of squares by the mean residual sum of squares Unlike the R-Square value, the F-statistic adjusts for the number of variables used The Significance F is simply a p-value testing the null hypothesis that the F-statistic equals zero; With this data, this null hypothesis is rejected because the p-value is very low In effect, the F-statistic tests whether the X-variables, as a group matter in explaining the Y-variable The SS above refers to Sum of Squares. The Residual SS simply squares the individual errors and adds them up. MS refers to mean sum of squares which divides the SS by the number of observations (minus the number of variables in the regression). The Predicted sum of squares computes differences in the actual and predicted values for Fare and then adds them up The Total sum of squares adds the Predicted and Residual together The R-Square is simply the regression sum of squares divided by the total The Adjusted R-squared, like the F-statistic, adjusts for the number of variables used
Regression Pointers
Regressions that are well done have residuals that have no obvious patterns and are roughly bell shaped; Checking the residuals for these and other characteristics is called Residual Analysis Regressions that leave out key explanatory (X) variables can yield misleading slopes this is called the Omitted Variables Bias; Regressions leaving out key variables should be viewed as exploratory or preliminary in nature There is no magical R-squared value to be obtained; if a model is put together well, then a low R-squared is fine; if a model has key flaws in it, then a high R-Squared value does not make it good Only humans can determine if a regression is causal (Income-Sales) or merely associative (SAT-ACT); the software treats both cases the same
1. a. b. c. d.
2. a. b. c. d. 3. a. b. c. d.
The regression equation depicted by the table is 5k Time = 0.731 + Age + Intervals + Residual 5k Time = 17.554 + Age*Intervals + Residual 5k Time = 17.554+ 0.071*(-0863)*Age*Intervals + Residual 5k Time = 17.554 + 0.071*Age 0.863*Intervals + Residual
The percent of 5k time differences accounted for by Age and Intervals in the regression model is 0.731 17.554 12.660 0.535 The slope coefficient for Age is 0.071 0.731 17.554 0.016
4. a. b. c. d. 5. a. b. c. d. 6. a. b. c. d.
The likely sampling error in the slope coefficient for Age is 0.071 0.731 17.554 0.016 The slope coefficient for Age implies that For each 1 minute increase in Time, Age increases by 0.071 years For each 1 year increase in Age, Time increases by 1 minute For each 1 year increase in Age, Time increases by 0.071 minutes For each 1 year increase in Time, Age increases by about 53% The regression results imply that if Age were 0, then Time would be 0.731 12.660 24.000 17.554
7. a. b. c. d. 8. a. b. c. d.
The value in the preceding question Means that a newborn baby would be predicted to run this time in a 5k Means that the value is really only a hypothetical extension of the regression line because none of the actual data go back to zero years of Age Means that the regression is not reliable at any values Means that babies should compete in the Olympics The coefficient for Intervals implies that When interval equals 1, the Age slope is reduced by 0.863 When interval equals 0, the y-intercept value is reduced by 0.863 minutes When interval equals 1, the Age slope is the same but the entire regression line shifts down by 0.863 minutes When interval equals 0, the Age slope is the same but the entire regression line shifts down by 0.863 minutes If you wanted to compute the effects of 10 more years of Age on the predicted 5k Time, you should multiply 0.10 x 0.071 10 x 0.071 10 x 1.0 100 x 0.071
9. a. b. c. d.
The predicted value for 5k Time when a person is 47 and using intervals in training would be found by which of the following equations? Predicted 5k Time = 17.554 + 0.071*(47) Predicted 5k Time = 0.731 + 0.072*47 0.863*((1) Predicted 5k Time = 17.554 + 0.071*(47) 0.863*(1) Predicted 5k Time = 0.071*(47) - 0.863*(1) Using the data sheet provided earlier, compute the residual for the first observation. (Note: you will first have to compute the predicted time) -0.545 0.631 -.034 1.232 The data provided on the accuracy of the coefficients indicates that All are not significantly different from zero Age is significantly different from zero but not Intervals Intervals is significantly different from zero but not Age All are significantly different from zero
Section III
Statistical Software
Overview
Personal computers and software make it possible for almost anyone to complete complicated or lengthy computations needed for statistics knowing what to do with them is the hard part Excel contains many useful statistical and graphing capabilities; these are introduced in the next few slides
Software dedicated to statistical operations vastly expands the breadth of procedures possible as well as doing some much easier than in Excel. Some commonly used statistical software includes
SAS (www.sas.com); the company offers many varieties; JMP is a point-click product; SAS is available in some places at WKU SPSS (www.spss.com); This software is available in most computer labs on campus; it is not as widely used by economists as SAS but contains most of the same features, especially for basic purposes Stata (www.stata.com) is widely used by economists and contains broad and very powerful tool; Eviews (www.eviews.com) is also very powerful and especially useful for time series and forecasting applications; both provide point-click functionality
One of the main differences in Excel and spreadsheets in statistical software is that Excel is address driven (each cell has an address), whereas the stat software is variable driven once a column of data exists for a variable, the entire column can be manipulated simply by referring to the name
(Note: If you opened Excel from your desktop, the procedures above should work; if you happened to open Excel by opening an Excelbased spreadsheet while browsing on the internet, it may not work)
Formatting the output table (this is something you should always do in Excel)
Highlight the columns with the table Select the Format menu Select the Column and AutoFit Selection options Again, select the Format menu Select the Cells options In the Number menu, choose the Number option Pick a number for the Decimal Places box (the number of decimal places depends somewhat on the data 3 will be fine here)
Make sure to do this step in Excel; tables with a lot of insignificant decimal places are very messy to read
Also, under the Data menu, there is a Pivot Table and Pivot Chart option that provides further capabilities
If you would like a hands-on introduction to other statistical software, please contact Brian Goff at brian.goff@wku.edu . Also, other several other economics professors can provide assistance in becoming acquainted with software.
Probability Distributions
A final topic briefly introduced here is that of probability distributions (PD) A PD is a formula (often presented as a graphic or table) that links values of a variable with the probability of those values
PDs are used in many ways; for statistics, one of the key uses is to assess hypotheses including the use of t-statistics and p-values
Statistical software makes an extensive knowledge of PDs not necessary because the relevant information about the PD is stored by the computer and used as needed; however, a few basic points are worthwhile even for basic statistics users
Probability Distributions 2
PDs have a center, dispersion, and symmetry or skew (asymmetry)
measures of location of center include mean & median measures of dispersion include the standard deviation and range PDs have tails (the ends), measured by the amount of kurtosis
t-Distribution
Also bell-shaped Is wider in its tails than the normal but converges to it with large samples
Binomial Distribution deals with 2 outcome situations F-Distribution, Chi-Square Distribution commonly used distribution when the topic is variability
We can now produce probabilities for a variable assumed to be normal or near normal
Example: Lets assume that male height is apx. Normal with a mean of 70 inches and a standard deviation of 2 inches, what is the probability of finding someone taller than 74? In the NORMDIST window, plug in 74 for X, 72 for Mean, and 2 for Standard Deviation In the Cumulative box, put True. Excel will produce a number that is the probability of being 74 or less (that is, the cumulative probability) This number is 0.977 The probability of being taller than 74 is 1-.977 = .023 or 2.3%
The same or similar procedures can be used for 2 outcome (binomial) problems or many others and opens up a wide array of uses
Clockwise from Left Corner: Normal, t-, F-, and Chi-Square Distributions
A gallery of PDs and more background is offered at the Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/eda/sect ion3/eda366.htm