You are on page 1of 61

STATISTICS TUTORIAL FOR ECON MA STUDENTS

This tutorial offers a chance for students with limited statistics background a concise review of and introduction to fundamental topics in the MA program. It also provides a refresher for students with more extensive statistics backgrounds.
To encourage a practical understanding, topics are presented using actual data for air travel data and Excel screenshots of statistical results.

There is a self-test at the end of each section to help each student evaluate grasp of the material.
No one will grade these self tests; responsibility rests with the student Students are advised to review incorrect answers and seek additional assistance in understanding incorrect answers if needed. Students may contact brian.goff@wku.edu with questions.

Additional, concise sources of information on the topics presented are available from Hyperstats http://davidmlane.com/hyperstat/ Statsoft Electronic Textbook -- http://www.statsoft.com/textbook/stathome.html

Section I
Descriptive Statistics and Measures of Sampling Error

Air Travel Data


For 21 cities, the following data have been recorded or computed:
City = city identifying code Fare = cheapest coach fare from Nashville to city in $ on Orbitz on given a given day Distance = distance in miles for the route Fare per Mile = Fare divided by distance

Excel Screen Shot of Data

Distribution of Fare per Mile


The histogram has a normal (bell-shaped) distribution curve superimposed

The distribution of fare per mile is similar to the normal after smoothing out the rectangles, but is just slightly right-tilted or skewed
This graphic was produced by the statistical software package, SPSS

Table 1-- Descriptive Statistics for Fare per Mile


Key Univariate Descriptive Statistics Mean = average of 28 cents per mile Median = middle value (50th percentile); so half of the values are above 28.8 cents per mile and half are below; the median is a better measure of the center of the data set when the data are highly skewed Standard deviation = average distance or variability from the mean fare for the observations; in this case, the 21 observations differ from the mean by an average of 9.6 cents per mile Range = difference between the minimum and maximum values Skewness = degree of asymmetry; zero is perfectly symmetric; large positive values (1.0 or larger) indicate a leaning to the right; large negative values indicate a leaning to the left; the value of 0.559 indicates a slight rightward skew as shown in the graph on the prior page

Sample Statistics v. Population Parameters


The statistics reported in Table 1 are sample statistics they summarize the 21 observations in the sample The full set of all possible fares between all cities of interest would represent the population of fares and fares per mile Population Parameter refers to a summary measure using all possible data; for example, the population mean or population standard deviation The sample statistics reported in Table 1 provide estimates of these population parameters Table 1 also provides numerical estimates of the accuracy and reliability of the sample mean in estimating the population mean (see next slide)

Table 1-- Estimates of Sampling Error


Key Univariate Descriptive Statistics Standard Error (of sample mean) = estimate of the likely sampling error between the sample mean and the population mean; 0.021 implies that repeated samples of the same size could easily find sample means 2.1 cents higher or lower; Confidence Levels (95%) = roughly two times the standard error; (for 99%, it is roughly 2.5 times the standard error); as such, it provides a figure similar to the standard error, but with a wider margin for error; 0.044 with a 95% Confidence level implies that about 95 out of 100 samples of this size would likely result in sample means within 4.4 cents of the estimated value

How Reliable are Sample Error Estimates?


Standard Errors and Confidence Intervals estimate sampling error Sampling Error is error arising because one is using less than the entire population To accurately estimate population parameters and sampling error,
samples must be representative of the population
Randomly selected samples are the best (though not foolproof way) of assuring this

Error not related to sampling selection (question bias, response bias, dishonest responses, data entry errors, ) must be small relative to the size of the sampling error
This kind of error is called non-sampling error

Using Sampling Error in Testing Claims (Hypothesis Testing)


Estimates of sampling error permit a claims or conjectures (hypotheses) concerning population parameters to be tested with sample statistics while taking into account a margin for error Testing a claim for the population mean:
Suppose someone thinks that the mean fare per mile for the full population is 30 cents or higher Given the sample mean (0.282) and the standard error of 0.21, it is quite likely that another sample would yield an estimate of 30 cents or higher If we double the standard error to get a 95% confidence interval and margin for error of 0.042, we see that the claim of 30 cents or higher is quite likely In contrast, if someone were to claim that the mean is 35 cents or higher, the standard error and confidence interval suggests that such a figure is not very likely

Testing Claims with P-values


Put briefly, a p-value shows the likelihood of obtaining the sample estimate by chance if the null hypothesis were true Take the claim of a mean of .30 tested here (using SPSS software) given the sample mean of 0.282 and s.e. of 0.021
The estimated p-value (called Sig.-2 tailed) is 0.425 The chance of finding such a value by chance is 42.5 percent Typically, we reject the null only if this p-value is below a 5 percent threshold
farepermile t -.814

One-Sample Te st Test Value = .30 95% Confidence Interval of the Difference Lower Upper -.0611 .0268

Mean df Sig. (2-tailed) Difference 20 .425 -.01714

Note: our test is really 1-tailed since we are testing greater than 0.30. We should cut the p-value in half to 21.25, but this is still well above 0.05

Testing Claims with P-values


Now, test a mean of .35 or higher:
The estimated p-value (called Sig.-2 tailed) is 0.005 The chance of finding such a value by chance is 0.5 percent which is far below the 5 percent threshold even before cutting it in half for a 1-tailed test The p-value indicates that there is only a 0.5 percent chance of finding our mean of 0.282 if the true mean were 0.35 or higher The null hypothesis of a mean of 0.35 or higher is rejected

One-Sample Test Test Value = .35 95% Confidence Interval of the Difference Lower Upper -.1111 -.0232

farepermile

t -3.189

df 20

Sig. (2-tailed) .005

Mean Difference -.06714

Sidebar on Hypothesis Testing


In the previous slide, the proposition that the coefficient was equal to zero was tested using the p-value
Any time that a p-value appears, a null hypothesis is being tested The proposition being examined is called the null hypothesis Using p-values from the output of software is the simplest way of testing a hypothesis With small data sets, especially with small effects being tested, a p-value may not be below 0.05.
This does not mean that the null hypothesis is true It may indicate that the test lacks Power to reject a false null (due to lack of data); See Statsoft textbook under xxxxxxx for further information

Sidebar on Hypothesis Testing In addition to p-values, t-statistics and confidence intervals (all derived from standard errors) can also test a hypothesis
As a rule-of-thumb, t-values greater than 2 in absolute value are equivalent to p-values below 0.05

Self Test Section I


The self test uses a data set on 5K running times; the raw data appears on the next slide; variables are Time = 5k time in minutes (decimals are fractions of minutes) Age = age in years Intervals = 1 if hard interval workouts were used and 0 if not; Miles Per Week = number of miles per week in training at peak of training

Self-Test for Section I


1. a. b. c. d. 2. a. b. c. d. 3. a. b. c. d. The measure that provides the middle or 50th percentile observation is 19.30 19.50 0.800 19.00 The statistic that indicates how spread out the individual 5k times are from the average time is 3.250 0.160 0.800 0.192 Based on the data, you can say that the times are Nearly symmetric Highly skewed to the right Highly skewed to the left Not enough information

Self-Test for Section I


4. a. b. c. d. 5. a. b. c. d. 6. a. b. c. d. The likely sampling error for the mean is The measure that provides the middle or 50th percentile observation is 0.160 0.192 0.800 3.250 The 95% confidence interval for the mean is computed by Multiplying the standard error by about 2.0 Multiplying the standard deviation by 95% Dividing the range by about 10 Dividing the mean by the sample size The value for Age for the second observation is 42 21 22 44

Self-Test for Section I


7. a. b. c. d. 8. a. b. c. d. 9. a. b. c. d. In the output, a test of the mean is provided. The null hypothesis being tested is That the population mean equals 19.3 That the population mean equals -4.373 That the population mean equals -0.700 That the population mean equals 20 The results in the table provide a 2-tailed test. To compute a 1-tailed test, you would Double the p-value Divide the t-statistic by two Divide the p-value by two Double the size of the confidence interval Which of the following indicates that the null hypothesis should be rejected? t = -4.373 p-value (Sig. 2-tailed) = 0.000 Both a and b One-Sample Test Neither a or b
Test Value = 20 95% Confidence Interval of the Difference Lower Upper -1.0304 -.3696

Time

t -4.373

df 24

Sig. (2-tailed) .000

Mean Difference -.70000

Correct Answers to Self-Test Section I


1. 2. 3. A C A (the skewness statistic is very small, 0.192, indicating only a slight amount of positive skew; 0 would be perfectly symmetric; above or below 1.0/-1.0 would indicate substantial asymmetry) A A C (go back to the original data sheet for this) D (this is indicated by the Test Value = 20 in the SPSS output) C (the test provided is 2-tailed because it tests whether the mean equals 20 or not; 1-tailed would test whether it was 20 or more) C (the p-value is less than the typical 0.05 threshold for rejecting the null hypothesis; the t-values absolute value is greater than 2.0)

4. 5. 6. 7. 8. 9.

Section II
Regression Analysis

Relationships Between Variables


In economics investigators are frequently interested in how one variable interacts with another; Example: sales and income Often, one of the variables causes changes in the other such as higher incomes causing more sales. The causal variable is referred to as the X, Independent, or Explanatory Variable The responding variable is referred to as the Y or Dependent Variable Sometimes the relationship is not causal but merely one of association because of links to a third variable
Example: SAT & ACT cores, which are both caused by academic ability and achievement

The most frequently used statistical technique used to examine relationships between variables is Regression Analysis or some technique that is very similar to regression analysis. Regression analysis can be used for all kinds data and relationships including Linear relationships and Curved relationships Quantitative data and Qualitative data Cross-sectional and time series data The following slides present the simplest form of Regression Analysis A quantitative dependent variable (Air Fare) and one quantitative, independent variable (Distance) The relationship is treated as linear

Scatterplot for Fare & Distance

Fig. 2 -- Scatterplot of Fare (Y) and Distance (X)


400 350 300 250
Fare

The Scatterplot presented in Figure 1 depicts the 21 Fare (Y-axis) and Distance (X-axis) combinations in the data set
The graph shows that as distance increases, fare also tends to increase, but that the relationship is not perfect; otherwise, it would lay on a straight line

200 150 100 50 0 0 500 1000 1500 2000 2500 Distance

Regression from a Visual Standpoint

Figure 3. Scatterplot and Regression Plot for Fare-Distance

Figure 3 adds another element to the plot a straight line of points (a line connecting the pink points)
These points represent the regression line that Excel chose as the straight line that best fit the scatterplot points Software chooses the line to minimize the sum of the (squared) distances between the blue points and the pink line this method is called the Least Squares or Ordinary Least Squares (OLS) method and is widely used

400 300
Fare

200 100 0 0 500 1000 1500 Distance 2000 2500

Fare-Distance Regression as Tabular Output

Regression in Table Form


Table R1 presents the same regression results Figure 3
The Regression Statistics and ANOVA parts of the table evaluate the overall performance of the regression in predicting Fare to different cities The bottom part with Coefficients for Intercept and Distance presents the regression line as numbers that can be put into an equation along with estimates of sampling error The following slides breakdown the different parts of the table

Intercept Distance

Coefficients 157.614 0.084

Regression output always implies an equation written generally as


y = b0 + b1*X b0 = y-intercept b1 = slope (change in Y over change in X) b0 and b1 are referred to as regression coefficients or intercept coefficient and slope coefficient

The pink line in Figure 3 can be written down as an equation


Recall, the slope-intercept form of a line (y=mx+b) from basic algebra if you draw a line through the pink points in Figure 3, and extend it to where Distance (X) = 0, the intercept should be obvious

The equation for this line is

Fare = 157 +
(Intercept)

0.084 * Distance + Error


(Slope)

Slope & Intercept Meaning


The slope indicates that for every 1 mile Distance, the Fare is increasing by 0.084 (or about 8 cents).
The slope produced in regression analysis always shows the amount of increase in Y (or decrease if negative) for a 1 unit increase in X To correctly interpret the slope for a regression, it is critical to know the units in which X and Y are measured; here, the units are miles and dollars A 100 mile increase implies an $8.40 (100 x 0.084) increase in Fare

The y-intercept indicates that if distance were 0, the fare would be 157
The intercept in this case is not an economically meaningful number because there are no flights of 0 miles The intercept merely extends the line to the X-axis for statistical purposes Be aware of the relevant range (min, max) of the X-variable

Regression Line Errors (Residuals)


Using the regression equation, Y-values for given X-values can be calculated
Predicted Y = intercept + slope*(X-value)

Example: Observation 1 is Dallas with a distance of 600 miles:


Predicted Fare = 157.6 + 0.084*(600) = 208 (Excels prediction is 208.310 we rounded)

The regression Error (residual) = Actual Y value Predicted Yvalue


For Dallas (observation 1), the actual fare was $250, so we calculate

Residual = 250 208.310 = 41.690

Each observation has a predicted fare and error associated with it

Multiple R R Square Adjusted R Square Standard Error Observations

0.697 0.486 0.459 43.294 21.000

R Square reports the percent of the Y-variable explained by the X-variable


In other words, expresses (as a percent) how close the regression line points come to predicting the actual scatterplot points The maximum R-square is 1.0 (100%) and the minimum is 0. In this case, Distance, by itself, can account for 48.6% of the Fare differences between cities

In a 2-variable regression like this one, the Multiple R is the same thing as the Correlation Coefficient between X and Y.
The R-square is the squared correlation coefficient in such cases. Its maximum is 1.0 or -1.0 (perfectly correlated) and 0 is the min It can take on positive or negative values depending on the direction of the relationship between the two variables

Regression Coefficient Accuracy

Just like the sample mean, the regression coefficients are sample statistics that are usually used to estimate what the true relationship would be if all possible data were used Regression coefficients, therefore, also have standard errors that estimate their sampling error
The slope coefficient for distance (0.08) has a standard error of 0.02 This implies that the population parameter (regression coefficient using all possible data) may easily be 2 cents higher or lower than the 0.08 coefficient estimated by this sample For a wider (apx. 95%) margin for error, this standard error can be multiplied by about 2.0

More on Regression Coefficient Accuracy

The t-stat and p-value are also ways of assessing the reliability of the coefficient They test whether the coefficient is significantly different from zero As a rule of thumb, if the t-statistic is > 2.0 (< - 2.0), this is viewed as significantly different from zero The t-Stat on Distance is 4.239, so it is statistically significant The p-value estimates the likelihood of finding the coefficient of 0.084 by mere chance if the true value were zero The p-value of 0.000 indicates that this would be very unlikely, also showing a statistically significant result In scientific research, p-values below 5 percent (0.05) are taken as statistically significant In other settings, the cutoff level for the p-value may vary

Expanded Regression Analysis


In most situations in economics, investigators look at the effects of multiple variables on a dependent variables when using regression analysis
Example: price and income effects on sales

Such regressions are sometimes called multiple regression analysis and involve only slight modifications of the earlier points

Also, economists widely use qualitative variables as independent variables. When these take on only two values (male, female) they are usually coded as (1,0) and called binary or dummy variables
In the Air Travel data, we have such a variable, Direct SWA, that indicates whether Southwest Airlines flies this route directly (1) or not (0). This variable is added to the regression analysis, resulting in the following Excel output:

Fare Regression with Distance and Direct SWA

Intercept Distance Direct SWA

Coefficients Standard Error 193.032 14.411 0.081 0.012 -66.779 11.446

t Stat P-value 13.395 0.000 6.698 0.000 -5.834 0.000

The regression equation is now


Fare = 193 + 0.08*Distance 66*Direct SWA + Residual The slope coefficient for Distance is still about 0.08 The y-intercept coefficient was 157; It is now 193

The Direct SWA variable has these effects:


When SWA = 0 (when SWA does not fly that route), the regression equation is Fare = 193 + 0.081*Distance ; because -66*(0) = 0 When SWA =1 (when SWA flies the route), the regression equation is Fare = 193 + 0.081* Distance 66*(1) = 127 + 0.081*Distance

Note that the SWA dummy variable only influences the y-intercept The SWA variable does not influence the slope for distance (see next slide)

Distance Line Fit Plot


400 350 300 250
Fare

The line connecting the upper pink dots shows the regression line when SWA= 0
The line connecting the lower pink dots shows the regression line when SWA=1 The Fare-Distance slope for both lines is 0.08

200 150 100 50 0 0 500 1000 1500 2000 2500 Distance

Table R2. Regression with Multiple X-Variables Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations

0.907 0.822 0.802 26.161 21.000

Another important difference that results from adding the SWA variable is the increase in the R-Square value It is now 82.2 (it was about 48% when using only Distance)
The combination of Distance and Direct SWA account for 82.2% of the differences in Fares across cities. Adding SWA increased this value by about 36%

ANOVA df Regression Residual Total


2.000 18.000 20.000

SS MS F Significance F 56977.083 28488.541 41.627 0.000 12318.727 684.374 69295.810

From the regression predictions and errors, Excel (and other software) compute an Analysis of Variance or ANOVA The F-Statistic is the most important number here; itcomputes the ratio of the mean regression sum of squares by the mean residual sum of squares Unlike the R-Square value, the F-statistic adjusts for the number of variables used The Significance F is simply a p-value testing the null hypothesis that the F-statistic equals zero; With this data, this null hypothesis is rejected because the p-value is very low In effect, the F-statistic tests whether the X-variables, as a group matter in explaining the Y-variable The SS above refers to Sum of Squares. The Residual SS simply squares the individual errors and adds them up. MS refers to mean sum of squares which divides the SS by the number of observations (minus the number of variables in the regression). The Predicted sum of squares computes differences in the actual and predicted values for Fare and then adds them up The Total sum of squares adds the Predicted and Residual together The R-Square is simply the regression sum of squares divided by the total The Adjusted R-squared, like the F-statistic, adjusts for the number of variables used

Regression Pointers
Regressions that are well done have residuals that have no obvious patterns and are roughly bell shaped; Checking the residuals for these and other characteristics is called Residual Analysis Regressions that leave out key explanatory (X) variables can yield misleading slopes this is called the Omitted Variables Bias; Regressions leaving out key variables should be viewed as exploratory or preliminary in nature There is no magical R-squared value to be obtained; if a model is put together well, then a low R-squared is fine; if a model has key flaws in it, then a high R-Squared value does not make it good Only humans can determine if a regression is causal (Income-Sales) or merely associative (SAT-ACT); the software treats both cases the same

Self Test Section II


The self test again uses a data set on 5K running times shown on the next slide Time = 5k time in minutes (decimals are fractions of minutes) Age = age in years Intervals = 1 if hard interval workouts were used and 0 if not; Miles Per Week = number of miles per week in training at peak of training

For These Questions, Refer to this Output

1. a. b. c. d.
2. a. b. c. d. 3. a. b. c. d.

The regression equation depicted by the table is 5k Time = 0.731 + Age + Intervals + Residual 5k Time = 17.554 + Age*Intervals + Residual 5k Time = 17.554+ 0.071*(-0863)*Age*Intervals + Residual 5k Time = 17.554 + 0.071*Age 0.863*Intervals + Residual
The percent of 5k time differences accounted for by Age and Intervals in the regression model is 0.731 17.554 12.660 0.535 The slope coefficient for Age is 0.071 0.731 17.554 0.016

4. a. b. c. d. 5. a. b. c. d. 6. a. b. c. d.

The likely sampling error in the slope coefficient for Age is 0.071 0.731 17.554 0.016 The slope coefficient for Age implies that For each 1 minute increase in Time, Age increases by 0.071 years For each 1 year increase in Age, Time increases by 1 minute For each 1 year increase in Age, Time increases by 0.071 minutes For each 1 year increase in Time, Age increases by about 53% The regression results imply that if Age were 0, then Time would be 0.731 12.660 24.000 17.554

7. a. b. c. d. 8. a. b. c. d.

The value in the preceding question Means that a newborn baby would be predicted to run this time in a 5k Means that the value is really only a hypothetical extension of the regression line because none of the actual data go back to zero years of Age Means that the regression is not reliable at any values Means that babies should compete in the Olympics The coefficient for Intervals implies that When interval equals 1, the Age slope is reduced by 0.863 When interval equals 0, the y-intercept value is reduced by 0.863 minutes When interval equals 1, the Age slope is the same but the entire regression line shifts down by 0.863 minutes When interval equals 0, the Age slope is the same but the entire regression line shifts down by 0.863 minutes If you wanted to compute the effects of 10 more years of Age on the predicted 5k Time, you should multiply 0.10 x 0.071 10 x 0.071 10 x 1.0 100 x 0.071

9. a. b. c. d.

10. a. b. c. d. 11. a. b. c. d. 12. a. b. c. d.

The predicted value for 5k Time when a person is 47 and using intervals in training would be found by which of the following equations? Predicted 5k Time = 17.554 + 0.071*(47) Predicted 5k Time = 0.731 + 0.072*47 0.863*((1) Predicted 5k Time = 17.554 + 0.071*(47) 0.863*(1) Predicted 5k Time = 0.071*(47) - 0.863*(1) Using the data sheet provided earlier, compute the residual for the first observation. (Note: you will first have to compute the predicted time) -0.545 0.631 -.034 1.232 The data provided on the accuracy of the coefficients indicates that All are not significantly different from zero Age is significantly different from zero but not Intervals Intervals is significantly different from zero but not Age All are significantly different from zero

Correct Answers Section II Self Test


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. D D A D C D B C B (the slope for a 1 unit (year) change in time is 0.071, a 10 year change is simply 10 x slope) D A (Predicted Time = 17.554+0.071*(21) 0.86*(0) = 19.045; Residual = Actual Predicted = 18.50 19.045) D (All of the p-values for the coefficients are below the 0.05 threshold for significance; All of the t-statistics are above 2.0 in absolute value the rule-of-thumb value for significance

Section III
Statistical Software

Overview
Personal computers and software make it possible for almost anyone to complete complicated or lengthy computations needed for statistics knowing what to do with them is the hard part Excel contains many useful statistical and graphing capabilities; these are introduced in the next few slides

Software dedicated to statistical operations vastly expands the breadth of procedures possible as well as doing some much easier than in Excel. Some commonly used statistical software includes
SAS (www.sas.com); the company offers many varieties; JMP is a point-click product; SAS is available in some places at WKU SPSS (www.spss.com); This software is available in most computer labs on campus; it is not as widely used by economists as SAS but contains most of the same features, especially for basic purposes Stata (www.stata.com) is widely used by economists and contains broad and very powerful tool; Eviews (www.eviews.com) is also very powerful and especially useful for time series and forecasting applications; both provide point-click functionality

Excel Stat Introduction 1


Making Application
While there is no self-test with this section, you are strongly encouraged to practice on Excel; even if you use other software in later classes, the practice in Excel will be helpful

One of the main differences in Excel and spreadsheets in statistical software is that Excel is address driven (each cell has an address), whereas the stat software is variable driven once a column of data exists for a variable, the entire column can be manipulated simply by referring to the name

Excel Stat Introduction 2


Click the Tools menu in Excel; if Data Analysis appears as an option you may skip to the next slide; if not then
Select the Add-Ins option under the Tools menu Check the box for Analysis Tool Pak The Data Analysis option should now appear under the Tools menu

(Note: If you opened Excel from your desktop, the procedures above should work; if you happened to open Excel by opening an Excelbased spreadsheet while browsing on the internet, it may not work)

Excel Stat Introduction 3


Take one of the data sheets, Air Travel or 5k Times, used in this tutorial and enter the data into Excel. The instruction here proceed using the Air Travel data. To compute descriptive statistics for a variable
Select the Tools menu Select the Data Analysis option Select the Descriptive Statistics option Click on the icon next to the blank for Input Range Highlight the column for Fare including the label Check the Labels in the First Row box Check the Summary Statistics box Check the Confidence Interval for the Mean box Check the OK button

Excel Stat Introduction 4


You should now have an output table on a new sheet
One disadvantage of Excel is that statistical output table like this one tend to be collapsed or condensed and need to be formatted

Formatting the output table (this is something you should always do in Excel)
Highlight the columns with the table Select the Format menu Select the Column and AutoFit Selection options Again, select the Format menu Select the Cells options In the Number menu, choose the Number option Pick a number for the Decimal Places box (the number of decimal places depends somewhat on the data 3 will be fine here)
Make sure to do this step in Excel; tables with a lot of insignificant decimal places are very messy to read

Excel Stat Introduction 5


Return to the original data sheet Create a regression analysis:
Select the Tools menu and the Data Analysis option Select the Regression Analysis option in the window Select the icon next to the Input Y Range blank and highlight the data containing Fare including the label Select the icon next to the Input X Range and highlight the data containing Distance and Direct SWA including the labels (Note: if you try to highlight the whole columns you may get an error) Check the Labels box, the Residuals box, and the Line Fit box Select the OK button and reformat the output tables as before You will also need to reformat the Line Fit plot (another small hassle in Excel); just expand it using the mouse

Excel Stat Introduction 6


Return to the original data sheet Charts in Excel
Excel can also be used to create scatterplots, histograms, and other types of plots This is an area where statistical software is much easier to use If you want to tinker some, click on the Chart Wizard icon that should appear below the top level menus
The icon has the appearance of a bar chart

Also, under the Data menu, there is a Pivot Table and Pivot Chart option that provides further capabilities

If you would like a hands-on introduction to other statistical software, please contact Brian Goff at brian.goff@wku.edu . Also, other several other economics professors can provide assistance in becoming acquainted with software.

Probability Distributions
A final topic briefly introduced here is that of probability distributions (PD) A PD is a formula (often presented as a graphic or table) that links values of a variable with the probability of those values

PDs are used in many ways; for statistics, one of the key uses is to assess hypotheses including the use of t-statistics and p-values
Statistical software makes an extensive knowledge of PDs not necessary because the relevant information about the PD is stored by the computer and used as needed; however, a few basic points are worthwhile even for basic statistics users

Probability Distributions 2
PDs have a center, dispersion, and symmetry or skew (asymmetry)
measures of location of center include mean & median measures of dispersion include the standard deviation and range PDs have tails (the ends), measured by the amount of kurtosis

Normal (Probability) Distribution


Most widely known due to its bell-shape Many real life situations are approximately (though not perfectly) distributed Normal The mother of PDs in that many other distributions are related to it or converge to it with large samples or other conditions

t-Distribution
Also bell-shaped Is wider in its tails than the normal but converges to it with large samples

Binomial Distribution deals with 2 outcome situations F-Distribution, Chi-Square Distribution commonly used distribution when the topic is variability

Excel permits PDs to be used directly if desired


Click on the function icon (the script f) just below the top menus Select Statistical in the window and scroll to the desired distribution such as NORMDIST for normal

We can now produce probabilities for a variable assumed to be normal or near normal
Example: Lets assume that male height is apx. Normal with a mean of 70 inches and a standard deviation of 2 inches, what is the probability of finding someone taller than 74? In the NORMDIST window, plug in 74 for X, 72 for Mean, and 2 for Standard Deviation In the Cumulative box, put True. Excel will produce a number that is the probability of being 74 or less (that is, the cumulative probability) This number is 0.977 The probability of being taller than 74 is 1-.977 = .023 or 2.3%

The same or similar procedures can be used for 2 outcome (binomial) problems or many others and opens up a wide array of uses

Clockwise from Left Corner: Normal, t-, F-, and Chi-Square Distributions

A gallery of PDs and more background is offered at the Engineering Statistics Handbook
http://www.itl.nist.gov/div898/handbook/eda/sect ion3/eda366.htm

You might also like