You are on page 1of 98

Conjoint Analysis

Conjoint Analysis is used by marketers to tell which product attributes of a product are most important to a consumer and to what degree is each important to the consumer.

Step 1 - Make a list of product attributes to be evaluated by consumer. Brand Color Price A Red $50 B Blue $100 C $150 Step 2 - Make a complete list of all possible attribute combinations. Card Brand Color Price 1 A Red 50 2 A Red 100 3 A Red 150 4 A Blue 50 5 A Blue 100 6 A Blue 150 7 B Red 50 8 B Red 100 9 B Red 150 10 B Blue 50 11 B Blue 100 12 B Blue 150 13 C Red 50 14 C Red 100 15 C Red 150 16 C Blue 50 17 C Blue 100 18 C Blue 150 Step 3 - Have the consumer rank each combination on a scale of 1 (worst) to 10 (best). Card Brand Color Price 1 1 1 50 2 1 1 100 3 1 1 150 4 1 2 50 5 1 2 100 6 1 2 150 7 2 1 50 8 2 1 100 9 2 1 150 10 2 2 50 11 2 2 100 12 2 2 150 13 3 1 50 14 3 1 100 15 3 1 150 16 3 2 50 17 3 2 100 18 3 2 150

Step 4 - Final data preparation step prior to running regression - Remove 1 variable from each set of variables with more than 1 choice. Removal of these variables removes the predictability of the other variables. Card A B C Red Blue $50 $100 $150 1 1 0 0 1 0 1 0 0 2 1 0 0 1 0 0 1 0 3 1 0 0 1 0 0 0 1 4 1 0 0 0 1 1 0 0 5 1 0 0 0 1 0 1 0 6 1 0 0 0 1 0 0 1 7 0 1 0 1 0 1 0 0 8 0 1 0 1 0 0 1 0 9 0 1 0 1 0 0 0 1 10 0 1 0 0 1 1 0 0

11 12 13 14 15 16 17 18

0 0 0 0 0 0 0 0

1 1 0 0 0 0 0 0

0 0 1 1 1 1 1 1

0 0 1 1 1 0 0 0

1 1 0 0 0 1 1 1

0 0 1 0 0 1 0 0

1 0 0 1 0 0 1 0

0 1 0 0 1 0 0 1

Card 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0

C 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1

Blue 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1

$100 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0

$150 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1

most important to a consumer

Conjoint

each combination Preference 5 5 0 8 5 2 7 5 3 9 6 5 10 7 5 9 7 8

Conjoint is an analysis that provides a marketer with a method to predict how much more or less a co one combination of product attributes over another combination of product attributes. The degree that a product attribute is called the "utility" of that attribute. For example, a product might come in three br at three levels of price. Each color, brand, and price level will have its own utility caluculated during th Conjoint is done using Multiple Regression. Each product attribute variation will assigned as one of th to the Multiple Regression equation. For example, the color red will be represented by one independe blue will be presented by another independent variable. The resulting regression equation assigns a c variable. These coefficients are the utilities of each of the attributes. The more positive an individual c highly valued is the associated product attribute. The coefficients can be interrpretted as the utilities o

In this conjoint exercise, we are going to determine the utilities of eight product attributes. They are as

There are 18 possible combinations of these attributes (3 brands x two colors x three prices). The on a scale of 0 to 10 (10 being the best). The consumer test results are modified for the regression eq The resulting regression analysis calculates a coefficient for each independent variable as part of the Each coefficient is the measure of value that the consumer places on the product attribute associated

The chart on the left side provides the choices that the consumer had to analyze. The consumer was provided with 18 separate cards. Each card contained one of the 18 possible variations of product attributes. The consumer had to rate their overall preference of each combination of attributes on a scale of 1 to 10.

The chart on the right shows the consumer's stated preference for each combination of attributes. Non-numerical attributes were assigned numbers. Brand A and Red are shown as 1's in their respect respective columns. Brand C was assigned a 3 in its respective column.

m each set of ty of the other variables. Preference 5 5 0 8 5 2 7 5 3 9

The chart is now further prepared for Regression Analysis. Each individual product attribute is given its own column. Each product attribute now has either the value of 1 or 0.

6 5 10 7 5 9 7 8

One problem must be corrected before this data can be submitted for Regression Analysis. Independent variables or combinations of independent variables should not be able to predict each other. Using independent variables that are highly correlated to each other (either positively or negatively) produce a regression error known as co-linearity. For example, if the color is either red or blue, knowing the state of one of the color (if the state of Blue = 1, the state of Red must = 0), we know the state of the other color. This error condition also occurs when there are 3 variables. If you know the states of 2, you know the state of the remaining one. These error conditions are solved by removing one column of data from each type of variation. Information about Brand A, Red, and Price level $50 were removed. We will see below that this has no effect on the accuracy of the Regression output.

Preference 5 5 0 8 5 2 7 5 3 9 6 5 10 7 5 9 7 8

SUMMARY OUTPUT
Regression Statistics Multiple R 0.933190299 R Square 0.870844134 Adjusted R Square 0.812136922 Standard Error 1.141319161 Observations 17 ANOVA df Regression Residual Total 5 11 16 Coefficients 5.916666667 1.513888889 3.347222222 1.231481481 -2.319444444 -4.319444444 SS MS 96.61247277 19.3224946 14.3287037 1.30260943 110.9411765 Standard Error 0.807034518 0.698912395 0.698912395 0.559992106 0.698912395 0.698912395 t Stat 7.33136753 2.16606387 4.7891871 2.19910507 -3.31864832 -6.18023729

Intercept Brand B Brand C Blue $100 $150

Regression Equation Combination Preference = 5.91666666666667 + (1.5138


Removing information about Brand A, Red, and Price level $50 did not hurt the output accuracy. These product attributes could still be considered to be part of the Regression equation, but with coefficients of 0. The coefficients attached to each of the product attributes simply show the consumer's utility for that attribute. The utilities for each attribute are relative to each other.

For example, Price level $50 has the highest preference with with a utility of 0 while Price level $150 has the lowest utility of -4.319444444. Blue has a utility of 1.231481481, which is that much hgiher than the utility of red, which was 0. Brand C was the most liked brand with a utility of 3.347222222 with Brand A is liked the least with a utility of 0. The resulting Regression Equation still does a good job of predicting overall preference. For example, the consumer rated the combination of attributes on card 13 with a 10. Here the predicted Combination Preference for card 13 attribute combination is: (5.9166) + (3.3472)(1) = 9.263 which is very close to the consumer's rating of 10.

The regression appears to be a good one because Adjusted R Squared is high (close to 1). Adjusted R Square = Explained variance over unexplained variance. Here, Adjusted R Square is 8.12 Each of the variables has a low p-Value and is therefore a significant predictor. The absolute value of the coefficients indicates the effect that each has on the consumer's overall liking of product. For example, Brand C (coefficient = 3.347) produced the highest positive influence while the $150 price (coefficient = -4.319) reduces consumer liking the most.

The overall low significance of the regressions F statistic indicates that the regression, overall, is valid

o predict how much more or less a consumer will value of product attributes. The degree that a consumer likes mple, a product might come in three brands, two colors, and ve its own utility caluculated during the conjoint analysis. ute variation will assigned as one of the independent variable inputs will be represented by one independent variable while the color ulting regression equation assigns a coefficient to each independent utes. The more positive an individual coefficient is, the more s can be interrpretted as the utilities of the variables.

of eight product attributes. They are as follows:

ds x two colors x three prices). The consumer rates each combination ults are modified for the regression equation and then run through the regression. ch independent variable as part of the regression output equation. es on the product attribute associated with that utiliy.

er had to analyze. The consumer of the 18 possible variations of ence of each combination of attributes

for each combination of attributes. Red are shown as 1's in their respective columns. Brand B and Blue were shown as 2's in their

h individual product attribute he value of 1 or 0.

ed for Regression t variables should hat are highly correlated on error known as co-linearity.

of one of the color (if tate of the other color.

ou know the states of 2,

ata from each type of were removed.

Regression output.

F Significance F 14.83368241 0.000143011

P-value 1.48277E-05 0.053140224 0.000563039 0.050164457 0.006847687 6.90583E-05

Lower 95% Upper 95% 4.140395669 7.692937664 -0.024406919 3.052184697 1.808926414 4.88551803 -0.001052832 2.464015795 -3.857740252 -0.78114864 -5.857740252 -2.78114864

Lower 95.0% 4.140395669 -0.024406919 1.808926414 -0.001052832 -3.857740252 -5.857740252

Upper 95.0% 7.692937664 3.052184697 4.88551803 2.464015795 -0.781148637 -2.781148637

5.91666666666667 + (1.51388888888889)*(Brand B) + (3.34722222222222)*(Brand C) + (1.23148148148148

did not hurt the output be part of the

y show the consumer's to each other.

th a utility of 0 while Price ity of 1.231481481, and C was the most liked brand

cting overall preference. on card 13 with a 10.

e combination is: onsumer's rating of 10.

Squared is high (close to 1). ance. Here, Adjusted R Square is 8.12.

ficant predictor.

ach has on the consumer's 347) produced the highest uces consumer liking the most.

es that the regression, overall, is valid.

C) + (1.23148148148148)*(Blue) + (-2.31944444444445)*($100) + (-4.319444444)*($150)

Regression

Regression is a statistical techniques that is used to create predictive models. The models receive input (independent the outcome of the dependent variable.

When performing Multiple Regression, Correlation Analysis should be performed on a independent and dependent va

Monthly Rates of Return


Date 1/30/1998 2/27/1998 3/31/1998 4/30/1998 5/29/1998 6/27/1998 S&P 0.8799 7.5187 5.558 1.3716 -1.6289 2.4171 Viacom 0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 AT&T 2.1407 -2.5948 7.7869 -8.5551 1.2474 0.8214 GM -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 Coke -18.8406 6.6964 -3.3473 5.8442 1.9427 2.1063

S&P Viacom 0.8799 0.7541 7.5187 14.9701 5.558 11.9792 1.3716 7.907 -1.6289 -5.1724 2.4171 3.4091

AT&T 2.1407 -2.5948 7.7869 -8.5551 1.2474 0.8214

GM -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261

Regression Statistics

Regression

Residual Total

Intercept Viacom AT&T GM

Regression Equation S&P

Interpretting the Regre

Low signifiance of the F statistic

p-Values for each variable - The Viacom returns are a good predi AT&T and GM returns are much The small coefficients of these tw

Adding new independent variabl Adjusted R Square is increased

dels. The models receive input (independent) variables and predict

rformed on a independent and dependent variables first, as below.

Correlation Analysis
Tools / Data Analysis / Correlation
S&P S&P Viacom AT&T GM Coke 1 0.938661647 0.128558379 0.470349107 Viacom 1 -0.098932814 0.350437967 0.342337358 AT&T GM

0.255052662

1 -0.26371086 -0.501490208

1 0.627513676

Coke has a low correlation with the S&P and is therefore not a good predictor of the S&P Also, if two of the independent variables above are highly correctlated with each other, only one of the be used in the Multiple Regression below. This is not the case here because none of the variables ab a high correlation with another variable. Using highly correlated variables as inputs to a Multiple Regr causes an error called Multicollinearity and should be avoided. Multiple Regressions should be built u new independent variable at a time and evaluating results. Good new independent variables noticeab and lower Standard Error without causing much change to Coefficients. Poor new independent variab R-Square much but have unpredictable effects on Coefficients. Build regressions up one variable at a evaluate after adding each new variable.

Multiple Regression

Predicting S&P returns from returns of other investments

Tools / Data Analysis / Regression

Coke was not used because it has a low correlation with S&P and is therefore not a good predictor of the S&P All others (Viacom, AT&T, GM) were used because they had a relatively high correction with S&P and low corrections Regressions are Predictive, not Forecasting. All new independent variables must be chosen from within the range of t

SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations ANOVA df Regression 3 0.987732311 0.975615119 0.939037796 0.821001266 6

Adjusted R Square - states that 94% of the varian

The high coefficient of Viacom indicates that it is The standard error of regression is used to determ 95% confidence interval = Predicted S&P Value +/ MS (Model Significance) shows high ratio of expl F Ratio = Explained variance (17.9) / Unexplained SS MS F 53.9356009 17.97853363 26.67267746

Residual Total

2 5 Coefficients 0.1250621 0.394220806 0.170135064 0.091267454

1.348086157 55.28368705 Standard Error 0.44169756 0.052563186 0.070141633 0.047497875

0.674043079

Intercept Viacom AT&T GM

t Stat 0.283139667 7.49994124 2.425593151 1.921506033

P-value 0.803685895 0.017317591 0.136110167 0.194617419

Regression Equation S&P = (0.125062100111188) + (0.39422080554261)*(Viacom) + (0.170135064

Interpretting the Regression:

Low signifiance of the F statistic - indicates that, overall, the regession output is statistically significant (valid), at leas

p-Values for each variable - The lower the p-Value, the better predictor the variable was. Viacom returns are a good predictor of the S&P AT&T and GM returns are much less effective predictors of the S&P return (higher p-Values) - These would not be vali The small coefficients of these two company returns also indicate that they are lesss valid predictors. Adding new independent variables to a regression equation always increases R Square.

Adjusted R Square is increased only when newly added independent variable increase predictability of the dependent

Coke

dictor of the S&P th each other, only one of them should use none of the variables above have as inputs to a Multiple Regression Regressions should be built up by adding one dependent variables noticeably raise R-Square Poor new independent variables don't change ressions up one variable at a time and

er investments

redictor of the S&P h S&P and low corrections with each other n from within the range of the previously sampled independent variable,

ates that 94% of the variance of the S&P return is explained by the model - This is good.

Viacom indicates that it is the biggest predictor of the S&P. It's high correlation indicates this as well. egression is used to determine confidence intervals. al = Predicted S&P Value +/- z(95%) * (Standard Error) e) shows high ratio of explained (regression) over unexplained (residual) variance. Low p value (Significance of F) shows regressi riance (17.9) / Unexplained variance (0.67) = 26.6 - This is high and is good. A low P value shows that this is significant. Significance F 0.036353424

Lower 95% -1.775409111 0.16805967 -0.131660024 -0.113099408

Upper 95% 2.025533312 0.620381941 0.471930152 0.295634316

Lower 95.0% -1.775409111 0.16805967 -0.131660024 -0.113099408

acom) + (0.170135064181028)*(AT&T) + (0.0912674536429872)*(GM)

y significant (valid), at least to the 0.05 level of significance.

) - These would not be valid predictors for a 0.05 level of significance.

ictability of the dependent variable.

n indicates this as well.

nce. Low p value (Significance of F) shows regression model is statistically significant w P value shows that this is significant.

Upper 95.0% 2.025533312 0.620381941 0.471930152 0.295634316

Testing Two Population Means To Determine If Change Oc


The Confidence Interval or the t-Test can be used to determine if a population mean has changed.

Testing to determine if a change has occurred, for example, after an ad co using the Confidence Interval
BEFORE Average Daily DEALER A B C D E F G H I J K L M N O P Q R S T U V W X Y Z A1 B1 C1 D1 Sales 100 130 120 140 155 200 300 260 190 185 100 130 120 140 155 200 300 260 190 185 100 130 120 140 155 200 300 260 190 185 AFTER Average Daily Sales 110 135 122 157 160 206 309 283 202 192 110 135 122 157 160 206 309 283 202 192 110 135 122 157 160 206 309 283 202 192 = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Testing to determine if a change has occurred, using the t-Test


t-Test - Paired Means

Sampling the same thing before and after to determine if somethin Trying to determine if the "after" samples are statistically different 30 Samples should always be taken, unless population is known t (Here only 6 samples are taken for brevity)

In this case, we want to determine with 95% certainty whether or n a change from before to after. Null hypothesis is 0 and =

t-Test: Paired Two Sample for Means


Before 0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 After -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

P(T<=t) one-tail (0.289) is greater the (0.05) so th

P(T<=t) two-tail (0.579) is greater the (0.05) so th

Here, because is less than both P values, we cannot reject the N in either case. The null Hypothesis states that there is no change

Problem:

A tire manufacturer wants to determine if a new rubber formulation will improve tire wear. 12 sets of tires were created with the old rubber formula and 12 sets of news with the new rubber formulation. They were placed on the following cars and driven until they wore out. Determine at a 0.05 level of significance whether the new rubber produces longer tread life.

Car 1 2 3 4 5 6

Tire Location Front Rear Front Rear Front Rear Front Rear Front Rear Front Rear

Old Rubber 37661 42342 31108 41239 32903 42658 29829 39616 34625 42650 31923 39990

New Rubber 31902 41203 38816 43305 35375 52353 30883 49424 38724 43234 34565 43861

The NULL Hypothesis here is that the mean tread wear of the old rubber equals the mean tread wear of the The p-Value for both one-tailed test and two-0tailed test is less than the level of significance (0.05) so the N is rejected - Therefore, we have a 95% certainty that the new rubber compund increases tread wear.

Problem:

Evaluate the returns of these two stocks to determine if there is a real difference. Use a 0.05

Viacom 0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 3.4091

GM -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 -4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 1.8261

t-Test: Two-Sample Assuming Unequal Vari


Mean Variance Observations Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

p-Values for both one and two tailed tests are greater than th so it can be stated with 95% certainty that there is a differenc

The NULL Hypothesis that the means of both returns are equ

0.7541 14.9701 11.9792 7.907 -5.1724 3.4091 0.7541 14.9701 11.9792 7.907 -5.1724

-4.6296 18.986 -1.7226 -0.5535 6.679 1.8261 -4.6296 18.986 -1.7226 -0.5535 6.679

Problem:

A company is testing light bulbs from 2 suppliers. Below is listed the hours of usage before e Determine using a 0.05 level of significance whether the new supplier's light bulbs really last old supplier's.

Light Bulb Suppliers Old 42 46 64 53 38 44 61 44 50 60 39 51 42 37 45 65 54 46 42 44 26 52 New 55 45 58 52 54 47 51 61 49 56 52 49

t-Test: Two-Sample Assuming Equal Varian


Mean Variance Observations Pooled Variance Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail

The one-tailed p-value (one-tailed because we are only testin the stated level of significance (0.05) so we cannot reject the the means light bulb life for both suppliers is the same.

e If Change Occurred

mple, after an ad compaign is run.

Difference 10 5 2 17 5 6 9 23 12 7 10 5 2 17 5 6 9 23 12 7 10 5 2 17 5 6 9 23 12 7

In this case, we want to determine with 95% certainty whether an advertising campa to our large dealer network. To determine this, we must take Before and After sampl The keys to success of this sampling are the following:

1) At least 30 dealers must be sampled. 2) Before and After samples must be taken from the same dealers 3) The samples must be AVERAGE sales, for example, average daily sales over a we 4) The dealer's sampled must be random and representative of the overall populatio

We are trying to determine whether the Mean Difference falls inside or outside the 95 If the Mean Difference falls within this 95% Confidence Interval, We say that there is If the Mean Difference Falls outside this Confidence Interval, there is a 95% chance t

We can state with 95% certainly that there has been no significant change if the Ave the 95% Confidence Interval of this mean being 0. To determine the 95% Confidence Sample size (COUNT) = Sample Standard Deviation (STDEV) = Sample Standard Error = Sample Mean (AVERAGE) = (1 - Confidence Interval) = 30 6.11 1.11 9.60 0.05

Need at least 30 samples.of

Sample Standard Error = (Sample S

(for 95% Confidence Intveral, = 0.0

The 95% confidence interval will contain 95% of the area under the Normal curve. The rem The Z Score represents the right outer edge of the confidence interval. Total area under th a 95% two-tailed confidence interval is 97.5%. The z Score for this is 1.96. This means tha is to the left of 1.96 Standard deviations to the right of the mean.

Z Score (two tailed) for 95% CI =

1.96

NORMSINV(0.975)

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% CI) 0 +/- (1.96) x (1.11) The 95% Confidence Interval for the Mean = 0 is from -2.18 to +2.18

If the Sample Mean (9.60) is outside of the 95% Confidence Interval for the Mean Differenc We can say with 95% certainty that Average Daily sales throughout the entire population of has increased.

This is the case because Mean of 9.60 is outside of the confidence interval of -2.18 to +2.1

We can now state with 95% certainty that the advertising campaign has caused a change i

d after to determine if something has changed mples are statistically different than the "before"sample n, unless population is known to be normally distributed

with 95% certainty whether or not there has been l hypothesis is 0 and = 0.05 (1 - 0.95)

ple for Means


Before 5.641183333 55.62648613 6 0.350437967 0 5 0.592077573 0.28977785 2.015048372 0.5795557 2.570581835 After 3.4309 72.4984677 6

reater the (0.05) so there has been no significant increase

reater the (0.05) so there has been no significant change at all

P values, we cannot reject the Null Hypothesis states that there is no change in the mean.

on will improve tire wear. sets of news with the new driven until they wore out. produces longer tread life.

t-Test: Paired Two Sample for Means


Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Old Rubber 37212 23678506 12 0.736490409 0 11 -2.395091934 0.017769924 1.795884814 0.035539848 2.200985159 New Rubber 40303.75 43699518.39 12

s the mean tread wear of the new rubber. f significance (0.05) so the NULL Hypothesis ncreases tread wear.

a real difference. Use a 0.05 level of significance.

suming Unequal Variances


Viacom 5.641183333 47.95386735 30 0 57 1.151915733 0.127082174 1.672028889 0.254164347 2.002465444 GM 3.4309 62.49867906 30

iled tests are greater than the stated level of significance (0.05) ainty that there is a difference in the returns of these companies.

eans of both returns are equal is rejected.

d the hours of usage before each sample burned out. pplier's light bulbs really last longer than the

suming Equal Variances


Old 47.5 90.54761905 22 66.82552083 0 32 -1.675954 0.051746314 1.693888703 0.103492628 2.036933334 New 52.41666667 21.53787879 12

d because we are only testing if one is better) is very close to .05) so we cannot reject the NULL Hypothesis, which states that suppliers is the same.

whether an advertising campaign increased average daily sales st take Before and After samples of average daily sales at least 30 dealers.

average daily sales over a week or a month. It cannot just be one sample of one day's sales tative of the overall population.

ce falls inside or outside the 95% Confidence Interval that the Mean Difference is 0. Interval, We say that there is a 95% that the Mean Difference is 0 and No change occurred. terval, there is a 95% chance that average daily sales for the whole network has changed.

o significant change if the Average (Mean) Difference is within determine the 95% Confidence Interval for a 0 Mean, we need the following information:

ed at least 30 samples.of daily averages from the same dealers

ple Standard Error = (Sample Standard Deviation) / ( Square Root of Sample Size)

95% Confidence Intveral, = 0.05)

nder the Normal curve. The remaining 5% () will be split between each outer tail on the Normal curve. nce interval. Total area under the Normal curve to the left of this Z value for e for this is 1.96. This means that 97.5% of the total area under the Normal curve

MSINV(0.975) = 0 +/- (Z Score for 95% CI) * (Sample Standard Error)

e Interval for the Mean Difference being 0, roughout the entire population of dealers

nfidence interval of -2.18 to +2.18

ampaign has caused a change in the daily sales of the dealer network.

Analysis of Variance - ANOVA

ANOVA is a technique for testing the equality of different population means. ANOVA is very useful because it can be extened to any number of populations. All ANOVA test the NULL Hypothesis - that is - all samples drawn have the sam

ANOVA is often used by markets to tests whether different marketing campaigns with multiple varying elements actua

The NULL Hypothesis is rejected - that is - there are real differences between the means - if the p-Value pertaining to t item being evaluated is less than the desired level of significance. For example, in the 1st ANOVA below, the p-Value petaining to "Between Methods (Groups) is less than the desired lever of significance - So there is a difference betwee

Anova: Single Factor - Single Factor Analysis Calculated by Excel


The Hand Calculation of this ANOVA is performed at the bottom of this worksheet

Students Problem: 3 different sale training methods are used. Three groups of four randomly chosen new saleppeople are chosen. Each group is trained using one of the methods. After the course is completed, sales totals of each salesperson over the next two weeks is collected. Determine within a 0.05 level of significance whether there is a difference in the effectiveness of the courses. 1 2 3 4

Anova: Single Factor SUMMARY Groups Method 1 Method 2 Method 3

Count 4 4 4

Sum 68 80 92

ANOVA Source of Variation Between Groups Within Groups Total SS 72 46 118 df 2 9 11

The p-Value for Methods (Between Groups, which are the Methods) (0.011419201) is much less than the level of signi so there is a difference between the effectiveness of the teaching methods..

The p-Value calculated by Excel agrees with the hand-calculated p-Value, which is less than the level of significance. T

difference in the effectiveness between the courses.

Anova: Two Factor - Two Factor Without Replication


Two factors are being evaluated and each test is performed only once.

Problem:

Here are 3 different types of typing keyboards. 5 Typists each got to use all three keyboards. Here are the typing speeds of each typist on of of the 3 keyboard types. Determine at a 0.01 level of significance (99% certainty) whether typing speed differs between the 3 keyboard type.

Typist 1 Typist 2 Typist 3 Typist 4 Typist 5

In this example, the two factors that influence the speed of typing are 1) the keyboard, and 2) the typing ability of each

Anova: Two-Factor Without Replication SUMMARY Typist 1 Typist 2 Typist 3 Typist 4 Typist 5 Keyboard A Keyboard B Keyboard C Count 3 3 3 3 3 5 5 5 Sum 180 338 141 303 216 375 379 424

ANOVA Source of Variation Rows Columns Error Total SS 9151.066667 296.1333333 94.53333333 9541.733333 df 4 2 8 14

The p-Value for the Rows (5.42004E-08) is much less than the level of significance (0.05) so there is a difference betwe

The p-Value for columns (0.003428581) is much less than the level of significance (0.05) so there is a difference betwe

Anova: Two Factor - Two Factor With Replication

Two factors are being evaluated and the tests are performed more than once (in this case, each test is performed in tw

Problem

A Perfume company was testing a product using 3 different advertising focuses (Sophisticated, Athletic, Popular), Design 1 3 different package Designs, and testing 2 separate markets. Using a 0.05 level of significance, Design 2 determine 1) Advertising Focus, 2) Package Design, or 3) the Interaction between them had any affect Design 3 on sales. The chart shows the sales with each combination in each of the two markets.

Anova: Two-Factor With Replication SUMMARY


Design 1

Sophisticated 2 5.53 2.765 0.00245


Design 2

Athletic 2 3.37 1.685 0.25205

Count Sum Average Variance

Count Sum Average Variance


Design 3

2 5.97 2.985 0.18605

2 2.9 1.45 0.005

Count Sum Average Variance


Total

2 5.13 2.565 0.00125

2 6.03 3.015 0.03645

Count Sum Average Variance

6 16.63 2.771666667 0.073256667

6 12.3 2.05 0.62848

ANOVA Source of Variation Sample SS 0.807211111 df 2

Columns Interaction Within Total

4.991077778 2.277122222 1.0447 9.120111111

2 4 9 17

The p-Value for Sample (0.076062) is more than the level of significance (0.05). We cannot reject the NULL Hypothesis

The p-value for Columns (0.00037339) is less than the level of significance (0.05). This indicates that that overall adver The p-Value for Interaction (0.022409) is less than the level of significance. This indicates that different combinations

Anova: Single Factor - Single Factor Analysis Calculated by Hand


( Excel calculation of Single Factor ANOVA is shown at the top of this Worksheet) Problem: 3 different sale training methods are used. Three groups of four randomly chosen new saleppeople are chosen. Each group is trained using one of the methods. After the course is completed, sales totals of each salesperson over the next two weeks is collected. Determine within a 0.05 level of significance whether there is a difference in the effectiveness of the courses.

Column Total Column Mean Grand Mean = (17 + 20 + 23) / 3 Grand Mean = Column Mean - Grand Mean (Column Mean - Grand Mean)^2 # Rows x [ (Column Mean - Grand Mean)^2 ]

Method 1 16 21 18 13 68 17

20 -3 9 36

Sum of Squares Between Groups = 36 + 0 + 36 =

72

Degrees of Freedom Between Groups DOF = # groups - 1 = c - 1 = 3 - 1 = Within Groups DOF = C(r-1) = 3 (4 - 1) = Total Degrees of Freedom = 2 9 11

Sum of Squares Between Groups Sum of the Squares Sum of Squares Within Groups Total Sum of the Squares 72 46 118

Mean Squares MS = Mean Square = Sum of Square / degrees of freedom SS 72 46 df 2 9

F Statistic F Statistic = (MS Between Group) / (MS Within Groups) F Statistic = 36 / 5.111111 =

7.043478261

p Value p-Value = FDIST(F Statistic,DOF Between Groups,DOF Within Groups) = p-Value = FDIST(7.043478,2,9) = 0.014419203

The p-value of 0.014419 is less than the designated level of significance of 0.05. This indicates if there was no difference in effectiveness between the courses. Therefore, there is at least 95%

useful because it can be mples drawn have the same mean.

ple varying elements actually yielded different results.

he p-Value pertaining to that NOVA below, the p-Value here is a difference between the groups.

by Excel

his worksheet

Method 1 16 21 18 13

Teaching Method Method 2 Method 3 19 24 20 21 21 22 20 25

Average

Variance 17 11.3333333 20 0.66666667 23 3.33333333

F P-value 36 7.04347826 0.014419201 5.111111111 MS

F crit 4.256494729

ess than the level of significance (0.05)

the level of significance. This indicates that there is a real

Keyboard A 51 109 47 98 70

Keyboard B 57 112 43 98 69

Keyboard C 72 117 51 107 77

) the typing ability of each typist.

Variance 60 117 112.6666667 16.3333333 47 16 101 27 72 19 75 75.8 84.8 767.5 819.7 724.2

Average

MS F P-value 2287.766667 193.605078 5.42004E-08 148.0666667 12.5303244 0.003428581 11.81666667

F crit 7.006076623 8.649110641

here is a difference between the speed of each typist.

here is a difference between keyboards regarding typing speed.

ach test is performed in two markets).

Sophisticated 2.80 2.73 3.29 2.68 2.54 2.59

Athletic 2.04 1.33 1.50 1.40 3.15 2.88

Popular 1.58 1.26 1.00 1.82 1.92 1.33

Use "2 Rows Per Sample"

Popular

Total

2 6 2.84 11.74 1.42 1.95666667 0.0512 0.46722667

2 6 2.82 11.69 1.41 1.94833333 0.3362 0.75057667

2 6 3.25 14.41 1.625 2.40166667 0.17405 0.44477667

6 8.91 1.485 0.12407

MS 0.403605556

F P-value 3.4770269 0.076062669

F crit 4.256494729

2.495538889 21.4988513 0.00037339 0.569280556 4.90430267 0.022409688 0.116077778

4.256494729 3.633088512

ject the NULL Hypothesis that states that the package does not affect sales.

ates that that overall advertising strategies affect sales differently.

at different combinations of interactions (package / ad campaign) have different affects on sales.

by Hand

Method 2 19 20 21 20 80 20

Method 3 24 21 22 25 92 23

Column Total Column Mean

0 0 0

3 9 36

Sum of Squares Within Treatments = 34 + 2 + 10 =

MS 36 5.111111111

The p-Value represents the proportion of area under the F Distribution curve to the right of the given F value. If this p-Value is less than the stated level of significance, this demonstrates that there is a difference in the objects or process being analyzed. - in other words, there is a difference in the variances.

nce of 0.05. This indicates that there is less than a 5% chance that this result could have occurred refore, there is at least 95% certainty that there is a real difference in effectiveness of the courses.

Method 1 16 21 18 13 68 17

Method 2 Method 3 19 24 20 21 21 22 20 25 80 92 20 23

Method 1 16 - 17 21-17 18 - 17 13 - 17

Method 2 Method 3 19 - 20 24 - 23 20 - 20 21 - 23 21 - 20 22 - 23 20 - 20 25 - 23

Method 1 -1 4 1 -4

Method 2 Method 3 -1 1 0 -2 1 -1 0 2 Square each

Method 1 1 16 1 16 34 46

Method 2 Method 3 1 1 0 4 1 1 0 4 2 10

right of the given F value. ere is a difference

Determining if Population Variance Has Changed - Uses Ch


Quality control people use the Chi Square test to determine if process' variance levels are staying within given limits.

The Chi Square Distribution is used to determine if a population's variance has been changed. The Chi Squre Distribution is sk curve occuring at the point on the x axis that equals the number of degrees of freedom (n-1 --> Sample Size - 1). The total area The area under the curve to the left or right of outer limits determines wihether it can be said with a certain degree of confidenc If the area outside the Chi Square Statistic (the p value) is less than the desired level of significance, then the population varian

If Sample Standard Deviation, s, is greater than Population Standard Deviation, , then the Chi Squared Statistic will be to the r and the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will be the p value of the right tail.

If Sample Standard Deviation, s, is less than Population Standard Deviation, , then the Chi Squared Statistic will be to the left and the p value produced by CHIDIST(ChiSquare Statistic, degrees of freedom) will still be the area under the Chi Square curv To get the area under the left tail (are to the left of the Chi Square point), the p-value = 1 - CHIDIST(Chi Square Statistic, degre

Test on Whether a Population Variance Has Increased Above a Gi


Problem:

A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part o The standard deviation of the hole diameter has historically been 1.6 ml. A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviatio At an 0.05 level of significance, has the population standard deviation increased above 1.6 ml? Givens: n= Degrees of Freedom= n-1 Level of Significance, , = Population Standard Deviation, , = Sample Standard Deviation, s, =

50 49 0.05 1.6 1.9

Use the Chi Squared Test to determine if there has been a change in variance. 1) Calculate Chi Square Statistic, = [ (n-1)*(s*s) ] / (*) = 2) Obtain p value from Chi Square Statistic Upper p value = CHIDIST(69.09766,49) = 0.030749 69.09766

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the The Chi Square Statistic is caluculated from sample size (n - 1), population standard deviation, and sample standar If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) greater than the level of significance value we are evaluating ( = 0.05 on a one-tailed test), then we accept the NUL

In the case the p value (0.030749) is less than the desired level of significance ( = 0.05), and we reject the N It appears that the population variance has increased above 1.6 ml.

Test on Whether a Population Variance Has Decreased Below a G


Problem:

A manufacturer wants to check if the variance on a process has changed. A machine drills a hole as part o The standard deviation of the hole diameter has historically been 1.6 ml. The engineers believe that they ha A random sample of 50 hole diameters were checked in one batch. The measured sample standard deviatio At an 0.05 level of significance, has the population standard deviation decreased 1.6 ml? Givens: n= Degrees of Freedom= n-1 Level of Significance, , = Population Standard Deviation, , = Sample Standard Deviation, s, =

50 49 0.05 1.6 1.375

Use the Chi Squared Test to determine if there has been a change in variance. 1) Calculate Chi Square Statistic, = [ (n-1)*(s*s) ] / (*) = 2) Obtain p value from Chi Square Statistic Area under curve to right = CHIDIST(69.09766,49) = p value = Area to the left of Chi Square point = 1 - CHIDIST () = 0.912951 0.087049 36.18774

This p value states the portion of total area under the Chi Square distribution curve for 49 degree of freedom to the The Chi Square Statistic is calculated from sample size (n - 1), population standard deviation, and sample standard If the p value ( the area under the Chi Square distribution curve to the right of the Chi Square Statistic on that curve) greater than the level of significance value we are evaluating ( = 0.05 on a one-tailed test), then we accept the NUL

In the case the p value (0.087049) is greater than the desired level of significance ( = 0.05), and we do not r It appears that the population variance has not decreased below 1.6 ml.

anged - Uses Chi Squared Distribution

re staying within given limits.

The Chi Squre Distribution is skewed with the high point of the Sample Size - 1). The total area under the Chi Squared curve is 1.0. ith a certain degree of confidence that the population variance has changed. ance, then the population variance has changed.

Squared Statistic will be to the right (greater than) the degree of freedom point alue of the right tail.

quared Statistic will be to the left (less than) the degree of freedom point area under the Chi Square curve to the right of the Chi Square Statistic point.. DIST(Chi Square Statistic, degrees of freedom)

eased Above a Given Value

machine drills a hole as part of the manufacturing process.

ured sample standard deviation was 1.9 ml. sed above 1.6 ml?

for 49 degree of freedom to the left of the Chi Square Statistic d deviation, and sample standard deviation. hi Square Statistic on that curve) is ed test), then we accept the NULL Hypothesis.

e ( = 0.05), and we reject the NULL Hypothesis.

reased Below a Given Value

machine drills a hole as part of the manufacturing process. engineers believe that they have improved the process. ured sample standard deviation was 1.35 ml.

for 49 degree of freedom to the left of the Chi Square Statistic deviation, and sample standard deviation. hi Square Statistic on that curve) is ed test), then we accept the NULL Hypothesis.

nce ( = 0.05), and we do not reject the NULL Hypothesis that there has been no change.

Normal Distribution
Any Normal distribution can be identified by two variables - the mean and standard deviation The area under the entire density function = 1.

The Normal distribution is a continuous distribution, as oppoed to a discrete distribution such as the binomial distribu

Most problems involving the Normal distribution fall into two categories: 1) Determining the probability of a normally distributed random variable having a value within a given interval

2) Determining a Confidence Interval - that is - Determining an interval within which the value of a normally distribute

To be able to apply the Normal distribution, It is extremely important that the underlying population can be

For any population, whether Normally distributed or not, the distribution of x bar (th Normally distributed if sample size is large (30 or more). This a basic tenant of the Central Limit Theorem - Statistics' most fundamental rule.
It is important to note that the problems on this page do not deal with samples. These problems only use parameters

z = number of standard deviations that a points lies from the mean Population Mean = = "mu" Population Standard Deviation = = "sigma" z=(x-)/ = ( x - mean ) / ( Length of 1 Standard Deviation )

The z distribution, sometimes called the standard normal distribution, is a normal distirbution with the mean, , = 0 and the stan

Population parameters are generally described with Greek letters, such as (population mean) and (population standard dev while Sample parameters are genearlly described with Roman letters, such as x bar (sample mean) and s (sample standard de

Statistical Function NORMSDIST(z) tells what percentage of total area of standardized normal curve (mean = 0 and standard d is to the left of a point z standard deviations from the mean, which is 0. NORMSDIST(0) = NORMSDIST(1.96) = 0.5 0.975

This means that half of the area under the standardized normal curve exists t

This means that 97.5% of the total area under that staandardized normal curv This point of z = 1.96 is often used to calculate the 95% Confidence interval. T standard deviations to the left of the mena and extends to 1.96 standard devia 95% of the total area under the bell shaped Normal curve.

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that th will equal the percentage given as the argument for the function.

NORMSINV(0.0975) =

1.96

This means that 97.5% of the total area under the normal curve is to the left o

Statisical Function NORMDIST(x, mean, standard dev, TRUE) will calculate the area under the curve to the left of point x on a The TRUE stated to provide Cumulative area - This is nearly always TRUE) NORMDIST(1.96,0,1,TRUE) = 0.975

Setting mean to 0 and stan. Dev. To 1 makes it a standardized No

Problem: A store has normally distributed daily sales. The average daily sales = $2,000 and the daily sales standard de What is the probability that the sales of one random day will be below $1,000? Population Mean = = "mu" = $2,000 Population Standard Deviation = = "sigma" = = $500 x = $1,000 NORMDIST(1000,2000,500,TRUE) = 0.02275 2.28% This can be interpreted by saying the only 2.28% of the total area

Problem: A brand of car has a mean fuel consumption of 27 mpg with a standard deviation of 5 mpg. What percentage of the cars can be expected to have a fuel consumption of between 25 mpg and 30 mpg? Fuel consumption is normally distributed for this population. Percentage of cars with fuel efficiency between 25 mpg and 30 mpg = Percentage of cars with fuel efficiency less than 30% - Percentage of cars with fuel efficiency less than 25% = NORMDIST(30,27,5,TRUE) - NORMDIST(25,27,5,TRUE) = 0.725747 0.344578 =

For the regular Normal curve, x = + z The standardized Normal curve has = 0 and = 1.

Statistical Function NORMSINV() tells how many standard deviations a point on a normal curve is to the left of the mean that th will equal the percentage given as the argument for the function. NORMINV(0.975,0,1) = 1.96

This means that 97.5% of the total area under the normal curve is to the left o

Problem: A company's package delivery time is normally distributed with a mean of 10 hours and a standard deviation What delivery time will be beaten by only 2.5% of all deliveries? = 10 =3 NORMINV(0.025,10,3) = 4.12

Meaning that only 2.5% of all package delivery times will be quicke

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standa What tread life would be exceeded by 98% of all tires? = 39,000 = 5,000 NORMINV(0.02,39000,5300) = 28115

Meaning that only 2% of all tires will wear out before 28,115 miles.

Problem: A tire company makes a tire with a normally distributed tread life that has a mean of 39,000 miles and standa What would the range of tread life be that 95% of all tires would wear out in? = 39,000 = 5,000 Calculation of the left boundary: NORMINV(0.025,39000,5300) = 28612

Meaning that only 2.5% of all tires will wear out before 28,115 mile

Calculation of the right boundary: NORMINV(0.975,39000,5300) = 49388

Meaning that only 2.5% of all tires will wear out after 49,388 miles.

So, 95% of tires will wear out in the range of 28,612 miles to 49,388 miles.

on such as the binomial distribution, whish is a set of discrete points.

within a given interval

e value of a normally distributed random variable will fall with a given probability

nderlying population can be proven to be normally distributed. This is often not the case.

istribution of x bar (the average of each sample) will be approximately

most fundamental rule.

problems only use parameters of the entire populations.

with the mean, , = 0 and the standard deviation, , = 1.

n) and (population standard deviation) mean) and s (sample standard devation)

l curve (mean = 0 and standard deviation length = 1)

andardized normal curve exists to the left of z when z = 0 (z is exactly on top of the mean, that is, 0 standard deviations away from the mean

er that staandardized normal curve is to the left of the z when z is 1.96 standard deviations from the mean. ate the 95% Confidence interval. That is, the section under the normal curve that starts a 1.96 nd extends to 1.96 standard deviations to the right of the normal curve will contain Normal curve.

e is to the left of the mean that the stated total area under the normal curve

er the normal curve is to the left of the point 1.96 standard deviations from the mean

e curve to the left of point x on a normal curve with the given mean and standard deviation.

To 1 makes it a standardized Normal curve, like the above problem.

and the daily sales standard deviation = $500,

g the only 2.28% of the total area under this particular Normal curve falls to the left of x = 1,000

tion of 5 mpg. 5 mpg and 30 mpg?

y less than 25% = 0.381169 38.12%

e is to the left of the mean that the stated total area under the normal curve

er the normal curve is to the left of the point 1.96 standard deviations from the mean

hours and a standard deviation of 3 hours.

ckage delivery times will be quicker than 4.12 hours.

ean of 39,000 miles and standard deviation of 5,300 miles.

will wear out before 28,115 miles..

ean of 39,000 miles and standard deviation of 5,300 miles.

s will wear out before 28,115 miles.

s will wear out after 49,388 miles..

ndard deviations away from the mean)

Confidence Intervals
Collection of 40 individual test scores
210 340 490 610

Calculate with 95% certainty an interval in which the population me based upon a random sample of 40 test scores taken from that pop

In other words, calculate a 95% Confidence Interval for the population mean
Sample size (COUNT) = Sample Standard Deviation (STDEV) = (1 - Confidence Interval) = Mean (AVERAGE) =

Sample size must be at least 30 and must be random and representative of the populatio

Excel calculates the Confidence Interval to be 49.42 using the following statistical function: CONFIDENCE (alpha, s Input for this function are CONFIDENCE(0.05,159.48,40) =

Let's see how Excel's calculation holds up to the correct, manual calculation of Confidence Interval calculated from (Excel hits it just about right on)

The 95% Confidence Interval around a Sample Mean of 0 = 0 +/- (Z Score for 95% Confidence Interval) * (Samp Z Score for 95% Confidence Interval (two sided) = Z(0.975) = 1.96 Sample Standard Error = (Sample Standard Deviation) / ( Square Root of Sample Size) Sample Standard Error = (159.48) / (Square Root [40] ) = 25.21 Confidence Interval = Sample Mean +/- Z Score(95% Confidence Interval) *(Sample Standard Error) Confidence Interval = 473.5 +/- (1.96) x (25.21) = 473.5 +/-

49.41

(Excel's answer of 49.42 is pretty close

Confidence Interval = 473.75 +/- 49.41 = 124.32 to 223.16

This means that there is a 95% chance that the mean of the entire popultation is between the endpoints of this 95% Confidence Interval

Statistically this is written as: Confidence Interval = Sample Mean +/- Z/2 * (Sample Standard Deviation / Square root of Sample Size)

Getting Z Score for Two-Tailed 95% Confidence Interval


Two-tailed 95% confidence interval will have 2.5% of toal curve area in each tail. Therefore this Z Score corresponds to 97.5% of total area to left of Z

Z Score for two-tailed 95% confidence interval =


(NORMSINV) - Input is percentage (expressed as decimal) of area under standardized normal curve to the left of Z = Standardized normal curve --> Mean = 0, Standard Deviation Length = 1

Getting Z Score for One-Tailed 95% Confidence Interval


One-tailed 95% confidence interval will have 5% of total curve area in right tail. Therefore this Z Score corresponds to 95% of total area to left of Z

Z Score for one-tailed 95% confidence interval =


(NORMSINV) - Input is percentage (expressed as decimal) of area under standardized normal curve to the left of Z =

Determining Sample Size (n) for a Given Confidence Level and Bound (B)
n = number of sample needed to establish a specified confidence interval of of width B on either side of mean e.g. How many samples must be taken to estimate the population diameter (of, for example, holes drilled by a machine) to within 0.05 mm. of the mean sample diameter with 99% confidence. Standard deviation (determined from previous sampling) is 0.75 mm ?. n = [ (Z score of two-tailed 99% confidence)**2 x (sample standard deviation)**2 ] / [Interval**2] n = [ (2.575)**2 x (0.75)**2 ] / [ (0.05)**2 ] = 1,492 NORMSINV(0.995)=

Problem: A restaurant owner wants to estimate within $2.00 the average amount that customers spend during lunch. For experience, the standard deviation of the population is $5.00. How many samples need to be taken to get a sample that is 92% certain of being within $2.00 of the population mean Z score of two-tailed 92% confidence = NORMSINV(0.96) = Population Standard Deviation = 5.00 Interval = 2.00 n = [ (Z score of two-tailed 92% confidence)**2 x (sample standard deviation)**2 ] / [Interval**2]

n = [ (1.751)**2 x (5.00)**2 ] / [ (2.00)**2 ] =

220 370 500 640

230 370 500 640

240 380 510 640

270 400 510 650

n interval in which the population mean must fall of 40 test scores taken from that population.

onfidence Interval for the population mean.


40 159.48 0.05 473.75 (Need

st be random and representative of the population.

a sample size of at least 30 to be able to use z score for Norma

(for 95% Confidence Inveral, = 0.05)

using the following statistical function: CONFIDENCE (alpha, standard_dev,size)]

49.42

rect, manual calculation of Confidence Interval calculated from this sample:

n of 0 = 0 +/- (Z Score for 95% Confidence Interval) * (Sample Standard Error) 1.96 Insert / NORMSINV(0.975) Function

Z(0.975) = 1.96

on) / ( Square Root of Sample Size)

40] ) = 25.21

5% Confidence Interval) *(Sample Standard Error) 473.5 +/-

49.41

(Excel's answer of 49.42 is pretty close to the manual calculation of 49.41)

32 to 223.16

an of the entire popultation

ample Standard Deviation / Square root of Sample Size)

dence Interval
area in each tail.

1.96
0.975

dence Interval

area in right tail.

1.64
0.95

Confidence Level and Bound (B)

ce interval of of width B on either side of mean

on diameter (of, for example, ple diameter with 99% confidence.

dard deviation)**2 ] / [Interval**2]

2.576

0 the average amount that customers spend during lunch. $5.00. How many samples need to be taken to get a sample average mean expenditure during lunch?

1.751

dard deviation)**2 ] / [Interval**2]

19

Samples

Although 30 samples shold be the minimum taken unless you know for certain that the underlying population is normally distributed.

300 410 540 660

300 410 540 660

320 450 580 750

320 470 580 750

320 470 610 790

size of at least 30 to be able to use z score for Normal Distribution)

Inveral, = 0.05)

ORMSINV(0.975)

ulation of 49.41)

xpenditure during lunch?

Binomial Distribution

Binomial distributions are are collections of discrete values as opposed to, for example, the Normal distribution, whic

Any binomial distribution can be identified the value of two of its variables - the number of trials (n) and the probabilit

Random Number Generator


Tools / Data Analysis / Random Number Generator

In this case, generate 5 random numbers, Each with possible outcomes of 2 or 3. Each event has a 20% probability of a "2" ou (You could easily do the same thing with outputs of 1 and 0 - measuring something occuring or not occurring) 3 3 3 2 2 Number of variable = 1 (The value of the 1 variable is 1 or 0) Number of random variables = 5 Distribution type is Discrete Value in input range - the Yellow highlighted Ouput range - Highlight the tan range Outcome Probability 2 0.2 3 0.8

Sum of 2's = 2 Statistical function COUNTIF - Select the range of outputs to be c The sum is the number of successes in 5 random trials, each having a 0.20 chance of a "2" outcome.

This sum is a binomially distributed random variable.

Calculating the probability of a certain number of a given outcome to occur in a certain number of trials if the probability of that outcome on a single trial is known.

Problem: What is the probability of 3 successful outcomes in 5 trials if the probability of a succes
s = number of successes = n = number of trials = p = probability of successful outcome = on 1 trial Find Cumulative distribution (NO) - Use 0 3 5 0.2

Probability of this is = 0.0512 Statistical Function / BINOMDIST (in this case, you don't want cumulative distribution - Use 0 as that last argument)

Which is = Format / Cell / Percentage

5.12%

Problem - In 12 trials (n = 12), what is the probability that at least 10 of them (Sum of the probabilities that s = 10, s = 11, and s will have the 1 of the 2 possible outcomes that has a probability of occuring of 65%? The probabilities of each outcome need to be added up. 10 11 12 0.65 12 0 0.108846 0.036753 0.005688009 0.151288 This represents a combined probability of Statistical function BINOMDIST(s,p,n,FALSE) BINOMDIST(10,12,0.65,0) + BINOMDIST(11,12,0.65,0) + BINOMDIST(12,12,0.65,0)

Problem - What is the possibility of getting between 4 and 6 heads on 10 flips of a fa


Probability of getting between 4 and 6 head = P(4) + P(5) + P(6) Also equals [ P(1) + P(2) + P(3) + P(4) + P (5) + P(6) ] This equals [ Cumulative probability of P(6) ] 6 0.5 10 1 0.828125 BINOMDIST(6,10,0.5,1) [ P(1) + P(2) + P(3) ] [ Cumulative probability of P(3) ] 3 0.5 10 1 0.171875

Equals

BINOMDIST(3,10,0.5,1)

Problem - If 10% of products require servicing, what is probability that less than 15
The problem actually asks what is the probability that up to 14 products will need servicing. Therefore, you are solving for the cumulative probability that up to 14 products need servicing s = 14 p = 0.10 n = 200 TRUE = 1 BINOMDIST(14,200,0.10,1) = 0.092946 9.29%

, the Normal distribution, which is continuous. of trials (n) and the probability of success on a single trial (p)

has a 20% probability of a "2" outcome and an 80% of a "3" outcome. . r not occurring)

This = p - This is the probability that the outcome of the event will be "1" and not "0" This = q - This is the probabability that the outcome of the event will be "0" and not "1"

elect the range of outputs to be counted and then select the cell that has the output to be counted, (Where outcome = 2) 0.20 chance of a "2" outcome.

ome to occur

he probability of a successful outcome in 1 trial is 20%?

as that last argument)

abilities that s = 10, s = 11, and s = 12)

15.13%

eads on 10 flips of a fair coin?

0.65625 65.63%

ility that less than 15 of 200 products will need servicing?

outcome = 2)

Population Proportions

When sample of size n is used to estimate a population proportion, e.g. a proportion of a population who would vote f it can be analyzed using the binomial distribution The population proportion of success will be the same as p, the probability of success of a single trial. The following relationships hold true for population proportions: The mean of sample proportions = = p The standard deviation of sample proportions = = SQRT { [ p (1 - p) ] / n } The confidence interval of a population proportion would be = z = p zSQRT { [ p (1 - p) ] / n }

Problem: A random sample of 350 people was chosen and each person was asked if they recognized a particular bran 112 people recognized the brand. Calculate a 95% confidence interval of the proportion of the total population who recognize the brand. Givens: n= p= 112 / 350 = Confidence level

350 0.32

0.95 - This means that 2.5% of area under Normal curve exists in each tail above and belo

z = NORMSINV(0.975) =

1.96

- 97.5% of the total area under the normal curve is to the left of a point 1.96 standard

The confidence interval = z = p zSQRT { [ p (1 - p) ] / n } = The confidence interval = The confidence interval = 0.32 0.27113 to 0.04887 0.36887

Which means that there is a 95% chance that be are aware of the brand.

Determining Sample Size for a Desired Sampling Error


n = p (1-p) (z/e)**2

The minimum number of sample needed, n, to obtian a confidence interval of a certain width, e (or given sample error

It is better to use the binomial distribution to calculate the p value when dealing wit
The p value is the area under the Normal curve outside of x - NOT the probability of a successful trial)

Problem: A manufacturer of circuit boards wants to keep the proportion of defective boards at 0.098. The manufactur tested 156 randomly chosen boards and found 20 to be defective. Determine with a 95% certainty (0.05 level of significance) the defective proportion has not increased above 0 n= p= x= 156 0.098 19

The probability that 20 or more boards are defective =

1 - the probability that19 or less are defective = 1 - Cumulative probability of 19 defective = 1 - BINOMDIST(19,256,0.0 10.870142 = 0.129858

This p-value of 0.129858 is greater than (0.05 - the level of significance - the proportion of area under the Normal curve to the We therefore conclude that the large x value could have happened by chance and we fail to reject the NULL Hypothesis.

To determine whether a known population has changed, take a sample of the population and use the binomial distribu calculate the probability of that sampling event (the number of successes, x, per given sample size,n, given p - the pre and compare that probabiilty to the desired level of significance. If this probability is less than the level of significance you have established ( for a one-tailed test and /2 for a two-ta then the NULL Hypothesis is rejected.

population who would vote for a certain candidate,

recognized a particular brand. ortion of the total population

sts in each tail above and below the confidence interval.

e left of a point 1.96 standard deviations from the mean

here is a 95% chance that between 27.1% and 36.9% of the total population

dth, e (or given sample error)

ue when dealing with a proportion.

cessful trial)

n has not increased above 0.098.

= 1 - BINOMDIST(19,256,0.098,1)

under the Normal curve to the right of the critical value)

ct the NULL Hypothesis.

and use the binomial distribution to mple size,n, given p - the previously know probability of success in a single trial)

iled test and /2 for a two-tailed test),

Histograms, Charting, and Descriptive Statistics

Civilian Labor Force (1,000) Year Males


1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 40,619 40,803 41,129 40,831 40,712 41,334 41,496 41,749 42,645 42,625 42,833 43,053 43,563 43,907 43,589 44,025 44,397 44,837 44,698 45,086 45,671 46,081 46,842 47,627 48,542 49,389 50,862 51,213 51,753 52,784 54,077 55,349 56,225 56,860 57,461 58,105 59,250 59,949 61,126 61,899 62,423 63,375

Females
14,974 15,580 16,285 17,000 17,593 17,957 17,492 18,266 19,456 19,591 20,093 20,455 20,689 21,608 21,758 22,134 22,734 23,351 24,043 25,003 25,642 26,770 27,954 28,810 29,580 30,148 31,491 32,972 34,214 35,399 37,323 38,959 40,747 41,866 42,952 44,255 44,994 46,740 47,852 49,085 50,436 51,996

80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 0

1st - Highlight the Males and Females column of data to cre

2nd - In the 2nd step of creating the chart, click the Series t

Descriptive Statistics - Tools / Data Analysis / Descriptive S Males Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

1990 1991 1992 1993 1994 1995 1996 1997 1998 1999

64,805 65,149 65,767 66,329 66,788 67,516 67,434 68,884 69,547 70,295

52,925 53,328 54,356 54,982 56,322 56,871 57,503 58,788 59,583 60,718

Measures of Dispersion - Standard Deviation and Variance


x 20 30 42 40 55 521 x bar (x mean) 118 118 118 118 118 118 Sum ( (x - x bar)**2) =

# of points Statistical Function COUNT n n-1 Sum Arithmetic Function SUM Mean Statistical Function AVERAGE

6 5 708

118

Individual Function Calculation of Stan Dev & Var

Variance Statistical Function VAR Standard Deviation Statistical Function STDEV

39117.2

197.7807

Histogram and Descriptive Statistics


State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware District of Columbia Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada Median Value Owner Occupied $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ 53,700 94,400 80,100 46,300 195,500 82,700 177,800 100,100 123,900 77,100 71,300 245,300 58,200 80,900 53,900 45,900 52,200 50,500 58,500 87,400 116,500 162,800 60,600 74,000 45,600 59,800 56,600 50,400 95,700

Descriptive Statistics
Median Value Owner Occupied Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

Histogram
Bin Range Requested By Histogram (in Yellow) Interval 1 2 3 4 5 6 7 8

Frequency

New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

129,400 162,300 70,100 131,600 65,800 50,800 63,500 48,100 67,100 69,700 133,500 61,100 45,200 58,400 59,600 68,900 95,500 91,000 93,400 47,900 62,500 61,600

Histogram - Tools / Data Analysis / Histogram 45000 70000 95000 120000 145000 170000 195000 220000 More

Histogram - Median Income


30 25 20 15 10 5 0 4 4 2 11 27

45000 - Starting (25000 blocks)

Sorting Data and Histogram To Find Patterns


Original Data
Gross Domestic Product Per Capita using Purchasing Power Parity 1991 per capita GDP Country (dollars) Australia $ 16,085 Austria $ 17,280 Belgium $ 17,454 Canada $ 19,178 Denmark $ 17,621 Finland $ 15,997

Sorted Data
Gross Domestic Product Per Capita using Purchasing Power Parity 1991 Country Turkey Greece Portugal Ireland Spain New Zealand

France Germany Greece Iceland Ireland Italy Japan Luxembourg Netherlands New Zealand Norway Portugal Spain Sweden Switzerland Turkey United Kingdom United States

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

18,227 19,500 7,775 17,237 11,507 16,896 19,107 21,372 16,530 13,883 16,904 9,191 12,719 16,729 21,747 3,491 15,720 22,204

United Kingdom Finland Australia Netherlands Sweden Italy Norway Iceland Austria Belgium Denmark France Japan Canada Germany Luxembourg Switzerland United States

Males vs. Female Hires

ght the Males and Females column of data to create the chart. Do not highlight the year column.

e 2nd step of creating the chart, click the Series tab and highlight the Year column as the x-axis.

e Statistics - Tools / Data Analysis / Descriptive Statistics Females 52371.30769 1362.939367 50125.5 #N/A 9828.295549 96595393.39 -1.329434402 0.41216714 29676 40619 70295 2723308 52 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 34646.6 2051.745 30819.5 #N/A 14795.34 2.19E+08 -1.36515 0.341443 45744 14974 60718 1801623 52

and Variance
x - x bar -98 -88 -76 -78 -63 403 (x - x bar)2 9604 7744 5776 6084 3969 162409 195586 5

n-1 =

Direct Calculations of Standard Deviation and Variance


Variance = [Sum ( ( x - x bar)**2 )] / [n-1] = 39117.2

Standard Deviation = SQ RT (Variance) =

197.7807 Arithmetic Function SQRT

Descriptive Statistics Calculations of Stand Dev & Variance


Descriptive Statistics Tools / Data Analysis / Descriptive Statistics

Mean Standard Error Median Mode

118 80.7436272 41 #N/A

Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count

197.7806866 39117.2 5.925570311 2.429919032 501 20 521 708 6

ptive Statistics
84209.80392 6018.541452 68900 #N/A 42980.98303 1847364902 3.556208617 1.84961606 200100 45200 245300 4294700 51

edian Value Owner Occupied

e Requested By Histogram (in Yellow) More than .. 45000 70000 95000 120000 145000 170000 195000 220000

But not more than.. 70000 95000 120000 145000 170000 195000 220000 245000

ata Analysis / Histogram Frequency 27 11 4 4 2 1 1 1

Median Income

Frequency

Starting (25000 blocks)

The data needs to be copied here and then sorted Data / Sort Domestic Product Per Capita urchasing Power Parity 1991 per capita GDP (dollars) $ 3,491 $ 7,775 $ 9,191 $ 11,507 $ 12,719 $ 13,883

Histogram
Bin 3491 8169.25 12847.5 17525.75 More Frequency 1 1 3 11 8

Allowing Excel to pick bin s

$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $

15,720 15,997 16,085 16,530 16,729 16,896 16,904 17,237 17,280 17,454 17,621 18,227 19,107 19,178 19,500 21,372 21,747 22,204

Males Females

Allowing Excel to pick bin size (leave bin range blank)

Histogram
Frequency 12 10 8 6 4 2 0

Frequency

per Cap GDP

per Cap GDP

Hypothesis Testing of a Population Mean


Hypothesis testing is one of the types of statistical tests to determine if a change has occurred to a population mean.

Overall, two hypothesis are being created and tested.

The first hypothesis, the NULL Hypothesis, is usually stated in terms such as "There has been no change in the popul This will normally involve an equal sign.

The second hypothesis, the Alternative Hypothesis, states that the population mean has changed in one of three ways 1) The population mean has changed (increased OR decreased) - This involves a two-tailed test 2) The population mean has decreased - This involves a one-tailed test with the left tail 3) The population mean has increased - This involves a one-tailed test with the right tail. In summary, hypothesis testing involves: 1) Determining the NULL hypothesis, determining the level of certainty to which that NULL Hypothesis

1) Determining the NULL hypothesis. This is normally that the original population mean has not changed. 2) Determining the level of certainty to which that NULL Hypothesis will be tested. If you want to establish a 95% certainty level, 3) Take a sample of the population. 4) Calculate the sample mean. This value will be called x. 5) Graph this sample mean on the normal curve created from the original population mean 6) The NULL Hypothesis is accepted or rejected based upon the results of either of the following tests (which are both equivale

6a) The critical value test - The level of certainty, , is converted to a "critical value." This "critical value" is the number of stand the level of certianty is from the mean. For example, on a two-tailed test, an of 0.05 translates to a 95% level of certainty. On a two-tailed test, this would result in 2.5% of the total area under the Normal curve to be greater than the right critical va and 2.5% of the area under the Normal curve to be less than the left critical value. Each critical value is 1.96 standard devia from the mean on the normal curve - NORMSINV(0.975) = 1.96 The z value of the sample mean is calculated. The z-value is the number of standard deviations that the sample mean is fro on a Normal curve derived from the population mean. If the z-value of the sample is farther away from the mean than the critical value (the z value of that level of certainty), then

6b) The p-value test - This is equivalent to the above test A Normal curve is constructed based upon the population mean. The is the significance level. The significance level represents that percentage of the area under the normal curve that is For example, on a two-tailed test with a 95% required level of certainty, = 0.05. The test is two-tailed so 2.5% of the total a and 2.5% of the area under the normal curve will be below the 95% confidence area. The p value is equal to the percentage of area under the normal curve that is outside of x on the normal curve. If the p value is less than the the percentage of the area under the normal curve corresponding to , the NULL Hypothesis i

Two-tailed test - Testing whether a population mean changed in e

Problem: A manufacturer claims that the average thickness of metal sheets is 15 mls. And that the population standar 50 sheets are sample having a sample mean of 14.982 mls. At the 0.05 significance level (95% confidence leve

the manufacturer's claim that the average thickness of 15 mls. is correct. Givens: n= = = x= =

50 0.05 0.1 14.982 15

The NULL Hypothesis is the population mean, , = 15 mls.

The ALTERNATE Hypothesis is that 15 mls. (Since we are testing whether a difference exists in either direction, this is a tw 1) Calculate Sample Standard Error 2) Calculate z value for sample Sample Standard Error = / SQRT(n) = Z value = (x - ) / (Sample Standard Error)= 0.014142 -1.27279

3) Calculate p value - the area under the Normal curve outside the sample z value. NORMSDIST(1.272792) = This states that 10.154% of the total area under the Normal curve is lies outside a point 1.27 standard deviations from the m THE P TEST CAN BE PERFORMED AT THIS POINT The NULL Hypothesis is rejected if the p-value (the percentage of area under the Normal curve ouside point x) is less than /2

The p-value = 0.101546 and is much larger than /2 (0.025) so the NULL Hypothesis is not rejected - The manufacturer's claim

TO PERFORM THE EQUIVALENT CRITICAL VALUE TEST, DO THE FOLLOWING; 1) Calculate the critical value of - NORMSINV(0.975)= 1.96

This states that of 0.05 on a two-tailed test produces a confidence interval that goes from 1.96 standard deviations above the If x is outside of this range (the z value for z is greater than 1.96), then the NULL Hypothesis is rejected.

In this case, the z value of x (1.27279) is less than the critical value (1.96) and therefore x is closer to the mean than the critical

One-tailed test - Testing whether a population mean changed in o

Problem: A furniture company states that its average delivery time is 15 days with a (population) standard deviation o A random sample of 50 deliveries showed an average delivery time of 17 days. Determine within 98% certainty (0.02 significance level) whether delivery time has increased. Givens: n= = = x= =

50 0.02 4 17 15

This is a one-tailed test because we are checking whether delivery time increased. NULL Hypothesis - = 15 ALTERNATE Hypothesis - > 15

Using the P-test, we will determine if the p value (area above x under the normal curve) is less than (since this is a one-tailed 1) Calculate Sample Standard Error 2) Calculate z value for sample Sample Standard Error = / SQRT(n) = Z value = (x - ) / (Sample Standard Error)= 0.565685 3.535534

3) Calculate p value - the area under the Normal curve outside the sample z value = 1 - NORMSDIST(3.535534) = This states that 0.000203 of the total area under the Normal curve is lies above the point 3.535534 standard deviations abov

This p-value (0.000203) is less than (0.02) so the NULL Hypothesis is rejected - It appears likely that delievery time has inc

s occurred to a population mean.

e has been no change in the population mean"

has changed in one of three ways:

s not changed. ant to establish a 95% certainty level, then , "alpha" , = 0.05

owing tests (which are both equivalent to each other)

"critical value" is the number of standard deviations that translates to a 95% level of certainty. to be greater than the right critical value h critical value is 1.96 standard deviations

eviations that the sample mean is from the population mean

value of that level of certainty), then the NULL hypothesis is normally rejected

area under the normal curve that is outside the required level of certainty. est is two-tailed so 2.5% of the total area will be in one tail above the 95% certainty level

f x on the normal curve. ponding to , the NULL Hypothesis is normally rejected.

mean changed in either direction

s. And that the population standard deviation, , is 0.1 mls. icance level (95% confidence level) whether

exists in either direction, this is a two tailed test)

NORMSDIST(1.272792) = 0.101546 1.27 standard deviations from the mean on either side (tail) of the Normal curve.

curve ouside point x) is less than /2 (in a two-talied test) or (in a one-tailed test)

t rejected - The manufacturer's claim appears to be valid.

m 1.96 standard deviations above the mean to 1.96 standard deviations below the mean.

is is rejected.

s closer to the mean than the critical value, and we do not reject the NULL Hypothesis.

mean changed in only one direction

(population) standard deviation of 4 days.

me has increased.

less than (since this is a one-tailed test)

1 - NORMSDIST(3.535534) = 10.999797 t 3.535534 standard deviations above the mean.

0.000203

ears likely that delievery time has increased.

Discrete Variables
Calculating Means, Standard Deviations, and Variances of their distributions of Disrete Variables.

x
Grade 4 3 2 1 0

P(x)
Probability 0.1 0.2 0.35 0.25 0.1 1

x * P(x)
0.4 0.6 0.7 0.25 0 1.95 = 1.95

Expected Value = mean = x bar = Sum [ x * P(x) ]

x
Grade 4 3 2 1 0 Mean 1.95 1.95 1.95 1.95 1.95 ( x - Mean ) 2.05 1.05 0.05 -0.95 -1.95 Square of (x - Mean ) 4.2025 1.1025 0.0025 0.9025 3.8025 Variance =

P(x)
Probability 0.1 0.2 0.35 0.25 0.1

SUM [ { Square of (x-Mean) } * P

Standard Deviation = SQRT (Variance) = Mathematical Function SQ

These are the variance and stand dev of probabil

{ Square of (x-Mean) } * P(x) 0.42025 0.2205 0.000875 0.225625 0.38025 ]= 1.2475

{ Square of (x-Mean) } * P(x)

SQRT (Variance) = Mathematical Function SQRT

1.116915

and stand dev of probability distribution of x (the distribution of the grades)

You might also like