You are on page 1of 53

by Benjamin L. Marciano, Jr.

Objectives
y Understand the statistical nature of research data y Identify approaches in quantitative research planning (data collection, organization and analysis) y Identify appropriate statistical techniques for a given study design

Consid rations in Choosing Statistical Tools

y1. Level of Measurement y2. Nature of Statistical

Relationship y3. Parametric versus Nonparametric Test

Levels of Measurement
y Nominal numbers are just categories y Ordinal ranks, hierarchy, order y Interval equally spaced scores; no mathematical concept of multiplicity; no true zero y Ratio highest level of measurement

Nature of Statistical Relationship( epen s on objective of the stu y)


yAssociation/Correlation yComparing groups or treatment

effects yPredicting a value of an attribute of interest yTesting the effect of several factors on a response

Parametric vs. Nonparametric


Choice relies on y the level of measurement y assumption of normality y sample size Note: Parametric tests are generally more powerful than nonparametric tests.

Probability and Non-probability Sampling


y Probability

Sampling

y Nonprobability

Sampling

procedure wherein every element of the population is given a (known) nonzero chance of being selected in the sample procedure wherein not all the elements in the population are given a chance of being included in the sample

Issues
y Choice relies on

Nature of measurement Variation in the population Tolerable margin of error y Treatment of Heterogeneity Stratification Clustering Multi-staging y Formula

Testing Statistical Hypotheses


The Hypotheses y Null hypothesis (Ho) -the hypothesis of no difference or no effect y Alternative hypothesis (Ha) -the operational statement that is accepted in case the null hypothesis is rejected

Testing Statistical Hypotheses


Level of Significance (alpha) y the size of the risk (0 < alpha< 1) of erroneously rejecting Ho that the researcher is willing to make y The choice of alpha usually depends on the consequences associated with erroneously rejecting Ho. y alpha=0.01 or less => very serious error y alpha=0.05 => moderate y alpha=0.10 => not too serious error

A Summary of Possible Decisions in Hypothesis Testing


State of Nature (True Situation) Ho is true Decision (Data says ) Reject Ho TYPE I error chance of occurrence=alpha (level of significance) CORRECT decision chance of occurrence= 1 - alpha Ho is false CORRECT decision chance of occurrence= 1 - beta (power of the test) TYPE II error chance of occurrence= beta

Do not reject Ho

Testing Statistical Hypotheses


The p-value y the smallest level of significance at which Ho will be rejected based on the information contained in the sample y Alternative form of decision rule based on the pvalue: Reject Ho if the p-value is less than or equal to the level of significance (alpha). y Remember: If p is low, Ho must go!

DESCRIPTIVE METHODS
Describing and Summarizing A Set of Measurements yPresentation of Tables yConstruction of Graphs yComputation of Summary Measures

How to escribe ata


y Averages describe the central value

Issue: Which average to use? y Variation describes extent of dispersion Issue: Absolute or comparative dispersion? y Skewness describes degree of asymmetry Where in the range of values do data cluster? y Percentiles identify markers or thresholds

Chi-Square Test
y The chi-square test determines the

association between two (categorical) variables set in a contingency table. y Generally regarded as a nonparametric test though no parametric counterpart is gaining popularity. y The Fisher Exact Test is an alternative to this test for 2x2 contingency tables.

Chi-Square Test
Low Income Middle Income High Income (-) attitude 31 29 27 (+) attitude 48 93 165 Total 79 122 192 The null and alternative hypotheses arey Ho: Socioeconomic status and attitude are independent. y Ha: The 2 variables are associated.

Correlation Analysis
y Correlation means the degree of linear

association between two measurements. y The most common correlation measure is the Pearson coefficient, r. Alternative to this is the Spearman coefficient for rank data. y Pearson s r ranges from -1 to +1. Values close to either -1 or +1 indicate strong correlation while near-zero values mean minimal or no correlation.

Correlation Analysis
y Positive correlation means that as one

variable increases, there is a tendency for the other to increase as well. Also, there is a tendency for both variables to decrease together. y Negative correlation means that as one variable increases, there is a tendency for the other to decrease; and vice-versa.

Correlation Analysis
y Example: Refer to the data showing 20

nations ranked with respect to births attended by trained health care personnel and maternal mortality rate. Spearman correlation (rs) is -0.88 (p=0.000). A significant negative correlation exists; there is a general tendency for maternal mortality to decrease when more births are attended by medical personnel.

Nation

Rank by Percentage

y y y y y y y y y y y y y y y y y y y y

Bangladesh Nepal Morocco Pakistan Nigeria Kenya Philippines Iran Ecuador Portugal Vietnam Spain Panama Chile Switzerland USA Hungary Netherlands Hong Kong Belgium

1 2 3 4 5 6 7 8 9 10 11 12.5 12.5 14 16 16 16 19 19 19

Attended Rank by Maternal Mortality Rate per 100,000 Live Births 18 20 16 17 19 14.5 11 12.5 14.5 6.5 12.5 2.5 9 10 2.5 5 8 6.5 4 1

Paire -Sample Tests


y Paired-sample tests are used to test

significant differences in scores between related observations or matched pairs. y The two common types of paired-sample tests are: y Paired t-test (parametric) y Wilcoxon Signed Ranks Test (nonparametric)

Paire -Sample Tests


y The paired t-test is used when scores are assumed to be normally distributed or following a bell-shaped histogram. y The Wilcoxon signed-ranks test is used when there is marked skewness in the data or when data is measured in an ordinal scale (ranks).

In epen ent-Sample Tests


y Independent-sample tests are used to

determine if scores significantly differ between two disjoint or exclusive groups. y The two most common types of independent-sample tests are: Independent-sample t-test (parametric) Mann-Whitney Test (nonparametric)

In epen ent-Sample Tests


y Like the paired t-test, the independent

sample t-test is used when scores are assumed to be normally distributed or following a bell-shaped histogram. y The Mann-Whitney test is used when marked skewness in the observed measurements is present or when data is ordinal (ranks).

One-way Analysis of Variance


y The One-way ANOVA is the extension of the

independent-sample t-test to the case of three or more disjoint or exclusive groups. y When data is ordinal or when there is skewness, the counterpart procedure is the Kruskal-Wallis test. y When the null hypotheses of equality of means is rejected, pairwise comparisons are necessary (e.g. Duncan, Tukey, Scheffe,etc.)

One-way Analysis of Variance


y Example: Four techniques are being used to perform a task. Five subjects each were included in the experimental design to determine whether or not they yield, on the average, the same results (time, in seconds). The analytical results for the 4 techniques are as follows:

A 58.7 B 62.7 C 55.9 D 60.7

61.4 64.5 56.1 60.3 Lab A

60.9 63.1 57.3 60.9 Lab B 62.0 2.2

59.1 59.2 55.2 61.4 Lab C 56.2 1.2

58.2 60.3 58.1 62.3 Lab D 61.1 0.8

Mean Std. Dev.

59.7 1.4

One-way Analysis of Variance


y Ho: The means across four techniques are equal. y Ha: At least one mean is different. y The F-test statistic has p-value 0.000. y At 5% level of significance, we reject Ho. At least one mean is different.

N-way Analysis of Variance


y Allows analysis of main effects and interactions y Most popular is the two-way ANOVA y Presents difficulty for higher order ANOVA y Useful if there are blocking variables

Regression Analysis
y Regression analysis is a method relevant

to analyzing a variable by using information on other variables. The variable that is being explained or analyzed is called the response or dependentvariable. y The variables whose effects act on the response are called predictor, regressor or independentvariables.

Regression Analysis
y When there is only one predictor, we have a simple linear regression model. y Response = function (one predictor) y Ex. O2Consumption = function of Running Time y The formal model is Yi= b0+ b1Xi+ i where i is a random disturbance. y O2= intercept value + slope value times RunTime+ random error

Regression Analysis
y When there are many predictors, we have

amultiple linear regression model. y Response = function (several predictors) y Ex. O2= function of RunTime and Age y The MLRM is written as Yi= 0+ 1X1i+ 2X2i+ . + kXki+ ei. Where Yi is the value of the response variable in the ith observation 0, 1, 2, ., k are parameters of the model y X1i, X2i, .,Xki are the values of the predictors in the ith observation and ei is the error term

So, I ant to s r gr ssion. What is the first thing I should do?

IDENTIFY YOUR RESPONSE VARIABLE! yThis should be quantifiable. yYes/No, High/Low, and similar categorical responses are not valid here.

How about my pre ictors?


y You may choose quantitative and dummy variables as

your predictors. Quantitative predictors must have correlation with the response. y Make sure there is no redundancy among predictors. Check this by computing their correlations. If there are correlated predictors, choose only the one that has practical significance to your study. There are advanced statistical methods that treat correlated predictors.

What s next?
y You are now ready to fit the regression equation.To illustrate, consider an example.

RenarInteriors operates in medium size business areas. In considering an expansion into other areas of similar size, it wishes to investigate how sales (Y) can be predicted from the size of the target market, i.e., the 20-39 age group (X1) and the average monthly income of households in the area (X2). Data on these variables in the most recent year for 21 business areas where the company operates is given below.

Renar Interiors Data


y See the provided copies.

How to use the excel?


In Excel, clickTools, DataAnalysis, Regression. y 1. Supply the InputY-Range box with the appropriate cell addresses. y 2. Supply the InputX-Range box with the appropriate cell addresses of the X1 and X2 values contiguously placed in the data matrix. y 3.Supply the Output Range with any convenient location. y 4.Excel shall return an output of analysis.

Results
yThe Coefficients column gives the

estimated values of the regression parameters. yHere,the fitted model is: Y=-3.887+0.146X1+0.929X2 ySALES = -3.887 + 0.146 x Market Size + 0.929 x Income

How o I interpret the fitte mo el?


-3.887 y The value of the intercept 3.887 is not interpreted since the two predictors do not have values equal to zero. 0.146 x Market Size y There is an estimated increase of 0.146 million pesos (i.e., P146,000) in mean sales when the size of the target market increases by one percent holding the average monthly family income constant. 0.929 x Income y There is an estimated increase of 0.929 million pesos (i.e., P929,000) in the mean sales when the average monthly family income increases by one thousand pesos holding the size of the target market constant.

Can I use the mo el alrea y for pre iction purposes?


NOT YET! yYou still need to investigate the model s goodness-of-fit. yYou need to prove if your predictors are significant. yYou must also verify if the assumptions of regression hold.

How o I assess goo ness-of-fit ?


Three things: yANOVA yF-test yR squared They lurk somewhere in the Excel output!

Analysis of Variance (ANOVA)


y The ANOVA is a decomposition of the total

variation in the response into explained (pattern) and unexplained (error) parts. y The explained variability is the amount of variation in the response variable that may be attributed to the predictors explicitly stated in the model. y The unexplained variability is the amount of variation attributed to random error.

Results from the ANOVA table for the Renar Interiors data
y The first column in the table labels the sources of

variation (Regression and Residual). y The df column refers to the degrees of freedom. The df for Regression is always the number of regression parameters minus one. The df for Residual, it is the sample size minus the number of regression parameters. The total df is the sum of these two degrees of freedom.

Results from the ANOVA table for the Renar Interiors data
y SS refers to Sum of Squares. The value 240.3407 represents the amount of variation in sales explained by the two predictors in the model. The value 21.9658 represents the unexplained variation. These two values sum to 262.3065. There is good fit if the Regression Sum of Squares is much larger than the Residual Sum of Squares y MS refers to Mean Squares. The values in this column are the ratio of each sum of square to their respective degrees of freedom. Mean squares have no physical meaning but are instrumental in computing the Fstatistic.

The F-test
yTheF-test determines if

regression is meaningful for the data at hand. When the p-value is small (see Significance F in Excel output), it means that there is at least one significant predictor in the analysis.

What is the role of the p-value?


y The p-value is our evidence against the hypothesis that

we do not have any significant predictor in the data. When it is small,we reject that hypothesis. y Technically, we call the above hypothesis our null hypothesis or Ho. y Remember: WHEN p IS LOW, Ho MUST GO! y Rule of Thumb: The p-value is low if it is less than 0.05.

Results from the Renar Data


y In the Renar data, the F-statisticis 98.47 with an associated p-value of 2.03x10 raised to 10 (almostzero!). y Since the p-value is lower than 0.05, we reject Ho. We can therefore conclude that at least one of our two predictors can significantly explain sales.

The Coefficient of Multiple Determination (R squared)


y The coefficient of multiple determination, R squared, is a goodness-of-fit measure. y R squared is a figure of merit; the higher the R squared, the better is the success of the model in explaining the variation in the response using the set of predictors.

Results from the Renar Data


y The R squared is normally expressed as a

percentage and is interpreted as the amount of variability in the response explained by the independent variables. y Thevalue of the R squared = 0.9163 means that 91.63% of the variation in sales can be explained by size of target market and average monthly family income.

CAVEAT on the Coefficient of Multiple Determination (R2)


y A draw back of the R squared is that it naturally

increases as the number of predictors increases. This is true even if the added predictor(s) are not significant. y As an alternative, we use the adjusted-R squared(Ra squared). y Ra squared penalizes the R squared for the addition of regressors that do not contribute to the explanatory power of the model. y The Ra squared is never larger than the R squared and can decrease as regressors are added and for poorly fitting models, may even be negative.

The T-tests
y The t-test helps in assessing if an individual

predictor is significant. y Let us interpret the t-tests for the Renar data. X Variable 1 (Target Market Size): Since p=2.05x10-6 <0.05, size of target market is a significant predictorof sales. X Variable 2 (Average Monthly Income): Since p=.0353 <0.05, average monthly income is a significant predictorof sales. Intercept: Since p=.5466 >0.05, the intercept is not significantly different from zero

You might also like