You are on page 1of 12

Business Research Methods

Assignment On Statistical Data Analysis Univariate Analysis Bivariate Analysis

Submitted By: Tariq Mahmood Asghar Roll No. # 77 MBA (A, B1)

DATA ANALYSIS
The terms "statistics" and "data analysis" mean the same thing -- the study of how we describe, combine, and make inferences from numbers. A lot of people are scared of numbers (quantiphobia), but statistics has got less to do with numbers, and more to do with rules for arranging them. It even lets you create some of those rules yourself, so instead of looking at it like a lot of memorization, it's best to see it as an extension of the research mentality, something researchers do (crunch numbers) to obtain complete and total power over the numbers. After awhile, the principles behind the computations become clear, and there's no better way to accomplish this than by understanding the research purpose of statistics.

Basic Statistical Analysis UNIVARIATE ANALYSIS


In mathematics, univariate refers to an expression, equation, function or polynomial of only one variable. Data analysis begins with univariate analysis. Univariate analysis is the first step of data analysis once a data set is ready. Various descriptive statistics provide valuable basic information about variables that is used to determine appropriate analysis methods to be employed. Normality is commonly assumed in many statistical and economic methods, although often conveniently assumed in reality without any empirical test. Violation of this assumption will result in unreliable inferences and misleading interpretations. There are graphical and numerical methods for conducting univariate analysis and normality tests. Graphical methods produce various plots such as a stem-and-leaf plot, histogram, and a P-P plot that are intuitive and easy to interpret. Some are descriptive and others are theory-driven. Numerical methods compute a variety of measures of central tendency and dispersion. Mean, median, and mode measure central tendency of a variable. Measures of dispersion include variance, standard deviation, range, and interquantile range (IQR). Statistical methods are based on various underlying assumptions. One common assumption is that a random variable is normally distributed. In many statistical analyses, normality is often conveniently assumed without any empirical evidence or

test. But normality is critical in many statistical methods. When this assumption is violated, interpretation and inference may not be reliable or valid.

MEASURES OF CENTRAL TENDENCY


The most commonly used measure of central tendency is the mean. To compute the mean, you add up all the numbers and divide by how many numbers there are. It's not the average nor a halfway point, but a kind of center that balances high numbers with low numbers. For this reason, its most often reported along with some simple measure of dispersion, such as the range, which is expressed as the lowest and highest number.

The median is the number that falls in the middle of a range of numbers. It's not the average; it's the halfway point. There are always just as many numbers above the median as below it. In cases where there is an even set of numbers, you average the two middle numbers. The median is best suited for data that are ordinal, or ranked. It is also useful when you have extremely low or high scores. The mode is the most frequently occurring number in a list of numbers. It's the closest thing to what people mean when they say something is average or typical. The mode doesn't even have to be a number. It will be a category when the data are nominal or qualitative. The mode is useful when you have a highly skewed set of numbers, mostly low or mostly high. You can also have two modes (bimodal distribution) when one group of scores are mostly low and the other group is mostly high, with few in the middle.

MEASURES OF DISPERSION
In data analysis, the purpose of statistically computing a measure of dispersion is to discover the extent to which scores differ, cluster, or spread from around a measure of central tendency. The most commonly used measure of dispersion is the standard deviation. You first compute the variance, which is calculated by subtracting the mean from each number, squaring it, and dividing the grand total (Sum of Squares) by how many numbers there are. The square root of the variance is the standard deviation.

The standard deviation is important for many reasons. One reason is that, once you know the standard deviation, you can standardize by it. Standardization is the process of converting raw scores into what are called standard scores, which allow you to better compare groups of different sizes. Standardization isn't required for data analysis, but it becomes useful when you want to compare different subgroups in your sample, or between groups in different studies. A standard score is called a z-score (not to be confused with a z-test), and is calculated by subtracting the mean from each and every number and dividing by the standard deviation. Once you have converted your data into standard scores, you can then use probability tables that exist for estimating the likelihood that a certain raw score will appear in the population. This is an example of using a descriptive statistic (standard deviation) for inferential purposes.

Methods of Univariate Analysis HYPOTHESIS TESTING


An unproven proposition or supposition that tentatively explains certain facts or phenomena.a proposition that empirically tested. Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted H1. These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis. The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population. The outcome of a H1" or "Do not reject hypothesis test test is "Reject H0 in favour of H0".

Type I and Type II Errors

Alternate Way of Testing the Hypothesis:


Z obs = X SX

Univariate Hypothesis Test t-Test:


t obs = X SX

Testing a Hypothesis about a Distribution


Chi-Square test Test for significance in the analysis of frequency distributions Compare observed frequencies with expected frequencies Goodness of Fit

CHI-SQUARE
A technique designed for less than interval level data is chi-square (pronounced kyesquare), and the most common forms of it are the chi-square test for contingency and the chi-square test for independence. The chi-square test for contingency is interpreted as strength of association measure, while the chi-square test for independence (which requires two samples) is a nonparametric test of significance that essentially rules out as much sampling error and chance as possible.
x = (Oi Ei ) Ei

x = chi-square statistics Oi = observed frequency in the ith cell Ei = expected frequency on the ith cell

Chi-Square Goodness of Fit Test


When an analyst attempts to fit a statistical model to observed data, he or she may wonder how well the model actually reflects the data. How "close" are the observed values to those which would be expected under the fitted model? One statistical test that addresses this issue is the chi-square goodness of fit test. This test is commonly used to test association of variables in two-way tables (see "Two-Way Tables and the Chi-Square Test"), where the assumed model of independence is evaluated against the observed data. In general, the chi-square test statistic is of the form

. If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the data.

BIVARIATE ANALYSIS
Bivariate analysis is the simultaneous analysis of two variables. It is usually undertaken to see if one variable, such as gender, is related to another variable, perhaps attitudes toward male/female equality. This analysis ascertains whether the values of the dependent variable tend to coincide with those of the independent variable. In most instances, the association between two variables is assessed with a bivariate statistical technique (see below for exceptions). The three most commonly used techniques are contingency tables, analysis of variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a multivariate form to evaluate whether the association can be interpreted as a relationship. The importance of bivariate analysis is sometimes overlooked because it superseded by multivariate analysis. This misperception is reinforced by scientific journals that report bivariate associations only in passing, if at all. This practice creates the misleading impression that analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the foundation laid by the thorough analysis of the 2-variable model. The proper specification of the theoretical model at the bivariate level is essential to the quality of subsequent multivariate analysis. Some forms of bivariate analysis require that variables be differentiated into independent or dependent types. For example, the analysis of group differences in

means, either by t-test or ANOVA, treats the group variable as independent, which means that the procedure is asymmetrical _different values are obtained if the independent and dependent variables are inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of bivariate association, yields identical values irrespective of which variable is treated as dependent, meaning that it is symmetrical _ the same coefficient and probability level are obtained if the two variables are interchanged. Similarly, the chi-squared ( _) test for independence between nominal variables yields the same value irrespective of whether the dependent variable appears in the rows or the columns of the contingency table. Although the test of statistical significance is unchanged, switching variables yields different expressions of the association because row and column percentages are not interchangeable. Unlike the correlation coefficient, where both the statistic and test of statistical significance are symmetrical, only the probability level is symmetrical in the 2 technique. Designating one variable as independent and the other variable as dependent is productive even when this differentiation is not required by the statistical method. The value of this designation lies in setting the stage for subsequent multivariate analysis where this differentiation is required by most statistical techniques. This designation is helpful in the bivariate analysis of the focal relationship because multivariate analysis ultimately seeks to determine whether the bivariate association is indicative of a state of dependency between the two variables. This approach makes more sense if the original association is conceptualized as a potential relationship.

Methods of Bivariate Analysis CORRELATION


The most commonly used relational statistic is correlation and it's a measure of the strength of some relationship between two variables, not causality. Interpretation of a correlation coefficient does not even allow the slightest hint of causality. The most a researcher can say is that the variables share something in common; that is, are related in some way. The more two things have something in common, the more strongly they are related. There can also be negative relations, but the important quality of correlation coefficients is not their sign, but their absolute value. A correlation of -.58 is stronger than a correlation of .43, even though with the former, the relationship is negative. The following table lists the interpretations for various correlation coefficients: .8 to 1.0 .6 to .8 .4 to .6 very strong Strong Moderate

.2 to .4 .0 to .2

Weak very weak

The most frequently used correlation coefficient in data analysis is the Pearson product moment correlation. It is symbolized by the small letter r, and is fairly easy to compute from raw scores using the following formula:

If you square the Pearson correlation coefficient, you get the coefficient of determination, symbolized by the large letter R. It is the amount of variance accounted for in one variable by the other. Large R can also be computed by using the statistical technique of regression, but in that situation, it's interpreted as the amount of variance explained for one variable by another. If you subtract a coefficient of determination from one, you get something called the coefficient of alienation, which is sometimes seen in the literature.

REGRESSION
Regression is the closest thing to estimating causality in data analysis, and that's because it predicts how much the numbers "fit" a projected straight line. There are also advanced regression techniques for curvilinear estimation. The most common form of regression, however, is linear regression, and the least squares method to find an equation that best fits a line representing what is called the regression of y on x. The procedure is similar to computing calculus minima (if you've had a math course in calculus). Instead of finding the perfect number, however, one is interested in finding the perfect line, such that there is one and only one line (represented by equation) that perfectly represents, or fits the data, regardless of how scattered the data points. The slope of the line (equation) provides information about predicted directionality, and the estimated coefficients (or beta weights) for x and y (independent and dependent variables) indicate the power of the relationship. Use of a regression formula (not shown

here because it's too large; only the generic regression equation is shown) produces a number called R-squared, which is a kind of conservative, yet powerful coefficient of determination. Interpretation of R-squared is somewhat controversial, but generally uses the same strength table as correlation coefficients, and at a minimum, researchers say it represents "variance explained."
= a + b

Where

a = constant, alpha, or intercept (value of Y when X= 0 B= slope or beta, the value of X

Z-TESTS, F-TESTS, AND T-TESTS


These refer to a variety of tests for inferential purposes. Z-tests are not to be confused with z-scores. Z-tests come in a variety of forms, the most popular being: (1) to test the significance of correlation coefficients; (2) to test for equivalence of sample proportions to population proportions, as in whether the number of minorities you've got in your sample is proportionate to the number in the population. Z-tests essentially check for linearity and normality, allow some rudimentary hypothesis testing, and allow the ruling out of Type I and Type II error. F-tests are much more powerful, as they allow explanation of variance in one variable accounted for by variance in another variable. In this sense, they are very much like the coefficient of determination. One really needs a full-fledged statistics course to gain an understanding of F-tests, so suffice it to say here that you find them most commonly with regression and ANOVA techniques. F-tests require interpretation by using a table of critical values.

T-tests are kind of like little F-tests, and similar to Z-tests. It's appropriate for smaller samples, and relatively easy to interpret since any calculated t over 2.0 is, by rule of thumb, significant. T-tests can be used for one sample, two samples, one tail, or twotailed. You use a two-tailed test if there's any possibility of bidirectionality in the relationship between your variables. The formula for the t-test is as follows:

ANOVA
Analysis of Variance (ANOVA) is a data analytic technique based on the idea of comparing explained variance with unexplained variance, kind of like a comparison of the coefficient of determination with the coefficient of alienation. It uses a rather unique computational formula which involves squaring almost every column of numbers. What is called the Between Sum of Squares (BSS) refers to variance in one variable explained by variance in another variable, and what is called the Within Sum of Squares (WSS) refers to variance that is not explained by variance in another variable. A F-test is then conducted on the number obtained by dividing BSS by WSS. The results are presented in what's called an ANOVA source table, which looks like the following: Source Total Between Within SS 2800 1800 1000 1 6 1800 166.67 10.80 <.05 Df MS F P

Examine the differences between univariate and bivariate data:


Univariate Data

Bivariate Data

involving a single variable does not deal with causes or relationships the major purpose of univariate analysis is to describe central tendency - mean, mode, median dispersion - range, variance, max, min, quartiles, standard deviation. frequency distributions bar graph, histogram, pie chart, line graph, box-and-whisker plot

involving two variables deals with causes or relationships the major purpose of bivariate analysis is to explain analysis of two variables simultaneously correlations comparisons, relationships, causes, explanations tables where one variable is contingent on the values of the other variable. independent and dependent variables

You might also like