Professional Documents
Culture Documents
Submitted By: Tariq Mahmood Asghar Roll No. # 77 MBA (A, B1)
DATA ANALYSIS
The terms "statistics" and "data analysis" mean the same thing -- the study of how we describe, combine, and make inferences from numbers. A lot of people are scared of numbers (quantiphobia), but statistics has got less to do with numbers, and more to do with rules for arranging them. It even lets you create some of those rules yourself, so instead of looking at it like a lot of memorization, it's best to see it as an extension of the research mentality, something researchers do (crunch numbers) to obtain complete and total power over the numbers. After awhile, the principles behind the computations become clear, and there's no better way to accomplish this than by understanding the research purpose of statistics.
test. But normality is critical in many statistical methods. When this assumption is violated, interpretation and inference may not be reliable or valid.
The median is the number that falls in the middle of a range of numbers. It's not the average; it's the halfway point. There are always just as many numbers above the median as below it. In cases where there is an even set of numbers, you average the two middle numbers. The median is best suited for data that are ordinal, or ranked. It is also useful when you have extremely low or high scores. The mode is the most frequently occurring number in a list of numbers. It's the closest thing to what people mean when they say something is average or typical. The mode doesn't even have to be a number. It will be a category when the data are nominal or qualitative. The mode is useful when you have a highly skewed set of numbers, mostly low or mostly high. You can also have two modes (bimodal distribution) when one group of scores are mostly low and the other group is mostly high, with few in the middle.
MEASURES OF DISPERSION
In data analysis, the purpose of statistically computing a measure of dispersion is to discover the extent to which scores differ, cluster, or spread from around a measure of central tendency. The most commonly used measure of dispersion is the standard deviation. You first compute the variance, which is calculated by subtracting the mean from each number, squaring it, and dividing the grand total (Sum of Squares) by how many numbers there are. The square root of the variance is the standard deviation.
The standard deviation is important for many reasons. One reason is that, once you know the standard deviation, you can standardize by it. Standardization is the process of converting raw scores into what are called standard scores, which allow you to better compare groups of different sizes. Standardization isn't required for data analysis, but it becomes useful when you want to compare different subgroups in your sample, or between groups in different studies. A standard score is called a z-score (not to be confused with a z-test), and is calculated by subtracting the mean from each and every number and dividing by the standard deviation. Once you have converted your data into standard scores, you can then use probability tables that exist for estimating the likelihood that a certain raw score will appear in the population. This is an example of using a descriptive statistic (standard deviation) for inferential purposes.
CHI-SQUARE
A technique designed for less than interval level data is chi-square (pronounced kyesquare), and the most common forms of it are the chi-square test for contingency and the chi-square test for independence. The chi-square test for contingency is interpreted as strength of association measure, while the chi-square test for independence (which requires two samples) is a nonparametric test of significance that essentially rules out as much sampling error and chance as possible.
x = (Oi Ei ) Ei
x = chi-square statistics Oi = observed frequency in the ith cell Ei = expected frequency on the ith cell
. If the computed test statistic is large, then the observed and expected values are not close and the model is a poor fit to the data.
BIVARIATE ANALYSIS
Bivariate analysis is the simultaneous analysis of two variables. It is usually undertaken to see if one variable, such as gender, is related to another variable, perhaps attitudes toward male/female equality. This analysis ascertains whether the values of the dependent variable tend to coincide with those of the independent variable. In most instances, the association between two variables is assessed with a bivariate statistical technique (see below for exceptions). The three most commonly used techniques are contingency tables, analysis of variance (ANOVA), and correlations. The basic bivariate analysis is then usually extended to a multivariate form to evaluate whether the association can be interpreted as a relationship. The importance of bivariate analysis is sometimes overlooked because it superseded by multivariate analysis. This misperception is reinforced by scientific journals that report bivariate associations only in passing, if at all. This practice creates the misleading impression that analysis begins at the multivariate level. In reality, the multiple-variable model rests upon the foundation laid by the thorough analysis of the 2-variable model. The proper specification of the theoretical model at the bivariate level is essential to the quality of subsequent multivariate analysis. Some forms of bivariate analysis require that variables be differentiated into independent or dependent types. For example, the analysis of group differences in
means, either by t-test or ANOVA, treats the group variable as independent, which means that the procedure is asymmetrical _different values are obtained if the independent and dependent variables are inverted. In contrast, the Pearson correlation coefficient, the most widely used measure of bivariate association, yields identical values irrespective of which variable is treated as dependent, meaning that it is symmetrical _ the same coefficient and probability level are obtained if the two variables are interchanged. Similarly, the chi-squared ( _) test for independence between nominal variables yields the same value irrespective of whether the dependent variable appears in the rows or the columns of the contingency table. Although the test of statistical significance is unchanged, switching variables yields different expressions of the association because row and column percentages are not interchangeable. Unlike the correlation coefficient, where both the statistic and test of statistical significance are symmetrical, only the probability level is symmetrical in the 2 technique. Designating one variable as independent and the other variable as dependent is productive even when this differentiation is not required by the statistical method. The value of this designation lies in setting the stage for subsequent multivariate analysis where this differentiation is required by most statistical techniques. This designation is helpful in the bivariate analysis of the focal relationship because multivariate analysis ultimately seeks to determine whether the bivariate association is indicative of a state of dependency between the two variables. This approach makes more sense if the original association is conceptualized as a potential relationship.
.2 to .4 .0 to .2
The most frequently used correlation coefficient in data analysis is the Pearson product moment correlation. It is symbolized by the small letter r, and is fairly easy to compute from raw scores using the following formula:
If you square the Pearson correlation coefficient, you get the coefficient of determination, symbolized by the large letter R. It is the amount of variance accounted for in one variable by the other. Large R can also be computed by using the statistical technique of regression, but in that situation, it's interpreted as the amount of variance explained for one variable by another. If you subtract a coefficient of determination from one, you get something called the coefficient of alienation, which is sometimes seen in the literature.
REGRESSION
Regression is the closest thing to estimating causality in data analysis, and that's because it predicts how much the numbers "fit" a projected straight line. There are also advanced regression techniques for curvilinear estimation. The most common form of regression, however, is linear regression, and the least squares method to find an equation that best fits a line representing what is called the regression of y on x. The procedure is similar to computing calculus minima (if you've had a math course in calculus). Instead of finding the perfect number, however, one is interested in finding the perfect line, such that there is one and only one line (represented by equation) that perfectly represents, or fits the data, regardless of how scattered the data points. The slope of the line (equation) provides information about predicted directionality, and the estimated coefficients (or beta weights) for x and y (independent and dependent variables) indicate the power of the relationship. Use of a regression formula (not shown
here because it's too large; only the generic regression equation is shown) produces a number called R-squared, which is a kind of conservative, yet powerful coefficient of determination. Interpretation of R-squared is somewhat controversial, but generally uses the same strength table as correlation coefficients, and at a minimum, researchers say it represents "variance explained."
= a + b
Where
T-tests are kind of like little F-tests, and similar to Z-tests. It's appropriate for smaller samples, and relatively easy to interpret since any calculated t over 2.0 is, by rule of thumb, significant. T-tests can be used for one sample, two samples, one tail, or twotailed. You use a two-tailed test if there's any possibility of bidirectionality in the relationship between your variables. The formula for the t-test is as follows:
ANOVA
Analysis of Variance (ANOVA) is a data analytic technique based on the idea of comparing explained variance with unexplained variance, kind of like a comparison of the coefficient of determination with the coefficient of alienation. It uses a rather unique computational formula which involves squaring almost every column of numbers. What is called the Between Sum of Squares (BSS) refers to variance in one variable explained by variance in another variable, and what is called the Within Sum of Squares (WSS) refers to variance that is not explained by variance in another variable. A F-test is then conducted on the number obtained by dividing BSS by WSS. The results are presented in what's called an ANOVA source table, which looks like the following: Source Total Between Within SS 2800 1800 1000 1 6 1800 166.67 10.80 <.05 Df MS F P
Bivariate Data
involving a single variable does not deal with causes or relationships the major purpose of univariate analysis is to describe central tendency - mean, mode, median dispersion - range, variance, max, min, quartiles, standard deviation. frequency distributions bar graph, histogram, pie chart, line graph, box-and-whisker plot
involving two variables deals with causes or relationships the major purpose of bivariate analysis is to explain analysis of two variables simultaneously correlations comparisons, relationships, causes, explanations tables where one variable is contingent on the values of the other variable. independent and dependent variables