You are on page 1of 36

Canonical Analysis

Canonical Correlation is an additional procedure for assessing the relationship between variables. Specifically, this analysis allows us to investigate the relationship between two sets of variables. For example, an educational researcher may want to compute the (simultaneous) relationship between three measures of scholastic ability with five measures of success in school. A sociologist may want to investigate the relationship between two predictors of social mobility based on interviews, with actual subsequent social mobility as measured by four different indicators. A medical researcher may want to study the relationship of various risk factors to the development of a group of symptoms. In all of these cases, the researcher is interested in the relationship between two sets of variables, and Canonical Correlation would be the appropriate method of analysis.

ANOVA
The purpose of analysis of variance (ANOVA) is to test for significant differences between means. This is accomplished by analyzing the variance, that is, by partitioning the total variance into the component that is due to true random error (i.e., within-group SS) and the components that are due to differences between means. These latter variance components are then tested for statistical significance, and, if significant, we reject the null hypothesis of no differences between means and accept the alternative hypothesis that the means (in the population) are different from each other.

Dependent and independent variables


The variables that are measured (e.g., a test score) are called dependent variables. The variables that are manipulated or controlled (e.g., a teaching method or some other criterion used to divide observations into groups that are compared) are called factorsor independent variables.

Data Mining (Predictive Analytics, Big Data)


Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related - also known as "big data") in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction. Process of data mining (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

Exploration
This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered).

Model building and validation.


This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best.

Deployment
That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome. Data Mining is often considered to be "a blend of statistics,.

Cluster analysis
The term cluster analysis (first used by Tryon, 1939) encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories. cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal. cluster analysis simply discovers structures in data without explaining why they exist.

Discriminate analysis
Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. Discriminant Analysis is used to determine which variable(s) are the best predictors of students' subsequent educational choice.

Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA). Let us consider a simple example. Suppose we measure height in a random sample of 50 males and 50 females. Females are, on the average, not as tall as males, and this difference will be reflected in the difference in means (for the variable Height). Therefore, variable height allows us to discriminate between males and females with a better than chance probability: if a person is tall, then he is likely to be a male, if a person is short, then she is likely to be a female.

Time Series Analysis


There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable) Two basic classes of components: trend and seasonality. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (e.g., as much as 25% of yearly sales each year are made in December, whereas only 4% in August). Autocorrelation correlogram, ARIMA

Factor analysis
To reduce the number of variables to detect structure in the relationships between variables, that is to classify variables. Therefore, factor analysis is applied as a data reduction or structure detection method (the term factor analysis was first introduced by Thurstone, 1931). Suppose we conducted a (rather "silly") study in which we measure 100 people's height in inches and centimeters. Thus, we would have two variables that measure height. If in future studies, we want to research, for example, the effect of different nutritional food supplements on height, would we continue to use both measures? Probably not; height is one characteristic of a person, regardless of how it is measured.

Parametric and Nonparametric test


Parametric Assumptions The observations must be independent The observations must be drawn from normally distributed populations The means of these normal and homoscedastic populations must be linear combinations of effects due to columns and/or rows* Dependent variables at interval level. Sampling random t tests, ANOVA

Nonparametric test Assumption


Observations are independent Variable under study has underlying continuity Do not require normality Or interval level of measurement

Parametric tests => have information about population, or can make certain assumptions Assume normal distribution of population Non-parametric tests are used when there are no assumptions made about population distribution Also known as distribution free tests But info is known about sampling distribution

One-Way Chi Square Test


Compares observed frequencies within groups to their expected frequencies. HO = observed frequencies are not different from the expected frequencies. Research hypothesis: They are different.

Chi Square Statistic


fo = observed frequency fe = expected frequency

Chi Square Statistic

( fo fe) fe

One-way Chi Square


Calculate the Chi Square statistic across all the categories. Degrees of freedom = k - 1, where k is the number of categories. Compare value to Table of 2.

One-way Chi Square Interpretation


If our calculated value of chi square is less than the table value, accept or retain Ho If our calculated chi square is greater than the table value, reject Ho as with t-tests and ANOVA all work on the same principle for acceptance and rejection of the null hypothesis

Two-Way Chi Square


Review cross-tabulations (= contingency tables) from Chapter 2. Are the differences in responses of two groups statistically significantly different? One-way = observed vs expected Two-way = one set of observed frequencies vs another set.

Two-way Chi Square


Comparisons between frequencies (rather than scores as in t or F tests). So, null hypothesis is that the two or more populations do not differ with respect to frequency of occurrence. rather than working with the means as in t test, etc.

Two-way Chi Square Example


Null hypothesis: The relative frequency [or percentage] of liberals who are permissive is the same as the relative frequency of conservatives who are permissive. Categories (independent variable) are liberals and conservatives. Dependent variable being measured is permissiveness.

Two-Way Chi Square Example


Child-rearing Practices Permissive Non-permissive Total Political Liberals 13 7 20 Orientation Conservatives 7 13 20 Total 20 20 40

Two-Way Chi Square Example


Because we had 20 respondents in each column and each row, our expected values in this cross-tabulation would be 10 cases per cell. Note that both rows and columns are nominal data -which could not be handled by t test or ANOVA. Here the numbers are frequencies, not an interval variable.

Two-Way Chi Square Expected

Child-rearing Practices Permissive Non-permissive Total

Political Liberals 10 10 20

Orientation (Expected) Conservatives 10 10 20 Total 20 20 40

Two-Way Chi Square Example


Unfortunately, most examples do not have equal row and column totals, so it is harder to figure out the expected frequencies.

Two-Way Chi Square Example


What frequencies would we see if there were no difference between groups (if the null hypothesis were true)? If 25 out of 40 respondents(62.5%) were permissive, and there were no difference between liberals and conservatives, 62.5% of each would be permissive.

Two-Way Chi Square Example


We get the expected frequencies for each cell by multiplying the row marginal total by the column marginal total and dividing the result by N. Well put the expected values in parentheses.

Two-Way Chi-Square Example

Political Orientation Permissive Not Permissive Total Liberals Conservatives Total 15 (12.5) 10 (12.5) 25 5 (7.5) 10 (7.5) 15 20 20 40

Two-Way Chi-Square Example


So the chi square statistic, from this data is (15-12.5)squared / 12.5 PLUS the same values for all the other cells = .5 + .5 + .83 + .83 = 2.66

Two-Way Chi-Square Example


df = (r-1) (c-1) , where r = rows, c =columns so df = (2-1)(2-1) = 1 From Table C, = .05, chi-sq = 3.84 Compare: Calculate 2.66 is less than table value, so we retain the null hypothesis.

T test
The t-test assesses whether the means of two groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means of two groups, and especially appropriate as the analysis for the posttest-only two-group randomized experimental design.

Formula for the T test

F test
The F-test is designed to test if two population variances are equal. It does this by comparing the ratio of two variances. So, if the variances are equal, the ratio of the variances will be 1. If the null hypothesis is true, then the F test-statistic given above can be simplified (dramatically). This ratio of sample variances will be test statistic used. If the null hypothesis is false, then we will reject the null hypothesis that the ratio was equal to 1 and our assumption that they were equal.

You might also like