You are on page 1of 68

BASIC CONCEPTS

Population Collection of all individuals or objects or items under study and denoted by N Sample A part of a population and denoted by n Variable Characteristic of an individual or object. Parameter Characteristic of the population Statistic Characteristic of the sample
Qualitative and Quantitative variables

NOTATIONS OF POPULATION AND SAMPLE Characteristics Size Mean SD Proportion Correlation Coefficient Population N Sample

n
_ x = X n
s= _ 2 (x x) n

_ 2 (x x) S= n1

x p = n

C O V x , y) ( r= x y

Chart on population, sample and statistical inference


Population too large

Sample drawn from the population

Draw inference which is applicable to the population

Collect data from the sample


Jun 16, 2011

Organise data
Dr.R.RAVANAN, Presidency College

Analyse the Organised data

Sampling Techniques
P S S S i m a Sp r o b a m a p b i l i t y l i n g N S o a n m - P p r o b a l i n g

tl e r a R t S i f a i y e n s d d t e o R C m m a l au n tCs d i ct o e nm r v Q e un oi J e t u na d c g e S e n m o we m S p a l e m R p a l e n d S o am m s p a l m i n S pg a l i nm Sg p a l i m n S g p a l i m n g S a m p l e r o Dp o i s r p t iO ro o n n p ae To t erwS t i ot o aM n Sg u a e t l at t e i g S e t a g e

Stages in Data Analysis


Editing
Error Checking And Verification Coding Data Entry (Keyboarding) Data Analysis

Descriptive Analysis

Univeriate Analysis

Bivariate Analysis

Multivate Analysis

Interpretation

Statistical Inference
S t a t i s I n f e r e t i c a n c e l

T E P E s
Jun 16, 2011

h e o s t i m

r y o f a t i o n

T H

e s y p

t i n g o t h e

i n t i m

t I n t e r v P a a l r a mN E a st i t o i mn a T t ie o s n t
Dr.R.RAVANAN, Presidency College

oe nt r i P c a T e s t

The Concept of P Value


Given the observed data set, the P value is the smallest level for which the null hypothesis is rejected (and the alternative is accepted) If the P value then reject H0 ; Otherwise accept H0 If the P value 0.01 then reject H0 at 1% level of significance If the P value lies between 0.01 to 0.05 (ie. 0.01< P value 0.05) then reject H0 at 5% level of significance If the P value > 0.05 then accept H0 at 5% level of significance
Jun 16, 2011 Dr.R.RAVANAN, Presidency College

Measurement Scales
Types of measurement scales are Nominal Scale Ordinal Scale Interval scale Ratio Scale

The Measurement Principles


Nominal Ordinal Interval Ratio People or objects People or objects Intervals between There is a rationale with NominalOrdinalIntervalRatioPeople or objects with the same scale value arefor the the same scale with a higher scale adjacent scale zero point the same the same value have more the scale have equal value are on some attribute. The values of of values areno 'numeric' meaning in the way scale. that you usually think about numbers.People or objects with a higher scale value have on some attribute. someThe intervals between adjacent scale values are attribute. with respect the more of some attribute. indeterminate. Scale assignment is by theattribute of "greater than," "equal to," or property being measured. "less than."Intervals between adjacent scale values are equal with respect the the The values of the measured. E.g., the difference between 8 and 9 isRatios are the attribute being The intervals the same as scaledifference between 76 and 77.There is a rationale zero point for the scale. Ratios are have no between adjacent E.g., the difference equivalent, e.g., equivalent, e.g., scale of 2 to is 'numeric' meaningthe ratiovalues1arethe same as the 8 and 98 is 4. ratio of 2 to 1 between ratio of to the in the way that you indeterminate. the same as the is the same as the usually think about difference between ratio of 8 to 4. numbers. Scale assignment 76 and 77. is by the property of "greater than," "equal to," or "less than."

Examples of the Measurement Scales


Nominal Classification data: e.g. Male / Female Ordinal Interval Ratio Ordered, constant scale, natural zero Ordered but differences Ordered, constant between values are not scale, but no important natural zero

e.g., Height, No ordering: Differences make Weight, e.g. it makes no sense to e.g., Political parties on left to right spectrum sense, but ratios state that M > F Age, given labels 0, 1, 2 do not Length e.g., Likert scales, rank on a Arbitrary labels: scale of 1..5 your degree e.g. Temperature e.g., M/F, 0/1, etc of satisfaction (C,F), Dates e.g., Restaurant ratings

What is a natural zero: Some scales of measurement have a natural zero and some do not. For example, height, weight etc have a natural 0 at no height or no weight. Consequently, it makes sense to say that 2m is twice as large as 1m. Both of these variables are ratio scale. On the other hand, year and temperature (C) do not have a natural zero. The year 0 is arbitrary and it is not sensible to say that the year 2000 is twice as old as the year 1000. Similarly, 0C is arbitrary (why pick the freezing point of water?) and it again does not make sense to say that 20C is twice as hot as 10C. Both of these variables are interval scale.

Scales of Measurement
Scale Level 4 Scale of Scale Qualities Example(s) Measurement Ratio Magnitude Age, Height, Equal Intervals Weight, Absolute Zero Percentage Magnitude Temperature Equal Intervals Magnitude Likert Scale, Anything rank ordered Names, Lists of words

3 2

Interval Ordinal

Nominal

None

Permissible Arithmetic Operations


Nominal Counting Ordinal Interval Ratio

Greater than Addition and Multiplication or less than subtraction of and division of operations. scale values. scale values.

Appropriate Statistics
Nominal
[ Cross tabs ]
Chi square, Phi Cramer's Contingency [ Nonparametric ] Chi-square, Runs Binomial McNemar Cochran

Ordinal
[ Frequencies ]
Median, Interquartile range

Interval
Mean Standard Deviation Pearson's product-moment correlation t test Analysis of variance, Multivariate analysis of variance, MANOVA Factor analysis Regression Multiple correlation, R

Ratio
Coefficient of Variation, (CV = SD / M)

[ Nonparametric ]
Kolmogorov-Smirnov Sign Wilcoxen Kendall coefficient of concordance Friedman two-way anova Mann-Whitney U Wald-Wolfowitz Kruskal-Wallis

Scale construction decisions


What level of data is involved (nominal, ordinal, interval, or ratio)? What will the results be used for? Should you use a scale, index, or typology? What types of statistical analysis would be useful? Should you use a comparative scale or a noncomparative scale? How many scale divisions or categories should be used (1 to 10; 1 to 7; -3 to +3)? Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.) What should the nature and descriptiveness of the scale labels be? What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal) Should a response be forced or be left optional?

Statistical Inference
There are two types of statistical inferences: Estimation of population parameters and hypothesis testing. Hypothesis testing is one of the most important tools of application of statistics to real life problems. Most often, decisions are required to be made concerning populations on the basis of sample information. Statistical tests are used in arriving at these decisions.

Five ingredients to statistical test


Null Hypothesis Alternate Hypothesis Level of Significance Test Statistic Interpretation

Steps in Hypothesis Testing 1 Identify the null hypothesis H0 and the alternate hypothesis H1. 2 Choose . The value should be small, usually less than 10%. It is important to consider the consequences of both types of errors. 3 Select the test statistic and determine its value from the sample data. This value is called the observed value of the test statistic. 4 Compare the observed value of the statistic to the critical value obtained for the chosen a. 5 Make a decision

If the test statistic falls in the critical region: Reject H0 in favour of H1 .

If the test statistic does not fall in the critical region: Conclude that there is not enough evidence to reject H0.

Types of Error
Type of decision H0 true H0 false

Reject H0

Type I error ()

Correct decision (1-) Type II error ()

Accept H0

Correct decision (1- )

Use the key by answering the questions in the most relevant way.
1. Have you got more than two samples?
No......go to 2 Yes.....go to 8

2. Have you got one or two samples?


One.....Single sample t-test Two....go to 3

3. Are your data sets normally distributed (K-S test or Shapiro-Wilke)?


No.......go to 4 Yes......go to 5

4. Do your data sets have any factor in common (dependence), i.e. location or individuals?
No.Mann Whitney U test YesWilcoxon Matched Pairs

5. Do your data sets have any factor in common (dependence), i.e. location or individuals?
No......go to 6 Yes.....paired sample t-test

Use the key by answering the questions in the most relevant way.
6. Do your data sets have equal variances (f-test)?
No......unequal variance t-test Yes.....go to 7

7. Is n greater or less than 30?


<30.....equal variance t-test or ANOVA >30.....z-test or ANOVA

8. Are your samples normally distributed and with equal variances?


No......Kruskal-Wallis non-parametric ANOVA Yes.....go to 9

9. Does your data involve one factor or two factors?


One.....One-way ANOVA (see also Multiple comparison tests) Two.....Two-way ANOVA (see also Multiple comparison tests)

Descriptive Statistics

Diversity Indices

Comparisons

Correlations

Regression

A Classification of Multivariate Methods


A ll M M u e t h lt iv a r ia t e o d s

A r e s o m e o f t h e v a r ia b le s d e p e n d o n o t h e r s ? Y D e s N o

e n

e p e n d I en n d c e e p e n d e n M e t h o d sM e t h o d s

c e

Multivariate Analysis: Classification of Dependence Methods

D M H o w a r e

p e n d e n e t h o d s m d e V e a p n e y n

c e

v a r i a b d e n t ? e n

l e

n V

e a

D e p e r i a b l e

v e r a l D e p a r i a b l e s

e M n ut l t i p l e in d a n d d e p e n v a r i a b l e s

e t r i c r a t i o

- T h e sN c o a n l em s e a t r r i e c M - e T t rh i ce o r i n t e s r c v a a l le s a r e n o a m t i oi n r o r o r d i n a l l t i p l e n D i s c r im i n A n a l y s i s M u M n At

- T h e sN c o a n l em s e a t r r i e c - TC h a e n o n i c a a lr i n t e s r c v a a l l e s o a r e n o m ni n a a l y l s i s A o r o r d i n a l a r i a t e s i s o f n c e O V A ) C A o n jo i n t n a l y s i s

M R

u l t i p le e g r e s s i o

u l t i v n a l y V a r ia ( M A N

Multivariate Analysis: Classification of Independence Methods


I n M A d e p e n d e e t h o d s i n p u t r i c ? t s n c e

r e M a

M s c a o r C A

t r i c - T h e l e s a r e r a I n t e r v a l l u n a s t e r l y s Ms i M u S e

N t i o s

o n m e t r i c c a l e s a r e o r o r d i n a

F A n

c t o r a l y s i s

t r i c N o n m e t r i l t i d i m eM n u s l i t o i d n i am l e c a l i n g S c a l i n g

Test of Hypothesis
Test of Hypotheses concerning mean(s). Test of Hypotheses concerning variance/Variances. Test of Hypotheses concerning proportions.
Jun 16, 2011 Dr.R.RAVANAN, Presidency College

Type of Statistical Tests and its Characteristics


Hypothesis Testing Hypotheses About frequency Distribution One Two or more One (Large sample One (small sample Hypothesis About means Two (Small sample Three or more(Small sample One (Large sample One (small Hypothesis About Proportions Two (Small sample Variance Two or more sample sample Two (Large sample Two (Large sample Number of Samples Measurement Scale Nominal Nominal Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio Interval Or Ratio ANOVA t Test Z Test t Test Z Test ANOVA t Test Z Test t Test Chi-square Chi-square Z Test Test

Example for Tests of Hypotheses concerning Two population means


Sample I: 110, 120, 123, 112, 125 Sample II: 120, 128, 133, 138, 129
Go pS tis s r u ta tic Cs la s Cs A la s Cs B la s N 5 5 Ma en 18 0 1 .0 19 0 2 .6 S . D v tio td e ia n 6 7 .6 1 6 5 .6 6 S .E r td rro Ma en 2 8 .9 3 2 7 .9 7

M rk a

Independent Samples Test Levene's Test for Equality of Variances t-test for Equality of Means 95% Confidence Interval of the Difference Lower Upper -21.318 -21.318 -1.882 -1.882

F Mark Equal variances assumed Equal variances not assumed .178

Sig. .684

t -2.753 -2.753

df 8 8.000

Sig. (2-tailed) .025 .025

Mean Difference -11.60 -11.60

Std. Error Difference 4.214 4.214

Class Class A Class B

Sample 5 5

Mean 118.00 129.60

SD 6.671 6.656

t Value 2.753

P Value 0.025

Tests of Hypotheses concerning proportion(s)


1. One-tailed tests concerning single proportion 2. Two-tailed tests concerning single proportion 3. One-tailed tests concerning two proportions 4. Two-tailed tests concerning two proportions

Tests of Hypotheses concerning Variance(s)


1. One-tailed chi-square test concerning single population variance 2. Two-tailed chi-square test concerning single population variance 3. One-tailed F-test concerning equality of two population variances 4. Two-tailed F-test concerning equality of two population variances

Chi-square test for checking independence of two categorized data


Let us consider two factors which may or may not have influence on the observed frequencies formed with respect to combinations of different levels of the two factors H0: Factor A and factor B are independent H1: Factor A and factor B are not independent Objective : To check whether the null hypothesis is to be accepted based on the value of the chisquare by placing the significance level of at the right tail of the chi-square distribution.

Chi-square test for goodness of fit


To fit the data to the nearest distribution which represents the data more meaningfully for future analysis. Such fitting of data to the nearest distribution is done using the goodness of fit test H0: The given data follow an assumed distribution H1: The given data do not follow an assumed distribution Objective : To check whether the null hypothesis is to be accepted based on the value of the chi-square by placing the significance level of at the right tail of the chi-square distribution.

Comparing Multiple Population 1. Comparing multiple population variances 2. Comparing multiple population means

Comparing multiple population variances


For more than two populations, it is assumed that the probability distribution ( i.e. Histogram ) of each population is approximately normal. H0: All the population variances are equals H1: At least two population variances are differ This test is called Bartletts Test Objective : To check whether the null hypothesis is to be accepted based on the value of the chi-square by placing the significance level of at the right tail of the chisquare distribution.

Comparing multiple population means


For more than two populations, it is assumed that the probability distribution ( i.e. Histogram ) of each population is approximately normal. H0: All the population means are equals H1: At least two population means are differ This test is called Analysis Of Variance (ANOVA) Data from Unrestricted (independent) samples ( One-way ANOVA) Data from Block Restricted Samples (Two-way ANOVA) Objective : To check whether the null hypothesis is to be accepted based on the value of the F by placing the significance level of at the right tail of the Snedecor F distribution.

Example for One Way ANOVA


School I : 45, 54, 35, 43, 48 School II : 54, 65, 67, 55, 52 School III : 87, 65, 75, 79, 67
ANOVA Mark Sum of Squares 2195.200 706.400 2901.600 df 2 12 14 Mean Square 1097.600 58.867 F 18.646 Sig. .000

Between Groups Within Groups Total

Mr ak Dn a ucn Sh o col Sh o I col Sh o I col I Sh o I I col I


a

N 5 5 5

Sb e f r a h =. 5 u s t o lp a 0 1 2 4. 0 50 5. 0 86

7. 0 46

Ma sf r go p inh mg n o ss b esaed p y d en o r us o o e e u u s t r is la e . a Ue Hr o icMa SmleS e=50 0 . s s amn e n a p iz . 0.

Non-Parametric Tests
In some situations, the practical data may be non-normal and/or it may not be possible to estimate the parameter(s) of the data The test which are used for such situations are called non-parametric tests Since these tests are based on the data which are free from distribution and parameter, these tests are known as non-parametric(NP) test or Distribution Free tests NP test can be used even for nominal data (qualitative data like greater or less, etc.) and ordinal data, like ranked data. NP test required less calculation, because there is no need to compute parameters.

List of Non-Parametric Tests


1. One-sample test
One sample sign test Chi-square one sample test Kolmogorov-Smirnov test Two samples sign test Wilcoxon Matched-pairs signed rank test Chi-Square test for two independent samples Mann-Whitney U test Kolmogorov-Smirnov two sample test

1. Two related samples tests 1. Two independent samples test

List of Non-Parametric Tests


4 K Related Samples test
Friedman Two way Analysis of Variance by Ranks The Coehran Q test Chi-Square test for k Independent samples The extension of the Median test Kruskal-Wallis one-way Analysis of Variance by Rank

5. K Independent samples

One sample sign test


This test is applied to a situation where a sample is taken from a population which has a continuous symmetrical distribution and known to be non-normal such that the probability of having a sample values less than the mean value as well as probability of having a sample values more than the mean value(p) is . Classified into four categories
1 2 3 4 One-tailed one-sample sign tests for small sample Two-tailed one-sample sign tests for small sample One-tailed one-sample sign tests for large sample Two-tailed one-sample sign tests for large sample

Kolmogorov-smirnov test
It is similar to the chi-square test to do goodness of fit of a given set of data to an assumed distribution This test is more powerful for small samples whereas the chisquare test is suited for large sample H0: The given data follow an assumed distribution H1: The given data do not follow an assumed distribution K-S test is an one-tailed test. Hence if the calculated value of D is more than the theoretical value of D for a given significance level, then reject H0 ; otherwise accept H0

Two samples sign test


Two samples sign test is applied to a situation, where two samples are taken from two populations which have continuous symmetrical distributions and known to be nonnormal Modified sample value, Zi = + if Xi > Yi = = 0
1 2 3 4

if Xi < Yi

if Xi = Yi

Classified into four categories


One-tailed two-sample sign tests with binomial distribution Two-tailed two-sample sign tests with binomial distribution One-tailed two-sample sign tests with normal distribution Two-tailed two-sample sign tests with normal distribution

The Wilcoxon Matched-pairs signed-ranks test


The Wilcoxon test is a most useful test for behavioral scientist Let di = the difference score for any matched pair Rank all the di without regard to sign T = Sum of rank with less frequent sign Compute Z = [T E(T)]/SD(T)

Mann-Whitney U Test
Mann-Whitney U test is an alternate to the two sample t-test This test is based on the ranks of the observations of two samples put together Alternate name for this test is Rank-Sum Test Let R1 = The sum of the ranks of the observations of the first sample Let R2 = The sum of the ranks of the observations of the second sample Objective: To check whether the two samples are drawn from different populations having the same distribution Compute Z = [U E(U)]/SD(U) where U = n1n2 + [n1(n1 + 1)/2] - R1 or U = n1n2 + [n2(n2 + 1)/2] - R2

Correlation and Regression Analysis


The Chi-square test measures the association between two or more variables.This test is applicable only when data is on nominal scale. Correlation and Regression analysis is used for measuring the relationship between two variables measured on interval or ratio scale.

Correlation Analysis
Correlation analysis is a statistical technique used to measure the magnitude of linear relationship between two variables. Correlation analysis cannot be used in isolation to describe the relationship between variables. It can be used along with regression analysis to determine the nature of the relationship between two variables. Thus correlation analysis can be used for further analysis Two prominent types of correlation Coefficient are Pearson Product Moment correlation coefficient Spearmans Rank correlation coefficient Testing the significance of correlation coefficient Type I H0: = 0 and H1: 0 Type II H0: = r and H1: r Type III H0: r1 = r2 and H1: r1 r2

Correlation Analysis
Example:

Mark in Mathematics: 89,58,78,79,86,58 Marks in Statistics: 75,79,59,78,84,65

Correlations STATISTI MATHS CS 1 .968** . .002 6 6 .968** 1 .002 . 6 6

MATHS

Pearson Correlation Sig. (2-tailed) N STATISTI Pearson Correlation CS Sig. (2-tailed) N

**. Correlation is significant at the 0.01 level (2-tailed).

Regression Analysis
Regression analysis is used to predict the nature and closeness of relationships between two or more variables It evaluate the causal effect of one variable on another variable It used to predict the variability in the dependent (or criterion) variable based on the information about one or more independent (or predictor) variables. Two variables : Simple or Linear Regression Analysis More than two variables : Multiple Regression Analysis

Linear Regression Analysis


Linear regression : Y = + X
Where Y : Dependent variable X : Independent variable and : Two constants are called regression coefficients : Slope coefficient i.e. the change in the value of Y with the corresponding change in one unit of X : Y intercept when X = 0

R2 : The strength of association i.e. to what degree that the variation in Y can be explained by X. R2 = 0.10 then only 10% of the total variation in Y can be explained by the variation in X variables

Test of significance of Regression Equation


Linear regression : Y = + X F test is used to test the significance of the linear relationship between two variables Y and X H0: = 0 (There is no linear relationship between Y and X) H1: 0 (There is linear relationship between Y and X) Objective : To check whether the estimates from the regression model represent the real world data.

Example for Regression Analysis School Climate : 25, 34, 55, 45, 56, 49, 65 Academic Achievement: 58, 62, 80, 75, 84, 72, 89
Variables Entered/Removed b Model 1 Variables Entered School a Climate Variables Removed . Method Enter Model Summary Model 1 R R Square .978a .957 Adjusted R Square .949 Std. Error of the Estimate 2.55330

a. All requested variables entered. b. Dependent Variable: Academic Achievement

a. Predictors: (Constant), School Climate

ANOVAb Model 1 Sum of Squares 732.832 32.597 765.429 df 1 5 6 Mean Square 732.832 6.519 F 112.409 Sig. .000 a

Regression Residual Total

a. Predictors: (Constant), School Climate b. Dependent Variable: Academic Achievement

a Coefficients

Model 1

(Constant) School Climate

Unstandardized Coefficients B Std. Error 36.436 3.698 .805 .076

Standardized Coefficients Beta .978

t 9.853 10.602

Sig. .000 .000

a. Dependent Variable: Academic Achievement

Multivariate Analysis
Multivariate analysis is defined as all statistical techniques which are simultaneously analyse more than two variables on a sample of observation. Multivariate analysis helps the researcher in evaluating the relationship between multiple (more than two) variables simultaneously. Multivariate techniques are broadly classified into two categories:
Dependency Techniques Independency Techniques

A Classification of Multivariate Methods


A ll M M u e t h lt iv a r ia t e o d s

A r e s o m e o f t h e v a r ia b le s d e p e n d o n o t h e r s ? Y D e s N o

e n

e p e n d I en n d c e e p e n d e n M e t h o d sM e t h o d s

c e

Multivariate Analysis: Classification of Dependence Methods

D M H o w a r e

p e n d e n e t h o d s m d e V e a p n e y n

c e

v a r i a b d e n t ? e n

l e

n V

e a

D e p e r i a b l e

v e r a l D e p a r i a b l e s

e M n ut l t i p l e in d a n d d e p e n v a r i a b l e s

e t r i c r a t i o

- T h e sN c o a n l em s e a t r r i e c M - e T t rh i ce o r i n t e s r c v a a l le s a r e n o a m t i oi n r o r o r d i n a l l t i p l e n D i s c r im i n A n a l y s i s M u M n At

- T h e sN c o a n l em s e a t r r i e c - TC h a e n o n i c a a lr i n t e s r c v a a l l e s o a r e n o m ni n a a l y l s i s A o r o r d i n a l a r i a t e s i s o f n c e O V A ) C A o n jo i n t n a l y s i s

M R

u l t i p le e g r e s s i o

u l t i v n a l y V a r ia ( M A N

Multivariate Analysis: Classification of Independence Methods


I n M A d e p e n d e e t h o d s i n p u t r i c ? t s n c e

r e M a

M s c a o r C A

t r i c - T h e l e s a r e r a I n t e r v a l l u n a s t e r l y s Ms i M u S e

N t i o s

o n m e t r i c c a l e s a r e o r o r d i n a

F A n

c t o r a l y s i s

t r i c N o n m e t r i l t i d i m eM n u s l i t o i d n i am l e c a l i n g S c a l i n g

Discriminant Analysis
Discriminant analysis aims at studying the effect of two or more predictor variables (independent variables) on certain evaluation criterion The evaluation criterion may be two or more groups Two groups such as good or bad, like or dislike, successful or unsuccessful, above expected level or below expected level Three groups such as good, normal or poor Check whether the predictor variable discriminate among the groups To identify the predictor variable which is more important when compared to other predictor variable(s). Such analysis is called discriminant analysis

Discriminant Analysis
Designing a discriminant function: Y = aX1 + bX2 where Y is a linear composite representing the discriminant function, X1 and X2 are the predictor variables (independent variables) which are having effect on the evaluation criterion of the problem of interest. Finding the discriminant ratio (K) and determining the variables which account for intergroup difference in terms of group means This ratio is the maximum possible ratio between the variability between groups and the variability within groups Finding the critical value which can be used to include a new data set (i.e. new combination of instances for the predictor variables) into its appropriate group Testing H0: The group means are equal in importance H1: The group means are not equal in importance using F test at a given significance level

Factor Analysis
Factor analysis can be defined as a set of methods in which the observable or manifest responses of individuals on a set of variables are represented as functions of a small number of latent variables called factors. Factor analysis helps the researcher to reduce the number of variables to be analyzed, thereby making the analysis easier. For example, Consider a market researcher at a credit card company who wants to evaluate the credit card usage and behaviour of customers, using various variables. The variables include age, gender, marital status, income level, education, employment status, credit history and family background. Analysis based on a wide range of variables can be tedious and time consuming. Using Factor Analysis, the researcher can reduce the large number of variables into a few dimensions called factors that summarize the available data. Its aims at grouping the original input variables into factors which underlying the input variables. For example, age, gender, marital status can be combined under a factor called demographic characteristics. The income level, education, employment status can be combined under a factor called socio-economic status. The credit card and family background can be combined under factor called background status.

Benefits of Factor Analysis


To identify the hidden dimensions or construct which may not be apparent from direct analysis To identify relationships between variables It helps in data reduction It helps the researcher to cluster the product and population being analyzed.

Terminology in Factor Analysis


Factor: A factor is an underlying construct or dimension that represent a set of observed variables. In the credit card company example, the demographic characteristics, socio economic status and background status represent a set of variables. Factor Loadings: Factor loading help in interpreting and labeling the factors. It measure how closely the variables in the factor are associated. It is also called factor-variable correlation. Factor loadings are correlation coefficients between the variables and the factors. Eigen Values: Eigen values measure the variance in all the variables corresponding to the factor. Eigen values are calculated by adding the squares of factor loading of all the variables in the factor. It aid in explaining the importance of the factor with respect to variables.Generally factors with eigen values more than 1.0 are considered stable. The factors that have low eigen values (<1.0) may not explain the variance in the variables related to that factor.

Terminology in Factor Analysis


Communalities: Communalities, denoted by h2, measure the percentage of variance in each variable explained by the factors extracted. It ranges from 0 to 1. A high communality value indicates that the maximum amount of the variance in the variable is explained by the factors extracted from the factor analysis. Total Variance explained: The total variance explained is the percentage of total variance of the variables explained. This is calculating by adding all the communality values of each variable and dividing it by the number of variables. Factor Variance explained: The factor variance explained is the percentage of total variance of the variables explained by the factors. This is calculating by adding the squared factor loadings of all the variables and dividing it by the number of variables.

Procedure followed for Factor Analysis


Define the problem Construct the correlation matrix that measures the relationship between the factors and the variables. Select an appropriate factor analysis method Determine the number of factors Rotation of factors Interpret the factors Determine the factor scores

Cluster Analysis
Cluster analysis can be defined as a set of techniques used to classify the objects into relatively homogeneous groups called clusters It involves identifying similar objects and grouping them under homogeneous groups Cluster as a group of objects that display high correlation with each other and low correlation with other variables in other clusters

Procedure in Cluster Analysis


1. 2. Defining the problem: First define the problem and de upon the variables based on which the objects are clustered. Selection of similarity or distance measures: The similarity measure tries to examine the proximity between the objects. Closer or similar objects are grouped together and the farther objects are ignored. There are three major methods to measure the similarity between objects: 1. Euclidean Distance measures 2. Correlation coefficient 3. Association coefficients Selection of clustering approach: To select the appropriate clustering approach. There are two types of clustering approaches: 1. Hierarchical Clustering approach 2. Non-Hierarchical Clustering approach Hierarchical clustering Approach consists of either a top-down approach or a bottom-up approach. Prominent hierarchical clustering methods are: Single linkage, Complete linkage, Average linkage, Wards method and Centroid method.

3.

Procedure in Cluster Analysis


Hierarchical clustering Approach consists of either a top-down approach or a bottom-up approach. Prominent hierarchical clustering methods are: Single linkage, Complete linkage, Average linkage, Wards method and Centroid method.

Non-Hierarchical clustering Approach: A cluster center is first determined and all the objects that are within the specified distance from the cluster center are included in the cluster 4 5 Deciding on the number of clusters to be selected Interpreting the clusters

Canonical Correlation Analysis (CCA)


CCA is a way of measuring the linear relationship between two multidimensional variables. CCA is extension of multiple regression analysis (MRA) MRA analyses the linear relationship between a single dependent variable and multiple independent variables. CCA analyses a linear relationship between multiple dependent variable and multiple independent variables. For example, a social researcher wants to know the relationship between various work environment factors (like work culture, HR policies, Compensation structure, top management) influencing various employee behaviour elements (Employee productivity, job satisfaction, perception about company) The linear combination for each variable is called canonical variables or canonical variates. CCA tries to maximize the correlation between two canonical variables

Canonical Correlation Analysis (CCA)


For example, U represent the linear combination of work environment factors U = a1X1 + a2X2 + a3X3 + a4X4 and V represent the linear combination of employee behaviour factors V = b1Y1 + b2Y2 + b3Y3 + b4Y4 The coefficient of each canonical variable are called canonical coefficients To interpret the canonical analysis, the researcher examines the relative magnitude and the sign of the several weights defining each equation and sees if a meaningful interpretation can be given. Being a complex statistical tool that requires a great investment of effort and computing resources, CCA has not gained as much popularity as statistical tools like multiple regression.

Multivariate Analysis of Variance (MANOVA)


MANOVA examines the relationship between several dependent variables and several independent variables It tries to examine whether there is any difference between various dependent variables with respect to the independent variables. For example, an industrial buyer wants to know whether the product from Company A, Company B and Company C differ in terms of various parameters (set by the company) such as quality, customer support, pricing and reliability. The difference between ANOVA and MANOVA is that while ANOVA deals with problems containing one dependent variable and several independent variables, MANOVA deals with problems containing several dependent variables and several independent variables. Another major difference is that the ANOVA test ignores interrelationship between the variables. This leads to biased results MANOVA considers this aspect by testing the mean difference between groups on two or more dependent variables simultaneously.

Books for Reference SPSS For Windows Step by Step A simple Guide and Reference Sixth Edition Darren George and Paul Mallery Pearson Education 48, Ariya Gowda Road, West Mambalam, Chennai Phone: 24803091, 92, 93, 94

Books for Reference

Statistics: Concepts and Applications


Nabendu Pal and Sahadeb Sarkar Prentice-Hall of India Private Limited, New Delhi.

Marketing Research Text and cases


Rajendra Nargundkar Tata McGraw-Hill Publishing Company Limited, New Delhi

Research Methodology
Panneerselvam.R Prentice-Hall of India Private Limited, New Delhi.

You might also like