You are on page 1of 9

Data Preparation

University of Technology, Mauritius

BUSINESS RESEARCH METHODS


(Data preparation and preliminary data analysis)
Session 10

By Reza Beebeejaun

This section deals with the how to code the data, input the data and clean the data We will focus on the preliminary data analysis techniques such as frequency distribution and also discuss hypothesis testing using various analysis techniques.

Nature and scope of data preparation


Data editing
The purpose of editing is to generate data which is: Accurate consistent with intent of the question and other information in the survey uniformly entered Complete arranged to simplify coding and tabulation One of the major editing problem concerns with faking of an interview One of the best ways to tackle the fraudulent interviews is to add a few open-ended questions within questionnaire
4

Once the data is collected it is important to use editing and coding procedures to input the data in the appropriate statistical software. There are several steps which are required to prepare the data ready for analysis. The steps generally involve data editing and coding, data entry, and data cleaning.

Coding

Data entry

Coding involves assigning numbers or other symbols to answers so the responses can be grouped into a limited number of classes or categories Coding entails the assignment of numerical values to each individual response for each question within the survey. Coding close ended questions is much easier as they are structured questions and the responses obtained are predetermined For open ended questions, content analysis is used, which provides an objective, systematic and quantitative description of the response
5

Once the questionnaire is coded appropriately, researchers input the data into statistical software package There are various methods of data entry. Manual data entry or keyboarding remains a mainstay for researchers who need to create a data file immediately and store it in a minimal space on a variety of media
6

Data cleaning

Preliminary data analysis


Data cleaning focuses on error detection and consistency checks as well as treatment of missing responses. The first step in the data cleaning process is to check each variable for data that are out of the range or as otherwise called logically inconsistent data. Such data must be corrected as they can hamper the overall analysis process
7

In this section we will focus on the first stage of data analysis which is mostly concerned with descriptive statistics. Descriptive statistics include the mean, standard deviation, range of scores, skewness and kurtosis. This statistics can be obtained using frequencies, descriptives or explore command in SPSS For analysis purposes, researchers define the primary scales of measurements (nominal, ordinal, interval and ratio) into two categories. Nominal and ordinal scale based variables are called categorical variables (such as gender, marital status and so on) while interval and ratio scale based variables are called continuous variables (such as height, length, distance, temperature and so on).
8

Univariate Data Analysis

Descriptive analysis process


Categorical variables: SPSS menu Analyse > Descriptive statistics > Frequencies (Choose appropriate variables and transfer them into the variables box using the arrow button. Then choose the required analysis to be carried out using the statistics, charts and format button in the same window. Press OK and then you will see the results appear in another window)
9 10

Univariate data analysis-explores each variable in a data set separately

Serves as a good method to check the quality of the data

Inconsistencies or unexpected results should be investigated using the original data as the reference point

Frequencies

Descriptive analysis process


Central tendency- Most common values for a variable Mean - arithmetic average Mode - most common response Median - middle response when observations arranged in order Variability ( Dispersion) - How cases are distributed across a set of attributes of a variable Variance - spread of data around the mean The range- The range is the difference between the highest and lowest scores. Standard deviation > The standard deviation is the average amount of deviation from the mean within a group of scores. >The greater the spread of scores, the greater the standard deviation Shape of the overall distribution Skewness Negatively Skewed No Skew Positively Skewed Kurtosis
12

Frequencies

can tell you if many study participants share a

characteristic of interest (age, gender, etc.)


look

at detailed information in a nominal (category) data that

describes the results.


Graphs and tables can be helpful

Frequency Distributions
Frequency tables Histograms
11

Descriptive analysis process


Continuous variables: SPSS menu Analyse > Descriptive statistics > Descriptives (Choose all the continuous variables and transfer them into the variables box using the arrow button. Then clicking the options button, choose the various analyses you wish to perform. Press OK and then you will see the results appear in another window)

Descriptive analysis process


Checking normality using explore option
SPSS menu Analyse > Descriptive statistics > Explore (Choose all the continuous variables and transfer them into the dependent list box using the arrow button. Click on the independent or grouping variable that you wish to choose (such as gender). Move that specific variable into the factor list box. Click on display section and tick both. In the plots button, click histogram and normality plots with tests. Click on case id variable and move into the section label cases. Click on the statistics button and check outliers. In the options button, click on exclude cases pairwise. Press OK and then you will see the results appear in output window)
13 14

Normality test
The main things to look for are: (a) 5% trimmed mean (if there is a big difference between original and 5% trimmed mean there are many extreme values in the dataset.) (b) Skewness and kurtosis values are also provided through this technique. +ve value of skewness- long right tail and +ve value of kutosis- high peak

Normality test
(c) The test of normality
SPSS produces two Sig. values, the first is for the KolmogorovSmirnov test, the second is the Shapiro-Wilks test. For sample size greater than 50 (n>50), use result from the Kolmogorov-Smirnov test For sample size less than 50 (n<50), use result from Shapiro-Wilks test Sig. value less or equal to 0.05 - not normally distributed. Sig. value greater than 0.05 - normally distributed

The ratio of skewness to its standard error can be used as a test of normality (that is, you can reject normality if the ratio is less than -2 or greater than +2) The ratio of kurtosis to its standard error can be used as a test of normality (that is, you can reject normality if the ratio is less than -2 or greater than +2).
15

(d) The histograms provide the visual representation of data distribution (a bell-shaped curve).
16

Bivariate analysis
Explore Relationships/associations differences between two variables

Common Bivariate Tests


Type of Measurement Nominal Measure of Association Chi-Square Phi Coefficient Chi-square Rank Correlation Interval and Ratio Scales
17

Measuring association Bivariate correlations Partial correlation Multiple correlation (multiple regression) Crosstabs Measuring differences T-Tests ANOVA

Look at measures of the strength of the relationship between two variables.

Ordinal Scales

Test hypotheses about relationships between two nominal or ordinal level variables.

Correlation Coefficient Bivariate Regression


18

Bivariate analysis- crosstabulation

Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categorical variables. Chi-square is simply an extension of a cross-tabulation that gives you more information about the relationship. However, it provides no information about the direction of the relationship (positive or negative) between the two variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command to obtain the test statistic and its associated p-value To conduct a chi-square the following conditions must be met: There must be at least a total of 30 observations (people) in the table Each cell must contain a count of 5 or more To conduct a chi-square test we compare the observed data (from study results) with the data we would expect to see
8/23/2011 20

We use cross-tabulation when: We want to demonstrate relationship between two categorical variables (nominal and ordinal variables) We want a descriptive statistical measure to tell us whether differences among groups are large enough to indicate some sort of relationship among variables.

19

20

Different Scales, Different Measures of Association


Scale of Both Variables Nominal Scale Measures of Association Pearson Chi-Square: 2

Correlation
Examine relationships among variables to show the strength and the direction of a relationship A correlation tells you how and to what extent two variables are linearly related 1. Rank correlation (ordinal variables) 2. Linear correlation (interval variables)

Ordinal Scale

Spearmans rho

Interval or Ratio Scale

Pearson r

8/23/2011

21

22

Rank correlation
for ordinal variables and test used is Spearmans rho
correlation coefficient (r ) may range from -1 to 1, where -1 or 1 indicates a perfect relationship The further the coefficient is from 0, whether positive or negative, the stronger the relationship between the two variables.
r>0.70 indicate strong correlation/relationship r>0.30 indicate moderate correlation R<0.30 indicate weak correlation

Linear correlation
Linear correlation for interval variables (continuous) > Pearsons r E.g. correlation between height and weight. The hypothesis is that tall people are heavier than short ones

If p-value<.05, significant relationship If p-value>.05, no significant relation

8/23/2011

Data Analysis with SPSS

23

8/23/2011

23

24

Partial correlations
The Partial

Linear Regression

Correlations procedure computes partial correlation coefficients

Linear Regression estimates the coefficients of the linear equation, involving one or more independent variables, that best predict the value of the dependent variable

that describe the linear relationship between two variables while controlling for the effects of one or more additional variables.
Partial

correlation is defined as the measure of the association that occurs

Regression between smoke and cough


Variables would be:
Cough (value is 1 if yes, 0 if not) smoking(1 if yes, 0 if not)
i.e. dependent- cough Independent- smoking habits

between two variables after keeping the control or adjusting the effects of one or more additional variables. Testing hypothesis: If p-value >the significance level (0.05), then do not reject the null hypothesis If p-value <the significance level (0.05), then reject the null hypothesis E.g. height and weight controlling bmi
25

26

Independent-Samples T Test

Paired-Samples T Test

A test procedure that compare means for two groups of cases to determine whether there is significant difference between the means of two groups, e.g.: 1- female 2- male E.g. comparing means for cholesterol for the 2 groups

The Paired-Samples T test is used to test whether one continuous variable has a significantly higher mean value than another for the same cases in the same data file

Repeated Measures are obtained on one group of participants, such as in measuring participants before a treatment is applied and again after the treatment.

Thus, each person serves as his/her own control, and because the two sets of scores to be compared are obtained from the same people, the two groups of scores are not independent

27

28

ANOVA
Simple or One-Way ANOVA is used to determine whether there is a significant difference between two or more means. Most statistical software programs will calculate ANOVA Output varies slightly in different programs ANOVA uses either the t-test or the f-test When comparing continuous variables between groups of study subjects: Use a t-test for comparing 2 groups Use an f-test for comparing 3 or more groups Both tests result in a p-value Example: testing age differences between 2 groups If groups have similar average ages and a similar distribution of age values, t-statistic will be small and the p-value will not be significant If average ages of 2 groups are different, t-statistic will be larger and p-value will be smaller (p-value <0.05 indicates two groups have significantly different ages)

29

Multivariate Analysis

Analysis that provides a simultaneous analysis of multiple independent and dependent variables.

Allows us to estimate the effects of each independent variable on the dependent variable

30

Multiple regression analysis

Univariate Techniques

Multiple regression is very similar to simple regression, except that in multiple regression you have more than one predictor variable in the equation The dependent variable = categorical. Independent variables = factors or covariates. Factors should be categorical variables and covariates should be continuous variables.

31

32

Univariate Techniques
1. 2. 3. 4. 5. 6. 7. 8.
33

Format of analysis and findings chapter


Introduction Purpose of this chapter What is being analysed- Main objective Software and statistical tests used Population and response rate Demographic of respondents Reliability analysis Descriptive analysis Bivariate and multivariate analysis Hypothesis testing
34

You might also like