Professional Documents
Culture Documents
Preparing Preliminary Plan of Data Analysis Questionnaire Checking Editing Coding Transcribing Data Cleaning Statistically Adjusting the Data Selecting a data analysis Strategy
Questionnaire Checking
Involve a check of all questionnaire for completeness and interviewing quality A questionnaire returned from the field may be unacceptable due to several reasons
The part of the questionnaire may be incomplete The questionnaire is received after the pre established cutoff date Answered by someone who does not qualify for the participation The questionnaire is physically incomplete
Editing
Editing is done for filled up questionnaires Editing is the rearview of questionnaire with the objective of increasing it s accuracy and precision It consist of screening the questionnaire to identify illegible, incomplete, inconsistent or ambiguous responses
Editing
Treatment of unsatisfactory responses
Returning to the field Assigning missing values Discarding unsatisfactory respondents
Coding
Coding means assigning a code to each possible response to each question Coding Example
Dou you have currently valid passport?
1. Yes 2. No
Here Yes response is coded with 1 and No response is coded with 2 Codes allow the data to be easily entered and processed in a computer
Coding
Researchers arrange the organize the coded data into field, records and files Field
A collection of characters that represents a single type of data
Record
A collection of all related fields
File
A collection of all related records
Coding
The data matrix
A rectangular arrangement of data in rows and columns Fields
Column-1 States Column-2 Population ( in millions) 4 Column-2 Average Age 29.3 Column-2 Cars per 1000 543
Records
GUJ
MAH
26.1
550
RAJ DEL
3.6 2.4
29.2 32.5
460 560
Coding
Code Book
A book containing coding instructions and necessary information about various variables in the data set It is used by researcher to promote more accurate and more efficient data entry
Coding
Pre-coding
Codes are assigned before the field work Used in case of close ended questions Its easy to code close ended questions
Post-coding
Codes are assigned after the field work Used in case of open ended questions Its very complex to code open ended questions
Coding
Coding Rules
Appropriate to the research problem and purpose Exhaustive
Transcribing
Transcribing involves transferring the coded data from questionnaire to computer Various techniques used for transcribing the data includes,
Computer Aided Telephonic Interviews (CATI) Keypunching Optical recognition Digital Technologies Bar Code Other technologies
Data Cleaning
Data cleaning includes consistency checks and treatment of missing responses Although preliminary consistency checks have been made during editing stage the checks at this stage is more through and extensive, because they are made on computer.
Data Cleaning
Consistency checks
A part of the data cleaning process that identifies the data that are out of range, logically inconsistent, or have extreme values
Missing responses
Represents values of variables that are unknown either because respondents provided unambiguous answers or their answers are not properly recorded
Data Cleaning
Treatment of missing responses
Substitute a neutral value
A neutral value typically a mean response value is substituted for the missing responses
Data Cleaning
Treatment of missing responses
Case wise deletion
Cases or respondents with any missing responses are discarded from the analysis
Variable re specification
Involves transformation of data to create new variables or modify existing variables
Characteristics of data
Measurement scales used
Multivariate techniques
Suitable for analyzing the data when there are two or more measurement of each elements and variables are analyzed simultaneously
Univariate Techniques
Univariate Techniques Metric Data Nonmetric Data
Metric Data
Data that are measured on a interval or ratio scale Non Metric Data Data that are measured on a nominal or ordinal scale
Univariate Techniques
Univariate Techniques Metric Data
One sample
Nonmetric Data
Two or more sample
o z Test o t Test
Independent
Dependent
o Paired t Test
Univariate Techniques
Univariate Techniques Metric Data
One Sample
Nonmetric Data
Two Sample Related
Multivariate Techniques
Multivariate Techniques Dependence Techniques Interdependence Techniques
Dependence Techniques
Appropriate when one or more variables can be identified as dependent and remaining as dependent
Interdependence techniques
The variables are not classified as dependent or independent whether the whole set of interdependent relationships are examined
Frequency Distribution
A mathematical distribution whose objective is to obtain count of number of responses associated with different values of one variable and to express these counts in percentage term
Frequency Distribution
42 30 53 50 52 30 55 49 61 74 26 58 40 40 28 36 30 33 31 37 32 37 30 32 23 32 58 43 30 29 34 50 47 31 35 26 64 46 40 43 57 30 49 40 25 50 52 32 60 54
Ages of a Sample of Managers from Urban Child Care Centers in the United States
Frequency Distribution
Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Frequency 6 18 11 11 3 1
Frequency Distribution
Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
Frequency 6 18 11 11 3 1 50
For tables:
Does it have clear row and column headings Are the columns and rows in logical sequence?
Employees by Grade
Number
Grade
10 20 30 40 50 60 70 80 Years
10 20 30 40 50 60 70 80 Years
10 20 30 40 50 60 70 80 Years
To Show Proportions
Pie chart
The most frequently used diagram to show the proportion or share is the pie charts A pie chart is a circular chart divided into sectors, illustrating proportion. The pie chart is perhaps the most ubiquitous statistical chart in the business world and the mass media.
Pie chart
To Show Proportions
The box itself contains middle 50% of data The upper edge of the box indicates 75th percentile Lowe edge indicates 25th percentile
Comparing Variables
To show specific values and interdependence
Contingency tables or cross tabulation
Is often used to record and analyze the relation between two or more categorical variables. It displays the frequency distribution of the variables in a matrix format.
Left-handed 9 4 13
Totals 52 48 100
Comparing Variables
To compare highest and lowest values
Comparison of highest and lowest values are best explored using a multiple bar chart or a compound bar chart
Amount of sales during 1999 and 2010
160 140 Sales in Lacks 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010
Comparing Variables
To compare proportions
Comparisons of proportion between variables uses either a percentage component bar chart or two or more pie charts
Would you purchase this product again?
120% Percentage of respondents 100% 80% 60% 40% 20% 0% Product A Product B Product C Products Product D No Unsure Probably Definitely
Comparing Variables
To compare trends and conjunctions
The most suitable diagram to compare the trends between two or more variable is multiple line graph Trends in Sale of ice cream
18 16 14 Sales in '000 liters 12 10 8 6 4 2 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Months Chocolate Strawberry Vanilla
Comparing Variables
To compare distribution of values
Often it is useful to compare the distribution of values for two or more variables Plotting Multiple frequency polygon or bar charts are useful in such situation
Amount of sales during 1999 and 2010
160 140 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010
Sales in Lacks
Comparing Variables
To show the relationship between cases for variables
Often it is useful to compare the distribution of values for two or more variables Scatter graph or scatter plot would be useful in such situation
Measures of variability
Range, Inter quartile range, Standard deviation and Coefficient of variables
Measures of Shape
Skewness and Kurtosis
Measures of Location
Mean
The average, that value obtained by summing all elements in a set and dividing by number of elements The most commonly used measure of central tendency Affected by each value in the data set, including extreme values
Measures of Location
Mean
X X!
n 57 86 42 38 90 66 ! 6 379 ! 6 ! 63.167
Measures of Location
Mode
The most frequently occurring value in a data set Represent the highest peak of the distribution Multimodal -- Data sets that contain more than two modes
Measures of Location
Mode
The mode is 44. There are more 44s than any other value.
35 37 37 39 40 40
41 41 43 43 43 43
44 44 44 44 44 45
45 46 46 46 46 48
Measures of Location
Median
Middle value in an ordered array of numbers. Unaffected by extremely large and extremely small values.
First Procedure
Arrange the observations in an ordered array. If there is an odd number of terms, the median is the middle term of the ordered array. If there is an even number of terms, the median is the average of the middle two terms.
Second Procedure
The median s position in an ordered array is given by (n+1)/2. Where n= No of items in array
Measures of Location
Median
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22 There are 17 terms in the ordered array. Position of median = (n+1)/2 = (17+1)/2 = 9 The median is the 9th term, 15. If the 22 is replaced by 100, the median is 15. If the 3 is replaced by -103, the median is 15.
Median
Measures of shape
Skewness
Absence of symmetry Extreme values in one side of a distribution
Negatively Skewed
Positively Skewed
Measures of Shape
Skewness
Mean Median
Mode
Mode Median
Mean
Negatively Skewed
Positively Skewed
Measures of Shape
Kurtosis
A measure of relative peakedness or flatness of the curve defined by frequency distribution The kurtosis of normal distribution is always zero If Kurtosis is positive then the distribution is more peaked then normal distribution
Measures of variability
A statistics that indicates the distribution s dispersion It measures the dispersion of data in distribution Common measures of variability
Range Interquartile Range Mean Absolute Deviation Variance Standard Deviation Z scores Coefficient of Variation
Measures of variability
Range
The difference between the largest and the smallest values in a set of data Simple to compute Ignores all data points except the two extremes Example: Range = Largest Smallest = 48 - 35 = 13
35 37 37 39 40 40 41 41 43 43 43 43 44 44 44 44 44 45 45 46 46 46 46 48
Measures of variability
Interquartile Range
Range of values between the first and third quartiles Range of the middle half Less influenced by extremes
Interquartile Range ! Q 3 Q1
Measures of variability
Variance
The mean squared deviation of all the values from mean When data points are clustered around the mean, the variance is small Data set: 5, 9, 16, 17, 18 Mean:
65 Deviations fromN ! 5 ! 13 -4, the mean: -8,
X Q!
3, 4, 5
+5 +4 +3
-8
0 5
-4
10 15
20
Measures of variability
Standard Deviation
The square root of variance Expressed in the same units as data, rather in squared unit Population Variance and standard deviation Sample Variance and standard deviation
X Q
!
N
X X
!
2
n 1
W !
S!
Measures of variability
Coefficient of variation
The coefficient of variation is the ratio of standard deviation to the mean, expressed as percentage
S CV ! X
Hypothesis
Hypothesis means assumption It is related with making assumption about population parameters Hypothesis/assumptions about population parameters may be right or wrong. Hypothesis testing is related with determining the correctness/accuracy of hypothesis.
H 0 : Q ! 500
One tail test will reject the null hypothesis if sample mean is
Higher than hypothesized population mean Lower than hypothesized population mea
H0: Q!Q,
Alternative hypothesis is,
A two tail test will reject the null hypothesis if sample mean is significantly higher or lower than hypothesized population mean. So there are two rejection areas Appropriate when, Null hypothesis is H0: Q!Q, Alternative hypothesis is H1:Q !Q,
n>30
n < 30
T- distribution t- table
Type - 2 error
Error of accepting null hypothesis when it is false F Its probability is denoted by It happens when significance level is low There is a trade off between these two types of errors
z - test
xQ Z ! Wx
t - test
xQ t! Wx
Cross tabulation
Contingency table
Hypothesis tests
Parametric Tests
Univariate Techniques
Hypothesis Tests Parametric
One sample
Non Parametric
Two or more sample
o z Test o t Test
Independent
Dependent
o Paired t Test
Univariate Techniques
Hypothesis Tests Parametric
One Sample
Non Parametric
Two Sample
Independence
Related
Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
Two Sample TTTest To see if two Wilcoxon RankRanksamples have Sum Test identical population means To test a hypothesis about the mean of the population a sample was taken from Wilcoxon Signed Ranks Test To see if two samples have identical population medians To test a hypothesis about the median of the population a sample was taken from
Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
ChiChi-Squared Test To see if a sample KolmogorovKolmogorovfor Goodness of fits a theoretical Smirnov Test Fit distribution, such as the normal curve ANOVA To see if two or more sample means are significantly different KruskalKruskal-Wallis Test To see if a sample could have come from a certain distribution To test if two or more sample medians are significantly different
Non-parametric
Not Normal Ordinal or Nominal Any Median Simplicity; Less affected by outliers
Benefits
Z !
X Q
xQ t! Wx
Several hypothesis in marketing relates the parameters from two different populations Foe example,
Brand perception of users and non users Spending habits of high income consumers and low income consumers etc.
H 0 : Q1 ! Q 2 H1 : Q1 { Q 2
x x
Q Q
z!
1 2 1 2
W x1 x2
W x1 x2
2 W 12 W 2 ! n1 n2
To see if two samples have identical population means Used when sample size is small t statistic is computed as under
x x
Q Q
t!
1 2 1 2
W x1 x2
W x1 x 2
1 1 ! sp v n1 n2
( n1 1)( s1 ) ( n2 1)( s2 ) n1 n2 2
DOF= n1 + n1
F test (ANNOVA)
An F test of ample variance can be performed in order to know that whether the two populations have equal variance In this case the hypothesis take the following form
H0 :W1 ! W 2 H1 : W 1 { W 2
2
F test (ANNOVA)
s1 Fn1 1
,n2 1
! 2 s2
Where n1 ! Size of sample 1 n 2 ! Size of sample 2 n1 1 ! Degree of freedom for sample 1 n 2 1 ! Degree of freedom for sample 2
2 S1 ! Sample variance for sample 1
H0 : QD ! 0 H1 : Q D { 0 D QD t n 1 ! SD n
K ! Max Ai Oi
Where,
Ai ! Cumulative relative frequency for each category of theoretical distribution O i ! Comparable value of sample frequency
A run is a sequence of identical occurrences that are followed and preceded by different occurrences.
Example: The list of X s & O s below consists of 7 runs.
xxxooooxxooooxxxxoox
Suppose r is the number of runs, n1 is the number of type 1 occurrences and n2 is the number of type 2 occurrences.
Example: A stock exhibits the following price increase (+) and decrease () behavior over 25 business days. Test at the 1% whether the pattern is random. +++++++++++++ r =16, = 13, = 12
r
n1 (+) n2 ()
-2.575
2.575
Sometimes distributions of variables do not show a normal distribution, or the samples taken are so small that one
cannot tell if they are part of a normal distribution or not. Using the t-test to tell if there is a significant difference between samples is not appropriate here.
The Mann-Whitney U-test can be used in these situations. This test can be used for very small
(n1 1) U ! n1n2 n1 R1 2
1 2 3 4 5 6 9
7 8 10 11 12
n2 = 5
R1 = 30
R2 = 48
Based on the following samples from two universities, test at the 10% level whether graduates from the two schools have the same average grade on an aptitude test.
First merge and rank the grades. Sum the ranks for each sample. rank sum for university A: 74 rank sum for university B: 136 university university A B A B 50 70 50 70 52 73 52 73 56 77 56 77 60 80 60 80 64 83 64 83 68 85 68 85 71 87 71 87 74 88 74 88 89 96 89 96 95 99 95 99
rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
grade 50 52 56 60 64 68 70 71 73 74 77 80 83 85 87 88 89 95 96 99
university A A A A A A B A B A B B B B B B A A B B
Note: If there are ties, each value gets the average rank. For example, if 2 values tie for 7th and 8th place, both are ranked 7.5.
Here, the group from university A is considered the 1st sample. When the samples differ in size, designate the smaller of the 2 samples as the 1st sample. Define W ! sum of the ranks for 1st sample .
The mean of W is Q W n1 (n1 n 2 1) ! , 2
n1n 2 (n1 n 2 1) . 12
.45
-1.645 0
.45
.05 1.645
critical region
where nj is the number of observations in the jth sample, n is the total number of observations, and Rj is the sum of ranks for the jth sample.
If each n j u 5 and the null hypothesis is true, then the distribution of K is G 2 with dof ! c - 1, where c is the number of sample groups.
Kruskal-Wallis Test Example: Test at the 5% level whether average employee performance is the same at 3 firms, using the following standardized test scores for 20 employees.
n3 =7
We rank all the scores. Then we sum the ranks for each firm. Then we calculate the K statistic.
n3 =7
f(G2)
crit. reg.
G 22
From the G2 table, we see that the 5% critical value for a G2 with 2 dof is 5.991. Since our value for K was 6.641, we reject H0 that the means are the same and accept H1 that the means are different.
Sign Test
Earliest form of Non-parametric testing, developed in 1710 Compares matched pairs of data Looks at direction of differences of each pair
Not interested in size of difference
Expressed as
0 if no difference + if 1st of pair is higher (or = 1)
144
Ignore 0 s Let test statistic = S = no. of + signs Ho = no differences in population Sampling dist of S is binomial & probability of success of getting a + sign is 0.5
145
Sampling dist of S
Mean: Qs = np Std Dev: s ! npq W When nu 10 can use normal distr as good approx for binomial Hence test statistic ! S Q S Z WS
146
Example
Soft drink distributor wants to see if people prefer old diet cola to new diet cola (eg Coke Zero)
11 customers = n 7 prefer new brand, 4 still like old brand Let new brand = +
147
Hypo test
Ho: Preferences for the 2 brands are similar (p = 0.5) Ha: Preferences for 2 brands are not similar Test at E= 0.05, So Zcrit = 1.96 Let S = 7 = no. of +ve preferences Then = np = 11*0.5 = 5.5 W = 0.5*(11)^0.5 = 1.66 So Zcalc = (7 5.5)/1.66 = 0.9 Hence Ho can t be rejected So customers are indifferent to brands of cola, if to be sold at same price, then equal q should be stocked
148
Correlation
A measure of association between two numerical variables. Typically, in the summer as the temperature increases people are thirstier.
It measures a linear relationship between two variables It does not imply that any one of the variable causes other
Correlation
For seven random summer days, a person recorded the temperature and their water consumption, during a three-hour period spent outside.
Temperature (F)
Water Consumption (ounces)
75 83 85 85 92 97 99
16 20 25 27 32 48 48
Correlation
Correlation
Pearson s Sample Correlation Coefficient, r
Measures the direction and the strength of the linear association between two numerical paired variables.
Correlation
r value Interpretation 1 0 -1 perfect positive linear relationship no linear relationship perfect negative linear relationship
Correlation
Direction of correlation Positive Correlation Negative Correlation
Correlation
Strength of correlation
r value 0.9 0.5 0.25
Interpretation
strong association moderate association weak association
Correlation
Strength of correlation
COVXY r! S X SY
= the sum n = number of paired items xi = input variable x = x-bar = mean of x s sx= standard deviation of x s yi = output variable y = y-bar = mean of y s sy= standard deviation of ys