Presentation 1

Module - 3
Data Analysis And Presentation
The Data Preparation Process
Preparing Preliminary Plan of Data Analysis Questionnaire Checking Editing Coding Transcribing Data Cleaning Statistically Adjusting the Data Selecting a data analysis Strategy
Questionnaire Checking
Involve a check of all questionnaire for completeness and interviewing quality A questionnaire returned from the field may be unacceptable due to several reasons
The part of the questionnaire may be incomplete The questionnaire is received after the pre established cutoff date Answered by someone who does not qualify for the participation The questionnaire is physically incomplete
Editing
Editing is done for filled up questionnaires Editing is the rearview of questionnaire with the objective of increasing it s accuracy and precision It consist of screening the questionnaire to identify illegible, incomplete, inconsistent or ambiguous responses
Editing
Treatment of unsatisfactory responses
Returning to the field Assigning missing values Discarding unsatisfactory respondents
Coding
Coding means assigning a code to each possible response to each question Coding Example
Dou you have currently valid passport?
1. Yes 2. No
Here Yes response is coded with 1 and No response is coded with 2 Codes allow the data to be easily entered and processed in a computer
Coding
Researchers arrange the organize the coded data into field, records and files Field
A collection of characters that represents a single type of data
Record
A collection of all related fields
File
A collection of all related records
Coding
The data matrix
A rectangular arrangement of data in rows and columns Fields
Column-1 States Column-2 Population ( in millions) 4 Column-2 Average Age 29.3 Column-2 Cars per 1000 543
Records
Gujarat Row-1 Maharashtra Raw-2 Rajasthan Row-3 Delhi Raw-4
GUJ
MAH
26.1
550
RAJ DEL
3.6 2.4
29.2 32.5
460 560
Coding
Code Book
A book containing coding instructions and necessary information about various variables in the data set It is used by researcher to promote more accurate and more efficient data entry
Coding
Pre-coding
Codes are assigned before the field work Used in case of close ended questions Its easy to code close ended questions
Post-coding
Codes are assigned after the field work Used in case of open ended questions Its very complex to code open ended questions
Coding
Coding Rules
Appropriate to the research problem and purpose Exhaustive
Transcribing
Transcribing involves transferring the coded data from questionnaire to computer Various techniques used for transcribing the data includes,
Computer Aided Telephonic Interviews (CATI) Keypunching Optical recognition Digital Technologies Bar Code Other technologies
Data Cleaning
Data cleaning includes consistency checks and treatment of missing responses Although preliminary consistency checks have been made during editing stage the checks at this stage is more through and extensive, because they are made on computer.
Data Cleaning
Consistency checks
A part of the data cleaning process that identifies the data that are out of range, logically inconsistent, or have extreme values
Missing responses
Represents values of variables that are unknown either because respondents provided unambiguous answers or their answers are not properly recorded
Data Cleaning
Treatment of missing responses
Substitute a neutral value
A neutral value typically a mean response value is substituted for the missing responses
Substitute an imputed response

Respondents pattern of responses to other questions is used to impute a suitable response to the missing question
Data Cleaning
Treatment of missing responses
Case wise deletion
Cases or respondents with any missing responses are discarded from the analysis
Pair wise deletion

Instead of discarding all cases with missing values researcher uses only cases with complete responses for the variables involved in each question
Statistically adjusting the data

Procedures for statistically adjusting the data consist of,
Weighting Variable re-specification and, Scale transformation

Weighting
Each case or respondent in the database is assigned a weight to reflect its importance relative to other cases or respondents
Variable re specification
Involves transformation of data to create new variables or modify existing variables

Scale transformation
Involves modification of assigned scales to make it more efficient and valuable Other objective is to increase the comparability between scales
Selecting a data analysis strategy

While selecting a data analysis strategy followings should be taken into consideration
Earlier steps of marketing research
Problem identification, research design etc.
Characteristics of data
Measurement scales used
Properties of statistical technique

A technique suitable for your research should be selected
Background and philosophy of researcher
A classification of statistical techniques
Classification of statistical techniques

Statistical techniques can be classified as univariate and multivariate Univariate techniques
Appropriate when there is single measurement of each element in the sample or there are several measurements of each element but each variable is analyzed in isolation
Multivariate techniques
Suitable for analyzing the data when there are two or more measurement of each elements and variables are analyzed simultaneously
Univariate Techniques
Univariate Techniques Metric Data Nonmetric Data
Metric Data
Data that are measured on a interval or ratio scale Non Metric Data Data that are measured on a nominal or ordinal scale
Univariate Techniques Metric Data
One sample
Nonmetric Data
Two or more sample
o z Test o t Test
Independent
Dependent
o Two Group t Test o Two Group z Test o One way ANNOVA
o Paired t Test
Univariate Techniques Metric Data
One Sample
Nonmetric Data
Two Sample Related
o Frequency o Chi-square Independence o Kolmogorov Smirnov oRuns o Chi-Square oBinomial
oMann-Whitney oMedian oKS oKW ANNOVA
o Sign oWilcoxon oMcNeamar oChi-Square
Multivariate Techniques
Multivariate Techniques Dependence Techniques Interdependence Techniques
Dependence Techniques
Appropriate when one or more variables can be identified as dependent and remaining as dependent
Interdependence techniques
The variables are not classified as dependent or independent whether the whole set of interdependent relationships are examined
Frequency Distribution
A mathematical distribution whose objective is to obtain count of number of responses associated with different values of one variable and to express these counts in percentage term
42 30 53 50 52 30 55 49 61 74 26 58 40 40 28 36 30 33 31 37 32 37 30 32 23 32 58 43 30 29 34 50 47 31 35 26 64 46 40 43 57 30 49 40 25 50 52 32 60 54
Ages of a Sample of Managers from Urban Child Care Centers in the United States
Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Frequency 6 18 11 11 3 1
Relative Frequency .12 .36 .22 .22 .06 .02 1.00 Cumulative Frequency 6 24 35 46 49 50
Class Interval 2020-under 30 3030-under 40 4040-under 50 5050-under 60 6060-under 70 7070-under 80 Total
Frequency 6 18 11 11 3 1 50
Graphical Presentation of data

Once your data have been entered and checked for errors you are ready for analysis Data can be presented by using various diagrams, graphs and tables

Following check list should considered while using diagrams and tables, For Diagrams:
Does it have clear axis labels ? Are the bars and their components in logical sequence? Is a key or legend included?
For tables:
Does it have clear row and column headings Are the columns and rows in logical sequence?

Following check list should considered while using diagrams and tables, For both diagrams and tables:
Does it have brief but clear and descriptive title? Are the units of measurement used are clearly stated? Are the sources of data used clearly stated? Are there notes to explain abbreviations and unusual terminologies

Tables and graphs can used to present following type of information's,
Specific values Highest and lowest values Trends over time Proportions Distributions Conjunctions Totals Interdependence and relationships
To show specific values

The simplest way of summarizing the data for individual variables so that specific values can be read is,
Tables and, Frequency distributions
To show highest and lowest values

Table attach no visual significance to highest or lowest value Diagrams, charts and graphs can do this job in batter manner Following types of diagrams can be used for the same purpose,
Bar chart Histogram Line graph / Frequency Polygon

Bar chart 70
60 50
Employees by Grade
Number
40 30 20 10 0 Production Marketing Finance HR
Grade

Histogram:
A histogram is series of rectangles, each proportional in width to the range of values in within a class and proportional in height to the number of items falling in each class Advantages:
Rectangle clearly shows each separate class in the distribution Area of each rectangle, relative to all other rectangles shows proportion of total number observations that occur in each class

Histogram
Frequ ency 20 0 0 10
10 20 30 40 50 60 70 80 Years

Histogram
Class IntervalFrequency IntervalFrequency 2020-under 30 6 3030-under 40 18 4040-under 50 11 5050-under 60 11 6060-under 70 3 7070-under 80 1
20 Freq uen cy 0 0 10
10 20 30 40 50 60 70 80 Years

Line graph / Frequency polygon
Frequency Polygon is a line graph of frequencies Here we plot each class frequency by drawing a dot above its midpoint and connect the successive dots by a straight lines to form a polygon Advantages of polygon
Frequency polygon is simpler than its histogram counterpart It sketches an outline of data pattern more clearly

Frequency polygon
Class Interval 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 Frequency 6 18 11 11 3 1
20 Frequency 0 0 10
10 20 30 40 50 60 70 80 Years
To Show Proportions
Pie chart
The most frequently used diagram to show the proportion or share is the pie charts A pie chart is a circular chart divided into sectors, illustrating proportion. The pie chart is perhaps the most ubiquitous statistical chart in the business world and the mass media.
Pie chart
To Show Proportions
To Show Distribution of values

To show the distribution of values among the variables following diagrams can be used
Pie chart Frequency Polygon Bar Chart Box and plot diagram
To Show Distribution of values

Box and plot diagram
It summarizes the following statistical measures
Median Upper and lower quartiles Minimum and maximum data values
The box itself contains middle 50% of data The upper edge of the box indicates 75th percentile Lowe edge indicates 25th percentile
Comparing Variables
To show specific values and interdependence
Contingency tables or cross tabulation
Is often used to record and analyze the relation between two or more categorical variables. It displays the frequency distribution of the variables in a matrix format.
Right-handed Males Females Totals 43 44 87
Left-handed 9 4 13
Totals 52 48 100
Comparing Variables
To compare highest and lowest values
Comparison of highest and lowest values are best explored using a multiple bar chart or a compound bar chart
Amount of sales during 1999 and 2010
160 140 Sales in Lacks 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010
Comparing Variables
To compare proportions
Comparisons of proportion between variables uses either a percentage component bar chart or two or more pie charts
Would you purchase this product again?
120% Percentage of respondents 100% 80% 60% 40% 20% 0% Product A Product B Product C Products Product D No Unsure Probably Definitely
Comparing Variables
To compare trends and conjunctions
The most suitable diagram to compare the trends between two or more variable is multiple line graph Trends in Sale of ice cream
18 16 14 Sales in '000 liters 12 10 8 6 4 2 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Months Chocolate Strawberry Vanilla
Comparing Variables
To compare distribution of values
Often it is useful to compare the distribution of values for two or more variables Plotting Multiple frequency polygon or bar charts are useful in such situation
Amount of sales during 1999 and 2010
160 140 120 100 80 60 40 20 0 Company A Company B Company Company C 1999 2010
Sales in Lacks
Comparing Variables
To show the relationship between cases for variables
Often it is useful to compare the distribution of values for two or more variables Scatter graph or scatter plot would be useful in such situation
Statistics Associated With Frequency Distribution
Statistics Associated With Frequency Distribution

The most commonly used statistics associated with frequency distribution are,
Measures of location
Mean , Median and Mode
Measures of variability
Range, Inter quartile range, Standard deviation and Coefficient of variables
Measures of Shape
Skewness and Kurtosis
Measures of Location
Mean
The average, that value obtained by summing all elements in a set and dividing by number of elements The most commonly used measure of central tendency Affected by each value in the data set, including extreme values
Mean
X X!
n 57 86 42 38 90 66 ! 6 379 ! 6 ! 63.167
Mode
The most frequently occurring value in a data set Represent the highest peak of the distribution Multimodal -- Data sets that contain more than two modes
Mode
The mode is 44. There are more 44s than any other value.
35 37 37 39 40 40
41 41 43 43 43 43
44 44 44 44 44 45
45 46 46 46 46 48
Median
Middle value in an ordered array of numbers. Unaffected by extremely large and extremely small values.
First Procedure
Arrange the observations in an ordered array. If there is an odd number of terms, the median is the middle term of the ordered array. If there is an even number of terms, the median is the average of the middle two terms.
Second Procedure
The median s position in an ordered array is given by (n+1)/2. Where n= No of items in array
Median
Ordered Array 3 4 5 7 8 9 11 14 15 16 16 17 19 19 20 21 22 There are 17 terms in the ordered array. Position of median = (n+1)/2 = (17+1)/2 = 9 The median is the 9th term, 15. If the 22 is replaced by 100, the median is 15. If the 3 is replaced by -103, the median is 15.
Median
Measures of shape
Skewness
Absence of symmetry Extreme values in one side of a distribution
Negatively Skewed
Symmetric (Not Skewed)
Positively Skewed
Measures of Shape
Skewness
Mean Median
Mode
Mean Median Mode
Mode Median
Mean
Negatively Skewed
Symmetric (Not Skewed)
Positively Skewed
Measures of Shape
Kurtosis
A measure of relative peakedness or flatness of the curve defined by frequency distribution The kurtosis of normal distribution is always zero If Kurtosis is positive then the distribution is more peaked then normal distribution
A statistics that indicates the distribution s dispersion It measures the dispersion of data in distribution Common measures of variability
Range Interquartile Range Mean Absolute Deviation Variance Standard Deviation Z scores Coefficient of Variation
Range
The difference between the largest and the smallest values in a set of data Simple to compute Ignores all data points except the two extremes Example: Range = Largest Smallest = 48 - 35 = 13
35 37 37 39 40 40 41 41 43 43 43 43 44 44 44 44 44 45 45 46 46 46 46 48
Interquartile Range
Range of values between the first and third quartiles Range of the middle half Less influenced by extremes
Interquartile Range ! Q 3 Q1
Variance
The mean squared deviation of all the values from mean When data points are clustered around the mean, the variance is small Data set: 5, 9, 16, 17, 18 Mean:
65 Deviations fromN ! 5 ! 13 -4, the mean: -8,
X Q!
3, 4, 5
+5 +4 +3
-8
0 5
-4
10 15
20
Standard Deviation
The square root of variance Expressed in the same units as data, rather in squared unit Population Variance and standard deviation Sample Variance and standard deviation
X Q !
N
X X !
2
n 1
W !
S!
Coefficient of variation
The coefficient of variation is the ratio of standard deviation to the mean, expressed as percentage
S CV ! X
Introduction to hypothesis testing
Hypothesis
Hypothesis means assumption It is related with making assumption about population parameters Hypothesis/assumptions about population parameters may be right or wrong. Hypothesis testing is related with determining the correctness/accuracy of hypothesis.
Procedure for hypothesis testing

1. 2. 3. 4. Formulate the null and alternative hypothesis Select an appropriate statistical technique Choose the level of significance Determine the sample size and collect the data. Calculate the value of test statistics 5. Determine the probability associated with test statistic and determine the critical value of test statistic that divide rejection and non rejection regions

6. Compare probability with level of significance. Determine whether the test statistic has fallen within rejection or non rejection regions 7. Reject or do not reject null hypothesis 8. Draw a marketing research conclusion

Step 1 Formulate the hypothesis Null hypothesis
Assumption we wish to test is called null hypothesis Denoted as- H0 For e.g. assumption is that population mean is 500 It is called null hypothesis and denoted as,
H 0 : Q ! 500

Step 1 Formulate the hypothesis Alternative hypothesis
When we reject null hypothesis the conclusion we do accept is called alternative hypothesis. H :Q ! Fro e.g. if null hypothesis is, H0: 0 = 500 500 Alternative hypothesis would be as under,
1. H0 : Q { 500 2. H0 : Q " 500 3. H0 : Q 500
Step 1 Formulate the hypothesis One tail test
One tail test will reject the null hypothesis if sample mean is
Higher than hypothesized population mean Lower than hypothesized population mea
So there is only one rejection area
Step 1 Formulate the hypothesis One tail test

Appropriate when, Null hypothesis is,
H0: Q!Q,
Alternative hypothesis is,
H1:Q "Q, H1:Q

Q,
Step 1 Formulate the hypothesis Two tail test
A two tail test will reject the null hypothesis if sample mean is significantly higher or lower than hypothesized population mean. So there are two rejection areas Appropriate when, Null hypothesis is H0: Q!Q, Alternative hypothesis is H1:Q !Q,

Step 2 Select an appropriate test Test statistic measures how close the sample has come to the null hypothesis It often follows a well known distributions such as normal, t or chi-square distribution

Step 2 Select an appropriate test
Population standard Population standard deviation is known deviation is unknown
n>30
Normal distribution Normal distribution Z- table Z- table
n < 30
Normal distribution Z- table
T- distribution t- table

Step 3 Choose a level of significance The purpose of hypothesis testing is to make judgment about the difference between sample statistic (mean) and hypothesized population parameters (mean) The significance level is related with level of acceptable difference between sample mean and hypothesized population mean

Step 3 Choose a level of significance Higher significance level leads to higher accuracy Higher the significance level we use for testing hypothesis, higher the probability of rejecting null hypothesis when it is true.

Step 3 Choose a level of significance Type -1 error
Error of rejecting null hypothesis when it is true E Its probability is denoted by It happens in case of high significance level
Type - 2 error
Error of accepting null hypothesis when it is false F Its probability is denoted by It happens when significance level is low There is a trade off between these two types of errors

Step 4 Collect the data and calculate test statistic The value of test statistic should be computed by collecting the required data The test statistic can be calculated as under
z - test
xQ Z ! Wx
t - test
xQ t! Wx

Step 5 Determine the probability (or critical value) Determine the probability / Critical value for zvalue or t-value by using standard normal table Critical value is determined by taking into consideration the significance level

Step 6 Compare the probability and make the decision See that whether the value of sample statistic is within the acceptance region or rejection region Step 7 Marketing research conclusion Finally draw the marketing research conclusion based on hypothesis tested
Cross tabulation
Introduction to cross tabulation

If you want to better understand how two different survey items inter-relate, then crosstab analysis is the answer. It is a statistical technique that describe the two or more variables simultaneously and results in a table that reflect the joint distribution of two or more variables

Examples
Relationship between gender and internet use
Gender Internet Use
Male Light Heavy Total 5 10 15 Female 10 5 15 Total 15 15 30
Contingency table

Examples
How many brand loyal users are male ? Is familiarity with new product related with age and education level ? Is product ownership is related to income ?
Cross tabulation - Advantages

Cross tabulation analysis and results can be easily understood and interpreted by managers The clarity of interpretation provides a stronger link between research result and managerial actions Simple to conduct and appealing to less sophisticated researchers
Cross tabulation With Two Variables

Cross tabulation with two variables is also known as bivariate cross tabulation It shows the relationship between two variables
Gender Internet Use
Male Light Heavy Total 5 10 15 Female 10 5 15 Total 15 15 30
Cross tabulation With Two Variables

Gender Internet Use
Male Light Heavy Total 33.33% 66.7% 100% Female 66.7% 33.33% 100%
Gender Internet Use

Male Light Heavy 33.33% 66.7% Female 66.7% 33.33% Total 100% 100%
Cross tabulation With Three Variables

Cross tabulation with three variables
Cross tabulation with three variables reflect the relationship between three variables
Gender
Male Marital Status Female Marital Status
Married 25% 75% 100% Unmarried 60% 40% 100%
Purchase of Fashion Clothing

High Low Column Totals
Married 35% 65% 100%
Unmarried 40% 60% 100%
Statistics associated with cross tabulation

The following are the statistical tests associated with cross tabulation
Chi-Square Phi Coefficient Contingency Coefficient Carmer s V Test Lambda Coefficient
Hypothesis Testing Related to Differences
Hypothesis tests
Parametric Tests
Non parametric Tests
Hypothesis Tests Parametric
One sample
Non Parametric
Two or more sample
o z Test o t Test
Independent
Dependent
o Two Group t Test o Two Group z Test o F - Test
o Paired t Test
Hypothesis Tests Parametric
One Sample
Non Parametric
Two Sample
o Chi-square o Kolmogorov Smirnov oRuns oBinomial
Independence
Related
o Chi-Square oMann-Whitney oMedian oKS
o Sign oWilcoxon oMcNeamar oChi-Square
Parametric V/S Nonparametric Tests

Parametric tests
Hypothesis test procedure that assume that variables are measured on interval scale Assumes that sampling distribution is normally distributed
Non Parametric V/S Nonparametric Tests Non Parametric tests

Assumes that variables are measured on a nominal or ordinal scale Useful when sampling distribution is not normally distributed These are tests that can be done without the assumption of normality, approximate normality, or symmetry These tests do not require a mean and standard deviation
Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
Two Sample TTTest To see if two Wilcoxon RankRanksamples have Sum Test identical population means To test a hypothesis about the mean of the population a sample was taken from Wilcoxon Signed Ranks Test To see if two samples have identical population medians To test a hypothesis about the median of the population a sample was taken from
One Sample TTTest
Non Parametric V/S Nonparametric Parametric Test Goal for Tests -Parametric Goal for NonNonNon NonParametric Test Test Parametric Test
ChiChi-Squared Test To see if a sample KolmogorovKolmogorovfor Goodness of fits a theoretical Smirnov Test Fit distribution, such as the normal curve ANOVA To see if two or more sample means are significantly different KruskalKruskal-Wallis Test To see if a sample could have come from a certain distribution To test if two or more sample medians are significantly different
Non Parametric V/S Nonparametric Tests

Parametric
Assumed distribution Typical data Data set relationships Usual central measure
Normal Ratio or Interval Independent Mean Can draw more conclusions
Non-parametric
Not Normal Ordinal or Nominal Any Median Simplicity; Less affected by outliers
Benefits
Parametric Test One Sample

z test
The Z-test compares sample and population means to determine if there is a significant difference. It requires a simple random sample from a population with a Normal distribution and where the mean is known. Used when sample size is very large (More than 30) Uses Z - distribution

z test
Z !
X Q

t test
To test a hypothesis about the mean of the population a sample was taken from Used when population standard deviation is unknown and sample size is small (n<30) Uses t - distribution

t test
Characteristics of t distribution
More flatter than normal distribution
Lower at the mean and higher at tails than normal

distribution. More values in tails as compared to normal distribution As the size of n increases t distribution approximates normal distribution

t test
xQ t! Wx
t test Procedure for hypothesis testing when t test is used

Formulate null and alternative hypothesis Select appropriate formula for t statistic Select a significance level Select one or two samples and compute mean and standard deviation Calculate degree of freedom and find out the critical value from t table Compare t value with critical value and based on it accept or reject null hypothesis
Several hypothesis in marketing relates the parameters from two different populations Foe example,
Parametric Test Two Independent Samples
Brand perception of users and non users Spending habits of high income consumers and low income consumers etc.

Two group z test
To see if two samples have identical population means Used when sample size is large In this case the hypothesis take the following form
H 0 : Q1 ! Q 2 H1 : Q1 { Q 2
Two group z test
x x Q Q z!
1 2 1 2
W x1 x2
W x1 x2
2 W 12 W 2 ! n1 n2
Two group t test
To see if two samples have identical population means Used when sample size is small t statistic is computed as under
x x Q Q t!
1 2 1 2
W x1 x2
W x1 x 2
1 1 ! sp v n1 n2
Two group t test

sp ! s
2
2 p
( n1 1)( s1 ) ( n2 1)( s2 ) n1 n2 2
DOF= n1 + n1
F test (ANNOVA)
An F test of ample variance can be performed in order to know that whether the two populations have equal variance In this case the hypothesis take the following form
H0 :W1 ! W 2 H1 : W 1 { W 2
2
F test (ANNOVA)

2
The f statistic is computed as under
s1 Fn1 1 ,n2 1 ! 2 s2
Where n1 ! Size of sample 1 n 2 ! Size of sample 2 n1 1 ! Degree of freedom for sample 1 n 2 1 ! Degree of freedom for sample 2
2 S1 ! Sample variance for sample 1
S2 ! Sample variance for sample 1 2
Parametric Test Paired Samples

In many marketing research applications the observations for two groups are not selected from independent samples Rather the observations relates to paired samples in that the two sets of observations relates to the same sample. For example a sample of respondents may rate two competing brands The difference in this case is examined by paired t test
Parametric Test Paired Samples

In this case the hypothesis take the following form
H0 : QD ! 0 H1 : Q D { 0 D QD t n 1 ! SD n
Non Parametric One sample

Kolmogorov Smirnov test (K S Test)
A fully non-parametric test for comparing two distributions To see if a sample could have come from a certain distribution (Normal, uniform, Poisson) Goodness of fit test
Non Parametric One sample

Kolmogorov Smirnov test (K S Test)
K ! Max Ai Oi
Where,
Ai ! Cumulative relative frequency for each category of theoretical distribution O i ! Comparable value of sample frequency
One sample test of runs

a test for randomness of order of occurrence
A run is a sequence of identical occurrences that are followed and preceded by different occurrences.
Example: The list of X s & O s below consists of 7 runs.
xxxooooxxooooxxxxoox
Suppose r is the number of runs, n1 is the number of type 1 occurrences and n2 is the number of type 2 occurrences.
The mean number of runs is 2n1n 2 1. r ! n1 n 2

The standard deviation of the number of runs is 2n1n 2 (2n1n 2 - n1 - n 2 ) Wr ! . 2 (n1 n 2 ) (n1 n 2 1)
If n1 and n2 are each at least 10, then r is approximately normal.
r - Qr So, Z! Wr is a standard normal variable.
Example: A stock exhibits the following price increase (+) and decrease () behavior over 25 business days. Test at the 1% whether the pattern is random. +++++++++++++ r =16, = 13, = 12
r
n1 (+) n2 ()
2n1n 2 2(13)(12) ! 1 ! 1 ! 13.48 n1 n 2 13 12

2(13)(12) [(2(13)(12) - 13 - 12] ! 2.44 2 (13 12) (13 12 1)
critical region .005 critical region .005
2n1n 2 (2n1n 2 - n1 - n 2 ) ! Wr ! 2 (n1 n 2 ) (n1 n 2 1)
r - Q r 16 - 13.48 Z! ! ! 1.03 2.44 Wr

Since the critical values for a 2-tailed 1% test are 2.575 and -2.575, we accept H0 that the pattern is random.
.495 .495 acceptance region
-2.575
2.575
Mann Whiteny U test
Non Parametric Two independent samples
Sometimes distributions of variables do not show a normal distribution, or the samples taken are so small that one
cannot tell if they are part of a normal distribution or not. Using the t-test to tell if there is a significant difference between samples is not appropriate here.
The Mann-Whitney U-test can be used in these situations. This test can be used for very small
samples (between 5 and 20).

Mann Whiteny U test
(n1 1) U ! n1n2 n1 R1 2

Mann Whiteny U test - Example
Two tailed null hypothesis that there is no difference between the heights of male and female students Ho: Male and female students are the same height HA: Male and female students are not the same height

Heights of Heights of males (cm) females (cm) Ranks of male heights Ranks of female heights
193 188 185 183 180 178 170 n1 = 7
175 173 168 165 163
1 2 3 4 5 6 9
7 8 10 11 12
n2 = 5
R1 = 30
R2 = 48

U = n1n2 + n1(n1+1) R1 2 U=(7)(5) + (7)(8) 30 2 U = 35 + 28 30 U = 33 U = n1n2 U U = (7)(5) 33 U =2 U 0.05(2),7,5 = U 0.05(2),5,7 = 30 As 33 > 30, Ho is rejected
Non Parametric Paired Sample

Wilcoxon Rank Sum test
This test is used to test whether 2 independent samples have been drawn from populations with the same mean. It is a nonparametric substitute for the t-test on the difference between two means.

Wilcoxon Rank Sum test Example
university university A B A B 50 70 50 70 52 73 52 73 56 77 56 77 60 80 60 80 64 83 64 83 68 85 68 85 71 87 71 87 74 88 74 88 89 96 89 96 95 99 95 99
Based on the following samples from two universities, test at the 10% level whether graduates from the two schools have the same average grade on an aptitude test.
First merge and rank the grades. Sum the ranks for each sample. rank sum for university A: 74 rank sum for university B: 136 university university A B A B 50 70 50 70 52 73 52 73 56 77 56 77 60 80 60 80 64 83 64 83 68 85 68 85 71 87 71 87 74 88 74 88 89 96 89 96 95 99 95 99
rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
grade 50 52 56 60 64 68 70 71 73 74 77 80 83 85 87 88 89 95 96 99
university A A A A A A B A B A B B B B B B A A B B
Note: If there are ties, each value gets the average rank. For example, if 2 values tie for 7th and 8th place, both are ranked 7.5.
Here, the group from university A is considered the 1st sample. When the samples differ in size, designate the smaller of the 2 samples as the 1st sample. Define W ! sum of the ranks for 1st sample .
The mean of W is Q W n1 (n1 n 2 1) ! , 2
n1n 2 (n1 n 2 1) . 12
and the standard deviation is W W !
If n1 and n 2 are each at least 10, W is approximately normal.

W - QW So, Z ! has a standard normal distribution. WW
For our example, W ! 74.

n1 (n 1) 10(20 1) QW ! ! ! 105 2 2
WW ! n1n 2 (n 1) ! 12 (10)(10)(20 1) ! 13.229 12
W - Q W 74 - 105 Z! ! ! - 2.343. WW 13.229

Since the critical values for a 2-tailed Z test at the 10% level are 1.645 and -1.645, we reject H0 that the means are the same and accept H1 that the means are different.
critical region .05
.45
-1.645 0
.45
.05 1.645
critical region

The Kruskal-Wallis Test
This test is used to test whether several populations have the same mean. It is a nonparametric substitute for a one-factor ANOVA.
R2 12 j The test statistic is K ! n - 3(n 1) , n(n 1) j
where nj is the number of observations in the jth sample, n is the total number of observations, and Rj is the sum of ranks for the jth sample.
If each n j u 5 and the null hypothesis is true, then the distribution of K is G 2 with dof ! c - 1, where c is the number of sample groups.
In the case of ties, a corrected statistic should be computed:

K Kc ! (t 3 t j ) j 1- n3 n
where tj is the number of ties in the jth sample.
Kruskal-Wallis Test Example: Test at the 5% level whether average employee performance is the same at 3 firms, using the following standardized test scores for 20 employees.
Firm 1 score 78 95 85 87 75 90 80 n1 = 7 rank 68 77 84 61 62 72
Firm 2 score rank 82 65 50 93 70 60 73 n2 = 6
Firm 3 score rank
n3 =7
We rank all the scores. Then we sum the ranks for each firm. Then we calculate the K statistic.
Firm 1 score 78 95 85 87 75 90 80 n1 = 7 rank 12 20 16 17 10 18 13 R1 = 106 68 77 84 61 62 72
Firm 2 score rank 6 11 15 3 4 8 82 65 50 93 70 60 73 n2 = 6 R2 = 47
Firm 3 score rank 14 5 1 19 7 2 9 R3 = 57
n3 =7
2 12 R j 12 106 2 47 2 57 2 - 3(n 1) ! K! 7 6 7 - 3(21) ! 6.641 n(n 1) nj 20(21)
f(G2)
acceptanc e region .05 5.991
crit. reg.
G 22
From the G2 table, we see that the 5% critical value for a G2 with 2 dof is 5.991. Since our value for K was 6.641, we reject H0 that the means are the same and accept H1 that the means are different.
Sign Test
Earliest form of Non-parametric testing, developed in 1710 Compares matched pairs of data Looks at direction of differences of each pair
Not interested in size of difference
Expressed as
0 if no difference + if 1st of pair is higher (or = 1)
- if 1st of pair is lower (or 1)
144
Ignore 0 s Let test statistic = S = no. of + signs Ho = no differences in population Sampling dist of S is binomial & probability of success of getting a + sign is 0.5
145
Sampling dist of S
Mean: Qs = np Std Dev: s ! npq W When nu 10 can use normal distr as good approx for binomial Hence test statistic ! S Q S Z WS
146
Example
Soft drink distributor wants to see if people prefer old diet cola to new diet cola (eg Coke Zero)
11 customers = n 7 prefer new brand, 4 still like old brand Let new brand = +
147
Hypo test
Ho: Preferences for the 2 brands are similar (p = 0.5) Ha: Preferences for 2 brands are not similar Test at E= 0.05, So Zcrit = 1.96 Let S = 7 = no. of +ve preferences Then = np = 11*0.5 = 5.5 W = 0.5*(11)^0.5 = 1.66 So Zcalc = (7 5.5)/1.66 = 0.9 Hence Ho can t be rejected So customers are indifferent to brands of cola, if to be sold at same price, then equal q should be stocked
148
Simple Linear Regression and correlation

In order to understand the relationship between two variables regression and correlation studies are carried out
Correlation
A measure of association between two numerical variables. Typically, in the summer as the temperature increases people are thirstier.
It measures a linear relationship between two variables It does not imply that any one of the variable causes other
Correlation
For seven random summer days, a person recorded the temperature and their water consumption, during a three-hour period spent outside.
Temperature (F)
Water Consumption (ounces)
75 83 85 85 92 97 99
16 20 25 27 32 48 48
Correlation
Correlation
Pearson s Sample Correlation Coefficient, r
Measures the direction and the strength of the linear association between two numerical paired variables.
Correlation
r value Interpretation 1 0 -1 perfect positive linear relationship no linear relationship perfect negative linear relationship
Correlation
Direction of correlation Positive Correlation Negative Correlation
Correlation
Strength of correlation
r value 0.9 0.5 0.25
Interpretation
strong association moderate association weak association
Correlation
Strength of correlation
Correlation Coefficient Formula
COVXY r! S X SY
Correlation Coefficient Formula
= the sum n = number of paired items xi = input variable x = x-bar = mean of x s sx= standard deviation of x s yi = output variable y = y-bar = mean of y s sy= standard deviation of ys
Hypothesis testing for correlation

Inferences from sample correlation coefficient (r) to population correlation coefficient (p) (r) can be used an estimate of population correlation coefficient (p)

Presentation 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Presentation 1

Uploaded by

Copyright:

Available Formats

Module - 3

Data Analysis And Presentation

The Data Preparation Process

Gujarat Row-1 Maharashtra Raw-2 Rajasthan Row-3 Delhi Raw-4

Substitute an imputed response

Pair wise deletion

Statistically adjusting the data

Statistically adjusting the data

Statistically adjusting the data

Selecting a data analysis strategy

Properties of statistical technique

Background and philosophy of researcher

A classification of statistical techniques

Classification of statistical techniques

o Two Group t Test o Two Group z Test o One way ANNOVA

o Frequency o Chi-square Independence o Kolmogorov Smirnov oRuns o Chi-Square oBinomial

oMann-Whitney oMedian oKS oKW ANNOVA

o Sign oWilcoxon oMcNeamar oChi-Square

Class Interval 2020-under 30 3030-under 40 4040-under 50 5050-under 60 6060-under 70 7070-under 80 Total

Graphical Presentation of data

Graphical Presentation of data

Graphical Presentation of data

Graphical Presentation of data

Graphical Presentation of data

To show specific values

To show highest and lowest values

To show highest and lowest values

40 30 20 10 0 Production Marketing Finance HR

To show highest and lowest values

To show highest and lowest values

To show highest and lowest values

To show highest and lowest values

To show highest and lowest values

To Show Distribution of values

To Show Distribution of values

Right-handed Males Females Totals 43 44 87

Statistics Associated With Frequency Distribution

Statistics Associated With Frequency Distribution

Symmetric (Not Skewed)

Mean Median Mode

Symmetric (Not Skewed)

Introduction to hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

1. H0 : Q { 500 2. H0 : Q " 500 3. H0 : Q 500

Step 1 Formulate the hypothesis One tail test

Procedure for hypothesis testing

So there is only one rejection area

Step 1 Formulate the hypothesis One tail test

Procedure for hypothesis testing

H1:Q "Q, H1:Q

Step 1 Formulate the hypothesis Two tail test

Procedure for hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

Normal distribution Normal distribution Z- table Z- table

Normal distribution Z- table

Procedure for hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

Procedure for hypothesis testing

H1:Q "Q, H1:Q

The mean number of runs is 2n1n 2 1. r ! n1 n 2

2n1n 2 2(13)(12) ! 1 ! 1 ! 13.48 n1 n 2 13 12

2n1n 2 (2n1n 2 - n1 - n 2 ) ! Wr ! 2 (n1 n 2 ) (n1 n 2 1)