You are on page 1of 23

Case 1: Missing Values

Identify the undesired values and clean the data from typo errors:
In this analysis, we distinguished some of the typo errors. Out of range values in decimal in the likes of 0.4 min and 3.8 max in the

Price row which are not consistent with the data.

Statistics

Delivery_spe Price_flexibili reputatio Sales_force Size_compa Satisfactio evaluatio Order_typ Purchase_Si


ed Price ty n service s Quality ny Utility n n e Industry tuation

N Valid 49 57 53 63 61 64 60 68 61 63 68 68 69 68

Missing 21 13 17 7 9 6 10 2 9 7 2 2 1 2
Mean 4.008 1.944 8.062 5.17 2.856 2.611 6.81 .37 46.03 4.759 .65 .46 .48 2.03
Median 3.800 1.900 8.200 5.00 3.000 2.600 6.80 .00 47.00 4.900 1.00 .00 .00 2.00
a a a
Mode 3.0 2.0 6.4 5 3.3 2.5 7 0 49 5.0 1 0 0 3
Std. Deviation .9318 .8751 1.4072 1.171 .7760 .7174 1.691 .486 9.356 .8319 .481 .502 .503 .828
Minimum 2.8 .4 5.0 3 1.1 1.1 2 0 25 3.3 0 0 0 1
Maximum
6.5 3.8 9.9 8 4.6 4.0 10 1 65 6.2 1 1 1 3

Examine the descriptive statistics of the data:


As below table shows that there some typo errors dwelling at the moment, so in order to rectify them all so that all the max and in

value should not be below than 1 and not above than 7, ultimately personify the Likert scale. However, price flexibility is totally

stands out amongst the rest of the variables when it comes to appraising the perceptions of purchasing professionals.
Count the missing values per variable and per case:
Most real-world data contain some (or many) missing values. It's always a good idea to inspect the amount of missing value for

avoiding unpleasant surprises later on. Missing values are values that are truly absent from the data. In data view they are shown as

empty cells holding just a tiny dot. Missing values are common in real world data. Some reasons why they occur are the following:

some questions weren't offered to all respondents;


some respondents skipped some of the questions;
a technical failure occurred.

Missing values are values that are present in the data but must be excluded from calculations and analyses. Recoding variables into

various variables, and offer 1 to missing value and 0 to every other value. This demonstrates missing values per variable. Dummy

variable with name of "MISSING" to see that what number of inquiries are definitely not replied by a solitary case, as indicated by the

yield there are few cases who did not answer more than 5 questions, as examined in later on.
Data Filtration

Data filtration is indeed an important element to get the complete insight about the missing values per case. As the above table

suggests we need to eliminate 6 respondents who didnt answer the 7 inquires. Since on the off chance that we select other 4 cases

who left the 3 questions unanswered every then there will be few quantities of case for the analysis. What's more, we can supplant

these missing values, which would be impartial than those 6 case who did not answer 7 inquiry each. If we tend to eliminate other

respondents rather than the chosen ones based on rational decision, there might be a chance that we could lose very significant and

excessive information.

Check whether values are randomly or systematically missing


Delivery_speed

Frequency Percent Valid Percent Cumulative Percent

Valid 2.8 3 4.3 6.1 6.1

2.9 1 1.4 2.0 8.2

3 5 7.1 10.2 18.4

3.1 3 4.3 6.1 24.5

3.2 1 1.4 2.0 26.5

3.3 2 2.9 4.1 30.6

3.4 2 2.9 4.1 34.7

3.5 1 1.4 2.0 36.7

3.6 3 4.3 6.1 42.9

3.7 2 2.9 4.1 46.9

3.8 2 2.9 4.1 51.0

3.9 1 1.4 2.0 53.1

4 2 2.9 4.1 57.1

4.1 2 2.9 4.1 61.2

4.2 2 2.9 4.1 65.3

4.3 1 1.4 2.0 67.3

4.5 2 2.9 4.1 71.4

4.6 1 1.4 2.0 73.5

4.7 2 2.9 4.1 77.6

4.8 1 1.4 2.0 79.6


4.9 1 1.4 2.0 81.6

5.1 2 2.9 4.1 85.7

5.2 2 2.9 4.1 89.8

5.3 2 2.9 4.1 93.9

5.6 1 1.4 2.0 95.9

6.1 1 1.4 2.0 98.0

6.5 1 1.4 2.0 100.0

Total 49 70.0 100.0


Missing System 21 30.0
Total 70 100.0

Price

Frequency Percent Valid Percent Cumulative Percent

Valid 0.4 1 1.4 1.8 1.8

0.5 2 2.9 3.5 5.3

0.7 1 1.4 1.8 7.0

0.8 2 2.9 3.5 10.5

0.9 2 2.9 3.5 14.0

1 1 1.4 1.8 15.8

1.1 1 1.4 1.8 17.5

1.3 4 5.7 7.0 24.6

1.4 4 5.7 7.0 31.6

1.5 3 4.3 5.3 36.8

1.6 3 4.3 5.3 42.1

1.7 1 1.4 1.8 43.9

1.8 2 2.9 3.5 47.4

1.9 3 4.3 5.3 52.6

2 5 7.1 8.8 61.4

2.1 2 2.9 3.5 64.9

2.2 3 4.3 5.3 70.2

2.4 2 2.9 3.5 73.7

2.5 2 2.9 3.5 77.2

2.6 2 2.9 3.5 80.7

2.7 1 1.4 1.8 82.5

2.8 3 4.3 5.3 87.7

3.2 1 1.4 1.8 89.5

3.3 1 1.4 1.8 91.2

3.7 3 4.3 5.3 96.5

3.8 2 2.9 3.5 100.0

Total 57 81.4 100.0


Missing System 13 18.6
Total 70 100.0

Price_flexibility

Cumulative
Frequency Percent Valid Percent Percent
Valid 5 1 1.4 1.9 1.9

5.2 1 1.4 1.9 3.8

5.5 1 1.4 1.9 5.7

5.7 1 1.4 1.9 7.5

5.9 1 1.4 1.9 9.4

6 1 1.4 1.9 11.3

6.4 4 5.7 7.5 18.9

6.6 1 1.4 1.9 20.8

6.7 3 4.3 5.7 26.4

6.9 1 1.4 1.9 28.3

7.4 1 1.4 1.9 30.2

7.5 1 1.4 1.9 32.1

7.6 2 2.9 3.8 35.8

7.7 2 2.9 3.8 39.6

7.8 1 1.4 1.9 41.5

7.9 1 1.4 1.9 43.4

8.1 1 1.4 1.9 45.3

8.2 3 4.3 5.7 50.9

8.5 1 1.4 1.9 52.8

8.6 2 2.9 3.8 56.6

8.7 2 2.9 3.8 60.4

8.9 1 1.4 1.9 62.3

9 2 2.9 3.8 66.0

9.1 3 4.3 5.7 71.7

9.2 3 4.3 5.7 77.4

9.3 2 2.9 3.8 81.1

9.4 1 1.4 1.9 83.0

9.6 1 1.4 1.9 84.9

9.7 4 5.7 7.5 92.5

9.9 4 5.7 7.5 100.0

Total 53 75.7 100.0


Missing System 17 24.3
Total 70 100.0

reputation

Cumulative
Frequency Percent Valid Percent Percent

Valid 2.5 1 1.4 1.6 1.6

2.7 1 1.4 1.6 3.2

2.9 1 1.4 1.6 4.8

3.1 1 1.4 1.6 6.3

3.3 1 1.4 1.6 7.9

3.4 1 1.4 1.6 9.5

3.5 1 1.4 1.6 11.1

3.7 1 1.4 1.6 12.7

3.8 1 1.4 1.6 14.3

4 1 1.4 1.6 15.9

4.2 1 1.4 1.6 17.5

4.5 6 8.6 9.5 27.0

4.6 2 2.9 3.2 30.2

4.7 2 2.9 3.2 33.3

4.8 5 7.1 7.9 41.3


4.9 3 4.3 4.8 46.0

5 3 4.3 4.8 50.8

5.1 1 1.4 1.6 52.4

5.2 1 1.4 1.6 54.0

5.3 2 2.9 3.2 57.1

5.4 3 4.3 4.8 61.9

5.5 2 2.9 3.2 65.1

5.7 2 2.9 3.2 68.3

5.8 3 4.3 4.8 73.0

5.9 2 2.9 3.2 76.2

6 1 1.4 1.6 77.8

6.1 2 2.9 3.2 81.0

6.2 2 2.9 3.2 84.1

6.6 1 1.4 1.6 85.7

6.7 1 1.4 1.6 87.3

6.8 1 1.4 1.6 88.9

6.9 2 2.9 3.2 92.1

7 2 2.9 3.2 95.2

7.1 2 2.9 3.2 98.4

7.8 1 1.4 1.6 100.0

Total 63 90.0 100.0


Missing System 7 10.0
Total 70 100.0

service

Cumulative
Frequency Percent Valid Percent Percent

Valid 1.1 1 1.4 1.6 1.6

1.2 1 1.4 1.6 3.3

1.3 1 1.4 1.6 4.9

1.5 1 1.4 1.6 6.6

1.6 1 1.4 1.6 8.2

1.7 1 1.4 1.6 9.8

1.9 2 2.9 3.3 13.1

2 1 1.4 1.6 14.8

2.1 4 5.7 6.6 21.3

2.2 3 4.3 4.9 26.2

2.4 2 2.9 3.3 29.5

2.5 3 4.3 4.9 34.4

2.6 2 2.9 3.3 37.7

2.7 2 2.9 3.3 41.0

2.8 1 1.4 1.6 42.6

2.9 1 1.4 1.6 44.3

3 5 7.1 8.2 52.5

3.1 4 5.7 6.6 59.0

3.2 3 4.3 4.9 63.9

3.3 6 8.6 9.8 73.8

3.4 3 4.3 4.9 78.7

3.5 3 4.3 4.9 83.6

3.6 4 5.7 6.6 90.2

3.7 2 2.9 3.3 93.4

4 2 2.9 3.3 96.7


4.5 1 1.4 1.6 98.4

4.6 1 1.4 1.6 100.0

Total 61 87.1 100.0


Missing System 9 12.9
Total 70 100.0

Sales_forces

Cumulative
Frequency Percent Valid Percent Percent

Valid 1.1 1 1.4 1.6 1.6

1.2 1 1.4 1.6 3.1

1.4 2 2.9 3.1 6.3

1.5 2 2.9 3.1 9.4

1.6 1 1.4 1.6 10.9

1.7 3 4.3 4.7 15.6

1.8 1 1.4 1.6 17.2

1.9 1 1.4 1.6 18.8

2.1 3 4.3 4.7 23.4

2.2 1 1.4 1.6 25.0

2.3 3 4.3 4.7 29.7

2.4 2 2.9 3.1 32.8

2.5 8 11.4 12.5 45.3

2.6 8 11.4 12.5 57.8

2.7 5 7.1 7.8 65.6

2.8 3 4.3 4.7 70.3

2.9 2 2.9 3.1 73.4

3 2 2.9 3.1 76.6

3.1 2 2.9 3.1 79.7

3.2 2 2.9 3.1 82.8

3.4 2 2.9 3.1 85.9

3.6 1 1.4 1.6 87.5

3.7 2 2.9 3.1 90.6

3.8 1 1.4 1.6 92.2

3.9 3 4.3 4.7 96.9

4 2 2.9 3.1 100.0

Total 64 91.4 100.0


Missing System 6 8.6
Total 70 100.0

Quality

Cumulative
Frequency Percent Valid Percent Percent

Valid 1.7 1 1.4 1.7 1.7

3.8 2 2.9 3.3 5.0

4.4 1 1.4 1.7 6.7

4.5 1 1.4 1.7 8.3

4.6 1 1.4 1.7 10.0

4.7 1 1.4 1.7 11.7

4.8 2 2.9 3.3 15.0

5 1 1.4 1.7 16.7

5.2 2 2.9 3.3 20.0


5.3 2 2.9 3.3 23.3

5.4 1 1.4 1.7 25.0

5.6 1 1.4 1.7 26.7

5.8 1 1.4 1.7 28.3

5.9 2 2.9 3.3 31.7

6 1 1.4 1.7 33.3

6.2 2 2.9 3.3 36.7

6.3 1 1.4 1.7 38.3

6.6 1 1.4 1.7 40.0

6.7 2 2.9 3.3 43.3

6.8 5 7.1 8.3 51.7

7.1 2 2.9 3.3 55.0

7.2 2 2.9 3.3 58.3

7.3 3 4.3 5.0 63.3

7.4 1 1.4 1.7 65.0

7.6 1 1.4 1.7 66.7

7.7 1 1.4 1.7 68.3

7.9 1 1.4 1.7 70.0

8 2 2.9 3.3 73.3

8.2 2 2.9 3.3 76.7

8.3 1 1.4 1.7 78.3

8.4 3 4.3 5.0 83.3

8.5 1 1.4 1.7 85.0

8.8 2 2.9 3.3 88.3

8.9 1 1.4 1.7 90.0

9 1 1.4 1.7 91.7

9.1 1 1.4 1.7 93.3

9.2 1 1.4 1.7 95.0

9.3 1 1.4 1.7 96.7

9.6 1 1.4 1.7 98.3

9.9 1 1.4 1.7 100.0

Total 60 85.7 100.0


Missing System 10 14.3
Total 70 100.0

Size_company

Frequency Percent Valid Percent Cumulative Percent

Valid small 43 61.4 63.2 63.2

large 25 35.7 36.8 100.0

Total 68 97.1 100.0

Missing System 2 2.9

Total 70 100.0

Utility

Cumulative
Frequency Percent Valid Percent Percent

Valid 25 1 1.4 1.6 1.6

28 1 1.4 1.6 3.3


29 1 1.4 1.6 4.9

31 1 1.4 1.6 6.6

32 2 2.9 3.3 9.8

33 1 1.4 1.6 11.5

35 2 2.9 3.3 14.8

36 2 2.9 3.3 18.0

38 1 1.4 1.6 19.7

39 5 7.1 8.2 27.9

40 1 1.4 1.6 29.5

41 3 4.3 4.9 34.4

42 1 1.4 1.6 36.1

43 2 2.9 3.3 39.3

44 1 1.4 1.6 41.0

45 1 1.4 1.6 42.6

46 3 4.3 4.9 47.5

47 5 7.1 8.2 55.7

49 7 10.0 11.5 67.2

50 2 2.9 3.3 70.5

53 2 2.9 3.3 73.8

54 4 5.7 6.6 80.3

55 3 4.3 4.9 85.2

56 1 1.4 1.6 86.9

58 1 1.4 1.6 88.5

59 1 1.4 1.6 90.2

60 4 5.7 6.6 96.7

62 1 1.4 1.6 98.4

65 1 1.4 1.6 100.0

Total 61 87.1 100.0


Missing System 9 12.9
Total 70 100.0

Satisfaction

Frequency Percent Valid Percent Cumulative Percent

Valid 3.3 4 5.7 6.3 6.3

3.4 2 2.9 3.2 9.5

3.6 2 2.9 3.2 12.7

3.7 2 2.9 3.2 15.9

3.8 1 1.4 1.6 17.5

3.9 2 2.9 3.2 20.6

4 1 1.4 1.6 22.2

4.1 2 2.9 3.2 25.4

4.2 2 2.9 3.2 28.6

4.3 3 4.3 4.8 33.3

4.4 3 4.3 4.8 38.1

4.5 2 2.9 3.2 41.3

4.7 1 1.4 1.6 42.9

4.8 3 4.3 4.8 47.6

4.9 3 4.3 4.8 52.4

5 5 7.1 7.9 60.3

5.1 4 5.7 6.3 66.7


5.2 5 7.1 7.9 74.6

5.3 1 1.4 1.6 76.2

5.4 2 2.9 3.2 79.4

5.5 1 1.4 1.6 81.0

5.6 2 2.9 3.2 84.1

5.8 1 1.4 1.6 85.7

5.9 3 4.3 4.8 90.5

6 3 4.3 4.8 95.2

6.1 2 2.9 3.2 98.4

6.2 1 1.4 1.6 100.0

Total 63 90.0 100.0


Missing System 7 10.0
Total 70 100.0

evaluation

Frequency Percent Valid Percent Cumulative Percent

Valid checking of the specifications


24 34.3 35.3 35.3

each purchase separately 44 62.9 64.7 100.0

Total 68 97.1 100.0


Missing System 2 2.9
Total 70 100.0

Order_type

Frequency Percent Valid Percent Cumulative Percent

Valid decentrelized 37 52.9 54.4 54.4

centrelized 31 44.3 45.6 100.0

Total 68 97.1 100.0

Missing System 2 2.9

Total 70 100.0

Industry

Frequency Percent Valid Percent Cumulative Percent

Valid others 36 51.4 52.2 52.2

raw 33 47.1 47.8 100.0


Total
69 98.6 100.0

Missing System 1 1.4

Total
70 100.0

Purchase_Situation

Frequency Percent Valid Percent Cumulative Percent

Valid new 22 31.4 32.4 32.4

modified repurchase 22 31.4 32.4 64.7

simple repurchase 24 34.3 35.3 100.0

Total 68 97.1 100.0

Missing System 2 2.9


Total 70 100.0

In the following table, we check whether missing values are arbitrarily missing or systematically missing, to do this, we examine the

correlation between all factors as said in the below table. The values with twofold star showed that these factors are highly correlated

with each other. For instance, Sales force and reputation have high correlation amongst them, as it goes with delivery speed and

purchase situation. It means the respondent who didnt answer the question of sales force also didnt answered the reputation. Biasness

do exist when any of the respondents are reluctant to answer any of these variable factors. However, single solitary star also represents

correlation but not as much like twofold stars. The values of correlation with single star are represent that values are randomly missing

and these values have not as such correlation with each other. For instance; the table manifests, the value of utility and satisfaction is .

437* and sig is .029. It demonstrates that it is not systemically missing, respondent missed the variable unexpectedly.
After filtration of the data, impute the missing values
With series means
Series mean replaces missing values with the mean for the entire series. Missing observations can be problematic in analysis,

and some time series measures cannot be computed if there are missing values in the series. Sometimes the value for a

particular observation is simply not known. In addition, missing data can result from any of the following:
Each degree of differencing reduces the length of a series by 1.
Each degree of seasonal differencing reduces the length of a series by one season.
If you create new series that contain forecasts beyond the end of the existing series the generated

residual series will have missing data for the new observations.
Some transformations (for example, the log transformation) produce missing data for certain values of

the original series.


The following table demonstrates that outcomes subsequent to supplanting the missing values, it likewise demonstrates the quantities

of missing values supplanted in the column number 3, for instance, there were 21 inquires left unanswered, and through series mean

we supplanted the values in missing value box, including each of the 14 variable factors.

With linear interpolation

Linear interpolation replaces missing values using a linear interpolation. The last valid value before the missing value and the first

valid value after the missing value are used for the interpolation. If the first or last case in the series has a missing value, the missing

value is not replaced. In above table, we replaced the missing values by using Linear Interpolation method.
Hot Deck method
Hot deck imputation is a method for handling missing data in which each missing value is replaced with an observed response from a

similar unit. Missing data are often a problem in large-scale surveys, arising when a sampled unit does not respond to the entire

survey (unit non-response) or to a particular question (item non-response). A common technique for handling item non-response is

imputation, whereby the missing values are filled in to create a complete data set that can then be analyzed with traditional analysis

methods. It is important to note at the outset that usually sample surveys are conducted with the goal of making inferences about

population quantities such as means, correlations and regression coefficients, and the values of individual cases in the data set are not

the main interest. Thus, the objective of imputation is not to get the best possible predictions of the missing values, but to replace them

by plausible values in order to exploit the information in the recorded variables in the incomplete cases for inference about population
parameters. For such purposes, we use hot deck method; imputation involves replacing missing values of one or more variables for a

non-respondent (called the recipient) with observed values from a respondent that is similar to the non-respondent with respect to

characteristics observed by both cases.

This method is different from series mean, because series mean takes the average of whole case of particular variable, but in this

method, it takes average of randomly selected cases in a particular cluster and put the value in the same variable within the that

particular cluster.

Merits and Demerits of Imputation

Series Mean
Series Mean imputation is a method in which the missing value on a certain variable is replaced by the mean of the available cases.

This method maintains the sample size and is easy to use, but the variability in the data is reduced, so the standard deviations and the

variance estimates tend to be underestimated. The magnitude of the covariances and correlation also decreases by restricting the

variability and this method often causes biased estimates, irrespective of the underlying missing data mechanism

For numerical information, the missing values are swapped by the mean for all respondents to that question. This will get the right

average value yet it is not a decent method generally. It can twist the state of distributions and the contort connections between

variables.

The upside of this imputation method is that it takes all values average and put into the missing value box. It likewise simple and less

tedious. The primary disadvantage of this imputation is that it takes average of that specific variable of all cases. It causes high risk for

results controls, and high biasness in the outcomes.

For missings on multi-item questionnaires, mean imputation can be applied at the item level. One option is to impute the missing item

scores with the item mean for each item. In that case, the average of the respondents with observed scores for each item is computed

and that average value is imputed for respondents with a missing score. Another option is to impute the person mean. In that method,

the average of the observed item scores for each respondent is computed and that average is imputed for the item scores that are

missing for that respondent.

Linear Interpolation
Linear interpolation replaces missing values using a linear interpolation. The last valid value before the missing value and the first

valid value after the missing value are used for the interpolation. If the first or last case in the series has a missing value, the missing

value is not replaced. Linear interpolation is sufficiently specific to define a model, but nonlinear is not. That is, one might consider

many different nonlinear functions of time as possible models. One of the advantage is that is the procedure in linear interpolation

comprised of replacing missing values with mean or median of nearby points by linear interpolation. Overall missing data represents a

complex problem for the data analyst and simple solutions such as replacement of missing values is generally not advised. Further

imputing missing values is not recommended for variables with a large amount of missing data. This technique for missing value is the
imputation of the last entire observation value before the missing data and the primary finish observation value after the missing value

rather than the missing data. On the off chance that the first and last observations are missing in the set, there can't be any values

imputed rather than the missing value. The merit of this technique is that it takes one value from above case and second value from the

below case of missing value, takes the mean and put the value in the missing value box.

Hot Deck
In hot deck imputation, the missing values are filled in by selecting the values from other records within the survey data. It gets its

name from the way it was originally carried out when survey data was on cards and the cards were sorted in order to find similar

records to use for the imputation. The process involves finding other records in the data set that are similar in other parts of their

responses to the record with the missing value or values. Often there will be more than one record that could be used for hot deck

imputation and the record that could potentially be used for filling a cell are known as donor records. Hot deck imputation often

involves taking, not the best match, but a random choice from a series of good matches and replacing the missing value or values with

one of the records from the donor set.

Hot deck imputation is very heavily used with census data. It has the advantage that it can be carried out as the data are being collected

using everything that is in the data set so far. Hot deck imputation procedures are usually programmed up in a programming language

and generally done by a survey firm often around the time the data are being collected. Hot-deck imputation is a popular and widely

used imputation method to handle missing data. The method involves filling in missing data on variables of interest from no

respondents (or recipients) using observed values from respondents (i.e. donors) within the same survey data set. Hot-deck imputation

can be applied to missing data caused by either failure to participate in a survey (i.e. unit nonresponse) or failure to respond to certain

survey questions (i.e. item non-response). The term hot deck, in contrast with cold deck, dates back to the storage of data on punch

cards. It indicates that the donors and the recipients are from the same data set; the stack of cards was "hot" because it was currently

being processed.

Hot decks only impute values from the original data, which makes them very good at reproducing marginal distributions including

idiosyncratic features of the data that would be smoothed out by parametric methods. This advantage turns into a downside if the

source data suers from problems and one would like to impose constraints or model selection into the source data. Classic hot deck

methods cannot include continuous variables in the conditioning set and are limited in the number of categorical variables because the

curse of dimensionality quickly makes the number of cells large, which results in imputation from empty or thinly populated cells.

Thereby, they often fail to capture multivariate relationships beyond basic ones such as univariate statistics within cells. The main

advantage of this methods is that they allow more exible conditioning sets and thereby capture the relations between variables better,

but they make more parametric assumptions which are often chosen arbitrarily, because specication tests are not available. However,

the main problem of hot deck methods is that it is very hard to work with the imputed values, because we know very little about their

theoretical properties. Overall, this makes the hot deck a rather specialized imputation procedure. If the object of interest is univariate

or a univariate population statistic within one of the imputation cells, the hot deck may perform well since it does not make a
functional form assumption on the marginal distributions and only uses the original data. However, in multivariate models it is hard to

specify and likely to produce bias and inconsistent.

After imputation of the data, again check the descriptive statistics of the data and compare these

statistics with the initial data set.

You might also like