Professional Documents
Culture Documents
Identify the undesired values and clean the data from typo errors:
In this analysis, we distinguished some of the typo errors. Out of range values in decimal in the likes of 0.4 min and 3.8 max in the
Statistics
N Valid 49 57 53 63 61 64 60 68 61 63 68 68 69 68
Missing 21 13 17 7 9 6 10 2 9 7 2 2 1 2
Mean 4.008 1.944 8.062 5.17 2.856 2.611 6.81 .37 46.03 4.759 .65 .46 .48 2.03
Median 3.800 1.900 8.200 5.00 3.000 2.600 6.80 .00 47.00 4.900 1.00 .00 .00 2.00
a a a
Mode 3.0 2.0 6.4 5 3.3 2.5 7 0 49 5.0 1 0 0 3
Std. Deviation .9318 .8751 1.4072 1.171 .7760 .7174 1.691 .486 9.356 .8319 .481 .502 .503 .828
Minimum 2.8 .4 5.0 3 1.1 1.1 2 0 25 3.3 0 0 0 1
Maximum
6.5 3.8 9.9 8 4.6 4.0 10 1 65 6.2 1 1 1 3
value should not be below than 1 and not above than 7, ultimately personify the Likert scale. However, price flexibility is totally
stands out amongst the rest of the variables when it comes to appraising the perceptions of purchasing professionals.
Count the missing values per variable and per case:
Most real-world data contain some (or many) missing values. It's always a good idea to inspect the amount of missing value for
avoiding unpleasant surprises later on. Missing values are values that are truly absent from the data. In data view they are shown as
empty cells holding just a tiny dot. Missing values are common in real world data. Some reasons why they occur are the following:
Missing values are values that are present in the data but must be excluded from calculations and analyses. Recoding variables into
various variables, and offer 1 to missing value and 0 to every other value. This demonstrates missing values per variable. Dummy
variable with name of "MISSING" to see that what number of inquiries are definitely not replied by a solitary case, as indicated by the
yield there are few cases who did not answer more than 5 questions, as examined in later on.
Data Filtration
Data filtration is indeed an important element to get the complete insight about the missing values per case. As the above table
suggests we need to eliminate 6 respondents who didnt answer the 7 inquires. Since on the off chance that we select other 4 cases
who left the 3 questions unanswered every then there will be few quantities of case for the analysis. What's more, we can supplant
these missing values, which would be impartial than those 6 case who did not answer 7 inquiry each. If we tend to eliminate other
respondents rather than the chosen ones based on rational decision, there might be a chance that we could lose very significant and
excessive information.
Price
Price_flexibility
Cumulative
Frequency Percent Valid Percent Percent
Valid 5 1 1.4 1.9 1.9
reputation
Cumulative
Frequency Percent Valid Percent Percent
service
Cumulative
Frequency Percent Valid Percent Percent
Sales_forces
Cumulative
Frequency Percent Valid Percent Percent
Quality
Cumulative
Frequency Percent Valid Percent Percent
Size_company
Total 70 100.0
Utility
Cumulative
Frequency Percent Valid Percent Percent
Satisfaction
evaluation
Order_type
Total 70 100.0
Industry
Total
70 100.0
Purchase_Situation
In the following table, we check whether missing values are arbitrarily missing or systematically missing, to do this, we examine the
correlation between all factors as said in the below table. The values with twofold star showed that these factors are highly correlated
with each other. For instance, Sales force and reputation have high correlation amongst them, as it goes with delivery speed and
purchase situation. It means the respondent who didnt answer the question of sales force also didnt answered the reputation. Biasness
do exist when any of the respondents are reluctant to answer any of these variable factors. However, single solitary star also represents
correlation but not as much like twofold stars. The values of correlation with single star are represent that values are randomly missing
and these values have not as such correlation with each other. For instance; the table manifests, the value of utility and satisfaction is .
437* and sig is .029. It demonstrates that it is not systemically missing, respondent missed the variable unexpectedly.
After filtration of the data, impute the missing values
With series means
Series mean replaces missing values with the mean for the entire series. Missing observations can be problematic in analysis,
and some time series measures cannot be computed if there are missing values in the series. Sometimes the value for a
particular observation is simply not known. In addition, missing data can result from any of the following:
Each degree of differencing reduces the length of a series by 1.
Each degree of seasonal differencing reduces the length of a series by one season.
If you create new series that contain forecasts beyond the end of the existing series the generated
residual series will have missing data for the new observations.
Some transformations (for example, the log transformation) produce missing data for certain values of
of missing values supplanted in the column number 3, for instance, there were 21 inquires left unanswered, and through series mean
we supplanted the values in missing value box, including each of the 14 variable factors.
Linear interpolation replaces missing values using a linear interpolation. The last valid value before the missing value and the first
valid value after the missing value are used for the interpolation. If the first or last case in the series has a missing value, the missing
value is not replaced. In above table, we replaced the missing values by using Linear Interpolation method.
Hot Deck method
Hot deck imputation is a method for handling missing data in which each missing value is replaced with an observed response from a
similar unit. Missing data are often a problem in large-scale surveys, arising when a sampled unit does not respond to the entire
survey (unit non-response) or to a particular question (item non-response). A common technique for handling item non-response is
imputation, whereby the missing values are filled in to create a complete data set that can then be analyzed with traditional analysis
methods. It is important to note at the outset that usually sample surveys are conducted with the goal of making inferences about
population quantities such as means, correlations and regression coefficients, and the values of individual cases in the data set are not
the main interest. Thus, the objective of imputation is not to get the best possible predictions of the missing values, but to replace them
by plausible values in order to exploit the information in the recorded variables in the incomplete cases for inference about population
parameters. For such purposes, we use hot deck method; imputation involves replacing missing values of one or more variables for a
non-respondent (called the recipient) with observed values from a respondent that is similar to the non-respondent with respect to
This method is different from series mean, because series mean takes the average of whole case of particular variable, but in this
method, it takes average of randomly selected cases in a particular cluster and put the value in the same variable within the that
particular cluster.
Series Mean
Series Mean imputation is a method in which the missing value on a certain variable is replaced by the mean of the available cases.
This method maintains the sample size and is easy to use, but the variability in the data is reduced, so the standard deviations and the
variance estimates tend to be underestimated. The magnitude of the covariances and correlation also decreases by restricting the
variability and this method often causes biased estimates, irrespective of the underlying missing data mechanism
For numerical information, the missing values are swapped by the mean for all respondents to that question. This will get the right
average value yet it is not a decent method generally. It can twist the state of distributions and the contort connections between
variables.
The upside of this imputation method is that it takes all values average and put into the missing value box. It likewise simple and less
tedious. The primary disadvantage of this imputation is that it takes average of that specific variable of all cases. It causes high risk for
For missings on multi-item questionnaires, mean imputation can be applied at the item level. One option is to impute the missing item
scores with the item mean for each item. In that case, the average of the respondents with observed scores for each item is computed
and that average value is imputed for respondents with a missing score. Another option is to impute the person mean. In that method,
the average of the observed item scores for each respondent is computed and that average is imputed for the item scores that are
Linear Interpolation
Linear interpolation replaces missing values using a linear interpolation. The last valid value before the missing value and the first
valid value after the missing value are used for the interpolation. If the first or last case in the series has a missing value, the missing
value is not replaced. Linear interpolation is sufficiently specific to define a model, but nonlinear is not. That is, one might consider
many different nonlinear functions of time as possible models. One of the advantage is that is the procedure in linear interpolation
comprised of replacing missing values with mean or median of nearby points by linear interpolation. Overall missing data represents a
complex problem for the data analyst and simple solutions such as replacement of missing values is generally not advised. Further
imputing missing values is not recommended for variables with a large amount of missing data. This technique for missing value is the
imputation of the last entire observation value before the missing data and the primary finish observation value after the missing value
rather than the missing data. On the off chance that the first and last observations are missing in the set, there can't be any values
imputed rather than the missing value. The merit of this technique is that it takes one value from above case and second value from the
below case of missing value, takes the mean and put the value in the missing value box.
Hot Deck
In hot deck imputation, the missing values are filled in by selecting the values from other records within the survey data. It gets its
name from the way it was originally carried out when survey data was on cards and the cards were sorted in order to find similar
records to use for the imputation. The process involves finding other records in the data set that are similar in other parts of their
responses to the record with the missing value or values. Often there will be more than one record that could be used for hot deck
imputation and the record that could potentially be used for filling a cell are known as donor records. Hot deck imputation often
involves taking, not the best match, but a random choice from a series of good matches and replacing the missing value or values with
Hot deck imputation is very heavily used with census data. It has the advantage that it can be carried out as the data are being collected
using everything that is in the data set so far. Hot deck imputation procedures are usually programmed up in a programming language
and generally done by a survey firm often around the time the data are being collected. Hot-deck imputation is a popular and widely
used imputation method to handle missing data. The method involves filling in missing data on variables of interest from no
respondents (or recipients) using observed values from respondents (i.e. donors) within the same survey data set. Hot-deck imputation
can be applied to missing data caused by either failure to participate in a survey (i.e. unit nonresponse) or failure to respond to certain
survey questions (i.e. item non-response). The term hot deck, in contrast with cold deck, dates back to the storage of data on punch
cards. It indicates that the donors and the recipients are from the same data set; the stack of cards was "hot" because it was currently
being processed.
Hot decks only impute values from the original data, which makes them very good at reproducing marginal distributions including
idiosyncratic features of the data that would be smoothed out by parametric methods. This advantage turns into a downside if the
source data suers from problems and one would like to impose constraints or model selection into the source data. Classic hot deck
methods cannot include continuous variables in the conditioning set and are limited in the number of categorical variables because the
curse of dimensionality quickly makes the number of cells large, which results in imputation from empty or thinly populated cells.
Thereby, they often fail to capture multivariate relationships beyond basic ones such as univariate statistics within cells. The main
advantage of this methods is that they allow more exible conditioning sets and thereby capture the relations between variables better,
but they make more parametric assumptions which are often chosen arbitrarily, because specication tests are not available. However,
the main problem of hot deck methods is that it is very hard to work with the imputed values, because we know very little about their
theoretical properties. Overall, this makes the hot deck a rather specialized imputation procedure. If the object of interest is univariate
or a univariate population statistic within one of the imputation cells, the hot deck may perform well since it does not make a
functional form assumption on the marginal distributions and only uses the original data. However, in multivariate models it is hard to
After imputation of the data, again check the descriptive statistics of the data and compare these