Professional Documents
Culture Documents
INTRODUCTION
Linear discriminant analysis (LDA) and logistic regression (LR) are two widely used multivariate statistical
methods for data analysis with categorical outcome variables [1]. Both methods can construct linear classification
model which creates a linear boundary between two groups. In other words, LDA and LR assist in profiling the
characteristics of the subjects being studied and then assign them to the most suitable group.
Both methods are found to be different in their basic assumptions. LDA makes a few assumptions such as the
explanatory or predictor variables must be normally distributed. When the assumption is fulfilled, LDA is expected
to produce accurate and good results. Data with violation of the multivariate normality assumption will cause severe
shortcomings. Another assumption made is equality of the unknown variance-covariance matrices for all categories
of the grouping variable. Unequal matrices of covariance will result in negative impact on the classification process
[2]. LR appears as a robust alternative to LDA as it does not need any underlying assumption made on the
distribution of the data. Hence, LR has always been suggested as the first choice to carry out data classification
especially for a situation where the data is not normally distributed. Nevertheless, computing time of LR is much
longer than the time taken by LDA, making it a less desirable alternative to LDA.
This study puts its focus on the evaluation and comparison on the performance of both LDA and LR. Since the
normality assumption is a popular issue studied by a number of researchers rather than any other issues, it will be the
main topic in this study. Furthermore, by using real datasets, the performance of both methods should be able to
provide a more practical point of view on their usage. Besides that, the computing time will be compared. In
addition, the effect of the prior probability setting for linear discriminant analysis will also be reported.
COMPARISON CRITERIA
Two measurements are used in this study to assess the performance of LDA and LR, and the criteria that are used
to compare between the two methods will be discussed.
The first measurement, which is the percentage of correct classification, evaluates the performance of a
classification model with the following formula:
1159
u 100%
(1)
Although classification error is the simplest and the most frequently used measurement for comparison between
two methods, according to Harrell and Lee [3], it is a very insensitive and statistically inefficient measure.
The second measurement, the B index is another measurement that is used to assess the accuracy of the outcome
prediction. It is an efficient measure because it provides information on how well a model discriminates between the
groups and how good the prediction is. The B index measures the average of the squared difference between the
estimated and actual values. The values of B index are between [0, 1], where 1 indicates perfect prediction [1]. The
B index is computed as follows:
B
1
i 1
( Pi Yi ) / n
(2)
where
Level of Normality
In the process of building a linear classification model, LDA makes several assumptions on the data. LR is seen
to be a more flexible and robust method if the assumptions made by LDA are violated [1]. One of the assumptions
made is the independent variables are normally distributed with equal covariance matrices. Harrell and Lee [3]
stated that when the multivariate normality assumption is met, LR is able to give the same level of accuracy as
discriminant analysis. However, if the normality assumption is not fulfilled, the use of LR has more benefits
compared to LDA because the results of LR are not affected by the degree of normality of the independent variables
[2].
In practice, the presence of binary variable has violated the multivariate normality assumption [4]. Truett et al.
[5] also emphasized that the assumption can hardly be met in the application. When the independent variables are
not normally distributed, especially with at least one discrete variable, the usage of discriminant analysis will be
incorrect [3]. On the contrary, LR works well with the non-metric variable by using dummy variable coding [2].
Prior Probability
Prior probability refers to the size of a particular group relative to the population [2]. Normally prior probability
is estimated by the group size relative to the entire sample. It can take two possible values. First is by assuming that
the two groups are of the same size. The second one is to compute the prior probability based on each group size.
There are a few reasons that some researchers would assign equal probability for all groups instead of computing it
based on group size. One main reason would be the researcher does not know about the real population size hence
all groups are assumed to be of the same size. Another reason is that it eases all the computational processes
involved.
1160
In LR, inequal prior probability will move the boundary line closer to the group with the smaller size. This will
influence only the constant value while the coefficients of the independent variables remain the same [1]. Besides
that, analysis on prior probability for LR is not found in SPSS. Therefore, prior probability in LR will not be
examined in this study and the interest for this criterion will be solely on the LDA.
Method
LDA
LR
B Index
0.727
1.000
The next step is to analyse a dataset which violates multivariate normality assumption not too badly. HBAT.sav
matches the requirement and its results are shown in Table 2. LDA classified 93% of the objects to the correct
region while LR achieves 92% only. The difference in the percentage of correct classification is not significant, but
their B index values are so much different. LDA has a B index value of 0.640 whereas LR obtains 0.952. This has
1161
proved that the percentage of correct classification is biased. With reference to the B index, the discriminating power
of LR is stronger than LDA even though the percentage of correct classification is slightly lower than LDA.
Method
LDA
LR
B Index
0.64
0.952
The comparison continues by testing a dataset which totally violates normality assumption. The dataset
spss_use.sav is split into two parts, 1920 observations to be the analysis sample, and 1371 observations to be the
validation sample. In Table 3, it is shown that LR performs better compared with LDA for nonnormality case. A
total of 77.6% of the objects were classified correctly by LR while LDA attained only 75.1%. Moreover, the B index
for LR is 0.845 while it is only 0.729 for LDA. Therefore, LR proved to produce higher prediction accuracy than
LDA.
TABLE (3). Classification results for spss_use.sav (analysis sample)
Method Gender Percentage of Correct Classification B Index
Female
73.5
LDA
Male
75.8
0.729
Total
75.1
Female
53.0
Male
89.6
LR
0.845
Total
77.6
For the validation sample, the results obtained also show that LDA is weaker in classifying data which does not
meet the underlying assumption. 74.6% of the people were being classified to the correct gender with LDA. LR
succeeded in categorising 76.4% of people correctly but yet it failed to classify more than half of females to the
correct group. LDA has the value of 0.732 for B index while LR does better with 0.846. In this case, LR has better
predicting power compared with LDA when analysing nonnormal data.
TABLE (4). Classification results for spss_use.sav (validation sample)
Method Gender Percentage of Correct Classification B Index
Female
72.3
LDA
LR
Male
75.7
Total
74.6
Female
49.9
Male
88.7
Total
76.4
0.732
0.846
1162
times longer computing time than LDA when processing the large dataset spss_use.sav even though it gives a better
result of correct classification, which is 76.6% compared to 74.3% obtained by LDA. The phenomenon above may
be caused by the distribution of the three datasets. Hence, we can conclude that overall LR performs better than
LDA.
From the aspect of the number of independent variables, spss_use.sav has the most which is 122 followed by
General.sav with 29 independent variables, and HATCO.sav with only 9 independent variables. An issue being
raised at this point is how is the effect of the number of independent variables on the computing time taken by both
methods.
HATCO.sav
General.sav
spss_use.sav
Method
LDA
LR
LDA
LR
LDA
LR
TABLE (5). Computing time for two-group classification for three datasets.
Trial
Number of independent
Average
Percentage of Correct
variables
(s)
Classification
1
2
3
1.2
0.8
1.1
1.03
95.0
9
1.3
0.9
1.0
1.07
100.0
2.0
1.5
1.0
1.50
60.0
29
1.3
1.2
1.0
1.17
75.5
3.4
2.6
2.3
2.77
74.3
122
13.2 13.6 11.9
12.9
76.7
Random samples of various sizes will be built from a single dataset spss_use.sav. This should minimize the bias
in comparing the two methods. Ten random samples of various sizes (30, 50, 80, 100, 500, 1000, 1500, 2000, 2500
and 3000) were generated for the comparison. The number of independent variables is 122.
Referring to Table 6, LR uses shorter computing time than LDA for the sample size in the range of 30 to 500.
When the sample size increases to 1000 and above, the computing time taken by LR increases at a faster rate. It
takes 10.2s when analysing random sample with 3000 sample size while on the other hand, LDA uses only 3.2s to
get the classification process done. In this case, the difference of the computing time between two methods is 7s.
Although the computing time needed for LDA to perform classification increases from 1.6s to 3.2s which is double
of the initial time, the value is far smaller compared to the time taken by LR.
1163
TABLE (6). Computing time for two-group classification for the dataset spss_use.sav
Percentage of Correct Classification
Size of the random sample
30
50
80
100
500
1000
1500
2000
2500
3000
LDA
90.0
100.0
100.0
91.0
74.8
74.6
75.2
73.7
74.2
73.4
LR
100.0
95.9
100.0
100.0
77.5
78.8
76.9
77.1
76.6
76.9
LR
1.2
0.9
1.0
1.5
1.7
3.5
4.0
6.2
7.6
10.2
40
0.4
100
1.0
B Index
0.488
0.470
The third dataset that was used to study the effect of the prior probability setting on the classification accuracy is
General.sav. The computed prior probability according to group size is 0.249 for female group and 0.751 for male
group. 60% of the people are categorised correctly when the prior probability is set at 0.5 for each group while
67.2% of people are classified to the correct group when using the computed prior probability. The value of B index
for computed prior probability is 0.920, which is so much higher than the B index for equal prior probability, 0.736.
1164
2006
0.751
Total
2671
1.000
B Index
0.736
0.920
CONCLUSIONS
In conclusion, LR has higher predictive power compared to LDA. LR fits well to various types of distribution of
the dataset. Some researchers have shown that when the normality assumption of independent variables are fulfilled
or not too badly violated, LDA performs well. However, for all levels of normality in this study, LR has obtained
higher percentage of correct classification and B index than LDA.
When the sample size is large, it is shown that the computing time used in LR is longer than LDA. For example,
when the whole dataset spss_use.sav with more than 3000 observations is processed, the time needed for LR to
calculate is about 4.66 times longer than the computing time of LDA. Moreover, the computing time taken by LDA
is consistent for all sample sizes. Nearly all the percentages of correct classification for LR are higher than LDA
without taking into account the computing time issue. Hence, a recommendation being raised here is to use LDA
when you have limited cost and time for doing the classification models with the condition that the sample size is
large and the data is distributed normally or not too badly violated. On the other hand, the number of independent
variables has no significant effect on the computing time taken by both methods.
This study has also found that if the prior probability takes the value computed based on the group size rather
than using the equal probability for all groups, LDA can achieve higher percentage of correct classification and B
index. Therefore, in order to obtain higher predictive accuracy, prior probability should be calculated according to
group size when using LDA.
ACKNOWLEDGMENTS
We would like to express our gratitude to University Kebangsaan Malaysia and the Ministry of Higher Education
Malaysia (MOHE) for the financial support through the Research Grant No. UKM-ST-06-FRGS0183-2010.
REFERENCES
1. M. Pohar, M. Blas and S. Turk, Metodoloki zvezki 1, 143161 (2004).
2. J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson and R. L. Tatham, Multivariate Data Analysis, Upper Saddle River, N.J.:
Pearson Prentice Hall, 2006.
3. F. E. J. Harrell and K. L. Lee, A Comparison of the Discriminant of Discriminant Analysis and Logistic Regression Under
Multivariate Normality in Biostatistics: Statistics in Biomedical, Public Health and Environment Sciences, edited by P. K.
Sen, Amsterdam: North-Holland, 1985, pp. 333343.
4. S. J. Press and S. Wilson, Journal of the American Statistical Association 73, 699705 (1978).
5. J. Truett, J. Cornfield and W. Kannel, Journal of Chronic Diseases 20, 511524 (1967).
6. T. Brenn and E. Arnesen, Statistics in Medicine 4, 413423 (1985).
1165
Copyright of AIP Conference Proceedings is the property of American Institute of Physics and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.