You are on page 1of 8

Comparison of Linear Discriminant Analysis and Logistic

Regression for Data Classification


Choong-Yeun Liong and Sin-Fan Foo
School of Mathematical Sciences, Faculty of Science and Technology,
Universiti Kebangsaan Malaysia, 46300 UKM Bangi,
Selangor DE, Malaysia.
Abstract. Linear discriminant analysis (LDA) and logistic regression (LR) are often used for the purpose of classifying
populations or groups using a set of predictor variables. Assumptions of multivariate normality and equal variancecovariance matrices across groups are required before proceeding with LDA, but such assumptions are not required for
LR and hence LR is considered to be much more robust than LDA. In this paper, several real datasets which are different
in terms of normality, number of independent variables and sample size are used to study the performance of both
methods. The methods are compared based on the percentage of correct classification and B index. The results show that
overall, LR performs better regardless of the distribution of the data is normal or nonnormal. However, LR needs longer
computing time than LDA with the increase in sample size. The performance of LDA was also tested by using various
prior probabilities. The results show that the average percentage of correct classification and the B index are higher when
the prior probability is set based on the group size rather than using equal probabilities for all groups.
Keywords: linear discriminant analysis, logistic regression, multivariate normality, prior probability, sample size
PACS: 02.50.-r

INTRODUCTION
Linear discriminant analysis (LDA) and logistic regression (LR) are two widely used multivariate statistical
methods for data analysis with categorical outcome variables [1]. Both methods can construct linear classification
model which creates a linear boundary between two groups. In other words, LDA and LR assist in profiling the
characteristics of the subjects being studied and then assign them to the most suitable group.
Both methods are found to be different in their basic assumptions. LDA makes a few assumptions such as the
explanatory or predictor variables must be normally distributed. When the assumption is fulfilled, LDA is expected
to produce accurate and good results. Data with violation of the multivariate normality assumption will cause severe
shortcomings. Another assumption made is equality of the unknown variance-covariance matrices for all categories
of the grouping variable. Unequal matrices of covariance will result in negative impact on the classification process
[2]. LR appears as a robust alternative to LDA as it does not need any underlying assumption made on the
distribution of the data. Hence, LR has always been suggested as the first choice to carry out data classification
especially for a situation where the data is not normally distributed. Nevertheless, computing time of LR is much
longer than the time taken by LDA, making it a less desirable alternative to LDA.
This study puts its focus on the evaluation and comparison on the performance of both LDA and LR. Since the
normality assumption is a popular issue studied by a number of researchers rather than any other issues, it will be the
main topic in this study. Furthermore, by using real datasets, the performance of both methods should be able to
provide a more practical point of view on their usage. Besides that, the computing time will be compared. In
addition, the effect of the prior probability setting for linear discriminant analysis will also be reported.

COMPARISON CRITERIA
Two measurements are used in this study to assess the performance of LDA and LR, and the criteria that are used
to compare between the two methods will be discussed.
The first measurement, which is the percentage of correct classification, evaluates the performance of a
classification model with the following formula:

Proceedings of the 20th National Symposium on Mathematical Sciences


AIP Conf. Proc. 1522, 1159-1165 (2013); doi: 10.1063/1.4801262
2013 AIP Publishing LLC 978-0-7354-1150-0/$30.00

1159

Percentage of correct classification


=

Number of observations being classified correctly in a particular group


Total number of observations in a particular group

u 100%

(1)

Although classification error is the simplest and the most frequently used measurement for comparison between
two methods, according to Harrell and Lee [3], it is a very insensitive and statistically inefficient measure.
The second measurement, the B index is another measurement that is used to assess the accuracy of the outcome
prediction. It is an efficient measure because it provides information on how well a model discriminates between the
groups and how good the prediction is. The B index measures the average of the squared difference between the
estimated and actual values. The values of B index are between [0, 1], where 1 indicates perfect prediction [1]. The
B index is computed as follows:
B

1

i 1

( Pi  Yi ) / n

(2)

where

Pi = probability of classification into group i


Yi = actual group membership (1 or 0)
n = sample size of both populations.

Level of Normality
In the process of building a linear classification model, LDA makes several assumptions on the data. LR is seen
to be a more flexible and robust method if the assumptions made by LDA are violated [1]. One of the assumptions
made is the independent variables are normally distributed with equal covariance matrices. Harrell and Lee [3]
stated that when the multivariate normality assumption is met, LR is able to give the same level of accuracy as
discriminant analysis. However, if the normality assumption is not fulfilled, the use of LR has more benefits
compared to LDA because the results of LR are not affected by the degree of normality of the independent variables
[2].
In practice, the presence of binary variable has violated the multivariate normality assumption [4]. Truett et al.
[5] also emphasized that the assumption can hardly be met in the application. When the independent variables are
not normally distributed, especially with at least one discrete variable, the usage of discriminant analysis will be
incorrect [3]. On the contrary, LR works well with the non-metric variable by using dummy variable coding [2].

Effect of Varying Sample Size on Computing Time


A study by Press and Wilson [4] showed that computing time by LR is found to be longer than the time needed
for discriminant analysis. Brenn and Arnesen [6] compared both methods using a large dataset and arrive at the
conclusion that discriminant analysis takes shorter time than LR although the estimator yielded is about the same for
both methods. Nonetheless, not much research has been carried out on the computing time taken. Therefore, it is
worth investigating further about the effect of varying sample sizes on the computing time taken by each method in
performing the classification task.

Prior Probability
Prior probability refers to the size of a particular group relative to the population [2]. Normally prior probability
is estimated by the group size relative to the entire sample. It can take two possible values. First is by assuming that
the two groups are of the same size. The second one is to compute the prior probability based on each group size.
There are a few reasons that some researchers would assign equal probability for all groups instead of computing it
based on group size. One main reason would be the researcher does not know about the real population size hence
all groups are assumed to be of the same size. Another reason is that it eases all the computational processes
involved.

1160

In LR, inequal prior probability will move the boundary line closer to the group with the smaller size. This will
influence only the constant value while the coefficients of the independent variables remain the same [1]. Besides
that, analysis on prior probability for LR is not found in SPSS. Therefore, prior probability in LR will not be
examined in this study and the interest for this criterion will be solely on the LDA.

DESCRIPTIONS OF THE DATASETS


There are five datasets in this study:
(i) Wolves.sav This is the smallest dataset and has 25 observations. After undergoing normality test, wolves.sav
meets the multivariate normality assumption.
(ii) HBAT.sav This dataset is taken from the book Multivariate Data Analysis [2]. The dataset contains 100
observations and is actually a survey done on the customers of a paper manufacturing company. HBAT.sav is a
dataset which violates normality assumption but not too badly.
(iii) HATCO.sav This dataset is also a demonstrative material in the book Multivariate Data Analysis. It contains
100 observations. HATCO.sav also violates the multivariate normality assumption but not too badly.
(iv) General.sav This dataset has 5022 observations. However, there are only 2671 effective observations after
deleted those with incomplete information. This dataset is adopted from GVUs 10th WWW User Survey which
was done by GVU, a research centre in Atlanta, USA. All the independent variables are not normally
distributed.
(v) spss_use.sav this dataset is also from the GVUs 10th WWW User Survey. This file has 3291 records. It has
3205 effective observations after filtering out those with incomplete details. About 60% of the dataset, or more
precisely 1868 observations, were used as analysis sample while the rest were used as the validation sample.
This file has 122 independent variables. The normality test shows that all the independent variables are not
normally distributed, and therefore they do not fulfil the normality assumption required by LDA.

RESULTS AND DISCUSSION


Effect of Different Level of Normality
Three datasets with different levels of normality are tested by using LDA and LR. In this case, the prior
probability for LDA is set at 0.5 for both classification groups. After that, percentage of correct classification and B
index are computed and compared.
All the independent variables in the dataset wolves.sav are normally distributed. LDA is expected to give good
results for normally distributed data, but the results show otherwise. Referring to Table 1, LR achieves 100% of
correct classification while LDA seems weaker with 88% of correct classification. Besides that, the B index for LR
is 1 while it is only 0.727 for LDA. In other words, in this study, LR gives perfect prediction and its performance is
better than LDA when processing a normally distributed dataset wolves.sav.

Method
LDA

LR

TABLE (1). Classification results for wolves.sav


Percentage of Correct
Classification
Gender
Female
77.8
Male
93.8
Total
88.0
Female
100.0
Male
100.0
Total
100.0

B Index
0.727

1.000

The next step is to analyse a dataset which violates multivariate normality assumption not too badly. HBAT.sav
matches the requirement and its results are shown in Table 2. LDA classified 93% of the objects to the correct
region while LR achieves 92% only. The difference in the percentage of correct classification is not significant, but
their B index values are so much different. LDA has a B index value of 0.640 whereas LR obtains 0.952. This has

1161

proved that the percentage of correct classification is biased. With reference to the B index, the discriminating power
of LR is stronger than LDA even though the percentage of correct classification is slightly lower than LDA.

Method
LDA

LR

TABLE (2). Classification results for HBAT.sav


Region
Percentage of Correct Classification
USA / North America
100.0
Outside North America
88.5
Total
93.0
USA / North America
92.3
Outside North America
91.8
Total
92.0

B Index
0.64

0.952

The comparison continues by testing a dataset which totally violates normality assumption. The dataset
spss_use.sav is split into two parts, 1920 observations to be the analysis sample, and 1371 observations to be the
validation sample. In Table 3, it is shown that LR performs better compared with LDA for nonnormality case. A
total of 77.6% of the objects were classified correctly by LR while LDA attained only 75.1%. Moreover, the B index
for LR is 0.845 while it is only 0.729 for LDA. Therefore, LR proved to produce higher prediction accuracy than
LDA.
TABLE (3). Classification results for spss_use.sav (analysis sample)
Method Gender Percentage of Correct Classification B Index
Female
73.5
LDA
Male
75.8
0.729
Total
75.1
Female
53.0
Male
89.6
LR
0.845
Total

77.6

For the validation sample, the results obtained also show that LDA is weaker in classifying data which does not
meet the underlying assumption. 74.6% of the people were being classified to the correct gender with LDA. LR
succeeded in categorising 76.4% of people correctly but yet it failed to classify more than half of females to the
correct group. LDA has the value of 0.732 for B index while LR does better with 0.846. In this case, LR has better
predicting power compared with LDA when analysing nonnormal data.
TABLE (4). Classification results for spss_use.sav (validation sample)
Method Gender Percentage of Correct Classification B Index
Female
72.3
LDA

LR

Male

75.7

Total

74.6

Female

49.9

Male

88.7

Total

76.4

0.732

0.846

Effect of Varying Sample Size on Computing Time


Three datasets that were tested here are HATCO.sav with the sample size of 100, General.sav with a treated
sample size of 2671 and spss_use.sav with a treated sample size of 3205. In Table 5, the computing time for
HATCO.sav is the shortest which are 1.03s and 1.07s for LDA and LR respectively. In this case, LR takes slightly
longer time than LDA. However, for dataset General.sav, the average computing time for LDA is about 0.33s longer
than LR yet the percentage of correct classification for LDA is worse than for LR. However, LR takes about 4.66

1162

times longer computing time than LDA when processing the large dataset spss_use.sav even though it gives a better
result of correct classification, which is 76.6% compared to 74.3% obtained by LDA. The phenomenon above may
be caused by the distribution of the three datasets. Hence, we can conclude that overall LR performs better than
LDA.
From the aspect of the number of independent variables, spss_use.sav has the most which is 122 followed by
General.sav with 29 independent variables, and HATCO.sav with only 9 independent variables. An issue being
raised at this point is how is the effect of the number of independent variables on the computing time taken by both
methods.

HATCO.sav
General.sav
spss_use.sav

Method
LDA
LR
LDA
LR
LDA
LR

TABLE (5). Computing time for two-group classification for three datasets.
Trial
Number of independent
Average
Percentage of Correct
variables
(s)
Classification
1
2
3
1.2
0.8
1.1
1.03
95.0
9
1.3
0.9
1.0
1.07
100.0
2.0
1.5
1.0
1.50
60.0
29
1.3
1.2
1.0
1.17
75.5
3.4
2.6
2.3
2.77
74.3
122
13.2 13.6 11.9
12.9
76.7

Random samples of various sizes will be built from a single dataset spss_use.sav. This should minimize the bias
in comparing the two methods. Ten random samples of various sizes (30, 50, 80, 100, 500, 1000, 1500, 2000, 2500
and 3000) were generated for the comparison. The number of independent variables is 122.
Referring to Table 6, LR uses shorter computing time than LDA for the sample size in the range of 30 to 500.
When the sample size increases to 1000 and above, the computing time taken by LR increases at a faster rate. It
takes 10.2s when analysing random sample with 3000 sample size while on the other hand, LDA uses only 3.2s to
get the classification process done. In this case, the difference of the computing time between two methods is 7s.
Although the computing time needed for LDA to perform classification increases from 1.6s to 3.2s which is double
of the initial time, the value is far smaller compared to the time taken by LR.

Effect of Prior Probability


In this part, only LDA is tested and the prior probability is set as either equal for all groups or computed using
group size. Since there are only two groups for the outcome variable, the equal prior probability is assigned to 0.5
for each group. Hence, the prior probability which is computed by using group size will be set based on the dataset
tested. Three datasets are involved in the testing for this part.
Table 7 shows the prior probability computed according to group size. The prior probability for female group is
0.324, and for male it is 0.676. With the computed prior probability, LDA discriminates 76.9% of people to the
correct gender while discriminates 74.3% of people correctly with equal prior probability. Apparently, by using the
computed prior probability, LDA produces better result. It is also clearly shown that the value of the B index when
using computed prior probability (0.836) is higher than when using equal prior probability (0.733).
HATCO.sav is the second dataset being analysed in this part. When the prior probability is set at 0.6 and 0.4 for
small firm and large firm respectively, LDA gives a percentage of correct classification of 96% while it is only 95%
when using equal prior probability for both categories. However, the percentage results are close. Based on B index,
the classification model with equal prior probability can make better prediction. The value of B index is 0.488 for
the model built with equal prior probability, whereas it is only 0.470 for the model generated by using computed
prior probability. Again, since the values of the B index for both prior probability setting are close, we conclude that
the difference in the discriminating power of both models is not significant. Both the models are actually weak due
to the B indexes are less than 0.5.

1163

TABLE (6). Computing time for two-group classification for the dataset spss_use.sav
Percentage of Correct Classification
Size of the random sample
30
50
80
100
500
1000
1500
2000
2500
3000

Number of useful observations


30
50
79
99
491
972
1464
1944
2438
2921

LDA
90.0
100.0
100.0
91.0
74.8
74.6
75.2
73.7
74.2
73.4

LR
100.0
95.9
100.0
100.0
77.5
78.8
76.9
77.1
76.6
76.9

Computing time (s)


LDA
1.6
1.6
1.9
2.0
2.2
2.4
2.4
2.3
3.2
3.2

LR
1.2
0.9
1.0
1.5
1.7
3.5
4.0
6.2
7.6
10.2

TABLE (7). Prior probability for each group in spss_use.sav


Gender
Sample Size
Prior Probability
Female
1038
0.324
Male
2167
0.676
Total
3205
1.000
TABLE (8). Classification results for spss_use.sav with LDA
Prior Probability Gender Percentage of Correct Classification B Index
Female
72.0
0.5, 0.5
Male
75.5
0.733
Total
74.3
Female
50.0
Male
89.8
0.324, 0.676
0.836
Total
76.9
TABLE (9). Prior probability for each group in HATCO.sav
Firm Size Sample Size Prior Probability
Small
60
0.6
Large
Total

40

0.4

100

1.0

TABLE (10). Classification results for HATCO.sav with LDA


Prior Probability Firm Size Percentage of Correct Classification
Small
91.7
0.5, 0.5
Large
100.0
Total
95.0
Small
95.0
Large
97.5
0.6, 0.4
Total
96.0

B Index
0.488

0.470

The third dataset that was used to study the effect of the prior probability setting on the classification accuracy is
General.sav. The computed prior probability according to group size is 0.249 for female group and 0.751 for male
group. 60% of the people are categorised correctly when the prior probability is set at 0.5 for each group while
67.2% of people are classified to the correct group when using the computed prior probability. The value of B index
for computed prior probability is 0.920, which is so much higher than the B index for equal prior probability, 0.736.

1164

TABLE (11). Prior probability for each group in General.sav


Gender Sample Size Prior Probability
Female
665
0.249
Male

2006

0.751

Total

2671

1.000

TABLE (12). Classification results for General.sav with LDA


Prior Probability
Gender
Percentage of Correct Classification
Female
61.4
0.5, 0.5
Male
59.4
Total
60.0
Female
5.8
Male
98.3
0.249, 0.751
Total
67.2

B Index
0.736

0.920

CONCLUSIONS
In conclusion, LR has higher predictive power compared to LDA. LR fits well to various types of distribution of
the dataset. Some researchers have shown that when the normality assumption of independent variables are fulfilled
or not too badly violated, LDA performs well. However, for all levels of normality in this study, LR has obtained
higher percentage of correct classification and B index than LDA.
When the sample size is large, it is shown that the computing time used in LR is longer than LDA. For example,
when the whole dataset spss_use.sav with more than 3000 observations is processed, the time needed for LR to
calculate is about 4.66 times longer than the computing time of LDA. Moreover, the computing time taken by LDA
is consistent for all sample sizes. Nearly all the percentages of correct classification for LR are higher than LDA
without taking into account the computing time issue. Hence, a recommendation being raised here is to use LDA
when you have limited cost and time for doing the classification models with the condition that the sample size is
large and the data is distributed normally or not too badly violated. On the other hand, the number of independent
variables has no significant effect on the computing time taken by both methods.
This study has also found that if the prior probability takes the value computed based on the group size rather
than using the equal probability for all groups, LDA can achieve higher percentage of correct classification and B
index. Therefore, in order to obtain higher predictive accuracy, prior probability should be calculated according to
group size when using LDA.

ACKNOWLEDGMENTS
We would like to express our gratitude to University Kebangsaan Malaysia and the Ministry of Higher Education
Malaysia (MOHE) for the financial support through the Research Grant No. UKM-ST-06-FRGS0183-2010.

REFERENCES
1. M. Pohar, M. Blas and S. Turk, Metodoloki zvezki 1, 143161 (2004).
2. J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson and R. L. Tatham, Multivariate Data Analysis, Upper Saddle River, N.J.:
Pearson Prentice Hall, 2006.
3. F. E. J. Harrell and K. L. Lee, A Comparison of the Discriminant of Discriminant Analysis and Logistic Regression Under
Multivariate Normality in Biostatistics: Statistics in Biomedical, Public Health and Environment Sciences, edited by P. K.
Sen, Amsterdam: North-Holland, 1985, pp. 333343.
4. S. J. Press and S. Wilson, Journal of the American Statistical Association 73, 699705 (1978).
5. J. Truett, J. Cornfield and W. Kannel, Journal of Chronic Diseases 20, 511524 (1967).
6. T. Brenn and E. Arnesen, Statistics in Medicine 4, 413423 (1985).

1165

Copyright of AIP Conference Proceedings is the property of American Institute of Physics and its content may
not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.

You might also like