Professional Documents
Culture Documents
ABSTRACT
This research paper explores a variety of strategies for likelihood learning methods for
f a regularized Linear
performing classification with missing feature values. Discriminant Analysis model with missing data. We
The classification setting is particularly affected by consider applying these methods to classification with
the presence of missing feature values since most missing data using imputation, reduced models, and
discriminative learning approaches including logistic the response indicator framework.
regression, support vector machines, and neural
networks have no natural ability to deal with missing 2. Frameworks for Classification with Missing
Missin
input features. Our main interest is in classification Features
methods thatt can both learn from data cases with
missing features, and make predictions for data cases Generative classifiers have a natural ability to deal
with missing features. with missing data through marginalization. This
makes them well suited for dealing with random
Keywords: [Multiple imputations, Classification, missing data. The most well-known
well methods for
Machine learning approaches] dealing with missing data in discriminative classifiers
classi
are case deletion, imputation, and learning in
1. INTRODUCTION subspaces. All of these methods can be applied in
conjunction with any classifier that operates on
We begin with an overview of strategies for dealing complete data. In this section we discuss these
with missing data in classification. Generative methods for dealing with missing data. We also
classifiers learn a joint model of labels and features. discuss a different strategy for converting a complete
Generative classifier does have a natural ability to data classifier into a classifier that can operate on
learn from incomplete data cases and to make incomplete data cases by augmenting the input
predictions when features are missing. We then representation with response indicators.
discuss several strategies that can be applied to any
discriminative classifier including case deletion, 2.1 Generative Classifiers
imputation, and classification in subspaces. Finally,
we discuss a frame-work work for classification from Generative classifiers model the joint distribution
di of
incomplete data based on augmenting the input labels and features. If any feature values are missing
representation
presentation of complete data classifiers with a they can be marginalized over when classifying data
vector of response indicators. cases. In class conditional models like the Naive
Bayes classifier and Linear Discriminant Analysis, the
We consider Linear Discriminant Analysis as an marginalization operation can be performed
example of a generative classifier. We present both efficiently. Missing data must also be dealt with
maximum likelihood and maximum conditional during learning. This typically requires an application
of the Expectation Maximization algorithm. However,
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 746
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
3. Linear Discriminant Analysis style co-variance matrix. Finally, we present a
discriminative training method for the same model
In this section we present Linear Discriminant that maximizes the conditional probability of labels
Analysis, and its application to classification with given features.
missing features. We begin by reviewing Fisher's
original conception of Linear Discriminant Analysis. 6.2.5 LDA and Missing Data
We then describe the relationship between Fisher's
view and a view based on maximum probability As a generative model, Linear Discriminant Analysis
classification in a class conditional Gaussian model. has a natural ability to deal with missing input
We discuss several extensions of LDA including features. The class conditional probability of a data
Quadratic Discriminant Analysis (QDA), and vector with missing input features is given in
Regularized Discriminant Analysis (RDA). We Equation 1. The posterior probability of each class
introduce a new method for missing data given a data case with missing features is shown in
classification based on generative training of a linear Equation 2
discriminant analysis model with a factor analysis-
---- (1)
---- (2)
Maximum Likelihood Estimation
The maximum likelihood estimate of the mean appropriate class mean as seen in Equation 3, and then
parameters is computed from incomplete data as applying the Expectation Maximization algorithm for
shown in Equation 3 factor analysis with missing data. The dimensionality
of the latent factors Q is a free parameter that must be
set using cross validation.
6.2.6 Discriminatively Trained LDA and Missing
Data
The parameters of the full covariance matrix can be One of the main drawbacks of generatively trained
estimated using the Expectation Maxi-mization classification methods is that they tend to be very
algorithm. However, when data vectors are high sensitive to violations of the underlying modeling
dimensional and there are a relatively small number of assumptions. In this section we consider a
data cases, it is preferable to use a structured discriminative training procedure for the LDA-FA
covariance approximation. We choose to use a factor model described in the previous section. The main
analysis-like covariance matrix of the form T + with insight is that we can t the LDA-FA model parameters
diagonal. We call this model LDA-FA for Linear by maximizing the conditional probability of the
Discriminant Analysis with Factor Analysis labels given the incomplete features instead of
covariance. The factor analysis covariance model is maximizing the joint probability of the labels and the
slightly more general than the PPCA covariance incomplete features. This training procedure is closely
model used by Tipping and Bishop in their LDA related to the minimum classification error factor
experiments. Note that while Tipping and Bishop also analysis algorithm introduced by Saul and Rahim for
consider learning PPCA models with missing data, complete data.
they do not consider the simultaneous application of
PPCA to linear discriminant analysis with missing The posterior class probabilities given an incomplete
data. feature vector are again given by Equation 1
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 747
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
----- (3)
-----------(4)
We begin by computing the partial derivative of the conditional log likelihood with respect to the current
posterior class probabilities Pnk , and the partial derivative of the posterior class probability with respect to A kn.
----- (5)
l
We compute the partial derivative of A n with respect to c, and use the chain rule to nd the partial derivative of
the conditional log likelihood with respect to c. The projection matrix Hon was introduced in Section 1.2.1.
Recall that Hon projects the observed dimensions of xon back into D dimension such that the missing
dimensions are lled in with zeros. These projection matrices arise naturally when taking the derivative of a sub-
matrix or sub-vector with respect to a full dimensional matrix or vector. Also recall that on refers to the vector
of observed dimensions for data case xn such that oin = d if d is the ith observed dimension of xn.
Table 6.1: Summary of illustrative results for generative and discriminatively trained LDA-FA models. We
report the log loss (average negative log probability of the correct class), as well as the average classification
error.
CONCLUSION
The multiple imputation results show much smoother multiple imputation results also show that multiple
classification functions than any of the other methods. imputation can perform well even if the imputation
This results from a combination of noise due to model is incorrect. There is little difference in the
sampling variation in the imputations, as well as from classification functions based on a one component
the fact that each classification function results from factor analysis model, and a two component factor
an ensemble of logistic regression classifiers. The analysis mixture. The reason for this behavior is
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 748
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470
explained by the fact that if a single Gaussian is used
to explain both clusters in the Simple training data set,
the conditional densities are approximately correct for
most data cases, even though a two component
mixture gives a much better fit to the data.
REFERENCE
1. Arthur E. Hoerl and Robert W. Kennard. Ridge
Regression: Biased Estimation for Nonorthogonal
Problems. Technometrics, 12(1):55{67, 1970.
2. David W. Hosmer and Stanley Lemeshow.
Applied Logistic Regression, Second Edition.John
Wiley and Sons, 2000.
3. Hemant Ishwaran and Mahmoud Zarepour. Exact
and Approximate Sum Representations for the
Dirichlet Process. The Canadian Journal of
Statistics, 30(2):269{283, 2002.
4. Tony Jebara, Risi Kondor, and Andrew Howard.
Probability Product Kernels. Journal ofMachine
Learning Research, 5:819{844, 2004
5. M. Jordan. Why the Logistic Function? A Tutorial
Discussion on Probabilities and NeuralNetworks.
Technical Report 9503, MIT Computational
Cognitive Science, 1995.
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 3 | Mar-Apr 2018 Page: 749