Professional Documents
Culture Documents
Erniel B. Barrios
School of Statistics, University of the Philippines Diliman
School of Statistics
Ramon Magsaysay Avenue
U.P. Diliman, Quezon City
Telefax: 928-08-81
Email: updstat@yahoo.com
ABSTRACT
In modelling count data with multivariate predictors, we often encounter problems with
clustering of observations and interdependency of predictors. We propose to use principal
components of predictors to mitigate the multicollinearity problem and to abate information
losses due to dimension reduction, a semiparametric link between the count dependent
variable and the principal components is postulated. Clustering of observations is accounted
into the model as a random component and the model is estimated via the backfitting
algorithm. A simulation study illustrates the advantages of the proposed model over standard
poisson regression in a wide range of simulation scenarios.
1. Introduction
In many diverse fields, outcomes of certain phenomena are measured using indicators that
possess the characteristics of poisson events, e.g., prevalence of a disease, number of
customers patronizing products/services, number of student enrollees. Poisson regression is
used to characterizesuch dataand in predicting the average number of instances an event
occurs, conditional on one or more factors.[1] demonstrated using malaria data that poisson
regression is advantageous over classical regression in modeling count data. Classical
regression analysis requires more predictors to achieve as much predictive ability as poisson
regression.
Spatial aggregation causes certain poisson events to manifest clustering.The spread of AH1N1
is influenced by determinants leadings towards vulnerability of individuals in the same
community, this may be different from those causing vulnerability of other individuals from a
different community. Clusters may still be independent but members of the same cluster (or
neighborhood) are necessarily dependent since there is some spatial endowment commonly
shared among units that formed the cluster. Classical statistical inference assumes
independence of observations, i.e., data are independently collected on similar, homogenous
units. This assumption is not necessarily true for clustered data. Thus, in analyzing clustered
data with methods that implicitly consider independence of observations may yield incorrect
analyseson the dynamics of the events/phenomena being characterized.
Predictors that explain occurrence of poisson events within the cluster can also be naturally
correlated. The interdependence among predictors usually causes problems in statistical
inference involving linear models. The multicollinearity problem exists when two or more
explanatory variables in a regression model are highly correlatedimplicating the inefficiency
of ordinary least squares estimates of the regression coefficients. As an illustration, consider
income and educational attainment as predictors of political preference. Income and
educational attainment are structurally correlated since income varies according to the level of
educational attainment of an individual. The presence of multicollinearity in a statistical
model
inflates
the
standard
error
of
the
estimated
coefficients,
resulting
in
There are several solutions to the multicollinearity problem. For example, instead of
individual predictors, some important principal componentsare used in the model. In the
presence of multicollinearity, the design matrix becomes ill-conditioned if not singular and
hence, principal components analysis transforms correlated variables into fewer independent
components.
characterizing high dimensional clustered data. Clustering effect is accounted into the model
through a random intercept term. Dimension reduction is achieved through principal
components and due to the inherent deterioration in model fit due to dimension reduction, the
covariate effect summarized in terms of the principal components will be postulated as
nonparametric functions.
Classical linear regression assumes continuous dependent variable and will lead to inefficient,
inconsistent and biased estimates when used in count dependent variable. Poisson regression
is appropriate in modeling with count dependent variable data.Even if poisson regression can
be approximated by classical linear regression, e.g., large sample size, poisson regression is
advantageous over classical linear regression since it usually requires fewer predictorsto
achieve a good fit, as demonstrated in the study of malaria incidence by [1].
[5] introduced the generalized linear models (GLM) to relax some of the classical
assumptions of a linear model. The model is given by
where
is a function that
These are developed for regression models with non-normal dependent variables; special
cases include poisson regression where Y is a count variable and logistic regression where Y
is a binary outcome.
[6] compared the following models for clustered data: (1) ordinary poisson regression, which
ignores intracluster correlation, (2) poisson regression with fixed cluster-specific intercepts,
(3) a generalized estimating equations approach with an equi-correlation matrix, (4) an exact
generalized estimating equations approach with an exact covariance matrix, and (5) maximum
likelihood. All five methods lead to consistent estimates of slopes but have yield varying
efficiency levels especiallyfor unbalanced data.
specific intercepts may be eliminated and can be viewed as a limiting maximum likelihood
estimates when the variance of the intercepts approaches infinity.[6]
substantial loss of information. Typically, only a subset of the PCs is included in regression
modeling, though there is no universally acceptable procedure yet to determine the PCs to
retain.[9]However, [10] proposed a procedure that simultaneously chooses the components
while model fit is optimized.
[11] and [3]justified why the principal components with low eigenvalues are not included in
the model. Since the variance of the estimator
the reciprocal of the eigenvalues, inclusion of one or more components with small
eigenvalues in the model yields high variance of .
The bias and the lost information in principal component regression should be addressed. For
instance, nonparametric smoothing techniques which aim to provide a strategy in modeling
the relationships between variables without specifying any particular form for the underlying
regression function may be considered.When several covariates are present, [13] proposed to
extend the idea of linear regression into a flexible form known as generalized additive model
(GAM). The regression model is given by
where
splines for each predictor in regression models. [14]suggested that additive models are used as
initial procedure to locate the patterns and behavior of the predictors relative to the response,
suggesting a possible parametric form for which to model Y at a later stage.
[15] formulated a censored regression model with additive effects of the covariates. The
additive model sped up computation in the inference process and yieldmore promising results
over a class of linear models especially where there is violation in the linearity assumption.
3. Methodology
Multicollinearity can be a crucial issue in modeling. However, there are various approaches
to mitigate its ill-effects, e.g., transform the individual predictors into linear combinations, the
combinations are chosen so that they are independent, yet it contains the maximum amount of
variance that the original predictorscontain. With linear combinations as predictors instead of
the individual variables in a linear model, the predictive ability could suffer.
[16]proposed a semiparametric poisson regression model for spatially clustered count data
given npredetermined clusters with nk observationsin eachcluster given by
(1)
The flexibility in form of the nonparametric function of the principal components will
compensate for the bias in the estimation of the regression coefficients due to the information
lost by selecting only the most important principal components to be included as predictors in
the model. The random intercepts will account for cluster differences (homogeneous within a
cluster) and the possible peculiarities in the model caused by clustering.
The principle of backfitting is used to estimate the parametric and nonparametric components
of the model.Assuming additivity of Model (3), the parametric and nonparametric frameworks
are imbedded into the backfitting algorithm to estimate the parameters and the nonparametric
components of the model.
Two approaches are presented in this section.In Method 1, the backfitting algorithm estimates
the nonparametric part first, and then the parametric part is estimated from the residuals. In
Method 2, ordinary poisson regression is usedbut with principal components as predictors.
After the extraction of principal components, the parametric and nonparametric parts of the
model are estimated iteratively in the context of backfitting. The nonparametric functions of
the principal components are estimated through spline smoothing.
Smoothing splines are used to fit the nonparametric part of the model. Consider a simple
additive regression model Y = f(X) + , where E () = 0 and Var () = 2. We want to estimate
the function fas a solution to the penalized least squares problem given by
given
value
of
the
smoothing
parameter >0. The first term measures the goodness of fit and the second term served as the
penalty for lack of smoothness in f due to the interpolation in the first term.The smoothing
parameter controls the tradeoff between smoothness and goodness of fit. Largevalues of
emphasizes smoothness of f over model fit, while small values put higher leverage on
model fit rather than on smoothness of f. As
data points.The choice of the value of smoothing parameter is optimized through the
generalized cross validation (GCV). is chosen to minimize the generalized cross validation
mean squared error given by
The partial residual e i is computed and used to estimate the random intercepts (with a priori
information on clustering of observations) through methods like maximum likelihood
methods, EM algorithm, see for example [19].
Spline smoothing and mixed model estimation in the backfitting framework are then iterated
until convergence, see [20] for some optimal properties of the backfitting estimators.
With poisson link function, a general linear model (GLM) with the principal components as
predictors was also estimated. In GLM, the outcome Y is assumed to be generated from a
particular distribution in the exponential family, in this case, from a poisson distribution. The
heterogeneous mean, , of the distribution depends on the independent variables, in this case,
the PCs, through:
(4)
These two methods are compared with Ordinary Poisson Regression on the original predictors
in terms of their predictive ability.
We conduct a simulation study that covers the features of typical data to be analyzed using
Model (3), i.e., either it includes high dimensional predictors, or that the multicollinearity
problem is present, or both. We then compared the predictive ability of Methods 1 and 2 with
ordinary poisson regression with the individual variables as predictors. The simulation
scenarios are summarized in Table 1:
Table 1 Summary of Simulation Scenarios
Number of Variables/
Expected Number of
PCs
Few (5)
Many (30) - Single Pattern
Many (30) - Three Patterns
Multicollinearity
Absence
Strong
Sample Size
50
100
Number of Clusters
5
10
Model Fit
Good
Poor
Starting with five Xs, the rest were simulated as an additive combination of a function of
another X and the random error term. The initial set of predictors is simulated from:
(5)
The error term is distributed asN (0,1), the multiplier of the error term induces
multicollinearity among the predictors, i.e., higher multiplier implies absence of
multicollinearity while lower multiplier indicates presence of multicollinearity.The twenty
five other predictors were generated to fulfil various multicollinearity structures.
The response variable was computed as the linear combination of the Xs and added with a
cluster mean (from the normal distribution) and an error term. The means of the normal
distribution from where the cluster means were simulated were spread thoroughly to
differentiate the clusters. A multiplier to the error term is included to altermodel fit (large
multiplier implies poor model fit; small multiplier implies good model fit). A model usually
fits the data well when the functional form used is correct and that thecorrect predictors are
aptly accounted into the model. When the functional form of the model is incorrect or that
there are missed out predictors, variation of the error term will dominate. Hence, to simulate
misspecification, we magnified the error by multiplying it with a constant.
Scenarios for varying sample sizes (50, 100) as well as varyingnumber of clusters (5,10) were
generated to assess robustness of the proposed model.
The predictive ability of the proposed model is assessed by comparing the mean absolute
prediction error (MAPE)using estimation methods 1 and 2 as well as the MAPE obtained
using Ordinary Poisson Regression (OPR) based on the original predictors,
In the simulation study, good model fit is represented by linear equations where coefficient
of determination is at least 60%, while poor model fit is associated with coefficient of
determination lower than 60%. Simulation shows that whether misspecification error is
present or not, the proposed semiparametric model always outperformed the other models in
terms of predictive ability. In the absence of misspecification error, the proposed model is
advantageous to ordinary poisson regression by about 10% in MAPE. Shifting from ordinary
poisson regression to the parametric principal components regression, on the other hand,
results in 65% increase in MAPE. This illustrates the effect of lost information due to the
summarization of the predictors into principal components instead of using the individual
predictors.
The advantages of the proposed method are also observed even with the presence of
misspecification errors. MAPE in the semiparametric model improved 9% over ordinary
poisson regression. Furthermore, there is advantage of 48% in terms of MAPE in using
ordinary poisson regression relative to the parametric model principal component model. The
MAPE of the three models by nature of model fit are summarized in Table 2.
Table 2. Comparison of MAPE for Varying Model Fit
MAPE (%)
Model Fit
Semiparametric Model
Parametric Model
Good
Poor
24.26
141.54
44.28
230.64
OPR on Original
Predictors
26.88
156.23
Since substantial number of cases with poor model fit was included, subsequent results
yield higher MAPE levels since predictive ability of those models with good fit were
contaminated. The discussions then will focus on the comparison of the three models, rather
than on the magnitude of MAPE.
In the absence of multicollinearity, the proposed model yields better prediction over ordinary
poisson regression on the original predictors (OPR) where the MAPE is lower by 5%. The
parametric principal component regression on the other hand, yields 48% higher MAPE than
OPR. This is explained as the effect of lost information due to selection only of the more
important principal components. With multicollinearity present, similar ranges of MAPE can
be observed from the proposed model with improvementby 14% over ordinary poisson
regression. Again, the parametric model yield 52% higher MAPE relative to ordinary poisson
regression on original variables.Principal Components Analysis is a procedure aimed at
addressing multicollinearity, thus giving the proposed model advantage over ordinary poisson
regression with original correlated variables as predictors. The semiparametric nature of the
proposed model makes a more flexible regression and endowed advantages over the
parametric model.The proposed model performs best among the three methods whether
multicollinearityis present or not. The MAPE values are summarized in Table 3.
Semiparametric Model
Parametric Model
Absence
Presence
108.84
65.61
170.28
115.58
OPR on Original
Predictors
114.79
76.06
We simulated two sample size values: 50 observations (small) and 100 observations (large).In
small sample size,while ordinary poisson regressionyields better predictive ability among the
three models, the proposed model is still within a comparable range. For large sample size,
however, the proposed model is most advantageous, yielding 19% lower MAPE than ordinary
poisson regression. This is consistent with the observation of[21] that in small samples, errors
can easily occur particularly among multivariate techniques such as principal components
analysis (e.g. extraction of erroneous principal components). For small samples, estimates of
correlation among the predictors are relatively unstable, hence, the estimates of component
loadings may not be accurate resulting to losses in information. This lost information is
somehow retrieved by relaxing the parametric structure of the model and employ the more
flexible nature of nonparametric regression.
In Table 4, the parametric model is the most robust to sample size, with its MAPE changing
only by 3% as sample size changes from 50 to 100. MAPE of the proposed model increased
23% as sample size increased, whereas MAPE of ordinary poisson regression increased 63%.
In the parametric model, a sample of size 50 is already fairly large, hence no significant
changes in MAPE is observed as the sample size is increased further to 100. While MAPE for
the proposed model increases with the sample size, it still performed better than ordinary
poisson regression and the parametric model.
Table 4. Comparison of MAPE for Varying Sample Size
MAPE (%)
Sample Size
Semiparametric Model
Parametric Model
50
100
74.19
91.61
135.26
139.66
OPR on Original
Predictors
69.59
113.51
A priori information on clustering of observations can inform the modeler of possible sources
of variation of the response variable. More clusters mean that there is more basis for the
attribution of heterogeneity of the observations. Furthermore, the proposed model included a
random intercept term that basically explains the cluster differences. The simulation study
illustrates the influence of the number of clusters in the proposed model. Increasing the
number of clusters yield improvement on the predictive ability of the proposed model. In
fewer clusters, MAPE of the semiparametric model improved by 7% compared toordinary
poisson regression. On the other hand, MAPE increased by 49% in the parametric model from
ordinary poisson regression. Similar trend can be observed for cases with more clusters, but
the greater advantage of the semiparametric model over ordinary poisson regression (about
12% decline in MAPE) is observed in these cases.
The above result is consistent with [22] who recommended adding more clusters, not the
individual observations, for increased efficiency in the analysis of clustered data. Table 5
provides details related to the effect of number of clusters on predictive ability of models in
clustered data. Table 6 illustrates the implication of the number of clusters and number of
observations per cluster in the analysis of clustered data.
Table 5 Comparison of MAPE for Varying Cluster Count
MAPE (%)
Number of
Clusters
5
10
Semiparametric Model
Parametric Model
87.49
78.31
140.57
134.36
OPR on Original
Predictors
94.19
88.91
Table 6. Comparison of MAPE for Varying Cluster Count and Cluster Size
MAPE (5)
Number of
Clusters (Cluster
Size)
5 (10)
5 (20)
10 (5)
10 (10)
Semiparametric Model
79.40
95.58
68.98
87.64
Parametric Model
131.69
149.45
138.84
129.87
OPR on Original
Predictors
64.20
124.18
74.98
102.84
While multicollinearity may appear even with few predictors, the chances of observing it in
high dimensional data are higher. Scenarios with 5 (few) and 30 (many) variables were
simulated in this study.In high dimensional data, principal component analysis is knownas an
effective data reduction technique. Whether there are few or many variables, the
semiparametric model still exhibitsbetter predictive ability.MAPE values are even lower in
cases where there are 30variables. With fewer variables, the proposed model yields 17%
lower MAPE compared toordinary poisson regression, whereas the parametric model yields
8% higher MAPE than that of ordinary poisson regression. Table 7indicates similar results for
scenarios involving many variables.
Semiparametric Model
81.53
78.19
Parametric Model
106.84
146.52
OPR on Original
Predictors
98.36
83.37
For scenarios with thirty variables, presence of multicollinearity is further examined for the
effect of only a single pattern of interdependence among the predictors and cases where there
are three patterns of such interdependencies. The purpose of heterogeneity in
interdependencies is to account for the number of principal components that need to be
included in the model. The simulation study illustrates that predictive ability of the
semiparametric model improves as the interdependencies become more complicated, i.e.,
more patterns of interdependencies. Compared to the ordinary poisson regression and the
parametric model, the proposed model is still advantageous in terms of predictive ability.
Semiparametric Model
95.05
61.33
Parametric Model
180.60
112.43
OPR on Original
Predictors
94.31
72.43
5.Conclusions
The proposed model and the corresponding estimation procedure is capable of mitigating the
problem of multicollinearity by regressing on the principal components instead of on the
original predictors. Furthermore, the nonparametric specification of the effect of principal
components abate the potential reduction in the predictive ability of the model that is usually
observed in principal components regression caused by loss in information from dimension
reduction.
REFERENCES