You are on page 1of 3

GLMs include multiple regression but generalize in several ways:

1) the conditional distribution of the response (dependent variable) is from the


exponential family, which includes the Poisson, binomial, gamma, normal and
numerous other distributions.

2) the mean response is related to the predictors (independent variables) through a link
function. Each family of distributions has an associated canonical link function - for
example in the case of the Poisson, the canonical link is the log. The canonical links are
almost always the default, but in most software you generally have several choices
within each distribution choice. For the binomial the canonical link is the logit (the
linear predictor is modelling log(p1p)

, the log-odds of a success, or a "1") and for the Gamma the canonical link is the inverse
- but in both cases other link functions are often used.

So if your response was Y

and your predictors were X1 and X2, with a Poisson regression with the log link you
might have for your description of how the mean of Y is related to the X

's:

E(Yi)=i

logi=i

( is called the 'linear predictor', and here the link function is log, the symbol g

is often used to represent the link function)

i=0+1x1i+2x2i
3) the variance of the response is not constant, but operates through a variance-function
(a function of the mean, possibly times a scaling parameter). For example, the variance
of a Poisson is equal to the mean, while for a gamma it's proportional to the square of
the mean. (The quasi-distributions allow some degree of decoupling of Variance
function from assumed distribution)

--

So what assumptions are in common with what you remember from MLR?

Independence is still there.


Homoskedasticity is no longer assumed; the variance is explicitly a function of
the mean and so in general varies with the predictors (so while the model is
generally heteroskedastic, the heteroskedasticity takes a specific form).
Linearity: The model is still linear in the parameters (i.e. the linear predictor is
X
), but the expected response is not linearly related to them (unless you use the
identity link function!).
The distribution of the response is substantially more general

The interpretation of the output is in many ways quite similar; you can still look at
estimated coefficients divided by their standard errors for example, and interpret them
similarly (they're asymptotically normal - a Wald z-test - but people still seem to call
them t-ratios, even when there's no theory that makes them t

-distributed in general).

Comparisons between nested models (via 'anova-table' like setups) are a bit different,
but similar (involving asymptotic chi-square tests). If you're comfortable with AIC and
BIC these can be calculated.

Similar kinds of diagnostic displays are generally used, but can be harder to interpret.

Much of your multiple linear regression intuition will carry over if you keep the
differences in mind.

Here's an example of something you can do with a glm that you can't really do with
linear regression (indeed, most people would use nonlinear regression for this, but GLM
is easier and nicer for it) in the normal case - Y

is normal, modelled as a function of x

E(Y)=exp()=exp(X)=exp(0+1x)
(that is, a log-link)

Var(Y)=2

That is, a least-squares fit of an exponential relationship between Y

and x

Can I transform the variables the same way (I've already discovered transforming the
dependent variable is a bad call since it needs to be a natural number)?

You (usually) don't want to transform the response (DV). You sometimes may want to
transform predictors (IVs) in order to achieve linearity of the linear predictor.
I already determined that the negative binomial distribution would help with the over-
dispersion in my data (variance is around 2000, the mean is 48).

Yes, it can deal with overdispersion. But take care not to confuse the conditional
dispersion with the unconditional dispersion.

Another common approach - if a bit more kludgy and so somewhat less satisfying to my
mind - is quasi-Poisson regression (overdispersed Poisson regression).

With the negative binomial, it's in the exponential family if you specify a particular one
of its parameters (the way it's usually reparameterized for GLMS at least). Some
packages will fit it if you specify the parameter, others will wrap ML estimation of that
parameter (say via profile likelihood) around a GLM routine, automating the process.
Some will restrict you to a smaller set of distributions; you don't say what software you
might use so it's difficult to say much more there.

I think usually the log-link tends to be used with negative binomial regression.

There are a number of introductory-level documents (readily found via google) that lead
through some basic Poisson GLM and then negative binomial GLM analysis of data,
but you may prefer to look at a book on GLMs and maybe do a little Poisson regression
first just to get used to that.

You might also like