You are on page 1of 8

Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.

org/wiki/Linear_regression

Linear regression
From Wikipedia, the free encyclopedia.

In statistics, linear regression is a method of estimating the conditional expected value of one variable y given the values
of some other variable or variables x. The variable of interest, y, is conventionally called the "dependent variable". The
terms "endogenous variable" and "output variable" are also used. The other variables x are called the "independent
variables". The terms "exogenous variables" and "input variables" are also used. The dependent and independent variables
may be scalars or vectors. If the independent variable is a vector, one speaks of multiple linear regression.

The term independent variable suggests that its value can be chosen at will, and the dependent variable is an effect, i.e.,
causally dependent on the independent variable, as in a stimulus-response model. Although many linear regression models
are formulated as models of cause and effect, the direction of causation may just as well go the other way, or indeed there
need not be any causal relation at all.

Regression, in general, is the problem of estimating a conditional expected value. Linear regression is called "linear"
because the relation of the dependent to the independent variables is assumed to be a linear function of some parameters.
Regression models which are not a linear function of the parameters are called nonlinear regression models. A neural
network is an example of a nonlinear regression model.

Still more generally, regression may be viewed as a special case of density estimation. The joint distribution of the
dependent and independent variables can be constructed from the conditional distribution of the dependent variable and
the marginal distribution of the independent variables. In some problems, it is convenient to work in the other direction:
from the joint distribution, the conditional distribution of the dependent variable can be derived.

Contents
1 Historical remarks
2 Justification for regression
3 Statement of the linear regression model
4 Parameter estimation
4.1 Robust regression
4.2 Summarizing the data
4.3 Estimating beta
4.4 Estimating alpha
4.5 Displaying the residuals
4.6 Ancillary statistics
5 Multiple linear regression
6 Scientific applications of regression
7 See also
8 References
8.1 Historical
8.2 Modern theory
8.3 Modern practice
9 External links

Historical remarks
The earliest form of linear regression was the method of least squares, which was published by Legendre in 1805, and by
Gauss in 1809. The term "least squares" is from Legendre's term, moindres carrés. However, Gauss claimed that he had

1 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

known the method since 1795.

Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of
bodies about the sun. Euler had worked on the same problem (1748) without success. Gauss published a further
development of the theory of least squares in 1821, including a version of the Gauss-Markov theorem.

The term "reversion" was used in the nineteenth century to describe a biological phenomenon, namely that the progeny of
exceptional individuals tend on average to be less exceptional than their parents, and more like their more distant
ancestors. Francis Galton studied this phenomenon, and applied the slightly misleading term "regression towards
mediocrity" to it (parents of exceptional individuals also tend on average to be less exceptional than their children). For
Galton, regression had only this biological meaning, but his work (1877, 1885) was extended by Karl Pearson and Udny
Yule to a more general statistical context (1897, 1903). In the work of Pearson and Yule, the joint distribution of the
dependent and independent variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his
works of 1922 and 1925. Fisher assumed that the conditional distribution of the dependent variable is Gaussian, but the
joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.

Justification for regression


The theoretical problem is that given two random variables X and Y, what is the estimator of Y that is the best, or in
another words the estimator that minimizes the mean square error.

1. If we estimate Y by a constant, it can be shown that Y = E(Y) (population mean) is the best unbiased estimator with
MSE var(Y) = E[(Y − E(Y))2].

2. If we estimate Y with a linear (technically affine) predictor of the form Y = aX + b, it can be shown that if

then the MSE E[(Y − aX − b)2] is minimized.

3. Finally, what is the best general function f(X) that estimates Y? It is Y = f(X) = E[Y | X].

Note: E[Y|X] is a function of X. This can be proven with the inequality

Thus, the regression estimates the conditional mean of Y given X because it minimizes the MSE. Other ways of testing for
relationship does not use the conditional mean, i.e. correlation.

Statement of the linear regression model


A linear regression model is typically stated in the form

The right hand side may take other forms, but generally comprises a linear combination of the parameters, here denoted α
and β. The term ε represents the unpredicted or unexplained variation in the dependent variable; it is conventionally called
the "error" whether it is really a measurement error or not. The error term is conventionally assumed to have expected
value equal to zero, as a nonzero expected value could be absorbed into α. See also errors and residuals in statistics; the
difference between an error and a residual is also dealt with below. It is also assumed that is independent of x.

An equivalent formulation which explicitly shows the linear regression as a model of conditional expectation is

2 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

with the conditional distribution of y given x essentially the same as the distribution of the error term.

A linear regression model need not be affine, let alone linear, in the independent variables x. For example,

is a linear regression model, for the right-hand side is a linear combination of the parameters α, β, and γ. In this case it is
useful to think of x2 as a new independent variable, formed by modifying the original variable x. Indeed, any linear
combination of functions f(x), g(x), h(x), ..., is linear regression model, so long as these functions do not have any free
parameters (otherwise the model is generally a nonlinear regression model). The least-squares estimates of α, β, and γ are
linear in the response variable y, and nonlinear in x (they are nonlinear in x even if the γ and α terms are absent; if only β
were present then doubling all observed x values would multiply the least-squares estimate of β by 1/2).

Parameter estimation
Often in linear regression problems statisticians rely on the Gauss-Markov assumptions:

The random errors εi have expected value 0.


The random errors εi are uncorrelated (this is weaker than an assumption of probabilistic independence).
The random errors εi are "homoscedastic", i.e., they all have the same variance.

(See also Gauss-Markov theorem. That result says that under the assumptions above, least-squares estimators are in a
certain sense optimal.)

Sometimes stronger assumptions are relied on:

The random errors εi have expected value 0.


They are independent.
They are normally distributed.
They all have the same variance.

If xi is a vector we can take the product βxi to be a scalar product (see "dot product").

A statistician will usually estimate the unobservable values of the parameters α and β by the method of least squares,
which consists of finding the values of a and b that minimize the sum of squares of the residuals

Those values of a and b are the "least-squares estimates." The residuals may be regarded as estimates of the errors; see
also errors and residuals in statistics.

Notice that, whereas the errors are independent, the residuals cannot be independent because the use of least-squares
estimates implies that the sum of the residuals must be 0, and the scalar product of the vector of residuals with the vector
of x-values must be 0, i.e., we must have

and

These two linear constraints imply that the vector of residuals must lie within a certain (n − 2)-dimensional subspace of
Rn; hence we say that there are "n − 2 degrees of freedom for error". If one assumes the errors are normally distributed

3 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

and independent, then it can be shown to follow that 1) the sum of squares of residuals

is distributed as

So we have :

the sum of squares divided by the error-variance σ 2, has a chi-square distribution with n − 2 degrees of freedom,
the sum of squares of residuals is actually probabilistically independent of the estimates a, b of the parameters α and
β.

These facts make it possible to use Student's t-distribution with n − 2 degrees of freedom (so named in honor of the
pseudonymous "Student") to find confidence intervals for α and β.

Denote by capital Y the column vector whose ith entry is yi, and by capital X the n*2 matrix whose second column
contains the xi as its ith entry, and whose first column contains n 1s. Let ε be the column vector containing the errors ε i.
Let δ and d be respectively the 2*1 column vector containing α and β and the 2*1 column vector containing the estimates
a and b. Then the model can be written as

where ε is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance σ2 In, where In is the n*n
identity matrix. The matrix Xd (where (remember) d is the vector of estimates) is then the orthogonal projection of Y onto
the column space of X.

Then it can be shown that

(where X′ is the transpose of X) and the sum of squares of residuals is

The fact that the matrix X(X′X)−1X′ is a symmetric idempotent matrix is incessantly relied on both in computations and in
proofs of theorems. The linearity of d as a function of the vector Y, expressed above by saying d = (X′X)−1 X′Y, is the
reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.

The matrix In − X (X′ X)−1 X′ that appears above is a symmetric idempotent matrix of rank n − 2. Here is an example of
the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that
any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G′MG is a diagonal matrix. If
the matrix M is also idempotent, then the diagonal entries in G′MG must be idempotent numbers. Only two real numbers
are idempotent: 0 and 1. So In − X(X′X) 1X′, after diagonalization, has n − 2 0s and two 1s on the diagonal. That is most of
the work in showing that the sum of squares of residuals has a chi-square distribution with n−2 degrees of freedom.

Regression parameters can also be estimated by Bayesian methods. This has the advantages that

confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
prior information can be incorporated into the analysis.

Suppose that in the linear regression

4 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

we know from domain knowledge that alpha can only take one of the values {−1, +1} but we do not know which. We can
build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of
0.5 on −1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {−1, +1}, but the probability
weights will change to reflect the evidence from the data.

Robust regression

A useful alternative to linear regression is robust regression in which mean absolute error is minimized instead of mean
squared error as in linear regression. Robust regression is computationally much more intensive than linear regression and
is somewhat more difficult to implement as well.

Robust regression usually means linear regression with robust (Huber-White) standard errors (e.g. relaxing the
assumption of homoskedasticity).

An alternative approach to robust regression is to replace the normal distribution in the error term with a Student's
t-distribution.

Summarizing the data

We sum the observations, the squares of the Ys and Xs and the products XY to obtain the following quantities.

and SY similarly.

and SYY similarly.

Estimating beta

We use the summary statistics above to calculate b, the estimate of β.

Estimating alpha

We use the estimate of β and the other statistics to estimate α by:

A consequence of this estimate is that the regression line will always pass through the "center"
.

Displaying the residuals

The first method of displaying the residuals use the histogram or cumulative distribution to depict the similarity (or lack
thereof) to a normal distribution. Non-normality suggests that the model may not be a good summary description of the

5 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

data.

We plot the residuals,

against the independent variable, x. There should be no discernible trend or pattern if the model is satisfactory for these
data. Some of the possible problems are:

Residuals increase (or decrease) as the independent variable increases —indicates mistakes in the calculations. Find
the mistakes and correct them.
Residuals first rise and then fall (or first fall and then rise) —indicates that the appropriate model is (at least)
quadratic. Adding a quadratic term (and then possibly higher) to the model may be appropriate. See polynomial
regression.
One residual is much larger than the others —suggests that there is one unusual observation which is distorting the
fit.
Verify its value before publishing or
Eliminate it, document your decision to do so, and recalculate the statistics.

Studentized residuals can be used in outlier detection.

The vertical spread of the residuals increases as the independent variable increases (funnel-shaped plot) —indicates
that the homoscedasticity assumption is violated (i.e. there is heteroskedasticity: the variability of the response
depends on the value of x.
Transform the data: for example, the logarithm or logit transformations are often useful.
Use a more general modeling approach that can account for non-constant variance, for example a general
linear model or a generalized linear model.

Ancillary statistics

The sum of squared deviations can be partitioned as in ANOVA to indicate what part of the dispersion of the dependent
variable is explained by the independent variable.

The correlation coefficient, r, can be calculated by

This statistic is a measure of how well a straight line describes the data. Values near zero suggest that the model is
ineffective. r2 is frequently interpreted as the fraction of the variability explained by the independent variable, X.

Multiple linear regression


Linear regression can be extended to functions of two or more variables, for example

Here X and Y are independent variables. The values of the parameters a, b and c are estimated by the method of least
squares, that minimize the sum of squares of the residuals

Take derivatives with respect to a, b and c in Sr and set equal to zero. This leads to a system of three linear equations i.e.

6 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

the normal equations to find estimates of parameters a, b and c:

Scientific applications of regression


Linear regression is widely used in biological and behavioral sciences to describe relationships between variables. It ranks
as one of the most important tools used in these disciplines. For example, early evidence relating cigarette smoking to
mortality came from studies employing regression. Researchers usually include several variables in their regression
analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example,
researchers might include socio-economic status in addition to smoking to insure that any observed effect of smoking on
mortality is not due to some effect of education or income. However, it is never possible to include all possible
confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase
mortality and also cause people to smoke more. For this reason, randomized experiments are considered to be more
trustworthy than a regression analysis.

See also
regression analysis
least squares
instrumental variable, instrumental variables estimation (these two should get merged)

References
Historical

A.M. Legendre. Nouvelles méthodes pour la détermination des orbites des comètes (1805). "Sur la Méthode des
moindres quarrés" appears as an appendix.

C.F. Gauss. Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientum. (1809)

C.F. Gauss. Theoria combinationis observationum erroribus minimis obnoxiae. (1821/1823)

Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what
was known about reversion in Galton's time. Darwin uses the term "reversion".)

Francis Galton. "Typical laws of heredity", Nature 15 (1877), 492-495, 512-514, 532-533. (Galton uses the term
"reversion" in this paper, which discusses the size of peas.)

Francis Galton. Presidential address, Section H, Anthropology. (1885) (Galton uses the term "regression" in this
paper, which discusses the height of humans.)

Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute,
15:246-263 (1886). (Facsimile at: [1]
(http://www.mugu.com/galton/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf ) )

G. Udny Yule. "On the Theory of Correlation", J. Royal Statist. Soc., 1897, p. 812-54.

Karl Pearson, G. U. Yule, Norman Blanchard, and Alice Lee. "The Law of Ancestral Heredity", Biometrika (1903)

R.A. Fisher. "The goodness of fit of regression formulae, and the distribution of regression coefficients", J. Royal
Statist. Soc., 85, 597-612 (1922)

7 of 8 9/10/2005 2:15 PM
Linear regression - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Linear_regression

R.A. Fisher. Statistical Methods for Research Workers (1925)

Modern theory

Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)

Modern practice

External links
Earliest Known uses of some of the Words of Mathematics (http://members.aol.com/jeff570/mathword.html) . See:
[2] (http://members.aol.com/jeff570/e.html) for "error", [3] (http://members.aol.com/jeff570/g.html) for
"Gauss-Markov theorem", [4] (http://members.aol.com/jeff570/m.html) for "method of least squares", and [5]
(http://members.aol.com/jeff570/r.html) for "regression".
Online linear regression calculator. (http://www.wessa.net/esteq.wasp)
Online regression by eye (simulation). (http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/)
Online curve and surface fitting. (http://zunzun.com/)
Leverage Effect (http://www.vias.org/simulations/simusoft_leverage.html) Interactive simulation to show the effect
of outliers on the regression results
Linear regression as an optimisation problem (http://www.vias.org/simulations/simusoft_linregr.html)
Visual Statistics with Multimedia
(http://www.visualstatistics.net/web%20Visual%20Statistics/Visual%20Statistics%20Multimedia/regression_analysi

Retrieved from "http://en.wikipedia.org/wiki/Linear_regression"

Categories: Statistics | Estimation theory

This page was last modified 21:21, 7 September 2005.


All text is available under the terms of the GNU Free Documentation License (see Copyrights for details).

8 of 8 9/10/2005 2:15 PM

You might also like