Professional Documents
Culture Documents
Regression analysis
Otavio R de Medeiros
Regression analysis
Introduction A regression is a model of the relationship between one variable on one side and one or more variables on the other side Regression analysis constructs and tests a mathematical model of the relationship between one dependent (endogenous) variable and one or more independent (exogenous) variables. The direction of causality between the variables is determined from a priori information (e.g. theory) and embodied in the model by way of hypothesis The regression analysis tests the statistical strength of the model as hypothesized E.g. the level of the FTSE 100 is linearly dependent on the SP 500. We can test this hypothesis using simple linear regression (Fig. 6.1) Regressions can be simple (2 variables) or multiple (> 2 variables) There are 3 types of regression:
Time series Cross section Panel data
Otavio R de Medeiros 2
0
Otavio R de Medeiros
X = independent variable
3
Y !E FX e
where Y = dependent variable; X = independent variable; E = constant or intercept; F = slope or regression coefficient; e = random error or disturbance term The error term exists because there are other unobserved and unknown effects not included in the regression We cannot infer causality Regression analysis cannot prove a hypothesis, it can only support or not support the hypothesis formulated
Otavio R de Medeiros
Regression analysis Ordinary least-squares (OLS) regression To test the relationship between Y and X it is necessary to derive the values of E, F and e by using a method that is Best, Linear, Unbiased Estimator (BLUE)
Best: most efficient, i.e. smallest variance Linear Unbiased: E(E) = E; E(F) = F
Ordinary least squares (OLS) minimizes the sum of squares of e, i.e. minimize 7 e2 If the data complies with the assumptions which will be seen later, OLS gives the BLUE, i.e. it gives the straight line that best fits the data by calculating the line that minimizes the sum of squares of the errors between Y and Y (Fig. 6.2)
Otavio R de Medeiros 5
Regression analysis Statistical assumptions of OLS regression: 1. The mathematical formula of the relationship between the true dependent variable Y and the independent variable X is:
Y !E F
Estimated model:
Y !E FX
e
2. The error term e is normally distributed with zero mean and constant variance W2, i.e. e ~ N(0,W2) 3. The successive error terms are independent of each other, i.e. cov eiej = 0 4. X is non-stochastic (or exogenous, please explain...)
Otavio R de Medeiros 6
Regression analysis Normality is also known as Gaussianity, i.e. having a Gaussian distribution. If e has constant variance W2, this is called homoscedasticity If the variance of e is not constant, this is called heteroscedasticity, i.e. the opposite of homoscedasticity If cov eiej = 0, the residuals e is called non-autocorrelated or non-serially correlated. This assumption means that the factors that caused one value of Y to show error does not automatically cause all the observations of Y to show error When the e values are independent the data are said to be nonautocorrelated
Otavio R de Medeiros 7
Regression analysis As Y is related to e in a linear form, Y itself is a random variable For any values of X, Y will be ~ N(Q, W2) and therefore the statistical distribution of Y can be fully described by its mean and variance The expected value (mean) of Y:
Yi ! E F ! E F X i E (ei )
But since E(ei) = 0
ei
E (Yi ) ! E (E F X i ei ) ! E (E ) E ( F X i ) E (ei ) !
E (Yi ) ! E F X i
Otavio R de Medeiros 8
Regression analysis As the expected value of e = 0, the variance of Y, which is also the variance of e, is the mean value of e2, i.e. 7(ei 0)2/n = 7(ei 0)2/n = E(e2i) = W2 Thus Y ~ N(E + FX, W2) This can be seen in Fig. 6.3.
Otavio R de Medeiros
Yi ! E F X i ei
If we take variances on both sides, we get
Var (Yi ) ! Var (E F X i ei ) ! Var (E ) Var ( F X i ) Var ( ei ) ! (ei e ) 2 (ei 0) 2 (ei ) 2 ! Var (ei ) ! ! ! ! E (ei2 ) ! W 2 n n n
Thus Yi ~ N(E + FXi, W2)
Otavio R de Medeiros 10
Regression analysis Fitting the regression line The values of E and F that minimize 7e2 are
cov F! var
[( ! (
)(Y Y )] )2
E !Y F
ei ! Yi Y
Otavio R de Medeiros
11
SS is minimized when the partial derivatives are set to zero, i.e. (2(Y E F X )) ! 0 This is achieved when
(2 X (Y E F X )) ! 0
Y ! nE F Y !E F
Otavio R de Medeiros
2
12
Regression analysis This is a simultaneous equation problem Multiply the 1st equation by 7X and the second by n
X Y ! nE X F ( X )2 n XY ! nE X nF X 2
Subtracting 1st equation from the 2nd gives
n Y @
Y ! nF
F!
n
F (
) 2 ! F (n
( ) 2 )
n Y
2
( ) 2
Y
13
Otavio R de Medeiros
X ! nX
F!
and
Y ! nY XY nXY ! X nX
2 2 XY X
n XY nXnY n X 2 (nX )2
Otavio R de Medeiros
14
Y ! nE F
1 1 Y !E F n n
Dividing by n
@Y ! E F
Solving for E:
E !Y F
Otavio R de Medeiros
15
Regression analysis Example: table 6.1 (page 189) and page 194
cov XY F! ! varX
Y ! 196.3298 5.964 X
Otavio R de Medeiros
16
196.33 (intercept)
Otavio R de Medeiros
17
Regression analysis Significance tests of coefficients As shown in Fig. 6.3., calculating the regression coefficients gives single estimates of Y The estimated regression coefficients are also assumed to come from a normal distribution We need to know the statistical significance of these coefficients, by testing that the regression coefficients are significantly different from zero. The statistical significance of the coefficient is measured by the degree of dispersion around the estimated value As the errors or residuals are assumed to be ~ N (Q, W2), the standard deviation of the errors is used to measure that dispersion These standard deviations are called standard errors of the coefficients
Otavio R de Medeiros 18
Regression analysis Significance tests of coefficients We use the t-statistics to indicate the degree of significance of the coefficients To derive these measures we need to know:
The sampling distribution of those coefficients Estimates of their variances and their standard deviations
We can perform test of hypotheses concerning the coefficients or construct confidence intervals for them
Otavio R de Medeiros
19
W 2 X 2 E ~ N E , n ( X X ) 2
The sampling distribution of F is
W2 ~ N F, F ( X X )2
Otavio R de Medeiros
20
s !
2
ei2 n2
(Y Y ) X n2 n ( X X )
2 i i 2 i
2 i
SE of F
SE ( F ) !
(Y Y )
i i
n2 ( X i X )2
21
Otavio R de Medeiros
Regression analysis For data with a normal distribution, the difference between a variable and its mean divided by the estimate of its standard deviation has a t-distribution The probability statements are:
E E P tn 2,c / 2 e e tn 2, c / 2 ! 1 c SE (E ) F F P tn 2,c / 2 e e tn 2, c / 2 ! 1 c SE ( F )
Regression analysis
Thus we have 1-c probability that the true value of the coefficients falls within the range specified. If that range includes zero, the coefficients are not statistically significantly different from zero.
Otavio R de Medeiros
23
The variance of the random variable ut is given by Var(ut) = E[(ut)-E(ut)]2 which reduces to Var(ut) = E(ut2) We could estimate this using the average of ut2:
1 s ! ut2 T
2
Unfortunately this is not workable since ut is not observable. We can use the sample counterpart to ut, which is ut : 1 2 2
s !
u T
Otavio R de Medeiros
24
2 t
T 2
where
2 t
Otavio R de Medeiros
25
Otavio R de Medeiros
26
Example
Example: The following model with k=3 is estimated over 15 observations: and the following data have been calculated from the original Xs.
30 2.0 35 10 . . . ( X X ) 1 ! 35 10 65 ,( X y) ! 2.2 , u' u ! 10.96 . . . 0.6 10 65 4.3 . .
y ! F 1 F 2 x 2 F 3 x3 u
Calculate the coefficient estimates and their standard errors. To calculate the coefficients, just multiply the matrix by the vector to obtain X ' X
1 X ' y . To calculate the standard errors, we need an estimate of W2.
2 ! RSS ! 10.96 ! 0.91 s
Tk
15 3
Otavio R de Medeiros
27
We write:
Recall that the formula for a test of significance approach to hypothesis testing using a t-test was
F F i* test statistic ! i SE F i
If the test is H0 : Fi = 0 H1 : F i { 0 i.e. a test that the population coefficient is zero against a two-sided alternative, this is known as a t-ratio test: F i Since F i* = 0, test stat !
SE ( F i )
The ratio of the coefficient to its SE is known as the t-ratio or t-statistic. Otavio R de Medeiros 29
Compare this with a tcrit with 15-3 (2% in each tail for a 5% test)
5% 1%
Do we reject H0: H 0: H 0:
F1 = 0? F2 = 0? F3 = 0?
Otavio R de Medeiros
H0 :E ! 0 H1 : E { 0 H0 : F ! 0 H1 : F { 0
To test these hypotheses, we need to calculate the t-statistics for the coefficients:
E SE (E ) F tF ! SE ( F ) tE !
Otavio R de Medeiros
31
Regression analysis It is usual to test for statistical significance at the 95% or 99% level of confidence. That means that there is 95% or 99% probability that the values of E and F are not due to chance. The probability distribution of the t-statistics is a t-distribution with n-2 degrees of freedom. Degrees of freedom refers to the number of pairs of data points used in the regression. The regression coefficients are significant if the t statistic is greater than the critical value given in the t distribution tables From the book example (page 198), F= 5.964 and SE(F) = 0.3476, hence t = 5.964/0.3476 = 17.1577 The test statistic for E is t = 196.3298/136.991 = 1.4332 95% confidence intervals:
for E :196.3298 2 v136.991 e E e 196.3298 2 v136.991 p 77. 65 e E e 470.31 for F : 5.964 2 v 0.3476 e F e Otavio R devMedeiros p 5.27 e F e 6.66 5.964 2 0.3476
32
Regression analysis A one-tailed test or a two-tailed test? We have to decide whether the significance test will be a 1tailed or a 2-tailed test This decision is made before the regression results are known The choice is determined by the theory underlying the model between X and Y which the regression is testing E.g.: if a theory says that the slope of the relationship between X and Y should be greater than one, our test should be
H0: F = 1 H1: F > 1
Otavio R de Medeiros
33
Y
Y
Y
X
Otavio R de Medeiros 34
The total sum of squares (SST) is the sum of the squared differences between Y and Y , i.e. SST ! (Yi Y ) 2 The sum of squares due to the regression is the sum of the squared differences between Y and Y SSR ! (Yi Y )2 The sum of squares due to the error is the sum of the squared differences between Y and Yi SSE ! (Y Y ) 2
Otavio R de Medeiros
35
Regression analysis SST=SSR+SSE The ratio between SSR and SST gives the proportion of the variation in Y explained by the variation in X and is referred to as R2 = coefficient of determination or goodness of fit
SSR 2 R ! ! SST
(Yi Y ) 2 (Yi Y ) 2
Otavio R de Medeiros
36
Regression analysis
SSR 2 R ! ! SST @ R2 (Yi Y ) 2
2 i
(Y Y ) SSE (Y Y ) ! 1 ! 1 SST (Y Y )
i i i
! 1
(Yi Y ) 2
ei2
! 1
e 'e ( Y - Y v i) '( Y - Y v i )
If the regression is so good that all the points lie exactly on the regression line, then Yi ! Yi In this case we would have R2 = 1, i.e. a perfect regression If the regression is very bad, the regression line will be the mean, i.e. Yi ! Y @ (Yi Yi ) 2 ! (Yi Y ) 2 Hence, R2 will have a value ranging from 0 to 1, i.e. 0 < R2 < 1
Otavio R de Medeiros 37
Regression analysis If R2 is multiplied by 100 and expressed as a percentage, it represents the proportion of the variation in Y that is explained by the variation in X. R2 is a random variable and it has a F distribution 2 The test statistic is
Fk 1,n 2 ! (k 1) 1 R2 n2
The test has 1 degree of freedom in the numerator and n-2 DFs in the denominator
Otavio R de Medeiros
38
F1,50
R2 ! 1 R2 n2
From the F table, we see that the 5% critical value for v1 = 1 and v2 = 50 is 4.03 As the value of the test statistic (294.4) is greater than 4.03, we reject the null that R2 = 0
Otavio R de Medeiros
39
Regression analysis Using regression for prediction The prediction interval: The results of applying the OLS model can be used for prediction E.g. suppose that we wish to predict the level of the FTSE 100 if the SP 500 rose to 550. The predicted value would be:
Otavio R de Medeiros
40
Regression analysis Using regression for prediction (SKIP) The standard error of the estimate = standard error of the regression is
s !
2
ei2 n2
Y Y
!
i i
n2
1 ( X * X )2 Y s t99 v s 1 n ( X i X )2
where 99 indicates the level of confidence and X* is the value used in the prediction, i.e. 550
Otavio R de Medeiros 41
Regression analysis The standard error of the regression is s = 114.27. The prediction interval is
X * X )2 s t v s 1 1 ( Y 99 ! 2 n (Xi X ) (550 391.42) 2 ! 3476 s 2.5 v114.27 1 0.0192 ! 108046.7 ! 3476 s 319.65
Thus we can consider with 99% confidence that if the S&P 500 rises to 550, the FTSE 100 will rise to 3476 +/ 320, i.e. between 3156 and 3796
Otavio R de Medeiros 42
Regression analysis
Spurious regressions: economic and financial time series are usually nonstationary variables (they trend over time and have unit roots). Regressions with non-stationary variables are not valid (spurious). There are tests to check for unit roots, the most popular being the ADF (Augmented Dickey-Fuller) and the PP (Phillips-Perron) tests. Unit roots are eliminated by differencing the variables
Non-stationary variable
Stationary variable
Otavio R de Medeiros
43
Regression analysis
Multiple regression: a regression model incorporating several independent variables is known as a multiple regression, i.e. Y ! E F1 X 1 F 2 X 2 ... F n X n e The true relationship is unknown and we have to estimate
Y ! E F1 X 1 F 2 X 2 ... F n X n
The Fs are the partial derivatives of Y w.r.t the Xs, i.e.
xY xY xY ; F2 ! ; Fn ! F1 ! xX 1 xX 2 xX n
Otavio R de Medeiros 44
Regression analysis Computer packages (Eviews, SPSS, RATS, etc) are used to solve multiple regressions. Example of results given by software (data in Appendix 6.2, n = 51):
Y ! 0.215 0.209 X 1 0.934 X 2 0.302 X 3
( 0.39) (1.02) (6.42) ( 2.54)
The assumptions for the multivariate OLS are the same as for the univariate model However, the multivariate model has the additional assumption that the independent variables are independent of each other, i.e. cov(xj,xk) = 0 j { k
Otavio R de Medeiros 45
Regression analysis
Interpretation of results:
Regression analysis Adjusted In multivariate regressions, adding additional explanatory variables will cause R2 to increase. Consequently, R2 must be adjusted to take this into account:
R 2 ! 1 (1 R 2 ) where: n = number of observations k = number of independent regressors n 1 nk
R2
R2 =
The 1% critical value of the F statistic for 2 DF in the numerator and 48 in the denominator = 5.08 As the decision rule for testing H0 that R2 = 0 is to reject H0 if F > critical value, we reject H0.
Otavio R de Medeiros
48
Otavio R de Medeiros
49
Otavio R de Medeiros
50
Otavio R de Medeiros
51
Otavio R de Medeiros
52
ei ! E F X iH
X = independent variable that is assumed to be the cause of heteroscedasticity; H = power of the relationship = 2, 1/n, ... The variance of the coefficients becomes E (W i2 ) ! W 2 X iH Thus if H = 1/2, we would transform the regression model to
Yi Xi
! E
Xi
F
ei Xi Yi e E ! F i Xi Xi Xi
53
et ! V et 1 zt
Higher order processes: AR(2): et ! Vt 1et 1 Vt 2 et 2 zt AR(4):
et ! Vt 1et 1 Vt 2 et 2 Vt 3et 3 Vt 4 et 4 zt
Otavio R de Medeiros
54
Regression analysis Test for 1st order autocorrelation: the Durbin-Watson test
et2 To test for autocorrelation we test the following null hypothesis
H0: no autocorrelation if dU e d e 4-dU H1: positive autocorrelation d < dL negative autocorrelation d > 4-dL Inconclusive: dL < d < dU or 4-dU < d < 4-dL
DW !
(et et 1 ) 2
dL
dU
2
Otavio R de Medeiros
4-dU
4-dL
4
55
DW !
(et et 1 ) 2
dL
dU
2
Otavio R de Medeiros
4-dL
4-dU
4
56
Regression analysis Autocorrelation may be caused by omitted variables or wrong functional form Can also be caused when lagged variables are introduced To solve the autocorrelation problem:
Consider the possibility of and correct for omitted variables or wrong functional form If this is unsuccessful, use the Orcutt-Cochrane procedure (skip):
Calculate the autocorrelation coefficient Change the equation to
V!
(e e e
t t 1 2 t
Yt VYt 1 ! E F ( X t V X t 1 ) et
This will remove the 1st order autocorrelation from the data
Otavio R de Medeiros 57
Otavio R de Medeiros
58
Y=E+F1X+e
Y=E+F1X+e
Shift dummy
Slope dummy
Otavio R de Medeiros
59
Otavio R de Medeiros
60
Data transformations
Y=EXF
F"
Y=EFX
Y=EXF
F
ln Y ! ln E F ln X
Otavio R de Medeiros
Y ! E F Z;
Z ! 1/ X
61