You are on page 1of 34

Advanced Research Methods

Final Report: To explain the effect of a number of predictors on Annual League score of Chelsea FC

SUBMITTED TO: Mr. Tehseen Javed Submitted by: Ibad Ur Rehman (5873) Date: 20/07/2012

Research Model
Multiple regression problem selected for this assignment is: Prediction/ explanation of Chelsea FCs English Premier League annual score by a combination

of variables: Transfer expenditure, Number of foreign players in the team, Success rate, Average attendance in the stadium, Average age of the players, number of new arrivals in the team, previous years league position, previous seasons goal scored and the previous seasons goal difference.
That is we will evaluate and conclude how well these variables predict and explain the dependent variable Annual PL score of Chelsea.

Sample Size
It is the most influential thing in our control in the research model and it plays an important role in assessing the statistical power of an analysis. This research problem includes 43 years of time series data for each variable. Number of samples should be more so to better represent the average value of dependent variable.

Analysis
Following are the initial test results when all variables were entered,
Variables Entered Model Variables Entered Last yrs Goals, Average age, New Arrivals, Annual att. in thousand, %age foreign players, Adjusted transfer exp., Last yrs Position, Last yr Success rate, Last yrs Goal Diff. Method Enter

Dependent variable: Annual adjusted Score

Test initial results Table 1


Model Regression Residual Total Sum of Squares 24886.411 4492.711 29379.122 df 9 31 40 Mean Square 2765.157 144.926 F 19.080 Sig. .000

The above table indicates significance of the overall model.

P value is less than 0.1 (operating at 90% level of confidence), therefore the model is significantly predicting/explaining the dependent variable: Annual adjusted score. It shows that the above predictors are contributing in the model. The ANOVA table (Table 1) shows that F= 19.08 which is significant. This indicates that the combination of the predictors significantly predict/explain annual team score i.e. the combine effect of independent variables is significant.

Model Summary Table 2


R .920 R Square .847 Adjusted R Square .803 Std. Error of the Estimate 12.03853

The Model Summary (table 2) shows that the multiple correlation coefficient (R), using all the predictors, is .920, R square is 0.847 and the adjusted R square is 0.803, R, the multiple correlation coefficient, is the linear correlation between the observed and modelpredicted values of the dependent variable. Its large value 0.92 indicates a strong relationship. R square is a statistic that gives information about the goodness of fit. It is a statistical measure of how well the regression line approximates the real data points. Here it means 84.7% of the variance in Annual score of Chelsea can be predicted from Success rate, Annual transfer expenditure and other variables combined. Adjusted R square is a modification of R square that adjusts the number if explanatory terms in a model. Unlike R square, the adjusted increases only if the new term included improves the model. It allows the degree of freedom associated with the sum of squares and it is lower than the unadjusted. This is because it takes into account the sample size and the number of independent variables in the regression model. Here the value of adjusted R square is 80.3% showing the true explanatory power of the model.

Coefficients Table 3
Independent Variables
(Constant) Adjusted transfer exp. %age foreign players Average age Last yr Success rate Annual att. in thousand Last yrs Position Last yrs Goal Diff. New Arrivals Last yrs Goals Standardized Coefficients Beta .494 -.123 .244 -.002 .050 .072 -.511 -.043 .386 -.061 -.912 1.954 -.020 .160 .557 -2.331 -.126 2.220 -.349 .625 .369 .060 .984 .874 .581 .026 .901 .034 .729

Unstandardized Coefficients B 37.850 -.073 .369 -.044 .095 .209 -2.280 -.046 1.117 -.123 Std. Error 76.624 .080 .189 2.227 .596 .375 .978 .363 .503 .353

P value

In table 3 is the individual predictors are being assessed. The results in this table show which variable(s)
are significantly contributing in the model and which are not. It also tells the relative importance of each predictors and the type of relationship with the dependent variable. The hypothesis of this table is, B (beta) = 0. We do not reject the null hypothesis if the P value is greater than 0.1. In this table there are different columns to explain. The t value and the Sig opposite each independent variable indicates whether that variable is significantly contributing to the equation for predicting Annual Score of Chelsea FC from the whole set of predictors. Even though the model fit looks positive, the first section of the coefficients table shows that there are too many predictors in the model. There are several non-significant coefficients, indicating that these variables do not contribute much to the model. Thus, percentage foreign players, last year league position and new arrival of players, in this table, are the only variables that are significantly adding to the prediction as their P values are less than 0.1. The Standardized beta coefficients in the table are showing the relative importance of each predictor in the model. Last years league position with the highest B value of -0.511 is contributing the most in the research model while Average age of players with the value of -0.002 is contributing the least. 4

The signs of the Beta values are showing the type of relationship a predictor possesses with the predicted variable (dependent variable). Individual interpretation of the predictors will be explained later after the elimination of insignificant variable(s) because the B values are likely to be biased now due to the violation of several assumptions. The magnitude of the Beta coefficients explains how much average change in Annual League scores would take place with the per unit change in each predictor. For instance, the beta of Arrivals is equal to 1.117. It means with the inclusion of one new player in the club, the average annual score would increase by 1.117 points.

(Alpha) represents the value of dependent variable when all independent variables are equal to zero. It shows there are also other factors on which Annual score is depending (residuals). Here alpha is insignificant according to the table and it is interpreted that the without any of the above predictors the Annual club score would be 37.85.

Multicollinearity
Multicollinearity (or collinearity) occurs when there are high inter-correlations among some set of the predictor variables. We can say, multicollinearity happens when two or more independent variables contain much of the same information and one may start explaining the other predictor instead of contributing for the model in the prediction of dependent variable. For checking multicollinearity we have to consider the collinearity statistics in the following table.

Table - 4
Independent Variables
(Constant) Adjusted transfer exp. %age foreign players Average age Last yr Success rate Annual att. in thousands Last yrs Position Last yrs Goal Diff. New Arrivals Last yrs Goals .273 .316 .648 .050 .295 .103 .042 .164 .163 3.661 3.166 1.543 19.847 3.392 9.755 23.588 6.116 6.140

Collinearity Statistics Tolerance VIF

The tolerance is the percentage of the variance in a given predictor that cannot be explained by the other predictors. Thus, the small tolerances show that 70%-90% of the variance in a given predictor can be explained by the other predictors. When the tolerances are close to 0, there is high multicollinearity and the standard error of the regression coefficients will be inflated. VIF, the variance inflation factor, also quantifies the severity of multicollinearity. It provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. The above table shows high VIF value in last years success rate and last years goal difference which greater than 10. While the Tolerance value of those are less than 0.1. Therefore the collinearity diagnostics confirm that there are serious problems with multicollinearity.

Resolving Multicollinearity
As indicated before, there is very high multicollinearity in 2 variables, last years success and last years goal difference score. Last years team position is also shown as highly inflated with multicollinearity problem. Therefore a few remedial actions would be adopted to cure this violation. 1. In the first step, the most problematic predictors are going to be addressed. It would not be possible to take any ratio of success rate. The goal difference of the club was adjusted by all positive values and then by taking ratio with the number of matches played in the season. The new results of VIF and Tolerance are as follows:

Table - 5

Independent Variables
(Constant) Adjusted transfer exp. %age foreign players Average age Annual att. in thousand Last yrs Position New Arrivals Last yrs Goals Last yr Success rate Goal Diff. Adjusted

Collinearity Statistics

Tolerance .273 .282 .659 .245 .092 .167 .311 .065 .316

VIF 3.662 3.542 1.518 4.083 10.897 6.001 3.213 15.481 3.164

Here after adjusting the previously inflated variable GD, the VIF value is decreased to 3.16. But still two other variables are having high VIF.

2. Now in the second step, it would be appropriate to remove any of the problematic variable(s) suffering from multicollinearity.

Table 6

Independent Variables
Collinearity Statistics Tolerance (Constant) Adjusted transfer exp. %age foreign players Average age Annual att. in thousands Last yrs Position New Arrivals Last yrs Goals Goal Diff. Adjusted .295 .290 .673 .253 .231 .215 .365 .337 3.393 3.445 1.485 3.951 4.326 4.644 2.739 2.969 VIF

In this table the variable last years success rate have been excluded from the model and now all VIF values are lying below the threshold and can be considered to be acceptable. As far as the contribution in the model is concerned, the results obtained of most of the predictors are still insignificant. So it would be reasonable to run a test and remove insignificant predictors.

Table 7 (Backward Method)

Model 1 (Constant) %age foreign players Average age Annual att. in thousand Last yrs Position New Arrivals Last yrs Goals Goal Diff. Adjusted Adjusted transfer exp. (Constant) %age foreign players Average age Annual att. in thousand Last yrs Position New Arrivals Goal Diff. Adjusted Adjusted transfer exp. (Constant) %age foreign players Annual att. in thousand Last yrs Position New Arrivals Goal Diff. Adjusted Adjusted transfer exp. (Constant) %age foreign players Annual att. in thousand Last yrs Position

Unstandardized Coefficients Std. B Error 53.036 56.682 .477 -.870 .575 -2.034 1.309 -.070 -8.574 -.089 43.859 .483 -.735 .586 -1.903 1.330 -8.820 -.092 26.590 .485 .516 -1.934 1.300 -8.444 -.084 29.926 .486 .426 -1.937 .183 2.038 .378 .607 .409 .220 4.479 .072 48.054 .180 1.966 .371 .439 .398 4.351 .070 13.138 .178 .315 .426 .385 4.180 .066 12.980 .179 .310 .429

Standardized Coefficients Beta .316 -.034 .198 -.456 .452 -.034 -.216 -.148 t .936 2.599 -.427 1.521 -3.349 3.203 -.317 -1.914 -1.228 .913 .320 -.029 .202 -.427 .459 -.222 -.154 2.680 -.374 1.582 -4.335 3.339 -2.027 -1.303 2.024 .321 .178 -.434 .449 -.213 -.140 2.730 1.637 -4.544 3.374 -2.020 -1.265 2.306 .322 .147 -.435 2.713 1.374 -4.513 Sig. .356 .014 .672 .138 .002 .003 .753 .065 .228 .368 .011 .711 .123 .000 .002 .051 .202 .051 .010 .111 .000 .002 .051 .215 .027 .010 .178 .000

New Arrivals Goal Diff. Adjusted (Constant) %age foreign players Last yrs Position New Arrivals Goal Diff. Adjusted (Constant) %age foreign players Last yrs Position New Arrivals

1.028 -8.446 42.535 .435 -2.251 1.125 -5.813 38.478 .362

.322 4.216 9.294 .177 .368 .318 3.801 9.067 .174

.355 -.213

3.189 -2.003 4.577

.003 .053 .000 .019 .000 .001 .135 .000 .045

.288 -.505 .389 -.147

2.451 -6.121 3.533 -1.529 4.244

.240

2.080

-2.196 .959

.373 .305

-.493 .331

-5.893 3.147

.000 .003

The above table has eliminated those predictors who were not contributing significantly in explaining the dependent variable: Annual score of the club. Now the model number 6 indicated in the table would be use in further tests and interpretations.

Autocorrelation
It is another assumption of linear regression which represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals. It shows dependence/correlation between changes or error in a single variable over different periods. It is also known as lagged or serial correlation. If error terms in different time periods are correlated, they start predicting future error hence the error does not remain random. As a result the becomes biased. Detection of Autocorrelation To check whether there is autocorrelation or not in the data set, following tests are observed or performed: 1. Durbin-Watson stat: It is a test to check whether residuals from a linear regression are independent. It ranges from 0 to 4, from extreme positive to negative autocorrelation, whereas at 2 no autocorrelation is considered. (rho) which is the coefficient of e-lagged sign shows negative/ positive autocorrelation and its magnitude shows the strength.

Table - 8
Dependent Variable: SCORE Method: Least Squares Sample: 1970 2012 Included observations: 43 Variable last year's Position New Arrivals Foreign players C R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Coefficient Std. Error -1.844158 0.864051 0.573993 25.11707 0.829049 0.815899 12.22925 5832.632 -166.5799 63.0451 0.0000 0.378198 0.324832 0.167098 8.293498 t-Statistic -4.876172 2.659996 3.435071 3.028526 Prob. 0 0.0113 0.0014 0.0043 47.53488 28.50175 7.933947 8.09778 7.994363 1.350897

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

In the above table the Durbin Watson stat = 1.351 which is not very near to 2 therefore we cannot conclude here whether autocorrelation exist or not. It is showing signs of positive autocorrelation as it is between 0 2. 2. Serial Correlation LM test: This test confirms presence of autocorrelation in the model. The test is run via e-views. Following are the results of the test:

Table - 9
Breusch-Godfrey Serial Correlation LM Test: F-statistic Obs*R-squared 4.191964 Prob. F(1,38) Prob. Chi4.272246 Square(1) 0.0476 0.0387

The hypothesis of the above test states, No Autocorrelation exists As the P value is less than 0.1 therefore it rejects the null hypothesis and we could conclude that autocorrelation is present. It means errors predicting future values. It happened might be because an important variable is excluded and placed in the residuals. It would result in biased values.

10

Resolving Autocorrelation
There are few ways to resolve autocorrelation from the model. 1. Including an independent variable: As discussed earlier, important variable(s) that might have been missed from the model have created the autocorrelation problem. So by checking the theory, an important variable could be detected. In this research, last years goal difference (adjusted) is an important factor in explaining the clubs annual score. Though it was previously declared insignificant but its inclusion might let the model follow this assumption.

Table - 10
Dependent Variable: SCORE Method: Least Squares Sample: 1970 2012 Included observations: 43 Variable Last year's Postion New Arrivals GD Adjusted Foreign players C R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Coefficient Std. Error -1.948561 1.086236 -7.305805 0.640403 31.73432 0.842912 0.826376 11.87616 5359.638 -164.7616 50.97566 0 0.371677 0.337981 3.989478 0.166276 8.827499 t-Statistic -5.242623 3.213899 -1.831269 3.851445 3.594939 Prob. 0 0.0027 0.0749 0.0004 0.0009 47.53488 28.50175 7.895887 8.100677 7.971407 1.466973

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

Here the variable: adjusted Goal difference of previous year is included and it increased the DurbinWatson stat which is now equal to 1.467 (more close to 2). Still it cannot be deduced that autocorrelation has been resolved. Serial LM test should be performed.

11

Table 11
Breusch-Godfrey Serial Correlation LM Test: F-statistic Obs*R-squared 2.329571 Prob. F(1,37) Prob. Chi2.546978 Square(1) 0.1354 0.1105

Now the P value is equal to 0.135 which concludes that null hypothesis cannot be rejected and Autocorrelation does not exist now.

2. Cochrane Orcutt method It explains a method where without adding other variables, we can remove autocorrelation. It is a procedure which adjusts a linear model for serial correlation in the error term. Following steps are performed in order to resolve autocorrelation: a) Calculate a residual series

Table - 12
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 11.144 3.162562 8.969859 4.678068 17.66892 16.36122 -5.33152 -22.0511 -5.64855 8.317451 5.518213 -25.9485 -9.35849 -17.9232 -21.163 4.057698 15.19987 -5.01143 3.086235 -25.2261 4.571231 -0.78024 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 3.073346 11.70838 5.547431 17.07079 6.964462 10.72888 9.940079 9.457424 -6.063351 -17.29621 -8.243533 2.581138 -8.572424 9.379695 9.172913 -0.854429 3.318821 -3.079499 7.287237 -13.02357 -13.39078

12

b) Running residuals on their lag values to calculate (rho)

Table - 13

Dependent Variable: RES Method: Least Squares Sample (adjusted): 1971 2012 Included observations: 42 after adjustments Variable RES(-1) Coefficient 0.308003 Std. Error 0.14938 t-Statistic 2.061881 Prob. 0.0456

In the above Table 13, the coefficient of residuals is computed, as indicated by the highlighted value, by running the residual on its lag. [res res(-1)]. So the = 0.308 c) In this step previous years equation is multiplied with and simultaneously subtracted from this years equation.

Following equations are used to transform all variables in the model. 1. tscore=score-0.308*score(-1) 2. tpos=pos-0.308*pos(-1) 3. tnew=new-0.308*new(-1) 4. tfp=fp-0.308*fp(-1) Now the error trend should be different in the equation of transformed variables. Also the s should not be biased now.

13

Table 14
Dependent Variable: TSCORE Method: Least Squares Sample (adjusted): 1971 2012 Included observations: 42 after adjustments Variable TPOS TNEW TFP C R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Coefficient Std. Error -1.80987 0.855013 0.555992 17.51219 0.747405 0.727463 11.65209 5159.303 -160.6241 37.47948 0 0.36586 0.360296 0.190089 6.519326 t-Statistic -4.946887 2.373084 2.924911 2.686197 Prob. 0 0.0228 0.0058 0.0107 32.95619 22.31985 7.83924 8.004733 7.8999 1.952549

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

The above table is showing Durbin Watson stat equal to 1.95 which is very near to No autocorrelation value of 2.

Table - 15

Breusch-Godfrey Serial Correlation LM Test: F-statistic Obs*R-squared 0.011483 Prob. F(1,37) Prob. Chi0.01303 Square(1) 0.9152 0.9091

Now after applying Cochrane orcutt method, LM test is showing P value equal to 0.915 which cannot reject the null hypothesis of No Autocorrelation. So Autocorrelation is resolved. 3. AR(1) method This method will run multiple times, if required, to remove autocorrelation. It will use the equation: score pos new fp c ar(1)' 14

Table - 16
Dependent Variable: SCORE Method: Least Squares Sample (adjusted): 1971 2012 Included observations: 42 after adjustments Convergence achieved after 9 iterations Variable POS NEW FP C AR(1) R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Inverted AR Roots Coefficient Std. Error -1.81039 0.85247 0.555417 25.3577 0.315388 0.848765 0.832415 11.80817 5159.016 -160.6229 51.9129 0 0.32 0.379373 0.374142 0.193583 9.627306 0.163844 t-Statistic -4.772054 2.278468 2.869147 2.633935 1.924925 Prob. 0 0.0286 0.0068 0.0122 0.062 47.47619 28.84461 7.886804 8.093669 7.962628 1.965921

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

This method has improved the value of Durbin stat and to remove any doubts about the existence of autocorrelation LM test could be observed.

Table - 17
Breusch-Godfrey Serial Correlation LM Test: F-statistic Obs*R-squared 0.032267 Prob. F(1,36) Prob. Chi0.037611 Square(1) 0.8585 0.8462

Hence P value confirms the removal of autocorrelation.

15

Stationarity
Over-fit exists in data when there is a trend in data. Due to the existence of trend, spurious regression is evolved. So if the data series are not stationary, the regression seems to be statistically significant even though there was no casual relationship between the variables. First we needed to have a graphic analysis of all variables to check whether data are stationary or not at level and to confirm the interpretation of graphic analysis Unit root test is performed.

FP
80 70 60 50 40 30 20 10 70 75 80 85 90 95 00 05 10

The figure is indicating some trend in the percentage of foreign players over years as the percentage is increasing from 1983 and onwards.

The presence of trend can be checked by Unit root test and the results are as under:

Table -18
Null Hypothesis: FP has a unit root Exogenous: Constant Lag Length: 0 (Automatic based on AIC, MAXLAG=9) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -0.940473 -3.596616 -2.933158 -2.604867 Prob.* 0.7653

The Prob. Value is greater than 0.1 so it did not reject the null hypothesis, FP is non-stationary. It means there is a trend in this variable at level. 16

NEW
35 30 25 20 15 10 5 0 70 75 80 85 90 95 00 05 10

In the variable: new arrivals of players, trend seems to exist in the data.

Table 19

Null Hypothesis: NEW arrivals has a unit root Exogenous: Constant Lag Length: 1 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -1.209201 -3.600987 -2.935001 -2.605836 Prob.* 0.6614

The sig Value is greater than 0.1 so it did not reject the null hypothesis, New arrivals is non-stationary. It means there is a trend in this variable at level.

17

POS
24

20

16

The figure of last yrs position in the premier league is not showing any significant trend. It is showing stationary data.

12

0 70 75 80 85 90 95 00 05 10

The observed results can be checked through Unit root test.

Table - 20
Null Hypothesis: POS has a unit root Exogenous: Constant Lag Length: 0 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -3.999433 -3.596616 -2.933158 -2.604867 Prob.* 0.0034

Here the P value is less than 0.1 therefore we can say that the variable: last years position of the club is Stationary.

18

SCORE
100

80

Annual team score is the dependent variable and its figure is show showing some trend in the data over time

60

40

20

0 70 75 80 85 90 95 00 05 10

Table 21

Null Hypothesis: SCORE has a unit root Exogenous: Constant Lag Length: 0 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -2.022128 -3.596616 -2.933158 -2.604867 Prob.* 0.2766

Here too, the sig Value is greater than 0.1 so it did not reject the null hypothesis, annual score is nonstationary. It means there is a trend in this variable at level.

Three out of four variables from this model are non stationary. Now, taking first difference and checking whether trend is removed.

19

D(FP)
20 15 10 5 0 -5 -10 -15 70 75 80 85 90 95 00 05 10

According to this figure at first difference, in foreign players percentage, trend seems to be removed. We can further confirm it through Unit root test.

Table - 22
Null Hypothesis: D(FP) has a unit root Exogenous: Constant Lag Length: 0 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -5.517794 -3.600987 -2.935001 -2.605836 Prob.* 0.000

The hypothesis of stationarity is being rejected here. It shows that percentage foreign players are stationary at their first difference.

20

D(NEW)
25 20 15 10 5 0 -5 -10 70 75 80 85 90 95 00 05 10

At first difference, in new arrivals of players, trend seems to be removed. We can further confirm it through Unit root test.

Table - 23
Null Hypothesis: D(NEW) has a unit root Exogenous: Constant Lag Length: 1 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -8.832034 -3.605593 -2.936942 -2.606857 Prob.* 0.000

The unit root test confirmed the absence of trend in new arrivals after taking first difference. And now the data is said to be stationary.

21

D(SCORE)
80 60 40 20 0 -20 -40 -60 70 75 80 85 90 95 00 05 10

Similarly, in the dependent variable (annual club score) trend is being removed as shown in the figure.

Table - 24
Null Hypothesis: D(SCORE) has a unit root Exogenous: Constant Lag Length: 0 (Automatic based on AIC, MAXLAG=1) t-Statistic Augmented Dickey-Fuller test statistic Test critical values: 1% level 5% level 10% level -7.482432 -3.600987 -2.935001 -2.605836 Prob.* 0.000

Unit root test confirmed the above figure, and rejected the hypothesis of non-stationarity. So the three variables which were non stationary at level would be stationary if first difference is taken. It would not be appropriate to use variables with and without trend together and different levels. We can check the effect of trend on the results by including the last years position first difference series too. Following are the 1st difference data of the four variables:

22

1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Dscore NA -4 -3 -7 -3 -3 -30 0 33 -15 -18 0 0 0 0 60 4 -17 -7 -40 60 -11 -1 3 -5 3 1 9 4 12 -10 -4 3 3 12 16 -4 -8 2 -2 3 -15 -7

Dnew NA 2 -1 -1 1 3 -4 0 0 3 -1 -2 0 4 4 -5 4 -1 0 -2 0 2 4 4 -3 -2 2 4 -4 4 1 5 -6 -6 21 -5 -4 6 5 -7 -4 5 -8

Dfp NA -2.5 -4.2 0 -13.3 8.6 4.4 -3 0 -10 -5.7 0 -3.2 8.9 18.9 13.5 -2.4 7.1 -0.6 -4.5 -2 0 4.8 -6.2 -3.1 -2.2 6.7 3.1 -1.7 9.3 4.9 11.5 1.7 -4.6 2.3 -2.3 -3.8 -5.5 -8.1 18.2 -3.6 -4.7 3.7

Dpos NA -2 3 1 5 5 4 -10 -9 14 6 -18 8 0 6 -17 5 0 8 4 -17 4 6 3 -3 3 -3 0 -5 -2 -1 2 1 0 -2 -2 -1 0 1 0 1 -2 1 23

Following are the result obtained after running the regression with the help these computed variables. Table - 25
Dependent Variable: DSCORE Method: Least Squares Date: 07/16/12 Time: 09:53 Sample (adjusted): 1971 2012 Included observations: 42 after adjustments Variable DPOS DNEW DFP C R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Coefficient Std. Error -1.786161 0.379865 0.212552 -0.110896 0.437688 0.393295 13.91839 7361.416 -168.0885 9.859372 0.000061 0.341828 0.442625 0.313157 2.167687 t-Statistic -5.225326 0.858208 0.67874 -0.051159 Prob. 0.0000 0.3962 0.5014 0.9595 0.333333 17.86899 8.194691 8.360184 8.255351 2.498079

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

The overall model is still significant but first differences of new arrivals and foreign players are now not contributing significantly. The model is significant probably due to the effect of first difference of last years position. We may say that the insignificant variables were having significant relationship with the annual team score due to the presence of trend.

24

Cointegration
Cointegration is a statistical property of time series variables. It checks whether there exists long term relationship between the variables. It evaluates long term relationship between OLS coefficients. If there exists are stationary linear combination of non stationary random variables, the variables combined are said to be cointegrated. Sample size must be large enough (30 and above) for the relationship to be long. Following are the results of Johansen Cointegration Test Summary:

Table - 26
Sample: 1970 2012 Included observations: 41 Series: SCORE POS NEW FP Lags interval: 1 to 1 Selected (0.05 level*) Number of Cointegrating Relations by Model Data Trend: Test Type Trace Max-Eig None No Intercept No Trend 0 0 None Intercept No Trend 0 0 Linear Intercept No Trend 1 1 Linear Intercept Trend Quadratic Intercept Trend 1 1 1 1

*Critical values based on MacKinnon-Haug-Michelis (1999) Information Criteria by Rank and Model Data Trend: Rank or No. of CEs None No Intercept No Trend None Intercept No Trend Linear Intercept No Trend Linear Intercept Trend Quadratic Intercept Trend

0 1 2 3 4

Log Likelihood by Rank (rows) and Model (columns) -547.1368 -547.1368 -546.5609 -546.5609 -535.7472 -532.9043 -532.3382 -528.4732 -530.295 -527.3508 -526.8145 -520.9823 -529.0465 -521.9657 -521.4726 -515.4816 -529.0441 -521.192 -521.192 -510.9452

-546.202 -528.179 -520.879 -515.456 -510.945 25

0 1 2 3 4

Akaike Information Criteria by Rank (rows) and Model (columns) 27.47009 27.47009 27.63712 27.63712 27.81474 27.30474 27.42902 27.75837 28.14849 27.21484 27.38297 27.5593 27.96058 27.33357 27.19382* 27.45437 27.26743 27.58403 27.43813 27.96058 27.65587 27.3258 27.35994 27.48567 27.65587

Schwarz Criteria by Rank (rows) and Model (columns) 0 1 2 3 4 28.13880* 28.13880* 28.30781 28.25971 28.76645 28.80398 29.43014 29.35646 30.15463 30.13389 28.47301 28.50381 28.95896 29.42298 30.13389 28.47301 28.40585 28.85562 29.40247 29.99635 28.81781 28.66323 29.03172 29.4918 29.99635

Checking the asterisk sign in the Akaike table, it is present with a value at lag 1 and column 4. Now again running the test as illustrated in the below screenshot:

Here option 4 is selected in the deterministic trend assumption of test and lag interval is set to 1 1 because the asterisk was found previously in 4th column and in lag 1.

26

Table - 27
Sample (adjusted): 1972 2012 Included observations: 41 after adjustments Trend assumption: Linear deterministic trend (restricted) Series: SCORE POS NEW FP Lags interval (in first differences): 1 to 1 Unrestricted Cointegration Rank Test (Trace) Hypothesized No. of CE(s) None * At most 1 At most 2 At most 3 Eigenvalue 0.586181 0.30609 0.23534 0.198514 Trace Statistic 71.23136 35.05599 20.07406 9.072785 Critical Value 63.8761 42.91525 25.87211 12.51798 0.05 Prob.** 0.0106 0.2428 0.2222 0.1759

Trace test indicates 1 cointegrating eqn(s) at the 0.05 level * denotes rejection of the hypothesis at the 0.05 level **MacKinnon-Haug-Michelis (1999) p-values Unrestricted Cointegration Rank Test (Maximum Eigenvalue) Hypothesized No. of CE(s) None * At most 1 At most 2 At most 3 Eigenvalue 0.586181 0.30609 0.23534 0.198514 Max-Eigen Statistic 36.17538 14.98192 11.00128 9.072785 Critical Value 32.11832 25.82321 19.38704 12.51798 0.05 Prob.** 0.0151 0.6361 0.5132 0.1759

Max-eigenvalue test indicates 1 cointegrating eqn(s) at the 0.05 level * denotes rejection of the hypothesis at the 0.05 level **MacKinnon-Haug-Michelis (1999) p-values The above table shows according to trace stat or eigenvalue that one cointegration equation at 0.05 level exists. In these tables only one null hypothesis is rejected (having p value less than 0.1) therefore only one cointegrating factor exists. It might be concluded that long term relationship exists. The problem of spurious relationship can be eliminated by differencing the data, but this implies loss of long 27

run information content in the data. So cointegration is the simplest way of eliminating the illogical correlation established between time series due to presence of trends.

Causality Test
Regression tells us about cause and effect but the relation may not exist in real. One may not be causing the other in real life. So regression does not tell us about the direction of causation. To know the direction of causation we use Causality test

Table - 28
Pairwise Granger Causality Tests Sample: 1970 2012 Lags: 1 Null Hypothesis: POS does not Granger Cause SCORE SCORE does not Granger Cause POS NEW does not Granger Cause SCORE SCORE does not Granger Cause NEW FP does not Granger Cause SCORE SCORE does not Granger Cause FP Obs. 42 F-Statistic Prob. 0.0051 2.28238 7.23605 6.56673 15.4385 0.00083 0.9434 0.1389 0.0105 0.0144 0.0003 0.9772

42

42

The above table is showing relationship of Annual score with other three variables (Position, foreign players and new arrivals). Granger Causality test is used to assess these relationships. In the first of these sets it is observed that neither previous years team position did not cause the team score nor the score is causing last years position. New arrivals are significant in causing score but score also caused the new arrivals. Lastly the 3rd combination made more sense as here foreign players are affecting team annual score while scores are insignificant to foreign players percentage.

28

Forecasting
Forecasting is a technique to find whether the model as the ability to forecast future values There are two methods to forecast: 1. In Sample Forecast 2. Out Sample Forecast In Sample Forecast Here existing sample data is used to predict known value to check whether model is good to predict.

120 110 100 90 80 70 60 50 2007

Forecast: SCOREF Actual: SCORE Forecast sample: 2007 2012 Included observations: 6 Root Mean Squared Error Mean Absolute Error Mean Abs. Percent Error Theil Inequality Coefficient Bias Proportion Variance Proportion Covariance Proportion 10.00013 8.175795 11.29742 0.060828 0.430574 0.214531 0.354895

2008

2009 SCOREF

2010 2 S.E.

2011

2012

For estimation, data from 1970 2006 is used to forecast values of years 2007 2012. Theil Inequality Coefficient measures the forecasting ability of the model. Its value is in between 0 and 1. 0 is considered as perfect fit while values should below 0.1 for better forecasting. In the above table, the Theil Inequality Coefficient is equal to 0.06 which is less than 0.1 and is near to 0 therefore it could be concluded that the model has better forecasting ability.

29

Table - 29
Years 2007 2008 2009 2010 2011 2012 Actual Scores 83 85 83 86 71 64 Forecasted Scores 88 86 89 81 88 79

The above table is showing the actual and in sample forecasted values. There are some differences in these two columns but overall the forecasting ability is good.

Out Sample Forecast Future values of independent variables can be used to forecast dependent variable (annual team score). Future values are calculated on the basis of moving average. The data extracted is shown in the following table: Table 30
% Years Position change Arrivals % change Fplayers % change 2007 1 27 64.9 2008 2 1.000 32 0.185 56.8 -0.125 2009 2 0.000 25 -0.219 75.0 0.320 2010 3 0.500 21 -0.160 71.4 -0.048 2011 1 -0.667 26 0.238 66.7 -0.066 2012 2 1.000 18 -0.308 70.4 0.055 2013 2 0.367 18 -0.053 70.4 0.027 2014 2 0.240 18 -0.100 70.5 0.058 2015 2 0.288 18 -0.076 70.5 0.005 2016 2 0.246 18 -0.060 70.5 0.016 2017 2 0.428 18 -0.119 70.5 0.032 % Score change 83.0 85.0 0.024 83.0 -0.024 86.0 0.036 71.0 -0.174 64.0 -0.099 64.0 -0.047 63.9 -0.062 63.9 -0.069 63.8 -0.090 63.8 -0.073

The highlighted area is showing calculated future values of the predictors and the dependent variable.
30

Now opening a new file for 48 years (previous 43 and new 5 years).
100

90

Forecast: SCOREF Actual: SCORE Forecast sample: 2013 2017 Included observations: 5 Root Mean Squared Error Mean Absolute Error Mean Abs. Percent Error Theil Inequality Coefficient Bias Proportion Variance Proportion Covariance Proportion 10.50677 10.50636 16.44722 0.075989 0.999923 0.000027 0.000050

80

70

60

50

40 2013

2014

2015 SCOREF

2016 2 S.E.

2017

The above table showing the value of Theil Coefficient equal to 0.076 which is less than 0.1 and therefore the model has better forecasting ability.

Dummy Variable
A dummy variable is one that takes the values 0 or 1 to indicate the absence or presence of some categorical effect that may be expected to shift the outcome. It is metric variable made up of a category of non-metric variable. A dummy variable might be introduced in the model to check effect of that particular attribute on the dependent variable. The era of Roman Abramovich is considered to be very critical for Chelsea football club. He acquired the club in June 2003 and since then there were significant changes in club. The Dummy variable of Romans era is included and analyzed for the clubs score and following results are found:

31

Table 31
Dependent Variable: SCORE Method: Least Squares Sample: 1970 2012 Included observations: 43 Variable ROMAN C R-squared Adjusted R-squared S.E. of regression Sum squared resid Log likelihood F-statistic Prob(F-statistic) Coefficient Std. Error 42.82424 37.57576 0.412508 0.398179 22.11084 20044.46 -193.1213 28.78819 0.000003 7.981463 3.849003 t-Statistic 5.365463 9.762464 Prob. 0.000 0.000 47.53488 28.50175 9.075408 9.157325 9.105617 0.732021

Mean dependent var S.D. dependent var Akaike info criterion Schwarz criterion Hannan-Quinn criter. Durbin-Watson stat

The relationship is significant and it can be interpreted as Roman acquired the club the average annual score is increased by 42.82 points which is quite significant. This dummy variable when operated with other independent variables produced insignificant result with biased beta value, might be due to the violation of any assumption(s).

Factor Analysis
It is a technique to summarize and reduce the number of variables. For this purpose a number of variables are taken and then analyzed. Following results are obtained:

KMO and Bartlett's Test table 32 Kaiser-Meyer-Olkin Measure of Sampling Adequacy. Bartlett's Test of Approx. Chi-Square Sphericity df. Sig. .836 652.293 66 .000

As the sig value in the above table is less than 0.1 therefore we would say that there is correlation between the variables and it is appropriate to run factor analysis. 32

Table - 33
Total Variance Explained Extraction Sums of Squared Loadings

Component Initial Eigenvalues Total 7.491 1.479 1.185 .605 .492 .281 .183 .128 .081 .043 .028 .004

Rotation Sums of Squared Loadings

1 2 3 4 5 6 7 8 9 10 11 12

% of Cumulative Variance % 62.426 62.426 12.326 9.873 5.042 4.101 2.344 1.528 1.067 .673 .355 .234 .029 74.752 84.625 89.667 93.768 96.113 97.641 98.708 99.381 99.736 99.971 100.000

% of Cumulative % of Cumulative Total Variance % Total Variance % 7.491 62.426 62.426 4.600 38.335 38.335 1.479 1.185 12.326 9.873 74.752 4.292 84.625 1.263 35.764 10.526 74.099 84.625

In the above table we came to know that how many factors are going to be formed. The eigenvalues show that how much variance each successive factor extracts. Rotated Component Matrix table 34 Component 1 Adjusted annual score Adjusted transfer exp. %age foreign players Average age Last yr Success rate Annual att. in thousand Last yrs Position Last yrs Goal Diff. New Arrivals Percentage T. Expenditure Last yrs Goals Goal Diff. Adjusted .589 .913 .722 .430 .570 .409 .888 .903 2 .648 .429 .966 .868 .506 -.903 .889 .456 3

.890 .716 33

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization. a. Rotation converged in 4 iterations. In table 34 factors loading appeared. Variables with the highest loading are placed in the specific factor and variables with cross loadings are eliminated. For instance, Annual expenditure, new arrivals, last year goals and GD could be placed under the head on one factor.

Communalities (table 35) Initial Adjusted annual score Adjusted transfer exp. %age foreign players Average age Last yr Success rate Annual att. in thousand Last yrs Position Last yrs Goal Diff. New Arrivals Percentage T. Expenditure Last yrs Goals Goal Diff. Adjusted 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Extraction .794 .878 .707 .934 .947 .790 .918 .962 .863 .876 .844 .643

Extraction Method: Principal Component Analysis. Here the total amount of variation extract for a variable by all the factors is reported. It is sum of squares of factor loadings.

34

You might also like