Professional Documents
Culture Documents
Taylor
The purpose of this guide is to explore linear regression using Excel. This note consists of the following sections: Summarising and describing a multi-variable data set Correlation analysis Scatter plots Simple regression Multiple regression Excels regression functions Exercises
First, check that Excels statistical add-in, Data Analysis, is attached to Excel. From the set of tabs at the top of your screen, click on the Data tab. If Data Analysis is attached, it will be available as an option in the Analysis group towards the top right of your screen.
If Data Analysis is not one of the options, you need to attach it by working through the following steps: (i) Click on the File tab, near the top left of the screen, and then select Options. (ii) Click Add-Ins (on the left of the screen), and then in the Manage box (at the bottom of the screen), select Excel Add-ins. If this is not one of the options in the dialog box, you need to install the add-ins from your Microsoft Excel installation disc. (iii) Click Go. (iv) In the Add-Ins box (shown on the right here), select three add-ins: Analysis ToolPak, Analysis ToolPak VBA, and Solver. Then click OK. If you get prompted that the Analysis ToolPak Add-in is not currently installed on your computer, click Yes to install it. (v) After you load these add-ins, the Solver and Analysis ToolPak commands are available in the Analysis group on the Data tab.
1. SUMMARISING & DESCRIBING A MULTI-VARIABLE DATA SET The Excel file ElectricityConsumption.xls contains monthly observations from January 2004 to July 2012 for the following variables:
ELEC C66 C76 H55 DINC AIRC Residential electricity sales (KWh) per customer in a mid-Atlantic U.S. city Cooling degree hours at base temperature 66 degrees (a measure of summer heat)1 Cooling degree hours at base temperature 76 degrees (a measure of summer heat) Heating degree hours at base temperature 55 degrees (a measure of winter cold)2 Disposable income per household ($) Proportion of households with air conditioning
The ultimate aim is to build a forecasting model for residential electricity consumption.
1 2 3 4 5 6 7 8 9 10 11 12 13 A MONTH Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04 Oct-04 Nov-04 Dec-04 B ELEC 681.7 620.3 590.8 538.0 513.4 575.5 1019.3 1203.9 1176.7 723.0 519.0 604.9 C C66 20 0 20 14 559 1601 5348 7416 6887 2975 427 9 D C76 0 0 0 0 3 83 833 1547 1287 398 5 0 E H55 10148 12504 9300 5333 2846 282 1 0 0 155 1812 5779 F DINC 34825 34934 35050 35172 35302 35438 35583 35734 35892 36056 36222 36391 G AIRC 0.698 0.701 0.705 0.708 0.712 0.716 0.72 0.724 0.728 0.731 0.735 0.739
Use the Analysis Toolpak Descriptive Statistics tool to get summary statistics (in one sequence of operations) for all 6 variables, by selecting From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Descriptive Statistics
In the Descriptive Statistics dialog box, specify: Input Range as the range containing values and variable names: B1:G104 Click the Labels in First Row checkbox Output options as New Worksheet Ply with the name Summary Click the Summary Statistics checkbox.
in
i 1 i 1
in
2. CORRELATION ANALYSIS
Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Correlation
In the Correlation dialog box, specify: Input Range: as B1:G104 (dont include the house number column) Grouped By: as Columns, so that Excel knows that each column is a variable. The Labels in First Row checkbox should be crossed Output options: as New Worksheet Ply with the name Correlations Click OK.
The correlation matrix below should result. Correlation coefficients for pairs of variables indicate the levels of linear association between them, e.g. ELEC and C76 have correlation of 0.94, so that as C76 rises, ELEC rises.
ELEC 1.00 0.92 0.94 -0.36 0.14 0.14 C66 0.92 1.00 0.95 -0.65 0.02 0.02 C76 0.94 0.95 1.00 -0.52 0.01 0.01 H55 -0.36 -0.65 -0.52 1.00 -0.04 -0.05 DINC 0.14 0.02 0.01 -0.04 1.00 0.94 AIRC 0.14 0.02 0.01 -0.05 0.94 1.00
You should get the same value using the Excel function =CORREL Note any variables strongly correlated with ELEC, and any strong inter-correlations between the potential explanatory variables, C66, C76, H55, DINC and AIRC.
3. SCATTER PLOTS Scatter plots are of great help in identifying the strength, nature and direction of relationships between pairs of variables. In particular, they can highlight non-linear relationships, which will not necessarily be apparent from the correlation values. Since the observed correlation, 0.94, between ELEC and C76 suggests a relationship, lets examine their scatter plot. Return to the Data worksheet. Copy the ELEC column of data to column K. Copy C76 to column J. Highlight the new C76 and ELEC columns (columns J and K), as shown in the screen dump below.
From the main Excel menu, click on the Insert tab. From the Charts group, select Scatter with no lines as highlighted above.
ElectricityConsumption
Dealing with charts is somewhat cumbersome in Excel 2010. A simple way to insert axis titles and chart titles is to use Excels text box option, which is also highlighted in the screen dump above. After a little work, the chart can look something like this. The scatter plot confirms the moderate strength, linear relationship, with ELEC increasing as C76 increases.
1600.0 1400.0 1200.0 1000.0 800.0 600.0 400.0 200.0 0.0 0 500 1000 1500 2000
ELEC
4. SIMPLE REGRESSION Regression analysis produces the estimated linear equation that best fits a set of data. By best fitting we mean the line (or linear model) for which there is least residual scatter. Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Regression
Complete the regression dialog box by specifying: Input Y range as B1:B104 Input X range as D1:D104 ELEC as dependent variable C76 as independent variable
Check the Labels box as the first entries in each cell range are labels Specify Output options as New Worksheet Ply, with the name Regression1. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK.
Intercept C76
The 1st part of the output contains summary statistics for the regression as a whole, R and residual standard deviation (called standard error). Ignore the 2nd part which displays ANOVA or Analysis of Variance calculations. The 3rd part of the output indicates that the best fitting linear model has equation: ELEC = 632.20 + 0.538*C76 And that the slope, 0.538, has a t-stat of 26.86 and a very small p-value. The variable C76 is therefore significantly explaining some of the variation in ELEC. The 4th part shows predicted values for each of the observations, and the residuals.
4.2 REGRESSION - INTERPRETING EXCELS GRAPHICAL OUTPUT The Regression tool puts one chart on top of another. Click on the top chart so that it becomes the active chart, and then move it down. The Line Fit Plot shows actual ELEC and predicted ELEC, plotted for different values of C76. The regression line (called Predicted ELEC in the legend) is shown as points rather than as a line. This can be changed by formatting the points.
C76LineFitPlot
2000.0 1500.0
ELEC
ELEC PredictedELEC
Residuals Plot shows residuals plotted versus the value of the C76 variable. Check that the residuals do not display an obvious pattern. Ideally, residuals should be as if random, not showing any systematic pattern, of much the same average size, and not increasing in size as X (ELEC) increases, etc. Residual plots are also useful for spotting outliers.
C76ResidualPlot
300 200
Residuals
5. MULTIPLE REGRESSION Can the ELEC predictions be improved if other possible explanatory variables are brought into the model? This section contains a brief description of the way Excels regression can be extended from simple (ELEC on C76) to multiple regression (ELEC on two or more variables). The purpose is to find the best equation for predicting ELEC from one or more of the independent variables. Lets regress ELEC on the other five variables. Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Regression In the Regression dialog box, specify: Input Y range as B1:B104 i.e. ELEC as dependent variable Input X range as C1:G104 i.e. five explanatory variables Check the Labels checkbox. Specify Output options: as New Worksheet Ply, with the name Regression2. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK. 7