Professional Documents
Culture Documents
X i xij / M
j 1
Example 2.1
1
Example 2.2
For the problem in Example 2.1, we shall have the following set of data :
Month
Spending
January
10.0 11.375 = -1.375
Feb.
19.0 11.375 = 7.625
Mar.
9.5 11.375 = -1.875
Apr.
11.0 11.375 = -0.375
May
12.0 11.375 = 0.625
June
11.0 11.375 = -0.375
July
10.0 11.375 = -1.375
Aug.
13.0 11.375 = 1.625
Sep.
10.0 11.375 = -1.375
Oct.
10.0 11.375 = -1.375
Nov.
11.0 11.375 = -0.375
Dec.
10.0 11.375 = -1.375
It can be said that the above data are more informative. Note that the sign plays
an important role. If it is positive (negative), it is above (below) average. Thus, one
2
can easily see that in January, the family spends less than average and in May, it
spends more.
After the mean is calculated, we can extract more information by calculating the
variance which is a measure of dispersion.
Definition
Let Xi denote a variable and xi1, xi2,,xiM denote M individual observations of Xi.
The variance of Xi is defined as
M
j 1
j 1
i2 x ij ' 2 / M ( x ij X i ) 2 / M
Where X i is the mean of Xi.
i is called the standard deviation of Xi.
Example 2.3
For the family spending problem described in Example 2.1, the variance is
calculated as follows :
2 (((1.375) 2 (7.625) 2 (1.875) 2 (0.375) 2 (0.625) 2 (0.375) 2
(1.375) 2 (1.625) 2 (1.375) 2 (1.375) 2 (0.375) 2 (1.375) 2 ) / 12 6.21
Let us imagine that we have a set of personnel data and the two variables are
involved. One is the year of college education and the other is the body weight (in
terms of Kilograms). Some typical examples may be as follows :
Weight
X1
60
70
75
65
72
73
80
55
Table 2.1
One can see that there is another problem. That is, the values of weight are much
larger than the values of years of college education. Thus variable X1 will dominate
variable X2. Since we want these variables to play equally important roles, we want to
normalize these two variables. There are several ways to normalize the data. One
3
practical and commonly used method is to normalize the data so that the variances are
all 1. The normalization procedure is as follows :
Input :
Step 1.
Step 2.
xi1, xi2,,xiM.
Calculate the mean X i .
Let xij ' xij X i .
Step 3.
2
Let i xij ' / M .
j 1
Step 4.
12 59.436
22 3.937
2 1.984
After normalization with respect to means and variances, the data become
X1 (weight)
X2 (year of college education)
-1.135
-0.63
0.162
1.38
0.810
-1.64
-0.486
-0.63
0.421
0.37
0.551
1.38
1.459
-0.63
-1.783
0.37
The reader can see that the influence of units of measurements can now be
eliminated.
X2
X2
X2
X1
X1
(b)
(a)
X1
(c)
Fig. 2.1
Definition
Let variable X1 assume values x11,,x1M and variable X2 assume values x21,
,x2M. The Covariance between X1 and X2 is defined as
M
v12 ( x1 j X 1 )( x2 j X 2 ) / M
j 1
If, instead of X1 and X2, we use the notation of variables X and Y, then we use vxy
to denote the covariance between X and Y. If we have two identical variables, then the
covariance between these variables degenerates into the variance of the individual
variable. That is, v11 12 .
Example 2.5
Consider the following two variables.
X1
170
165
150
X2
130
127
121
5
180
173
184
153
140
130
144
125
X2
0.000
0.707
1.000
0.707
0.000
-0.707
-1.000
-0.707
X 2 0.0
v12 ((1.0 0.0)(0.0 0.0) (0.707 0.0)(0.707 0.0) (0.0 0.0)(1.0 0.0)
( 0.707 0.0)(0.707 0.0) ( 1.0 0.0)(0.0 0.0) ( 0.707 0.0)(0.707 0.0)
(0.0 0.0)(1.0 0.0) (0.707 0.0)( 0.707 0.0)) / 8
0
The reader may have already noted that the above eights points lie on a circle.
That the covariance between X1 and X2 is zero is therefore not surprising.
Again, as we discussed before, the covariance is heavily influenced by the units
of measurements. In Example 2.5, X1 may be body height in centimeter and X2 may be
body weight in pounds. If we measure body weight by tons, the covariance may
approach zero simply because the values of X2 are too small, although the two
variables actually do covary. If the influence of units of measurements has to be
eliminated, one may use the definition of correlations.
6
Definition
Let variables X1 and X2 assume values x11,x12,,x1M and x21,x22,,x2M
respectively. Let X 1 and X 2 be the means of X1 and X2 respectively. Let 1 and
2 be the standard deviations of X1 and X2 respectively. Then, the correlation
between X1 and X2 is defined as
r12
1
M
(x
j 1
1j
X 1 )( x2 j X 2 ) / 1 2
Note that if these two variables have been normalized with respect to variances,
the correlation and covariance between these two variables will be the same.
It should be easy for the reader to prove that the correlation satisfies the
following properties :
(1) r11 = 1
(2) r12 = r21
(3) 1 r12 1
Given a set of variables X1,X2,,XN, it is often desirable to describe the
correlations, or covariances among these variances in matrix forms. We shall use V to
denote the covariance matrix and R to denote the correlation matrix. Each matrix is of
dimension N N . For the covariance (correlation) matrix V(R), the (i,j)th location
V[i,j] (R[i,j]) is the covariance cij (correlation rij) between Xi and Xj.
Example 2.7
Consider the data in Table 2.2. The covariance matrix and the correlation matrix
are shown in Table 2.3 and Table 2.4 respectively. Let us consider the correlation
matrix. From this matrix, we know that the change in population (X1) from 1950 to
1960 is highly related to the change in employment (X3) during the same period. On
the other hand, the change in median income (X6) is almost unrelated to all other
variables.
SELECTED POPULATOIN AND ECONOMIC STATISTICS FOR
SELECTED CITIES
Change in Employment Change in
population per house- employment
Median
age of the
Median
income
Change in
median
1950-1960
(percent)
hold, 1950
1950-1960
(percent)
population
(years)
1950
(dollars)
income
1950-1960
(percent)
X1
X2
X3
X4
X5
X6
98.30
1.36
63.50
27.20
3473.00
120.80
26.80
1.20
23.30
23.20
2367.00
74.90
40.80
1.38
41.90
27.20
2126.00
140.80
22.50
1.28
15.70
28.20
4045.00
71.50
95.30
1.26
103.20
28.20
5128.00
60.60
80.70
1.32
48.90
27.30
3098.00
83.40
33.20
1.19
29.00
23.30
1846.00
63.70
-15.90
1.09
-9.10
30.50
1932.00
111.30
19.20
1.10
22.80
33.40
2592.00
95.70
10
54.90
1.30
40.40
26.30
2880.00
81.30
11
56.40
1.48
43.70
30.90
3042.00
96.40
12
31.00
1.12
19.30
23.90
1746.00
98.70
13
25.60
1.42
36.20
23.60
885.00
464.30
14
51.10
1.23
59.90
23.30
1816.00
66.70
15
27.80
1.38
17.20
30.10
2830.00
93.80
16
16.30
1.14
16.20
32.70
2232.00
112.90
17
264.20
1.41
245.40
26.10
3398.00
99.90
18
108.20
1.30
110.30
26.70
3307.00
74.30
19
77.40
1.29
41.10
24.30
2225.00
87.30
20
-8.70
1.26
-12.40
41.50
5144.00
166.50
21
49.70
1.28
27.10
23.90
2207.00
97.80
22
16.20
1.30
15.40
24.90
2642.00
88.20
23
63.50
1.33
51.40
28.90
2491.00
115.00
24
79.40
1.39
70.00
25.70
2646.00
111.00
25
63.10
1.26
67.50
24.70
2095.00
80.90
26
30.30
1.22
25.90
31.90
2209.00
96.20
27
8.60
1.16
12.70
21.80
1208.00
98.30
28
188.40
1.34
175.60
26.70
3842.00
84.60
29
2.80
1.24
5.50
27.20
1599.00
128.60
30
28.00
1.24
26.30
27.90
2371.00
94.20
31
20.90
1.20
12.50
25.50
2913.00
89.20
32
11.80
1.13
3.80
33.30
2325.00
96.50
33
-3.10
1.07
0.20
31.80
1725.00
108.90
34
161.30
1.27
157.70
25.20
3990.00
79.80
35
15.90
1.30
4.30
29.00
3386.00
67.10
36
12.90
1.21
11.10
27.80
2530.00
83.80
37
23.70
1.15
28.40
22.60
1598.00
84.30
38
24.00
1.21
15.50
31.50
2494.00
90.60
39
2.20
1.30
-4.20
28.00
2819.00
72.20
40
19.40
1.25
18.20
30.10
2400.00
84.80
41
22.10
1.22
13.10
30.90
2069.00
110.40
42
92.90
1.24
85.50
26.10
3502.00
74.20
43
-4.40
1.33
2.30
35.50
4578.00
125.10
44
-4.00
1.21
-2.40
29.90
2443.00
82.50
45
104.90
1.37
49.10
28.70
2638.00
100.10
46
15.50
1.28
12.90
29.20
2274.00
113.70
47
13.80
1.22
22.20
29.40
1972.00
127.10
48
-14.30
1.29
3.30
33.90
5760.00
48.10
49
6.30
1.15
10.30
22.70
3233.00
48.70
50
49.50
1.29
42.60
25.00
1764.00
209.00
2753.92
2.22
2483.46
-62.07
10876.57
-271.96
2.22
0.00
1.65
-0.01
23.23
1.40
2483.46
1.85
2339.34
-56.41
11250.71
-181.65
-62.07
-0.01
-56.41
14.60
1696.77
-2.79
10876.57
23.23
11250.71
1696.77
967468.75
-19475.46
-271.96
1.40
-181.65
-2.79
-19475.46
3385.94
1.00
0.47
0.97
-0.30
0.21
-0.08
0.47
1.00
0342
-0.05
0.26
0.27
0.97
0.42
1.00
-0.30
0.23
-0.06
-0.30
-0.05
-0.30
1.00
0.45
-0.01
0.21
0.26
0.23
0.45
1.00
-0.34
-0.08
0.27
-0.06
-0.01
-0.34
1.00
Current
Black box
Voltage
Fig. 2.2
is current and whose output is voltage. We can measure the current and voltage. Thus,
for every input observation, we have a corresponding output observation. A typical
case may be as follows:
X (current)
10
11
12
9
8
13
7
14
Y (voltage)
21
23
23
19
17
25
14
18
Table 2.5
There are many ways in which the current and voltage can be related. The simplest
model is to assume a linear relationship. That is,
Y b0 b1 X .
Of course, it is quite unlikely that the above model exactly holds. We are interested in
b0 and b1 which best fit our observed data. The meaning of best fitting can be
explained by considering Fig. 2.3.
10
28
28
26
26
24
24
22
22
20
20
18
18
16
16
14
7
10
11
12
13
14
10
11
12
13
In both figures, the points do not all lie on the line. We may therefore say that
errors occur if we use these straight lines to approximate our data. As can be seen, the
line in Fig. 2.3(a) is much better than that in Fig. 2.3(b). In the following, we shall
show how the best fitting line can be found.
Let us assume that for y1 , y 2 , , y M , we observe x1 , x 2 , , x M . Ideally,
/
y i b0 b1 x1
(2.1)
The observed value corresponding to xi is yi. Therefore, we have an error
/
y i y i y i b0 b1 xi
(2.2)
The total sum of squares of errors is
M
E ( y i y i ) ( y i b0 b1 xi ) 2
/ 2
i 1
(2.3)
i 1
We shall choose b0 and b1 on the bases that they will minimize E. This is
achieved by differentiating E with respect to b0 and b1.
M
E
2 ( y i b0 b1 xi )
b0
i 1
M
E
2 xi ( y i b0 b1 xi )
b1
i 1
(2.5)
The values b0 and b1 are found by solving
11
(2.4)
14
(y
i 1
b0 b1 xi ) 0
(2.6)
and
M
x (y
i
i 1
b0 b1 xi ) 0
(2.7)
We have
M
i 1
i 1
Mb0 ( xi )b1 y i
M
i 1
i 1
(2.8)
( xi )b0 ( xi )b1 xi y i
2
(2.9)
i 1
i 1
i 1
b0 ( xi / M )b1 ( y i ) / M
M
i 1
i 1
(2.10)
( xi / M )b0 ( xi / M )b1 ( xi yi ) / M
2
(2.11)
i 1
b1
x y
i 1
( xi )( y i ) / M
i 1
x
i 1
2
i
i 1
( x i ) / M
(2.12)
i 1
Equivalently,
M
b1
(x
i 1
x)( y i y )
(x
i 1
x)
V xy
(2.14)
(2.15)
b0 y b1 x
Example 2.8:
For the set of data in Table 2.5, we have
12
X 10.5 , Y 21.25 ,
Therefore,
b1 V xy / x 9.5 / 5.25 1.80
2
We have
Y = 2.60 + 1.80x .
24.00
20.00
16.00
12.00
Y-AXIS
28.00
32.00
13
6.00
8.00
10.00
16.00
X-AXIS
12.00
14.00
Fig. 2.4
Since we assumed that variable Y measures voltage, variable X measures current and
our linear regression model indicates that Y and X can be approximated by the
equation:
Y = 2.60 + 1.80X,
we can realize the system in the black box by the following passive and linear circuit.
Voltage(Y)
Current(X)
Fig. 2.5
Let us assume that we are given a certain value of X, say x = 7.5. Can we guess
what Y should be? It should not be unreasonable to use Y = 2.60 + 1.80X to have an
educated guess. That is,
Y = 2.60 + 1.80 7.5 = 16.10.
One can see that the linear regression is indeed an information extraction
method. We were given only a set of data to start with. We have now established a
relationship between X and Y and can predict, with some degree of confidence, the
output associated with some unknown input.
14
Let us assume that we have y e 2 . In the interval [0, 1], we may approximate
this function by a straight line. Let us use 11 values of X(0, 0.1, 0.2, , 0.9, 1.0). For
every xi, we have a corresponding yi as in the following table:
xi
yi
0.0
1.000
0.1
0.995
0.2
0.980
0.3
0.956
0.4
0.923
0.5
0.882
0.6
0.835
0.7
0.782
0.8
0.726
0.9
0.667
1.0
0.606
Using a linear regression analysis e obtain
y 0.85 ,
x 0.5 ,
x 0.316 ,
v xy 0.04 ,
We have
y = 1.053 0.406x.
x2
1.00
0.80
0.90
y = 1.053 0.406 x
15
0.70
0.60
1.10
The function y e 2 and y = 1.053 0.406x are now plotted in Fig. 2.6.
0.00
0.20
0.40
0.60
0.80
16.00
X
Fig. 2.6
yM = b0 + b1xM
The above equations can be expressed as
Y X B.
(2.16)
Where
y1
1 x1
y
1x
b0
2
Y 2 , X
, and B .
b1
yM
1 xM
It can be easily proved that
X'X
x
i 1
M
x
i 1
M
i
(2.18)
xi
i 1
X 'Y
where
X'
is the transpose of
(2.17)
and
i 1
M
yi
(2.19)
xy
i 1 i i
16
(2.20)
b
B 0 ( X ' X ) 1 ( X ' Y )
b1
(2.21)
Example 2.10:
Let us consider the data in Example 2.9,.
1
1
0.000
0.100
0.200
0.300
1 0.400
X 1 0.500 ,
1 0.600
1 0.700
1 0.800
1 0.900
1 1.000
1.000
0.995
0.980
0.955
0.923
Y 0.882 ,
0.835
0.782
0.726
0.666
0.606
0.318
( X ' X ) 1
0.454
11
X'X
5.5
0.454
9.353
, X 'Y
,
0.909
4.229
b0
0.318 0.454 9.353
1.053
.
0.454 0.909 4.229
0.406
b1
Therefore
17
5.5
,
3.85
y = 1.053 0.406x.
Black
Box
xN
Fig. 2.5
A very simple model would be
y b0 b1 x1 bN x N
(2.22)
Let us assume that for every independent variable set ( xi1 , xi 2 , , xiN ) , we
have a corresponding idealized dependent variable y i / . Thus we have
/
y1 b0 b1 x11 bN x1N
/
y M b0 b1 x M 1 b N x MN
(2.23)
E ( yi yi ) 2
/
i 1
(y
i 1
b0 b1 xi1 b2 xi 2 bN xiN ) 2
M
E
2 ( y i b0 b1 xi1 b2 xi 2 b N xiN ) 0
b0
i 1
(2.24)
E
2 xiN ( y i b0 b1 xi1 b2 xi 2 bN xiN ) 0
bN
i 1
M
We have
18
i 1
i 1
i 1
i 1
i 1
i 1
i 1
(2.25)
i 1
i 1
i 1
i 1
Y ,
X
and B
y M
1 x M 1 x M 2 x MN
bN
and express the equations in (2.25) in matrix forms:
( X ' X ) B X 'Y .
To obtain B, we have
( X ' X ) 1 ( X ' X ) B ( X ' X ) 1 X ' Y
or
B ( X ' X ) 1 X ' Y .
(2.26)
Example 2.11: (Social-economic Data)
In this example, we shall use some economic data obtained from[Burns and
Harman 1966] as shown in Table 2.6. We used median value of houses as the
dependent variable and the other two variables (median school years and misc.
professional services) as independent variables. The linear regression equation
obtained is
y 10.743 22.219 x1 0.1192 x 2
(2.27)
In Table 2.7, we list the value of house obtained through the use of (2.27), the
actual value and the percentage of error. The reader can see the average error is found
to be 14%.
Individual
X1
X2
19
(Tract No.)
1
2
3
4
5
1.28
1.09
0.87
1.35
1.27
27.0
1.0
1.0
14.0
14.0
25.0
10.0
9.0
25.0
25.0
6
7
8
9
10
11
12
0.83
1.14
1.15
1.25
1.37
0.96
1.14
6.0
1.0
6.0
18.0
39.0
8.0
10.0
12.0
16.0
14.0
18.0
25.0
12.0
13.0
Table 2.6
Individual Tract
Y (Actual Value)
Y (Estimated)
Percentage Error
1
2
3
4
5
6
7
8
9
10
11
12
25.0
10.0
9.0
25.0
25.0
12.0
13.0
14.0
18.0
25.0
12.0
13.0
22.9
13.7
9.0
22.2
20.4
8.8
14.78
15.96
20.48
27.19
12.12
16.50
0.09
0.27
0.00
0.13
0.22
0.36
0.08
0.12
0.12
0.08
0.01
0.21
Average error =
1.69
0.14 14%
12
Table 2.7
Example 2.12: A Set of Artificial Data
20
(2.28)
x1
25.0
51.0
85.0
54.0
45.0
70.0
22.0
1.0
-24.0
51.0
x2
34.0
70.0
100.0
720.0
78.0
88.0
11.0
5.0
-51.0
-11.0
Table 2.8
x3
200.0
36.0
35.0
51.0
5.0
654.0
428.0
7.0
-750.0
6.0
y
135.638
-208.327
-255.589
-4983.289
339.020
559.850
586.920
-25.658
-725.615
351.608
(2.29)
21