Professional Documents
Culture Documents
tex
1 Introduction
Independent component analysis (ICA) [Jutten and Herault, 1988] and projection pursuit (PP) [Friedman, 1987]
are methods for recovering underlying source signals from linear mixtures of these signals. This rather terse
description does not capture the deep connection between ICA/PP and the fundamental nature of the physical
world. In the following pages, we hope to establish, not only that ICA/PP are powerful and useful tools, but
that this power follows naturally from the fact that ICA/PP are based on assumptions which are remarkably
attuned to the spatiotemporal structure of the physical world.
1
Most measured quantities are actually mixtures of other quantities. Typical examples are, i) sound signals in
a room with several people talking simultaneously, ii) an EEG signal, which contains contributions from many
dierent brain regions, and, iii) a person's height, which is determined by contributions from many dierent
genetic and environmental factors. Science is, to a large extent, concerned with establishing the precise nature
of the component processes responsible for a given set of measurements, whether these involve height, EEG
signals, or even IQ. Under certain conditions, the underlying sources of measured quantities can be recovered
by making use of methods (PP and ICA) based on two intimately related assumptions.
The more intuitively obvious of these assumptions is that dierent physical processes tend to generate signals
that are statistically independent of each other. This suggests that one way to recover source signals from
signal mixtures is to nd transformations of those mixtures that produce independent signal components. This
independence is given much emphasis in the ICA literature, although an apparently subsidiary assumption that
source signals have amplitude histograms that are non-Gaussian is also required. In (apparent) contrast, the
PP method relies on the assumption that any linear mixture of any set of (nite variance) source signals is
Gaussian, and that the source signals themselves are not Gaussian. Thus, another method for extracting source
signals from linear mixtures of those signals is to nd transformations of the signal mixtures that extract nonGaussian signals. It can be shown that the assumption of statistical independence is implicit in the assumption
that source signals are non-Gaussian, and therefore that both PP and ICA are actually based on the same
assumptions.
Within the literature, PP is used to extract one signal at a time, whereas ICA extracts simultaneously a set of signals. However, like the apparently dierent assumptions of PP and ICA, this dierence is supercial, and re
ects
the underlying histories of the two methods, rather than any fundamental dierence between them. Recent applications of ICA include separation of dierent speech signals [Bell and Sejnowski, 1995], analysis of EEG data
[Makeig et al., 1997], functional magnetic resonance imaging (fMRI) data [McKeown et al., 1998], image processing [Bell and TJ, 1997], and the relation between biologicial image processing and ICA [van Hateren and van der Schaaf, 1998
xt = Ast
(1)
Both ICA and PP are capable of taking the signal mixtures x and recovering the sources s. That the mixtures
can be separated in principle is easily demonstrated now that the problem has been summarised in matrix
algebra. An `unmixing' matrix W is dened such that:
st = W xt
(2)
Given that each row in W species how the mixtures in x are recombined to produce one source signal, it
follows that it must be possible to recover one signal at a time by using a dierent row vector to extract each
signal. For example, if only one signal is to be extracted from M signal mixtures then W is a 1 M matrix.
Thus, the shape of the unmixing matrix W depends upon how many signals are to be extracted. Usually, ICA
is used to extract a number of sources simultaneously, whereas PP is used to extract one source at a time.
that is orthogonal to all but one direction S10 . Such a line is dened by a vector w1 = (w1 ; w2) (depicted as a
dashed line in Figure 1b), dened so that only components of X that lie along the direction S10 are transformed
to non-zero values of y = w1 x. This is depicted graphically in Figure 1b, with the result of unmixing both
signals y = W x depicted in Figure 2.
To summarise, the linear transformation yt = W xt produces a scalar value for each point xt in X , so that a
single signal results from the transformation y = W x. The signal amplitude yt at time t is found by taking the
inner product of W with a point xt . As the row vector W is dened to be orthogonal to directions corresponding
all but one source signal in X , only that signal will be projected to non-zero values y = W x.
Having demonstrated that an unmixing matrix W exists that can extract one or more source signals from a
mixture, the following sections describe how PP and ICA can be used to obtain values for W .
x y
where x and y are the standard deivations of x and y, respectively, and Cov(x; y) is the covariance between
x and y:
X
(4)
Cov(x; y) = (1=n) (xi x )(yi y )
where x and y are the means of x and y, respectively. Correlation is simply a form of covariance that
has been normalised to lie in the range f 1; +1g. Note that if two variables x and y are uncorrelated then
(x; y) = Cov(x; y) = 0, although (x; y) and Cov(x; y) are not equal in general.
The covariance Cov(x; y) can be shown to be:
Cov(x; y) = (1=n)
xi yi (1=n)
xi (1=n)
yi
(5)
Each term in Equation (5) is a mean, or expected value E, and can be written more succinctly as:
(6)
A histogram plot with abscissas x and y, and with the ordinate denoting frequency, approximates the probability
density function (pdf) of the joint distribution of xy. The quantity E[xy] is known as a second moment of
this joint distribution. Similarly, histograms of x and y approximate their respective pdfs, and are known as
the marginal distributions of the joint distribution xy. The quantities E[x] and E[y] are the rst moments
(respectively) of these marginal distributions. Thus, covariance is dened in terms of moments associated with
the joint distribution xy.
Just because x and y are uncorrelated, this does not imply that they are independent. To take a simple example,
given a variable z = f0; :::; 2g, we can dene x = sin(z ) and y = cos(z ). Intuitively, it can be seen that both
x and y depend on z . As can be sen from Figure 3, the variables x and y are highly interdependent. However,
the covariance (and therefore the correlation) of x and y is zero:
(7)
(8)
(9)
In summary, covariance does not capture all types of dependencies between x and y, whereas measures of
statistical independence do.
Like covariance, independence is dened in terms of the expected values of the joint distribution xy. We have
established that if x and y are uncorrelated then they have zero covariance:
E[xy] E[x]E[y] = 0
(10)
Using a generalised form of covariance involving powers of x and y, if x and y are statistically independent
then:
E[xp yq ] E[xp]E[yq ] = 0
(11)
for all positive integer values of p and q. Whereas covariance uses p = q = 1, all positive integer values of p and
q are implicit in measures of independence. Formally, if x and y are independent then each moment E[xpyq ] is
equal to the product of the expected values of the pdf's marginal distributions E[xp ]E[yq ], which leads to the
result stated in Equation (11).
The formal similarity between measures of independence and covariance can be interpreted as follows. Whereas
covariance measures the amount of linear covariation between x and y, independence measures the linear
covariation between [x raised to powers p] and [y raised to powers q]. Thus, independence can be considered
as a generalised form of covariance, which measures the linear covariation between non-linear functions (e.g.
cubed power) of two variables.
For example, using x = sin(z ) and y = cos(z ) we know that Cov(x; y) = 0. However, the measure of linear
5
covariation between the variables xp and yq as depicted in Figure (3) for p = q = 2 is:
E[xp yq ] E[xp]E[yq ] = 0:123
(12)
This corresponds to a correlation between x2 and y2 of 0:864 (see Figure 3). Thus, whereas the correlation
between x sin(z ) and y = cos(z ) is zero, the fact that the value of x can be predicted from y is implicit in the
non-zero values of the higher order moments of the distribution of xy.
If we assume that source signals have non-Gaussian pdfs then, whilst most transformations produce data with
Gaussian distributions, a small number of transformations exist that produce data with non-Gaussian distributions. Under certain conditions, the non-Gaussian signals extracted from signal mixtures by such a transformation are in fact the original source signals. This is the basis of projection pursuit methods [Friedman, 1987].
In order to set about nding non-Gaussian component signals, it is necessary to dene precisely what is meant
by the term `non-Gaussian'. Two important classes of signals with non-Gaussian pdfs have super-Gaussian and
sub-Gaussian pdfs. These are dened in terms of kurtosis, which is dened as
R
1
4
T
k = 1 R ((ss sst ))2 dt
3
(13)
t dt
T
where st is the value of a signal at time t, s is the mean value of st , and the constant (3) ensures that superGaussian signals have positive kurtosis, whereas a sub-Gaussian signal have negative kurtosis.
optimisation process, extracting non-Gaussian signals produces signals that are mutually independent. In
contrast, ICA explicitly maximises the mutual independence of extracted signals.
form a mixture x2 , then x1 and x2 are mutually independent, even though both consist of mixtures of source
signals. Thus, statistical independence of extracted signals is a necessary, but not sucient, condition for source
separation.
Having established the connection between ICA and PP, and conditions under which they are equivalent, we
proceed by describing the `standard' ICA method [Bell and Sejnowski, 1995].
fx (x) ln fx (x) dy
(15)
As might be expected, the transformation of a given data set x aects the entropy of the transformed data Y
according to the change in the amount of `spread' introduced by the transformation. Given a multidimensional
signal x, if a cluster of points in x is mapped to a large region in Y, then the transformation implicitly maps
innitesimal volumes from one space to another. The `volumetric mapping' between spaces is given by the
Jacobian of the transformation between spaces. The Jacobian combines the derivative of each axis in x with
respect to every axis in y to form a ratio of innitesimal volumes in x and y. The change in entropy induced
by the transformation W can be shown to be equal to the expected value of ln jJ j, where j:j denotes absolute
value.
Given that Y = (Wx), the output entropy H (Y) can be shown to be related to the entropy of the input H (x)
by
H (Y) = H (x) + E [ log jJ j ]
(16)
where jJ j is the determinant of the Jacobian matrix J = @ Y=@ x. Note that the entropy of the input H (x) is
constant. Given that we wish to nd a W that maximises H (Y), any W that maximises H (Y) is unaected by
In fact sources si normalised so that E[si tanh si ] = 1=2 can be separated using tanh sigmoids if and only if the pairwise
conditions
i
j > 1 are satised, where
i = 2E[s2i ]E[sech2 si ] [Porrill, 1997].
2
H (x), which can therefore be ignored. Using the chain rule, we can evaluate jJ j as:
N
Y
@Y
@Y @y
jJ j = @ x = @ y @ x = i0 (yi )jW j
i=1
(17)
where @ Y=@ y and @ y=@ x are Jacobian matrices. Substituting Equation (17) in Equation (16) yields
H (Y) = H (x) + E
"
i=1
(18)
As the entropy of the x is unaected by W , it can be ignored in the maximisation of H (Y). The term
P
E [ log i0 (yi )] can be estimated given n samples from the distribution dened by y:
N
n X
X
log i0 (yi ) n1
log i0 (yi(j ) )
(19)
i=1
j =1 i=1
Ignoring H (x), and substituting Equation (19) in (18) yields a new function that diers from H (Y) by a
constant equal to H (x):
n X
N
X
h(W ) = n1
log i0 (yi(j ) ) + log jW j
(20)
j =1 i=1
If we dene the cdf i = tanh then this evaluates to
"
N
n X
X
h(W ) = n1
log(1 yi(j )2 ) + log jW j:
j =1 i=1
This function can be maximised by taking its derivative with respect to the matrix W :
rh W = [W T ]
2yxT
(21)
(22)
Now an unmixing matrix can be found by taking small steps of size to update W :
W = ([W T ]
2yxT )
(23)
In fact, the matching of the pdf of y to each cdf also requires that each signal yi has zero mean. This is easily
accomodated by introducing a `bias' weight wi to ensure that yi = W x + wi has zero mean. The value of each
bias weight is learned like any other weight in W . For a tanh cdf, this evaluates to:
wi = 2y
(24)
In practice, h(W ) is maximised either, a) by using a `natural gradient' [Amari, 1998] which normalises the error
surface so that the step-size along each dimenstion is scaled by the local gradient in that direction, and which
obviates the need to invert W at each step, or b) a second order technique (such as BFGS or a conjugate
gradient Marquardt method) which estimates an optimal search direction and step-size under the assumption
that the error surface is locally quadratic.
10
x = Ay
(25)
where A = W 1 is an N N matrix. Therefore, each row (source signal) of y species how the contribution to
x of one column (image) of A varies over time. So, whereas each row yi of y species a signal that is independent
of all rows in y, each column ai of A consists of an image that varies independently over time according to the
amplitude of yi . Note that, in general, the rows of s are constrained to be mutually independent, whereas the
relationship between columns of A is completely unconstrained.
x = UDV T
(26)
Where the diagonal elements of D contain the ordered eigenvalues of the corresponding eigenvectors in the
columns of U and the rows of V . This decomposition is produced by singular value decomposition (SVD).
Note that each eigenvalue species the amount of data variance associated with the direction dened by a
corresponding eigenvector in U and V . We can therefore discard eigenvectors with small eigenvalues because
these account for trivial variations in the data set. Setting K N permits a more econimical representation
of x:
x x~ = U~ D~ V~ T
(27)
Note that U~ is now an N K matrix, V~ is a K K matrix, and D~ is a diagonal K K matrix. As with ICAs
and ICAt, these can be considered in temporal and spatial terms. If each column of x is an image of N pixels
then each column of U is an eigenimage, and each column of V is an eigensequence.
Given that we require a small unmixing matrix W , it is desirable to use U~ instead of X for ICAt, and V~ instead
of X T for ICAs. The basic method consists of performing ICA on U~ or V~ to obtain K ICs, and then using the
relation X~ = U~ D~ V~ T to obtain the K corresponding columns of A.
12
y = W V~ T
(28)
where each row of the K T matrix V~ T is an `eigensequence', and W is a K K matrix. In this case, ICA
recovers K mutually independent sequences, each of length T .
The set of images corresponding to the K temporal ICs can be obtained as follows. Given
and
V~ T = Ay = W 1y
(29)
x = U~ D~ V~ T
(30)
we have
~
x = U~ DW
y
= Ay
1
~
A = U~ DW
(31)
(32)
(33)
where A is a N K matrix in which each column is an image. Thus, we have extracted K independent
T -dimensional sequences and their corresponding N -dimensional images using a K K unmixing matrix W .
and
U~ T = Ay = W 1y
(35)
xT = V~ D~ U~ T
(36)
13
we have
~
x = V~ DW
y
= Ay
1
A = V~ D~ W~
(37)
(38)
(39)
where A is a T K matrix in which each column is a time course. Thus, we have extracted K independent
N -dimensional images and their corresponding T -dimensional time courses using a K K unmixing matrix W .
Note that, using SVD in this manner requires an assumption that the ICs are not distributed amongst the
smaller eigenvectors which are usually discarded. The validity of this assumption is by no means guaranteed.
C = xT x
(40)
This covariance matrix is the starting point of many standard PCA algorithms. After PCA we have V and a
corresponding set of ordered eigenvalues. The matrix D which is normally obtained with SVD can be constructed
by setting each diagonal element to the square of each corresponding eigenvalue. Given that:
it follows that:
x = UDV T
(41)
U = xV D
(42)
14
So, given V and D from a PCA of the covariance matrix of x, we can obtain the eigenimages U . Note that we
can compute as many eigenimages as required by simply omitting corresponding eigensequences and eigenvalues
from V and D, respectively.
x = As
(44)
The main dierence between SVD and ICA is as follows. Each matrix produced by SVD has orthogonal columns.
That is, the variation in each column is uncorrelated with variations in every other column within U and V . In
contrast, ICA produces two matrices with quite dierent properties. Rather than being uncorrelated, the rows
of s are independent. This stringent requirement on the rows of s suggest that the columns of A cannot also be
independent, in general, and ICA actually places no constraints on the relationships between columns of A.
044823).
References
[Amari, 1998] Amari, A. (1998). Natural gradient works eciently in learning. Neural Computation, 10(2):251{
276.
[Bell and Sejnowski, 1995] Bell, A. and Sejnowski, T. (1995). An information-maximization approach to blind
separation and blind deconvolution. Neural Computation, 7:1129{1159.
[Bell and TJ, 1997] Bell, A. and TJ, S. (1997). The `independent components' of natural scenes are edge lters.
Vision Research, 37(23):3327{3338.
[Friedman, 1987] Friedman, J. (1987). Exploratory projection pursuit. J Amer. Statistical Association,
82(397):249{266.
[Girolami and Fyfe, 1996] Girolami, M. and Fyfe, C. (1996). Negentropy and kurtosis as projection pursuit
indices provide generalised ica algorithms. NIPS96 Blind Signal Separation Workshop.
15
[Jutten and Herault, 1988] Jutten, C. and Herault, J. (1988). Independent component analysis versus pca. In
Proc. EUSIPCO, pages 643 { 646.
[Makeig et al., 1997] Makeig, S., Jung, T., Bell, A., Ghahremani, D., and Sejnowski, T. (1997). Blind separation
of auditory event-related brain responses into independent components. Proc. Natl. Acad. Sci, 94:0979{10984.
[McKeown et al., 1998] McKeown, M., Makeig, S., Brown, G., Jung, T., Kindermann, S., and Sejnowski, T.
(1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the
stroop color-naming task. Proceedings of the National Academy of Sciences USA (In Press).
[Porrill, 1997] Porrill, J. (1997). Independent component analysis: Conditions for a local maximum. Technical
Report 123, Psychology Department, Sheeld University, England.
[van Hateren and van der Schaaf, 1998] van Hateren, J. and van der Schaaf, A. (1998). Independent component
lters of natural images compared with simple cells in primary visual cortex. Prc Royal Soc London (B),
265(7):359{366.
16
3
2
2
1
1
1
1
2
2
3
Figure 1: The geometry of source separation. a) Plot of signal s1 versus s2 . Each point st in S represents the
amplitudes of the source signals s1t and s2t at time t. These signals are plotted separately in Figure 2. b) Plot
of signal mixture x1 versus x2. Each point xt = Ast in X represents the amplitudes of the signal mixtures
x1t and x2t at time t. These signal mixtures are plotted separately in Figure 2. The orthogonal axes S1 and
S2 in S (solid lines in Figure a) are transformed by the mixing matrix A to form the skewed axes S10 and S20
in X (solid lines in Figure b). An `unmixing' matrix W consists of two row vectors, each of which `selects' a
direction associated with a dierent signal in X . The dashed line in Figure b species one row vector w1 of an
`unmixing' matrix W which is (in general) orthogonal to every transformed axis Si0 except one (S10 , in this case).
Variations in signal amplitude associated with directions (such as S20 ) that are orthogonal to w1 have no eect
on the inner product y1 = w1 x. Therefore, y1 only re
ects amplitude changes associated with the direction S10 ,
so that y1 = ks1 where k is a constant that equals unity if S10 and w1 are co-linear.
17
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
10
1.5
10
1.5
1.5
10
10
2
1.5
0.5
0.5
0.5
0.5
2
1
1.5
10
1
0
10
1.5
Figure 2: Separation of two signals. Original signals s = s1 ; s2 are displayed in the left hand graphs. Two signal
mixtures x = As are displayed in middle graphs. The results of applying an unmixing matrix W = A 1 to the
mixtures x = W x are displayed in the right hand graphs.
18
1.5
0.5
0.5
0.5
0
1.5
0.5
0.5
1.5
0.5
Figure 3: The interdependence of x = sin(z ) and y = cos(z ) is only apparent in the higher order moments of
the joint distribution of xy. a) Plot of x = sin(z ) versus y = cos(z ). Even though the value of x is highly
predictable given the corresponding value of y (and vice versa), the correlation between x and y is r = 0. For
display purposes, noise has been added in order to make the set of points visible. b) Plot of sin2 (z ) versus
cos2(z ). The correlation between sin2(z ) and cos2 (z ) is r = 0:864. Whereas x and y are uncorrelated if the
correlation between x and y is zero, they are statistically independent only if the correlation between xp and
yq is zero for all positive integer values of p and q. Therefore, sin(z ) and cos(z ) are uncorrelated, but not
independent.
19
1.5
Figure 4: Histograms of signals with dierent probability density functions. From left to right, histograms
of super-Guassian, Guassian, and sub-Guassian signal. The left hand histogram is derived from a portion of
Handel's Messiah, the middle histogram is derived from Gaussian noise, and the right hand histogram is derived
from a sine wave.
20
Figure 5: Six sound signals and their pdfs. Each signal consists of ten thousand samples. From top to bottom:
chirping, gong, Handel's Messiah, people laughing, whistle-plop, steam train.
21
Figure 6: The outputs of six microphones, each of which receives input from six sound sources according to
its proximity to each source. Each microphone receives a dierent mixture of the six non-Gaussian signals
displayed in Figure 5. Note that the pdf of each signal mixture shown on the rhs is approximately Gaussian.
22
Figure 7: A typical signal produced by applying a random `unmixing' matrix to the six signal mixtures displayed
in 6. The resultant signal has a pdf that is approximately Gaussian. From top to bottom: a single mixture
of the six signals shown in Figure 5, the mixture's pdf, pdf of a Gaussian signal. Note that correct unmixing
matrix would produce each of the original source signals displayed in Figure 5.
23