You are on page 1of 28

Independent Component Analysis

Seungjin Choi
Abstract Independent component analysis (ICA) is a statistical method, the goal
of which is to decompose multivariate data into a linear sum of non-orthogonal ba-
sis vectors with coefcients (encoding variables, latent variables, hidden variables)
being statistically independent. ICA generalizes a widely-used subspace analysis
method such as principal component analysis (PCA) and factor analysis, allowing
latent variables to be non-Gaussian and basis vectors to be non-orthogonal in gen-
eral. ICA is a density estimation method where a linear model is learned such that
the probability distribution of the observed data is best captured, while factor anal-
ysis aims at best modeling the covariance structure of the observed data. We begin
with a fundamental theory and present various principles and algorithms for ICA.
1 Introduction
Independent component analysis (ICA) is a widely-used multivariate data analysis
method that plays an important role in various applications such as pattern recog-
nition, medical image analysis, bioinformatics, digital communications, computa-
tional neuroscience, and so on. ICA seeks a decomposition of multivariate data into
a linear sum of non-orthogonal basis vectors with coefcients being statistically as
independent as possible.
We consider a linear generative model, where m-dimensional observed data x
R
m
is assumed to be generated by a linear combination of n basis vectors a
i
R
m
,
x = a
1
s
1
+a
2
s
2
+ a
n
s
n
, (1)
Seungjin Choi
Department of Computer Science, Pohang University of Science and Technology, San 31 Hyoja-
dong, Nam-gu, Pohang 790-784, Korea, e-mail: seungjin@postech.ac.kr
1
2 Seungjin Choi
where s
i
R are encoding variables representing the extent to which each basis
vectors is used to reconstruct the data vector. Given N samples, the model (1) can
be written in a compact form:
X = AS, (2)
where X = [x(1), . . . , x(N)] R
mN
is a data matrix, A = [a
1
, . . . , a
n
] R
mn
is a
basis matrix, and S = [s(1), . . . , s(N)] R
nN
is an encoding matrix with s(t) =
[s
1
(t), . . . , s
n
(t)]

.
Dual interpretation of basis-encoding in the model (2) is given as follows:
When columns in X are treated as data points in m-dimensional space, columns in
A are considered as basis vectors and each column in S is encoding that represents
the extent to which each basis vector is used to reconstruct data vector.
Alternatively, when rows in X are data points in N-dimensional space, rows in S
correspond to basis vectors and each row in A represents encoding.
A strong application of ICA is a problem of blind source separation (BSS), the
goal of which is to restore sources S (associated with encodings) without the kon-
wledge of A, given the data matrix X. ICA and BSS have been often treated as an
identical problem since they are closely related to each other. In BSS, the matrix A is
referred to as mixing matrix. In practice, we nd a linear transformation W, referred
to as demixing matrix, such that the rows of the output matrix
Y =WX, (3)
are statistically independent. Assume that sources (rows of S) are statistically inde-
pendent. In such a case, it is well known that WA becomes a transparent transforma-
tion when the rows of Y are statistically independent. The transparent transforma-
tion is given by WA = P, where P is a permutation matrix and is a nonsingular
diagonal matrix involving scaling. This transparent transformation reects two in-
determinacies in ICA [1]: (1) scaling ambiguity; (2) permutation ambiguity. In other
words, entries of Y correspond to scaled and permuted entries of S.
Since Jutten and Heraults rst solution [2] to ICA, various methods have been
developed so far, including a neural network approach [3], information maximiza-
tion [4], natural gradient (or relative gradient) learning [5, 6, 7], maximum like-
lihood estimation [8, 9, 10, 11], nonlinear principal component analysis (PCA)
[12, 13, 14]. Several books on ICA [15, 16, 17, 18, 19] are available, serving as
a good resource for through review and tutorial of ICA. In addition, tutorial papers
on ICA [20, 21] are useful resources as well.
In this chapter, we begin with a fundamental idea, emphasizing why independent
components are sought. Then we introduce well-known principles to tackle ICA,
leading to an objective function to be optimized. We explain the natural gradient al-
gorithm for ICA. We also elucidate how we incorporate nonstationarity or temporal
information into the standard ICA framework.
Independent Component Analysis 3
2 Why Independent Components?
Principal component analysis (PCA) is a popular subspace analysis method that has
been used for dimensionality reduction and feature extraction. Given a data matrix
X R
mN
, the covariance matrix R
xx
is computed by
R
xx
=
1
N
XHX

,
where H = I
NN

1
N
1
N
1

N
is the centering matrix, where I
NN
is the N N iden-
tity matrix and 1
N
= [1, . . . , 1]

R
N
. The rank-n approximation of the covariance
matrix R
xx
is of the form
R
xx
UU

,
where U R
mn
contains n eigenvectors associated with n largest eigenvalues of
R
xx
in its columns and the corresponding eigenvalues are in the diagonal entries of
(diagonal matrix). Then principal components z(t) are determined by projecting
data points x(t) onto these eigenvectors, leading to
z(t) =U

x(t),
or in a compact form,
Z =U

X.
It is well known that rows of Z are uncorrelated with each other.
ICA generalizes PCA in the sense that latent variables (components) are non-
Gaussian and A is allowed to be non-orthogonal transformation, whereas PCA con-
siders only orthogonal transformation and implicitly assumes Gaussian components.
Fig. 1 shows a simple example, emphasizing the main difference between PCA and
ICA.
We presents a core theorem which plays an important role in ICA. It provides
a fundamental principle for various unsupervised learning algorithms for ICA and
BSS.
Theorem 1 (Skitovich-Darmois). Let s
1
, s
2
, . . . , s
n
be a set of independent ran-
dom variables. Consider two random variables x
1
and x
2
which are linear combi-
nations of s
i
,
y
1
=
1
s
1
+
n
s
n
,
y
2
=
1
s
1
+
n
s
n
, (4)
where
i
and
i
are real constants. If y
1
and y
2
are statistically independent,
then each variable s
i
for which
i

i
,= 0 is Gaussian.
4 Seungjin Choi
8 6 4 2 0 2 4 6 8
8
6
4
2
0
2
4
6
8
8 6 4 2 0 2 4 6 8
8
6
4
2
0
2
4
6
8
(a) (b)
Fig. 1 Two-dimensional data with two main arms are tted by two different basis vectors: (a)
PCA makes the implicit assumption that the data have a Gaussian distribution and determines the
optimal basis vectors that are orthogonal, which are not efcient at representing non-orthogonal
distributions; (b) ICA does not require that the basis vectors be orthogonal and considers non-
Gaussian distributions, which is more suitable in tting more general types of distributions.
Consider the linear model (2) for m=n. Throughout this chapter, we consider the
simplest case where m = n (square mixing). Let us dene the global transformation
as G =WA, where A is the mixing matrix and W is the demixing matrix. With this
denition, we write the output y(t) as
y(t) =Wx(t) = Gs(t). (5)
Let us assume that both A and W are nonsingular, hence G is nonsingular. Under
this assumption, one can easily see that if y
i
(t) are mutually independent non-
Gaussian signals, then invoking Theorem 1, G has the following decomposition
G = P. (6)
This justies why ICA performs BSS.
3 Principles
The task of ICA is to estimate the mixing matrix A or its inverse W = A
1
(re-
ferred to as dexming matrix) such that elements of the estimate y = A
1
x =Wx are
as independent as possible. For the sake of simplicity, we often leave out the in-
dex t if the time structure does not have to be considered. In this section we review
four different principles: (1) maximum likelihood estimation; (2) mutual informa-
tion minimization; (3) information maximization; (4) negentropy maximization.
Independent Component Analysis 5
3.1 Maximum likelihood estimation
Suppose that sources s are independent with marginal distributions q
i
(s
i
)
q(s) =
n

i=1
q
i
(s
i
). (7)
In the linear model, x = As, a single factor in the likelihood function is given by
p(x[A, q) =
_
p(x[s, A)q(s)ds
=
_
n

j=1

_
x
j

i=1
A
ji
s
i
_
n

i=1
q
i
(s
i
)ds (8)
= [ det A[
1
n

i=1
q
i
_
n

j=1
A
1
i j
x
j
_
. (9)
Then, we have
p(x[A, q) =[ det A[
1
r(A
1
x). (10)
The log-likelihood is written as
log p(x[A, q) =log[ det A[ +logq(A
1
x), (11)
which can be also written as
log p(x[W, q) = log[ detW[ +log p(y), (12)
where W = A
1
and y is the estimate of s with the true distribution q() replaced
by a hypothesized distribution p(). Since sources are assumed to be statistically
independent, (12) is written as
log p(x[W, q) = log[ detW[ +
n

i=1
log p
i
(y
i
). (13)
The demixing matrix W is determined by

W = argmax
W
_
log[ detW[ +
n

i=1
log p
i
(y
i
)
_
. (14)
It is well known that maximum likelihood estimation is equivalent to Kull-
back matching where the optimal model is estimated by minimizing Kullback-
Leibler (KL) divergence between empirical distribution and model distribution.
Consider KL-divergence from the empirical distribution p(x) to the model distri-
bution p

(x) = p(x[A, q)
6 Seungjin Choi
KL[ p(x)[[p

(x)] =
_
p(x)log
p(x)
p

(x)
dx
= H( p)
_
p(x)log p

(x)dx, (15)
where H( p) =
_
p(x)log p(x)dx is the entropy of p. Given a set of data points,
x
1
, . . . , x
N
drawn from the underlying distribution p(x), the empirical distribution
p(x) puts probability
1
N
on each data point, leading to
p(x) =
1
N
N

t=1
(x x
t
). (16)
It follows from (15) that
argmin

KL[ p(x)[[p

(x)] argmax

log p

(x))
p
, (17)
where )
p
represents the expectation with respect to the distribution p. Plugging
(16) into the righthand side of (15), leads to
log p

(x))
p
=
1
N
_

t=1
N(x x
t
)log p

(x)dx =
1
N
N

t=1
log p

(x
t
). (18)
Apart from the scaling factor
1
N
, this is just the log-likelihood function. In other
words, maximum likelihood estimation is obtained from the minimization of (15).
3.2 Mutual information minimization
Mutual information is a measure for statistical independence. Demixing matrix W
is learned such that the mutual information of y =Wx is minimized, leading to the
following objective function:
J
mi
=
_
p(y)log
_
p(y)

n
i=1
p
i
(y
i
)
_
dy
= H(y)
_
n

i=1
log p
i
(y
i
)
_
y
, (19)
where H() represents the entropy, i.e.,
H(y) =
_
p(y)log p(y)dy, (20)
Independent Component Analysis 7
and )
y
denotes the statistical average with respect to the distribution p(y). Note
that p(y) =
p(x)
[ detW[
. Thus, the objective function (19) is given by
J
mi
=log[ detW[
n

i=1
log p
i
(y
i
)), (21)
where log p(x)) is left out since it does not depend on parameters W. For on-line
learning, we consider only instantaneous value, leading to
J
mi
=log[ detW[
n

i=1
log p
i
(y
i
). (22)
3.3 Information maximization
Infomax [4] involves the maximization of the output entropy z =g(y) where y =Wx
and g() is a squashing function (e.g., g
i
(y
i
) =
1
1+e
y
i
). It was shown that infomax
contrast maximization is equivalent to the minimization of KL divergence between
the distribution of y =Wx and the distribution p(s) =
n
i=1
p
i
(s
i
). In fact, infomax
is nothing but mutual information minimization in ICA framework.
Infomax contrast function is given by
J
I
(W) = H(g(Wx)), (23)
where g(y) = [g
1
(y
1
), . . . , g
n
(y
n
)]

. If g
i
(cdot) is differentiable, then it is the cumu-
lative distribution function of some probability density function q
i
(),
g
i
(y
i
) =
_
y
i

q
i
(s
i
)ds
i
.
Let us choose a squashing function g
i
(y
i
) as
g
i
(y
i
) =
1
1+e
y
i
, (24)
where g
i
() : R (0, 1) is a monotonically increasing function.
Let us consider an n-dimensional random vector s, the joint distribution of which
is factored into the product of marginal distributions:
q(s) =
n

i=1
q
i
( s
i
). (25)
Then g
i
( s
i
) is distributed uniformly on (0, 1), since g
i
() is the cumulative distri-
bution function of s
i
. Dene u = g(s) = [g
1
( s
1
), . . . , g
n
( s
n
)]

which is distributed
uniformly on (0, 1)
n
.
8 Seungjin Choi
Dene v = g(Wx). Then Infomax contrast function is re-written as
J
I
(W) = H(g(Wx))
= H(v)
=
_
p(v)log p(v)dv
=
_
p(v)log
_
p(v)

n
i=1
1
(0,1)
(v
i
)
_
dv
= KL[v|u]
= KL[g(Wx)|u], (26)
where 1
(0,1)
() denotes uniform distribution on (0,1). Note that KL-divergence is
invariant under an invertible transformation f ,
KL[ f (u)|f (v)] = KL[u|v]
= KL[ f
1
(u)|f
1
(v)].
Therefore we have
J
I
(W) = KL[g(Wx)|u]
= KL[Wx|g
1
(u)]
= KL[Wx|s]. (27)
It follows from (27) that maximizing J
I
(W) (Infomax principle) is identical to the
minimization of the KL divergence between the distribution of the output vector
y = Wx and the distribution s whose entries are statistically independent. In other
words, Infomax is equivalent to mutual information minimization in a framework
of ICA.
3.4 Negentropy maximization
Negative entropy or negentropy is a measure of distance to Gaussianity, yielding
the larger value for random variable whose distribution is far from Gaussian. Ne-
gentropy is always nonnegative and vanishes if and only if the random variable is
Gaussian. Negnetropy is dened as
J(y) = H(y
G
) H(y), (28)
where H(y) = Elog p(y) represents the entropy and y
G
is a Gaussian random
vector whose mean vector and covariance matrix are the same as y. In fact, negen-
tropy is KL-divergence of p(y
G
) from p(y), i.e.,
Independent Component Analysis 9
J(y) = KL
_
p(y)|p(y
G
)

,
=
_
p(y)log
p(y)
p(y
G
)
dy, (29)
leading to (28).
Let us discover a relation between negentropy and mutual information. To this
end, we consider mutual information I(y):
I(y) = I(y
1
, . . . , y
n
)
=
n

i=1
H(y
i
) H(y)
=
n

i=1
H(y
G
i
)
n

i=1
J(y
i
) +J(y) H(y
G
)
= J(y)
n

i=1
J(y
i
) +
1
2
log
_

n
i=1
[R
yy
]
ii
det R
yy
_
, (30)
where R
yy
=Eyy

and [R
yy
]
ii
denotes the ith diagonal entry of R
yy
.
Assume that y is already whitened (decorrelated), i.e., R
yy
= I. Then the sum of
marginal negentropies is given by
n

i=1
J(y
i
) = J(y) I(y) +
1
2
log
_

n
i=1
[R
yy
]
ii
det R
yy
_
. .
0
= H(y)
_
p(y)log p(y
G
)dy I(y)
= H(x) log|detW[ I(y)
_
p(y)log p(y
G
)dy. (31)
Invoking R
yy
= I, (31) becomes
n

i=1
J(y
i
) =I(y) H(x) log[ detW[ +
1
2
log[ det R
yy
[. (32)
Note that
1
2
log[ det R
yy
[ =
1
2
log[ det(WR
xx
W

)[. (33)
Therefore, we have
n

i=1
J(y
i
) =I(y), (34)
where irrelevant terms are omitted. It follows from (34) that maximizing the sum of
marginal negentropies is equivalent to minimizing the mutual information.
10 Seungjin Choi
4 Natural gradient algorithm
In previous section, four different principles lead to the same objective function
J =log[ detW[
n

i=1
log p
i
(y
i
). (35)
That is, ICA boils down to learning W which minimizes (35),

W = argmin
W
_
log[ detW[
n

i=1
log p
i
(y
i
)
_
. (36)
An easy way to solve (36) is the gradient descent method which gives a learning
algorithm for W that has the form
W =
J
W
=
_
W

(y)x

_
, (37)
where > 0 is the learning rate and (y) = [
1
(y
1
), . . . ,
n
(y
n
)]

is the negative
score function whose ith element
i
(y
i
) is given by

i
(y
i
) =
d log p
i
(y
i
)
dy
i
. (38)
A popular ICA algorithm is based on natural gradient [22] which is known to be
efcient since the steepest descent direction is used when parameter space is on
Riemannian manifold. We derive the natural gradient ICA algorithm [5].
Invoking (38), we have
d
_

i=1
logq
i
(y
i
)
_
=
n

i=1

i
(y
i
)dy
i
(39)
=

(y)dy, (40)
where (y) = [
1
(y
1

n
(y
n
)]

and dy is given in terms of dW as


dy = dWW
1
y. (41)
Dene a modied coefcient differential dV as
dV = dWW
1
. (42)
With this denition, we have
Independent Component Analysis 11
d
_

i=1
logq
i
(y
i
)
_
=

(y)dVy. (43)
We calculate an innitesimal increment of log[ detW[, then we have
dlog[ detW[ = trdV, (44)
where tr denotes the trace which adds up all diagonal elements.
Thus combining (43) and (44) gives
dJ =

(y)dVy trdV. (45)


The differential in (45) is in terms of the modied coefcient differential matrix dV.
Note that dV is a linear combination of the coefcient differentials dW
i j
. Thus, as
long as dW is nonsingular, dV represents a valid search direction to minimize (35),
because dV spans the same tangent space of matrices as spanned by dW. This leads
to a stochastic gradient learning algorithm for V given by
V =
dJ
dV
= I (y)y

. (46)
Thus the learning algorithm for updating W is described by
W = VW
=
_
I (y)y

_
W. (47)
5 Flexible ICA
Optimal nonlinear function
i
(y
i
) is given by (38). However, it requires the knowl-
edge of the probability distributions of sources which are not available to us. A vari-
ety of hypothesized density model has been used. For example, for super-Gaussian
source signals, unimodal or hyperbolic-Cauchy distribution model [9] leads to the
nonlinear function given by

i
(y
i
) = tanh(y
i
). (48)
Such sigmoid function was also used in [4]. For sub-Gaussian source signals, cubic
nonlinear function
i
(y
i
) = y
3
i
has been a favorite choice. For mixtures of sub- and
super-Gaussian source signals, according to the estimated kurtosis of the extracted
signals, nonlinear function can be selected from two different choices [23].
Flexible ICA [24] incorporates the generalized Gaussian density model into the
natural gradient ICA algorithm, so that the parameterized nonlinear function pro-
vides exibility in learning. The generalized Gaussian probability distribution is a
12 Seungjin Choi
set of distributions parameterized by a positive real number , which is usually re-
ferred to as the Gaussian exponent of the distribution. The Gaussian exponent
controls the peakiness of the distribution. The probability density function (PDF)
for a generalized Gaussian is described by
p(y; ) =

2
_
1

_e
[
y

, (49)
where (x) is Gamma function given by
(x) =
_

0
t
x1
e
t
dt. (50)
Note that if = 1, the distribution becomes the standard Laplacian distribution.
If = 2, the distribution is standard normal distribution (see Figure 2).
4 3 2 1 0 1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
y
p
(
y
)
alpha=2
alpha=4
alpha=1
alpha=.8
Fig. 2 The generalized Gaussian distribution is plotted for several different values of Gaussian
exponent, = 0.8, 1, , 2, 4.
For a generalized Gaussian distribution, the kurtosis can be expressed in terms of
the Gaussian exponent, given by

_
5

_
1

2
_
3

_ 3. (51)
The plot of kurtosis

versus the Gaussian exponent for leptokurtic and platykur-


tic signals are shown in Fig. 3.
From the parameterized generalized Gaussian density model, the nonlinear func-
tion in the algorithm (47) is given by

i
(y
i
) =
d log p
i
(y
i
)
dy
i
= [y
i
[

i
1
sgn(y
i
), (52)
Independent Component Analysis 13
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
10
2
10
1
10
0
10
1
10
2
10
3
10
4
Gaussian exponent
k
u
r
t
o
s
i
s
Leptokurtic distribution
2 3 4 5 6 7 8 9 10
10
2
10
1
10
0
10
1
10
2
10
3
10
4
Gaussian exponent

k
u
r
t
o
s
i
s
Platykurtic distribution
(a) (b)
Fig. 3 The plot of kurtosis

versus Gaussian exponent : (a) for leptokurtic signal; (b) for


platykurtic signal.
where sgn(y
i
) is the signum function of y
i
.
Note that for
i
= 1,
i
(y
i
) in (38) becomes a signum function (which can also
be derived from the Laplacian density model for sources). The signum nonlinear
function is favorable for the separation of speech signals since natural speeches is
often modeled as Laplacian distribution. Note also that for
i
= 4,
i
(y
i
) in (38)
becomes a cubic function, which is known to be a good choice for sub-Gaussian
sources.
In order to select a proper value of the Gaussian exponent
i
, we estimate the
kurtosis of the output signal y
i
and select the corresponding
i
from the relation-
ship in Figure 3. The kurtosis of y
i
,
i
can be estimated via the following iterative
algorithm:

i
(t +1) =
M
4i
(t +1)
M
2
2i
(t +1)
3, (53)
where
M
4i
(t +1) = (1)M
4i
(t) +[y
i
(t)[
4
, (54)
M
2i
(t +1) = (1)M
2i
(t) +[y
i
(t)[
2
, (55)
where is a small constant, say, 0.01.
In general, the estimated kurtosis of demixing lter output does not exactly match
the kurtosis of original source. However, it provides an idea whether the estimated
source is sub-Gassian signal or super-Gaussian signal. Moreover, it was shown
[11, 25] that the performance of source separation is not degraded even if the hy-
pothesized density does not match the true density. From these reasons, we suggest
a pratical method where only several different forms of nonlinear functions are used.
14 Seungjin Choi
6 Differential ICA
In a wide sense, most of ICA algorithms based on unsupervised learning belong to
the Hebb-type rule or its generalization with adopting nonlinear functions. Moti-
vated from the differential Hebb rule [26] and differential decorrelation [27, 28], we
introduce an ICA algorithm employing the differential learning and natural gradient,
which leads to a differential ICA algorithm. We rst introduce a random walk model
for latent variables, in order to show that the differential learning is interpreted as
the maximum likelihood estimation of a linear generative model. Then the detailed
derivation of the differential ICA algorithm is presented.
6.1 Random walk model for latent variables
Given a set of observation data, x(t), the task of learning the linear generative
model (1) under a constraint of latent variables being statistically independent, is
a semiparametric estimation problem. The maximum likelihood estimation of basis
vectors a
i
involves a probabilistic model for latent variables which are treated as
nuisance parameters.
In order to show a link between the differential learning and maximum likelihood
estimation, we consider a random walk model for latent variables s
i
(t), which is a
simple Markov chain, i.e.,
s
i
(t) = s
i
(t 1) +
i
(t), (56)
where the innovation
i
(t) is assumed to have zero mean with a density function
q
i
(
i
(t)). In addition, innovation sequences
i
(t) are assumed to be mutually in-
dependent white sequences, i.e., they are spatially independent and temporally white
as well.
Let us consider latent variables s
i
(t) over an N-point time block. We dene the
vector s
i
as
s
i
= [s
i
(0), . . . , s
i
(N1)]

. (57)
Then the joint probability density function of s
i
can be written as
p
i
(s
i
) = p
i
(s
i
(0), . . . , s
i
(N1))
=
N1

t=0
p
i
(s
i
(t)[s
i
(t 1)), (58)
where s
i
(t) = 0 for t < 0 and the statistical independence of innovation sequences
was taken into account.
It follows from the random walk model (56) that the conditional probability den-
sity of s
i
(t) given its past samples can be written as
Independent Component Analysis 15
p
i
(s
i
(t)[s
i
(t 1)) = q
i
(
i
(t)). (59)
Combining (58) and (59) leads to
p
i
(s
i
) =
N1

t=0
q
i
(
i
(t))
=
N1

t=0
q
i
_
s

i
(t))
_
, (60)
where s

i
(t) = s
i
(t) s
i
(t 1) which is the rst-order approximation of the differen-
tiation.
Take the statistical independence of latent variables and (60) into account, then
we can write the joint density p(s
1
, . . . , s
n
) as
p(s
1
, . . . , s
n
) =
n

i=1
p
i
(s
i
)
=
N1

t=0
n

i=1
q
i
_
s

i
(t)
_
. (61)
The factorial model given in (61) will be used as an optimization criterion in deriv-
ing the differential ICA algorithm.
6.2 Algorithm
Denote a set of observation data by
X =x
1
, . . . , x
n
, (62)
where
x
i
= [x
i
(0), . . . , x
i
(N1)]

. (63)
Then the normalized log-likelihood is given by
1
N
log p(X[A) = log[det A[ +
1
N
log p(s
1
, . . . , s
n
)
= log[det A[ +
1
N
N1

t=0
n

i=1
logq
i
(s

i
(t)). (64)
Let us denote the inverse of A by W = A
1
. The estimate of latent variables is
denoted by y(t) =Wx(t). With these dened variables, the objective function (that
is the negative normalized log-likelihood) is given by
16 Seungjin Choi
J
dib
=
1
N
log p(X[A)
= log[detW[
1
N
N1

t=0
n

i=1
logq
i
(y

i
(t)), (65)
where s
i
is replaced by its estimate y
i
and y

i
(t) = y
i
(t) y
i
(t 1) (the rst-order
approximation of the differentiation).
For on-line learning, the sample average is replaced by the instantaneous value.
Hence the on-line version of the objective function (65) is given by
J
di
=log[detW[
n

i=1
logq
i
(y

i
(t)), (66)
Note that objective function (66) is slightly different from (35) used in the con-
ventional ICA based on the minimization of mutual information or the maximum
likelihood estimation.
We derive a natural gradient learning algorithm which nds a minimum of (66).
To this end, we follow the way that was discussed in [29, 22, 24]. We calculate the
total differential dJ
di
(W) due to the change dW
dJ
di
= J
di
(W +dW) J
di
(W)
= d log[detW[+d
_

i=1
logq
i
(y

i
(t))
_
. (67)
Dene

i
(y

i
) =
d logq
i
(y

i
)
dy

i
. (68)
and construct a vector (y

) = [
1
(y

1
), . . . ,
n
(y

n
)]

.
With this denition, we have
d
_

i=1
logq
i
(y

i
(t))
_
=
n

i=1

i
(y

i
(t))dy

i
(t)
=

(y

(t))dy

(t). (69)
One can easily see that
d log[detW[ = tr
_
dWW
1
_
. (70)
Dene a modied differential matrix dV by
dV = dWW
1
. (71)
Independent Component Analysis 17
Then, with this modied differential matrix, the total differential dJ
di
(W) is com-
puted as
dJ
di
=tr dV+

(y

(t))dVy

(t). (72)
A gradient descent learning algorithm for updating V is given by
V(t +1) = V(t)
t
dJ
di
dV
=
t
_
I (y

(t))y

(t)
_
. (73)
Hence, it follows from the relation (71) that the updating rule for W has the form
W(t +1) =W(t) +
t
_
I (y

(t))y

(t)
_
W(t). (74)
7 Nonstationary Source Separation
So far, we assumed that sources are stationary random process where the statistics
does not vary over time. In this section, we show how the natural gradient ICA
algorithm is modied to handle nonstationary sources. As in [30], the following
assumptions are made in this section.
AS1 The mixing matrix A has full column rank.
AS2 Source signals s
i
(t) are statistically independent with zero mean. This im-
plies that the covariance matrix of source signal vector, R
s
(t) = Es(t)s

(t) is
a diagonal matrix, i.e.,
R
s
(t) = diagr
1
(t), . . . , r
n
(t), (75)
where r
i
(t) = Es
2
i
(t) and E denotes the statistical expectation operator.
AS3
r
i
(t)
r
j
(t)
(i, j = 1, . . . , n and i ,= j) are not constant with time.
We have to point out that the rst two assumptions (AS1, AS2) are common in
most existing approaches to source separation, however, the third assumption (AS3)
is critical in the present paper. For nonstationary sources, the third assumption is
satised and it allows us to separate linear mixtures of sources via SOS.
For stationary source separation, the typical cost function is based on the mutual
information which requires the knowledge of underlying distributions of sources.
Since probability distributions of sources are not known in advance, most ICA al-
gorithms rely on hypothesized distributions (for example, see [24] and references
therein). HOS should be incorporated either explicitly or implicitly.
For nonstationary sources, Matsuoka et al. have shown that the decomposition
(6) is satised if cross-correlations Ey
i
(t)y
j
(t) (i, j = 1, . . . , n, i ,= j) are zeros
at any time instant t, provided that the assumptions (AS1)-(AS3) are satised. To
18 Seungjin Choi
eliminate cross-correlations, the following cost function was proposed in [30],
J(W) =
1
2
_
n

i=1
logEy
2
i
(t)logdet
_
E
_
y(t)y

(t)
__
_
, (76)
where det() denotes the determinant of a matrix. The cost function given in (76)
is a non-negative function which takes minima if and only if Ey
i
(t)y
j
(t) = 0, for
i, j = 1, . . . , n, i ,= j. This is the direct consequence of the Hadamards inequality
which is summarized below.
Theorem 2 (Hadamards Inequality). Suppose K = [k
i j
] is a non-negative denite
symmetric nn matrix. Then,
det(K)
n

i=1
k
ii
, (77)
with equality iff k
i j
= 0, for i ,= j.
Take the logarithm on both sides of (77) to obtain
n

i=1
logk
ii
logdet(K) 0. (78)
Replacing the matrix K by Ey(t)y

(t), one can easily see that the cost function


(76) has the minima iff Ey
i
(t)y
j
(t) = 0, for i, j = 1, . . . , n and i ,= j.
We compute
d
_
logdet(Ey(t)y

(t))
_
= 2d logdetW+d logdetC(t)
= 2tr
_
W
1
dW
_
+d logdetC(t), (79)
Dene a modied differential matrix dV as
dV =W
1
dW. (80)
Then, we have
d
_
n

i=1
logEy
2
i
(t)
_
= 2Ey

(t)
1
(t)dVy(t), (81)
Similarly, we can derive the learning algorithm for W that has the form
W(t) =
t
_
I
1
(t)y(t)y

(t)
_
W(t)
=
t

1
(t)
_
(t) y(t)y

(t)
_
W(t). (82)
Independent Component Analysis 19
8 Spatial, Temporal, and Spatio-Temporal ICA
ICA decompostion, X = AS, has inherently duality. Considering the data matrix
X R
mN
where its each row is assumed to be a time course of an attribute, ICA
decomposition produces n independent time courses. On the other hand, regarding
the data matrix in the form of X

, ICA decomposition leads to n independent pat-


terns (for instance, images in fMRI or arrays in DNA microarray data).
The standard ICA (where X is considered) is treated as temporal ICA (tICA).
Its dual decomposition (regarding X

) is known as spatial ICA (sICA). Combining


these two ideas, leads to spatio-temporal ICA (stICA). These variations of ICA,
were rst investigated in [31]. Spatial ICA or spatio-temporal ICA were shown to
be useful in fMRI image analysis [31] and gene expression data analysis [32, 33].
Suppose that the singular value decomposition (SVD) of X is given by
X =UDV
T
=
_
UD
1/2
__
VD
1/2
_
T
=

U

V
T
, (83)
where U R
mn
, D R
nn
, and V R
Nn
for n min(m, N).
8.1 Temporal ICA
Temporal ICA nds a set of independent time courses and a corresponding set of
dual unconstrained spatial patterns. It embodies the assumption that each row vector
in

V

consists of a linear combination of n independent sequences, i.e.,



V

A
T
S
T
,
where S
T
R
nN
has a set of n independent temporal sequences of length N and

A
T
R
nn
is an associated mixing matrix.
Unmixing by Y
T
= W
T

where W
T
= P

A
1
T
, leads us to recover the n dual
patterns A
T
associated with the n independent time courses, by calculating A
T
=

UW
1
T
, which is a consequence of

X = A
T
Y
T
=

U

=

UW
1
T
Y
T
.
8.2 Spatial ICA
Spatial ICA seeks a set of independent spatial patterns S
S
and a corresponding set
of dual unconstrained time courses A
S
. It embodies the assumption that each row
vector in

U

is composed of a linear combination of n independent spatial patterns,


i.e.,

U

=

A
S
S
S
, where S
S
R
nm
contains a set of n independent m-dimensional
patterns and

A
S
R
nn
is an encoding variable matrix (mixing matrix).
Dene Y
S
= W
S

where W
S
is a permuted version of

A
1
S
. With this deni-
tion, the n dual time courses A
S
R
Nn
associated with the n independent patterns,
20 Seungjin Choi
is computed by A
S
=

VW
1
S
, since

X

= A
S
Y
S
=

U

V
T
=

VW
1
S
Y
S
. Each column
vector of A
S
corresponds to a temporal mode.
8.3 Spatio-temporal ICA
In linear decomposition, sICA enforces independence constraints over space, to nd
a set of independent spatial patterns, whereas tICA embodies independence con-
straints over time, to seek a set of independent time courses. Spatio-temporal ICA
nds a linear decomposition, by maximizing the degree of independence over space
as well as over time, without necessarily producing independence in either space
or time. In fact it allows a trade-off between the independence of arrays and the
independence of time courses.
Given

X =

U

V
T
, stICA nds the following decomposition:

X = S

S
S
T
, (84)
where S
S
R
nm
contains a set of n independent m-dimensional patterns, S
T

R
nN
has a set of n independent temporal sequences of length N, and is a diagonal
scaling matrix. There exist two n n mixing matrices, W
S
and W
T
such that S
S
=
W
S

and S
T
=W
T

. The following relation

X = S

S
S
T
=

UW

S
W
T

=

U

V
T
, (85)
implies that W

S
W
T
= I, which leads to
W
T
=W
T
S

1
. (86)
Linear transforms, W
S
and W
T
, are found by jointly optimizing objective func-
tions associated with sICA and tICA. That is, the objective function for stICA has
the form
J
stICA
=J
sICA
+(1)J
tICA
, (87)
where J
sICA
and J
tICA
could be infomax criteria or log-likelihood functions and
denes the relative weighting for spatial independence and temporal independence.
More details on stICA can be found in [31].
Independent Component Analysis 21
9 Algebraic Methods for BSS
Up to now, we have introduced on-line ICA algorithm in an framework of unsuper-
vised learning. In this section, we explain several algebraic methods for BSS where
matrix decomposition plays a critical role.
9.1 Fundamental principle for algebraic BSS
Algebraic methods for BSS often make use of the eigen-decomposition of correla-
tion matrices or cumulant matrices. Exemplary algebraic methods for BSS include
FOBI [34], AMUSE [35], JADE [36], SOBI [37], and SEONS [38]. Some of these
methods (FOBI and AMUSE) are based on simultaneous diagonalization of two
symmetric matrices. Methods such as JADE, SOBI, and SEONS make use of joint
approximate diagonalization of multiple matrices (more than two). The following
theorem provides a fundamental principle to algebraic BSS, justifying why simul-
taneous diagonalization of two symmetric data matrices (one of them is assumed to
be positive denite) provides a solution to BSS.
Theorem 3. Let
1
, D
1
R
nn
be diagonal matrices with positive diagonal entries
and
2
, D
2
R
nn
be diagonal matrices with non-zero diagonal entries. Suppose
that G R
nn
satises the following decompositions:
D
1
= G
1
G

, (88)
D
2
= G
2
G

. (89)
Then the matrix G is the generalized permutation matrix, i.e., G=P if D
1
1
D
2
and

1
1

2
have distinct diagonal entries.
Proof. It follows from (88) that there exists an orthogonal matrix Q such that
_
G
1
2
1
_
=
_
D
1
2
1
_
Q. (90)
Hence,
G = D
1
2
1
Q

1
2
1
. (91)
Substitute (91) into (89) to obtain
D
1
1
D
2
= Q
1
1

2
Q

. (92)
Since the right-hand side of (92) is the eigen-decomposition of the left-hand side of
(92), the diagonal elements of D
1
1
D
2
and
1
1

2
are the same. From the assump-
tion that the diagonal elements of D
1
1
D
2
and
1
1

2
are distinct, the orthogonal
22 Seungjin Choi
matrix Q must have the form Q = P, where is an diagonal matrix whose diag-
onal elements are either +1 or 1. Hence, we have
G = D
1
2
1
P

1
2
1
= PP

D
1
2
1
P

1
2
1
= P, (93)
where
= P

D
1
2
1
P

1
2
1
,
which completes the proof.
9.2 AMUSE
As an example of Theorem 3, we briey explain AMUSE [35] where a BSS solution
is determined by simultaneously diagonalizing the equal-time correlation matrix of
x(t) and a time-delayed correlation matrix of x(t).
Let us assume that sources s
i
(t) (entries of s(t)) are uncorrelated stochastic
processes with zero mean, i.e.,
Es
i
(t)s
j
(t ) =
i j

i
(), (94)
where
i j
is the Kronecker delta and
i
() are distinct for i = 1, . . . , n, given . In
other words, the equal-time correlation matrix of source, R
ss
(0) =Es(t)s

(t) is a
diagonal matrix with distinct diagonal entries. Moreover, a time-delayed correlation
matrix of source, R
ss
() = Es(t)s

(t ) is diagonal as well, with distinct non-


zero diagonal entries.
It follows from (2) that the correlation matrices of the observation vector x(t)
satisfy
R
xx
(0) = AR
ss
(0)A

, (95)
R
xx
() = AR
ss
()A

, (96)
for some non-zero time-lag and both R
ss
(0) and R
ss
() are diagonal matrices since
sources are assumed to be spatially uncorrelated.
Invoking Theorem 3, one can easily see that the inverse of the mixing matrix,
A
1
, can be identied up to its re-scaled and permuted version by the simultaneous
diagonalization of R
xx
(0) and R
xx
(), provided that R
1
ss
(0)R
ss
() has distinct diag-
onal elements. In other words, we determine a linear transformation W such that
R
yy
(0) and R
yy
() of the output y(t) =Wx(t) are simultaneously diagonalized:
Independent Component Analysis 23
R
yy
(0) = (WA)R
ss
(0)(WA)

,
R
yy
() = (WA)R
ss
()(WA)

.
It follows from Theorem 3 that WA becomes the transparent transformation.
9.3 Simultaneous diagonalization
We explain how two symmetric matrices are simultaneously diagonalized by a lin-
ear transformation. More details on simultaneous diagonalization can be found in
[39]. Simultaneous diagonalization consists of two steps (whitening followed by an
unitary transformation):
(1)First, the matrix R
xx
(0) is whitened by
z(t) = D

1
2
1
U

1
x(t), (97)
where D
1
and U
1
are the eigenvalue and eigenvector matrices of R
xx
(0) as
R
xx
(0) =U
1
D
1
U

1
. (98)
Then we have
R
zz
(0) = D

1
2
1
U

1
R
xx
(0)U
1
D

1
2
1
= I
m
,
R
zz
() = D

1
2
1
U

1
R
xx
()U
1
D

1
2
1
.
(2)Second, a unitary transformation is applied to diagonalize the matrix R
zz
(). The
eigen-decomposition of R
zz
() has the form
R
zz
() =U
2
D
2
U

2
. (99)
Then y(t) =U

2
z(t) satises
R
yy
(0) = U

2
R
zz
(0)U
2
= I
m
,
R
yy
() = U

2
R
zz
()U
2
= D
2
.
Thus both matrices R
xx
(0) and R
xx
() are simultaneously diagonalized by a linear
transform W = U

2
D

1
2
1
U

1
. It follows from Theorem 3 that W = U

2
D

1
2
1
U

1
is a
valid demixing matrix if all the diagonal elements of D
2
are distinct.
24 Seungjin Choi
9.4 Generalized eigenvalue problem
The simultaneous diagonalization of two symmetric matrices can be carried out
without going through two-step procedures. From the discussion in Section 9.3, we
have
WR
xx
(0)W

= I
n
, (100)
WR
xx
()W

= D
2
. (101)
The linear transformation W which satises (100) and (101) is the eigenvector ma-
trix of R
1
xx
(0)R
xx
() [39]. In other words, the matrix W is the generalized eigenvec-
tor matrix of the pencil R
xx
() R
xx
(0) [40].
Recently Chang et al. proposed the matrix pencil method for BSS [41] where they
exploited R
xx
(
1
) and R
xx
(
2
) for
1
,=
2
,=0. Since the noise vector was assumed to
be temporally white, two matrices R
xx
(
1
) and R
xx
(
2
) are not theoretically affected
by the noise vector, i.e.,
R
xx
(
1
) = AR
ss
(
1
)A

, (102)
R
xx
(
2
) = AR
ss
(
2
)A

. (103)
Thus it is obvious to see that we can nd an estimate of demixing matrix that is not
sensitive to the white noise. A similar idea was also exploited in [42, 43].
In general, the generalized eigenvalue decomposition requires the symmetric-
denite pencil (one matrix is symmetric and the other is symmetric and positive
denite). However R
xx
(
2
)R
xx
(
1
) is not symmetric-denite, which might cause
a numerical instability problem which results in complex-valued eigenvectors.
The set of all matrices of the form R
1
R
2
with R is said to be a pencil.
Frequently we encounter into the case where R
1
is symmetric and R
2
is symmetric
and positive denite. Pencils of this variety are referred to as symmetric-denite
pencils [44].
Theorem 4 (pp. 468 in [44]). If R
1
R
2
is symmetric-denite, then there exists a
nonsingular matrix U = [u
1
, . . . , u
n
] such that
U

R
1
U = diag
1
(
1
), . . . ,
n
(
1
), (104)
U

R
2
U = diag
1
(
2
), . . . ,
n
(
2
). (105)
Moreover R
1
u
i
=
i
R
2
u
i
for i = 1, . . . , n, and
i
=

i
(
1
)

i
(
2
)
.
It is apparent from Theorem 4 that R
1
should be symmetric and R
2
should be
symmetric and positive denite so that the generalized eigenvector U can be a valid
solution if
i
are distinct.
Independent Component Analysis 25
10 Softwares
A vareity of ICA softwares are available. ICA Central
1
was created in 1999 to pro-
mote research on ICA and blind source separation by means of public mailing lists,
a repository of data sets, a repository of ICA/BSS algorithms, and so on. ICA Cen-
tral might be the rst place where you can nd data sets and ICA algorithms. In
addition, several widely-used softwares include
ICALAB Toolboxes (http://www.bsp.brain.riken.go.jp/ICALAB/): ICALAB is an
ICA Matlab software toolbox developed in Laboratory for Advanced Brain Sig-
nal Processing in RIKEN Brain Science Institute, Japan. It consists of two in-
dependent packages, including ICALAB for signal processing and ICALAB for
image processing and each package contains a variety of algorithms.
FastICA (http://www.cis.hut./projects/ica/fastica/): It is the FastICA Matlab
package that implements fast xed-point algorithms for non-Gaussianity max-
imization [16]. It was developed in Helsinki University of Technology, Finland
and other environments (R, C++, Physon) are also available.
Infomax ICA (http://www.cnl.salk.edu/tewon/ica cnl.html): Matlab and Ccodes
for Bell and Sejnowskis Infomax algorithm [4] and extended infomax [15]
where a parametric density model is incorporated into Infomax to handle both
super-Gaussian and sub-Gaussian sources.
EEGLAB (http://sccn.ucsd.edu/eeglab/): EEGLAB is an interactive Matlab tool-
box for processing continuous and event-related EEG, MEG and other electro-
physiological data using ICA, time/frequency analysis, artifact rejection, and sev-
eral modes of data visualization.
ICA: DTU Toolbox (http://isp.imm.dtu.dk/toolbox/ica/): ICA: DTU Toolbox is
a collection of ICA algorithms that includes: (1) icaML which is an efcient im-
plementation of Infomax; (2) icaMF which is an iterative algorithm that offers
a variety of possible source priors and mixing matrix constraints (e.g. positivity)
and can also handle over and under-complete mixing; (3) icaMS which is an
one shot fast algorithm that requires time correlation between samples.
11 Further Issues
Overcomplete representation: Overcomplete representation enforces the latent
space dimension n to be greater than the data dimension m in the linear model
(1). Sparseness constraints on latent variables are necessary to learn fruitful rep-
resentation [45].
Bayesian ICA: Bayesian ICA incorporates uncertainty and prior distributions of
latent variables into the model (1). Independent factor analysis [46] is a pioneer-
ing work along this direction. EM algorithm for ICA was developed in [47] and
a full Bayesian ICA (also known as ensemble learning) was developed in [48].
1
URL: http://www.tsi.enst.fr/icacentral/
26 Seungjin Choi
Kernel ICA: Kernel methods were introduced to consider statistical independence
in reproducing kernel Hilbert space [49], developing kernel ICA.
Nonnegative ICA: Nonnegativity constraints were imposed on latent variables,
yielding nonnegative ICA [50]. Rectied Gaussian prior can be also used in
Bayesian ICA to handle nonnegative latent variables.
Sparseness: Sparseness is another important characteristics of sources, besides
independence. Sparse component analysis is studied in [51].
Beyond ICA: Independent subspace analysis [52] and tree-dependent component
analysis [53] generalizes ICA, allowing intra-dependence structure in feature
subspaces or clusters.
12 Summary
ICA has been successfully applied to various applications of machine learning, pat-
tern recognition, and signal processing. A brief overview of ICA has been presented,
starting from fundamental principles on learning a linear latent variable model for
parsimonious representation. Natural gradient ICA algorithms were derived in the
framework of maximum likelihood estimation, mutual information minimization,
infomax, and negentropy maximization. We have explained exible ICA where
generalized Gaussian density was adopted such that a exible nonlinear function
was incorporated into the natural gradient ICA algorithm. Equivariant nonstation-
ary source separation was presented in the framework of natural gradient as well.
Different learning was also adopted to incorporate a temporal structure of sources.
We have also presented a core idea and various methods for algebraic source sep-
aration. Various softwares for ICA were introduced, so that one could easily apply
ICA to his/her own applications. Further issues were also briey mentioned so that
readers could follow what is the status of ICA is.
References
1. Comon, P.: Independent component analysis, a new concept? Signal Processing 36(3) (1994)
287314
2. Jutten, C., Herault, J.: Blind separation of sources, part I: An adaptive algorithm based on
neuromimetic architecture. Signal Processing 24 (1991) 110
3. Cichocki, A., Unbehauen, R.: Robust neural networks with on-line learning for blind iden-
tication and blind separation of sources. IEEE Transactions on Circuits and Systems - I:
Fundamental Theory and Applications 43 (1996) 894906
4. Bell, A., Sejnowski, T.: An information maximisation approach to blind separation and blind
deconvolution. Neural Computation 7 (1995) 11291159
5. Amari, S., Cichocki, A., Yang, H.H.: A new learning algorithm for blind signal separation.
In Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., eds.: Advances in Neural Information Pro-
cessing Systems (NIPS). Volume 8., MIT Press (1996) 757763
6. Cardoso, J.F., Laheld, B.H.: Equivariant adaptive source separation. IEEE Transactions on
Signal Processing 44(12) (1996) 30173030
Independent Component Analysis 27
7. Amari, S., Cichocki, A.: Adaptive blind signal processing - neural network approaches. Pro-
ceedings of of the IEEE, Special Issue on Blind Identication and Estimation 86(10) (1998)
20262048
8. Pham, D.T.: Blind separation of instantaneous mixtures of sources via an independent com-
ponent analysis. IEEE Transactions on Signal Processing 44(11) (1996) 27682779
9. MacKay, D.J.C.: Maximum likelihood and covariant algorithms for independent component
analysis. Technical Report Draft 3.7, University of Cambridge, Cavendish Laboratory (1996)
10. Pearlmutter, B., Parra, L.: Maximum likelihood blind source separation: A context-sensitive
generalization of ICA. In Mozer, M.C., Jordan, M.I., Petsche, T., eds.: Advances in Neural
Information Processing Systems (NIPS). Volume 9. (1997) 613619
11. Cardoso, J.F.: Infomax and maximum likelihood for source separation. IEEE Signal Process-
ing Letters 4(4) (1997) 112114
12. Karhunen, J.: Neural approaches to independent component analysis. In: Proceedings of the
European Symposium on Articial Neural Networks (ESANN). (1996) 249266
13. Oja, E.: The nonlinear PCA learning rule and signal separation - mathematical analysis. Tech-
nical Report A26, Helsinki University of Technology, Laboratory of Computer and Informa-
tion Science (1995)
14. Hyv arinen, A., Oja, E.: A fast xed-point algorithm for independent component analysis.
Neural Computation 9 (1997) 14831492
15. Lee, T.W.: Independent Component Analysis: Theory and Applications. Kluwer Academic
Publishers (1998)
16. Hyv arinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. John Wiley & Sons,
Inc. (2001)
17. Haykin, S.: Unsupervised Adaptive Filtering: Blind Source Separation. Prentice-Hall (2000)
18. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms
and Applications. John Wiley & Sons, Inc. (2002)
19. Stone, J.V.: Independent Component Analysis: A Tutorial Introduction. MIT Press (2004)
20. Hyv arinen, A.: Survey on independent component analysis. Neural Computing Surveys 2
(1999) 94128
21. Choi, S., Cichocki, A., Park, H.M., Lee, S.Y.: Blind source separation and independent com-
ponent analysis: A review. Neural Information Processing - Letters and Review 6(1) (2005)
157
22. Amari, S.: Natural gradient works efciently in learning. Neural Computation 10(2) (1998)
251276
23. Lee, T.W., Girolami, M., Sejnowski, T.: Independent component analysis using an extended
infomax algorithm for mixed sub-Gaussian and super-Gaussian sources. Neural Computation
11(2) (1999) 609633
24. Choi, S., Cichocki, A., Amari, S.: Flexible independent component analysis. Journal of VLSI
Signal Processing 26(1/2) (2000) 2538
25. Amari, S., Cardoso, J.F.: Blind source separation: Semiparametric statistical approach. IEEE
Transactions on Signal Processing 45 (1997) 26922700
26. Kosko, B.: Differential Hebbian learning. In: Proceedings of American Institute of Physics:
Neural Networks for Computing. (1986) 277282
27. Choi, S.: Adaptive differential decorrelation: A natural gradient algorithm. In: Proceedings of
the International Conference on Articial Neural Networks (ICANN), Madrid, Spain (2002)
11681173
28. Choi, S.: Differential learning and random walk model. In: Proceedings of the IEEE Interna-
tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong (2003)
724727
29. Amari, S., Chen, T.P., Cichocki, A.: Stability analysis of learning algorithms for blind source
separation. Neural Networks 10(8) (1997) 13451351
30. Matsuoka, K., Ohya, M., Kawamoto, M.: A neural net for blind separation of nonstationary
signals. Neural Networks 8(3) (1995) 411419
28 Seungjin Choi
31. Stone, J.V., Porrill, J., Porter, N.R., Wilkinson, I.W.: Spatiotemporal independent component
analysis of event-related fmri data using skewed probability density functions. NeuroImage
15(2) (2002) 407421
32. Liebermeister, W.: Linear modes of gene expression determined by independent component
analysis. Bioinformatics 18(1) (2002) 5160
33. Kim, S., Choi, S.: Independent arrays or independent time courese for gene expression data.
In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS),
Kobe, Japan (2005)
34. Cardoso, J.F.: Source separation using higher-order moments. In: Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (1989)
35. Tong, L., Soon, V.C., Huang, Y.F., Liu, R.: AMUSE: a new blind identication alogrithm. In:
Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS). (1990)
17841787
36. Cardoso, J.F., Souloumiac, A.: Blind beamforming for non Gaussian signals. IEE
Proceedings-F 140(6) (1993) 362370
37. Belouchrani, A., Abed-Merain, K., Cardoso, J.F., Moulines, E.: A blind source separation
technique using second order statistics. IEEE Transactions on Signal Processing 45 (1997)
434444
38. Choi, S., Cichocki, A., Belouchrani, A.: Second order nonstationary source separation. Journal
of VLSI Signal Processing 32 (2002) 93104
39. Fukunaga, K.: An Introduction to Statistical Pattern Recognition. Academic Press, New York,
NY (1990)
40. Molgedey, L., Schuster, H.G.: Separation of a mixture of independent signals using time
delayed correlations. Physical Review Letters (1994) 36343637
41. Chang, C., Ding, Z., Yau, S.F., Chan, F.H.Y.: A matrix-pencil approach to blind separation of
colored nonstationary signals. IEEE Transactions on Signal Processing 48(3) (2000) 900907
42. Choi, S., Cichocki, A.: Blind separation of nonstationary sources in noisy mixtures. Electron-
ics Letters 36(9) (2000) 848849
43. Choi, S., Cichocki, A.: Blind separation of nonstationary and temporally correlated sources
from noisy mixtures. In: Proceedings of IEEE Workshop on Neural Networks for Signal
Processing, Sidney, Austrailia (2000) 405414
44. Golub, G.H., Loan, C.F.V.: Matrix Computations. 2 edn. Johns Hopkins (1993)
45. Lewicki, M.S., Sejnowski, T.: Learning overcomplete representation. Neural Computation
12(2) (2000) 337365
46. Attias, H.: Independent factor analysis. Neural Computation 11 (1999) 803851
47. Welling, M., Weber, M.: A constrained EM algorithm for independent component analysis.
Neural Computation 13 (2001) 677689
48. Miskin, J.W., MacKay, D.J.C.: Ensemble learning for blind source separation. In Roberts,
S., Everson, R., eds.: Independent Component Analysis: Principles and Practice. Cambridge
University Press (2001) 209233
49. Bach, F., Jordan, M.I.: Kernel independent component analysis. Journal of Machine Learning
Research 3 (2002) 148
50. Plumbley, M.D.: Algorithms for nonnegative independent component anlaysis. IEEE Trans-
actions on Neural Networks 14(3) (2003) 534543
51. Li, Y., Cichocki, A., Amari, S.: Blind estimation of channel parameters and source components
for EEG signals: A sparse factorization approach. IEEE Transactions on Neural Networks
17(2) (2006) 419431
52. Hyv arinen, A., Hoyer, P.: Emergence of phase- and shift-invariant features by decomposition
of natural images into independent feature subspaces. Neural Computation 12(7) (2000) 1705
1720
53. Bach, F.R., Jordan, M.I.: Beyond independent components: Trees and clusters. Journal of
Machine Learning Research 4 (2003) 12051233

You might also like