Professional Documents
Culture Documents
signal to several other neurons. It is important to note that a neuron can send
only one signal at a time, although that signal is broadcast to several other
neurons. For example, consider a neuron Y, illustrated in Figure, that receives
inputs from neurons X1, X 2 , and X 3 The activations (output signals) of these
neurons are X1, X 2 , and X 3, respectively. The weights on the connections from
X1, X 2 , and X 3 to neuron Y are w1, W2, and W3, respectively. The net input, y_in,
to neuron Y is the sum of the weighted signals from neurons X 1, X 2 , and X 3, i.e.,
y_in = w1X 1 + w2X 2 + w3X 3
The activation y of neuron Y is given by some function of its net input, y = f(y-in),
e.g., the logistic sigmoid function (an S-shaped curve)
f(x) =
1
1+exp (x)
treated as binary. However, the frequency of firing varies and can be viewed as a
signal of either greater or lesser magnitude. This corresponds to looking at
discrete time steps and summing all activity (signals received or signals sent) at
a particular point in time. The transmission of the signal from a particular neuron
is accomplished by an action potential resulting from differential concentrations
of ions on either side of the neuron's axon sheath (the brain's "white matter").
The ions most directly involved are potassium, sodium, and chloride.
A generic biological neuron is illustrated in Figure 1.3, together with axons from
two other neurons (from which the illustrated neuron could receive signals) and
dendrites for two other neurons (to which the original neuron would send
signals). Several key features of the processing elements of artificial networks are
suggested by the properties of biological neurons, viz., that:
1. The processing element receives many signals.
2. Signals may be modified by a weight at the receiving synapse.
3. The processing element sums the weighted inputs.
4. Under appropriate circumstances (sufficient input), the neuron transmits a
single output.
5. The output from a particular neuron may go to many other neurons (the axon
branches).
Other features of artificial neural networks that are suggested by biological
neurons are:
6. Information processing is local (although other means of transmission, such as
the action of hormones, may suggest means of overall process control).
7. Memory is distributed:
a. Long-term memory resides in the neurons' synapses or weights.
b. Short-term memory corresponds to the signals sent by the neurons.
8. A synapse's strength may be modified by experience.
9. Neurotransmitters for synapses may be excitatory or inhibitory.
Yet another important characteristic that artificial neural networks share with
biological neural systems is fault tolerance. Biological neural systems are fault
tolerant in two respects. First, we are able to recognize many input signals that
are somewhat different from any signal we have seen before. An example of this
is our ability to recognize a person in a picture we have not seen before or to
recognize a person after a long period of time.
Second, we are able to tolerate damage to the neural system itself. Humans are
born with as many as 100 billion neurons. Most of these are in the brain, and
most are not replaced when they die [Johnson & Brown, 1988]. In spite of our
continuous loss of neurons, we continue to learn. Even in cases of traumatic
neural loss, other neurons can sometimes be trained to take over the functions of
the damaged cells. In a similar manner, artificial neural networks can be
designed to be insensitive to small damage to the network, and the network can
be retrained in cases of significant damage (e.g., loss of data and some
connections). Even for uses of artificial neural networks that are not intended
primarily to model biological neural systems, attempts to achieve biological
plausibility may lead to improved computational features. One example is the
use of a planar array of neurons, as is found in the neurons of the visual cortex,
for Kohonen's self-organizing maps The topological nature of these maps has
computational advantages, even in applications where the structure of the
output units is not itself significant.
Other researchers have found that computationally optimal groupings of artificial
neurons correspond to biological bundles of neurons [Rogers & Kabrisky, 1989].
Separating the action of a back propagation net into smaller pieces to make it
more local (and therefore, perhaps more biologically plausible) also allows
improvement in computational power (cf. Section 6.2.3) [D. Fausett, 1990].
and
additionally interested to find the directions that maximize the separation (or discrimination)
between different classes (for example, in pattern classification problems where our dataset
consists of multiple classes. In contrast two PCA, which ignores the class labels).
In other words, via PCA, we are projecting the entire set of data (without class labels) onto
a different subspace, and in MDA, we are trying to determine a suitable subspace to
distinguish between patterns that belong to different classes. Or, roughly speaking in PCA
we are trying to find the axes with maximum variances where the data is most spread
(within a class, since PCA treats the whole data set as one class), and in MDA we are
additionally maximizing the spread between classes.
In typical pattern recognition problems, a PCA is often followed by an MDA.
With multi dimensions, PCA is more useful, because it's hard to see through a
cloud of data.
What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the
average consumption of 17 types of food in grams per person per week for every country in the
UK.
The table shows some interesting variations across different food types, but overall differences
aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.
Here's
the plot of the data along the first principal component. Already we can see something is different
about Northern Ireland.
Now, see the first and second principal components, we see Northern Ireland a major outlier.
Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way
more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's
a good sign that structure we've visualized reflects a big fact of real-world geography: Northern
Ireland is the only of the four countries not on the island of Great Britain. (If you're confused
about the differences among England, the UK and Great Britain)
Independent component analysis (ICA) is a quite powerful technique and is able (in
principle) to separate independent sources linearly mixed in several sensors. For
instance, when recording electroencephalograms (EEG) on the scalp, ICA can
separate out artifacts embedded in the data (since they are usually independent
of each other).
ICA is a technique to separate linearly mixed sources. For instance, let's try to
mix and then separate two sources. Let's define the time courses of 2
independent sources A(top) and B(bottom)
We then mix linearly these two sources. The top curve is equal to A minus twice
B and the bottom the linear combination is 1.73*A +3.41*B.
We then input these two signals into the ICA algorithm (in this case, fastICA)
which is able to uncover the original activation of A and B.
Note that the algorithm cannot recover the exact amplitude of the source
activities. Further, also that, in theory, ICA can only extract sources that are
combined linearly.
(Matlab Code)
A = sin(linspace(0,50, 1000));
B = sin(linspace(0,37, 1000)+5);
figure;
subplot(2,1,1); plot(A);
subplot(2,1,2); plot(B, 'r');
M1 = A - 2*B;
M2 = 1.73*A+3.41*B;
figure;
subplot(2,1,1); plot(M1);
subplot(2,1,2); plot(M2, 'r');
figure;
c = fastica([M1;M2]);
fastICA
subplot(1,2,1); plot(c(1,:));
% A
% B
% plot A
% plot B
% mixing 1
% mixing 2
% plot mixing 1
% plot mixing 2
% compute and plot unminxing using
subplot(1,2,2); plot(c(2,:));
We will now explain the preprocessing performed by most ICA algorithms before
actually applying ICA.
A first step in many ICA algorithms is to whiten (or sphere) the data. This means
that we remove any correlations in the data, i.e. the different channels (matrix Q)
are forced to be uncorrelated.
Why do that? A geometrical interpretation is that it restores the initial "shape" of
the data and that then ICA must only rotate the resulting matrix (see below).
Once more, let's mix two random sources A and B. At each time, in the following
graph, the value of A is the abscisia of the data point and the value of B is their
ordinates.
Let take two linear mixtures of A and B and plot these two new variables.
(Matlab Code)
POINTS = 1000; % number of points to plot
% define the two random variables
% ------------------------------for i=1:POINTS
A(i) = round(rand*99)-50;
%A
B(i) = round(rand*99)-50;
%B
end;
figure; plot(A,B, '.');
% plot the variables
set(gca, 'xlim', [-80 80], 'ylim', [-80 80]); % redefines limits of the graph
% mix linearly these two variables
% -------------------------------M1 = 0.54*A - 0.84*B;
M2 = 0.42*A + 0.27*B;
% mixing 1
% mixing 2
Then if we whiten the two linear mixtures, we get the following plot
the variance on both axis is now equal and the correlation of the projection of the
data on both axis is 0 (meaning that the covariance matrix is diagonal and that
all the diagonal elements are equal). Then applying ICA only mean to "rotate"
this representation back to the original A and B axis space.
The whitening process is simply a linear change of coordinate of the mixed data.
Once the ICA solution is found in this "whitened" coordinate frame, we can easily
reproject the ICA solution back into the original coordinate frame.
The ICA algorithm
Intuitively you can imagine that ICA rotates the whitened matrix back to the
original (A,B) space (first scatter plot above). It performs the rotation by
minimizing the Gaussianity of the data projected on both axes (fixed point ICA).
For instance, in the example above,
The projection on both axis is quite Gaussian (i.e., it looks like a bell shape
curve). By contrast the projection in the original A, B space far from gaussian.
By rotating the axis and minimizing Gaussianity of the projection in the first
scatter plot, ICA is able to recover the original sources which are statistically
independent (this property comes from the central limit theorem which states
that any linear mixture of 2 independent random variables is more Gaussian than
the original variables). In Matlab, the function kurtosis (kurt() in the EEGLAB
toolbox; kurtosis() in the Matlab statistical toolbox) gives an indication of the
gaussianity of a distribution (but the fixed-point ICA algorithm uses a slightly
different measure called negentropy).
The Infomax ICA in the EEGLAB toolbox (Infomax ICA) is not as intuitive and
involves minimizing the mutual information of the data projected on both axes.
However, even if ICA algorithms differ from a numerical point of view, they are all
equivalent from a theoretical point of view
ICA in N dimensions
We dealt with only 2 dimensions. However ICA can deal with an arbitrary high
number of dimensions. Let's consider 128 EEG electrodes for instance. The signal
recorded in all electrode at each time point then constitutes a data point in a 128
dimension space. After whitening the data, ICA will "rotate the 128 axis" in order
to minimize the Gaussianity of the projection on all axis (note that unlinke PCA
the axis do not have to remain orthogonal).
What we call ICA components is the matrix that allows projecting the data in the
initial space to one of the axis found by ICA. The weight matrix is the full
transformation from the original space. When we write
S=WX
For fMRI
Voxels
Time 1
Time 2
Time 3
Component 1
Component 2
Component 3
Component 1
Component 2
Component 3
elec1
elec2
elec3
elec4
elec5
[ 0.824
0.534
0.314
0.654
0.739 ...]
[ 0.314
0.154
0.732
0.932
0.183 ...]
[ 0.153
0.734
0.134
0.324
0.654 ...]
For instance to compute the activity of the second source or second independent
component (in a matrix multiplication format), you may simply multiply matrix X
(see beginning of paragraph) by the row vector
elec1
elec2
elec3
elec4
[ 0.314
0.154
0.732
0.932
elec5
Component 2
0.183 ...]
Now you have the activity of the second component, but the activity is unitless.
If you have heard of inverse modeling, the analogy with EEG/ERP sources in
dipole localization software is the easiest to grasp. Each dipole has an activity
(which project linearly to all electrodes). The activity of the Brain source (dipole)
is unitless unless it is projected to the electrodes. So each dipole create a
contribution at each electrode site. ICA components are just the same.
Now we will see how to reproject one component to the electrode space. W -1 is
the inverse matrix to go from the source space S to the data space X.
X = W-1S
In Matlab you would just type inv(W) to obtain the inverse of a matrix.
[ 0.184
0.253
0.131
0.364
0.639 ...]
[ 0.731
0.854
0.072
0.293
0.513 ...]
[ 0.125
0.374
0.914
0.134
0.465 ...]
[ 0.253 ]
[ 0.854 ]
[ 0.374 ]
We will obtain the projected activity of component 2 (the inverse weights for
component 2 (column vector; bottom left below) multiplied by the activity for
component 2 (row vector; top right below) leads to the component projection
(matrix; bottom right).
(on the rigth one row of
the
S matrix (the activity of
component 2)
[ 0.253 ]
[ 0.854 ]
[ 0.374 ]
Now, if one want to remove component number 2 from the data (for instance if
component number 2 proved to be an artifact), one can simply subtract the
matrix above (XC2) from the original data X.
Note that in the matrix computed above (XC2) all the columns are proportional,
which mean that the scalp activity is simply scaled. For this reason, we denote
the columns of the W-1 matrix, the scalp topography of the components. Each
column of this matrix is the topography of one component which is scaled in time
by the activity of the component. The scalp topography of each component can
be used to estimate the equivalent dipole location for this component (assuming
the component is not an artifact).
As a conclusion, when we talk about independent components, we usually refer
to two concepts
Rows of the S matrix which are the time course of the component activity
Columns of the W-1 matrix which are the scalp projection of the
components
ICA properties
From the preceding paragraphs, several properties of ICA becomes obvious
Since ICA is dealing with clouds of point, changing the order in which the
points are plotted (the time points order in EEG) has virtually no effect on
the outcome of the algorithm.
Even when the sources are not independent, ICA finds a space where they
are maximally independents.
Signal Mixtures
We know that signal mixtures tend to have Gaussian (normal) probability density
functions, and that source signals have non-gaussian pdfs. We also know that
each source signal can be extracted from a set of signal mixtures by taking the
inner product of a weight vector and those signal mixtures where this inner
product provides an orthogonal projection of the signal mixtures. But we do not
yet know precisely how to find such a weight vector. One type of method for
doing so is exploratory projection pursuit, often referred to simply as projection
pursuit.
Projection pursuit methods seek one projection at a time such that the
extracted signal is as non- gaussian as possible. This contrasts with ICA, which
typically extracts M signals simultaneously from M signal mixtures, which
requires estimating a (possibly very large) M x M unmixing matrix. One practical
advantage of projection pursuit over ICA is that less than M signals can extracted
if required, where each source signal is extracted from M signal mixtures using
an M element weight vector.
The name projection pursuit derives from the fact that this method seeks
a weight vector which provides an orthogonal projection of a set of signal
mixtures such that each extracted signal has a pdf which is as non-gaussian as
possible.
Let us consider the example of human height. Suppose that the height of
an individual hi is the outcome of many underlying factors which include a
genetic component SiG and dietary component SiD (i.e. nature vs nurture). Let us
further suppose that the contribution of each factor to height is the same for all
the individuals (i.e. the nature/Nurture ratio is fixed). Finally we need to assume
that the total effect of these different factors in each individual is the sum of
their contributions. If we consider the contribution of each factor as a constant
coefficient then we can write
Hi = aSiG + bSiD
Where a and b are non-zero coefficients. Each coefficient determines how height
increases with the factors SiG and SiD. Note that SiG and SiD vary across individuals,
whereas the coefficient a and b are the same for all individual. The central limit
theorem ensures that the pdf of h i value is approximately gaussian irrespective
of the pdf of SiG or SiD values and irrespective of the constants a and b.
Of course, we should recognize above equation for what it is: the
formation of a signal mixture h by a linear combination of source signals S iG and
SiD, using mixing coefficients a and b. Note that hi could equally well be mixture
of two voice signals.
As a further example, in signal processing it is almost always
assumed that, after tte signals of interest have been extracted from noisy stream
of data, the residual noise is gaussian. As stated above, this assumption is
mathematically very convenient, but it is also usually valid. If the residual noise
is the result of many processes whose outputs are added together then the
central limit theorem (CLT) guarantees that this noise is indeed approximately
gaussian.
Gaussian Signals: Good News, Bad News
The bad news is that the converse of the CLT is not true in general; that is,
it is not true that any gaussian signal is a mixture of non-gaussian signals. The
good news is that, in practice, gaussian signals often do consist of a mixture of
non- gaussian signals. This is good news because it means we can treat any
gaussian signal as if it consists of a mixture of non- gaussian source signals.
Given a set of such gaussian mixtures, we can then proceed to find each source
signal by finding that unmixing vector which extracts the most non- gaussian
signal by finding that unmixing vector which extracts the most non- gaussian
signal from the set of mixtures.
We could now precede using two different strategies. We could define a
measure of the distance between the signal extracted by a given unmixing
vector and a gaussian signal, and then find the unmixing vector that maximizes
this distance. This distance is known as kullback Leibler divergence. A simpler
strategy consists of defining a measure of non- gaussianity and then finding the
unmixing vector that maximizes this measure.
The fact that there are actually two types of non- gaussian signals will not
detain us long, because we shall assume that our source signals are of one type
only. The two types are known by various terms, such as super-gaussian and subgaussian or equivalently as playkurtoic and leptokurtoic resp. and signal with
zero kurtois is mesokurtotic. A signal; with a super gaussian pdf has most of its
values clustered around zero whereas a signal with a sub gaussian pdf does not .
As examples, a speech signal has a super gaussian pdf and a sawtooth function
and white noise have sub gaussian pdfs. This implies that super gaussian signals
have pdfs that are more peaky than that of a gaussian signal, whereas a sub
gaussian signal has a pdf that is less peaky than that if a gaussian signal