Neural Networks Chapter

The sole purpose of this paper is to identify which neural network could bring in
the great storage efficiency, quality, robustness, pattern completion, content

addressable memory of the image (objects) recognition in the traffic signal
systems.
Most of the pattern mapping neural networks suffer from the drawbacks that
during learning of weights, the weigh matrix tends to encode the presently active
pattern, thus weakening the trace of patterns it had already learnt. The other
problem that the common types of neural networks face is the forceful
categorization of a new pattern to one of the already learnt classes. On occasions
such categorization seems to be ridiculous as the nearest class of current pattern
may be significantly different with respect to the center of the class. The
problems of the lack of stability of the weight matrix and forceful categorization
of a new pattern to one of the existing classes, has led to the proposal of a new
architecture for pattern classification.
Neural nets are of interest to researchers in many areas for different reasons.
Electronic engineers find numerous applications in signal processing and control
theory. Computer engineers are intrigued by the potential for hardware to
implement neural nets efficiently and by applications of neural nets to robotics.
Computer scientists find that neural nets show promise for difficult problems in
areas such as artificial intelligence and pattern recognition. For applied
mathematicians, neural nets are a powerful tool for modelling problems for which
the explicit form of the relationships among certain variables is not known.
There are various points of view as to the nature of a neural net. For example, is
it a specialized piece of computer hardware (say, a VLSI chip) or a computer
program? We shall take the view that neural nets are basically mathematical
models of information processing. They provide a method of representing
relationships that is quite different from Turing machines or computers with
stored
programs. As with other numerical methods, the availability of computer
resources, either software or hardware, greatly enhances the usefulness of the
approach, especially for large problems.
The characteristics of biological neural networks serve as the inspiration for
artificial neural networks, or neurocomputing. Artificial neural networks have
been developed as generalizations of mathematical models of human cognition
or
neural
biology,
based
on
the
assumptions
that:
l. Information processing occurs at many simple elements called neurons.
2. Signals are passed between neurons over connection links.
3. Each connection link has an associated weight, which, in a typical neural net,
multiplies the signal transmitted.
4. Each neuron applies an activation function (usually nonlinear) to its net input
(sum of weighted input signals) to determine its output signal.
The key characteristics are the net's architecture (pattern of connections
between the Neurons) and training algorithm (method of determining the
weights on the connections) and its activation function. The weights represent
information being used by the net to solve a problem.
Each neuron has an internal state, called its activation or activity level, which is a
function of the inputs it has received. Typically, a neuron sends its activation as a
signal to several other neurons. It is important to note that a neuron can send
only one signal at a time, although that signal is broadcast to several other
neurons. For example, consider a neuron Y, illustrated in Figure, that receives
inputs from neurons X1, X 2 , and X 3 The activations (output signals) of these
neurons are X1, X 2 , and X 3, respectively. The weights on the connections from
X1, X 2 , and X 3 to neuron Y are w1, W2, and W3, respectively. The net input, y_in,
to neuron Y is the sum of the weighted signals from neurons X 1, X 2 , and X 3, i.e.,
y_in = w1X 1 + w2X 2 + w3X 3
The activation y of neuron Y is given by some function of its net input, y = f(y-in),
e.g., the logistic sigmoid function (an S-shaped curve)
f(x) =
1
1+exp (x)
Now suppose further that neuron Y is connected to neurons Z I and Z 2, with

weights V I and V2, respectively. Neuron Y sends its signal y to each of these
units. However, in general, the values received by neurons Z I and Z 2will be
different, because each signal is scaled by the appropriate weight, V I and V2. In a
typical net, the activations Z I and Z 2 of neurons Z I and Z 2 would depend on
inputs from several or even many neurons, not just one.
There is a close analogy between the structure of a biological neuron (i.e., a

brain or nerve cell) and the processing element (or artificial neuron) presented in
the rest of this book. In fact, the structure of an individual neuron varies much
less from species to species than does the organization of the system of which
the neuron is an element.
A biological neuron has three types of components that are of particular interest
in understanding an artificial neuron: its dendrites, soma, and axon. The many
dendrites receive signals from other neurons. The signals are electric impulses
that are transmitted across a synaptic gap by means of a chemical process. The
action of the chemical transmitter modifies the incoming signal (typically, by
scaling the frequency of the signals that are received) in a manner similar to the
action of the weights in an artificial neural network. The soma, or cell body, sums
the incoming signals. When sufficient input is received, the cell fires; that is, it
transmits a signal over its axon to other cells. It is often supposed that a cell
either fires or doesn't at any instant of time, so that transmitted signals can be
treated as binary. However, the frequency of firing varies and can be viewed as a
signal of either greater or lesser magnitude. This corresponds to looking at
discrete time steps and summing all activity (signals received or signals sent) at
a particular point in time. The transmission of the signal from a particular neuron
is accomplished by an action potential resulting from differential concentrations
of ions on either side of the neuron's axon sheath (the brain's "white matter").
The ions most directly involved are potassium, sodium, and chloride.
A generic biological neuron is illustrated in Figure 1.3, together with axons from
two other neurons (from which the illustrated neuron could receive signals) and
dendrites for two other neurons (to which the original neuron would send
signals). Several key features of the processing elements of artificial networks are
suggested by the properties of biological neurons, viz., that:
1. The processing element receives many signals.
2. Signals may be modified by a weight at the receiving synapse.
3. The processing element sums the weighted inputs.
4. Under appropriate circumstances (sufficient input), the neuron transmits a
single output.
5. The output from a particular neuron may go to many other neurons (the axon
branches).
Other features of artificial neural networks that are suggested by biological
neurons are:
6. Information processing is local (although other means of transmission, such as
the action of hormones, may suggest means of overall process control).
7. Memory is distributed:
a. Long-term memory resides in the neurons' synapses or weights.
b. Short-term memory corresponds to the signals sent by the neurons.
8. A synapse's strength may be modified by experience.
9. Neurotransmitters for synapses may be excitatory or inhibitory.
Yet another important characteristic that artificial neural networks share with
biological neural systems is fault tolerance. Biological neural systems are fault
tolerant in two respects. First, we are able to recognize many input signals that
are somewhat different from any signal we have seen before. An example of this
is our ability to recognize a person in a picture we have not seen before or to
recognize a person after a long period of time.
Second, we are able to tolerate damage to the neural system itself. Humans are
born with as many as 100 billion neurons. Most of these are in the brain, and
most are not replaced when they die [Johnson & Brown, 1988]. In spite of our
continuous loss of neurons, we continue to learn. Even in cases of traumatic
neural loss, other neurons can sometimes be trained to take over the functions of
the damaged cells. In a similar manner, artificial neural networks can be
designed to be insensitive to small damage to the network, and the network can
be retrained in cases of significant damage (e.g., loss of data and some
connections). Even for uses of artificial neural networks that are not intended
primarily to model biological neural systems, attempts to achieve biological
plausibility may lead to improved computational features. One example is the
use of a planar array of neurons, as is found in the neurons of the visual cortex,
for Kohonen's self-organizing maps The topological nature of these maps has
computational advantages, even in applications where the structure of the
output units is not itself significant.
Other researchers have found that computationally optimal groupings of artificial
neurons correspond to biological bundles of neurons [Rogers & Kabrisky, 1989].
Separating the action of a back propagation net into smaller pieces to make it
more local (and therefore, perhaps more biologically plausible) also allows
improvement in computational power (cf. Section 6.2.3) [D. Fausett, 1990].
A unified probabilistic model for independent

principal component analysis (Aapo Hyvarinen)
and
Principal component analysis (PCA) and independent component analysis (ICA)

are both based on a linear model of multivariate data. They are often seen as
complementary tools, PCA providing dimension reduction and ICA separating
underlying components or sources. In practice, a two-stage approach is often
followed, where first PCA and then ICA is applied. Here, we show how PCA and
ICA can be seen as special cases of the same probabilistic generative model. In
contrast to conventional ICA theory, we model the variances of the components
as further parameters. Such variance parameters can be integrated out in a
Bayesian framework, or estimated in a more classic framework. In both cases,
we find a simple objective function whose maximization enables estimation of
PCA and ICA. Specically, maximization of the objective under Gaussian
assumption performs PCA, while its maximization for whitened data, under
assumption of non-Gaussianity, performs ICA.
The main purposes of a principal component analysis are the analysis of data to
identify patterns and finding patterns to reduce the dimensions of the dataset
with minimal loss of information. Here, our desired outcome of the principal
component analysis is to project a feature space (our dataset consisting of n ddimensional samples) onto a smaller subspace that represents our data "well". A
possible application would be a pattern classification task, where we want to
reduce the computational costs and the error of parameter estimation by
reducing the number of dimensions of our feature space by extracting a
subspace
that
describes
our
data
"best".
Principal Component Analysis (PCA) Vs. Multiple Discriminant Analysis (MDA)
Both Multiple Discriminant Analysis (MDA) and Principal Component Analysis (PCA) are linear
transformation methods and closely related to each other. In PCA, we are interested to find the
directions (components) that maximize the variance in our dataset, where in MDA, we are
additionally interested to find the directions that maximize the separation (or discrimination)
between different classes (for example, in pattern classification problems where our dataset
consists of multiple classes. In contrast two PCA, which ignores the class labels).
In other words, via PCA, we are projecting the entire set of data (without class labels) onto
a different subspace, and in MDA, we are trying to determine a suitable subspace to
distinguish between patterns that belong to different classes. Or, roughly speaking in PCA
we are trying to find the axes with maximum variances where the data is most spread
(within a class, since PCA treats the whole data set as one class), and in MDA we are
additionally maximizing the spread between classes.
In typical pattern recognition problems, a PCA is often followed by an MDA.
What is a "good" subspace?

Let's assume that our goal is to reduce the dimensions of a d-dimensional dataset by projecting it
onto a (k)-dimensional subspace (where k<d). So, how do we know what size we should choose
for k, and how do we know if we have a feature space that represents our data "well"?
Later, we will compute eigenvectors (the components) from our data set and collect them in a socalled scatter-matrix (or alternatively calculate them from the covariance matrix). Each of those
eigenvectors is associated with an eigenvalue, which tell us about the "length" or "magnitude" of
the eigenvectors. If we observe that all the eigenvalues are of very similar magnitude, this is a
good indicator that our data is already in a "good" subspace. Or if some of the eigenvalues are
much higher than others, we might be interested in keeping only those eigenvectors with the
much larger eigenvalues, since they contain more information about our data distribution. Vice
versa, eigenvalues that are close to 0 are less informative and we might consider in dropping
those when we construct the new feature subspace.
2D example
First, consider a dataset in only two dimensions, like (height, weight). This dataset can be plotted
as points in a plane. But if we want to tease out variation, PCA finds a new coordinate system in
which every point has a new (x,y) value. The axes don't actually mean anything physical; they're
combinations of height and weight called "principal components" that are chosen to give one
axes lots of variation.
Eating in the UK (a 17D example)
With multi dimensions, PCA is more useful, because it's hard to see through a
cloud of data.
What if our data have way more than 3-dimensions? Like, 17 dimensions?! In the table is the
average consumption of 17 types of food in grams per person per week for every country in the
UK.
The table shows some interesting variations across different food types, but overall differences
aren't so notable. Let's see if PCA can eliminate dimensions to emphasize how countries differ.
Here's
the plot of the data along the first principal component. Already we can see something is different
about Northern Ireland.
Now, see the first and second principal components, we see Northern Ireland a major outlier.
Once we go back and look at the data in the table, this makes sense: the Northern Irish eat way
more grams of fresh potatoes and way fewer of fresh fruits, cheese, fish and alcoholic drinks. It's
a good sign that structure we've visualized reflects a big fact of real-world geography: Northern
Ireland is the only of the four countries not on the island of Great Britain. (If you're confused
about the differences among England, the UK and Great Britain)
Independent component analysis (ICA) is a quite powerful technique and is able (in
principle) to separate independent sources linearly mixed in several sensors. For
instance, when recording electroencephalograms (EEG) on the scalp, ICA can
separate out artifacts embedded in the data (since they are usually independent
of each other).
ICA is a technique to separate linearly mixed sources. For instance, let's try to
mix and then separate two sources. Let's define the time courses of 2
independent sources A(top) and B(bottom)
We then mix linearly these two sources. The top curve is equal to A minus twice
B and the bottom the linear combination is 1.73*A +3.41*B.
We then input these two signals into the ICA algorithm (in this case, fastICA)
which is able to uncover the original activation of A and B.
Note that the algorithm cannot recover the exact amplitude of the source
activities. Further, also that, in theory, ICA can only extract sources that are
combined linearly.
(Matlab Code)
A = sin(linspace(0,50, 1000));
B = sin(linspace(0,37, 1000)+5);
figure;
subplot(2,1,1); plot(A);
subplot(2,1,2); plot(B, 'r');
M1 = A - 2*B;
M2 = 1.73*A+3.41*B;
figure;
subplot(2,1,1); plot(M1);
subplot(2,1,2); plot(M2, 'r');
figure;
c = fastica([M1;M2]);
fastICA
subplot(1,2,1); plot(c(1,:));
% A
% B
% plot A
% plot B
% mixing 1
% mixing 2
% plot mixing 1
% plot mixing 2
% compute and plot unminxing using
subplot(1,2,2); plot(c(2,:));
Whitening the data
We will now explain the preprocessing performed by most ICA algorithms before
actually applying ICA.
A first step in many ICA algorithms is to whiten (or sphere) the data. This means
that we remove any correlations in the data, i.e. the different channels (matrix Q)
are forced to be uncorrelated.
Why do that? A geometrical interpretation is that it restores the initial "shape" of
the data and that then ICA must only rotate the resulting matrix (see below).
Once more, let's mix two random sources A and B. At each time, in the following
graph, the value of A is the abscisia of the data point and the value of B is their
ordinates.
Let take two linear mixtures of A and B and plot these two new variables.
(Matlab Code)
POINTS = 1000; % number of points to plot
% define the two random variables
% ------------------------------for i=1:POINTS
A(i) = round(rand*99)-50;
%A
B(i) = round(rand*99)-50;
%B
end;
figure; plot(A,B, '.');
% plot the variables
set(gca, 'xlim', [-80 80], 'ylim', [-80 80]); % redefines limits of the graph
% mix linearly these two variables
% -------------------------------M1 = 0.54*A - 0.84*B;
M2 = 0.42*A + 0.27*B;
% mixing 1
% mixing 2
figure; plot(M1,M2, '.');

set(gca, 'ylim', get(gca, 'xlim'));
% plot the mixing

% redefines limits of the graph
% withen the data

% --------------x = [M1;M2];
c=cov(x')
% covariance
sq=inv(sqrtm(c));
% inverse of square root
mx=mean(x');
% mean
xx=x-mx'*ones(1,POINTS); % subtract the mean
xx=2*sq*xx;
cov(xx')
% the covariance is now a diagonal matrix
figure; plot(xx(1,:), xx(2,:), '.');
% show projections
% ---------------figure;
axes('position', [0.2 0.2 0.8 0.8]); plot(xx(1,:), xx(2,:), '.'); hold on;
axes('position', [0 0.2 0.2 0.8]); hist(xx(1,:)); set(gca, 'view', [90 90]);
axes('position', [0.2 0 0.8 0.2]); hist(xx(2,:));
% show projections
% ---------------figure;
axes('position', [0.2 0.2 0.8 0.8]); plot(A,B, '.'); hold on;
axes('position', [0 0.2 0.2 0.8]); hist(A); set(gca, 'view', [90 90]);
axes('position', [0.2 0 0.8 0.2]); hist(B);
Then if we whiten the two linear mixtures, we get the following plot
the variance on both axis is now equal and the correlation of the projection of the
data on both axis is 0 (meaning that the covariance matrix is diagonal and that
all the diagonal elements are equal). Then applying ICA only mean to "rotate"
this representation back to the original A and B axis space.
The whitening process is simply a linear change of coordinate of the mixed data.
Once the ICA solution is found in this "whitened" coordinate frame, we can easily
reproject the ICA solution back into the original coordinate frame.
The ICA algorithm
Intuitively you can imagine that ICA rotates the whitened matrix back to the
original (A,B) space (first scatter plot above). It performs the rotation by
minimizing the Gaussianity of the data projected on both axes (fixed point ICA).
For instance, in the example above,
The projection on both axis is quite Gaussian (i.e., it looks like a bell shape
curve). By contrast the projection in the original A, B space far from gaussian.
By rotating the axis and minimizing Gaussianity of the projection in the first
scatter plot, ICA is able to recover the original sources which are statistically
independent (this property comes from the central limit theorem which states
that any linear mixture of 2 independent random variables is more Gaussian than
the original variables). In Matlab, the function kurtosis (kurt() in the EEGLAB
toolbox; kurtosis() in the Matlab statistical toolbox) gives an indication of the
gaussianity of a distribution (but the fixed-point ICA algorithm uses a slightly
different measure called negentropy).
The Infomax ICA in the EEGLAB toolbox (Infomax ICA) is not as intuitive and
involves minimizing the mutual information of the data projected on both axes.
However, even if ICA algorithms differ from a numerical point of view, they are all
equivalent from a theoretical point of view
ICA in N dimensions
We dealt with only 2 dimensions. However ICA can deal with an arbitrary high
number of dimensions. Let's consider 128 EEG electrodes for instance. The signal
recorded in all electrode at each time point then constitutes a data point in a 128
dimension space. After whitening the data, ICA will "rotate the 128 axis" in order
to minimize the Gaussianity of the projection on all axis (note that unlinke PCA
the axis do not have to remain orthogonal).
What we call ICA components is the matrix that allows projecting the data in the
initial space to one of the axis found by ICA. The weight matrix is the full
transformation from the original space. When we write
S=WX
X is the data in the original space. For EEG

Time points
Electrodes 1
Electrodes 2
Electrodes 3
[ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....]

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]
[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]
For fMRI
Voxels
Time 1
Time 2
Time 3
[ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....]

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]
[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]
S is the source activity.

In EEG: An artifact time course or the time course of the one compact domain in
the brain
Time points
Component 1
Component 2
Component 3
[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]
[ 0.153 0.734 0.134 0.324 0.654 0.739 0.932 ....]
In fMRI: An artifact topography or the topography of statistically maximally

independent pattern of activation
W is the weight matrix to go from the S space to the X space.
Now the rows of W are the vector with which we can compute the activity of one
independent component. To compute, the component activity in the formula S =
W X, the weight matrix W is defined as (note if the linear transformation between
X and S is still unclear (that is if you do not know how to perform matrix
multiplication), look up this book is a good starting point).
Component 1
Component 2
Component 3
elec1
elec2
elec3
elec4
elec5
[ 0.824
0.534
0.314
0.654
0.739 ...]
[ 0.314
0.154
0.732
0.932
0.183 ...]
[ 0.153
0.734
0.134
0.324
0.654 ...]
For instance to compute the activity of the second source or second independent
component (in a matrix multiplication format), you may simply multiply matrix X
(see beginning of paragraph) by the row vector
elec1
elec2
elec3
elec4
[ 0.314
0.154
0.732
0.932
elec5
Component 2
0.183 ...]
Now you have the activity of the second component, but the activity is unitless.
If you have heard of inverse modeling, the analogy with EEG/ERP sources in
dipole localization software is the easiest to grasp. Each dipole has an activity
(which project linearly to all electrodes). The activity of the Brain source (dipole)
is unitless unless it is projected to the electrodes. So each dipole create a
contribution at each electrode site. ICA components are just the same.
Now we will see how to reproject one component to the electrode space. W -1 is
the inverse matrix to go from the source space S to the data space X.
X = W-1S
In Matlab you would just type inv(W) to obtain the inverse of a matrix.
comp1 comp2 comp3 comp4 comp5

Electrode 1
Electrode 2
Electrode 3
[ 0.184
0.253
0.131
0.364
0.639 ...]
[ 0.731
0.854
0.072
0.293
0.513 ...]
[ 0.125
0.374
0.914
0.134
0.465 ...]
If S is a row vector (for instance the activity of component 2 computed above)

and we multiply it by the following column vector from the inverse matrix above
comp2
Electrode 1
Electrode 2
Electrode 3
[ 0.253 ]
[ 0.854 ]
[ 0.374 ]
We will obtain the projected activity of component 2 (the inverse weights for
component 2 (column vector; bottom left below) multiplied by the activity for
component 2 (row vector; top right below) leads to the component projection
(matrix; bottom right).
(on the rigth one row of
the
S matrix (the activity of
component 2)
[ 0.253 ]
[ 0.854 ]
[ 0.374 ]
[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]
[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]
[ 0.153 0.734 0.134 0.324 0.654 0.739 0.932 ....]
(above is the projection of one component activity

on
all the electrodes (note that the calculus
not accurate and that the numbers are
meaningless).
This matrix will be denoted XC2.
Now, if one want to remove component number 2 from the data (for instance if
component number 2 proved to be an artifact), one can simply subtract the
matrix above (XC2) from the original data X.
Note that in the matrix computed above (XC2) all the columns are proportional,
which mean that the scalp activity is simply scaled. For this reason, we denote
the columns of the W-1 matrix, the scalp topography of the components. Each
column of this matrix is the topography of one component which is scaled in time
by the activity of the component. The scalp topography of each component can
be used to estimate the equivalent dipole location for this component (assuming
the component is not an artifact).
As a conclusion, when we talk about independent components, we usually refer
to two concepts
Rows of the S matrix which are the time course of the component activity
Columns of the W-1 matrix which are the scalp projection of the
components
ICA properties
From the preceding paragraphs, several properties of ICA becomes obvious
ICA can only separate linearly mixed sources.
Since ICA is dealing with clouds of point, changing the order in which the
points are plotted (the time points order in EEG) has virtually no effect on
the outcome of the algorithm.
Changing the channel order (for instance swapping electrode locations in

EEG) has also no effect on the outcome of the algorithm. For EEG, the
algorithm has no a priori about the electrode location and the fact
that ICA components can most of the time be resolved to a single
equivalent dipole is a proof that ICA is able to isolate compact
domains of cortical synchrony.
Since ICA separates sources by maximizing their non-Gaussianity, perfect

Gaussian sources cannot be separated
Even when the sources are not independent, ICA finds a space where they
are maximally independents.
Signal Mixtures
We know that signal mixtures tend to have Gaussian (normal) probability density
functions, and that source signals have non-gaussian pdfs. We also know that
each source signal can be extracted from a set of signal mixtures by taking the
inner product of a weight vector and those signal mixtures where this inner
product provides an orthogonal projection of the signal mixtures. But we do not
yet know precisely how to find such a weight vector. One type of method for
doing so is exploratory projection pursuit, often referred to simply as projection
pursuit.
Projection pursuit methods seek one projection at a time such that the
extracted signal is as non- gaussian as possible. This contrasts with ICA, which
typically extracts M signals simultaneously from M signal mixtures, which
requires estimating a (possibly very large) M x M unmixing matrix. One practical
advantage of projection pursuit over ICA is that less than M signals can extracted
if required, where each source signal is extracted from M signal mixtures using
an M element weight vector.
The name projection pursuit derives from the fact that this method seeks
a weight vector which provides an orthogonal projection of a set of signal
mixtures such that each extracted signal has a pdf which is as non-gaussian as
possible.
Let us consider the example of human height. Suppose that the height of
an individual hi is the outcome of many underlying factors which include a
genetic component SiG and dietary component SiD (i.e. nature vs nurture). Let us
further suppose that the contribution of each factor to height is the same for all
the individuals (i.e. the nature/Nurture ratio is fixed). Finally we need to assume
that the total effect of these different factors in each individual is the sum of
their contributions. If we consider the contribution of each factor as a constant
coefficient then we can write
Hi = aSiG + bSiD
Where a and b are non-zero coefficients. Each coefficient determines how height
increases with the factors SiG and SiD. Note that SiG and SiD vary across individuals,
whereas the coefficient a and b are the same for all individual. The central limit
theorem ensures that the pdf of h i value is approximately gaussian irrespective
of the pdf of SiG or SiD values and irrespective of the constants a and b.
Of course, we should recognize above equation for what it is: the
formation of a signal mixture h by a linear combination of source signals S iG and
SiD, using mixing coefficients a and b. Note that hi could equally well be mixture
of two voice signals.
As a further example, in signal processing it is almost always
assumed that, after tte signals of interest have been extracted from noisy stream
of data, the residual noise is gaussian. As stated above, this assumption is
mathematically very convenient, but it is also usually valid. If the residual noise
is the result of many processes whose outputs are added together then the
central limit theorem (CLT) guarantees that this noise is indeed approximately
gaussian.
Gaussian Signals: Good News, Bad News
The bad news is that the converse of the CLT is not true in general; that is,
it is not true that any gaussian signal is a mixture of non-gaussian signals. The
good news is that, in practice, gaussian signals often do consist of a mixture of
non- gaussian signals. This is good news because it means we can treat any
gaussian signal as if it consists of a mixture of non- gaussian source signals.
Given a set of such gaussian mixtures, we can then proceed to find each source
signal by finding that unmixing vector which extracts the most non- gaussian
signal by finding that unmixing vector which extracts the most non- gaussian
signal from the set of mixtures.
We could now precede using two different strategies. We could define a
measure of the distance between the signal extracted by a given unmixing
vector and a gaussian signal, and then find the unmixing vector that maximizes
this distance. This distance is known as kullback Leibler divergence. A simpler
strategy consists of defining a measure of non- gaussianity and then finding the
unmixing vector that maximizes this measure.
The fact that there are actually two types of non- gaussian signals will not
detain us long, because we shall assume that our source signals are of one type
only. The two types are known by various terms, such as super-gaussian and subgaussian or equivalently as playkurtoic and leptokurtoic resp. and signal with
zero kurtois is mesokurtotic. A signal; with a super gaussian pdf has most of its
values clustered around zero whereas a signal with a sub gaussian pdf does not .
As examples, a speech signal has a super gaussian pdf and a sawtooth function
and white noise have sub gaussian pdfs. This implies that super gaussian signals
have pdfs that are more peaky than that of a gaussian signal, whereas a sub
gaussian signal has a pdf that is less peaky than that if a gaussian signal

Neural Networks Chapter

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Neural Networks Chapter

Uploaded by

Copyright:

Available Formats

The sole purpose of this paper is to identify which neural network could bring in

the great storage efficiency, quality, robustness, pattern completion, content

Now suppose further that neuron Y is connected to neurons Z I and Z 2, with

There is a close analogy between the structure of a biological neuron (i.e., a

A unified probabilistic model for independent

Principal component analysis (PCA) and independent component analysis (ICA)

What is a "good" subspace?

Eating in the UK (a 17D example)

Whitening the data

figure; plot(M1,M2, '.');

% plot the mixing

% withen the data

X is the data in the original space. For EEG

[ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....]

[ 0.134 0.424 0.653 0.739 0.932 0.183 0.834 ....]

S is the source activity.

[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

In fMRI: An artifact topography or the topography of statistically maximally

comp1 comp2 comp3 comp4 comp5

If S is a row vector (for instance the activity of component 2 computed above)

[ 0.314 0.154 0.732 0.932 0.183 0.834 0.134 ....]

[ 0.824 0.534 0.314 0.654 0.739 0.932 0.183 ....]

(above is the projection of one component activity

ICA can only separate linearly mixed sources.

Changing the channel order (for instance swapping electrode locations in

Since ICA separates sources by maximizing their non-Gaussianity, perfect

You might also like